Weakly labeled data augmentation for social media named entity recognition

被引:10
作者
Kim, Juae [1 ]
Kim, Yejin [2 ]
Kang, Sangwoo [3 ]
机构
[1] AIRS Co, Hyundai Motor Grp, Seoul 06620, South Korea
[2] George Washington Univ, Dept Comp Sci, Graph Lab, Washington, DC 20037 USA
[3] Gachon Univ, Sch Comp, Gyeonggi Do 13120, South Korea
基金
新加坡国家研究基金会;
关键词
Named entity recognition; Social-media text mining; Weakly labeled data; Transfer learning;
D O I
10.1016/j.eswa.2022.118217
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Named entity recognition is a task that extracts entities corresponding to predefined categories. Although NER is important in processing user-generated texts such as those obtained from social media, it remains challenging because such texts tend to contain numerous unseen words or abbreviations. To address this issue, we propose two methods for weakly labeled data generation that can extract named entities from social media texts more effectively: alias augmentation and typo augmentation. Using these methods, weakly labeled data are generated through the automatic annotation of unlabeled Wikipedia texts and Tweets and then trained through transfer learning. Our experimental results suggest that the proposed approach improves NER performance, with our best F1-score of 51.43% representing the highest score ever reported.
引用
收藏
页数:10
相关论文
共 53 条
[21]  
Kim J, 2019, IEEE IJCNN, DOI [10.1109/ijcnn.2019.8852087, 10.1007/s00779-019-01299-w]
[22]   Noise Improves Noise: Verification of Pre-training Effect with Weakly Labeled Data on Social Media NER [J].
Kim, Yejin ;
Kim, Juae ;
Seo, Jungyun .
2020 IEEE INTERNATIONAL CONFERENCE ON BIG DATA AND SMART COMPUTING (BIGCOMP 2020), 2020, :225-228
[23]  
Kontostathis A, 2004, SURVEY OF TEXT MINING, P185
[24]   Do Better ImageNet Models Transfer Better? [J].
Kornblith, Simon ;
Shlens, Jonathon ;
Le, Quoc V. .
2019 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2019), 2019, :2656-2666
[25]  
Lafferty J., 2001, INT C MACH LEARN, P282
[26]  
Lample G, 2016, arXiv
[27]  
Lin B. Y., 2017, P 3 WORKSH NOIS US G, P160, DOI DOI 10.18653/V1/W17-4421
[28]   A parallel computing-based Deep Attention model for named entity recognition [J].
Liu, Xiaojun ;
Yang, Ning ;
Jiang, Yu ;
Gu, Lichuan ;
Shi, Xianzhang .
JOURNAL OF SUPERCOMPUTING, 2020, 76 (02) :814-830
[29]  
Luo G, 2015, P C EMP METH NAT LAN, P879
[30]  
Ma XZ, 2016, PROCEEDINGS OF THE 54TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, VOL 1, P1064