Weakly labeled data augmentation for social media named entity recognition

被引:10
作者
Kim, Juae [1 ]
Kim, Yejin [2 ]
Kang, Sangwoo [3 ]
机构
[1] AIRS Co, Hyundai Motor Grp, Seoul 06620, South Korea
[2] George Washington Univ, Dept Comp Sci, Graph Lab, Washington, DC 20037 USA
[3] Gachon Univ, Sch Comp, Gyeonggi Do 13120, South Korea
基金
新加坡国家研究基金会;
关键词
Named entity recognition; Social-media text mining; Weakly labeled data; Transfer learning;
D O I
10.1016/j.eswa.2022.118217
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Named entity recognition is a task that extracts entities corresponding to predefined categories. Although NER is important in processing user-generated texts such as those obtained from social media, it remains challenging because such texts tend to contain numerous unseen words or abbreviations. To address this issue, we propose two methods for weakly labeled data generation that can extract named entities from social media texts more effectively: alias augmentation and typo augmentation. Using these methods, weakly labeled data are generated through the automatic annotation of unlabeled Wikipedia texts and Tweets and then trained through transfer learning. Our experimental results suggest that the proposed approach improves NER performance, with our best F1-score of 51.43% representing the highest score ever reported.
引用
收藏
页数:10
相关论文
共 53 条
[51]  
Zhou JT, 2019, 57TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2019), P3461
[52]  
Zhuang FZ, 2020, Arxiv, DOI [arXiv:1911.02685, DOI 10.48550/ARXIV.1911.02685]
[53]  
Zoph Barret, 2016, P EMNLP 2016 C, DOI 10.18653/V1/D16-1163.URLhttps:/