Weakly labeled data augmentation for social media named entity recognition

被引：10

作者：

Kim, Juae ^{[1
]}

Kim, Yejin ^{[2
]}

Kang, Sangwoo ^{[3
]}

机构：

[1] AIRS Co, Hyundai Motor Grp, Seoul 06620, South Korea

[2] George Washington Univ, Dept Comp Sci, Graph Lab, Washington, DC 20037 USA

[3] Gachon Univ, Sch Comp, Gyeonggi Do 13120, South Korea

来源：

EXPERT SYSTEMS WITH APPLICATIONS | 2022年 / 209卷

基金：

新加坡国家研究基金会;

关键词：

Named entity recognition; Social-media text mining; Weakly labeled data; Transfer learning;

D O I：

10.1016/j.eswa.2022.118217

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Named entity recognition is a task that extracts entities corresponding to predefined categories. Although NER is important in processing user-generated texts such as those obtained from social media, it remains challenging because such texts tend to contain numerous unseen words or abbreviations. To address this issue, we propose two methods for weakly labeled data generation that can extract named entities from social media texts more effectively: alias augmentation and typo augmentation. Using these methods, weakly labeled data are generated through the automatic annotation of unlabeled Wikipedia texts and Tweets and then trained through transfer learning. Our experimental results suggest that the proposed approach improves NER performance, with our best F1-score of 51.43% representing the highest score ever reported.

引用

页数：10