Weakly labeled data augmentation for social media named entity recognition

被引:7
作者
Kim, Juae [1 ]
Kim, Yejin [2 ]
Kang, Sangwoo [3 ]
机构
[1] AIRS Co, Hyundai Motor Grp, Seoul 06620, South Korea
[2] George Washington Univ, Dept Comp Sci, Graph Lab, Washington, DC 20037 USA
[3] Gachon Univ, Sch Comp, Gyeonggi Do 13120, South Korea
基金
新加坡国家研究基金会;
关键词
Named entity recognition; Social-media text mining; Weakly labeled data; Transfer learning;
D O I
10.1016/j.eswa.2022.118217
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Named entity recognition is a task that extracts entities corresponding to predefined categories. Although NER is important in processing user-generated texts such as those obtained from social media, it remains challenging because such texts tend to contain numerous unseen words or abbreviations. To address this issue, we propose two methods for weakly labeled data generation that can extract named entities from social media texts more effectively: alias augmentation and typo augmentation. Using these methods, weakly labeled data are generated through the automatic annotation of unlabeled Wikipedia texts and Tweets and then trained through transfer learning. Our experimental results suggest that the proposed approach improves NER performance, with our best F1-score of 51.43% representing the highest score ever reported.
引用
收藏
页数:10
相关论文
共 53 条
  • [1] Aguilar Gustavo, 2018, P 2018 C N AM CHAPT, V1, P1401
  • [2] Aguilar Gustavo, 2017, P 3 WORKSH NOIS US G, P148, DOI [DOI 10.18653/V1/W17-4419, DOI 10.18653/V1/W177-4419]
  • [3] Borrow from rich cousin: transfer learning for emotion detection using cross lingual embedding
    Ahmad, Zishan
    Jindal, Raghav
    Ekbal, Asif
    Bhattachharyya, Pushpak
    [J]. EXPERT SYSTEMS WITH APPLICATIONS, 2020, 139
  • [4] Ahmed I., 2015, International Journal of Database Theory and Application, V8, P43, DOI [10.14257/ijdta.2015.8.2.05, DOI 10.14257/IJDTA.2015.8.2.05]
  • [5] Akbik A., 2018, COLING 2018, 27th International Conference on Computational Linguistics, P1638
  • [6] Akbik A, 2019, 2019 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES (NAACL HLT 2019), VOL. 1, P724
  • [7] Aramaki E., 2009, P WORKSH BIONLP BION, P185, DOI [DOI 10.3115/1572364.1572390, 10.3115/1572364.1572390]
  • [8] Generalisation in named entity recognition: A quantitative analysis
    Augenstein, Isabelle
    Derczynski, Leon
    Bontcheva, Kalina
    [J]. COMPUTER SPEECH AND LANGUAGE, 2017, 44 : 61 - 83
  • [9] Baldwin T, 2015, P WORKSHOP NOISY USE, P126, DOI [10.18653/v1/W15-4319, DOI 10.18653/V1/W15-4319]
  • [10] LEARNING LONG-TERM DEPENDENCIES WITH GRADIENT DESCENT IS DIFFICULT
    BENGIO, Y
    SIMARD, P
    FRASCONI, P
    [J]. IEEE TRANSACTIONS ON NEURAL NETWORKS, 1994, 5 (02): : 157 - 166