Multimodular Text Normalization of Dutch User-Generated Content

被引:14
作者
Schulz, Sarah [1 ,3 ]
De Pauw, Guy [2 ]
De Clercq, Orphee [1 ]
Desmet, Bart [1 ]
Hoste, Veronique [1 ]
Daelemans, Walter [2 ]
Macken, Lieve [1 ]
机构
[1] Univ Ghent, Dept Translat Interpreting & Commun, Groot Brittannielaan 45, B-9000 Ghent, Belgium
[2] Univ Antwerp, Computat Linguist & Psycholinguist Res Ctr, Prinsstr 13, B-2000 Antwerp, Belgium
[3] Univ Stuttgart, Inst Nat Language Proc, Pfaffenwaldring 5B, D-70569 Stuttgart, Germany
关键词
Social media; text normalization; user-generated content;
D O I
10.1145/2850422
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
As social media constitutes a valuable source for data analysis for a wide range of applications, the need for handling such data arises. However, the nonstandard language used on social media poses problems for natural language processing (NLP) tools, as these are typically trained on standard language material. We propose a text normalization approach to tackle this problem. More specifically, we investigate the usefulness of a multimodular approach to account for the diversity of normalization issues encountered in user-generated content (UGC). We consider three different types of UGC written in Dutch (SNS, SMS, and tweets) and provide a detailed analysis of the performance of the different modules and the overall system. We also apply an extrinsic evaluation by evaluating the performance of a part-of-speech tagger, lemmatizer, and named-entity recognizer before and after normalization.
引用
收藏
页数:22
相关论文
共 63 条
[31]  
Kestemont M, 2012, LREC 2012 - EIGHTH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION, P1569
[32]  
Kobus C., 2008, ACT C TRAIT AUT LANG, P128
[33]  
Kobus Catherine., 2008, COLING, P441
[34]  
Koehn P., 2007, ACL
[35]  
Li C., 2012, P COLING 2012, Vbll, P1587
[36]  
Li C, 2014, 52ND ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: STUDENT RESEARCH WORKSHOP (ACL 2014), P86
[37]  
Ling Wang., 2013, Conference on Empirical Methods in Natural Language Processing, P73
[38]  
Liu F., 2012, Proc 50th Annu Meet Assoc Comput Linguist, V1, P1035
[39]   Named Entity Recognition for Tweets [J].
Liu, Xiaohua ;
Wei, Furu ;
Zhang, Shaodian ;
Zhou, Ming .
ACM TRANSACTIONS ON INTELLIGENT SYSTEMS AND TECHNOLOGY, 2013, 4 (01)
[40]  
Melero M., 2012, Proceedings of the Eight International Conference on Language Resources and Evaluation (LREC'12), P3794