Low-resource neural character-based noisy text normalization

被引:6
作者
Mager, Manuel [1 ]
Jasso Rosales, Monica [2 ,3 ]
Cetinoglu, Ozlem [1 ]
Meza, Ivan [4 ]
机构
[1] Univ Stuttgart, Inst Nat Language Proc, Stuttgart, Germany
[2] Univ Nacl Autonoma Mexico, Fac Filosofia & Letras, Mexico City, DF, Mexico
[3] Univ Nacl Autonoma Mexico, Inst Ingn, Mexico City, DF, Mexico
[4] Univ Nacl Autonoma Mexico, Inst Invest Matemat Aplicadas & Sistemas, Mexico City, DF, Mexico
关键词
Noisy text; normalization; recurrent neural networks; low-resource; autoencoding;
D O I
10.3233/JIFS-179039
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
User generated data in social networks is often not written in its standard form. This kind of text can lead to large dispersion in the datasets and can lead to inconsistent data. Therefore, normalization of such kind of texts is a crucial preprocessing step for common Natural Language Processing tools. In this paper we explore the state-of-the-art of the machine translation approach to normalize text under low-resource conditions. We also propose an auxiliary task for the sequence-to-sequence (seq2seq) neural architecture novel to the text normalization task, that improves the base seq2seq model up to 5%. This increase of performance closes the gap between statistical machine translation approaches and neural ones for low-resource text normalization.
引用
收藏
页码:4921 / 4929
页数:9
相关论文
共 33 条
[1]  
[Anonymous], P 54 ANN M ASS COMP
[2]  
[Anonymous], SPELLING NORMALIZATI
[3]  
[Anonymous], ARXIV171003476
[4]  
[Anonymous], 2016, P COLING 2016 26 INT
[5]  
[Anonymous], 2017, NEURIPS
[6]  
[Anonymous], 2014, SSST
[7]  
[Anonymous], 2015, Proceedings of the Workshop on Noisy User-generated Text
[8]  
[Anonymous], 2017, P 3 WORKSHOP NOISY U
[9]  
[Anonymous], ARCHITECTURE TEXT NO
[10]  
[Anonymous], 2018, P 6 INT WORKSH NAT L