Learning from mistakes: Improving spelling correction performance with automatic generation of realistic misspellings

被引:3
作者
Buyuk, Osman [1 ]
Arslan, Levent M. [2 ,3 ]
机构
[1] Izmir Demokrasi Univ, Dept Elect & Elect Engn, Izmir, Turkey
[2] Bogazici Univ, Dept Elect & Elect Engn, Istanbul, Turkey
[3] Sestek Speech Enabled Software Technol Inc, Sestek Res & Dev Ctr, Istanbul, Turkey
关键词
deep learning; machine learning; natural language processing; neural network;
D O I
10.1111/exsy.12692
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Sequence to sequence models (seq2seq) require a large amount of labelled training data to learn the mapping between the input and output. A large set of misspelled words together with their corrections is needed to train a seq2seq spelling correction system. Low-resource languages such as Turkish usually lack such large annotated datasets. Although misspelling-reference pairs can be synthesized with a random procedure, the generated dataset may not well match to genuine human-made misspellings. This might degrade the performance in realistic test scenarios. In this paper, we propose a novel procedure to automatically introduce human-like misspellings to legitimate words in Turkish language. Generated human-like misspellings are used to improve the performance of a seq2seq spelling correction system. The proposed system consists of two separate models; a misspelling generator and a spelling corrector. The generator is trained using a relatively small number of human-made misspellings and their manual corrections. Reference words and their misspellings are used as inputs and outputs of the generator, respectively. As a result, it is trained to add realistic spelling errors to the valid words. Training data of the spelling corrector is augmented by the generator's human-like misspellings. In the experiments, we observe that the data augmentation significantly improves the spelling correction performance. Our proposed method yields 5% absolute improvement over the state-of-the-art Turkish spelling correction systems in a test set which contains human-made misspellings from Twitter messages.
引用
收藏
页数:16
相关论文
共 38 条
[1]   EFFICIENT STRING MATCHING - AID TO BIBLIOGRAPHIC SEARCH [J].
AHO, AV ;
CORASICK, MJ .
COMMUNICATIONS OF THE ACM, 1975, 18 (06) :333-340
[2]  
Akin A. A., 2007, Structure, V10, P1
[3]  
Bahdanau D, 2016, Arxiv, DOI [arXiv:1409.0473, DOI 10.48550/ARXIV.1409.0473]
[4]   Both Complete and Correct? Multi-Objective Optimization of Touchscreen Keyboard [J].
Bi, Xiaojun ;
Ouyang, Tom ;
Zhai, Shumin .
32ND ANNUAL ACM CONFERENCE ON HUMAN FACTORS IN COMPUTING SYSTEMS (CHI 2014), 2014, :2297-2306
[5]  
Bolucu N, 2019, IEEE SCI M EL EL BIO, P1, DOI [10.1109/EBBT.2019.8742067, DOI 10.1109/EBBT.2019.8742067]
[6]  
Buyuk, 2005, THESIS SABANCI U TUR
[7]   Learning from mistakes: Improving spelling correction performance with automatic generation of realistic misspellings [J].
Buyuk, Osman ;
Arslan, Levent M. .
EXPERT SYSTEMS, 2021, 38 (05)
[8]   Context-Dependent Sequence-to-Sequence Turkish Spelling Correction [J].
Buyuk, Osman .
ACM TRANSACTIONS ON ASIAN AND LOW-RESOURCE LANGUAGE INFORMATION PROCESSING, 2020, 19 (04)
[9]   Context Influence on Sequence to Sequence Turkish Spelling Correction [J].
Buyuk, Osman ;
Erden, Mustafa ;
Arslan, Levent M. .
2019 27TH SIGNAL PROCESSING AND COMMUNICATIONS APPLICATIONS CONFERENCE (SIU), 2019,
[10]  
Chiu CC, 2018, 2018 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), P4774, DOI 10.1109/ICASSP.2018.8462105