Learning from mistakes: Improving spelling correction performance with automatic generation of realistic misspellings

被引：3

作者：

Buyuk, Osman ^{[1
]}

Arslan, Levent M. ^{[2
,3
]}

机构：

[1] Izmir Demokrasi Univ, Dept Elect & Elect Engn, Izmir, Turkey

[2] Bogazici Univ, Dept Elect & Elect Engn, Istanbul, Turkey

[3] Sestek Speech Enabled Software Technol Inc, Sestek Res & Dev Ctr, Istanbul, Turkey

来源：

EXPERT SYSTEMS | 2021年 / 38卷 / 05期

关键词：

deep learning; machine learning; natural language processing; neural network;

D O I：

10.1111/exsy.12692

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Sequence to sequence models (seq2seq) require a large amount of labelled training data to learn the mapping between the input and output. A large set of misspelled words together with their corrections is needed to train a seq2seq spelling correction system. Low-resource languages such as Turkish usually lack such large annotated datasets. Although misspelling-reference pairs can be synthesized with a random procedure, the generated dataset may not well match to genuine human-made misspellings. This might degrade the performance in realistic test scenarios. In this paper, we propose a novel procedure to automatically introduce human-like misspellings to legitimate words in Turkish language. Generated human-like misspellings are used to improve the performance of a seq2seq spelling correction system. The proposed system consists of two separate models; a misspelling generator and a spelling corrector. The generator is trained using a relatively small number of human-made misspellings and their manual corrections. Reference words and their misspellings are used as inputs and outputs of the generator, respectively. As a result, it is trained to add realistic spelling errors to the valid words. Training data of the spelling corrector is augmented by the generator's human-like misspellings. In the experiments, we observe that the data augmentation significantly improves the spelling correction performance. Our proposed method yields 5% absolute improvement over the state-of-the-art Turkish spelling correction systems in a test set which contains human-made misspellings from Twitter messages.

引用

页数：16

共 38 条

[1] EFFICIENT STRING MATCHING - AID TO BIBLIOGRAPHIC SEARCH [J].

AHO, AV ;

CORASICK, MJ .

COMMUNICATIONS OF THE ACM, 1975, 18 (06) :333-340

[2]

Akin A. A., 2007, Structure, V10, P1

[3]

Bahdanau D, 2016, Arxiv, DOI [arXiv:1409.0473, DOI 10.48550/ARXIV.1409.0473]

[4] Both Complete and Correct? Multi-Objective Optimization of Touchscreen Keyboard [J].

Bi, Xiaojun ;

Ouyang, Tom ;

Zhai, Shumin .

32ND ANNUAL ACM CONFERENCE ON HUMAN FACTORS IN COMPUTING SYSTEMS (CHI 2014), 2014, :2297-2306

[5]

Bolucu N, 2019, IEEE SCI M EL EL BIO, P1, DOI [10.1109/EBBT.2019.8742067, DOI 10.1109/EBBT.2019.8742067]

[6]

Buyuk, 2005, THESIS SABANCI U TUR

[7] Learning from mistakes: Improving spelling correction performance with automatic generation of realistic misspellings [J].

Buyuk, Osman ;

Arslan, Levent M. .

EXPERT SYSTEMS, 2021, 38 (05)

[8] Context-Dependent Sequence-to-Sequence Turkish Spelling Correction [J].

Buyuk, Osman .

ACM TRANSACTIONS ON ASIAN AND LOW-RESOURCE LANGUAGE INFORMATION PROCESSING, 2020, 19 (04)

[9] Context Influence on Sequence to Sequence Turkish Spelling Correction [J].

Buyuk, Osman ;

Erden, Mustafa ;

Arslan, Levent M. .

2019 27TH SIGNAL PROCESSING AND COMMUNICATIONS APPLICATIONS CONFERENCE (SIU), 2019,

[10]

Chiu CC, 2018, 2018 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), P4774, DOI 10.1109/ICASSP.2018.8462105

← 1 2 3 4 →