Arabic spelling error detection and correction

被引:18
作者
Attia, Mohammed [1 ,2 ]
Pecina, Pavel [3 ]
Samih, Younes [4 ]
Shaalan, Khaled [2 ]
Van Genabith, Josef [1 ]
机构
[1] Dublin City Univ, Sch Comp, Dublin, Ireland
[2] British Univ Dubai, Fac Engn & IT, Dubai, U Arab Emirates
[3] Charles Univ Prague, Fac Math & Phys, Prague, Czech Republic
[4] Univ Dusseldorf, Dept Linguist & Informat Sci, Dusseldorf, Germany
基金
新加坡国家研究基金会; 爱尔兰科学基金会;
关键词
WORDS;
D O I
10.1017/S1351324915000030
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
A spelling error detection and correction application is typically based on three main components: a dictionary (or reference word list), an error model and a language model. While most of the attention in the literature has been directed to the language model, we show how improvements in any of the three components can lead to significant cumulative improvements in the overall performance of the system. We develop our dictionary of 9.2 million fully-inflected Arabic words (types) from a morphological transducer and a large corpus, validated and manually revised. We improve the error model by analyzing error types and creating an edit distance re-ranker. We also improve the language model by analyzing the level of noise in different data sources and selecting an optimal subset to train the system on. Testing and evaluation experiments show that our system significantly outperforms Microsoft Word 2013, OpenOffice Ayaspell 3.4 and Google Docs.
引用
收藏
页码:751 / 773
页数:23
相关论文
共 51 条
[1]   Automatic Stochastic Arabic Spelling Correction With Emphasis on Space Insertions and Deletions [J].
Alkanhal, Mohamed I. ;
Al-Badrashiny, Mohamed A. ;
Alghamdi, Mansour M. ;
Al-Qabbany, Abdulaziz O. .
IEEE TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2012, 20 (07) :2111-2122
[2]  
[Anonymous], P 11 C NAT LANG PROC
[3]  
[Anonymous], 2005, ACM SIGKDD Explor. Newsl., DOI DOI 10.1145/1089815.1089817
[4]  
[Anonymous], 2011, Proceedings of ACL-HLT
[5]  
[Anonymous], P 8 INT COMP C AR IC
[6]  
Attia M., 2011, Proceedings of the 9th International Workshop on Finite State Methods and Natural Language Processing, P125
[7]  
Attia M.A., 2006, CHALLENGE ARABIC NLP, P48
[8]  
Beesley K. R., 2003, CSLI STUDIES COMPUTA
[9]  
Beesley KennethR., 1998, COMPUTATIONAL APPROA, P50
[10]   An improved error model for noisy channel spelling correction [J].
Brill, E ;
Moore, RC .
38TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, PROCEEDINGS OF THE CONFERENCE, 2000, :286-293