Automatic diacritization of Arabic text using recurrent neural networks

被引:0
作者
Gheith A. Abandah
Alex Graves
Balkees Al-Shagoor
Alaa Arabiyat
Fuad Jamour
Majid Al-Taee
机构
[1] University of Jordan,Computer Engineering Department
[2] Google DeepMind,undefined
[3] King Abdullah University of Science and Technology,undefined
来源
International Journal on Document Analysis and Recognition (IJDAR) | 2015年 / 18卷
关键词
Automatic diacritization; Arabic text; Machine learning; Sequence transcription; Recurrent neural networks ; Deep neural networks; Long short-term memory;
D O I
暂无
中图分类号
学科分类号
摘要
This paper presents a sequence transcription approach for the automatic diacritization of Arabic text. A recurrent neural network is trained to transcribe undiacritized Arabic text with fully diacritized sentences. We use a deep bidirectional long short-term memory network that builds high-level linguistic abstractions of text and exploits long-range context in both input directions. This approach differs from previous approaches in that no lexical, morphological, or syntactical analysis is performed on the data before being processed by the net. Nonetheless, when the network is post-processed with our error correction techniques, it achieves state-of-the-art performance, yielding an average diacritic and word error rates of 2.09 and 5.82 %, respectively, on samples from 11 books. For the LDC ATB3 benchmark, this approach reduces the diacritic error rate by 25 %, the word error rate by 20 %, and the last-letter diacritization error rate by 33 % over the best published results.
引用
收藏
页码:183 / 197
页数:14
相关论文
共 28 条
[1]  
Abandah G(2004)Issues concerning code system for Arabic letters Dirasat Eng. Sci. J. 31 165-177
[2]  
Khundakjie F(2014)Recognizing handwritten Arabic words using grapheme segmentation and recurrent neural networks Int. J. Doc. Anal. Recognit. 17 275-291
[3]  
Abandah GA(2004)Arabic morphological analysis techniques: a comprehensive survey J. Am. Soc. Inf. Sci. Technol. 55 189-213
[4]  
Jamour FT(2012)Context-dependent pre-trained deep neural networks for large-vocabulary speech recognition IEEE Trans. Audio Speech Lang. Process. 20 30-42
[5]  
Qaralleh EA(1964)A technique for computer detection and correction of spelling errors Commun. ACM 7 171-176
[6]  
Al-Sughaiyer IA(2002)Learning precise timing with LSTM recurrent networks J. Mach. Learn. Res. 3 115-143
[7]  
Al-Kharashi IA(2005)Framewise phoneme classification with bidirectional LSTM and other neural network architectures Neural Netw. 18 602-610
[8]  
Dahl G(1997)Long short-term memory Neural Comput. 9 1735-1780
[9]  
Yu D(1994)Enhanced MLP performance and fault tolerance resulting from synaptic weight noise during training IEEE Trans. Neural Netw. 5 792-802
[10]  
Deng L(2011)A stochastic Arabic diacritizer based on a hybrid of factorized and unfactorized textual features IEEE Trans. Audio Speech Lang. Process. 19 166-175