Rich Morphology Based N-gram Language Models for Arabic

被引:0
|
作者
Emami, Ahmad [1 ]
Zitouni, Imed [1 ]
Mangu, Lidia [1 ]
机构
[1] IBM Corp, TJ Watson Res Ctr, Yorktown Hts, NY 10598 USA
关键词
Language Modeling; Arabic Morphology; Rich Language Modeling;
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
In this paper we investigate the use of rich morphology such as word segmentation, part-of-speech tagging and diacritic restoration to improve Arabic language modeling. We enrich the context by performing morphological analysis on the word history. We use neural network models to integrate this additional information, due to their ability to handle long and enriched dependencies. We experimented with models with increasing order of morphological features, starting with Arabic segmentation, and later adding part of speech labels as well as words with restored diacritics. Experiments on Arabic broadcast news and broadcast conversations data showed significant improvements in perplexity, reducing the baseline N-gram and the neural network N-gram model perplexities by 35% and 31% respectively.
引用
收藏
页码:829 / 832
页数:4
相关论文
共 50 条
  • [21] On the N-gram Approximation of Pre-trained Language Models
    Krishnan, Aravind
    Alabi, Jesujoba O.
    Klakow, Dietrich
    INTERSPEECH 2023, 2023, : 371 - 375
  • [22] Learning N-gram Language Models from Uncertain Data
    Kuznetsov, Vitaly
    Liao, Hank
    Mohri, Mehryar
    Riley, Michael
    Roark, Brian
    17TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2016), VOLS 1-5: UNDERSTANDING SPEECH PROCESSING IN HUMANS AND MACHINES, 2016, : 2323 - 2327
  • [23] Language Identification of Short Text Segments with N-gram Models
    Vatanen, Tommi
    Vayrynen, Jaakko J.
    Virpioja, Sami
    LREC 2010 - SEVENTH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION, 2010, : 3423 - 3430
  • [24] Variable-length category n-gram language models
    Niesler, TR
    Woodland, PC
    COMPUTER SPEECH AND LANGUAGE, 1999, 13 (01): : 99 - 124
  • [25] Modeling actions of PubMed users with n-gram language models
    Lin, Jimmy
    Wilbur, W. John
    INFORMATION RETRIEVAL, 2009, 12 (04): : 487 - 503
  • [26] Modeling actions of PubMed users with n-gram language models
    Jimmy Lin
    W. John Wilbur
    Information Retrieval, 2009, 12 : 487 - 503
  • [27] Language Identification based on n-gram Frequency Ranking
    Cordoba, R.
    D'Haro, L. F.
    Fernandez-Martinez, F.
    Macias-Guarasa, J.
    Ferreiros, J.
    INTERSPEECH 2007: 8TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION, VOLS 1-4, 2007, : 1921 - 1924
  • [28] A Spectral Algorithm for Learning Class-Based n-gram Models of Natural Language
    Stratos, Karl
    Kim, Do-kyum
    Collins, Michael
    Hsu, Daniel
    UNCERTAINTY IN ARTIFICIAL INTELLIGENCE, 2014, : 762 - 771
  • [30] Discriminative n-gram language modeling
    Roark, Brian
    Saraclar, Murat
    Collins, Michael
    COMPUTER SPEECH AND LANGUAGE, 2007, 21 (02): : 373 - 392