NOVEL TOPIC N-GRAM COUNT LM INCORPORATING DOCUMENT-BASED TOPIC DISTRIBUTIONS AND N-GRAM COUNTS

被引:0
作者
Haidar, Md. Akmal [1 ]
O'Shaughnessy, Douglas [1 ]
机构
[1] EMT, INRS, 6900-800 De La Gauchetiere Ouest, Montreal, PQ H5A 1K6, Canada
来源
2014 PROCEEDINGS OF THE 22ND EUROPEAN SIGNAL PROCESSING CONFERENCE (EUSIPCO) | 2014年
关键词
Statistical n-gram language model; speech recognition; mixture models; topic models;
D O I
暂无
中图分类号
TM [电工技术]; TN [电子技术、通信技术];
学科分类号
0808 ; 0809 ;
摘要
In this paper, we introduce a novel topic n-gram count language model (NTNCLM) using topic probabilities of training documents and document-based n-gram counts. The topic probabilities for the documents are computed by averaging the topic probabilities of words seen in the documents. The topic probabilities of documents are multiplied by the document-based n-gram counts. The products are then summed-up for all the training documents. The results are used as the counts of the respective topics to create the NTNCLMs. The NTNCLMs are adapted by using the topic probabilities of a development test set that are computed as above. We compare our approach with a recently proposed TNCLM [1], where the long-range information outside of the n-gram events is not encountered. Our approach yields significant perplexity and word error rate (WER) reductions over the other approach using the Wall Street Journal (WSJ) corpus.
引用
收藏
页码:2310 / 2314
页数:5
相关论文
共 26 条
  • [1] [Anonymous], 1999, EUR C SPEECH COMMUN
  • [2] Exploiting latent semantic information in statistical language modeling
    Bellegarda, JR
    [J]. PROCEEDINGS OF THE IEEE, 2000, 88 (08) : 1279 - 1296
  • [3] Latent Dirichlet allocation
    Blei, DM
    Ng, AY
    Jordan, MI
    [J]. JOURNAL OF MACHINE LEARNING RESEARCH, 2003, 3 (4-5) : 993 - 1022
  • [4] DEERWESTER S, 1990, J AM SOC INFORM SCI, V41, P391, DOI 10.1002/(SICI)1097-4571(199009)41:6<391::AID-ASI1>3.0.CO
  • [5] 2-9
  • [6] Garofolo J. S., 2013, CARNEGIE MELLON U CM
  • [7] Garofolo J. S., 1993, TIMIT ACOUSTIC PHONE
  • [8] Finding scientific topics
    Griffiths, TL
    Steyvers, M
    [J]. PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA, 2004, 101 : 5228 - 5235
  • [9] Haidar MA, 2012, IEEE W SP LANG TECH, P165, DOI 10.1109/SLT.2012.6424216
  • [10] Haidar MA, 2010, 11TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2010 (INTERSPEECH 2010), VOLS 3 AND 4, P2438