NOVEL TOPIC N-GRAM COUNT LM INCORPORATING DOCUMENT-BASED TOPIC DISTRIBUTIONS AND N-GRAM COUNTS

被引:0
作者
Haidar, Md. Akmal [1 ]
O'Shaughnessy, Douglas [1 ]
机构
[1] EMT, INRS, 6900-800 De La Gauchetiere Ouest, Montreal, PQ H5A 1K6, Canada
来源
2014 PROCEEDINGS OF THE 22ND EUROPEAN SIGNAL PROCESSING CONFERENCE (EUSIPCO) | 2014年
关键词
Statistical n-gram language model; speech recognition; mixture models; topic models;
D O I
暂无
中图分类号
TM [电工技术]; TN [电子技术、通信技术];
学科分类号
0808 ; 0809 ;
摘要
In this paper, we introduce a novel topic n-gram count language model (NTNCLM) using topic probabilities of training documents and document-based n-gram counts. The topic probabilities for the documents are computed by averaging the topic probabilities of words seen in the documents. The topic probabilities of documents are multiplied by the document-based n-gram counts. The products are then summed-up for all the training documents. The results are used as the counts of the respective topics to create the NTNCLMs. The NTNCLMs are adapted by using the topic probabilities of a development test set that are computed as above. We compare our approach with a recently proposed TNCLM [1], where the long-range information outside of the n-gram events is not encountered. Our approach yields significant perplexity and word error rate (WER) reductions over the other approach using the Wall Street Journal (WSJ) corpus.
引用
收藏
页码:2310 / 2314
页数:5
相关论文
共 26 条
[21]  
Stolcke A., 2002, P INTERSPEECH, P901
[22]  
Tam Y.-C., 2005, Proceedings of INTERSPEECH, P5
[23]  
Tam YC, 2006, INTERSPEECH 2006 AND 9TH INTERNATIONAL CONFERENCE ON SPOKEN LANGUAGE PROCESSING, VOLS 1-5, P2206
[24]  
Vertanen K., 2013, HTK WALL STREET J TR
[25]  
WOODLAND PC, 1994, INT CONF ACOUST SPEE, P125
[26]  
Young S., 2013, HTK TOOLKIT 3 4 1