Unsupervised language model adaptation using LDA-based mixture models and latent semantic marginals

被引：10

作者：

Haidar, Md. Akmal ^{[1
]}

O'Shaughnessy, Douglas ^{[1
]}

机构：

[1] INRS EMT, Montreal, PQ H5A 1K6, Canada

来源：

COMPUTER SPEECH AND LANGUAGE | 2015年 / 29卷 / 01期

关键词：

Language model; Topic model; Mixture model; Speech recognition; Minimum discriminant information;

D O I：

10.1016/j.csl.2014.06.002

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

In this paper, we present unsupervised language model (LM) adaptation approaches using latent Dirichlet allocation (LDA) and latent semantic marginals (LSM). The LSM is the unigram probability distribution over words that are calculated using LDA-adapted unigram models. The LDA model is used to extract topic information from a training corpus in an unsupervised manner. The LDA model yields a document topic matrix that describes the number of words assigned to topics for the documents. A hard-clustering method is applied on the document topic matrix of the LDA model to form topics. An adapted model is created by using a weighted combination of the n-gram topic models. The stand-alone adapted model outperforms the background model. The interpolation of the background model and the adapted model gives further improvement. We modify the above models using the LSM. The LSM is used to form a new adapted model by using the minimum discriminant information (MDI) adaptation approach called unigram scaling, which minimizes the distance between the new adapted model and the other model. The unigram scaling of the adapted model using LSM yields better results over a conventional unigram scaling approach. The unigram scaling of the interpolation of the background and the adapted model using the LSM outperform the background model, the unigram scaling of the background model, the unigram scaling of the adapted model, and the interpolation of the background and the adapted models respectively. We perform experiments using the '87-89 Wall Street Journal (WSJ) corpus incorporating a multi-pass continuous speech recognition (CSR) system. In the first pass, we used the background n-gram language model for lattice generation and then we apply the LM adaptation approaches for lattice rescoring in the second pass. (C) 2014 Elsevier Ltd. All rights reserved.

引用

页码：20 / 31

页数：12

共 35 条

[1]

[Anonymous], MATLAB TOPIC MODELIN

[2]

[Anonymous], 1999, EUR C SPEECH COMMUN

[3] Exploiting latent semantic information in statistical language modeling [J].