Joint acoustic and language modeling for speech recognition

被引:25
作者
Chien, Jen-Tzung [1 ]
Chueh, Chuang-Hua [1 ]
机构
[1] Natl Cheng Kung Univ, Dept Comp Sci & Informat Engn, Tainan 70101, Taiwan
关键词
Hidden Markov model; n-Gram; Conditional random field; Maximum entropy; Discriminative training; Speech recognition; MAXIMUM-ENTROPY APPROACH;
D O I
10.1016/j.specom.2009.10.003
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
In a traditional model of speech recognition, acoustic and linguistic information sources are assumed independent of each other. Parameters of hidden Markov model (HMM) and n-gram are separately estimated for maximum a posteriori classification. However, the speech features and lexical words are inherently correlated in natural language. Lacking combination of these models leads to some inefficiencies. This paper reports on the joint acoustic and linguistic modeling for speech recognition by using the acoustic evidence in estimation of the linguistic model parameters, and vice versa, according to the maximum entropy (ME) principle. The discriminative ME (DME) models are exploited by using features from competing sentences. Moreover, a mutual ME (MME) model is built for sentence posterior probability, which is maximized to estimate the model parameters by characterizing the dependence between acoustic and linguistic features. The N-best Viterbi approximation is presented in implementing DME and MME models. Additionally, the new models are incorporated with the high-order feature statistics and word regularities. In the experiments, the proposed methods increase the sentence posterior probability or model separation. Recognition errors are significantly reduced in comparison with separate HMM and n-gram model estimations from 32.2% to 27.4% using the MATBN corpus and from 5.4% to 4.8% using the WSJ corpus (5K condition). (C) 2009 Elsevier B.V. All rights reserved.
引用
收藏
页码:223 / 235
页数:13
相关论文
共 36 条
[1]  
[Anonymous], 2002, COMP ALGORITHMS MAXI, DOI DOI 10.3115/1118853.1118871
[2]  
[Anonymous], P 9 EUR C SPEECH COM
[3]  
[Anonymous], INT C NEUR NETW SAN
[4]  
BAHL L, 1986, P INT C AC SPEECH SI, V1, P49, DOI DOI 10.1109/ICASSP.1986.1169179>
[5]  
Berger AL, 1996, COMPUT LINGUIST, V22, P39
[6]  
Beyerlein P, 1998, INT CONF ACOUST SPEE, P481, DOI 10.1109/ICASSP.1998.674472
[7]   Association pattern language modelling [J].
Chien, Jen-Tzung .
IEEE TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2006, 14 (05) :1719-1728
[8]  
Chien JT, 2008, 2008 IEEE WORKSHOP ON SPOKEN LANGUAGE TECHNOLOGY: SLT 2008, PROCEEDINGS, P201
[9]   Aggregate a posteriori linear regression adaptation [J].
Chien, JT ;
Huang, CH .
IEEE TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2006, 14 (03) :797-807
[10]  
CHUEH CH, 2006, P IEEE INT C AC SPEE, V1, P1061