Bayesian Unsupervised Batch and Online Speaker Adaptation of Activation Function Parameters in Deep Models for Automatic Speech Recognition

被引:11
作者
Huang, Zhen [1 ]
Siniscalchi, Sabato Marco [2 ,3 ]
Lee, Chin-Hui [1 ]
机构
[1] Georgia Inst Technol, Sch Elect & Comp Engn, Atlanta, GA 30332 USA
[2] Univ Enna Kore, Fac Engn & Architecture, I-94100 Enna, Italy
[3] Georgia Inst Technol, Atlanta, GA 30332 USA
关键词
Automatic speech recognition; Bayesian learning; deep neural networks; online adaptation; prior evolution; transfer learning; unsupervised speaker adaptation; HIDDEN MARKOV-MODELS; NEURAL-NETWORK; TRANSFORMATIONS;
D O I
10.1109/TASLP.2016.2621669
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
We present a Bayesian framework to obtain maximum a posteriori (MAP) estimation of a small set of hidden activation function parameters in context-dependent-deep neural network-hidden markov model (CD-DNN-HMM)-based automatic speech recognition (ASR) systems. When applied to speaker adaptation, we aim at transfer learning from a well-trained deep model for a "general" usage to a "personalized" model geared toward a particular talker by using a collection of speaker-specific data. To make the framework applicable to practical situations, we perform adaptation in an unsupervised manner assuming that the transcriptions of the adaptation utterances are not readily available to the ASR system. We conduct a series of comprehensive batch adaptation experiments on the Switchboard ASR task and show that the proposed approach is effective even with CD-DNN-HMM built with discriminative sequential training. Indeed, MAP speaker adaptation reduces the word error rate (WER) to 20.1% from an initial 21.9% on the full NIST 2000 Hub5 benchmark test set. Moreover, MAP speaker adaptation compares favorably with other techniques evaluated on the same speech tasks. We also demonstrate its complementarity to other approaches by applying MAP adaptation to CD-DNN-HMM trained with speaker adaptive features generated through constrained maximum likelihood linear regression and further reduces the WER to 18.6%. Leveraging upon the intrinsic recursive nature in Bayesian adaptation and mitigating possible system constraints on batch learning, we also proposed an incremental approach to unsupervised online speaker adaptation by simultaneously updating the hyperparameters of the approximate posterior densities and the DNN parameters sequentially. The advantage of such a sequential learning algorithm over a batch method is not necessarily in the final performance, but in computational efficiency and reduced storage needs, without having to wait for all the data to be processed. So far, the experimental results are promising.
引用
收藏
页码:64 / 75
页数:12
相关论文
共 60 条
[1]  
Abdel-Hamid O, 2013, INT CONF ACOUST SPEE, P7942, DOI 10.1109/ICASSP.2013.6639211
[2]  
[Anonymous], 1997, Switchboard-1 release 2
[3]  
[Anonymous], 1994, Connectionist Speech Recognition: A Hybrid Approach
[4]  
[Anonymous], 2013, P INTERSPEECH
[5]  
[Anonymous], 2014, P INTERSPEECH
[6]  
DeGroot MH., 2005, OPTIMAL STAT DECISIO, V82
[7]   Front-End Factor Analysis for Speaker Verification [J].
Dehak, Najim ;
Kenny, Patrick J. ;
Dehak, Reda ;
Dumouchel, Pierre ;
Ouellet, Pierre .
IEEE TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2011, 19 (04) :788-798
[8]   Machine Learning Paradigms for Speech Recognition: An Overview [J].
Deng, Li ;
Li, Xiao .
IEEE TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2013, 21 (05) :1060-1089
[9]   SPEAKER ADAPTATION USING CONSTRAINED ESTIMATION OF GAUSSIAN MIXTURES [J].
DIGALAKIS, VV ;
RTISCHEV, D ;
NEUMEYER, LG .
IEEE TRANSACTIONS ON SPEECH AND AUDIO PROCESSING, 1995, 3 (05) :357-366
[10]  
Fiscus J., 2000, P NIST SPEECH TRANSC, P1