Hermitian Polynomial for Speaker Adaptation of Connectionist Speech Recognition Systems

被引:60
作者
Siniscalchi, Sabato Marco [1 ,2 ]
Li, Jinyu [3 ]
Lee, Chin-Hui [2 ]
机构
[1] Kore Univ Enna, Dept Comp Engn, I-94100 Enna, Italy
[2] Georgia Inst Technol, Dept Elect & Comp Engn, Atlanta, GA 30332 USA
[3] Microsoft Corp, Rendmond, WA 98052 USA
来源
IEEE TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING | 2013年 / 21卷 / 10期
关键词
Artificial neural networks; model adaptation; speech processing; DEEP NEURAL-NETWORKS; MIXTURE OBSERVATIONS; MAXIMUM; TRANSFORMATIONS;
D O I
10.1109/TASL.2013.2270370
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
Model adaptation techniques are an efficient way to reduce the mismatch that typically occurs between the training and test condition of any automatic speech recognition (ASR) system. This work addresses the problem of increased degradation in performance when moving from speaker-dependent (SD) to speaker-independent (SI) conditions for connectionist (or hybrid) hidden Markov model/artificial neural network (HMM/ANN) systems in the context of large vocabulary continuous speech recognition (LVCSR). Adapting hybrid HMM/ANN systems on a small amount of adaptation data has been proven to be a difficult task, and has been a limiting factor in the widespread deployment of hybrid techniques in operational ASR systems. Addressing the crucial issue of speaker adaptation (SA) for hybrid HMM/ANN system can thereby have a great impact on the connectionist paradigm, which will play a major role in the design of next-generation LVCSR considering the great success reported by deep neural networks-ANNs with many hidden layers that adopts the pre-training technique-on many speech tasks. Current adaptation techniques for ANNs based on injecting an adaptable linear transformation network connected to either the input, or the output layer are not effective especially with a small amount of adaptation data, e. g., a single adaptation utterance. In this paper, a novel solution is proposed to overcome those limits and make it robust to scarce adaptation resources. The key idea is to adapt the hidden activation functions rather than the network weights. The adoption of Hermitian activation functions makes this possible. Experimental results on an LVCSR task demonstrate the effectiveness of the proposed approach.
引用
收藏
页码:2152 / 2161
页数:10
相关论文
共 48 条
[1]  
ABRASH V, 1995, P EUR, P2183, DOI DOI 10.1109/72.182692
[2]   Upper and lower bounds on the mean of noisy speech: Application to minimax classification [J].
Afify, M ;
Siohan, O ;
Lee, CH .
IEEE TRANSACTIONS ON SPEECH AND AUDIO PROCESSING, 2002, 10 (02) :79-88
[3]  
[Anonymous], P INTERSPEECH
[4]  
[Anonymous], 1973, Pattern Classification and Scene Analysis
[5]   Updated MINDS Report on Speech Recognition and Understanding, Part 2 [J].
Baker, Janet M. ;
Deng, Li ;
Khudanpur, Sanjeev ;
Lee, Chin-Hui ;
Glass, James R. ;
Morgan, Nelson ;
O'Shaughnessy, Douglas .
IEEE SIGNAL PROCESSING MAGAZINE, 2009, 26 (04) :78-85
[6]   Research Developments and Directions in Speech Recognition and Understanding, Part 1 [J].
Baker, Janet M. ;
Deng, Li ;
Glass, James ;
Khudanpur, Sanjeev ;
Lee, Chin-Hui ;
Morgan, Nelson ;
O'Shaughnessy, Douglas .
IEEE SIGNAL PROCESSING MAGAZINE, 2009, 26 (03) :75-80
[7]   Context-Dependent Pre-Trained Deep Neural Networks for Large-Vocabulary Speech Recognition [J].
Dahl, George E. ;
Yu, Dong ;
Deng, Li ;
Acero, Alex .
IEEE TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2012, 20 (01) :30-42
[8]   SPEAKER ADAPTATION USING CONSTRAINED ESTIMATION OF GAUSSIAN MIXTURES [J].
DIGALAKIS, VV ;
RTISCHEV, D ;
NEUMEYER, LG .
IEEE TRANSACTIONS ON SPEECH AND AUDIO PROCESSING, 1995, 3 (05) :357-366
[9]  
Gaglio S., 1999, P AI IA BOL IT SEP, P226
[10]   Maximum likelihood linear transformations for HMM-based speech recognition [J].
Gales, MJF .
COMPUTER SPEECH AND LANGUAGE, 1998, 12 (02) :75-98