SPEAKER ADAPTIVE TRAINING FOR DEEP NEURAL NETWORKS EMBEDDING LINEAR TRANSFORMATION NETWORKS

被引：0

作者：

Ochiai, Tsubasa ^{[1
,2
]}

Matsuda, Shigeki ^{[2
]}

Watanabe, Hideyuki ^{[1
]}

Lu, Xugang ^{[1
]}

Hori, Chiori ^{[1
]}

Katagiri, Shigeru ^{[2
]}

机构：

[1] Natl Inst Informat & Commun Technol, Kyoto, Japan

[2] Doshisha Univ, Grad Sch Engn, Kyoto, Japan

来源：

2015 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING (ICASSP) | 2015年

关键词：

Speaker Adaptive Training; Deep Neural Network; Linear Transformation Network;

D O I：

暂无

中图分类号：

O42 [声学];

学科分类号：

070206 ; 082403 ;

摘要：

Recently, a novel speaker adaptation method was proposed that applied the Speaker Adaptive Training (SAT) concept to a speech recognizer consisting of a Deep Neural Network (DNN) and a Hidden Markov Model (HMM), and its utility was demonstrated. This method implements the SAT scheme by allocating one Speaker Dependent (SD) module for each training speaker to one of the intermediate layers of the front-end DNN. It then jointly optimizes the SD modules and the other part of network, which is shared by all the speakers. In this paper, we propose an improved version of the above SAT-based adaptation scheme for a DNN-HMM recognizer. Our new training adopts a Linear Transformation Network (LTN) for the SD module, and such LTN employment leads to more appropriate regularization in both the SAT and adaptation stages by replacing an empirically selected anchorage of a network for regularization in the preceding SAT-DNN-HMM with a SAT-optimized anchorage. We elaborate the effectiveness of our proposed method over TED Talks corpus data. Our experimental results show that a speaker-adapted recognizer using our method achieves a significant word error rate reduction of 9.2 points from a baseline SI-DNN recognizer and also steadily outperforms speaker-adapted recognizers, each of which originates from the preceding SAT-based DNN-HMM.

引用

页码：4605 / 4609

页数：5

共 22 条

[1]

Abdel-Hamid O, 2013, INT CONF ACOUST SPEE, P7942, DOI 10.1109/ICASSP.2013.6639211

[2]

Anastasakos T, 1996, ICSLP 96 - FOURTH INTERNATIONAL CONFERENCE ON SPOKEN LANGUAGE PROCESSING, PROCEEDINGS, VOLS 1-4, P1137, DOI 10.1109/ICSLP.1996.607807

[3]

[Anonymous], 2013, P INTERSPEECH

[4] Context-Dependent Pre-Trained Deep Neural Networks for Large-Vocabulary Speech Recognition [J].

Dahl, George E. ;

Yu, Dong ;

Deng, Li ;

Acero, Alex .

IEEE TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2012, 20 (01) :30-42

[5] Maximum likelihood linear transformations for HMM-based speech recognition [J].

Gales, MJF .

COMPUTER SPEECH AND LANGUAGE, 1998, 12 (02) :75-98

[6] Linear hidden transformations for adaptation of hybrid ANN/HMM models [J].

Gemello, Roberto ;

Mana, Franco ;

Scanzio, Stefano ;

Laface, Pietro ;

De Mori, Renato .

SPEECH COMMUNICATION, 2007, 49 (10-11) :827-835

[7] Deep Neural Networks for Acoustic Modeling in Speech Recognition [J].

Hinton, Geoffrey ;

Deng, Li ;

Yu, Dong ;

Dahl, George E. ;

Mohamed, Abdel-rahman ;

Jaitly, Navdeep ;

Senior, Andrew ;

Vanhoucke, Vincent ;

Patrick Nguyen ;

Sainath, Tara N. ;

Kingsbury, Brian .

IEEE SIGNAL PROCESSING MAGAZINE, 2012, 29 (06) :82-97

[8]

Jian Xue, 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), P6359, DOI 10.1109/ICASSP.2014.6854828

[9]

Li B, 2010, 11TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2010 (INTERSPEECH 2010), VOLS 1-2, P526

[10]

Liao H, 2013, INT CONF ACOUST SPEE, P7947, DOI 10.1109/ICASSP.2013.6639212

← 1 2 3 →