Context adaptive neural network for rapid adaptation of deep CNN based acoustic models

被引：9

作者：

Delcroix, Marc ^{[1
]}

Kinoshita, Keisuke ^{[1
]}

Ogawa, Atsunori ^{[1
]}

Yoshioka, Takuya ^{[1
]}

Tran, Dung ^{[1
]}

Nakatani, Tomohiro ^{[1
]}

机构：

[1] NTT Corp, NTT Commun Sci Labs, 2-4 Hikaridai,Seika Cho, Keihanna Sci City, Kyoto 6190237, Japan

来源：

17TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2016), VOLS 1-5: UNDERSTANDING SPEECH PROCESSING IN HUMANS AND MACHINES | 2016年

关键词：

Acoustic model adaptation; context adaptive network; auxiliary features; deep convolutional neural networks;

D O I：

10.21437/Interspeech.2016-203

中图分类号：

O42 [声学];

学科分类号：

070206 ; 082403 ;

摘要：

Using auxiliary input features has been seen as one of the most effective ways to adapt deep neural network (DNN)-based acoustic models to speaker or environment. However, this approach has several limitations. It only performs compensation of the bias term of the hidden layer and therefore does not fully exploit the network capabilities. Moreover, it may not be well suited for certain types of architectures such as convolutional neural networks (CNNs) because the auxiliary features have different time-frequency structures from speech features. This paper resolves these problems by extending the recently proposed context adaptive DNN (CA-DNN) framework to CNN architectures. A CA-DNN is a DNN with one or several layers factorized in sub-layers associated with an acoustic context class representing speaker or environment. The output of the factorized layer is obtained as the weighted sum of the contributions of each sub-layer, weighted by acoustic context weights that are derived from auxiliary features such as i-vectors. Importantly, a CA-DNN can compensate both bias' and weight matrices. In this paper, we investigate the use of CA-DNN for deep CNN-based architectures. We demonstrate consistent performance gains for utterance level rapid adaptation on the AURORA4 task over a strong network-in-network based deep CNN architecture.

引用

页码：1573 / 1577

页数：5

共 34 条

[1] Abdel-Hamid O., 2013, INTERSPEECH, P1248
[2] Abdel-Hamid O, 2013, INT CONF ACOUST SPEE, P7942, DOI 10.1109/ICASSP.2013.6639211
[3] [Anonymous], 2013, CoRR
[4] Delcroix M., 2016, P ICASSP 16
[5] Delcroix M, 2015, INT CONF ACOUST SPEE, P4535, DOI 10.1109/ICASSP.2015.7178829
[6] Gemello R, 2006, INT CONF ACOUST SPEE, P1189
[7] He K, 2015, INT CONF WIRE COMMUN
[8] Jian Xue, 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), P6359, DOI 10.1109/ICASSP.2014.6854828
[9] Jinyu Li, 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), P5537, DOI 10.1109/ICASSP.2014.6854662
[10] Li B, 2010, 11TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2010 (INTERSPEECH 2010), VOLS 1-2, P526

← 1 2 3 4 →