Single-channel Dereverberation for Distant-Talking Speech Recognition by Combining Denoising Autoencoder and Temporal Structure Normalization

被引：7

作者：

Ueda, Yuma ^{[1
]}

Wang, Longbiao ^{[2
]}

Kai, Atsuhiko ^{[1
]}

Xiao, Xiong ^{[3
]}

Chng, Eng Siong ^{[4
]}

Li, Haizhou ^{[5
]}

机构：

[1] Shizuoka Univ, Grad Sch Engn, Hamamatsu, Shizuoka 4328561, Japan

[2] Nagaoka Univ Technol, Nagaoka, Niigata 9402188, Japan

[3] Nanyang Technol Univ, Temasek Labs NTU, Singapore 138632, Singapore

[4] Nanyang Technol Univ, Sch Comp Engn, Singapore 138632, Singapore

[5] ASTAR, Inst Infocomm Res, Human Language Technol, Singapore 138632, Singapore

来源：

JOURNAL OF SIGNAL PROCESSING SYSTEMS FOR SIGNAL IMAGE AND VIDEO TECHNOLOGY | 2016年 / 82卷 / 02期

关键词：

Speech recognition; Dereverberation; Denoising autoencoder; Environment adaptation; Distant-talking speech; SPECTRAL SUBTRACTION; REVERBERATION; ADAPTATION; ALGORITHM; DOMAIN; NOISE; MODEL;

D O I：

10.1007/s11265-015-1007-3

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

In this paper, we propose a robust distant-talking speech recognition by combining cepstral domain denoising autoencoder (DAE) and temporal structure normalization (TSN) filter. As DAE has a deep structure and nonlinear processing steps, it is flexible enough to model highly nonlinear mapping between input and output space. In this we train a DAE to map reverberant and noisy speech features to the underlying clean speech features in the cepstral domain. For the proposed method, after applying a DAE in the cepstral domain of speech to suppress reverberation, we apply a post-processing technology based on temporal structure normalization (TSN) filter to reduce the noise and reverberation effects by normalizing the modulation spectra to reference spectra of clean speech. The proposed method was evaluated using speech in simulated and real reverberant environments. By combining a cepstral-domain DAE and TSN, the average Word Error Rate (WER) was reduced from 25.2 % of the baseline system to 21.2 % in simulated environments and from 47.5 % to 41.3 % in real environments, respectively.

引用

页码：151 / 161

页数：11

共 17 条

[1] Single-channel dereverberation for distant-talking speech recognition by combining denoising autoencoder and temporal structure normalization
Ueda, Yuma
Wang, Longbiao
Kai, Atsuhiko
Xiao, Xiong
Chng, Eng Siong
Li, Haizhou
[J]. 2014 9TH INTERNATIONAL SYMPOSIUM ON CHINESE SPOKEN LANGUAGE PROCESSING (ISCSLP), 2014, : 379 - +
[2] Single-channel Dereverberation for Distant-Talking Speech Recognition by Combining Denoising Autoencoder and Temporal Structure Normalization
Yuma Ueda
Longbiao Wang
Atsuhiko Kai
Xiong Xiao
Eng Siong Chng
Haizhou Li
[J]. Journal of Signal Processing Systems, 2016, 82 : 151 - 161
[3] Environment-dependent denoising autoencoder for distant-talking speech recognition
Ueda, Yuma
Wang, Longbiao
Kai, Atsuhiko
Ren, Bo
[J]. EURASIP JOURNAL ON ADVANCES IN SIGNAL PROCESSING, 2015,
[4] Environment-dependent denoising autoencoder for distant-talking speech recognition
Yuma Ueda
Longbiao Wang
Atsuhiko Kai
Bo Ren
[J]. EURASIP Journal on Advances in Signal Processing, 2015
[5] Deep neural network-based bottleneck feature and denoising autoencoder-based dereverberation for distant-talking speaker identification
Zhang, Zhaofeng
Wang, Longbiao
Kai, Atsuhiko
Yamada, Takanori
Li, Weifeng
Iwahashi, Masahiro
[J]. EURASIP JOURNAL ON AUDIO SPEECH AND MUSIC PROCESSING, 2015,
[6] Denoising autoencoder and environment adaptation for distant-talking speech recognition with asynchronous speech recording
Wang, Longbiao
Ren, Bo
Ueda, Yuma
Kai, Atsuhiko
Teraoka, Shunta
Fukushima, Taku
[J]. 2014 ASIA-PACIFIC SIGNAL AND INFORMATION PROCESSING ASSOCIATION ANNUAL SUMMIT AND CONFERENCE (APSIPA), 2014,
[7] Combination of bottleneck feature extraction and dereverberation for distant-talking speech recognition
Bo Ren
Longbiao Wang
Liang Lu
Yuma Ueda
Atsuhiko Kai
[J]. Multimedia Tools and Applications, 2016, 75 : 5093 - 5108
[8] Combination of bottleneck feature extraction and dereverberation for distant-talking speech recognition
Ren, Bo
Wang, Longbiao
Lu, Liang
Ueda, Yuma
Kai, Atsuhiko
[J]. MULTIMEDIA TOOLS AND APPLICATIONS, 2016, 75 (09) : 5093 - 5108
[9] Single-channel dereverberation by feature mapping using cascade neural networks for robust distant speaker identification and speech recognition
Nugraha, Aditya Arie
Yamamoto, Kazumasa
Nakagawa, Seiichi
[J]. EURASIP JOURNAL ON AUDIO SPEECH AND MUSIC PROCESSING, 2014,
[10] MODEL-BASED DEREVERBERATION IN THE LOGMELSPEC DOMAIN FOR ROBUST DISTANT-TALKING SPEECH RECOGNITION
Sehr, Armin
Maas, Roland
Kellermann, Walter
[J]. 2010 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, 2010, : 4298 - 4301

← 1 2 →