Modelling non-stationary noise with spectral factorisation in automatic speech recognition

被引:16
作者
Hurmalainen, Antti [1 ]
Gemmeke, Jort F. [2 ]
Virtanen, Tuomas [1 ]
机构
[1] Tampere Univ Technol, Dept Signal Proc, FI-33101 Tampere, Finland
[2] Katholieke Univ Leuven, Dept ESAT PSI, B-3001 Louvain, Belgium
基金
芬兰科学院;
关键词
Automatic speech recognition; Noise robustness; Non-stationary noise; Non-negative spectral factorisation; Exemplar-based; NONNEGATIVE MATRIX FACTORIZATION; SEPARATION; ALGORITHMS;
D O I
10.1016/j.csl.2012.07.008
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Speech recognition systems intended for everyday use must be able to cope with a large variety of noise types and levels, including highly non-stationary multi-source mixtures. This study applies spectral factorisation algorithms and long temporal context for separating speech and noise from mixed signals. To adapt the system to varying environments, noise models are acquired from the context, or learnt from the mixture itself without prior information. We also propose methods for reducing the size of the bases used for speech and noise modelling by 20-40 times for better practical applicability. We evaluate the performance of the methods both as a standalone classifier and as a signal-enhancing front-end for external recognisers. For the CHiME noisy speech corpus containing non-stationary multi-source household noises at signal-to-noise ratios ranging from +9 to -6 dB, we report average keyword recognition rates up to 87.8% using a single-stream sparse classification algorithm. (c) 2012 Elsevier Ltd. All rights reserved.
引用
收藏
页码:763 / 779
页数:17
相关论文
共 39 条
[1]  
[Anonymous], 2000, INTERSPEECH, DOI DOI 10.1016/S0167-6393(03)00016-5
[2]  
[Anonymous], P 7 INT C IND COMP A
[3]  
[Anonymous], HTK BOOK VERSION 3 3
[4]   The PASCAL CHiME speech separation and recognition challenge [J].
Barker, Jon ;
Vincent, Emmanuel ;
Ma, Ning ;
Christensen, Heidi ;
Green, Phil .
COMPUTER SPEECH AND LANGUAGE, 2013, 27 (03) :621-633
[5]  
Cichocki A., 2006, Acoustics, Speech and Signal Processing, V5, P621, DOI DOI 10.1109/ICASSP.2006.1661352
[6]   An audio-visual corpus for speech perception and automatic speech recognition (L) [J].
Cooke, Martin ;
Barker, Jon ;
Cunningham, Stuart ;
Shao, Xu .
JOURNAL OF THE ACOUSTICAL SOCIETY OF AMERICA, 2006, 120 (05) :2421-2424
[7]  
De Wachter M., 2003, 8th European conference on speech communication and technology- Eurospeech 2003, P1133
[8]   Template-based continuous speech recognition [J].
De Wachter, Mathias ;
Matton, Mike ;
Demuynck, Kris ;
Wambacq, Patrick ;
Cools, Ronald ;
Van Compernolle, Dirk .
IEEE TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2007, 15 (04) :1377-1390
[9]  
Delcroix Marc., 2011, Machine Listening in Multisource Environments
[10]  
Demuynck K., 2011, P INTERSPEECH, P721