Normalization of the Speech Modulation Spectra for Robust Speech Recognition

被引:45
作者
Xiao, Xiong [1 ]
Chng, Eng Siong [1 ]
Li, Haizhou [1 ,2 ]
机构
[1] Nanyang Technol Univ, Sch Comp Engn, Singapore 639798, Singapore
[2] Inst Infocomm Res, Singapore 119613, Singapore
来源
IEEE TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING | 2008年 / 16卷 / 08期
关键词
Aurora task; feature normalization; modulation spectrum; robust speech recognition; temporal filter;
D O I
10.1109/TASL.2008.2002082
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
In this paper, we study a novel technique that normalizes the modulation spectra of speech signals for robust speech recognition. The modulation spectra of a speech signal are the power spectral density (PSD) functions of the feature trajectories generated from the signal, hence they describe the temporal structure of the features. The modulation spectra are distorted when the speech signal is corrupted by noise. We propose the temporal structure normalization (TSN) filter to reduce the noise effects by normalizing the modulation spectra to reference spectra. The TSN filter is different from other feature normalization methods such as the histogram equalization (HEQ) that only normalize the probability distributions of the speech features. Our previous work showed promising results of TSN on a small vocabulary Aurora-2 task. In this paper, we conduct an inquiry into the theoretical and practical issues of the TSN filter that includes the following. 1) We investigate the effects of noises on the speech modulation spectra and show the general characteristics of noisy speech modulation spectra. The observations help to further explain and justify the TSN filter. 2) We evaluate the TSN filter on the Aurora-4 task and demonstrate its effectiveness for a large vocabulary task. 3) We propose a segment-based implementation of the TSN filter that reduces the processing delay significantly without affecting the performance. Overall, the TSN filter produces significant improvements over the baseline systems, and delivers competitive results when compared to other state-of-the-art temporal filters.
引用
收藏
页码:1662 / 1674
页数:13
相关论文
共 62 条
[1]   Accurate compensation in the log-spectral domain for noisy speech recognition [J].
Afify, M .
IEEE TRANSACTIONS ON SPEECH AND AUDIO PROCESSING, 2005, 13 (03) :388-398
[2]   Syllable intelligibility for temporally filtered LPC cepstral trajectories [J].
Arai, T ;
Pavel, M ;
Hermansky, H ;
Avendano, C .
JOURNAL OF THE ACOUSTICAL SOCIETY OF AMERICA, 1999, 105 (05) :2783-2791
[3]   Joint acoustic and modulation frequency [J].
Atlas, L ;
Shamma, SA .
EURASIP JOURNAL ON APPLIED SIGNAL PROCESSING, 2003, 2003 (07) :668-675
[4]  
AVENDANO C, 1996, P ICSLP 96 PHIL PA O
[5]   DRAGON SYSTEM - OVERVIEW [J].
BAKER, JK .
IEEE TRANSACTIONS ON ACOUSTICS SPEECH AND SIGNAL PROCESSING, 1975, AS23 (01) :24-29
[6]   STATISTICAL INFERENCE FOR PROBABILISTIC FUNCTIONS OF FINITE STATE MARKOV CHAINS [J].
BAUM, LE ;
PETRIE, T .
ANNALS OF MATHEMATICAL STATISTICS, 1966, 37 (06) :1554-&
[7]   SUPPRESSION OF ACOUSTIC NOISE IN SPEECH USING SPECTRAL SUBTRACTION [J].
BOLL, SF .
IEEE TRANSACTIONS ON ACOUSTICS SPEECH AND SIGNAL PROCESSING, 1979, 27 (02) :113-120
[8]   MVA processing of speech features [J].
Chen, Chia-Ping ;
Bilmes, Jeff A. .
IEEE TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2007, 15 (01) :257-270
[9]   COMPARISON OF PARAMETRIC REPRESENTATIONS FOR MONOSYLLABIC WORD RECOGNITION IN CONTINUOUSLY SPOKEN SENTENCES [J].
DAVIS, SB ;
MERMELSTEIN, P .
IEEE TRANSACTIONS ON ACOUSTICS SPEECH AND SIGNAL PROCESSING, 1980, 28 (04) :357-366
[10]   Histogram equalization of speech representation for robust speech recognition [J].
de la Torre, A ;
Peinado, AM ;
Segura, JC ;
Pérez-Córdoba, JL ;
Benítez, MC ;
Rubio, AJ .
IEEE TRANSACTIONS ON SPEECH AND AUDIO PROCESSING, 2005, 13 (03) :355-366