Frequency warping for VTLN and speaker adaptation by linear transformation of standard MFCC

被引:27
作者
Panchapagesan, Sankaran [1 ]
Alwan, Abeer [1 ]
机构
[1] Univ Calif Los Angeles, Henry Samueli Sch Engn & Appl Sci, Dept Elect Engn, Los Angeles, CA 90095 USA
关键词
Automatic speech recognition; Speaker normalization; VTLN; Frequency warping; Linear transformation; Speaker adaptation; MAXIMUM-LIKELIHOOD; SPEECH;
D O I
10.1016/j.csl.2008.02.003
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Vocal tract length normalization (VTLN) for standard filterbank-based Mel frequency cepstral coefficient (MFCC) features is usually implemented by warping the center frequencies of the Mel filterbank, and the warping factor is estimated using the maximum likelihood score (MLS) criterion. A linear transform (LT) equivalent for frequency warping (FW) would enable more efficient MLS estimation. We recently proposed a novel LT to perform FW for VTLN and model adaptation with standard MFCC features. In this paper, we present the mathematical derivation of the LT and give a compact formula to calculate it for any FW function. We also show that our LT is closely related to different LTs previously proposed for FW with cepstral features, and these LTs for FW are all shown to be numerically almost identical for the sine-log all-pass transform (SLAPT) warping functions. Our formula for the transformation matrix is, however, computationally simpler and, unlike other previous LT approaches to VTLN with MFCC features, no modification of the standard MFCC feature extraction scheme is required. In VTLN and speaker adaptive modeling (SAM). experiments with the DARPA resource management (RM1) database, the performance of the new LT was comparable to that of regular VTLN implemented by warping the Mel filterbank, when the MLS criterion was used for FW estimation. This demonstrates that the approximations involved do not lead to any performance degradation. Performance comparable to front end VTLN was also obtained with LT adaptation of HMM means in the back end, combined with mean bias and variance adaptation according to the maximum likelihood linear regression (MLLR) framework. The FW methods performed significantly better than standard MLLR for very limited adaptation data (I utterance), and were equally effective with unsupervised parameter estimation. We also performed speaker adaptive training (SAT) with feature space LT denoted CLTFW. Global CLTFW SAT gave results comparable to SAM and VTLN. By estimating multiple CLTFW transforms using a regression tree, and including an additive bias, we obtained significantly improved results compared to VTLN, with increasing adaptation data. (C) 2008 Elsevier Ltd. All rights reserved.
引用
收藏
页码:42 / 64
页数:23
相关论文
共 27 条
[1]  
Anastasakos T, 1996, ICSLP 96 - FOURTH INTERNATIONAL CONFERENCE ON SPOKEN LANGUAGE PROCESSING, PROCEEDINGS, VOLS 1-4, P1137, DOI 10.1109/ICSLP.1996.607807
[2]  
[Anonymous], CMUCS97148
[3]   A novel feature transformation for vocal tract length normalization in automatic speech recognition [J].
Claes, T ;
Dologlou, I ;
ten Bosch, L ;
Van Compernolle, D .
IEEE TRANSACTIONS ON SPEECH AND AUDIO PROCESSING, 1998, 6 (06) :549-557
[4]  
CUI X, 2005, INTERSPEECH 2005, P273
[5]   Adaptation of children's speech with limited data based on formant-like peak alignment [J].
Cui, Xiaodong ;
Alwan, Abeer .
COMPUTER SPEECH AND LANGUAGE, 2006, 20 (04) :400-419
[6]   COMPARISON OF PARAMETRIC REPRESENTATIONS FOR MONOSYLLABIC WORD RECOGNITION IN CONTINUOUSLY SPOKEN SENTENCES [J].
DAVIS, SB ;
MERMELSTEIN, P .
IEEE TRANSACTIONS ON ACOUSTICS SPEECH AND SIGNAL PROCESSING, 1980, 28 (04) :357-366
[7]   MAXIMUM LIKELIHOOD FROM INCOMPLETE DATA VIA EM ALGORITHM [J].
DEMPSTER, AP ;
LAIRD, NM ;
RUBIN, DB .
JOURNAL OF THE ROYAL STATISTICAL SOCIETY SERIES B-METHODOLOGICAL, 1977, 39 (01) :1-38
[8]   Maximum likelihood linear transformations for HMM-based speech recognition [J].
Gales, MJF .
COMPUTER SPEECH AND LANGUAGE, 1998, 12 (02) :75-98
[9]   Mean and variance adaptation within the MLLR framework [J].
Gales, MJF ;
Woodland, PC .
COMPUTER SPEECH AND LANGUAGE, 1996, 10 (04) :249-264
[10]   Semi-tied covariance matrices for hidden Markov models [J].
Gales, MJF .
IEEE TRANSACTIONS ON SPEECH AND AUDIO PROCESSING, 1999, 7 (03) :272-281