Single and multiple F0 contour estimation through parametric spectrogram Modeling of speech in noisy environments

被引:29
作者
Le Roux, Jonathan [1 ]
Kameoka, Hirokazu
Ono, Nobutaka
de Cheveigne, Alain
Sagayama, Shigeki
机构
[1] Univ Tokyo, Grad Sch Informat Sci & Technol, Tokyo 1138656, Japan
[2] Univ Paris 05, CNRS, F-75230 Paris 05, France
[3] Ecole Normale Super, F-75230 Paris 05, France
来源
IEEE TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING | 2007年 / 15卷 / 04期
基金
日本科学技术振兴机构;
关键词
acoustic scene analysis; expectation-maximization (EM) algorithm; harmonic-temporal structured clustering (HTC); multipitch estimation; noisy speech; spline F-0 contour;
D O I
10.1109/TASL.2007.894510
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
This paper proposes a novel F-0 contour estimation algorithm based on a precise parametric description of the voiced parts of speech derived from the power spectrum. The algorithm is able to perform in a wide variety of noisy environments as well as to estimate the F(0)s of cochannel concurrent speech. The speech spectrum is modeled as a sequence of spectral clusters governed by a common F-0 contour expressed as a spline curve. These clusters are obtained by an unsupervised 2-D time-frequency clustering of the power density using a new formulation of the EM algorithm, and their common F-0 contour is estimated at the same time. A smooth F-0 contour is extracted for the whole utterance, linking together its voiced parts. A noise model is used to cope with non-harmonic background noise, which would otherwise interfere with the clustering of the harmonic portions of speech. We evaluate our algorithm in comparison with existing methods on several tasks, and show 1) that it is competitive on clean single-speaker speech, 2) that it outperforms existing methods in the presence of noise, and 3) that it outperforms existing methods for the estimation of multiple F-0 contours of cochannel concurrent speech.
引用
收藏
页码:1135 / 1145
页数:11
相关论文
共 22 条
[1]  
[Anonymous], 2006, COMPUTATIONAL AUDITO
[2]  
[Anonymous], 1983, PITCH DETERMINATION, DOI DOI 10.1007/978-3-642-81926-1
[3]  
BAGSHAW PC, 1993, P EUR C SPEECH COMM, P1003
[4]  
Boersma P., 1993, Accurate short-term analysis of the fundamental frequency and the harmonics-to-noise ratio of a sampled sound, P97, DOI DOI 10.1371/JOURNAL.PONE.0069107
[5]  
COOKE MP, 1993, THESIS U SHEFFIELD S
[6]   I-DIVERGENCE GEOMETRY OF PROBABILITY DISTRIBUTIONS AND MINIMIZATION PROBLEMS [J].
CSISZAR, I .
ANNALS OF PROBABILITY, 1975, 3 (01) :146-158
[7]   YIN, a fundamental frequency estimator for speech and music [J].
de Cheveigné, A ;
Kawahara, H .
JOURNAL OF THE ACOUSTICAL SOCIETY OF AMERICA, 2002, 111 (04) :1917-1930
[8]  
DOVAL B, 1994, THESIS U P M CURIE P
[9]  
GU YH, 1991, INT CONF ACOUST SPEE, P949, DOI 10.1109/ICASSP.1991.150497
[10]  
HEDELIN P, 1990, INT CONF ACOUST SPEE, P361, DOI 10.1109/ICASSP.1990.115685