Speech/music segmentation using entropy and dynamism features in a HMM classification framework

被引:77
作者
Ajmera, J
McCowan, I
Bourlard, H
机构
[1] IDIAP, CH-1920 Martigny, Switzerland
[2] Ecole Polytech Fed Lausanne, CH-1015 Lausanne, Switzerland
基金
瑞士国家科学基金会;
关键词
speech/music discrimination; audio segmentation; entropy; dynamism; HMM; GMM; MLP;
D O I
10.1016/S0167-6393(02)00087-0
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
In this paper, we present a new approach towards high performance speech/music discrimination on realistic tasks related to the automatic transcription of broadcast news. In the approach presented here, an artificial neural network (ANN) trained on clean speech only (as used in a standard large vocabulary speech recognition system) is used as a channel model at the output of which the entropy and "dynamism" will be measured every 10 ms. These features are then integrated over time through an ergodic 2-state (speech and non-speech) hidden Markov model (HMM) with minimum duration constraints on each HMM state. For instance, in the case of entropy, it is indeed clear (and observed in practice) that, on average, the entropy at the output of the ANN will be larger for non-speech segments than speech segments presented at their input. In our case, the ANN acoustic model was a multi-layer perceptron (MLP, as often used in hybrid HMM/ANN systems) generating at its output estimators of the phonetic posterior probabilities based on the acoustic vectors at its input. It is from these outputs, thus from "real" probabilities, that the entropy and dynamism are estimated. The 2-state speech/non-speech HMM will take these two-dimensional features (entropy and dynamism) whose distributions will be modeled through multi-Gaussian densities or a secondary MLP. The parameters of this HMM are trained in a supervised manner using Viterbi algorithm. Although the proposed method can easily be adapted to other speech/non-speech discrimination applications, the present paper only focuses on speech/music segmentation. Different experiments, including different speech and music styles, as well as different temporal distributions of the speech and music signals (real data distribution, mostly speech, or mostly music), illustrate the robustness of the approach, always resulting in a correct segmentation performance higher than 90%. Finally, we will show how a confidence measure can be used to further improve the segmentation results, and also discuss how this may be used to extend the technique to the case of speech/music mixtures. (C) 2002 Elsevier Science B.V. All rights reserved.
引用
收藏
页码:351 / 363
页数:13
相关论文
共 12 条
[1]  
Ajmera J, 2002, INT CONF ACOUST SPEE, P297
[2]  
BERNARDIS G, 1998, INT C SPOK LANG PROC, V3, P775
[3]  
CAREY MJ, 1999, IEEE INT C AC SPEECH
[4]  
CHEN S, 1998, IBM TECH J
[5]  
El-Maleh K, 2000, INT CONF ACOUST SPEE, P2445, DOI 10.1109/ICASSP.2000.859336
[6]  
Morgan N., 1995, IEEE SIGNAL PROC MAG, V12, P25
[7]  
Papoulis A., 1991, PROBABILITY RANDOM V
[8]  
PARRIS ES, 1999, EUR C SPEECH COMM TE, P2191
[9]  
Saunders J, 1996, INT CONF ACOUST SPEE, P993, DOI 10.1109/ICASSP.1996.543290
[10]  
Scheirer E, 1997, INT CONF ACOUST SPEE, P1331, DOI 10.1109/ICASSP.1997.596192