Data-driven voice source waveform analysis and synthesis

被引:10
作者
Gudnason, Jon [1 ]
Thomas, Mark R. P. [2 ]
Ellis, Daniel P. W.
Naylor, Patrick A. [2 ,3 ]
机构
[1] Reykjavik Univ, Sch Sci & Engn, Reykjavik, Iceland
[2] Univ London Imperial Coll Sci Technol & Med, Dept Elect & Elect Engn, London SW7 2AZ, England
[3] Columbia Univ, LabROSA, New York, NY 10027 USA
关键词
Voice source signal; Inverse filtering; Vocal tract modeling; Principal component analysis; Gaussian mixture model; Segmental signal to reconstruction ratio; GLOTTAL CLOSURE; LINEAR PREDICTION; QUALITY; PHASE; MODEL;
D O I
10.1016/j.specom.2011.08.003
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
A data-driven approach is introduced for studying, analyzing and processing the voice source signal. Existing approaches parameterize the voice source signal by using models that are motivated, for example, by a physical model or function-fitting. Such parameterization is often difficult to achieve and it produces a poor approximation to a large variety of real voice source waveforms of the human voice. This paper presents a novel data-driven approach to analyze different types of voice source waveforms using principal component analysis and Gaussian mixture modeling. This approach models certain voice source features that many other approaches fail to model. Prototype voice source waveforms are obtained from each mixture component and analyzed with respect to speaker, phone and pitch. An analysis/synthesis scheme was set up to demonstrate the effectiveness of the method. Compression of the proposed voice source by discarding 75% of the features yields a segmental signal-to-reconstruction error ratio of 13 dB and a Bark spectral distortion of 0.14. (C) 2011 Elsevier B.V. All rights reserved.
引用
收藏
页码:199 / 211
页数:13
相关论文
共 53 条
[31]   Direct evaluation of high-speed recordings of vocal fold vibrations [J].
Eysholdt, U ;
Tigges, M ;
Wittenberg, T ;
Proschel, U .
FOLIA PHONIATRICA ET LOGOPAEDICA, 1996, 48 (04) :163-170
[32]  
Fant G., 1960, ACOUSTIC THEORY SPEE
[33]  
Flanagan J. L., 1972, SPEECH ANAL SYNTHESI
[34]   SELF-OSCILLATING SOURCE FOR VOCAL-TRACT SYNTHESIZERS [J].
FLANAGAN, JL ;
LANDGRAF, LL .
IEEE TRANSACTIONS ON AUDIO AND ELECTROACOUSTICS, 1968, AU16 (01) :57-&
[35]   Voice source cepstrum coefficients for speaker identification [J].
Gudnason, Jon ;
Brookes, Mike .
2008 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING, VOLS 1-12, 2008, :4821-4824
[36]  
Hartigan J. A., 1979, Applied Statistics, V28, P100, DOI 10.2307/2346830
[37]  
Hirano M., 1981, CLIN EXAMINATION VOI
[38]   SYNTHESIS OF VOICED SOUNDS FROM A 2-MASS MODEL OF VOCAL CORDS [J].
ISHIZAKA, K ;
FLANAGAN, JL .
BELL SYSTEM TECHNICAL JOURNAL, 1972, 51 (06) :1233-+
[39]   ANALYSIS, SYNTHESIS, AND PERCEPTION OF VOICE QUALITY VARIATIONS AMONG FEMALE AND MALE TALKERS [J].
KLATT, DH ;
KLATT, LC .
JOURNAL OF THE ACOUSTICAL SOCIETY OF AMERICA, 1990, 87 (02) :820-857
[40]   ELECTROGLOTTOGRAPHY AND ITS RELATION TO GLOTTAL ACTIVITY [J].
LECLUSE, FLE ;
BROCAAR, MP ;
VERSCHUURE, J .
FOLIA PHONIATRICA, 1975, 27 (03) :215-224