Use of temporal information: Detection of periodicity, aperiodicity, and pitch in speech

被引:71
作者
Deshmukh, O [1 ]
Espy-Wilson, CY
Salomon, A
Singh, J
机构
[1] Univ Maryland, Dept Elect & Comp Engn, College Pk, MD 20742 USA
[2] Univ Maryland, Inst Syst Res, College Pk, MD 20742 USA
[3] MIT, Speech Commun Grp, Res Lab Elect, Cambridge, MA 02142 USA
来源
IEEE TRANSACTIONS ON SPEECH AND AUDIO PROCESSING | 2005年 / 13卷 / 05期
基金
美国国家科学基金会;
关键词
aperiodic and periodic energy; average magnitude difference function (AMDF); pitch detection; speech preprocessing; voiced obstruents; voice quality;
D O I
10.1109/TSA.2005.851910
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
In this paper, we present a time domain aperiodicity, periodicity, and pitch (APP) detector that estimates 1) the proportion of periodic and aperiodic energy in a speech signal and 2) the pitch period of the periodic component. The APP system is particularly useful in situations where the speech signal contains simultaneous periodic and aperiodic energy, as in the case of breathy vowels and some voiced obstruents. The performance of the APP system was evaluated on synthetic speech-like signals corrupted with noise at various levels of signal-to-noise ratio (SNR) and on three different natural speech databases that consist of simultaneously recorded electroglottograph (EGG) and acoustic data. When compared on a frame basis (at a frame rate of 2.5 ms) the results show excellent agreement between the periodic/aperiodic decisions made by the APP system and the estimates obtained from the EGG data (94.43 % for periodicity and 96.32 % for aperiodicity). The results also support previous studies that show that voiced obstruents are frequently manifested with either little or no aperiodic energy, or with strong periodic and aperiodic components. The EGG data were used as a reference for evaluating the pitch detection algorithm. The ground truth was not manually checked to rectify or exclude incorrect estimates. The overall gross error rate in pitch prediction across the three speech databases was 5.67 %. In the case of synthetic speech-like data, the estimated SNR was found to be in close proportion to the actual SNR, and the pitch was always accurately found regardless of the presence of any shimmer or jitter.
引用
收藏
页码:776 / 786
页数:11
相关论文
共 31 条
[1]  
[Anonymous], 1983, PITCH DETERMINATION, DOI DOI 10.1007/978-3-642-81926-1
[2]  
Bagshaw P. C., 1994, THESIS U EDINBURGH E
[3]   COMPUTATIONAL AUDITORY SCENE ANALYSIS [J].
BROWN, GJ ;
COOKE, M .
COMPUTER SPEECH AND LANGUAGE, 1994, 8 (04) :297-336
[4]  
CHILDERS DG, 1985, CRIT REV BIOMED ENG, V12, P131
[5]   Effectiveness of a periodic and aperiodic decomposition method for analysis of voice sources [J].
d'Alessandro, C ;
Darsinos, V ;
Yegnanarayana, B .
IEEE TRANSACTIONS ON SPEECH AND AUDIO PROCESSING, 1998, 6 (01) :12-23
[6]   YIN, a fundamental frequency estimator for speech and music [J].
de Cheveigné, A ;
Kawahara, H .
JOURNAL OF THE ACOUSTICAL SOCIETY OF AMERICA, 2002, 111 (04) :1917-1930
[7]  
Deshmukh O, 2003, 2003 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, VOL I, PROCEEDINGS, P448
[8]  
DESHMUKH O, 2003, P 15 INT C PHON SCI, P1365
[9]  
DESHMUKH O, 2002, P ICASSP, P593
[10]   Using knowledge to organize sound: The prediction-driven approach to computational auditory scene analysis and its application to speech/nonspeech mixtures [J].
Ellis, DPW .
SPEECH COMMUNICATION, 1999, 27 (3-4) :281-298