Monaural speech segregation based on pitch tracking and amplitude modulation

被引：283

作者：

Hu, GN ^{[1
]}

Wang, DL

机构：

[1] Ohio State Univ, Biophys Program, Columbus, OH 43210 USA

[2] Ohio State Univ, Dept Comp & Informat Sci, Columbus, OH 43210 USA

[3] Ohio State Univ, Ctr Cognit Sci, Columbus, OH 43210 USA

来源：

IEEE TRANSACTIONS ON NEURAL NETWORKS | 2004年 / 15卷 / 05期

基金：

美国国家科学基金会;

关键词：

amplitude modulation (AM); computational auditory scene analysis; grouping; monaural speech segregation; pitch tracking; segmentation;

D O I：

10.1109/TNN.2004.832812

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Segregating speech from one monaural recording has proven to be very challenging. Monaural segregation of voiced speech has been studied in previous systems that incorporate auditory scene analysis principles. A major problem for these systems is their inability to deal with the high-frequency part of speech. Psychoacoustic evidence suggests that different perceptual mechanisms are involved in handling resolved and unresolved harmonics. We propose a novel system for voiced speech segregation that segregates resolved and unresolved harmonics differently. For resolved harmonics, the system generates segments based on temporal continuity and cross-channel correlation, and groups them according to their periodicities. For unresolved harmonics, it generates segments based on common amplitude modulation (AM) in addition to temporal continuity and groups them according to AM rates. Underlying the segregation process is a pitch contour that is first estimated from speech segregated according to dominant pitch and then adjusted according to psychoacoustic constraints. Our system is systematically evaluated and compared with pervious systems, and it yields substantially better performance, especially for the high-frequency part of speech.

引用

页码：1135 / 1150

页数：16

共 35 条

[1]

Albert S. Bregman, 1990, AUDITORY SCENE ANAL, P411, DOI [DOI 10.1121/1.408434, DOI 10.7551/MITPRESS/1486.001.0001]

[2]

[Anonymous], 1998, COMPUTATIONAL AUDITO

[3]

[Anonymous], THESIS STANFORD U ST

[4] Estimation of speech embedded in a reverberant and noisy environment by independent component analysis and wavelets [J].

Barros, AK ;

Rutkowski, T ;

Itakura, F ;

Ohnishi, N .

IEEE TRANSACTIONS ON NEURAL NETWORKS, 2002, 13 (04) :888-893

[5]

BIRD J, 1997, PSYCHOPHYSICAL PHYSL

[6] COMPUTATIONAL AUDITORY SCENE ANALYSIS [J].

BROWN, GJ ;

COOKE, M .

COMPUTER SPEECH AND LANGUAGE, 1994, 8 (04) :297-336

[7] Temporal coding of periodicity pitch in the auditory system: An overview [J].

Cariani, P .

NEURAL PLASTICITY, 1999, 6 (04) :147-172

[8] COMPARING THE FUNDAMENTAL FREQUENCIES OF RESOLVED AND UNRESOLVED HARMONICS - EVIDENCE FOR 2 PITCH MECHANISMS [J].

CARLYON, RP ;

SHACKLETON, TM .

JOURNAL OF THE ACOUSTICAL SOCIETY OF AMERICA, 1994, 95 (06) :3541-3554

[9] Robust automatic speech recognition with missing and unreliable acoustic data [J].

Cooke, M ;

Green, P ;

Josifovski, L ;

Vizinho, A .

SPEECH COMMUNICATION, 2001, 34 (03) :267-285

[10]

COOKE M, 1993, MODELING AUDITORY PR

← 1 2 3 4 →