Binary mask estimation strategies for constrained imputation-based speech enhancement

被引:1
作者
Marxer, Ricard [1 ]
Barker, Jon [1 ]
机构
[1] Univ Sheffield, Sheffield, S Yorkshire, England
来源
18TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2017), VOLS 1-6: SITUATED INTERACTION | 2017年
基金
英国工程与自然科学研究理事会;
关键词
speech recognition; speech enhancement; imputation;
D O I
10.21437/Interspeech.2017-1257
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
In recent years, speech enhancement by analysis-resynthesis has emerged as an alternative to conventional noise filtering approaches. Analysis-resynthesis replaces noisy speech with a signal that has been reconstructed from a clean speech model. It can deliver high-quality signals with no residual noise. but at the expense of losing information from the original signal that is not well-represented by the model. A recent compromise solution, called constrained resynthesis. solves this problem by only resynthesising spectro-temporal regions that are estimated to he masked by noise (conditioned on the evidence in the unmasked regions). In this paper we first extend the approach by: i) introducing multi-condition training and a deep discriminative model for the analysis stage; ii) introducing an improved resynthesis model that captures within-state cross-frequency dependencies. We then extend the previous stationary-noise evaluation by using real domestic audio noise from the CHiME-2 evaluation. We compare various mask estimation strategies while varying the degree of constraint by tuning the threshold for reliable speech detection. PESQ and log-spectral distance measures show that although mask estimation remains a challenge, it is only necessary to estimate a few reliable signal regions in order to achieve performance close to that achieved with an optimal oracle mask.
引用
收藏
页码:1988 / 1992
页数:5
相关论文
共 26 条
[1]  
[Anonymous], P INTERSPEECH
[2]  
[Anonymous], 2015, P INT
[3]   SUPPRESSION OF ACOUSTIC NOISE IN SPEECH USING SPECTRAL SUBTRACTION [J].
BOLL, SF .
IEEE TRANSACTIONS ON ACOUSTICS SPEECH AND SIGNAL PROCESSING, 1979, 27 (02) :113-120
[4]   COMPUTATIONAL AUDITORY SCENE ANALYSIS [J].
BROWN, GJ ;
COOKE, M .
COMPUTER SPEECH AND LANGUAGE, 1994, 8 (04) :297-336
[5]   Speech Spectral Envelope Enhancement by HMM-Based Analysis/Resynthesis [J].
Carmona, Jose L. ;
Barker, Jon ;
Gomez, Angel M. ;
Ma, Ning .
IEEE SIGNAL PROCESSING LETTERS, 2013, 20 (06) :563-566
[6]   A glimpsing model of speech perception in noise [J].
Cooke, M .
JOURNAL OF THE ACOUSTICAL SOCIETY OF AMERICA, 2006, 119 (03) :1562-1573
[7]   Robust automatic speech recognition with missing and unreliable acoustic data [J].
Cooke, M ;
Green, P ;
Josifovski, L ;
Vizinho, A .
SPEECH COMMUNICATION, 2001, 34 (03) :267-285
[8]   An audio-visual corpus for speech perception and automatic speech recognition (L) [J].
Cooke, Martin ;
Barker, Jon ;
Cunningham, Stuart ;
Shao, Xu .
JOURNAL OF THE ACOUSTICAL SOCIETY OF AMERICA, 2006, 120 (05) :2421-2424
[9]  
Ephraim Y., 1984, IEEE T ACOUSTICS SPE, V32, P1109
[10]   DERIVATION OF AUDITORY FILTER SHAPES FROM NOTCHED-NOISE DATA [J].
GLASBERG, BR ;
MOORE, BCJ .
HEARING RESEARCH, 1990, 47 (1-2) :103-138