On the relevance. of auditory-based Gabor features for deep learning in robust speech recognition

被引：0

作者：

Martinez, Angel Mario Castro ^{[1
,2
]}

Mallidi, Sri Harish ^{[3
]}

Meyer, Bernd T. ^{[3
]}

机构：

[1] Carl von Ossietzky Univ Oldenburg, Dept Med Phys & Akust, Oldenburg, Germany

[2] Exzellenzcluster Hearing4all, D-26111 Oldenburg, Germany

[3] Johns Hopkins Univ, Ctr Language & Speech Proc, Baltimore, MD USA

来源：

COMPUTER SPEECH AND LANGUAGE | 2017年 / 45卷

关键词：

Auditory features; Spectro-temporal processing; Deep neural networks; Automatic speech recognition; MODULATION SPECTRUM; FRONT-END; PERCEPTION; NETWORKS; MODEL;

D O I：

10.1016/j.cs1.2017.02.006

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Previous studies support the idea of merging auditory-based Gabor features with deep learning architectures to achieve robust automatic speech recognition, however, the cause behind the gain of such combination is still unknown. We believe these representations provide the deep learning decoder with more discriminable cues. Our aim with this paper is to validate this hypothesis by performing experiments with three different recognition tasks (Aurora 4, CHiME 2 and CHiME 3) and assess the discriminability of the information encoded by Gabor filterbank features. Additionally, to identify the contribution of low, medium and high temporal modulation frequencies subsets of the Gabor filterbank were used as features (dubbed LTM, MTM and HTM, respectively). With temporal modulation frequencies between 16 and 25 Hz, HTM consistently outperformed the remaining ones in every condition, highlighting the robustness of these representations against channel distortions, low signal-to-noise ratios and acoustically challenging real-life scenarios with relative improvements from 11 to 56% against a Mel-filterbank-DNN baseline. To explain the results, a measure of similarity between phoneme classes from DNN activations is proposed and linked to their acoustic properties. We find this measure to be consistent with the observed error rates and highlight specific differences on phoneme level to pinpoint the benefit of the proposed features. (C) 2017 Elsevier Ltd. All rights reserved.

引用

页码：21 / 38

页数：18

共 57 条

[1]

[Anonymous], TECHNIQUES NOISE ROB

[2]

[Anonymous], 1956, PHONOLOGY PHONETICS

[3]

[Anonymous], 2005, THESIS

[4]

Baby D., 2015, P INTERSPEECH, P905

[5]

Barker J., 2015, P IEEE WORKSH AUT SP

[6] Multi-time resolution analysis of speech: evidence from psychophysics [J].

Chait, Maria ;

Greenberg, Steven ;

Arai, Takayuki ;

Simon, Jonathan Z. ;

Poeppel, David .

FRONTIERS IN NEUROSCIENCE, 2015, 9

[7]

Chang SY, 2014, INTERSPEECH, P905

[8] A glimpsing model of speech perception in noise [J].

Cooke, M .

JOURNAL OF THE ACOUSTICAL SOCIETY OF AMERICA, 2006, 119 (03) :1562-1573

[9] An audio-visual corpus for speech perception and automatic speech recognition (L) [J].

Cooke, Martin ;

Barker, Jon ;

Cunningham, Stuart ;

Shao, Xu .

JOURNAL OF THE ACOUSTICAL SOCIETY OF AMERICA, 2006, 120 (05) :2421-2424

[10] EFFECT OF REDUCING SLOW TEMPORAL MODULATIONS ON SPEECH RECEPTION [J].

DRULLMAN, R ;

FESTEN, JM ;

PLOMP, R .

JOURNAL OF THE ACOUSTICAL SOCIETY OF AMERICA, 1994, 95 (05) :2670-2680

← 1 2 3 4 5 6 →