Recognizing articulatory gestures from speech for robust speech recognition

被引：20

作者：

Mitra, Vikramjit ^{[2
]}

Nam, Hosung ^{[1
]}

Espy-Wilson, Carol ^{[3
]}

Saltzman, Elliot ^{[4
]}

Goldstein, Louis ^{[5
]}

机构：

[1] Haskins Labs Inc, New Haven, CT 06511 USA

[2] SRI Int, Speech Technol & Res Lab, Menlo Pk, CA 94025 USA

[3] Univ Maryland, Dept Elect & Comp Engn, Speech Commun Lab, College Pk, MD 20742 USA

[4] Boston Univ, Dept Phys Therapy & Athlet Training, Boston, MA 02215 USA

[5] Univ So Calif, Dept Linguist, Los Angeles, CA 90089 USA

来源：

JOURNAL OF THE ACOUSTICAL SOCIETY OF AMERICA | 2012年 / 131卷 / 03期

关键词：

MODELS; ENHANCEMENT; SYNTHESIZER; ACOUSTICS; MOVEMENT;

D O I：

10.1121/1.3682038

中图分类号：

O42 [声学];

学科分类号：

070206 ; 082403 ;

摘要：

Studies have shown that supplementary articulatory information can help to improve the recognition rate of automatic speech recognition systems. Unfortunately, articulatory information is not directly observable, necessitating its estimation from the speech signal. This study describes a system that recognizes articulatory gestures from speech, and uses the recognized gestures in a speech recognition system. Recognizing gestures for a given utterance involves recovering the set of underlying gestural activations and their associated dynamic parameters. This paper proposes a neural network architecture for recognizing articulatory gestures from speech and presents ways to incorporate articulatory gestures for a digit recognition task. The lack of natural speech database containing gestural information prompted us to use three stages of evaluation. First, the proposed gestural annotation architecture was tested on a synthetic speech dataset, which showed that the use of estimated tract-variable-time-functions improved gesture recognition performance. In the second stage, gesture-recognition models were applied to natural speech waveforms and word recognition experiments revealed that the recognized gestures can improve the noise-robustness of a word recognition system. In the final stage, a gesture-based Dynamic Bayesian Network was trained and the results indicate that incorporating gestural information can improve word recognition performance compared to acoustic-only systems. (C) 2012 Acoustical Society of America. [DOI: 10.1121/1.3682038]

引用

页码：2270 / 2287

页数：18

共 63 条

[1] [Anonymous], 1952, 13 MIT AC LAB
[2] [Anonymous], 2007, P 8 ANN C INT SPEECH
[3] [Anonymous], 2007, Large-scale kernel machines, DOI DOI 10.7551/MITPRESS/7496.003.0016
[4] [Anonymous], 1989, Haskins Laboratories Status Report on Speech Research, DOI DOI 10.1017/S0952675700001019
[5] Atal B. S., 1983, Proceedings of ICASSP 83. IEEE International Conference on Acoustics, Speech and Signal Processing, P81
[6] Bilmes J, 2002, INT CONF ACOUST SPEE, P3916
[7] ARTICULATORY PHONOLOGY - AN OVERVIEW
BROWMAN, CP
GOLDSTEIN, L
[J]. PHONETICA, 1992, 49 (3-4) : 155 - 180
[8] MVA processing of speech features
Chen, Chia-Ping
Bilmes, Jeff A.
[J]. IEEE TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2007, 15 (01): : 257 - 270
[9] Chomsky Noam., 1968, SOUND PATTERN ENGLIS
[10] An audio-visual corpus for speech perception and automatic speech recognition (L)
Cooke, Martin
Barker, Jon
Cunningham, Stuart
Shao, Xu
[J]. JOURNAL OF THE ACOUSTICAL SOCIETY OF AMERICA, 2006, 120 (05) : 2421 - 2424

← 1 2 3 4 5 6 7 →