Bimodal Recurrent Neural Network for Audiovisual Voice Activity Detection

被引：20

作者：

Tao, Fei ^{[1
]}

Busso, Carlos ^{[1
]}

机构：

[1] Univ Texas Dallas, Dept Elect Engn, Multimodal Signal Proc MSP Lab, Richardson, TX 75080 USA

来源：

18TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2017), VOLS 1-6: SITUATED INTERACTION | 2017年

关键词：

voice activity detection; multimodal signal processing; deep learning; recurrent neural network; SPEECH ACTIVITY DETECTION; RECOGNITION; INFORMATION;

D O I：

10.21437/Interspeech.2017-1573

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Voice activity detection (VAD) is an important preprocessing step in speech-based systems, especially for emerging hand free intelligent assistants. Conventional VAD systems relying on audio-only features are normally impaired by noise in the environment. An alternative approach to address this problem is audiovisual VAD (AV-VAD) systems. Modeling timing dependencies between acoustic and visual features is a challenge in AV-VAD. This study proposes a bimodal recurrent neural network (RNN) which combines audiovisual features in a principled, unified framework, capturing the timing dependency within modalities and across modalities. Each modality is modeled with separate bidirectional long short-term memory (BLSTM) networks. The output layers are used as input of another BLSTM network. The experimental evaluation considers a large audiovisual corpus with clean and noisy recordings to assess the robustness of the approach. The proposed approach outperforms audio-only VAD by 7.9% (absolute) under clean/ideal conditions (i.e., high definition (HD) camera, close-talk microphone). The proposed solution outperforms the audio-only VAD system by 18.5% (absolute) when the conditions are more challenging (i.e.. camera and microphone from a tablet with noise in the environment). The proposed approach shows the best performance and robustness across a varieties of conditions, demonstrating its potential for real-world applications.

引用

页码：1938 / 1942

页数：5

共 28 条

[1]

[Anonymous], 2000, AUDIO-VISUAL SPEECH RECOGNITION

[2]

[Anonymous], 2009, PROC AVSP

[3]

[Anonymous], EUR SIGN PROC C EUSI

[4]

[Anonymous], INT C AUD VIS SPEECH

[5]

[Anonymous], INTERSPEECH 2015

[6]

[Anonymous], P 16 INT C DIG SIGN

[7]

[Anonymous], EUR SIGN PROC C EUSI

[8]

Bandanau D, 2016, INT CONF ACOUST SPEE, P4945, DOI 10.1109/ICASSP.2016.7472618

[9] Learning Deep Architectures for AI [J].

Bengio, Yoshua .

FOUNDATIONS AND TRENDS IN MACHINE LEARNING, 2009, 2 (01) :1-127

[10] Player-Tracking Technology: Half-Full or Half-Empty Glass? [J].

Buchheit, Martin ;

Simpson, Ben Michael .

INTERNATIONAL JOURNAL OF SPORTS PHYSIOLOGY AND PERFORMANCE, 2017, 12 :35-41

← 1 2 3 →