Bimodal Recurrent Neural Network for Audiovisual Voice Activity Detection

被引:20
作者
Tao, Fei [1 ]
Busso, Carlos [1 ]
机构
[1] Univ Texas Dallas, Dept Elect Engn, Multimodal Signal Proc MSP Lab, Richardson, TX 75080 USA
来源
18TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2017), VOLS 1-6: SITUATED INTERACTION | 2017年
关键词
voice activity detection; multimodal signal processing; deep learning; recurrent neural network; SPEECH ACTIVITY DETECTION; RECOGNITION; INFORMATION;
D O I
10.21437/Interspeech.2017-1573
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Voice activity detection (VAD) is an important preprocessing step in speech-based systems, especially for emerging hand free intelligent assistants. Conventional VAD systems relying on audio-only features are normally impaired by noise in the environment. An alternative approach to address this problem is audiovisual VAD (AV-VAD) systems. Modeling timing dependencies between acoustic and visual features is a challenge in AV-VAD. This study proposes a bimodal recurrent neural network (RNN) which combines audiovisual features in a principled, unified framework, capturing the timing dependency within modalities and across modalities. Each modality is modeled with separate bidirectional long short-term memory (BLSTM) networks. The output layers are used as input of another BLSTM network. The experimental evaluation considers a large audiovisual corpus with clean and noisy recordings to assess the robustness of the approach. The proposed approach outperforms audio-only VAD by 7.9% (absolute) under clean/ideal conditions (i.e., high definition (HD) camera, close-talk microphone). The proposed solution outperforms the audio-only VAD system by 18.5% (absolute) when the conditions are more challenging (i.e.. camera and microphone from a tablet with noise in the environment). The proposed approach shows the best performance and robustness across a varieties of conditions, demonstrating its potential for real-world applications.
引用
收藏
页码:1938 / 1942
页数:5
相关论文
共 28 条
[1]  
[Anonymous], 2000, AUDIO-VISUAL SPEECH RECOGNITION
[2]  
[Anonymous], 2009, PROC AVSP
[3]  
[Anonymous], EUR SIGN PROC C EUSI
[4]  
[Anonymous], INT C AUD VIS SPEECH
[5]  
[Anonymous], INTERSPEECH 2015
[6]  
[Anonymous], P 16 INT C DIG SIGN
[7]  
[Anonymous], EUR SIGN PROC C EUSI
[8]  
Bandanau D, 2016, INT CONF ACOUST SPEE, P4945, DOI 10.1109/ICASSP.2016.7472618
[9]   Learning Deep Architectures for AI [J].
Bengio, Yoshua .
FOUNDATIONS AND TRENDS IN MACHINE LEARNING, 2009, 2 (01) :1-127
[10]   Player-Tracking Technology: Half-Full or Half-Empty Glass? [J].
Buchheit, Martin ;
Simpson, Ben Michael .
INTERNATIONAL JOURNAL OF SPORTS PHYSIOLOGY AND PERFORMANCE, 2017, 12 :35-41