SELF-ATTENTIVE VAD: CONTEXT-AWARE DETECTION OF VOICE FROM NOISE

被引:11
作者
Jo, Yong Rae [1 ]
Moon, Young Ki [1 ,2 ]
Cho, Won Ik [3 ]
Jo, Geun Sik [2 ]
机构
[1] Voithru Inc, Seoul, South Korea
[2] Inha Univ, Incheon, South Korea
[3] Seoul Natl Univ, Seoul, South Korea
来源
2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021) | 2021年
关键词
voice activity detection; self-attention; real-world noise;
D O I
10.1109/ICASSP39728.2021.9413961
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
Recent voice activity detection (VAD) schemes have aimed at leveraging the decent neural architectures, but few were successful with applying the attention network due to its high reliance on the encoder-decoder framework. This has often let the built systems have a high dependency on the recurrent neural networks, which are costly and sometimes less context-sensitive considering the scale and property of acoustic frames. To cope with this issue with the self-attention mechanism and achieve a simple, powerful, and environment-robust VAD, we first adopt the self-attention architecture in building up the modules for voice detection and boosted prediction. Our model surpasses the previous neural architectures in view of low signal-to-ratio and noisy real-world scenarios, at the same time displaying the robustness regarding the noise types. We make the test labels on movie data publicly available for the fair competition and future progress.
引用
收藏
页码:6808 / 6812
页数:5
相关论文
共 23 条
[21]  
Vaswani A, 2017, ADV NEUR IN, V30
[22]   Feature Learning with Raw-Waveform CLDNNs for Voice Activity Detection [J].
Zazo, Ruben ;
Sainath, Tara N. ;
Simko, Gabor ;
Parada, Carolina .
17TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2016), VOLS 1-5: UNDERSTANDING SPEECH PROCESSING IN HUMANS AND MACHINES, 2016, :3668-3672
[23]   Boosting Contextual Information for Deep Neural Network Based Voice Activity Detection [J].
Zhang, Xiao-Lei ;
Wang, DeLiang .
IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2016, 24 (02) :252-264