SELF-ATTENTIVE VAD: CONTEXT-AWARE DETECTION OF VOICE FROM NOISE

被引:8
作者
Jo, Yong Rae [1 ]
Moon, Young Ki [1 ,2 ]
Cho, Won Ik [3 ]
Jo, Geun Sik [2 ]
机构
[1] Voithru Inc, Seoul, South Korea
[2] Inha Univ, Incheon, South Korea
[3] Seoul Natl Univ, Seoul, South Korea
来源
2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021) | 2021年
关键词
voice activity detection; self-attention; real-world noise;
D O I
10.1109/ICASSP39728.2021.9413961
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
Recent voice activity detection (VAD) schemes have aimed at leveraging the decent neural architectures, but few were successful with applying the attention network due to its high reliance on the encoder-decoder framework. This has often let the built systems have a high dependency on the recurrent neural networks, which are costly and sometimes less context-sensitive considering the scale and property of acoustic frames. To cope with this issue with the self-attention mechanism and achieve a simple, powerful, and environment-robust VAD, we first adopt the self-attention architecture in building up the modules for voice detection and boosted prediction. Our model surpasses the previous neural architectures in view of low signal-to-ratio and noisy real-world scenarios, at the same time displaying the robustness regarding the noise types. We make the test labels on movie data publicly available for the fair competition and future progress.
引用
收藏
页码:6808 / 6812
页数:5
相关论文
共 23 条
[1]  
Abu-El-Haija S., 2016, Youtube-8M: A large-scale video classification benchmark
[2]  
Bahdanau D, 2016, Arxiv, DOI arXiv:1409.0473
[3]   Voice activity detection based on multiple statistical models [J].
Chang, Joon-Hyuk ;
Kim, Nam Soo ;
Mitra, Sanjit K. .
IEEE TRANSACTIONS ON SIGNAL PROCESSING, 2006, 54 (06) :1965-1976
[4]   A Feature Study for Classification-Based Speech Separation at Low Signal-to-Noise Ratios [J].
Chen, Jitong ;
Wang, Yuxuan ;
Wang, DeLiang .
IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2014, 22 (12) :1993-2002
[5]   Voice Activity Detection: Merging Source and Filter-based Information [J].
Drugman, Thomas ;
Stylianou, Yannis ;
Kida, Yusuke ;
Akamine, Masami .
IEEE SIGNAL PROCESSING LETTERS, 2016, 23 (02) :252-256
[6]  
Eyben F, 2013, INT CONF ACOUST SPEE, P483, DOI 10.1109/ICASSP.2013.6637694
[7]  
Garofolo J. S, 1992, Linguistic Data Consortium, V11
[8]  
Hughes T, 2013, INT CONF ACOUST SPEE, P7378, DOI 10.1109/ICASSP.2013.6639096
[9]   Voice Activity Detection Using an Adaptive Context Attention Model [J].
Kim, Juntae ;
Hahn, Minsoo .
IEEE SIGNAL PROCESSING LETTERS, 2018, 25 (08) :1181-1185
[10]  
Kinnunen T, 2013, INT CONF ACOUST SPEE, P7229, DOI 10.1109/ICASSP.2013.6639066