Voice Activity Detection in the Wild: A Data-Driven Approach Using Teacher-Student Training

被引:30
作者
Dinkel, Heinrich [1 ,2 ]
Wang, Shuai [1 ,2 ]
Xu, Xuenan [1 ,2 ]
Wu, Mengyue [1 ,2 ]
Yu, Kai [1 ,2 ]
机构
[1] Shanghai Jiao Tong Univ, AI Inst, MoE Key Lab Artificial Intelligence, X LANCE Lab,Dept Comp Sci & Engn, Shanghai 200240, Peoples R China
[2] State Key Lab Media Convergence Prod Technol & Sy, Beijing 100803, Peoples R China
基金
中国国家自然科学基金;
关键词
Hidden Markov models; Training; Data models; Speech recognition; Mathematical model; Training data; Speech enhancement; Voice activity detection; Speech activity detection; Weakly supervised learning; Convolutional neural networks; Teacher-student learning; SPEECH ACTIVITY DETECTION; SOUND EVENT DETECTION; NEURAL-NETWORKS; ALGORITHM; FEATURES;
D O I
10.1109/TASLP.2021.3073596
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
Voice activity detection is an essential pre-processing component for speech-related tasks such as automatic speech recognition (ASR). Traditional supervised VAD systems obtain frame-level labels from an ASR pipeline by using, e.g., a Hidden Markov model. These ASR models are commonly trained on clean and fully transcribed data, limiting VAD systems to be trained on clean or synthetically noised datasets. Therefore, a major challenge for supervised VAD systems is their generalization towards noisy, real-world data. This work proposes a data-driven teacher-student approach for VAD, which utilizes vast and unconstrained audio data for training. Unlike previous approaches, only weak labels during teacher training are required, enabling the utilization of any real-world, potentially noisy dataset. Our approach firstly trains a teacher model on a source dataset (Audioset) using clip-level supervision. After training, the teacher provides frame-level guidance to a student model on an unlabeled, target dataset. A multitude of student models trained on mid- to large-sized datasets are investigated (Audioset, Voxceleb, NIST SRE). Our approach is then respectively evaluated on clean, artificially noised, and real-world data. We observe significant performance gains in artificially noised and real-world scenarios. Lastly, we compare our approach against other unsupervised and supervised VAD methods, demonstrating our method's superiority.
引用
收藏
页码:1542 / 1555
页数:14
相关论文
共 54 条
[1]  
[Anonymous], 2015, ARXIV151008484V1
[2]  
Bilen C, 2020, INT CONF ACOUST SPEE, P61, DOI [10.1109/ICASSP40776.2020.9052995, 10.1109/icassp40776.2020.9052995]
[3]   Voice activity detection in the wild via weakly supervised sound event detection [J].
Chen, Yefei ;
Dinkel, Heinrich ;
Wu, Mengyue ;
Yu, Kai .
INTERSPEECH 2020, 2020, :3665-3669
[4]  
Chung JS, 2018, INTERSPEECH, P1086
[5]  
Ding S., 2020, P INT SPEECH COMM AS, P433
[6]   Towards Duration Robust Weakly Supervised Sound Event Detection [J].
Dinkel, Heinrich ;
Wu, Mengyue ;
Yu, Kai .
IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2021, 29 :887-900
[7]  
Dinkel H, 2020, INT CONF ACOUST SPEE, P311, DOI [10.1109/icassp40776.2020.9053459, 10.1109/ICASSP40776.2020.9053459]
[8]  
Eyben F, 2013, INT CONF ACOUST SPEE, P483, DOI 10.1109/ICASSP.2013.6637694
[9]   An introduction to ROC analysis [J].
Fawcett, Tom .
PATTERN RECOGNITION LETTERS, 2006, 27 (08) :861-874
[10]   Long-Term Spectro-Temporal and Static Harmonic Features for Voice Activity Detection [J].
Fukuda, Takashi ;
Ichikawa, Osamu ;
Nishimura, Masafumi .
IEEE JOURNAL OF SELECTED TOPICS IN SIGNAL PROCESSING, 2010, 4 (05) :834-844