Voice Activity Detection in the Wild: A Data-Driven Approach Using Teacher-Student Training

被引：30

作者：

Dinkel, Heinrich ^{[1
,2
]}

Wang, Shuai ^{[1
,2
]}

Xu, Xuenan ^{[1
,2
]}

Wu, Mengyue ^{[1
,2
]}

Yu, Kai ^{[1
,2
]}

机构：

[1] Shanghai Jiao Tong Univ, AI Inst, MoE Key Lab Artificial Intelligence, X LANCE Lab,Dept Comp Sci & Engn, Shanghai 200240, Peoples R China

[2] State Key Lab Media Convergence Prod Technol & Sy, Beijing 100803, Peoples R China

来源：

IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING | 2021年 / 29卷

基金：

中国国家自然科学基金;

关键词：

Hidden Markov models; Training; Data models; Speech recognition; Mathematical model; Training data; Speech enhancement; Voice activity detection; Speech activity detection; Weakly supervised learning; Convolutional neural networks; Teacher-student learning; SPEECH ACTIVITY DETECTION; SOUND EVENT DETECTION; NEURAL-NETWORKS; ALGORITHM; FEATURES;

D O I：

10.1109/TASLP.2021.3073596

中图分类号：

O42 [声学];

学科分类号：

070206 ; 082403 ;

摘要：

Voice activity detection is an essential pre-processing component for speech-related tasks such as automatic speech recognition (ASR). Traditional supervised VAD systems obtain frame-level labels from an ASR pipeline by using, e.g., a Hidden Markov model. These ASR models are commonly trained on clean and fully transcribed data, limiting VAD systems to be trained on clean or synthetically noised datasets. Therefore, a major challenge for supervised VAD systems is their generalization towards noisy, real-world data. This work proposes a data-driven teacher-student approach for VAD, which utilizes vast and unconstrained audio data for training. Unlike previous approaches, only weak labels during teacher training are required, enabling the utilization of any real-world, potentially noisy dataset. Our approach firstly trains a teacher model on a source dataset (Audioset) using clip-level supervision. After training, the teacher provides frame-level guidance to a student model on an unlabeled, target dataset. A multitude of student models trained on mid- to large-sized datasets are investigated (Audioset, Voxceleb, NIST SRE). Our approach is then respectively evaluated on clean, artificially noised, and real-world data. We observe significant performance gains in artificially noised and real-world scenarios. Lastly, we compare our approach against other unsupervised and supervised VAD methods, demonstrating our method's superiority.

引用

页码：1542 / 1555

页数：14

共 54 条

[1]

[Anonymous], 2015, ARXIV151008484V1

[2]

Bilen C, 2020, INT CONF ACOUST SPEE, P61, DOI [10.1109/ICASSP40776.2020.9052995, 10.1109/icassp40776.2020.9052995]

[3] Voice activity detection in the wild via weakly supervised sound event detection [J].