SELF-ADAPTIVE SOFT VOICE ACTIVITY DETECTION USING DEEP NEURAL NETWORKS FOR ROBUST SPEAKER VERIFICATION

被引:0
作者
Jung, Youngmoon [1 ]
Choi, Yeunju [1 ]
Kim, Hoirin [1 ]
机构
[1] Korea Adv Inst Sci & Technol, Sch Elect Engn, Daejeon, South Korea
来源
2019 IEEE AUTOMATIC SPEECH RECOGNITION AND UNDERSTANDING WORKSHOP (ASRU 2019) | 2019年
关键词
speaker verification; voice activity detection; unsupervised domain adaptation; soft VAD;
D O I
10.1109/asru46091.2019.9003935
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Voice activity detection (VAD), which classifies frames as speech or non-speech, is an important module in many speech applications including speaker verification. In this paper, we propose a novel method, called self-adaptive soft VAD, to incorporate a deep neural network (DNN)-based VAD into a deep speaker embedding system. The proposed method is a combination of the following two approaches. The first approach is soft VAD, which performs a soft selection of frame-level features extracted from a speaker feature extractor. The frame-level features are weighted by their corresponding speech posteriors estimated from the DNN-based VAD, and then aggregated to generate a speaker embedding. The second approach is self-adaptive VAD, which fine-tunes the pre-trained VAD on the speaker verification data to reduce the domain mismatch. Here, we introduce two unsupervised domain adaptation (DA) schemes, namely speech posterior-based DA (SP-DA) and joint learning-based DA (JL-DA). Experiments on a Korean speech database demonstrate that the verification performance is improved significantly in realworld environments by using self-adaptive soft VAD.
引用
收藏
页码:365 / 372
页数:8
相关论文
共 28 条
[1]  
[Anonymous], 2015, DNN BASED VOICE ACTI
[2]  
[Anonymous], 2017, ADV NEURAL INFORM PR
[3]  
[Anonymous], ARXIV12100297
[4]  
[Anonymous], 2011, INTERSPEECH
[5]  
[Anonymous], 2002, AURORA WORKING GROUP
[6]   Improving the self-adaptive voice activity detector for speaker verification using map adaptation and asymmetric tapers [J].
Asbai N. ;
Bengherabi M. ;
Amrouche A. ;
Aklouf Y. .
International Journal of Speech Technology, 2015, 18 (02) :195-203
[7]   Domain Adaptation Through Synthesis for Unsupervised Person Re-identification [J].
Bak, Slawomir ;
Carr, Peter ;
Lalonde, Jean-Francois .
COMPUTER VISION - ECCV 2018, PT XIII, 2018, 11217 :193-209
[8]  
Cai W., 2018, P OD SPEAK LANG REC, P74
[9]  
Fan ZC, 2019, INT CONF ACOUST SPEE, P6760, DOI 10.1109/ICASSP.2019.8682803
[10]   THE MEANING AND USE OF THE AREA UNDER A RECEIVER OPERATING CHARACTERISTIC (ROC) CURVE [J].
HANLEY, JA ;
MCNEIL, BJ .
RADIOLOGY, 1982, 143 (01) :29-36