SELF-ADAPTIVE SOFT VOICE ACTIVITY DETECTION USING DEEP NEURAL NETWORKS FOR ROBUST SPEAKER VERIFICATION

被引：0

作者：

Jung, Youngmoon ^{[1
]}

Choi, Yeunju ^{[1
]}

Kim, Hoirin ^{[1
]}

机构：

[1] Korea Adv Inst Sci & Technol, Sch Elect Engn, Daejeon, South Korea

来源：

2019 IEEE AUTOMATIC SPEECH RECOGNITION AND UNDERSTANDING WORKSHOP (ASRU 2019) | 2019年

关键词：

speaker verification; voice activity detection; unsupervised domain adaptation; soft VAD;

D O I：

10.1109/asru46091.2019.9003935

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Voice activity detection (VAD), which classifies frames as speech or non-speech, is an important module in many speech applications including speaker verification. In this paper, we propose a novel method, called self-adaptive soft VAD, to incorporate a deep neural network (DNN)-based VAD into a deep speaker embedding system. The proposed method is a combination of the following two approaches. The first approach is soft VAD, which performs a soft selection of frame-level features extracted from a speaker feature extractor. The frame-level features are weighted by their corresponding speech posteriors estimated from the DNN-based VAD, and then aggregated to generate a speaker embedding. The second approach is self-adaptive VAD, which fine-tunes the pre-trained VAD on the speaker verification data to reduce the domain mismatch. Here, we introduce two unsupervised domain adaptation (DA) schemes, namely speech posterior-based DA (SP-DA) and joint learning-based DA (JL-DA). Experiments on a Korean speech database demonstrate that the verification performance is improved significantly in realworld environments by using self-adaptive soft VAD.

引用

页码：365 / 372

页数：8

共 28 条

[1]

[Anonymous], 2015, DNN BASED VOICE ACTI

[2]

[Anonymous], 2017, ADV NEURAL INFORM PR

[3]

[Anonymous], ARXIV12100297

[4]

[Anonymous], 2011, INTERSPEECH

[5]

[Anonymous], 2002, AURORA WORKING GROUP

[6] Improving the self-adaptive voice activity detector for speaker verification using map adaptation and asymmetric tapers [J].

Asbai N. ;

Bengherabi M. ;

Amrouche A. ;

Aklouf Y. .

International Journal of Speech Technology, 2015, 18 (02) :195-203

[7] Domain Adaptation Through Synthesis for Unsupervised Person Re-identification [J].

Bak, Slawomir ;

Carr, Peter ;

Lalonde, Jean-Francois .

COMPUTER VISION - ECCV 2018, PT XIII, 2018, 11217 :193-209

[8]

Cai W., 2018, P OD SPEAK LANG REC, P74

[9]

Fan ZC, 2019, INT CONF ACOUST SPEE, P6760, DOI 10.1109/ICASSP.2019.8682803

[10] THE MEANING AND USE OF THE AREA UNDER A RECEIVER OPERATING CHARACTERISTIC (ROC) CURVE [J].

HANLEY, JA ;

MCNEIL, BJ .

RADIOLOGY, 1982, 143 (01) :29-36

← 1 2 3 →