Coupling a Generative Model With a Discriminative Learning Framework for Speaker Verification

被引：1

作者：

Lu, Xugang ^{[1
]}

Shen, Peng ^{[1
]}

Tsao, Yu ^{[2
]}

Kawai, Hisashi ^{[1
]}

机构：

[1] Natl Inst Informat & Commun Technol, Adv Speech Translat Res & Dev Promot Ctr, Kyoto 6190288, Japan

[2] Acad Sinica, Res Ctr Informat Technol Innovat, Taipei 115, Taiwan

来源：

IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING | 2021年 / 29卷

关键词：

Feature extraction; Data models; Task analysis; Measurement; Training; Solid modeling; Neural networks; Discriminative model; generative model; joint Bayesian model; speaker verification; RECOGNITION; MACHINES;

D O I：

10.1109/TASLP.2021.3129360

中图分类号：

O42 [声学];

学科分类号：

070206 ; 082403 ;

摘要：

The task of speaker verification (SV) is to decide whether an utterance is spoken by a target or an imposter speaker. In most studies of SV, a log-likelihood ratio (LLR) score is estimated based on a generative probability model on speaker features, and compared with a threshold for making a decision. However, the generative model usually focuses on individual feature distributions, does not have the discriminative feature selection ability, and is easy to be distracted by nuisance features. The SV, as a hypothesis test, could be formulated as a binary discrimination task where neural network based discriminative learning could be applied. In discriminative learning, the nuisance features could be removed with the help of label supervision. However, discriminative learning pays more attention to classification boundaries, and is prone to overfitting to a training set which may result in bad generalization on a test set. In this paper, we propose a hybrid learning framework, i.e., coupling a joint Bayesian (JB) generative model structure and parameters with a neural discriminative learning framework for SV. In the hybrid framework, a two-branch Siamese neural network is built with dense layers that are coupled with factorized affine transforms as used in the JB model. The LLR score estimation in the JB model is formulated according to the distance metric in the discriminative learning framework. By initializing the two-branch neural network with the generatively learned model parameters of the JB model, we further train the model parameters with the pairwise samples as a binary discrimination task. Moreover, a direct evaluation metric (DEM) in SV based on minimum empirical Bayes risk (EBR) is designed and integrated as an objective function in the discriminative learning. We carried out SV experiments on Speakers in the wild (SITW) and Voxceleb. Experimental results showed that our proposed model improved the performance with a large margin compared with state of the art models for SV.

引用

页码：3631 / 3641

页数：11

共 50 条

[21] Machine learning: Discriminative and generative
Marina Meila
The Mathematical Intelligencer, 2006, 28 (1) : 67 - 69
[22] A generative-discriminative learning model for noisy information fusion
Hecht, Thomas
Gepperth, Alexander
5TH INTERNATIONAL CONFERENCE ON DEVELOPMENT AND LEARNING AND ON EPIGENETIC ROBOTICS (ICDL-EPIROB), 2015, : 242 - 247
[23] Learning A Joint Discriminative-Generative Model for Action Recognition
Alexiou, Ioannis
Xiang, Tao
Gong, Shaogang
2015 INTERNATIONAL CONFERENCE ON SYSTEMS, SIGNALS AND IMAGE PROCESSING (IWSSIP 2015), 2015, : 1 - 4
[24] Combination of Cepstral and Phonetically Discriminative Features for Speaker Verification
Sarkar, Achintya K.
Cong-Thanh Do
Le, Viet-Bac
Barras, Claude
IEEE SIGNAL PROCESSING LETTERS, 2014, 21 (09) : 1040 - 1044
[25] A Discriminative Method for Speaker Verification Using the Difference Information
Lei, Zhenchun
Yang, Yingchun
Wu, Zhaohui
INTERSPEECH 2006 AND 9TH INTERNATIONAL CONFERENCE ON SPOKEN LANGUAGE PROCESSING, VOLS 1-5, 2006, : 497 - 500
[26] DISCRIMINATIVE MULTI-DOMAIN PLDA FOR SPEAKER VERIFICATION
Sholokhov, Alexey
Kinnunen, Tomi
Cumani, Sandro
2016 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING PROCEEDINGS, 2016, : 5030 - 5034
[27] Deep Discriminative Embeddings for Duration Robust Speaker Verification
Li, Na
Tuo, Deyi
Su, Dan
Li, Zhifeng
Yu, Dong
19TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2018), VOLS 1-6: SPEECH RESEARCH FOR EMERGING MARKETS IN MULTILINGUAL SOCIETIES, 2018, : 2262 - 2266
[28] A DISCRIMINATIVE CONDITION-AWARE BACKEND FOR SPEAKER VERIFICATION
Ferrer, Luciana
McLaren, Mitchell
2020 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, 2020, : 6604 - 6608
[29] Comparison of Generative and Discriminative Approaches for Speaker Recognition with Limited Data
Silovsky, Jan
Cerva, Petr
Zdansky, Jindrich
RADIOENGINEERING, 2009, 18 (03) : 307 - 316
[30] Discriminative Neural Embedding Learning for Short-Duration Text-Independent Speaker Verification
Wang, Shuai
Huang, Zili
Qian, Yanmin
Yu, Kai
IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2019, 27 (11) : 1686 - 1696

← 1 2 3 4 5 →