Coupling a Generative Model With a Discriminative Learning Framework for Speaker Verification

被引:1
|
作者
Lu, Xugang [1 ]
Shen, Peng [1 ]
Tsao, Yu [2 ]
Kawai, Hisashi [1 ]
机构
[1] Natl Inst Informat & Commun Technol, Adv Speech Translat Res & Dev Promot Ctr, Kyoto 6190288, Japan
[2] Acad Sinica, Res Ctr Informat Technol Innovat, Taipei 115, Taiwan
关键词
Feature extraction; Data models; Task analysis; Measurement; Training; Solid modeling; Neural networks; Discriminative model; generative model; joint Bayesian model; speaker verification; RECOGNITION; MACHINES;
D O I
10.1109/TASLP.2021.3129360
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
The task of speaker verification (SV) is to decide whether an utterance is spoken by a target or an imposter speaker. In most studies of SV, a log-likelihood ratio (LLR) score is estimated based on a generative probability model on speaker features, and compared with a threshold for making a decision. However, the generative model usually focuses on individual feature distributions, does not have the discriminative feature selection ability, and is easy to be distracted by nuisance features. The SV, as a hypothesis test, could be formulated as a binary discrimination task where neural network based discriminative learning could be applied. In discriminative learning, the nuisance features could be removed with the help of label supervision. However, discriminative learning pays more attention to classification boundaries, and is prone to overfitting to a training set which may result in bad generalization on a test set. In this paper, we propose a hybrid learning framework, i.e., coupling a joint Bayesian (JB) generative model structure and parameters with a neural discriminative learning framework for SV. In the hybrid framework, a two-branch Siamese neural network is built with dense layers that are coupled with factorized affine transforms as used in the JB model. The LLR score estimation in the JB model is formulated according to the distance metric in the discriminative learning framework. By initializing the two-branch neural network with the generatively learned model parameters of the JB model, we further train the model parameters with the pairwise samples as a binary discrimination task. Moreover, a direct evaluation metric (DEM) in SV based on minimum empirical Bayes risk (EBR) is designed and integrated as an objective function in the discriminative learning. We carried out SV experiments on Speakers in the wild (SITW) and Voxceleb. Experimental results showed that our proposed model improved the performance with a large margin compared with state of the art models for SV.
引用
收藏
页码:3631 / 3641
页数:11
相关论文
共 50 条
  • [1] A generative-discriminative framework using ensemble methods for text-dependent speaker verification
    Subramanya, Amarnag
    Zhang, Zhengyou
    Surendran, Arun C.
    Nguyen, Patrick
    Narasimhan, Mukund
    Acero, Alex
    2007 IEEE International Conference on Acoustics, Speech, and Signal Processing, Vol IV, Pts 1-3, 2007, : 225 - 228
  • [2] Learning Discriminative Features for Speaker Identification and Verification
    Yadav, Sarthak
    Rai, Atul
    19TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2018), VOLS 1-6: SPEECH RESEARCH FOR EMERGING MARKETS IN MULTILINGUAL SOCIETIES, 2018, : 2237 - 2241
  • [3] Fusion of Discriminative and Generative Scoring Criteria in GMM-Based Speaker Verification
    Vesnicer, Bostjan
    Gros, Jerneja Zganec
    Mihelic, France
    TEXT, SPEECH AND DIALOGUE, TSD 2011, 2011, 6836 : 139 - 146
  • [4] Discriminative Adaptation for Speaker Verification
    Longworth, C.
    Gales, M. J. F.
    INTERSPEECH 2006 AND 9TH INTERNATIONAL CONFERENCE ON SPOKEN LANGUAGE PROCESSING, VOLS 1-5, 2006, : 1467 - 1470
  • [5] DISCRIMINATIVE AUTOENCODERS FOR SPEAKER VERIFICATION
    Lee, Hung-Shin
    Lu, Yu-Ding
    Hsu, Chin-Cheng
    Tsao, Yu
    Wang, Hsin-Min
    Leng, Shyh-Kang
    2017 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2017, : 5375 - 5379
  • [6] Discriminative adaptation for speaker verification
    Korkmazskiy, F
    Juang, BH
    ICSLP 96 - FOURTH INTERNATIONAL CONFERENCE ON SPOKEN LANGUAGE PROCESSING, PROCEEDINGS, VOLS 1-4, 1996, : 1744 - 1747
  • [7] Learning Discriminative Speaker Embedding by Improving Aggregation Strategy and Loss Function for Speaker Verification
    Luo, Chengfang
    Guo, Xin
    Deng, Aiwen
    Xu, Wei
    Zhao, Junhong
    Kang, Wenxiong
    2021 INTERNATIONAL JOINT CONFERENCE ON BIOMETRICS (IJCB 2021), 2021,
  • [8] Evaluation of the generative and discriminative text-independent speaker verification approaches on handheld devices
    Curelaru, Florin
    2015 INTERNATIONAL CONFERENCE ON SPEECH TECHNOLOGY AND HUMAN-COMPUTER DIALOGUE (SPED), 2015,
  • [9] Centroid-aware local discriminative metric learning in speaker verification
    Sheng, Kekai
    Dong, Weiming
    Li, Wei
    Razik, Joseph
    Huang, Feiyue
    Hu, Baogang
    PATTERN RECOGNITION, 2017, 72 : 176 - 185
  • [10] Speaker-discriminative Embedding Learning via Affinity Matrix for Short Utterance Speaker Verification
    Peng, Junyi
    Gu, Rongzhi
    Zou, Yuexian
    Wangt, Wenwu
    2019 ASIA-PACIFIC SIGNAL AND INFORMATION PROCESSING ASSOCIATION ANNUAL SUMMIT AND CONFERENCE (APSIPA ASC), 2019, : 314 - 319