Bayesian Self-Attentive Speaker Embeddings for Text-Independent Speaker Verification

被引:9
|
作者
Zhu, Yingke [1 ]
Mak, Brian [1 ]
机构
[1] Hong Kong Univ Sci & Technol, Comp Sci & Engn, Hong Kong, Peoples R China
关键词
Speaker verification; deep neural network; self-attention; speaker embedding; x-vectors;
D O I
10.1109/TASLP.2023.3244502
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
Learning effective and discriminative speaker embed dings is a crucial task in speaker verification. Usually, speaker embeddings are extracted from a speaker-classification DNN that averages the hidden vectors over all the spoken frames of a speaker; the hidden vectors produced from all the frames are assumed to be equally important. In our previous work, we relaxed this assumption and computed the speaker embedding as a weighted average of a speaker's frame-level hidden vectors, and their weights were automatically determined by a self-attention mechanism. The effect of multiple attention heads have also been investigated to capture different aspects of a speaker's input speech. One challenge for multi-head attention is the information redundancy problem. If there is no constraint during the training of multi-head attention, different heads may extract similar attentive features, leading to the attention redundancy problem. In this paper, we generalize the deterministic multi-head attention to a Bayesian attention framework, and provide a new understanding of multi head attention from a Bayesian perspective. Under the Bayesian framework, we adopt the recently developed sampling method in optimization, which explicitly enforces the repulsiveness among the multiple heads. Systematic evaluation of the proposed Bayesian self-attentive speaker embeddings is performed on VoxCeleb and SITW evaluation sets. Significant and consistent improvements over other multi-head attention systems are achieved on all the evaluation datasets. The best Bayesian system with eight heads improves the EER by around 26% on VoxCeleb and 9% on SITW over the single-head baseline.
引用
收藏
页码:1000 / 1012
页数:13
相关论文
共 50 条
  • [41] Maximum Likelihood Discriminant Feature for Text-Independent Speaker Verification
    Liu, Qingsong
    Dai, Beiqian
    PROCEEDINGS OF THE 2009 2ND INTERNATIONAL CONGRESS ON IMAGE AND SIGNAL PROCESSING, VOLS 1-9, 2009, : 3733 - 3736
  • [42] Text-independent speaker verification using predictive neural networks
    Finan, RA
    Sapeluk, AT
    Damper, RI
    FIFTH INTERNATIONAL CONFERENCE ON ARTIFICIAL NEURAL NETWORKS, 1997, (440): : 274 - 279
  • [43] SMALL FOOTPRINT TEXT-INDEPENDENT SPEAKER VERIFICATION FOR EMBEDDED SYSTEMS
    Balian, Julien
    Tavarone, Raffaele
    Poumeyrol, Mathieu
    Coucke, Alice
    2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021), 2021, : 6179 - 6183
  • [44] Score Fusion Methods for Text-Independent Speaker Verification Applications
    Rastoceanu, Florin
    Lazar, Marilena
    2011 6TH CONFERENCE ON SPEECH TECHNOLOGY AND HUMAN-COMPUTER DIALOGUE (SPED), 2011,
  • [45] Acoustic Feature Shuffling Network for Text-Independent Speaker Verification
    Li, Jin
    Fang, Xin
    Chu, Fan
    Gao, Tian
    Song, Yan
    Dai, Lirong
    INTERSPEECH 2022, 2022, : 4790 - 4794
  • [46] Robust text-independent speaker verification using genetic programming
    Day, Peter
    Nandi, Asoke K.
    IEEE TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2007, 15 (01): : 285 - 295
  • [47] SpeakerNet for Cross-lingual Text-Independent Speaker Verification
    Habib, Hafsa
    Tauseef, Huma
    Fahiem, Muhammad Abuzar
    Farhan, Saima
    Usman, Ghousia
    ARCHIVES OF ACOUSTICS, 2020, 45 (04) : 573 - 583
  • [48] PROTOTYPICAL NETWORKS FOR SMALL FOOTPRINT TEXT-INDEPENDENT SPEAKER VERIFICATION
    Ko, Tom
    Chen, Yangbin
    Li, Qing
    2020 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, 2020, : 6804 - 6808
  • [49] A Text-Independent Speaker Verification System Based on Cross Entropy
    Lu, Xiaochun
    Yin, Junxun
    COMPUTATIONAL INTELLIGENCE AND INTELLIGENT SYSTEMS, 2009, 51 : 419 - 426
  • [50] USEFULNESS OF THE LPC-RESIDUE IN TEXT-INDEPENDENT SPEAKER VERIFICATION
    THEVENAZ, P
    HUGLI, H
    SPEECH COMMUNICATION, 1995, 17 (1-2) : 145 - 157