Bayesian Self-Attentive Speaker Embeddings for Text-Independent Speaker Verification

被引：9

作者：

Zhu, Yingke ^{[1
]}

Mak, Brian ^{[1
]}

机构：

[1] Hong Kong Univ Sci & Technol, Comp Sci & Engn, Hong Kong, Peoples R China

来源：

IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING | 2023年 / 31卷

关键词：

Speaker verification; deep neural network; self-attention; speaker embedding; x-vectors;

D O I：

10.1109/TASLP.2023.3244502

中图分类号：

O42 [声学];

学科分类号：

070206 ; 082403 ;

摘要：

Learning effective and discriminative speaker embed dings is a crucial task in speaker verification. Usually, speaker embeddings are extracted from a speaker-classification DNN that averages the hidden vectors over all the spoken frames of a speaker; the hidden vectors produced from all the frames are assumed to be equally important. In our previous work, we relaxed this assumption and computed the speaker embedding as a weighted average of a speaker's frame-level hidden vectors, and their weights were automatically determined by a self-attention mechanism. The effect of multiple attention heads have also been investigated to capture different aspects of a speaker's input speech. One challenge for multi-head attention is the information redundancy problem. If there is no constraint during the training of multi-head attention, different heads may extract similar attentive features, leading to the attention redundancy problem. In this paper, we generalize the deterministic multi-head attention to a Bayesian attention framework, and provide a new understanding of multi head attention from a Bayesian perspective. Under the Bayesian framework, we adopt the recently developed sampling method in optimization, which explicitly enforces the repulsiveness among the multiple heads. Systematic evaluation of the proposed Bayesian self-attentive speaker embeddings is performed on VoxCeleb and SITW evaluation sets. Significant and consistent improvements over other multi-head attention systems are achieved on all the evaluation datasets. The best Bayesian system with eight heads improves the EER by around 26% on VoxCeleb and 9% on SITW over the single-head baseline.

引用

页码：1000 / 1012

页数：13

共 50 条

[41] Maximum Likelihood Discriminant Feature for Text-Independent Speaker Verification
Liu, Qingsong
Dai, Beiqian
PROCEEDINGS OF THE 2009 2ND INTERNATIONAL CONGRESS ON IMAGE AND SIGNAL PROCESSING, VOLS 1-9, 2009, : 3733 - 3736
[42] Text-independent speaker verification using predictive neural networks
Finan, RA
Sapeluk, AT
Damper, RI
FIFTH INTERNATIONAL CONFERENCE ON ARTIFICIAL NEURAL NETWORKS, 1997, (440): : 274 - 279
[43] SMALL FOOTPRINT TEXT-INDEPENDENT SPEAKER VERIFICATION FOR EMBEDDED SYSTEMS
Balian, Julien
Tavarone, Raffaele
Poumeyrol, Mathieu
Coucke, Alice
2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021), 2021, : 6179 - 6183
[44] Score Fusion Methods for Text-Independent Speaker Verification Applications
Rastoceanu, Florin
Lazar, Marilena
2011 6TH CONFERENCE ON SPEECH TECHNOLOGY AND HUMAN-COMPUTER DIALOGUE (SPED), 2011,
[45] Acoustic Feature Shuffling Network for Text-Independent Speaker Verification
Li, Jin
Fang, Xin
Chu, Fan
Gao, Tian
Song, Yan
Dai, Lirong
INTERSPEECH 2022, 2022, : 4790 - 4794
[46] Robust text-independent speaker verification using genetic programming
Day, Peter
Nandi, Asoke K.
IEEE TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2007, 15 (01): : 285 - 295
[47] SpeakerNet for Cross-lingual Text-Independent Speaker Verification
Habib, Hafsa
Tauseef, Huma
Fahiem, Muhammad Abuzar
Farhan, Saima
Usman, Ghousia
ARCHIVES OF ACOUSTICS, 2020, 45 (04) : 573 - 583
[48] PROTOTYPICAL NETWORKS FOR SMALL FOOTPRINT TEXT-INDEPENDENT SPEAKER VERIFICATION
Ko, Tom
Chen, Yangbin
Li, Qing
2020 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, 2020, : 6804 - 6808
[49] A Text-Independent Speaker Verification System Based on Cross Entropy
Lu, Xiaochun
Yin, Junxun
COMPUTATIONAL INTELLIGENCE AND INTELLIGENT SYSTEMS, 2009, 51 : 419 - 426
[50] USEFULNESS OF THE LPC-RESIDUE IN TEXT-INDEPENDENT SPEAKER VERIFICATION
THEVENAZ, P
HUGLI, H
SPEECH COMMUNICATION, 1995, 17 (1-2) : 145 - 157

← 1 2 3 4 5 →