Bayesian Self-Attentive Speaker Embeddings for Text-Independent Speaker Verification

被引：9

作者：

Zhu, Yingke ^{[1
]}

Mak, Brian ^{[1
]}

机构：

[1] Hong Kong Univ Sci & Technol, Comp Sci & Engn, Hong Kong, Peoples R China

来源：

IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING | 2023年 / 31卷

关键词：

Speaker verification; deep neural network; self-attention; speaker embedding; x-vectors;

D O I：

10.1109/TASLP.2023.3244502

中图分类号：

O42 [声学];

学科分类号：

070206 ; 082403 ;

摘要：

Learning effective and discriminative speaker embed dings is a crucial task in speaker verification. Usually, speaker embeddings are extracted from a speaker-classification DNN that averages the hidden vectors over all the spoken frames of a speaker; the hidden vectors produced from all the frames are assumed to be equally important. In our previous work, we relaxed this assumption and computed the speaker embedding as a weighted average of a speaker's frame-level hidden vectors, and their weights were automatically determined by a self-attention mechanism. The effect of multiple attention heads have also been investigated to capture different aspects of a speaker's input speech. One challenge for multi-head attention is the information redundancy problem. If there is no constraint during the training of multi-head attention, different heads may extract similar attentive features, leading to the attention redundancy problem. In this paper, we generalize the deterministic multi-head attention to a Bayesian attention framework, and provide a new understanding of multi head attention from a Bayesian perspective. Under the Bayesian framework, we adopt the recently developed sampling method in optimization, which explicitly enforces the repulsiveness among the multiple heads. Systematic evaluation of the proposed Bayesian self-attentive speaker embeddings is performed on VoxCeleb and SITW evaluation sets. Significant and consistent improvements over other multi-head attention systems are achieved on all the evaluation datasets. The best Bayesian system with eight heads improves the EER by around 26% on VoxCeleb and 9% on SITW over the single-head baseline.

引用

页码：1000 / 1012

页数：13

共 50 条

[31] Score normalization for text-independent speaker verification systems
Auckenthaler, R
Carey, M
Lloyd-Thomas, H
DIGITAL SIGNAL PROCESSING, 2000, 10 (1-3) : 42 - 54
[32] Text-Independent Speaker Verification Using Rank Threshold in Large Number of Speaker Models
Okamoto, Haruka
Tsuge, Satoru
Abdelwahab, Amira
Nishida, Masafumi
Horiuchi, Yasuo
Kuroiwa, Shingo
INTERSPEECH 2009: 10TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2009, VOLS 1-5, 2009, : 2319 - +
[33] DeltaVLAD: An efficient optimization algorithm to discriminate speaker embedding for text-independent speaker verification
Guo, Xin
Luo, Chengfang
Deng, Aiwen
Deng, Feiqi
AIMS MATHEMATICS, 2022, 7 (04): : 6381 - 6395
[34] Cross similarity measurement for speaker adaptive test normalization in text-independent speaker verification
ZHAO Jian
The Journal of China Universities of Posts and Telecommunications, 2008, (02) : 130 - 134
[35] Introducing Self-Supervised Phonetic Information for Text-Independent Speaker Verification
Zhang, Ziyang
Guo, Wu
Gu, Bin
INTERSPEECH 2023, 2023, : 4698 - 4702
[36] SPEAKER DIARISATION USING 2D SELF-ATTENTIVE COMBINATION OF EMBEDDINGS
Sun, G.
Zhang, C.
Woodland, P. C.
2019 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2019, : 5801 - 5805
[37] Improving the Generalized Performance of Deep Embedding for Text-Independent Speaker Verification
Li, Rongjin
Li, Lin
Hong, Qingyang
Guo, Huiyang
Zhao, Miao
PROCEEDINGS OF 2018 12TH IEEE INTERNATIONAL CONFERENCE ON ANTI-COUNTERFEITING, SECURITY, AND IDENTIFICATION (ASID), 2018, : 21 - 25
[38] IMPROVING TEXT-INDEPENDENT SPEAKER VERIFICATION WITH AUXILIARY SPEAKERS USING GRAPH
Li, Jingyu
Ng, Si-Ioi
Lee, Tan
2021 IEEE AUTOMATIC SPEECH RECOGNITION AND UNDERSTANDING WORKSHOP (ASRU), 2021, : 198 - 205
[39] Significance of Constraining Text in Limited Data Text-independent Speaker Verification
Das, Rohan Kumar
Jelil, Sarfaraz
Prasanna, S. R. Mahadeva
2016 INTERNATIONAL CONFERENCE ON SIGNAL PROCESSING AND COMMUNICATIONS (SPCOM), 2016,
[40] GENERATIVE X-VECTORS FOR TEXT-INDEPENDENT SPEAKER VERIFICATION
Xu, Longting
Das, Rohan Kumar
Yilmaz, Emre
Yang, Jichen
Li, Haizhou
2018 IEEE WORKSHOP ON SPOKEN LANGUAGE TECHNOLOGY (SLT 2018), 2018, : 1014 - 1020

← 1 2 3 4 5 →