Self-Attention Encoding and Pooling for Speaker Recognition

被引：35

作者：

Safari, Pooyan ^{[1
]}

India, Miquel ^{[1
]}

Hernando, Javier ^{[1
]}

机构：

[1] Univ Politecn Cataluna, TALP Res Ctr, Barcelona, Spain

来源：

INTERSPEECH 2020 | 2020年

关键词：

Self-Attention Encoding; Self-Attention Pooling; Speaker Verification; Speaker Embedding;

D O I：

10.21437/Interspeech.2020-1446

中图分类号：

R36 [病理学]; R76 [耳鼻咽喉科学];

学科分类号：

100104 ; 100213 ;

摘要：

The computing power of mobile devices limits the end-user applications in terms of storage size, processing, memory and energy consumption. These limitations motivate researchers for the design of more efficient deep models. On the other hand, self-attention networks based on Transformer architecture have attracted remarkable interests due to their high parallelization capabilities and strong performance on a variety of Natural Language Processing (NLP) applications. Inspired by the Transformer, we propose a tandem Self-Attention Encoding and Pooling (SAEP) mechanism to obtain a discriminative speaker embedding given non-fixed length speech utterances. SAEP is a stack of identical blocks solely relied on self-attention and position-wise feed-forward networks to create vector representation of speakers. This approach encodes short-term speaker spectral features into speaker embeddings to be used in text-independent speaker verification. We have evaluated this approach on both VoxCeleb1 & 2 datasets. The proposed architecture is able to outperform the baseline x-vector, and shows competitive performance to some other benchmarks based on convolutions, with a significant reduction in model size. It employs 94%, 95%, and 73% less parameters compared to ResNet-34, ResNet-50, and x-vector, respectively. This indicates that the proposed fully attention based architecture is more efficient in extracting time-invariant features from speaker utterances.

引用

页码：941 / 945

页数：5

共 50 条

[1] Self-Attention Graph Pooling
Lee, Junhyun
Lee, Inyeop
Kang, Jaewoo
INTERNATIONAL CONFERENCE ON MACHINE LEARNING, VOL 97, 2019, 97
[2] Self-attention Hypergraph Pooling Network
Zhao Y.-F.
Jin F.-S.
Li R.-H.
Qin H.-C.
Cui P.
Wang G.-R.
Ruan Jian Xue Bao/Journal of Software, 2023, 34 (10):
[3] MULTI-VIEW SELF-ATTENTION BASED TRANSFORMER FOR SPEAKER RECOGNITION
Wang, Rui
Ao, Junyi
Zhou, Long
Liu, Shujie
Wei, Zhihua
Ko, Tom
Li, Qing
Zhang, Yu
2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, : 6732 - 6736
[4] Emotion embedding framework with emotional self-attention mechanism for speaker recognition
Li, Dongdong
Yang, Zhuo
Liu, Jinlin
Yang, Hai
Wang, Zhe
EXPERT SYSTEMS WITH APPLICATIONS, 2024, 238
[5] Self-attention is What You Need to Fool a Speaker Recognition System
Wang, Fangwei
Song, Ruixin
Tan, Zhiyuan
Li, Qingru
Wang, Changguang
Yang, Yong
2023 IEEE 22ND INTERNATIONAL CONFERENCE ON TRUST, SECURITY AND PRIVACY IN COMPUTING AND COMMUNICATIONS, TRUSTCOM, BIGDATASE, CSE, EUC, ISCI 2023, 2024, : 929 - 936
[6] Self-attention based speaker recognition using Cluster-Range Loss
Bian, Tengyue
Chen, Fangzhou
Xu, Li
NEUROCOMPUTING, 2019, 368 : 59 - 68
[7] Deep CNNs With Self-Attention for Speaker Identification
Nguyen Nang An
Nguyen Quang Thanh
Liu, Yanbing
IEEE ACCESS, 2019, 7 : 85327 - 85337
[8] Speaker diarization with variants of self-attention and joint speaker embedding extractor
Fu, Pengbin
Ma, Yuchen
Yang, Huirong
JOURNAL OF INTELLIGENT & FUZZY SYSTEMS, 2023, 45 (05) : 9169 - 9180
[9] Self-Attention Pooling-Based Long-Term Temporal Network for Action Recognition
Li, Huifang
Huang, Jingwei
Zhou, Mengchu
Shi, Qisong
Fei, Qing
IEEE TRANSACTIONS ON COGNITIVE AND DEVELOPMENTAL SYSTEMS, 2023, 15 (01) : 65 - 77
[10] Speaker-Aware Speech Enhancement with Self-Attention
Lin, Ju
Van Wijngaarden, Adriaan J.
Smith, Melissa C.
Wang, Kuang-Ching
29TH EUROPEAN SIGNAL PROCESSING CONFERENCE (EUSIPCO 2021), 2021, : 486 - 490

← 1 2 3 4 5 →