Self-Attention Encoding and Pooling for Speaker Recognition

被引:35
|
作者
Safari, Pooyan [1 ]
India, Miquel [1 ]
Hernando, Javier [1 ]
机构
[1] Univ Politecn Cataluna, TALP Res Ctr, Barcelona, Spain
来源
关键词
Self-Attention Encoding; Self-Attention Pooling; Speaker Verification; Speaker Embedding;
D O I
10.21437/Interspeech.2020-1446
中图分类号
R36 [病理学]; R76 [耳鼻咽喉科学];
学科分类号
100104 ; 100213 ;
摘要
The computing power of mobile devices limits the end-user applications in terms of storage size, processing, memory and energy consumption. These limitations motivate researchers for the design of more efficient deep models. On the other hand, self-attention networks based on Transformer architecture have attracted remarkable interests due to their high parallelization capabilities and strong performance on a variety of Natural Language Processing (NLP) applications. Inspired by the Transformer, we propose a tandem Self-Attention Encoding and Pooling (SAEP) mechanism to obtain a discriminative speaker embedding given non-fixed length speech utterances. SAEP is a stack of identical blocks solely relied on self-attention and position-wise feed-forward networks to create vector representation of speakers. This approach encodes short-term speaker spectral features into speaker embeddings to be used in text-independent speaker verification. We have evaluated this approach on both VoxCeleb1 & 2 datasets. The proposed architecture is able to outperform the baseline x-vector, and shows competitive performance to some other benchmarks based on convolutions, with a significant reduction in model size. It employs 94%, 95%, and 73% less parameters compared to ResNet-34, ResNet-50, and x-vector, respectively. This indicates that the proposed fully attention based architecture is more efficient in extracting time-invariant features from speaker utterances.
引用
收藏
页码:941 / 945
页数:5
相关论文
共 50 条
  • [1] Self-Attention Graph Pooling
    Lee, Junhyun
    Lee, Inyeop
    Kang, Jaewoo
    INTERNATIONAL CONFERENCE ON MACHINE LEARNING, VOL 97, 2019, 97
  • [2] Self-attention Hypergraph Pooling Network
    Zhao Y.-F.
    Jin F.-S.
    Li R.-H.
    Qin H.-C.
    Cui P.
    Wang G.-R.
    Ruan Jian Xue Bao/Journal of Software, 2023, 34 (10):
  • [3] MULTI-VIEW SELF-ATTENTION BASED TRANSFORMER FOR SPEAKER RECOGNITION
    Wang, Rui
    Ao, Junyi
    Zhou, Long
    Liu, Shujie
    Wei, Zhihua
    Ko, Tom
    Li, Qing
    Zhang, Yu
    2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, : 6732 - 6736
  • [4] Emotion embedding framework with emotional self-attention mechanism for speaker recognition
    Li, Dongdong
    Yang, Zhuo
    Liu, Jinlin
    Yang, Hai
    Wang, Zhe
    EXPERT SYSTEMS WITH APPLICATIONS, 2024, 238
  • [5] Self-attention is What You Need to Fool a Speaker Recognition System
    Wang, Fangwei
    Song, Ruixin
    Tan, Zhiyuan
    Li, Qingru
    Wang, Changguang
    Yang, Yong
    2023 IEEE 22ND INTERNATIONAL CONFERENCE ON TRUST, SECURITY AND PRIVACY IN COMPUTING AND COMMUNICATIONS, TRUSTCOM, BIGDATASE, CSE, EUC, ISCI 2023, 2024, : 929 - 936
  • [6] Self-attention based speaker recognition using Cluster-Range Loss
    Bian, Tengyue
    Chen, Fangzhou
    Xu, Li
    NEUROCOMPUTING, 2019, 368 : 59 - 68
  • [7] Deep CNNs With Self-Attention for Speaker Identification
    Nguyen Nang An
    Nguyen Quang Thanh
    Liu, Yanbing
    IEEE ACCESS, 2019, 7 : 85327 - 85337
  • [8] Speaker diarization with variants of self-attention and joint speaker embedding extractor
    Fu, Pengbin
    Ma, Yuchen
    Yang, Huirong
    JOURNAL OF INTELLIGENT & FUZZY SYSTEMS, 2023, 45 (05) : 9169 - 9180
  • [9] Self-Attention Pooling-Based Long-Term Temporal Network for Action Recognition
    Li, Huifang
    Huang, Jingwei
    Zhou, Mengchu
    Shi, Qisong
    Fei, Qing
    IEEE TRANSACTIONS ON COGNITIVE AND DEVELOPMENTAL SYSTEMS, 2023, 15 (01) : 65 - 77
  • [10] Speaker-Aware Speech Enhancement with Self-Attention
    Lin, Ju
    Van Wijngaarden, Adriaan J.
    Smith, Melissa C.
    Wang, Kuang-Ching
    29TH EUROPEAN SIGNAL PROCESSING CONFERENCE (EUSIPCO 2021), 2021, : 486 - 490