Self-Attention Encoding and Pooling for Speaker Recognition

被引:35
|
作者
Safari, Pooyan [1 ]
India, Miquel [1 ]
Hernando, Javier [1 ]
机构
[1] Univ Politecn Cataluna, TALP Res Ctr, Barcelona, Spain
来源
关键词
Self-Attention Encoding; Self-Attention Pooling; Speaker Verification; Speaker Embedding;
D O I
10.21437/Interspeech.2020-1446
中图分类号
R36 [病理学]; R76 [耳鼻咽喉科学];
学科分类号
100104 ; 100213 ;
摘要
The computing power of mobile devices limits the end-user applications in terms of storage size, processing, memory and energy consumption. These limitations motivate researchers for the design of more efficient deep models. On the other hand, self-attention networks based on Transformer architecture have attracted remarkable interests due to their high parallelization capabilities and strong performance on a variety of Natural Language Processing (NLP) applications. Inspired by the Transformer, we propose a tandem Self-Attention Encoding and Pooling (SAEP) mechanism to obtain a discriminative speaker embedding given non-fixed length speech utterances. SAEP is a stack of identical blocks solely relied on self-attention and position-wise feed-forward networks to create vector representation of speakers. This approach encodes short-term speaker spectral features into speaker embeddings to be used in text-independent speaker verification. We have evaluated this approach on both VoxCeleb1 & 2 datasets. The proposed architecture is able to outperform the baseline x-vector, and shows competitive performance to some other benchmarks based on convolutions, with a significant reduction in model size. It employs 94%, 95%, and 73% less parameters compared to ResNet-34, ResNet-50, and x-vector, respectively. This indicates that the proposed fully attention based architecture is more efficient in extracting time-invariant features from speaker utterances.
引用
收藏
页码:941 / 945
页数:5
相关论文
共 50 条
  • [11] LOCAL INFORMATION MODELING WITH SELF-ATTENTION FOR SPEAKER VERIFICATION
    Han, Bing
    Chen, Zhengyang
    Qian, Yanmin
    2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, : 6727 - 6731
  • [12] LOCAL INFORMATION MODELING WITH SELF-ATTENTION FOR SPEAKER VERIFICATION
    Han, Bing
    Chen, Zhengyang
    Qian, Yanmin
    ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings, 2022, 2022-May : 6727 - 6731
  • [13] Speaker Verification Employing Combinations of Self-Attention Mechanisms
    Bae, Ara
    Kim, Wooil
    ELECTRONICS, 2020, 9 (12) : 1 - 11
  • [14] Self-attention for Speech Emotion Recognition
    Tarantino, Lorenzo
    Garner, Philip N.
    Lazaridis, Alexandros
    INTERSPEECH 2019, 2019, : 2578 - 2582
  • [15] Self Attention Networks in Speaker Recognition
    Safari, Pooyan
    India, Miquel
    Hernando, Javier
    APPLIED SCIENCES-BASEL, 2023, 13 (11):
  • [16] Speaker recognition using isomorphic graph attention network based pooling on self-supervised representation *
    Ge, Zirui
    Xu, Xinzhou
    Guo, Haiyan
    Wang, Tingting
    Yang, Zhen
    APPLIED ACOUSTICS, 2024, 219
  • [17] END-TO-END NEURAL SPEAKER DIARIZATION WITH SELF-ATTENTION
    Fujita, Yusuke
    Kanda, Naoyuki
    Horiguchi, Shota
    Xue, Yawen
    Nagamatsu, Kenji
    Watanabe, Shinji
    2019 IEEE AUTOMATIC SPEECH RECOGNITION AND UNDERSTANDING WORKSHOP (ASRU 2019), 2019, : 296 - 303
  • [18] Self-Attention Networks for Text-Independent Speaker Verification
    Bian, Tengyue
    Chen, Fangzhou
    Xu, Li
    PROCEEDINGS OF THE 2019 31ST CHINESE CONTROL AND DECISION CONFERENCE (CCDC 2019), 2019, : 3955 - 3960
  • [19] Speaker identification for household scenarios with self-attention and adversarial training
    Li, Ruirui
    Jiang, Jyun-Yu
    Wu, Xian
    Hsieh, Chu-Cheng
    Stolcke, Andreas
    Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH, 2020, 2020-October : 2272 - 2276
  • [20] Speaker Identification for Household Scenarios with Self-attention and Adversarial Training
    Li, Ruirui
    Joang, Jyun-Yu
    Wu, Xian
    Hsieh, Chu-Cheng
    Stolcke, Andreas
    INTERSPEECH 2020, 2020, : 2272 - 2276