MULTI-VIEW SELF-ATTENTION BASED TRANSFORMER FOR SPEAKER RECOGNITION

被引:23
作者
Wang, Rui [1 ,4 ]
Ao, Junyi [2 ,3 ,4 ]
Zhou, Long [4 ]
Liu, Shujie [4 ]
Wei, Zhihua [1 ]
Ko, Tom [2 ]
Li, Qing [3 ]
Zhang, Yu [2 ]
机构
[1] Tongji Univ, Dept Comp Sci & Technol, Shanghai, Peoples R China
[2] Southern Univ Sci & Technol, Dept Comp Sci & Engn, Shenzhen, Guangdong, Peoples R China
[3] Hong Kong Polytech Univ, Dept Comp, Hong Kong, Peoples R China
[4] Microsoft Res Asia, Beijing, Peoples R China
来源
2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP) | 2022年
关键词
speaker recognition; Transformer; speaker identification; speaker verification;
D O I
10.1109/ICASSP43922.2022.9746639
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
Initially developed for natural language processing (NLP), Transformer model is now widely used for speech processing tasks such as speaker recognition, due to its powerful sequence modeling capabilities. However, conventional self-attention mechanisms are originally designed for modeling textual sequence without considering the characteristics of speech and speaker modeling. Besides, different Transformer variants for speaker recognition have not been well studied. In this work, we propose a novel multi-view self-attention mechanism and present an empirical study of different Transformer variants with or without the proposed attention mechanism for speaker recognition. Specifically, to balance the capabilities of capturing global dependencies and modeling the locality, we propose a multi-view self-attention mechanism for speaker Transformer, in which different attention heads can attend to different ranges of the receptive field. Furthermore, we introduce and compare five Transformer variants with different network architectures, embedding locations, and pooling methods to learn speaker embeddings. Experimental results on the VoxCeleb1 and VoxCeleb2 datasets show that the proposed multi-view self-attention mechanism achieves improvement in the performance of speaker recognition, and the proposed speaker Transformer network attains excellent results compared with state-of-the-art models.
引用
收藏
页码:6732 / 6736
页数:5
相关论文
共 50 条
  • [21] MVCformer: A transformer-based multi-view clustering method
    Zhao, Mingyu
    Yang, Weidong
    Nie, Feiping
    INFORMATION SCIENCES, 2023, 649
  • [22] Re-Transformer: A Self-Attention Based Model for Machine Translation
    Liu, Huey-Ing
    Chen, Wei-Lin
    AI IN COMPUTATIONAL LINGUISTICS, 2021, 189 : 3 - 10
  • [23] An Improved End-to-End Multi-Target Tracking Method Based on Transformer Self-Attention
    Hong, Yong
    Li, Deren
    Luo, Shupei
    Chen, Xin
    Yang, Yi
    Wang, Mi
    REMOTE SENSING, 2022, 14 (24)
  • [24] Action Transformer: A self-attention model for short-time pose-based human action recognition
    Mazzia, Vittorio
    Angarano, Simone
    Salvetti, Francesco
    Angelini, Federico
    Chiaberge, Marcello
    PATTERN RECOGNITION, 2022, 124
  • [25] Multi-Region and Multi-Band Electroencephalogram Emotion Recognition Based on Self-Attention and Capsule Network
    Ke, Sheng
    Ma, Chaoran
    Li, Wenjie
    Lv, Jidong
    Zou, Ling
    Prati, Andrea
    APPLIED SCIENCES-BASEL, 2024, 14 (02):
  • [26] Local self-attention in transformer for visual question answering
    Shen, Xiang
    Han, Dezhi
    Guo, Zihan
    Chen, Chongqing
    Hua, Jie
    Luo, Gaofeng
    APPLIED INTELLIGENCE, 2023, 53 (13) : 16706 - 16723
  • [27] Local self-attention in transformer for visual question answering
    Xiang Shen
    Dezhi Han
    Zihan Guo
    Chongqing Chen
    Jie Hua
    Gaofeng Luo
    Applied Intelligence, 2023, 53 : 16706 - 16723
  • [28] A Multi-Head Self-Attention Transformer-Based Model for Traffic Situation Prediction in Terminal Areas
    Yu, Zhou
    Shi, Xingyu
    Zhang, Zhaoning
    IEEE ACCESS, 2023, 11 : 16156 - 16165
  • [29] Class token and knowledge distillation for multi-head self-attention speaker verification systems
    Mingote, Victoria
    Miguel, Antonio
    Ortega, Alfonso
    Lleida, Eduardo
    DIGITAL SIGNAL PROCESSING, 2023, 133
  • [30] An efficient parallel self-attention transformer for CSI feedback
    Liu, Ziang
    Song, Tianyu
    Zhao, Ruohan
    Jin, Jiyu
    Jin, Guiyue
    PHYSICAL COMMUNICATION, 2024, 66