MULTI-VIEW SELF-ATTENTION BASED TRANSFORMER FOR SPEAKER RECOGNITION

被引:26
作者
Wang, Rui [1 ,4 ]
Ao, Junyi [2 ,3 ,4 ]
Zhou, Long [4 ]
Liu, Shujie [4 ]
Wei, Zhihua [1 ]
Ko, Tom [2 ]
Li, Qing [3 ]
Zhang, Yu [2 ]
机构
[1] Tongji Univ, Dept Comp Sci & Technol, Shanghai, Peoples R China
[2] Southern Univ Sci & Technol, Dept Comp Sci & Engn, Shenzhen, Guangdong, Peoples R China
[3] Hong Kong Polytech Univ, Dept Comp, Hong Kong, Peoples R China
[4] Microsoft Res Asia, Beijing, Peoples R China
来源
2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP) | 2022年
关键词
speaker recognition; Transformer; speaker identification; speaker verification;
D O I
10.1109/ICASSP43922.2022.9746639
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
Initially developed for natural language processing (NLP), Transformer model is now widely used for speech processing tasks such as speaker recognition, due to its powerful sequence modeling capabilities. However, conventional self-attention mechanisms are originally designed for modeling textual sequence without considering the characteristics of speech and speaker modeling. Besides, different Transformer variants for speaker recognition have not been well studied. In this work, we propose a novel multi-view self-attention mechanism and present an empirical study of different Transformer variants with or without the proposed attention mechanism for speaker recognition. Specifically, to balance the capabilities of capturing global dependencies and modeling the locality, we propose a multi-view self-attention mechanism for speaker Transformer, in which different attention heads can attend to different ranges of the receptive field. Furthermore, we introduce and compare five Transformer variants with different network architectures, embedding locations, and pooling methods to learn speaker embeddings. Experimental results on the VoxCeleb1 and VoxCeleb2 datasets show that the proposed multi-view self-attention mechanism achieves improvement in the performance of speaker recognition, and the proposed speaker Transformer network attains excellent results compared with state-of-the-art models.
引用
收藏
页码:6732 / 6736
页数:5
相关论文
共 50 条
[21]   SGSAFormer: Spike Gated Self-Attention Transformer and Temporal Attention [J].
Gao, Shouwei ;
Qin, Yu ;
Zhu, Ruixin ;
Zhao, Zirui ;
Zhou, Hao ;
Zhu, Zihao .
ELECTRONICS, 2025, 14 (01)
[22]   Transformer Based Multi-view Network for Mammographic Image Classification [J].
Sun, Zizhao ;
Jiang, Huiqin ;
Ma, Ling ;
Yu, Zhan ;
Xu, Hongwei .
MEDICAL IMAGE COMPUTING AND COMPUTER ASSISTED INTERVENTION, MICCAI 2022, PT III, 2022, 13433 :46-54
[23]   MVCformer: A transformer-based multi-view clustering method [J].
Zhao, Mingyu ;
Yang, Weidong ;
Nie, Feiping .
INFORMATION SCIENCES, 2023, 649
[24]   ENHANCING TONGUE REGION SEGMENTATION THROUGH SELF-ATTENTION AND TRANSFORMER BASED [J].
Song, Yihua ;
Li, Can ;
Zhang, Xia ;
Liu, Zhen ;
Song, Ningning ;
Zhou, Zuojian .
JOURNAL OF MECHANICS IN MEDICINE AND BIOLOGY, 2024, 24 (02)
[25]   Re-Transformer: A Self-Attention Based Model for Machine Translation [J].
Liu, Huey-Ing ;
Chen, Wei-Lin .
AI IN COMPUTATIONAL LINGUISTICS, 2021, 189 :3-10
[26]   An Improved End-to-End Multi-Target Tracking Method Based on Transformer Self-Attention [J].
Hong, Yong ;
Li, Deren ;
Luo, Shupei ;
Chen, Xin ;
Yang, Yi ;
Wang, Mi .
REMOTE SENSING, 2022, 14 (24)
[27]   Action Transformer: A self-attention model for short-time pose-based human action recognition [J].
Mazzia, Vittorio ;
Angarano, Simone ;
Salvetti, Francesco ;
Angelini, Federico ;
Chiaberge, Marcello .
PATTERN RECOGNITION, 2022, 124
[28]   Multi-Region and Multi-Band Electroencephalogram Emotion Recognition Based on Self-Attention and Capsule Network [J].
Ke, Sheng ;
Ma, Chaoran ;
Li, Wenjie ;
Lv, Jidong ;
Zou, Ling ;
Prati, Andrea .
APPLIED SCIENCES-BASEL, 2024, 14 (02)
[29]   Local self-attention in transformer for visual question answering [J].
Shen, Xiang ;
Han, Dezhi ;
Guo, Zihan ;
Chen, Chongqing ;
Hua, Jie ;
Luo, Gaofeng .
APPLIED INTELLIGENCE, 2023, 53 (13) :16706-16723
[30]   Local self-attention in transformer for visual question answering [J].
Xiang Shen ;
Dezhi Han ;
Zihan Guo ;
Chongqing Chen ;
Jie Hua ;
Gaofeng Luo .
Applied Intelligence, 2023, 53 :16706-16723