MULTI-VIEW SELF-ATTENTION BASED TRANSFORMER FOR SPEAKER RECOGNITION

被引：26

作者：

Wang, Rui ^{[1
,4
]}

Ao, Junyi ^{[2
,3
,4
]}

Zhou, Long ^{[4
]}

Liu, Shujie ^{[4
]}

Wei, Zhihua ^{[1
]}

Ko, Tom ^{[2
]}

Li, Qing ^{[3
]}

Zhang, Yu ^{[2
]}

机构：

[1] Tongji Univ, Dept Comp Sci & Technol, Shanghai, Peoples R China

[2] Southern Univ Sci & Technol, Dept Comp Sci & Engn, Shenzhen, Guangdong, Peoples R China

[3] Hong Kong Polytech Univ, Dept Comp, Hong Kong, Peoples R China

[4] Microsoft Res Asia, Beijing, Peoples R China

来源：

2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP) | 2022年

关键词：

speaker recognition; Transformer; speaker identification; speaker verification;

D O I：

10.1109/ICASSP43922.2022.9746639

中图分类号：

O42 [声学];

学科分类号：

070206 ; 082403 ;

摘要：

Initially developed for natural language processing (NLP), Transformer model is now widely used for speech processing tasks such as speaker recognition, due to its powerful sequence modeling capabilities. However, conventional self-attention mechanisms are originally designed for modeling textual sequence without considering the characteristics of speech and speaker modeling. Besides, different Transformer variants for speaker recognition have not been well studied. In this work, we propose a novel multi-view self-attention mechanism and present an empirical study of different Transformer variants with or without the proposed attention mechanism for speaker recognition. Specifically, to balance the capabilities of capturing global dependencies and modeling the locality, we propose a multi-view self-attention mechanism for speaker Transformer, in which different attention heads can attend to different ranges of the receptive field. Furthermore, we introduce and compare five Transformer variants with different network architectures, embedding locations, and pooling methods to learn speaker embeddings. Experimental results on the VoxCeleb1 and VoxCeleb2 datasets show that the proposed multi-view self-attention mechanism achieves improvement in the performance of speaker recognition, and the proposed speaker Transformer network attains excellent results compared with state-of-the-art models.

引用

页码：6732 / 6736

页数：5

共 50 条

[41] Transformer Self-Attention Change Detection Network with Frozen Parameters [J].

Cheng, Peiyang ;

Xia, Min ;

Wang, Dehao ;

Lin, Haifeng ;

Zhao, Zikai .

APPLIED SCIENCES-BASEL, 2025, 15 (06)

[42] MVT: Chinese NER Using Multi-View Transformer [J].

Xiao, Yinlong ;

Ji, Zongcheng ;

Li, Jianqiang ;

Han, Mei .

IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2024, 32 :3656-3668

[43] MVSTER: Epipolar Transformer for Efficient Multi-view Stereo [J].

Wang, Xiaofeng ;

Zhu, Zheng ;

Huang, Guan ;

Qin, Fangbo ;

Ye, Yun ;

He, Yijia ;

Chi, Xu ;

Wang, Xingang .

COMPUTER VISION, ECCV 2022, PT XXXI, 2022, 13691 :573-591

[44] MULTI-VIEW SPEAKER EMBEDDING LEARNING FOR ENHANCED STABILITY AND DISCRIMINABILITY [J].

He, Liang ;

Fang, Zhihua ;

Chen, Zuoer ;

Xu, Minqiang ;

Men, Ying ;

Wang, Penghao .

2024 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2024), 2024, :10081-10085

[45] Dual-Aspect Self-Attention Based on Transformer for Remaining Useful Life Prediction [J].

Zhang, Zhizheng ;

Song, Wen ;

Li, Qiqiang .

IEEE TRANSACTIONS ON INSTRUMENTATION AND MEASUREMENT, 2022, 71

[46] Streaming Transformer-based Acoustic Models Using Self-attention with Augmented Memory [J].

Wu, Chunyang ;

Wang, Yongqiang ;

Shi, Yangyang ;

Yeh, Ching-Feng ;

Zhang, Frank .

INTERSPEECH 2020, 2020, :2132-2136

[47] BiTMulV: Bidirectional-Decoding Based Transformer with Multi-view Visual Representation [J].

Yu, Qiankun ;

Wang, XueKui ;

Wang, Dong ;

Chu, Xu ;

Liu, Bing ;

Liu, Peng .

PATTERN RECOGNITION AND COMPUTER VISION, PT I, PRCV 2022, 2022, 13534 :735-748

[48] Spectral Superresolution Using Transformer with Convolutional Spectral Self-Attention [J].

Liao, Xiaomei ;

He, Lirong ;

Mao, Jiayou ;

Xu, Meng .

REMOTE SENSING, 2024, 16 (10)

[49] A lightweight and rapidly converging transformer based on separable linear self-attention for fault diagnosis [J].

Yin, Kexin ;

Chen, Chunjun ;

Shen, Qi ;

Deng, Ji .

MEASUREMENT SCIENCE AND TECHNOLOGY, 2025, 36 (01)

[50] ON THE USEFULNESS OF SELF-ATTENTION FOR AUTOMATIC SPEECH RECOGNITION WITH TRANSFORMERS [J].

Zhang, Shucong ;

Loweimi, Erfan ;

Bell, Peter ;

Renals, Steve .

2021 IEEE SPOKEN LANGUAGE TECHNOLOGY WORKSHOP (SLT), 2021, :89-96

← 1 2 3 4 5 →