An Enhanced Res2Net with Local and Global Feature Fusion for Speaker Verification

被引：3

作者：

Chen, Yafeng ^{[1
]}

Zheng, Siqi ^{[1
]}

Wang, Hui ^{[1
]}

Cheng, Luyao ^{[1
]}

Chen, Qian ^{[1
]}

Qi, Jiajun ^{[2
]}

机构：

[1] Alibaba Grp, Speech Lab, Hangzhou, Peoples R China

[2] Univ Sci & Technol China, Hefei, Peoples R China

来源：

INTERSPEECH 2023 | 2023年

关键词：

speaker verification; local feature fusion; global feature fusion; attentional feature fusion; AGGREGATION;

D O I：

10.21437/Interspeech.2023-1294

中图分类号：

O42 [声学];

学科分类号：

070206 ; 082403 ;

摘要：

Effective fusion of multi-scale features is crucial for improving speaker verification performance. While most existing methods aggregate multi-scale features in a layer-wise manner via simple operations, such as summation or concatenation. This paper proposes a novel architecture called Enhanced Res2Net (ERes2Net), which incorporates both local and global feature fusion techniques to improve the performance. The local feature fusion (LFF) fuses the features within one single residual block to extract the local signal. The global feature fusion (GFF) takes acoustic features of different scales as input to aggregate global signal. To facilitate effective feature fusion in both LFF and GFF, an attentional feature fusion module is employed in the ERes2Net architecture, replacing summation or concatenation operations. A range of experiments conducted on the VoxCeleb datasets demonstrate the superiority of the ERes2Net in speaker verification.

引用

页码：2228 / 2232

页数：5

共 31 条

[1] On-the-Fly Data Loader and Utterance-Level Aggregation for Speaker and Language Recognition
Cai, Weicheng
Chen, Jinkun
Zhang, Jun
Li, Ming
[J]. IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2020, 28 : 1038 - 1051
[2] Support vector machines for speaker and language recognition
Campbell, WM
Campbell, JP
Reynolds, DA
Singer, E
Torres-Carrasquillo, PA
[J]. COMPUTER SPEECH AND LANGUAGE, 2006, 20 (2-3) : 210 - 229
[3] Chen Y., 2022, PUSHING LIMITS SELF
[4] Improved Meta-learning Training for Speaker Verification
Chen, Yafeng
Guo, Wu
Gu, Bin
[J]. INTERSPEECH 2021, 2021, : 1049 - 1053
[5] Global versus local processing: seeing the left side of the forest and the right side of the trees
Christie, John
Ginsberg, Jay P.
Steedman, John
Fridriksson, Julius
Bonilha, Leonardo
Rorden, Christopher
[J]. FRONTIERS IN HUMAN NEUROSCIENCE, 2012, 6
[6] Chung JS, 2018, INTERSPEECH, P1086
[7] Front-End Factor Analysis for Speaker Verification
Dehak, Najim
Kenny, Patrick J.
Dehak, Reda
Dumouchel, Pierre
Ouellet, Pierre
[J]. IEEE TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2011, 19 (04): : 788 - 798
[8] ArcFace: Additive Angular Margin Loss for Deep Face Recognition
Deng, Jiankang
Guo, Jia
Yang, Jing
Xue, Niannan
Kotsia, Irene
Zafeiriou, Stefanos
[J]. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2022, 44 (10) : 5962 - 5979
[9] ECAPA-TDNN: Emphasized Channel Attention, Propagation and Aggregation in TDNN Based Speaker Verification
Desplanques, Brecht
Thienpondt, Jenthe
Demuynck, Kris
[J]. INTERSPEECH 2020, 2020, : 3830 - 3834
[10] Res2Net: A New Multi-Scale Backbone Architecture
Gao, Shang-Hua
Cheng, Ming-Ming
Zhao, Kai
Zhang, Xin-Yu
Yang, Ming-Hsuan
Torr, Philip
[J]. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2021, 43 (02) : 652 - 662

← 1 2 3 4 →