An Enhanced Res2Net with Local and Global Feature Fusion for Speaker Verification

被引:3
作者
Chen, Yafeng [1 ]
Zheng, Siqi [1 ]
Wang, Hui [1 ]
Cheng, Luyao [1 ]
Chen, Qian [1 ]
Qi, Jiajun [2 ]
机构
[1] Alibaba Grp, Speech Lab, Hangzhou, Peoples R China
[2] Univ Sci & Technol China, Hefei, Peoples R China
来源
INTERSPEECH 2023 | 2023年
关键词
speaker verification; local feature fusion; global feature fusion; attentional feature fusion; AGGREGATION;
D O I
10.21437/Interspeech.2023-1294
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
Effective fusion of multi-scale features is crucial for improving speaker verification performance. While most existing methods aggregate multi-scale features in a layer-wise manner via simple operations, such as summation or concatenation. This paper proposes a novel architecture called Enhanced Res2Net (ERes2Net), which incorporates both local and global feature fusion techniques to improve the performance. The local feature fusion (LFF) fuses the features within one single residual block to extract the local signal. The global feature fusion (GFF) takes acoustic features of different scales as input to aggregate global signal. To facilitate effective feature fusion in both LFF and GFF, an attentional feature fusion module is employed in the ERes2Net architecture, replacing summation or concatenation operations. A range of experiments conducted on the VoxCeleb datasets demonstrate the superiority of the ERes2Net in speaker verification.
引用
收藏
页码:2228 / 2232
页数:5
相关论文
共 31 条
  • [1] On-the-Fly Data Loader and Utterance-Level Aggregation for Speaker and Language Recognition
    Cai, Weicheng
    Chen, Jinkun
    Zhang, Jun
    Li, Ming
    [J]. IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2020, 28 : 1038 - 1051
  • [2] Support vector machines for speaker and language recognition
    Campbell, WM
    Campbell, JP
    Reynolds, DA
    Singer, E
    Torres-Carrasquillo, PA
    [J]. COMPUTER SPEECH AND LANGUAGE, 2006, 20 (2-3) : 210 - 229
  • [3] Chen Y., 2022, PUSHING LIMITS SELF
  • [4] Improved Meta-learning Training for Speaker Verification
    Chen, Yafeng
    Guo, Wu
    Gu, Bin
    [J]. INTERSPEECH 2021, 2021, : 1049 - 1053
  • [5] Global versus local processing: seeing the left side of the forest and the right side of the trees
    Christie, John
    Ginsberg, Jay P.
    Steedman, John
    Fridriksson, Julius
    Bonilha, Leonardo
    Rorden, Christopher
    [J]. FRONTIERS IN HUMAN NEUROSCIENCE, 2012, 6
  • [6] Chung JS, 2018, INTERSPEECH, P1086
  • [7] Front-End Factor Analysis for Speaker Verification
    Dehak, Najim
    Kenny, Patrick J.
    Dehak, Reda
    Dumouchel, Pierre
    Ouellet, Pierre
    [J]. IEEE TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2011, 19 (04): : 788 - 798
  • [8] ArcFace: Additive Angular Margin Loss for Deep Face Recognition
    Deng, Jiankang
    Guo, Jia
    Yang, Jing
    Xue, Niannan
    Kotsia, Irene
    Zafeiriou, Stefanos
    [J]. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2022, 44 (10) : 5962 - 5979
  • [9] ECAPA-TDNN: Emphasized Channel Attention, Propagation and Aggregation in TDNN Based Speaker Verification
    Desplanques, Brecht
    Thienpondt, Jenthe
    Demuynck, Kris
    [J]. INTERSPEECH 2020, 2020, : 3830 - 3834
  • [10] Res2Net: A New Multi-Scale Backbone Architecture
    Gao, Shang-Hua
    Cheng, Ming-Ming
    Zhao, Kai
    Zhang, Xin-Yu
    Yang, Ming-Hsuan
    Torr, Philip
    [J]. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2021, 43 (02) : 652 - 662