Improved Convolutional Neural Network-Time-Delay Neural Network Structure with Repeated Feature Fusions for Speaker Verification

被引:2
作者
Gao, Miaomiao [1 ,2 ,3 ]
Zhang, Xiaojuan [1 ,2 ]
机构
[1] Chinese Acad Sci, Aerosp Informat Res Inst, Beijing 100094, Peoples R China
[2] Chinese Acad Sci, Key Lab Electromagnet Radiat & Sensing Technol, Beijing 100190, Peoples R China
[3] Univ Chinese Acad Sci, Sch Elect Elect & Commun Engn, Beijing 100049, Peoples R China
来源
APPLIED SCIENCES-BASEL | 2024年 / 14卷 / 08期
关键词
speaker verification; speaker embedding; repeated multi-scale fusions; dilated convolution; gridding effect; ATTENTION;
D O I
10.3390/app14083471
中图分类号
O6 [化学];
学科分类号
0703 ;
摘要
The development of deep learning greatly promotes the progress of speaker verification (SV). Studies show that both convolutional neural networks (CNNs) and dilated time-delay neural networks (TDNNs) achieve advanced performance in text-independent SV, due to their ability to sufficiently extract the local feature and the temporal contextual information, respectively. Also, the combination of the above two has achieved better results. However, we found a serious gridding effect when we apply the 1D-Res2Net-based dilated TDNN proposed in ECAPA-TDNN for SV, which indicates discontinuity and local information losses of frame-level features. To achieve high-resolution process for speaker embedding, we improve the CNN-TDNN structure with proposed repeated multi-scale feature fusions. Through the proposed structure, we can effectively improve the channel utilization of TDNN and achieve higher performance under the same TDNN channel. And, unlike previous studies that have all converted CNN features to TDNN features directly, we also studied the latent space transformation between CNN and TDNN to achieve efficient conversion. Our best method obtains 0.72 EER and 0.0672 MinDCF on VoxCeleb-O test set, and the proposed method performs better in cross-domain SV without additional parameters and computational complexity.
引用
收藏
页数:11
相关论文
共 43 条
[1]   EFFICIENT SCORE NORMALIZATION FOR SPEAKER RECOGNITION [J].
Aronowitz, Hagai ;
Aronowitz, Vanessia .
2010 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, 2010, :4402-4405
[2]   Deep Normalization for Speaker Vectors [J].
Cai, Yunqi ;
Li, Lantian ;
Abel, Andrew ;
Zhu, Xiaoyan ;
Wang, Dong .
IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2021, 29 :733-744
[3]  
Chung Joon Son, 2018, arXiv
[4]  
Cumani S, 2011, 12TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2011 (INTERSPEECH 2011), VOLS 1-5, P2376
[5]   ArcFace: Additive Angular Margin Loss for Deep Face Recognition [J].
Deng, Jiankang ;
Guo, Jia ;
Xue, Niannan ;
Zafeiriou, Stefanos .
2019 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2019), 2019, :4685-4694
[6]  
Desplanques B, 2020, Arxiv, DOI arXiv:2005.07143
[7]  
Fan Y, 2020, INT CONF ACOUST SPEE, P7604, DOI [10.1109/ICASSP40776.2020.9054017, 10.1109/icassp40776.2020.9054017]
[8]   Res2Net: A New Multi-Scale Backbone Architecture [J].
Gao, Shang-Hua ;
Cheng, Ming-Ming ;
Zhao, Kai ;
Zhang, Xin-Yu ;
Yang, Ming-Hsuan ;
Torr, Philip .
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2021, 43 (02) :652-662
[9]   Dynamic Convolution With Global-Local Information for Session-Invariant Speaker Representation Learning [J].
Gu, Bin ;
Guo, Wu .
IEEE SIGNAL PROCESSING LETTERS, 2022, 29 :404-408
[10]   Deep Residual Learning for Image Recognition [J].
He, Kaiming ;
Zhang, Xiangyu ;
Ren, Shaoqing ;
Sun, Jian .
2016 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2016, :770-778