TEMPORAL DYNAMIC CONVOLUTIONAL NEURAL NETWORK FOR TEXT-INDEPENDENT SPEAKER VERIFICATION AND PHONEMIC ANALYSIS

被引:18
作者
Kim, Seong-Hu [1 ]
Nam, Hyeonuk [1 ]
Park, Yong-Hwa [1 ]
机构
[1] Korea Adv Inst Sci & Technol, Dept Mech Engn, Daejeon, South Korea
来源
2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP) | 2022年
关键词
Speaker verification; text-independent; temporal dynamic convolutional neural network; phoneme-adaptive kernel;
D O I
10.1109/ICASSP43922.2022.9747421
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
In the field of text-independent speaker recognition, dynamic models that adapt along the time axis have been proposed to consider the phoneme-varying characteristics of speech. However, a detailed analysis of how dynamic models work depending on phonemes is insufficient. In this paper, we propose temporal dynamic CNN (TDY-CNN) that considers temporal variation of phonemes by applying kernels optimally adapting to each time bin. These kernels adapt to time bins by applying weighted sum of trained basis kernels. Then, an analysis of how adaptive kernels work on different phonemes in various layers is carried out. TDY-ResNet-38(x0.5) using six basis kernels improved an equal error rate (EER), the speaker verification performance, by 17.3% compared to the baseline model ResNet-38(x0.5). In addition, we showed that adaptive kernels depend on phoneme groups and are more phoneme-specific at early layers. The temporal dynamic model adapts itself to phonemes without explicitly given phoneme information during training, and results show the necessity to consider phoneme variation within utterances for more accurate and robust text-independent speaker verification.
引用
收藏
页码:6742 / 6746
页数:5
相关论文
共 27 条
[1]   Adaptive Convolution for Object Detection [J].
Chen, Chunlin ;
Ling, Qiang .
IEEE TRANSACTIONS ON MULTIMEDIA, 2019, 21 (12) :3205-3217
[2]   Dynamic Convolution: Attention over Convolution Kernels [J].
Chen, Yinpeng ;
Dai, Xiyang ;
Liu, Mengchen ;
Chen, Dongdong ;
Yuan, Lu ;
Liu, Zicheng .
2020 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2020), 2020, :11027-11036
[3]  
Choi BJ, 2019, 2019 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES (NAACL HLT 2019), VOL. 1, P2475
[4]   In defence of metric learning for speaker recognition [J].
Chung, Joon Son ;
Huh, Jaesung ;
Mun, Seongkyu ;
Lee, Minjae ;
Heo, Hee-Soo ;
Choe, Soyeon ;
Ham, Chiheon ;
Jung, Sunghwan ;
Lee, Bong-Jin ;
Han, Icksang .
INTERSPEECH 2020, 2020, :2977-2981
[5]  
Chung JS, 2018, INTERSPEECH, P1086
[6]   ECAPA-TDNN: Emphasized Channel Attention, Propagation and Aggregation in TDNN Based Speaker Verification [J].
Desplanques, Brecht ;
Thienpondt, Jenthe ;
Demuynck, Kris .
INTERSPEECH 2020, 2020, :3830-3834
[7]  
EATOCK JP, 1994, INT CONF ACOUST SPEE, P133
[8]  
Garofolo J.S., 1993, NASA STI/Recon technical report n, 93:27403, V93, P27403
[9]   An Adaptive X-vector Model for Text-independent Speaker Verification [J].
Gu, Bin ;
Guo, Wu ;
Ding, Penguin ;
Ling, Zhenhua ;
Du, Jun .
INTERSPEECH 2020, 2020, :1506-1510
[10]  
Ha D., 2017, PROC INT C LEARN REP