ReptoNet: A 3D Log Mel Spectrogram-Based Few-Shot Speaker Identification with Reptile Algorithm

被引:0
作者
Saritha, Banala [1 ]
Laskar, Mohammad Azharuddin [1 ]
Monsley, K. Anish [2 ]
Laskar, Rabul Hussain [1 ]
Choudhury, Madhuchhanda [1 ]
机构
[1] Natl Inst Technol Silchar, Dept Elect & Commun Engn, Silchar 788010, Assam, India
[2] Indian Inst Technol, Dept Appl Mech, Chennai, India
关键词
Speaker identification; Meta-learning; 3D log Mel spectrogram; Few-shot learning; Reptile algorithm; RECOGNITION;
D O I
10.1007/s13369-024-09426-3
中图分类号
O [数理科学和化学]; P [天文学、地球科学]; Q [生物科学]; N [自然科学总论];
学科分类号
07 ; 0710 ; 09 ;
摘要
A speech-oriented speaker identification system offers an alternative approach to conventional biometric identification systems that rely on physical contact. Recent advances in deep learning have yielded impressive results with large amounts of data, but they are ineffective in forensic and law enforcement applications due to insufficient data. Consequently, speaker identification tasks may not always be feasible with limited data. Recent developments in meta-learning have opened the door to numerous few-shot learning applications. Nevertheless, significant challenges remain, including limited data, noisy and multivariable input conditions, overfitting, and the need for generalization to unseen speakers. This paper introduces a meta-learning-based speaker identification system utilizing a new framework named ReptoNet, which uses three-dimensional (3D) log Mel spectrogram inputs to minimize overfitting and improve generalization. ReptoNet utilizes a reptile algorithm to perform speaker identification tasks and uses the Keras framework for the implementation. The system is evaluated against state-of-the-art techniques on three diverse speech databases: VCTK, VoxCeleb1, and IIT Guwahati multivariability (IITG-MV). It outperforms existing methods on VCTK, improving log Mel and 3D log Mel inputs by 7% and 6%, respectively, and VoxCeleb1 by 2%.
引用
收藏
页码:7495 / 7510
页数:16
相关论文
共 47 条
[31]  
Sabour S, 2017, ADV NEUR IN, V30
[32]   MobileNetV2: Inverted Residuals and Linear Bottlenecks [J].
Sandler, Mark ;
Howard, Andrew ;
Zhu, Menglong ;
Zhmoginov, Andrey ;
Chen, Liang-Chieh .
2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, :4510-4520
[33]  
Santoro A, 2016, PR MACH LEARN RES, V48
[34]  
Saritha B., 2023, ADV SPEECH MUSIC TEC, DOI [10.1007/978-3-031-18444-41, DOI 10.1007/978-3-031-18444-41]
[35]   Deep Learning-Based End-to-End Speaker Identification Using Time-Frequency Representation of Speech Signal [J].
Saritha, Banala ;
Laskar, Mohammad Azharuddin ;
Kirupakaran, Anish Monsley ;
Laskar, Rabul Hussain ;
Choudhury, Madhuchhanda ;
Shome, Nirupam .
CIRCUITS SYSTEMS AND SIGNAL PROCESSING, 2024, 43 (03) :1839-1861
[36]   A wavelet- based transform method for quality improvement in noisy speech patterns of Arabic language [J].
Singh S. ;
Mutawa A.M. .
International Journal of Speech Technology, 2016, 19 (04) :677-685
[37]  
Snell J, 2017, ADV NEUR IN, V30
[38]   Learning to Compare: Relation Network for Few-Shot Learning [J].
Sung, Flood ;
Yang, Yongxin ;
Zhang, Li ;
Xiang, Tao ;
Torr, Philip H. S. ;
Hospedales, Timothy M. .
2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, :1199-1208
[39]   Making Confident Speaker Verification Decisions With Minimal Speech [J].
Vogt, Robert ;
Sridharan, Sridha ;
Mason, Michael .
IEEE TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2010, 18 (06) :1182-1192
[40]  
Wang JX, 2019, INT CONF ACOUST SPEE, P3652, DOI [10.1109/icassp.2019.8683393, 10.1109/ICASSP.2019.8683393]