ReptoNet: A 3D Log Mel Spectrogram-Based Few-Shot Speaker Identification with Reptile Algorithm

被引：0

作者：

Saritha, Banala ^{[1
]}

Laskar, Mohammad Azharuddin ^{[1
]}

Monsley, K. Anish ^{[2
]}

Laskar, Rabul Hussain ^{[1
]}

Choudhury, Madhuchhanda ^{[1
]}

机构：

[1] Natl Inst Technol Silchar, Dept Elect & Commun Engn, Silchar 788010, Assam, India

[2] Indian Inst Technol, Dept Appl Mech, Chennai, India

来源：

ARABIAN JOURNAL FOR SCIENCE AND ENGINEERING | 2024年

关键词：

Speaker identification; Meta-learning; 3D log Mel spectrogram; Few-shot learning; Reptile algorithm; RECOGNITION;

D O I：

10.1007/s13369-024-09426-3

中图分类号：

O [数理科学和化学]; P [天文学、地球科学]; Q [生物科学]; N [自然科学总论];

学科分类号：

07 ; 0710 ; 09 ;

摘要：

A speech-oriented speaker identification system offers an alternative approach to conventional biometric identification systems that rely on physical contact. Recent advances in deep learning have yielded impressive results with large amounts of data, but they are ineffective in forensic and law enforcement applications due to insufficient data. Consequently, speaker identification tasks may not always be feasible with limited data. Recent developments in meta-learning have opened the door to numerous few-shot learning applications. Nevertheless, significant challenges remain, including limited data, noisy and multivariable input conditions, overfitting, and the need for generalization to unseen speakers. This paper introduces a meta-learning-based speaker identification system utilizing a new framework named ReptoNet, which uses three-dimensional (3D) log Mel spectrogram inputs to minimize overfitting and improve generalization. ReptoNet utilizes a reptile algorithm to perform speaker identification tasks and uses the Keras framework for the implementation. The system is evaluated against state-of-the-art techniques on three diverse speech databases: VCTK, VoxCeleb1, and IIT Guwahati multivariability (IITG-MV). It outperforms existing methods on VCTK, improving log Mel and 3D log Mel inputs by 7% and 6%, respectively, and VoxCeleb1 by 2%.

引用

页码：7495 / 7510

页数：16

共 48 条

[1] Anand P, 2019, Arxiv, DOI [arXiv:1904.08775, 10.48550/arXiv.1904.08775]
[2] Andrychowicz M, 2016, Arxiv, DOI arXiv:1606.04474
[3] Brown EM., 2023, J. Open Source Softw, V8, P5132, DOI [10.21105/joss.05132, DOI 10.21105/JOSS.05132]
[4] Forensic Speaker Recognition A need for caution
Campbell, Joseph P.
Shen, Wade
Campbell, William M.
Schwartz, Reva
Bonastre, Jean-Francois
Matrouf, Driss
[J]. IEEE SIGNAL PROCESSING MAGAZINE, 2009, 26 (02) : 95 - 103
[5] Support vector machines using GMM supervectors for speaker verification
Campbell, WM
Sturim, DE
Reynolds, DA
[J]. IEEE SIGNAL PROCESSING LETTERS, 2006, 13 (05) : 308 - 311
[6] Finn C, 2017, Arxiv, DOI [arXiv:1703.03400, DOI 10.48550/ARXIV.1703.03400]
[7] Howard AG, 2017, Arxiv, DOI [arXiv:1704.04861, DOI 10.48550/ARXIV.1704.04861]
[8] Multivariability speaker recognition database in Indian scenario
Haris, B.
Pradhan, G.
Misra, A.
Prasanna, S.
Das, R.
Sinha, R.
[J]. INTERNATIONAL JOURNAL OF SPEECH TECHNOLOGY, 2012, 15 (04) : 441 - 453
[9] Deep Residual Learning for Image Recognition
He, Kaiming
Zhang, Xiangyu
Ren, Shaoqing
Sun, Jian
[J]. 2016 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2016, : 770 - 778
[10] MetaAudio: A Few-Shot Audio Classification Benchmark
Heggan, Calum
Budgett, Sam
Hospedales, Timothy
Yaghoobi, Mehrdad
[J]. ARTIFICIAL NEURAL NETWORKS AND MACHINE LEARNING - ICANN 2022, PT I, 2022, 13529 : 219 - 230

← 1 2 3 4 5 →