ReptoNet: A 3D Log Mel Spectrogram-Based Few-Shot Speaker Identification with Reptile Algorithm

被引:0
作者
Saritha, Banala [1 ]
Laskar, Mohammad Azharuddin [1 ]
Monsley, K. Anish [2 ]
Laskar, Rabul Hussain [1 ]
Choudhury, Madhuchhanda [1 ]
机构
[1] Natl Inst Technol Silchar, Dept Elect & Commun Engn, Silchar 788010, Assam, India
[2] Indian Inst Technol, Dept Appl Mech, Chennai, India
关键词
Speaker identification; Meta-learning; 3D log Mel spectrogram; Few-shot learning; Reptile algorithm; RECOGNITION;
D O I
10.1007/s13369-024-09426-3
中图分类号
O [数理科学和化学]; P [天文学、地球科学]; Q [生物科学]; N [自然科学总论];
学科分类号
07 ; 0710 ; 09 ;
摘要
A speech-oriented speaker identification system offers an alternative approach to conventional biometric identification systems that rely on physical contact. Recent advances in deep learning have yielded impressive results with large amounts of data, but they are ineffective in forensic and law enforcement applications due to insufficient data. Consequently, speaker identification tasks may not always be feasible with limited data. Recent developments in meta-learning have opened the door to numerous few-shot learning applications. Nevertheless, significant challenges remain, including limited data, noisy and multivariable input conditions, overfitting, and the need for generalization to unseen speakers. This paper introduces a meta-learning-based speaker identification system utilizing a new framework named ReptoNet, which uses three-dimensional (3D) log Mel spectrogram inputs to minimize overfitting and improve generalization. ReptoNet utilizes a reptile algorithm to perform speaker identification tasks and uses the Keras framework for the implementation. The system is evaluated against state-of-the-art techniques on three diverse speech databases: VCTK, VoxCeleb1, and IIT Guwahati multivariability (IITG-MV). It outperforms existing methods on VCTK, improving log Mel and 3D log Mel inputs by 7% and 6%, respectively, and VoxCeleb1 by 2%.
引用
收藏
页码:7495 / 7510
页数:16
相关论文
共 48 条
  • [1] Anand P, 2019, Arxiv, DOI [arXiv:1904.08775, 10.48550/arXiv.1904.08775]
  • [2] Andrychowicz M, 2016, Arxiv, DOI arXiv:1606.04474
  • [3] Brown EM., 2023, J. Open Source Softw, V8, P5132, DOI [10.21105/joss.05132, DOI 10.21105/JOSS.05132]
  • [4] Forensic Speaker Recognition A need for caution
    Campbell, Joseph P.
    Shen, Wade
    Campbell, William M.
    Schwartz, Reva
    Bonastre, Jean-Francois
    Matrouf, Driss
    [J]. IEEE SIGNAL PROCESSING MAGAZINE, 2009, 26 (02) : 95 - 103
  • [5] Support vector machines using GMM supervectors for speaker verification
    Campbell, WM
    Sturim, DE
    Reynolds, DA
    [J]. IEEE SIGNAL PROCESSING LETTERS, 2006, 13 (05) : 308 - 311
  • [6] Finn C, 2017, Arxiv, DOI [arXiv:1703.03400, DOI 10.48550/ARXIV.1703.03400]
  • [7] Howard AG, 2017, Arxiv, DOI [arXiv:1704.04861, DOI 10.48550/ARXIV.1704.04861]
  • [8] Multivariability speaker recognition database in Indian scenario
    Haris, B.
    Pradhan, G.
    Misra, A.
    Prasanna, S.
    Das, R.
    Sinha, R.
    [J]. INTERNATIONAL JOURNAL OF SPEECH TECHNOLOGY, 2012, 15 (04) : 441 - 453
  • [9] Deep Residual Learning for Image Recognition
    He, Kaiming
    Zhang, Xiangyu
    Ren, Shaoqing
    Sun, Jian
    [J]. 2016 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2016, : 770 - 778
  • [10] MetaAudio: A Few-Shot Audio Classification Benchmark
    Heggan, Calum
    Budgett, Sam
    Hospedales, Timothy
    Yaghoobi, Mehrdad
    [J]. ARTIFICIAL NEURAL NETWORKS AND MACHINE LEARNING - ICANN 2022, PT I, 2022, 13529 : 219 - 230