Deep Learning-Based End-to-End Speaker Identification Using Time-Frequency Representation of Speech Signal

被引:5
作者
Saritha, Banala [1 ]
Laskar, Mohammad Azharuddin [1 ]
Kirupakaran, Anish Monsley [1 ]
Laskar, Rabul Hussain [1 ]
Choudhury, Madhuchhanda [1 ]
Shome, Nirupam [2 ]
机构
[1] Natl Inst Technol Silchar, Dept Elect & Commun Engn, Silchar, Assam, India
[2] Assam Univ, Dept Elect & Commun Engn, Silchar, Assam, India
关键词
Spectrogram; Log Mel spectrogram; Cochleagram; Deep convolutional neural network; Speaker identification; End-to-end system; FEATURES;
D O I
10.1007/s00034-023-02542-9
中图分类号
TM [电工技术]; TN [电子技术、通信技术];
学科分类号
0808 ; 0809 ;
摘要
Speech-based speaker identification system is one of the alternatives to the conventional biometric contact-based identification systems. Recent works demonstrate the growing interest among researchers in this field and highlight the practical usability of speech for speaker identification across various applications. In this work, we try to address the limitations in the existing state-of-the-art approaches and highlight the usability of convolutional neural networks for speaker identification systems. The present work examines the usage of spectrogram as an input to these spatial networks and its robustness in the presence of noise. For faster training (computation) and to reduce the memory requirement (storage), SpectroNet model for speech-based speaker identification is introduced in this work. Evaluation of the proposed system has been done using Voxceleb1 and Part1 of the RSR 2015 databases. Experimental results show a relative improvement of similar to 16% (accuracy-96.21%) with spectrogram and similar to 10% (accuracy-98.92%) with log Mel spectrogram in identifying the speaker compared to the existing models. When cochleagram was used, it results in an identification accuracy of 99.26%. Analyzing the result obtained shows the applicability of the proposed approach in situations where (i) minimal speech data are available for speaker identification; (ii) speech data are noisy in nature.
引用
收藏
页码:1839 / 1861
页数:23
相关论文
共 53 条
[1]   Text-independent speaker recognition using LSTM-RNN and speech enhancement [J].
Abd El-Moneim, Samia ;
Nassar, M. A. ;
Dessouky, Moawad I. ;
Ismail, Nabil A. ;
El-Fishawy, Adel S. ;
Abd El-Samie, Fathi E. .
MULTIMEDIA TOOLS AND APPLICATIONS, 2020, 79 (33-34) :24013-24028
[2]   Text-independent speaker identification using Radon and discrete cosine transforms based features from speech spectrogram [J].
Ajmera, Pawan K. ;
Jadhav, Dattatray V. ;
Holambe, Raghunath S. .
PATTERN RECOGNITION, 2011, 44 (10-11) :2749-2759
[3]  
[Anonymous], 2018, P OD SPEAK LANG REC, DOI DOI 10.21437/ODYSSEY.2018-11
[4]   Multi-channel spectrograms for speech processing applications using deep learning methods [J].
Arias-Vergara, T. ;
Klumpp, P. ;
Vasquez-Correa, J. C. ;
Noeth, E. ;
Orozco-Arroyave, J. R. ;
Schuster, M. .
PATTERN ANALYSIS AND APPLICATIONS, 2021, 24 (02) :423-431
[5]  
Ashar Aweem, 2020, 2020 INT C EM TRENDS, P1, DOI [DOI 10.1109/ICETST49965.2020.9080730, 10.1109/ICETST49965.2020.9080730]
[6]  
Beigi H., 2012, New Trends Dev Biometr, P3, DOI 10.5772/52023
[7]  
Birnbaum Tobias, 2023, s_method_pub
[8]  
Bunrit Supaporn, 2019, International Journal of Machine Learning and Computing, V9, P143, DOI 10.18178/ijmlc.2019.9.2.778
[9]  
Chen GG, 2014, INT CONF ACOUST SPEE
[10]   AutoSpeech: Neural Architecture Search for Speaker Recognition [J].
Ding, Shaojin ;
Chen, Tianlong ;
Gong, Xinyu ;
Zha, Weiwei ;
Wang, Zhangyang .
INTERSPEECH 2020, 2020, :916-920