A robust DNN model for text-independent speaker identification using non-speaker embeddings in diverse data conditions

被引:5
作者
Shome, Nirupam [1 ]
Saritha, Banala [2 ]
Kashyap, Richik [1 ]
Laskar, Rabul Hussain [2 ]
机构
[1] Assam Univ, Dept Elect & Commun Engn, Silchar, Assam, India
[2] Natl Inst Technol, Dept Elect & Commun Engn, Silchar, Assam, India
关键词
Deep learning; Speaker recognition; Speaker identification; Non-speaker embeddings; RAW WAVE-FORM; RECOGNITION; VERIFICATION; EXTRACTION; ATTENTION; NOISE;
D O I
10.1007/s00521-023-08736-1
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Deep learning has provided many advantages including the ability to extract features from the voice samples and represents the data in a more decisive mode. In speaker identification, many studies have been done to extract more and more meaningful speaker embeddings to improve the system performance. These models perform quite satisfactorily under clean and unambiguous data conditions but with noise and non-speech segments, the functioning of these systems drastically drops down. For real-world scenarios, class noise is more significant as compared to attribute noise for classification problems. The same instances in a sample with different class labels and different classes with similar instances are referred to as class noise. The non-speech regions in different speaker samples carry the same properties and hinder the classification process. We have observed that this non-speaker information (present in noise and non-speech segments) plays a crucial role in the operation of a system. This paper emphasizes the non-speaker embeddings which are very essential for developing an effective model. But in this direction, sufficient research has not yet been done. In this study, we concentrate on this problem and analyses the effect of non-speaker embeddings on the speaker identification process. We have introduced non-speaker classes to solve this issue by learning the non-speaker parameters. Two state-of-art methods, convolutional neural network (CNN), and SincNet with non-speaker embeddings are analyzed to check the performance. The models are trained on non-speaker classes (like silence and noise) along with individual speaker classes resulting in improved system performance with respect to baseline CNN and the SincNet model. We also compare the performance of our approach with the hand-crafted feature (MFCC and FBANK)-based speaker identification process. Here, also, we observed a substantial amount of improvement in classification by our approach. Learning the non-speaker embeddings improve the system performance because the misclassified non-speaker segments are addressed by our approach. For this analysis, speaker data are taken from the LibriSpeech dataset, and non-speaker data are taken from NOISEX-92(Synthetic noise), TESDHE (Natural noise), and Silent CD (Silence) databases.
引用
收藏
页码:18933 / 18947
页数:15
相关论文
共 63 条
[1]   Speaker recognition with hybrid features from a deep belief network [J].
Ali, Hazrat ;
Tran, Son N. ;
Benetos, Emmanouil ;
Garcez, Artur S. d'Avila .
NEURAL COMPUTING & APPLICATIONS, 2018, 29 (06) :13-19
[2]  
[Anonymous], 2008, Springer handbook of speech processing
[3]  
[Anonymous], 1994, Interactively skimming recorded speech
[4]  
Ault Shaun V., 2018, International Journal of Machine Learning and Computing, V8, P518, DOI 10.18178/ijmlc.2018.8.6.739
[5]  
Bunrit Supaporn, 2019, International Journal of Machine Learning and Computing, V9, P143, DOI 10.18178/ijmlc.2019.9.2.778
[6]   Forensic Speaker Recognition A need for caution [J].
Campbell, Joseph P. ;
Shen, Wade ;
Campbell, William M. ;
Schwartz, Reva ;
Bonastre, Jean-Francois ;
Matrouf, Driss .
IEEE SIGNAL PROCESSING MAGAZINE, 2009, 26 (02) :95-103
[7]  
Chen WC, 2008, J APPL SCI ENG, V11, P357
[8]  
Cristianini N., 2000, An Introduction To Support Vector Machines and Other Kernel-based Learning Methods
[9]  
Dinkel H, 2017, INT CONF ACOUST SPEE, P4860, DOI 10.1109/ICASSP.2017.7953080
[10]   Speaker identification security improvement by means of speech watermarking [J].
Faundez-Zanuy, Marcos ;
Hagmueller, Martin ;
Kubin, Gernot .
PATTERN RECOGNITION, 2007, 40 (11) :3027-3034