A robust DNN model for text-independent speaker identification using non-speaker embeddings in diverse data conditions

被引:0
作者
Nirupam Shome
Banala Saritha
Richik Kashyap
Rabul Hussain Laskar
机构
[1] Assam University,Department of Electronics and Communication Engineering
[2] National Institute Technology,Department of Electronics and Communication Engineering
来源
Neural Computing and Applications | 2023年 / 35卷
关键词
Deep learning; Speaker recognition; Speaker identification; Non-speaker embeddings;
D O I
暂无
中图分类号
学科分类号
摘要
Deep learning has provided many advantages including the ability to extract features from the voice samples and represents the data in a more decisive mode. In speaker identification, many studies have been done to extract more and more meaningful speaker embeddings to improve the system performance. These models perform quite satisfactorily under clean and unambiguous data conditions but with noise and non-speech segments, the functioning of these systems drastically drops down. For real-world scenarios, class noise is more significant as compared to attribute noise for classification problems. The same instances in a sample with different class labels and different classes with similar instances are referred to as class noise. The non-speech regions in different speaker samples carry the same properties and hinder the classification process. We have observed that this non-speaker information (present in noise and non-speech segments) plays a crucial role in the operation of a system. This paper emphasizes the non-speaker embeddings which are very essential for developing an effective model. But in this direction, sufficient research has not yet been done. In this study, we concentrate on this problem and analyses the effect of non-speaker embeddings on the speaker identification process. We have introduced non-speaker classes to solve this issue by learning the non-speaker parameters. Two state-of-art methods, convolutional neural network (CNN), and SincNet with non-speaker embeddings are analyzed to check the performance. The models are trained on non-speaker classes (like silence and noise) along with individual speaker classes resulting in improved system performance with respect to baseline CNN and the SincNet model. We also compare the performance of our approach with the hand-crafted feature (MFCC and FBANK)-based speaker identification process. Here, also, we observed a substantial amount of improvement in classification by our approach. Learning the non-speaker embeddings improve the system performance because the misclassified non-speaker segments are addressed by our approach. For this analysis, speaker data are taken from the LibriSpeech dataset, and non-speaker data are taken from NOISEX-92(Synthetic noise), TESDHE (Natural noise), and Silent CD (Silence) databases.
引用
收藏
页码:18933 / 18947
页数:14
相关论文
共 89 条
[1]  
Ault SV(2018)On speech recognition algorithms Int J Mach Learn Comput 8 518-523
[2]  
Perez RJ(2010)An overview of text-independent speaker recognition: From features to supervectors Speech Commun 52 12-40
[3]  
Kimble CA(2017)Speaker identification features extraction methods: a systematic review Expert Syst Appl 90 250-271
[4]  
Wang J(2018)An MFCC-based text-independent speaker identification system for access control Concurr Comput Pract Exp 30 91-101
[5]  
Kinnunen T(2018)Speaker verification with short utterances: a review of challenges, trends and opportunities IET Biomet 7 143-148
[6]  
Li H(2019)Text-independent speaker identification using deep learning model of convolution neural network Int J Mach Learn Comput 9 23-24
[7]  
Tirumala SS(2018)Avoiding speaker overfitting in end-to-end dnns using raw waveform for text-independent speaker verification Extraction 8 1735-1780
[8]  
Shahamiri SR(1997)Long short-term memory Neural Comput 9 177-210
[9]  
Garhwal AS(2004)Class noise vs. attribute noise: a quantitative study Artif Intell Rev 22 967-973
[10]  
Wang R(2018)Tamil and English speech database for heartbeat estimation Int J Speech Technol 21 1813-1825