Short-Utterance-Based Children's Speaker Verification in Low-Resource Conditions

被引:1
作者
Aziz, Shahid [1 ]
Ankita [1 ]
Shahnawazuddin, S. [1 ]
机构
[1] Natl Inst Technol Patna, Dept Elect & Commun Engn, Patna, Bihar, India
关键词
Automatic speaker verification; Out-of-domain data augmentation; Prosody modification; Formant modification; Feature concatenation; Frequency-domain linear prediction; DOMAIN LINEAR PREDICTION; LIMITED DATA; SPEECH; RECOGNITION; IDENTIFICATION; FEATURES; SYSTEM;
D O I
10.1007/s00034-023-02535-8
中图分类号
TM [电工技术]; TN [电子技术、通信技术];
学科分类号
0808 ; 0809 ;
摘要
The task of developing an automatic speaker verification (ASV) system for children is extremely challenging due to unavailability of sufficiently large and free speech corpora from child speakers . On the other hand, hundreds of hours of speech data from adult speakers are freely available. Therefore, majority of the works on speaker verification reported in the literature deal predominantly with adults' speech, while only a few works dealing with children's speech have been published. The challenges in developing a robust ASV system for child speakers are further exacerbated when we use short utterances which is largely unexplored in the case of children's speech . Therefore, in this paper, we have focused on children's speaker verification using short utterances. To deal with data scarcity, several out-of-domain data augmentation techniques have been utilized. Since the out-of-domain data used in this study is from adult speakers which is acoustically very different from children's speech, we have resorted to techniques like prosody modification, formant modification, and voice conversion in order to render it acoustically similar to children's speech prior to augmentation. This helps in not only increasing the amount of training data, but also in effectively capturing the missing target attributes relevant to children's speech. A staggering relative improvement of 33.57% in equal error rate with respect to the baseline system trained solely on child dataset speaks volume of the effectiveness of the proposed data augmentation technique in this paper. Further to that, we have also proposed frame-level concatenation of Mel-frequency cepstral coefficients (MFCC) with frequency-domain linear prediction coefficients, in order to simultaneously model the spectral as well as temporal envelopes. The proposed idea of frame-level concatenation is expected to further enhance the discrimination among the speakers. This novel approach, when combined with data augmentation, helps in further improving the performance of the speaker verification system. The experimental results support our claims, wherein we have achieved an overall relative reduction of 38.04% for equal error rate.
引用
收藏
页码:1715 / 1740
页数:26
相关论文
共 33 条
[1]   Frequency domain linear prediction for temporal features [J].
Athineos, M ;
Ellis, DPW .
ASRU'03: 2003 IEEE WORKSHOP ON AUTOMATIC SPEECH RECOGNITION AND UNDERSTANDING ASRU '03, 2003, :261-266
[2]  
Batliner A., 2005, Proceedings of Interspeech, P2761, DOI DOI 10.21437/INTERSPEECH.2005
[3]   COMPARISON OF PARAMETRIC REPRESENTATIONS FOR MONOSYLLABIC WORD RECOGNITION IN CONTINUOUSLY SPOKEN SENTENCES [J].
DAVIS, SB ;
MERMELSTEIN, P .
IEEE TRANSACTIONS ON ACOUSTICS SPEECH AND SIGNAL PROCESSING, 1980, 28 (04) :357-366
[4]  
Eskenazi M., 1997, The CMU Kids Corpus LDC97S63
[5]   Sub-band Envelope Features using Frequency Domain Linear Prediction for Short Duration Language Identification [J].
Fernando, Sarith ;
Sethu, Vidhyasaharan ;
Ambikairajah, Eliathamby .
19TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2018), VOLS 1-6: SPEECH RESEARCH FOR EMERGING MARKETS IN MULTILINGUAL SOCIETIES, 2018, :1818-1822
[6]  
Ganapathy S, 2011, 2011 IEEE WORKSHOP ON APPLICATIONS OF SIGNAL PROCESSING TO AUDIO AND ACOUSTICS (WASPAA), P321, DOI 10.1109/ASPAA.2011.6082323
[7]  
Islam MT, 2016, IEEE ICC, DOI [10.1109/ICC.2016.7511459, 10.1109/ISTT.2016.7918075]
[8]  
Kaneko T., 2017, ARXIV
[9]  
Kathania HK, 2020, INT CONF ACOUST SPEE, P7429, DOI [10.1109/ICASSP40776.2020.9053334, 10.1109/icassp40776.2020.9053334]
[10]   Exploration of temporal dynamics of frequency domain linear prediction cepstral coefficients for dialect classification [J].
Kethireddy, Rashmi ;
Kadiri, Sudarsana Reddy ;
Gangashetty, Suryakanth V. .
APPLIED ACOUSTICS, 2022, 188