Multilingual Speech Corpus in Low-Resource Eastern and Northeastern Indian Languages for Speaker and Language Identification

被引:6
作者
Basu, Joyanta [1 ]
Khan, Soma [1 ]
Roy, Rajib [1 ]
Basu, Tapan Kumar [2 ]
Majumder, Swanirbhar [3 ]
机构
[1] CDAC, Sect 5, Kolkata, India
[2] Indian Inst Technol, Dept Elect Engn, Kharagpur, W Bengal, India
[3] Tripura Univ, Dept Informat Technol, Suryamaninagar, Tripura, India
关键词
Low-resource language (LRL); Speaker identification (SID); Language identification (LID); Mel frequency cepstral coefficients (MFCCs); i-Vectors; Deep neural networks (DNN); RECOGNITION;
D O I
10.1007/s00034-021-01704-x
中图分类号
TM [电工技术]; TN [电子技术、通信技术];
学科分类号
0808 ; 0809 ;
摘要
Research and development of speech technology applications in low-resource languages (LRL) are challenging due to the non-availability of proper speech corpus. Especially, for most of the Indian languages, the amount and type of data found in different digital sources are sparse and prior works are too few to serve the purpose of large-scale development needs. This paper illustrates the creation process of such an LRL corpus comprising of sixteen rarely studied Eastern and Northeastern (E&NE) Indian languages and presents the data variability with different statistics. Furthermore, several experiments are carried out using the collected LRL corpus to build baseline speaker identification (SID) and language identification (LID) system for acceptance evaluation. For investigating the presence of speaker and language-specific information, spectral features like Mel frequency cepstral coefficients (MFCCs), shifted delta cepstral (SDC), and relative spectral transform-perceptual linear prediction (RASTA-PLP) features are used here. Vector quantization (VQ), Gaussian mixture models (GMMs), support vector machine (SVM), and multilayer perceptron (MLP)-based models are developed to represent the speaker and language-specific information captured through the spectral features. Apart from this, i-vectors, time delay neural networks (TDNN), and recurrent neural network with long short-term memory (LSTM-RNN) method-based SID and LID models are being experimented with to comply with the recent approaches. Performances of the developed systems are analyzed with LRL corpus in terms of SID and LID accuracy. The best SID and LID performances are observed to be 94.49% and 95.69%, respectively, for the baseline systems using LSTM-RNN with MFCC + SDC feature.
引用
收藏
页码:4986 / 5013
页数:28
相关论文
共 74 条
  • [1] Allen J, 2005, 2005 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU), P4
  • [2] [Anonymous], 2013, P 11 AS PAC C COMP H, DOI DOI 10.1145/2525194.2525310
  • [3] [Anonymous], 1992, Proc. ICASSP 1992
  • [4] [Anonymous], 2013, NAT C COMM NCC
  • [5] [Anonymous], 2012, 2012 INT C COMMUNICA
  • [6] [Anonymous], 2000, AUST INT C SPEECH SC
  • [7] Baby Arun, 2016, P TEXT SPEECH DIAL
  • [8] Balakrishnama S., 1998, I SIGNAL INF PROCESS, V18, P1, DOI DOI 10.1073/PNAS.1715593115
  • [9] Basu Joyanta, 2020, Intelligence Enabled Research. DoSIER 2019. Advances in Intelligent Systems and Computing (AISC 1109), P71, DOI 10.1007/978-981-15-2021-1_9
  • [10] Basu J, 2017, 2017 20TH CONFERENCE OF THE ORIENTAL CHAPTER OF THE INTERNATIONAL COORDINATING COMMITTEE ON SPEECH DATABASES AND SPEECH I/O SYSTEMS AND ASSESSMENT (O-COCOSDA), P145, DOI 10.1109/ICSDA.2017.8384460