END-TO-END LANGUAGE RECOGNITION USING ATTENTION BASED HIERARCHICAL GATED RECURRENT UNIT MODELS

被引:0
作者
Padi, Bharat [1 ]
Mohan, Anand [2 ]
Ganapathy, Sriram [2 ]
机构
[1] Minds Ai, Bengaluru, India
[2] Indian Inst Sci, Elect Engn, LEAP Lab, Bengaluru, India
来源
2019 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP) | 2019年
关键词
End to end language identification; hierarchical GRU; attention; SPEECH; MULTILINGUALITY; IDENTIFICATION; SPEAKER;
D O I
10.1109/icassp.2019.8683895
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
The task of automatic language identification ( LID) involving multiple dialects of the same language family on short speech recordings is a challenging problem. This can be further complicated for short-duration audio snippets in the presence of noise sources. In these scenarios, the identity of the language/dialect may be reliably present only in parts of the speech embedded in the temporal sequence. The conventional approaches to LID ( and for speaker recognition) ignore the sequence information by extracting long-term statistical summary of the recording assuming an independence of the feature frames. In this paper, we propose to develop an end-to-end neural network framework utilizing short-sequence information in language recognition. A hierarchical gated recurrent unit ( HGRU) model with attention module is proposed for incorporating relevance in language recognition, where parts of speech data are weighted more based on their relevance for the language recognition task. Experiments are performed using the language recognition task in NIST LRE 2017 Challenge using clean, noisy and multi-speaker speech data. In these experiments, the proposed approach yields significant improvements over the conventional i-vector based language recognition approaches as well as previously proposed approach to language recognition using recurrent networks.
引用
收藏
页码:5966 / 5970
页数:5
相关论文
共 28 条
[1]  
[Anonymous], 2011, INTERSPEECH
[2]  
Bahdanau D, 2016, Arxiv, DOI arXiv:1409.0473
[3]  
Chelba C, 2008, IEEE SIGNAL PROC MAG, V25, P39, DOI 10.1109/MSP.200S.917992
[4]  
Cho K, 2014, ARXIV14061078
[5]  
Chung J, 2014, NIPS 2014 WORKSH DEE, DOI DOI 10.48550/ARXIV.1412.3555
[6]   Front-End Factor Analysis for Speaker Verification [J].
Dehak, Najim ;
Kenny, Patrick J. ;
Dehak, Reda ;
Dumouchel, Pierre ;
Ouellet, Pierre .
IEEE TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2011, 19 (04) :788-798
[7]  
Faouzi BenZeghiba Mhamed, 2012, INTERSPEECH
[8]  
Ganapathy S., 2014, INTERSPEECH
[9]   End-to-end Language Identification using Attention-based Recurrent Neural Networks [J].
Geng, Wang ;
Wang, Wenfu ;
Zhao, Yuanyuan ;
Cai, Xinyuan ;
Xu, Bo .
17TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2016), VOLS 1-5: UNDERSTANDING SPEECH PROCESSING IN HUMANS AND MACHINES, 2016, :2944-2948
[10]  
Graves A, 2012, STUD COMPUT INTELL, V385, P1, DOI [10.1162/neco.1997.9.1.1, 10.1007/978-3-642-24797-2]