Crosslingual and Multilingual Speech Recognition Based on the Speech Manifold

被引:12
作者
Sahraeian, Reza [1 ]
Van Compernolle, Dirk [1 ]
机构
[1] Katholieke Univ Leuven, Dept Elect Engn, Ctr Proc Speech & Image, B-3000 Leuven, Belgium
关键词
Crosslingual and multilingual speech recognition; acoustic-to-articulatory mapping; manifold learning; deep neural networks; NEURAL-NETWORK; ACOUSTICS; MATRICES; FEATURES;
D O I
10.1109/TASLP.2017.2751747
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
Speech signals are produced by the smooth and continuous movements of the human articulators. An articulatory representation of speech is considered to be a more compact, more universal, and language-independent speech feature space and can, therefore, improve crosslingual and multilingual speech recognition systems, especially when porting components from one language to another in low-resource scenarios. However, learning the acoustic-to-articulatory conversion has proven to be a very challenging task. In this paper, we utilize a manifold learning technique to derive a nonlinear feature transformation from the conventional filterbank feature space to an articulatory-like feature space. The coordinates in the resultant representation of which some have demonstrable phonological meaning are shown to be highly portable across languages. We propose a proper framework in terms of data selection and graph construction to train coordinates from multilingual data, which allows for training the coordinate space when we have abundant out-of-language data. Deep neural network (DNN) bottleneck features are demonstrated to exhibit a greater degree of language independence when using this representation than in the case of filterbank features as inputs. The usability of this representation is further demonstrated in a number of speech recognition experiments using DNNs in a variety of crosslingual and multilingual scenarios using the multilingual GlobalPhone dataset. Especially, speech recognition systems developed in low-resource settings profit from the improved portability across languages.
引用
收藏
页码:2301 / 2312
页数:12
相关论文
共 61 条
[1]  
[Anonymous], 2012, PROC SPOKEN LANG TEC
[2]  
[Anonymous], 2014, ARXIV14107455
[3]  
[Anonymous], 2011, P IEEE WORKSH AUT SP
[4]  
[Anonymous], 2013, ARXIV13013605
[5]   Laplacian eigenmaps for dimensionality reduction and data representation [J].
Belkin, M ;
Niyogi, P .
NEURAL COMPUTATION, 2003, 15 (06) :1373-1396
[6]  
Belkin M, 2006, J MACH LEARN RES, V7, P2399
[7]   Automatic speech recognition for under-resourced languages: A survey [J].
Besacier, Laurent ;
Barnard, Etienne ;
Karpov, Alexey ;
Schultz, Tanja .
SPEECH COMMUNICATION, 2014, 56 :85-100
[8]   MULTILINGUAL ACOUSTIC MODELING FOR SPEECH RECOGNITION BASED ON SUBSPACE GAUSSIAN MIXTURE MODELS [J].
Burget, Lukas ;
Schwarz, Petr ;
Agarwal, Mohit ;
Akyazi, Pinar ;
Feng, Kai ;
Ghoshal, Arnab ;
Glembek, Ondrej ;
Goel, Nagendra ;
Karafiat, Martin ;
Povey, Daniel ;
Rastrow, Ariya ;
Rose, Richard C. ;
Thomas, Samuel .
2010 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, 2010, :4334-4337
[9]  
Byrne W, 2000, INT CONF ACOUST SPEE, P1029
[10]  
Cayton L., 2005, Univ. of California at San Diego Tech. Rep, V12, P1