Crosslingual and Multilingual Speech Recognition Based on the Speech Manifold

被引：12

作者：

Sahraeian, Reza ^{[1
]}

Van Compernolle, Dirk ^{[1
]}

机构：

[1] Katholieke Univ Leuven, Dept Elect Engn, Ctr Proc Speech & Image, B-3000 Leuven, Belgium

来源：

IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING | 2017年 / 25卷 / 12期

关键词：

Crosslingual and multilingual speech recognition; acoustic-to-articulatory mapping; manifold learning; deep neural networks; NEURAL-NETWORK; ACOUSTICS; MATRICES; FEATURES;

D O I：

10.1109/TASLP.2017.2751747

中图分类号：

O42 [声学];

学科分类号：

070206 ; 082403 ;

摘要：

Speech signals are produced by the smooth and continuous movements of the human articulators. An articulatory representation of speech is considered to be a more compact, more universal, and language-independent speech feature space and can, therefore, improve crosslingual and multilingual speech recognition systems, especially when porting components from one language to another in low-resource scenarios. However, learning the acoustic-to-articulatory conversion has proven to be a very challenging task. In this paper, we utilize a manifold learning technique to derive a nonlinear feature transformation from the conventional filterbank feature space to an articulatory-like feature space. The coordinates in the resultant representation of which some have demonstrable phonological meaning are shown to be highly portable across languages. We propose a proper framework in terms of data selection and graph construction to train coordinates from multilingual data, which allows for training the coordinate space when we have abundant out-of-language data. Deep neural network (DNN) bottleneck features are demonstrated to exhibit a greater degree of language independence when using this representation than in the case of filterbank features as inputs. The usability of this representation is further demonstrated in a number of speech recognition experiments using DNNs in a variety of crosslingual and multilingual scenarios using the multilingual GlobalPhone dataset. Especially, speech recognition systems developed in low-resource settings profit from the improved portability across languages.

引用

页码：2301 / 2312

页数：12

共 61 条

[1]

[Anonymous], 2012, PROC SPOKEN LANG TEC

[2]

[Anonymous], 2014, ARXIV14107455

[3]

[Anonymous], 2011, P IEEE WORKSH AUT SP

[4]

[Anonymous], 2013, ARXIV13013605

[5] Laplacian eigenmaps for dimensionality reduction and data representation [J].

Belkin, M ;

Niyogi, P .

NEURAL COMPUTATION, 2003, 15 (06) :1373-1396

[6]

Belkin M, 2006, J MACH LEARN RES, V7, P2399

[7] Automatic speech recognition for under-resourced languages: A survey [J].

Besacier, Laurent ;

Barnard, Etienne ;

Karpov, Alexey ;

Schultz, Tanja .

SPEECH COMMUNICATION, 2014, 56 :85-100

[8] MULTILINGUAL ACOUSTIC MODELING FOR SPEECH RECOGNITION BASED ON SUBSPACE GAUSSIAN MIXTURE MODELS [J].

Burget, Lukas ;

Schwarz, Petr ;

Agarwal, Mohit ;

Akyazi, Pinar ;

Feng, Kai ;

Ghoshal, Arnab ;

Glembek, Ondrej ;

Goel, Nagendra ;

Karafiat, Martin ;

Povey, Daniel ;

Rastrow, Ariya ;

Rose, Richard C. ;

Thomas, Samuel .

2010 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, 2010, :4334-4337

[9]

Byrne W, 2000, INT CONF ACOUST SPEE, P1029

[10]

Cayton L., 2005, Univ. of California at San Diego Tech. Rep, V12, P1

← 1 2 3 4 5 6 7 →