Utterance-based selective training for the automatic creation of task-dependent acoustic models

被引：5

作者：

Cincarek, T ^{[1
]}

Toda, T ^{[1
]}

Saruwatari, H ^{[1
]}

Shikano, K ^{[1
]}

机构：

[1] Nara Inst Sci & Technol, Grad Sch Informat Sci, Ikoma 6300192, Japan

来源：

IEICE TRANSACTIONS ON INFORMATION AND SYSTEMS | 2006年 / E89D卷 / 03期

关键词：

acoustic modeling; task-dependency; development costs; selective training; sufficient statistics;

D O I：

10.1093/ietisy/e89-d.3.962

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

To obtain a robust acoustic model for a certain speech recognition task, a large amount of speech data is necessary. However, the preparation of speech data including recording and transcription is very costly and time-consuming. Although there are attempts to build generic acoustic models which are portable among different applications, speech recognition performance is typically task-dependent. This paper introduces a method for automatically building task-dependent acoustic models, based on selective training. Instead of setting Lip a new database, only a small amount of task-specific development data needs to be collected. Based on the likelihood of the target model parameters given this development data, utterances which are acoustically close to the development data are selected from existing speech data resources. Since there are too many possibilities for selecting a data Subset from a larger database in general, a heuristic has to be employed. The proposed algorithm deletes single utterances temporarily or alternates between successive deletion and addition of multiple utterances. In order to make selective training computationally practical, model retraining and likelihood calculation need to be fast. It is shown, that the model likelihood can be calculated fast and easily based on sufficient statistics without the need for explicit reconstruction of model parameters. The algorithm is applied to obtain an infant- and elderly-dependent acoustic model with only very few development data available. There is an improvement in word accuracy of up to 9% in comparison to conventional EM training without selection. Furthermore, the approach was also better than MLLR and MAP adaptation with the development data.

引用

页码：962 / 969

页数：8

共 14 条

[1]

Anastasakos T, 1996, ICSLP 96 - FOURTH INTERNATIONAL CONFERENCE ON SPOKEN LANGUAGE PROCESSING, PROCEEDINGS, VOLS 1-4, P1137, DOI 10.1109/ICSLP.1996.607807

[2] Selective training for hidden Markov models with applications to speech classification [J].

Arslan, LM ;

Hansen, JHL .

IEEE TRANSACTIONS ON SPEECH AND AUDIO PROCESSING, 1999, 7 (01) :46-54

[3] MAXIMUM LIKELIHOOD FROM INCOMPLETE DATA VIA EM ALGORITHM [J].

DEMPSTER, AP ;

LAIRD, NM ;

RUBIN, DB .

JOURNAL OF THE ROYAL STATISTICAL SOCIETY SERIES B-METHODOLOGICAL, 1977, 39 (01) :1-38

[4]

Gao YQ, 2005, INT CONF ACOUST SPEE, P1017

[5] Maximum a Posteriori Estimation for Multivariate Gaussian Mixture Observations of Markov Chains [J].

Gauvain, Jean-Luc ;

Lee, Chin-Hui .

IEEE TRANSACTIONS ON SPEECH AND AUDIO PROCESSING, 1994, 2 (02) :291-298

[6]

Hakkani-Tür D, 2002, INT CONF ACOUST SPEE, P3904

[7]

Hart, 2006, PATTERN CLASSIFICATI

[8]

Huang C., 2004, P INT C SPOK LANG PR, P1001

[9]

Kamm T. M., 2004, P INT C SPOK LANG PR, P1095

[10] Genericity and portability for task-independent speech recognition [J].

Lefevre, F ;

Gauvain, JL ;

Lamel, L .

COMPUTER SPEECH AND LANGUAGE, 2005, 19 (03) :345-363

← 1 2 →