Turbo Automatic Speech Recognition

被引：17

作者：

Receveur, Simon ^{[1
]}

Weiss, Robin ^{[1
]}

Fingscheidt, Tim ^{[1
]}

机构：

[1] Tech Univ Carolo Wilhelmina Braunschweig, Inst Commun Technol, D-38106 Braunschweig, Germany

来源：

IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING | 2016年 / 24卷 / 05期

关键词：

Speech recognition; iterative decoding; hidden Markov models; robustness; multimedia systems; FEATURE ENHANCEMENT; CONFIDENCE MEASURES; INFORMATION FUSION;

D O I：

10.1109/TASLP.2016.2520364

中图分类号：

O42 [声学];

学科分类号：

070206 ; 082403 ;

摘要：

Performance of automatic speech recognition (ASR) systems can significantly be improved by integrating further sources of information such as additional modalities, or acoustic channels, or acoustic models. Given the arising problem of information fusion, striking parallels to problems in digital communications are exhibited, where the discovery of the turbo codes by Berrou et al. was a groundbreaking innovation. In this paper, we show ways how to successfully apply the turbo principle to the domain of ASR and thereby provide solutions to the above-mentioned information fusion problem. The contribution of our work is fourfold: First, we review the turbo decoding forward-backward algorithm (FBA), giving detailed insights into turbo ASR, and providing a new interpretation and formulation of the so-called extrinsic information being passed between the recognizers. Second, we present a real-time capable turbo-decoding Viterbi algorithm suitable for practical information fusion and recognition tasks. Then we present simulation results for a multimodal example of information fusion. Finally, we prove the suitability of both our turbo FBA and turbo Viterbi algorithm also for a single-channel multimodel recognition task obtained by using two acoustic feature extraction methods. On a small vocabulary task (challenging, since spelling is included), our proposed turbo ASR approach outperforms even the best reference system on average over all SNR conditions and investigated noise types by a relative word error rate (WER) reduction of 22.4% (audio-visual task) and 18.2% (audio-only task), respectively.

引用

页码：846 / 862

页数：17

共 63 条

[1] Learning Dynamic Stream Weights For Coupled-HMM-Based Audio-Visual Speech Recognition [J].

Abdelaziz, Ahmed Hussen ;

Zeiler, Steffen ;

Kolossa, Dorothea .

IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2015, 23 (05) :863-876

[2]

[Anonymous], 1983, Error control coding

[3]

[Anonymous], 2006, SAND20065315 SAND NA

[4]

[Anonymous], 1993, Fundamentals of speech recognition

[5]

[Anonymous], 2002, ETSI ES

[6]

Audhkhasi K, 2013, INTERSPEECH, P3081

[7] Theoretical Analysis of Diversity in an Ensemble of Automatic Speech Recognition Systems [J].

Audhkhasi, Kartik ;

Zavou, Andreas M. ;

Georgiou, Panayiotis G. ;

Narayanan, Shrikanth S. .

IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2014, 22 (03) :711-726

[8] OPTIMAL DECODING OF LINEAR CODES FOR MINIMIZING SYMBOL ERROR RATE [J].

BAHL, LR ;

COCKE, J ;

JELINEK, F ;

RAVIV, J .

IEEE TRANSACTIONS ON INFORMATION THEORY, 1974, 20 (02) :284-287

[9] A MAXIMUM-LIKELIHOOD APPROACH TO CONTINUOUS SPEECH RECOGNITION [J].

BAHL, LR ;

JELINEK, F ;

MERCER, RL .

IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 1983, 5 (02) :179-190

[10]

BERROU C, 1993, IEEE INTERNATIONAL CONFERENCE ON COMMUNICATIONS 93 : TECHNICAL PROGRAM, CONFERENCE RECORD, VOLS 1-3, P1064, DOI 10.1109/ICC.1993.397441

← 1 2 3 4 5 6 7 →