Learning Dynamic Stream Weights For Coupled-HMM-Based Audio-Visual Speech Recognition

被引：60

作者：

Abdelaziz, Ahmed Hussen ^{[1
]}

Zeiler, Steffen ^{[1
]}

Kolossa, Dorothea ^{[1
]}

机构：

[1] Ruhr Univ Bochum, Cognit Signal Proc Grp, Inst Commun Acoust, D-44780 Bochum, Germany

来源：

IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING | 2015年 / 23卷 / 05期

关键词：

Audio-visual speech recognition; coupled hidden Markov model; logistic regression; multilayer perceptron; reliability measure; stream weight; NOISE; NETWORKS; ENTROPY; MODELS;

D O I：

10.1109/TASLP.2015.2409785

中图分类号：

O42 [声学];

学科分类号：

070206 ; 082403 ;

摘要：

With the increasing use of multimedia data in communication technologies, the idea of employing visual information in automatic speech recognition (ASR) has recently gathered momentum. In conjunction with the acoustical information, the visual data enhances the recognition performance and improves the robustness of ASR systems in noisy and reverberant environments. In audio-visual systems, dynamic weighting of audio and video streams according to their instantaneous confidence is essential for reliably and systematically achieving high performance. In this paper, we present a complete framework that allows blind estimation of dynamic stream weights for audio-visual speech recognition based on coupled hidden Markov models (CHMMs). As a stream weight estimator, we consider using multilayer perceptrons and logistic functions to map multidimensional reliability measure features to audiovisual stream weights. Training the parameters of the stream weight estimator requires numerous input-output tuples of reliability measure features and their corresponding stream weights. We estimate these stream weights based on oracle knowledge using an expectation maximization algorithm. We define 31-dimensional feature vectors that combine model-based and signal-based reliability measures as inputs to the stream weight estimator. During decoding, the trained stream weight estimator is used to blindly estimate stream weights. The entire framework is evaluated using the Grid audio-visual corpus and compared to state-of-the-art stream weight estimation strategies. The proposed framework significantly enhances the performance of the audio-visual ASR system in all examined test conditions.

引用

页码：863 / 876

页数：14

共 53 条

[1]

Abdelaziz Ahmed Hussen, 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), P1527, DOI 10.1109/ICASSP.2014.6853853

[2]

Agresti A., 1992, Statist Sci., V7, P131, DOI [10.1214/ss/1177011462, DOI 10.1214/SS/1177011454, 10.1214/ss/1177011454]

[3]

[Anonymous], P CHIME WORKSH MACH

[4]

[Anonymous], P EUR SIGN PROC C LA

[5]

[Anonymous], P NATO ASI C SPEECH

[6]

[Anonymous], P ICSLP

[7]

[Anonymous], P EUR TUT WORKSH AUD

[8]

[Anonymous], P ICSLP

[9]

[Anonymous], ACOUST SPEECH SIG PR

[10]

[Anonymous], P CHIME WORKSH MACH

← 1 2 3 4 5 6 →