Audio-visual continuous speech recognition using mpeg-4 compliant visual features

被引:0
作者
Aleksic, PS [1 ]
Williams, JJ [1 ]
Wu, ZL [1 ]
Katsaggelos, AK [1 ]
机构
[1] Northwestern Univ, Dept Elect & Comp Engn, Evanston, IL 60208 USA
来源
2002 INTERNATIONAL CONFERENCE ON IMAGE PROCESSING, VOL I, PROCEEDINGS | 2002年
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
In this paper we utilize Facial Animation Parameters (FAPs), supported by the MPEG-4 standard for the visual representation of speech, in order to significantly improve automatic speech recognition (ASR). We describe a robust and automatic algorithm for extraction of FAPs from visual data that requires no hand labeling or extensive training procedures. Multi-stream Hidden Markov Models (HMM) were used to integrate audio and visual information. ASR experiments were performed under both clean and noisy audio conditions using relatively large vocabulary (approximately 1000 words). The proposed system reduces the word error rate (WER) by 20% to 23% relatively to audio-only ASR WERs, at various SNRs with additive white Gaussian noise, and by 19% relatively to audio-only ASR WER under clean audio conditions.
引用
收藏
页码:960 / 963
页数:4
相关论文
共 14 条