Experimenting with lipreading for large vocabulary continuous speech recognition

被引:0
作者
Karel Paleček
机构
[1] Technical University of Liberec,Institute of Information Technology and Electronics
来源
Journal on Multimodal User Interfaces | 2018年 / 12卷
关键词
Audiovisual speech recognition; Lipreading; LVCSR;
D O I
暂无
中图分类号
学科分类号
摘要
Vast majority of current research in the area of audiovisual speech recognition via lipreading from frontal face videos focuses on simple cases such as isolated phrase recognition or structured speech, where the vocabulary is limited to several tens of units. In this paper, we diverge from these traditional applications and investigate the effect of incorporating the visual and also depth information in the task of continuous speech recognition with vocabulary size ranging from several hundred to half a million words. To this end, we evaluate various visual speech parametrizations, both existing and novel, that are designed to capture different kind of information in the video and depth signals. The experiments are conducted on a moderate sized dataset of 54 speakers, each uttering 100 sentences in Czech language. Both the video and depth data was captured by the Microsoft Kinect device. We show that even for large vocabularies the visual signal contains enough information to improve the word accuracy up to 22% relatively to the acoustic-only recognition. Somewhat surprisingly, a relative improvement of up to 16% has also been reached using the interpolated depth data.
引用
收藏
页码:309 / 318
页数:9
相关论文
共 22 条
[1]  
Cooke M(2006)An audio-visual corpus for speech perception and automatic speech recognition J Acoust Soc Am 120 2421-2424
[2]  
Barker J(2012)On dynamic stream weighting for audio-visual speech recognition IEEE Trans Audio Speech Lang Process 20 1145-1157
[3]  
Cunningham S(2015)Tcd-timit: an audio-visual corpus of continuous speech IEEE Trans Multimed 17 603-615
[4]  
Shao X(2005)Integration strategies for audio-visual speech processing: applied to text-dependent speaker recognition IEEE Trans Multimed 7 495-506
[5]  
Estellers V(1976)Hearing lips and seeing voices Nature 264 746-748
[6]  
Gurban M(2009)Lipreading with local spatiotemporal descriptors IEEE Trans Multimed 11 1254-1265
[7]  
Thiran J(2014)A review of recent advances in visual speech decoding Image Vis Comput 32 590-605
[8]  
Harte N(undefined)undefined undefined undefined undefined-undefined
[9]  
Gillen E(undefined)undefined undefined undefined undefined-undefined
[10]  
Lucey S(undefined)undefined undefined undefined undefined-undefined