Improving Automatic Speech Recognition Through Head Pose Driven Visual Grounding

被引:1
作者
Vosoughi, Soroush [1 ]
机构
[1] MIT, Media Lab, 75 Amherst St,E14-574K, Cambridge, MA 02139 USA
来源
32ND ANNUAL ACM CONFERENCE ON HUMAN FACTORS IN COMPUTING SYSTEMS (CHI 2014) | 2014年
关键词
visual grounding; language models; automatic speech recognition; head pose estimation; visual attention;
D O I
10.1145/2556288.2556957
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
In this paper, we present a multimodal speech recognition system for real world scene description tasks. Given a visual scene, the system dynamically biases its language model based on the content of the visual scene and visual attention of the speaker. Visual attention is used to focus on likely objects within the scene. Given a spoken description the system then uses the visually biased language model to process the speech. The system uses head pose as a proxy for the visual attention of the speaker. Readily available standard computer vision algorithms are used to recognize the objects in the scene and automatic real-time head pose estimation is done using depth data captured via a Microsoft Kinect. The system was evaluated on multiple participants. Overall, incorporating visual information into the speech recognizer greatly improved speech recognition accuracy. The rapidly decreasing cost of 3D sensing technologies such as the Kinect allows systems with similar underlying principles to be used for many speech recognition tasks where there is visual information.
引用
收藏
页码:3235 / 3238
页数:4
相关论文
共 15 条
[1]  
[Anonymous], 2015, The HTK book
[2]  
Bradski G., 2000, OpenCV. Dr Dobb's J. Softw. Tools
[3]  
Coco M. I., 2012, COGNITIVE SCI
[4]  
Fanelli G, 2011, PROC CVPR IEEE, P617, DOI 10.1109/CVPR.2011.5995458
[5]   What the eyes say about speaking [J].
Griffin, ZM ;
Bock, K .
PSYCHOLOGICAL SCIENCE, 2000, 11 (04) :274-279
[6]   Human gaze control during real-world scene perception [J].
Henderson, JM .
TRENDS IN COGNITIVE SCIENCES, 2003, 7 (11) :498-504
[7]   EYE FIXATIONS AND COGNITIVE-PROCESSES [J].
JUST, MA ;
CARPENTER, PA .
COGNITIVE PSYCHOLOGY, 1976, 8 (04) :441-480
[8]  
Kaur M., 2003, Proc. of the 5th Int. Conf. on Multimodal Interfaces ICMI '03 (Vancouver) (New York: ACM, P151, DOI [10.1145/958432, DOI 10.1145/958432]
[9]  
Prasov Zahar, 2008, 13th International Conference on Intelligent User Interfaces. IUI 2008, P20, DOI 10.1145/1378773.1378777
[10]  
Prasov Z., 2010, Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing, P471