Multimodal Turn-Taking Model Using Visual Cues for End-of-Utterance Prediction in Spoken Dialogue Systems

被引:0
作者
Kurata, Fuma [1 ]
Saeki, Mao [1 ]
Fujie, Shinya [2 ]
Matsuyama, Yoichi [1 ]
机构
[1] Waseda Univ, Tokyo, Japan
[2] Chiba Inst Technol, Chiba, Japan
来源
INTERSPEECH 2023 | 2023年
关键词
spoken dialog systems; turn-taking; multimodal machine learning;
D O I
10.21437/Interspeech.2023-578
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
In this study, we propose a multimodal model for predicting the end-of-utterance probability in spoken dialogue systems, highlighting the unique role of visual cues in addition to acoustic and linguistic information. Although the effectiveness of visual cues, such as gaze, mouth, and head movements, has been suggested, few studies have fully incorporated them into turn-taking models, and the relative importance of these visual cues has also been underresearched. To address these issues, we first conducted an ablation study on visual features, showing the larger contribution of eye movements than mouth and head movements. Additionally, an end-to-end visual feature extraction model utilizing 3D-CNN is employed to comprehensively capture these visual cues. By combining visual features with acoustic and verbal information, AUC score for end-of-utterance prediction improved from 0.896 to 0.920, demonstrating the effectiveness of incorporating these visual cues in turn-taking models.
引用
收藏
页码:2658 / 2662
页数:5
相关论文
共 21 条
  • [1] Baevski A, 2020, ADV NEUR IN, V33
  • [2] Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset
    Carreira, Joao
    Zisserman, Andrew
    [J]. 30TH IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2017), 2017, : 4724 - 4733
  • [3] de Kok I., 2009, P 2009 INT C MULTIMO, P91
  • [4] Devlin J, 2019, 2019 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES (NAACL HLT 2019), VOL. 1, P4171
  • [6] Ekstedt E., 2020, ARXIV201010874
  • [7] X3D: Expanding Architectures for Efficient Video Recognition
    Feichtenhofer, Christoph
    [J]. 2020 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2020, : 200 - 210
  • [8] GRACCO VL, 1994, J NEUROSCI, V14, P6585
  • [9] Hara K., 2018, LISTENER, V162, P364
  • [10] Multimodal and Multitask Approach to Listener's Backchannel Prediction: Can Prediction of Turn-changing and Turn-management Willingness Improve Backchannel Modeling?
    Ishii, Ryo
    Ren, Xutong
    Muszynski, Michal
    Morency, Louis-Philippe
    [J]. PROCEEDINGS OF THE 21ST ACM INTERNATIONAL CONFERENCE ON INTELLIGENT VIRTUAL AGENTS (IVA), 2021, : 131 - 138