Multimodal Turn-Taking Model Using Visual Cues for End-of-Utterance Prediction in Spoken Dialogue Systems

被引：0

作者：

Kurata, Fuma ^{[1
]}

Saeki, Mao ^{[1
]}

Fujie, Shinya ^{[2
]}

Matsuyama, Yoichi ^{[1
]}

机构：

[1] Waseda Univ, Tokyo, Japan

[2] Chiba Inst Technol, Chiba, Japan

来源：

INTERSPEECH 2023 | 2023年

关键词：

spoken dialog systems; turn-taking; multimodal machine learning;

D O I：

10.21437/Interspeech.2023-578

中图分类号：

O42 [声学];

学科分类号：

070206 ; 082403 ;

摘要：

In this study, we propose a multimodal model for predicting the end-of-utterance probability in spoken dialogue systems, highlighting the unique role of visual cues in addition to acoustic and linguistic information. Although the effectiveness of visual cues, such as gaze, mouth, and head movements, has been suggested, few studies have fully incorporated them into turn-taking models, and the relative importance of these visual cues has also been underresearched. To address these issues, we first conducted an ablation study on visual features, showing the larger contribution of eye movements than mouth and head movements. Additionally, an end-to-end visual feature extraction model utilizing 3D-CNN is employed to comprehensively capture these visual cues. By combining visual features with acoustic and verbal information, AUC score for end-of-utterance prediction improved from 0.896 to 0.920, demonstrating the effectiveness of incorporating these visual cues in turn-taking models.

引用

页码：2658 / 2662

页数：5

共 21 条

[1] Baevski A, 2020, ADV NEUR IN, V33
[2] Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset
Carreira, Joao
Zisserman, Andrew
[J]. 30TH IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2017), 2017, : 4724 - 4733
[3] de Kok I., 2009, P 2009 INT C MULTIMO, P91
[4] Devlin J, 2019, 2019 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES (NAACL HLT 2019), VOL. 1, P4171
[5] SOME SIGNALS AND RULES FOR TAKING SPEAKING TURNS IN CONVERSATIONS
DUNCAN, S
[J]. JOURNAL OF PERSONALITY AND SOCIAL PSYCHOLOGY, 1972, 23 (02) : 283 - &
[6] Ekstedt E., 2020, ARXIV201010874
[7] X3D: Expanding Architectures for Efficient Video Recognition
Feichtenhofer, Christoph
[J]. 2020 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2020, : 200 - 210
[8] GRACCO VL, 1994, J NEUROSCI, V14, P6585
[9] Hara K., 2018, LISTENER, V162, P364
[10] Multimodal and Multitask Approach to Listener's Backchannel Prediction: Can Prediction of Turn-changing and Turn-management Willingness Improve Backchannel Modeling?
Ishii, Ryo
Ren, Xutong
Muszynski, Michal
Morency, Louis-Philippe
[J]. PROCEEDINGS OF THE 21ST ACM INTERNATIONAL CONFERENCE ON INTELLIGENT VIRTUAL AGENTS (IVA), 2021, : 131 - 138

← 1 2 3 →