GATED MULTIMODAL FUSION WITH CONTRASTIVE LEARNING FOR TURN-TAKING PREDICTION IN HUMAN-ROBOT DIALOGUE

被引:6
|
作者
Yang, Jiudong [1 ]
Wang, Peiying [1 ]
Zhu, Yi [1 ,2 ]
Feng, Mingchao [1 ]
Chen, Meng [1 ]
He, Xiaodong [1 ]
机构
[1] JD AI, Beijing, Peoples R China
[2] Univ Cambridge, LTL, Cambridge, England
来源
2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP) | 2022年
基金
国家重点研发计划;
关键词
Multimodal Fusion; Turn-taking; Barge-in; Endpointing; Spoken Dialogue System;
D O I
10.1109/ICASSP43922.2022.9747056
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
Turn-taking, aiming to decide when the next speaker can start talking, is an essential component in building human-robot spoken dialogue systems. Previous studies indicate that multimodal cues can facilitate this challenging task. However, due to the paucity of public multimodal datasets, current methods are mostly limited to either utilizing unimodal features or simplistic multimodal ensemble models. Besides, the inherent class imbalance in real scenario, e.g. sentence ending with short pause will be mostly regarded as the end of turn, also poses great challenge to the turn-taking decision. In this paper, we first collect a large-scale annotated corpus for turn-taking with over 5,000 real human-robot dialogues in speech and text modalities. Then, a novel gated multimodal fusion mechanism is devised to utilize various information seamlessly for turn-taking prediction. More importantly, to tackle the data imbalance issue, we design a simple yet effective data augmentation method to construct negative instances without supervision and apply contrastive learning to obtain better feature representations. Extensive experiments are conducted and the results demonstrate the superiority and competitiveness of our model over several state-of-the-art baselines.
引用
收藏
页码:7747 / 7751
页数:5
相关论文
共 23 条
  • [21] Progress and Prospects of Multimodal Fusion Methods in Physical Human-Robot Interaction: A Review
    Xue, Teng
    Wang, Weiming
    Ma, Jin
    Liu, Wenhai
    Pan, Zhenyu
    Han, Mingshuo
    IEEE SENSORS JOURNAL, 2020, 20 (18) : 10355 - 10370
  • [22] Intent based Multimodal Speech and Gesture Fusion for Human-Robot Communication in Assembly Situation
    Paul, Sheuli
    Sintek, Michael
    Kepuska, Veton
    Silaghi, Marius
    Robertson, Liam
    2022 21ST IEEE INTERNATIONAL CONFERENCE ON MACHINE LEARNING AND APPLICATIONS, ICMLA, 2022, : 760 - 763
  • [23] Investigating Linguistic and Semantic Features for Turn-Taking Prediction in Open-Domain Human-Computer Conversation
    Razavi, S. Zahra
    Kane, Benjamin
    Schubert, Lenhart K.
    INTERSPEECH 2019, 2019, : 4140 - 4144