GazeVQA: A Video Question Answering Dataset for Multiview Eye-Gaze Task-Oriented Collaborations

被引:0
作者
Ilaslan, Muhammet Furkan [1 ,2 ]
Song, Chenan [1 ]
Chen, Joya [1 ]
Gao, Difei [1 ]
Lei, Weixian [1 ]
Xu, Qianli [2 ]
Lim, Joo Hwee [2 ]
Shou, Mike Zheng [1 ]
机构
[1] Natl Univ Singapore, Show Lab, Singapore, Singapore
[2] ASTAR, Inst Infocomm Res, Singapore, Singapore
来源
2023 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP 2023) | 2023年
基金
新加坡国家研究基金会;
关键词
LANGUAGE; VISION;
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
The usage of exocentric and egocentric videos in Video Question Answering (VQA) is a new endeavor in human-robot interaction and collaboration studies. Particularly for egocentric videos, one may leverage eye-gaze information to understand human intentions during the task. In this paper, we build a novel task-oriented VQA dataset, called GazeVQA, for collaborative tasks where gaze information is captured during the task process. GazeVQA is designed with a novel QA format that covers thirteen different reasoning types to capture multiple aspects of task information and user intent. For each participant, GazeVQA consists of more than 1,100 textual questions and more than 500 labeled images that were annotated with the assistance of the Segment Anything Model. In total, 2,967 video clips, 12,491 labeled images, and 25,040 questions from 22 participants were included in the dataset. Additionally, inspired by the assisting models and common ground theory for industrial task collaboration, we propose a new AI model called AssistGaze that is designed to answer the questions with three different answer types, namely textual, image, and video. AssistGaze can effectively ground the perceptual input into semantic information while reducing ambiguities. We conduct comprehensive experiments to demonstrate the challenges of GazeVQA and the effectiveness of AssistGaze.
引用
收藏
页码:10462 / 10479
页数:18
相关论文
共 79 条
  • [1] [Anonymous], Lecture Notes in Computer Sci-ence, V11131, P124
  • [2] VQA: Visual Question Answering
    Antol, Stanislaw
    Agrawal, Aishwarya
    Lu, Jiasen
    Mitchell, Margaret
    Batra, Dhruv
    Zitnick, C. Lawrence
    Parikh, Devi
    [J]. 2015 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2015, : 2425 - 2433
  • [3] Bambach S, 2017, J IEEE I C DEVELOP L, P290, DOI 10.1109/DEVLRN.2017.8329820
  • [4] Lending A Hand: Detecting Hands and Recognizing Activities in Complex Egocentric Interactions
    Bambach, Sven
    Lee, Stefan
    Crandall, David J.
    Yu, Chen
    [J]. 2015 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2015, : 1949 - 1957
  • [5] The IKEA ASM Dataset: Understanding People Assembling Furniture through Actions, Objects and Pose
    Ben-Shabat, Yizhak
    Yu, Xin
    Saleh, Fatemeh
    Campbell, Dylan
    Rodriguez-Opazo, Cristian
    Li, Hongdong
    Gould, Stephen
    [J]. 2021 IEEE WINTER CONFERENCE ON APPLICATIONS OF COMPUTER VISION (WACV 2021), 2021, : 846 - 858
  • [6] Using Cross-Model EgoSupervision to Learn Cooperative Basketball Intention
    Bertasius, Gedas
    Shi, Jianbo
    [J]. 2017 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION WORKSHOPS (ICCVW 2017), 2017, : 2355 - 2363
  • [7] Am I a Baller? Basketball Performance Assessment from First-Person Videos
    Bertasius, Gedas
    Park, Hyun Soo
    Yu, Stella X.
    Shi, Jianbo
    [J]. 2017 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2017, : 2196 - 2204
  • [8] Affordance Grounding from Demonstration Video to Target Image
    Chen, Joya
    Gao, Difei
    Lin, Kevin Qinghong
    Shou, Mike Zheng
    [J]. 2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR, 2023, : 6799 - 6808
  • [9] Choi S, 2021, AAAI CONF ARTIF INTE, V35, P1166, DOI 10.5626/KTCP.2021.27.1.7
  • [10] CLARK HH, 1991, PERSPECTIVES ON SOCIALLY SHARED COGNITION, P127, DOI 10.1037/10096-006