Equivariant and Invariant Grounding for Video Question Answering

被引:10
|
作者
Li, Yicong [1 ]
Wang, Xiang [2 ]
Xiao, Junbin [1 ]
Chua, Tat-Seng [1 ]
机构
[1] Natl Univ Singapore, Singapore, Singapore
[2] Univ Sci & Technol China, Hefei, Peoples R China
来源
PROCEEDINGS OF THE 30TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2022 | 2022年
关键词
Video Question Answering; Invariant Learning; Equivariant Learning; Interpretability;
D O I
10.1145/3503161.3548035
中图分类号
TP39 [计算机的应用];
学科分类号
081203 ; 0835 ;
摘要
Video Question Answering (VideoQA) is the task of answering the natural language questions about a video. Producing an answer requires understanding the interplay across visual scenes in video and linguistic semantics in question. However, most leading VideoQA models work as black boxes, which make the visual-linguistic alignment behind the answering process obscure. Such black-box nature calls for visual explainability that reveals "What part of the video should the model look at to answer the question?". Only a few works present the visual explanations in a post-hoc fashion, which emulates the target model's answering process via an additional method. Nonetheless, the emulation struggles to faithfully exhibit the visual-linguistic alignment during answering. Instead of post-hoc explainability, we focus on intrinsic interpretability to make the answering process transparent. At its core is grounding the question-critical cues as the causal scene to yield answers, while rolling out the question-irrelevant information as the environment scene. Taking a causal look at VideoQA, we devise a self-interpretable framework, Equivariant and Invariant Grounding for Interpretable VideoQA (EIGV). Specifically, the equivariant grounding encourages the answering to be sensitive to the semantic changes in the causal scene and question; in contrast, the invariant grounding enforces the answering to be insensitive to the changes in the environment scene. By imposing them on the answering process, EIGV is able to distinguish the causal scene from the environment information, and explicitly present the visual-linguistic alignment. Extensive experiments on three benchmark datasets justify the superiority of EIGV in terms of accuracy and visual interpretability over the leading baselines. Our code is available at https://github.com/yl3800/EIGV.
引用
收藏
页码:4714 / 4722
页数:9
相关论文
共 50 条
  • [1] Invariant Grounding for Video Question Answering
    Li, Yicong
    Wang, Xiang
    Xiao, Junbin
    Ji, Wei
    Chua, Tat-Seng
    2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2022), 2022, : 2918 - 2927
  • [2] Affective question answering on video
    Ruwa, Nelson
    Mao, Qirong
    Wang, Liangjun
    Gou, Jianping
    NEUROCOMPUTING, 2019, 363 : 125 - 139
  • [3] Video Graph Transformer for Video Question Answering
    Xiao, Junbin
    Zhou, Pan
    Chua, Tat-Seng
    Yan, Shuicheng
    COMPUTER VISION, ECCV 2022, PT XXXVI, 2022, 13696 : 39 - 58
  • [4] Video Reference: A Video Question Answering Engine
    Gao, Lei
    Li, Guangda
    Zheng, Yan-Tao
    Hong, Richang
    Chua, Tat-Seng
    ADVANCES IN MULTIMEDIA MODELING, PROCEEDINGS, 2010, 5916 : 799 - +
  • [5] Locate Before Answering: Answer Guided Question Localization for Video Question Answering
    Qian, Tianwen
    Cui, Ran
    Chen, Jingjing
    Peng, Pai
    Guo, Xiaowei
    Jiang, Yu-Gang
    IEEE TRANSACTIONS ON MULTIMEDIA, 2024, 26 : 4554 - 4563
  • [6] Video Question Answering on Screencast Tutorials
    Zhao, Wentian
    Kim, Seokhwan
    Xu, Ning
    Jin, Hailin
    PROCEEDINGS OF THE TWENTY-NINTH INTERNATIONAL JOINT CONFERENCE ON ARTIFICIAL INTELLIGENCE, 2020, : 1061 - 1068
  • [7] Video Question Answering by Frame Attention
    Fang, Jiannan
    Sun, Lingling
    Wang, Yaqi
    ELEVENTH INTERNATIONAL CONFERENCE ON DIGITAL IMAGE PROCESSING (ICDIP 2019), 2019, 11179
  • [8] Video Question Answering with Procedural Programs
    Choudhury, Rohan
    Niinuma, Koichiro
    Kitani, Kris M.
    Jeni, Laszlo A.
    COMPUTER VISION-ECCV 2024, PT XXXVIII, 2025, 15096 : 315 - 332
  • [9] BERT Representations for Video Question Answering
    Yang, Zekun
    Garcia, Noa
    Chu, Chenhui
    Otani, Mayu
    Nakashima, Yuta
    Takemura, Haruo
    2020 IEEE WINTER CONFERENCE ON APPLICATIONS OF COMPUTER VISION (WACV), 2020, : 1545 - 1554
  • [10] Open-Vocabulary Video Question Answering: A New Benchmark for Evaluating the Generalizability of Video Question Answering Models
    Ko, Dohwan
    Lee, Ji Soo
    Choi, Miso
    Chu, Jaewon
    Park, Jihwan
    Kim, Hyunwoo J.
    2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION, ICCV, 2023, : 3078 - 3089