Robust video question answering via contrastive cross-modality representation learning

被引:3
作者
Yang, Xun [1 ]
Zeng, Jianming [1 ,3 ]
Guo, Dan [2 ]
Wang, Shanshan [4 ]
Dong, Jianfeng [5 ]
Wang, Meng [2 ,3 ]
机构
[1] Univ Sci & Technol China, Sch Informat Sci & Technol, Hefei 230026, Peoples R China
[2] Hefei Univ Technol, Sch Comp Sci Informat Engn, Hefei 230601, Peoples R China
[3] Inst Artificial Intelligence, Hefei Comprehens Natl Sci Ctr, Hefei 230088, Peoples R China
[4] Anhui Univ, Inst Phys Sci & Informat Technol, Hefei 230601, Peoples R China
[5] Zhejiang Gongshang Univ, Sch Comp Sci & Technol, Hangzhou 310018, Peoples R China
基金
中国国家自然科学基金;
关键词
video question answering; cross-modality fusion; contrastive learning; cross-media reasoning; NETWORK;
D O I
10.1007/s11432-023-4084-6
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Video question answering (VideoQA) is a challenging yet important task that requires a joint understanding of low-level video content and high-level textual semantics. Despite the promising progress of existing efforts, recent studies revealed that current VideoQA models mostly tend to over-rely on the superficial correlations rooted in the dataset bias while overlooking the key video content, thus leading to unreliable results. Effectively understanding and modeling the temporal and semantic characteristics of a given video for robust VideoQA is crucial but, to our knowledge, has not been well investigated. To fill the research gap, we propose a robust VideoQA framework that can effectively model the cross-modality fusion and enforce the model to focus on the temporal and global content of videos when making a QA decision instead of exploiting the shortcuts in datasets. Specifically, we design a self-supervised contrastive learning objective to contrast the positive and negative pairs of multimodal input, where the fused representation of the original multimodal input is enforced to be closer to that of the intervened input based on video perturbation. We expect the fused representation to focus more on the global context of videos rather than some static keyframes. Moreover, we introduce an effective temporal order regularization to enforce the inherent sequential structure of videos for video representation. We also design a Kullback-Leibler divergence-based perturbation invariance regularization of the predicted answer distribution to improve the robustness of the model against temporal content perturbation of videos. Our method is model-agnostic and can be easily compatible with various VideoQA backbones. Extensive experimental results and analyses on several public datasets show the advantage of our method over the state-of-the-art methods in terms of both accuracy and robustness.
引用
收藏
页数:16
相关论文
共 50 条
[31]   Video question answering via traffic knowledge database and question classification [J].
Sun, Xiaoyong ;
Dai, Yu ;
Wang, Yuchen ;
Ma, Weifeng ;
Lin, Xuefen .
MULTIMEDIA SYSTEMS, 2024, 30 (01)
[32]   MSL-CCRN: Multi-stage self-supervised learning based cross-modality contrastive representation network for infrared and visible image fusion [J].
Yan, Zhilin ;
Nie, Rencan ;
Cao, Jinde ;
Xie, Guangxu ;
Ding, Zhengze .
DIGITAL SIGNAL PROCESSING, 2025, 156
[33]   Video Question Answering via Hierarchical Dual-Level Attention Network Learning [J].
Zhao, Zhou ;
Lin, Jinghao ;
Jiang, Xinghua ;
Cai, Deng ;
He, Xiaofei ;
Zhuang, Yueting .
PROCEEDINGS OF THE 2017 ACM MULTIMEDIA CONFERENCE (MM'17), 2017, :1050-1058
[34]   Knowledge Graph Question Answering based on Contrastive Learning and Feature Transformation [J].
Hu, Xinrong ;
Huang, Jingjing ;
Liu, Junping ;
Zhu, Qiang ;
Yang, Jie .
2022 IEEE 22ND INTERNATIONAL CONFERENCE ON SOFTWARE QUALITY, RELIABILITY, AND SECURITY COMPANION, QRS-C, 2022, :608-615
[35]   JPEG Artifacts Removal via Contrastive Representation Learning [J].
Wang, Xi ;
Fu, Xueyang ;
Zhu, Yurui ;
Zha, Zheng-Jun .
COMPUTER VISION - ECCV 2022, PT XVII, 2022, 13677 :615-631
[36]   Counterfactual Cross-modality Reasoning for Weakly Supervised Video Moment Localization [J].
Lv, Zezhong ;
Su, Bing ;
Wen, Ji-Rong .
PROCEEDINGS OF THE 31ST ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2023, 2023, :6539-6547
[37]   Multi-Granularity Contrastive Cross-Modal Collaborative Generation for End-to-End Long-Term Video Question Answering [J].
Yu, Ting ;
Fu, Kunhao ;
Zhang, Jian ;
Huang, Qingming ;
Yu, Jun .
IEEE TRANSACTIONS ON IMAGE PROCESSING, 2024, 33 :3115-3129
[38]   COCOA: Cross Modality Contrastive Learning for Sensor Data [J].
Deldari, Shohreh ;
Xue, Hao ;
Saeed, Aaqib ;
Smith, Daniel V. ;
Salim, Flora D. .
PROCEEDINGS OF THE ACM ON INTERACTIVE MOBILE WEARABLE AND UBIQUITOUS TECHNOLOGIES-IMWUT, 2022, 6 (03)
[39]   Cross-modality transformations in biological microscopy enabled by deep learning [J].
Hassan, Dana ;
Dominguez, Jesus ;
Midtvedt, Benjamin ;
Moberg, Henrik Klein ;
Pineda, Jesus ;
Langhammer, Christoph ;
Volpe, Giovanni ;
Corbera, Antoni Homs ;
Adiels, Caroline B. .
ADVANCED PHOTONICS, 2024, 6 (06)
[40]   Contrastive predictive coding with transformer for video representation learning [J].
Liu, Yue ;
Ma, Junqi ;
Xie, Yufei ;
Yang, Xuefeng ;
Tao, Xingzhen ;
Peng, Lin ;
Gao, Wei .
NEUROCOMPUTING, 2022, 482 :154-162