PAXION: Patching Action Knowledge in Video-Language Foundation Models

被引:0
作者
Wang, Zhenhailong [1 ]
Blume, Ansel [1 ]
Li, Sha [1 ]
Liu, Genglin [1 ]
Cho, Jaemin [2 ]
Tang, Zineng [2 ]
Bansal, Mohit [2 ]
Ji, Heng [1 ]
机构
[1] UIUC, Champaign, IL 61820 USA
[2] UNC, Chapel Hill, NC USA
来源
ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 36 (NEURIPS 2023) | 2023年
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Action knowledge involves the understanding of textual, visual, and temporal aspects of actions. We introduce the Action Dynamics Benchmark (ActionBench) containing two carefully designed probing tasks: Action Antonym and Video Reversal, which targets multimodal alignment capabilities and temporal understanding skills of the model, respectively. Despite recent video-language models' (VidLM) impressive performance on various benchmark tasks, our diagnostic tasks reveal their surprising deficiency (near-random performance) in action knowledge, suggesting that current models rely on object recognition abilities as a shortcut for action understanding. To remedy this, we propose a novel framework, PAXION, along with a new Discriminative Video Dynamics Modeling (DVDM) objective. The PAXION framework utilizes a Knowledge Patcher network to encode new action knowledge and a Knowledge Fuser component to integrate the Patcher into frozen VidLMs without compromising their existing capabilities. Due to limitations of the widely-used Video-Text Contrastive (VTC) loss for learning action knowledge, we introduce the DVDM objective to train the Knowledge Patcher. DVDM forces the model to encode the correlation between the action text and the correct ordering of video frames. Our extensive analyses show that PAXION and DVDM together effectively fill the gap in action knowledge understanding (similar to 50% -> 80%), while maintaining or improving performance on a wide spectrum of both object- and action-centric downstream tasks. The code and data will be made publicly available for research purposes at https://github.com/MikeWangWZHL/Paxion.git.
引用
收藏
页数:21
相关论文
共 50 条
  • [31] Depth-Aware Sparse Transformer for Video-Language Learning
    Zhang, Haonan
    Gao, Lianli
    Zeng, Pengpeng
    Hanjalic, Alan
    Shen, Heng Tao
    PROCEEDINGS OF THE 31ST ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2023, 2023, : 4778 - 4787
  • [32] Clover : Towards A Unified Video-Language Alignment and Fusion Model
    Huang, Jingjia
    Li, Yinan
    Feng, Jiashi
    Wu, Xinglong
    Sun, Xiaoshuai
    Ji, Rongrong
    2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2023, : 14856 - 14866
  • [33] Learning Trajectory-Word Alignments for Video-Language Tasks
    Yang, Xu
    Li, Zhangzikang
    Xu, Haiyang
    Zhang, Hanwang
    Ye, Qinghao
    Li, Chenliang
    Yan, Ming
    Zhang, Yu
    Huang, Fei
    Huang, Songfang
    2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION, ICCV, 2023, : 2504 - 2514
  • [34] HiVLP: Hierarchical Interactive Video-Language Pre-Training
    Shao, Bin
    Liu, Jianzhuang
    Pei, Renjing
    Xu, Songcen
    Dai, Peng
    Lu, Juwei
    Li, Weimian
    Yan, Youliang
    2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2023), 2023, : 13710 - 13720
  • [35] Survey: Transformer based video-language pre-training
    Ruan, Ludan
    Jin, Qin
    AI OPEN, 2022, 3 : 1 - 13
  • [36] VideoCon: Robust Video-Language Alignment via Contrast Captions
    Bansall, Hritik
    Bitton, Yonatan
    Szpektor, Idan
    Chang, Kai-Wei
    Grover, Aditya
    2024 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2024, : 13927 - 13937
  • [37] Object-aware Video-language Pre-training for Retrieval
    Wang, Alex Jinpeng
    Ge, Yixiao
    Cai, Guanyu
    Yan, Rui
    Lin, Xudong
    Shan, Ying
    Qie, Xiaohu
    Shou, Mike Zheng
    2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2022), 2022, : 3303 - 3312
  • [38] STOA-VLP: Spatial-Temporal Modeling of Object and Action for Video-Language Pre-training
    Zhong, Weihong
    Zheng, Mao
    Tang, Duyu
    Luo, Xuan
    Gong, Heng
    Feng, Xiaocheng
    Qin, Bing
    THIRTY-SEVENTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 37 NO 3, 2023, : 3715 - 3723
  • [39] Learning Unified Video-Language Representations via Joint Modeling and Contrastive Learning for Natural Language Video Localization
    Cui, Chenhao
    Liang, Xinnian
    Wu, Shuangzhi
    Li, Zhoujun
    2023 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS, IJCNN, 2023,
  • [40] Advancing High-Resolution Video-Language Representation with Large-Scale Video Transcriptions
    Xue, Hongwei
    Hang, Tiankai
    Zeng, Yanhong
    Sun, Yuchong
    Liu, Bei
    Yang, Huan
    Fu, Jianlong
    Guo, Baining
    2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2022), 2022, : 5026 - 5035