PAXION: Patching Action Knowledge in Video-Language Foundation Models

被引:0
|
作者
Wang, Zhenhailong [1 ]
Blume, Ansel [1 ]
Li, Sha [1 ]
Liu, Genglin [1 ]
Cho, Jaemin [2 ]
Tang, Zineng [2 ]
Bansal, Mohit [2 ]
Ji, Heng [1 ]
机构
[1] UIUC, Champaign, IL 61820 USA
[2] UNC, Chapel Hill, NC USA
来源
ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 36 (NEURIPS 2023) | 2023年
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Action knowledge involves the understanding of textual, visual, and temporal aspects of actions. We introduce the Action Dynamics Benchmark (ActionBench) containing two carefully designed probing tasks: Action Antonym and Video Reversal, which targets multimodal alignment capabilities and temporal understanding skills of the model, respectively. Despite recent video-language models' (VidLM) impressive performance on various benchmark tasks, our diagnostic tasks reveal their surprising deficiency (near-random performance) in action knowledge, suggesting that current models rely on object recognition abilities as a shortcut for action understanding. To remedy this, we propose a novel framework, PAXION, along with a new Discriminative Video Dynamics Modeling (DVDM) objective. The PAXION framework utilizes a Knowledge Patcher network to encode new action knowledge and a Knowledge Fuser component to integrate the Patcher into frozen VidLMs without compromising their existing capabilities. Due to limitations of the widely-used Video-Text Contrastive (VTC) loss for learning action knowledge, we introduce the DVDM objective to train the Knowledge Patcher. DVDM forces the model to encode the correlation between the action text and the correct ordering of video frames. Our extensive analyses show that PAXION and DVDM together effectively fill the gap in action knowledge understanding (similar to 50% -> 80%), while maintaining or improving performance on a wide spectrum of both object- and action-centric downstream tasks. The code and data will be made publicly available for research purposes at https://github.com/MikeWangWZHL/Paxion.git.
引用
收藏
页数:21
相关论文
共 50 条
  • [1] Verbs in Action: Improving verb understanding in video-language models
    Momeni, Liliane
    Caron, Mathilde
    Nagrani, Arsha
    Zisserman, Andrew
    Schmid, Cordelia
    2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2023), 2023, : 15533 - 15545
  • [2] DeVAn: Dense Video Annotation for Video-Language Models
    Liu, Tingkai
    Tao, Yunzhe
    Liu, Haogeng
    Fan, Qihang
    Zhou, Ding
    Huang, Huaibo
    He, Ran
    Yang, Hongxia
    PROCEEDINGS OF THE 62ND ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, VOL 1: LONG PAPERS, 2024, : 14305 - 14321
  • [3] OmniVL: One Foundation Model for Image-Language and Video-Language Tasks
    Wang, Junke
    Chen, Dongdong
    Wu, Zuxuan
    Luo, Chong
    Zhou, Luowei
    Zhao, Yucheng
    Xie, Yujia
    Liu, Ce
    Jiang, Yu-Gang
    Yuan, Lu
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 35 (NEURIPS 2022), 2022,
  • [4] Prompting Video-Language Foundation Models With Domain-Specific Fine-Grained Heuristics for Video Question Answering
    Yu, Ting
    Fu, Kunhao
    Wang, Shuhui
    Huang, Qingming
    Yu, Jun
    IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2025, 35 (02) : 1615 - 1630
  • [5] Robustness Analysis of Video-Language Models Against Visual and Language Perturbations
    Schiappa, Madeline C.
    Vyas, Shruti
    Palangi, Hamid
    Rawat, Yogesh S.
    Vineet, Vibhav
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 35 (NEURIPS 2022), 2022,
  • [6] Test of Time: Instilling Video-Language Models with a Sense of Time
    Bagad, Piyush
    Tapaswi, Makarand
    Snoek, Cees G. M.
    2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR, 2023, : 2503 - 2516
  • [7] Egocentric Video-Language Pretraining
    Lin, Kevin Qinghong
    Wang, Alex Jinpeng
    Soldan, Mattia
    Wray, Michael
    Yan, Rui
    Xu, Eric Zhongcong
    Gao, Difei
    Tu, Rongcheng
    Zhao, Wenzhe
    Kong, Weijie
    Cai, Chengfei
    Wang, Hongfa
    Damen, Dima
    Ghanem, Bernard
    Liu, Wei
    Shou, Mike Zheng
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 35 (NEURIPS 2022), 2022,
  • [8] Revisiting the "Video" in Video-Language Understanding
    Buch, Shyamal
    Eyzaguirre, Cristobal
    Gaidon, Adrien
    Wu, Jiajun
    Li Fei-Fei
    Niebles, Juan Carlos
    2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2022), 2022, : 2907 - 2917
  • [9] Deep Video Understanding with Video-Language Model
    Liu, Runze
    Fang, Yaqun
    Yu, Fan
    Tian, Ruiqi
    Ren, Tongwei
    Wu, Gangshan
    PROCEEDINGS OF THE 31ST ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2023, 2023, : 9551 - 9555
  • [10] Language Models with Image Descriptors are Strong Few-Shot Video-Language Learners
    Wang, Zhenhailong
    Li, Manling
    Xu, Ruochen
    Zhou, Luowei
    Lei, Jie
    Lin, Xudong
    Wang, Shuohang
    Yang, Ziyi
    Zhu, Chenguang
    Hoiem, Derek
    Chang, Shih-Fu
    Bansal, Mohit
    Ji, Heng
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 35 (NEURIPS 2022), 2022,