PAXION: Patching Action Knowledge in Video-Language Foundation Models

被引：0

作者：

Wang, Zhenhailong ^{[1
]}

Blume, Ansel ^{[1
]}

Li, Sha ^{[1
]}

Liu, Genglin ^{[1
]}

Cho, Jaemin ^{[2
]}

Tang, Zineng ^{[2
]}

Bansal, Mohit ^{[2
]}

Ji, Heng ^{[1
]}

机构：

[1] UIUC, Champaign, IL 61820 USA

[2] UNC, Chapel Hill, NC USA

来源：

ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 36 (NEURIPS 2023) | 2023年

关键词：

D O I：

暂无

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Action knowledge involves the understanding of textual, visual, and temporal aspects of actions. We introduce the Action Dynamics Benchmark (ActionBench) containing two carefully designed probing tasks: Action Antonym and Video Reversal, which targets multimodal alignment capabilities and temporal understanding skills of the model, respectively. Despite recent video-language models' (VidLM) impressive performance on various benchmark tasks, our diagnostic tasks reveal their surprising deficiency (near-random performance) in action knowledge, suggesting that current models rely on object recognition abilities as a shortcut for action understanding. To remedy this, we propose a novel framework, PAXION, along with a new Discriminative Video Dynamics Modeling (DVDM) objective. The PAXION framework utilizes a Knowledge Patcher network to encode new action knowledge and a Knowledge Fuser component to integrate the Patcher into frozen VidLMs without compromising their existing capabilities. Due to limitations of the widely-used Video-Text Contrastive (VTC) loss for learning action knowledge, we introduce the DVDM objective to train the Knowledge Patcher. DVDM forces the model to encode the correlation between the action text and the correct ordering of video frames. Our extensive analyses show that PAXION and DVDM together effectively fill the gap in action knowledge understanding (similar to 50% -> 80%), while maintaining or improving performance on a wide spectrum of both object- and action-centric downstream tasks. The code and data will be made publicly available for research purposes at https://github.com/MikeWangWZHL/Paxion.git.

引用

页数：21

共 50 条

[1] Verbs in Action: Improving verb understanding in video-language models
Momeni, Liliane
Caron, Mathilde
Nagrani, Arsha
Zisserman, Andrew
Schmid, Cordelia
2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2023), 2023, : 15533 - 15545
[2] DeVAn: Dense Video Annotation for Video-Language Models
Liu, Tingkai
Tao, Yunzhe
Liu, Haogeng
Fan, Qihang
Zhou, Ding
Huang, Huaibo
He, Ran
Yang, Hongxia
PROCEEDINGS OF THE 62ND ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, VOL 1: LONG PAPERS, 2024, : 14305 - 14321
[3] OmniVL: One Foundation Model for Image-Language and Video-Language Tasks
Wang, Junke
Chen, Dongdong
Wu, Zuxuan
Luo, Chong
Zhou, Luowei
Zhao, Yucheng
Xie, Yujia
Liu, Ce
Jiang, Yu-Gang
Yuan, Lu
ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 35 (NEURIPS 2022), 2022,
[4] Prompting Video-Language Foundation Models With Domain-Specific Fine-Grained Heuristics for Video Question Answering
Yu, Ting
Fu, Kunhao
Wang, Shuhui
Huang, Qingming
Yu, Jun
IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2025, 35 (02) : 1615 - 1630
[5] Robustness Analysis of Video-Language Models Against Visual and Language Perturbations
Schiappa, Madeline C.
Vyas, Shruti
Palangi, Hamid
Rawat, Yogesh S.
Vineet, Vibhav
ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 35 (NEURIPS 2022), 2022,
[6] Test of Time: Instilling Video-Language Models with a Sense of Time
Bagad, Piyush
Tapaswi, Makarand
Snoek, Cees G. M.
2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR, 2023, : 2503 - 2516
[7] Egocentric Video-Language Pretraining
Lin, Kevin Qinghong
Wang, Alex Jinpeng
Soldan, Mattia
Wray, Michael
Yan, Rui
Xu, Eric Zhongcong
Gao, Difei
Tu, Rongcheng
Zhao, Wenzhe
Kong, Weijie
Cai, Chengfei
Wang, Hongfa
Damen, Dima
Ghanem, Bernard
Liu, Wei
Shou, Mike Zheng
ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 35 (NEURIPS 2022), 2022,
[8] Revisiting the "Video" in Video-Language Understanding
Buch, Shyamal
Eyzaguirre, Cristobal
Gaidon, Adrien
Wu, Jiajun
Li Fei-Fei
Niebles, Juan Carlos
2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2022), 2022, : 2907 - 2917
[9] Deep Video Understanding with Video-Language Model
Liu, Runze
Fang, Yaqun
Yu, Fan
Tian, Ruiqi
Ren, Tongwei
Wu, Gangshan
PROCEEDINGS OF THE 31ST ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2023, 2023, : 9551 - 9555
[10] Language Models with Image Descriptors are Strong Few-Shot Video-Language Learners
Wang, Zhenhailong
Li, Manling
Xu, Ruochen
Zhou, Luowei
Lei, Jie
Lin, Xudong
Wang, Shuohang
Yang, Ziyi
Zhu, Chenguang
Hoiem, Derek
Chang, Shih-Fu
Bansal, Mohit
Ji, Heng
ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 35 (NEURIPS 2022), 2022,

← 1 2 3 4 5 →