PAXION: Patching Action Knowledge in Video-Language Foundation Models

被引：0

作者：

Wang, Zhenhailong ^{[1
]}

Blume, Ansel ^{[1
]}

Li, Sha ^{[1
]}

Liu, Genglin ^{[1
]}

Cho, Jaemin ^{[2
]}

Tang, Zineng ^{[2
]}

Bansal, Mohit ^{[2
]}

Ji, Heng ^{[1
]}

机构：

[1] UIUC, Champaign, IL 61820 USA

[2] UNC, Chapel Hill, NC USA

来源：

ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 36 (NEURIPS 2023) | 2023年

关键词：

D O I：

暂无

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Action knowledge involves the understanding of textual, visual, and temporal aspects of actions. We introduce the Action Dynamics Benchmark (ActionBench) containing two carefully designed probing tasks: Action Antonym and Video Reversal, which targets multimodal alignment capabilities and temporal understanding skills of the model, respectively. Despite recent video-language models' (VidLM) impressive performance on various benchmark tasks, our diagnostic tasks reveal their surprising deficiency (near-random performance) in action knowledge, suggesting that current models rely on object recognition abilities as a shortcut for action understanding. To remedy this, we propose a novel framework, PAXION, along with a new Discriminative Video Dynamics Modeling (DVDM) objective. The PAXION framework utilizes a Knowledge Patcher network to encode new action knowledge and a Knowledge Fuser component to integrate the Patcher into frozen VidLMs without compromising their existing capabilities. Due to limitations of the widely-used Video-Text Contrastive (VTC) loss for learning action knowledge, we introduce the DVDM objective to train the Knowledge Patcher. DVDM forces the model to encode the correlation between the action text and the correct ordering of video frames. Our extensive analyses show that PAXION and DVDM together effectively fill the gap in action knowledge understanding (similar to 50% -> 80%), while maintaining or improving performance on a wide spectrum of both object- and action-centric downstream tasks. The code and data will be made publicly available for research purposes at https://github.com/MikeWangWZHL/Paxion.git.

引用

页数：21

共 50 条

[31] Depth-Aware Sparse Transformer for Video-Language Learning
Zhang, Haonan
Gao, Lianli
Zeng, Pengpeng
Hanjalic, Alan
Shen, Heng Tao
PROCEEDINGS OF THE 31ST ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2023, 2023, : 4778 - 4787
[32] Clover : Towards A Unified Video-Language Alignment and Fusion Model
Huang, Jingjia
Li, Yinan
Feng, Jiashi
Wu, Xinglong
Sun, Xiaoshuai
Ji, Rongrong
2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2023, : 14856 - 14866
[33] Learning Trajectory-Word Alignments for Video-Language Tasks
Yang, Xu
Li, Zhangzikang
Xu, Haiyang
Zhang, Hanwang
Ye, Qinghao
Li, Chenliang
Yan, Ming
Zhang, Yu
Huang, Fei
Huang, Songfang
2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION, ICCV, 2023, : 2504 - 2514
[34] HiVLP: Hierarchical Interactive Video-Language Pre-Training
Shao, Bin
Liu, Jianzhuang
Pei, Renjing
Xu, Songcen
Dai, Peng
Lu, Juwei
Li, Weimian
Yan, Youliang
2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2023), 2023, : 13710 - 13720
[35] Survey: Transformer based video-language pre-training
Ruan, Ludan
Jin, Qin
AI OPEN, 2022, 3 : 1 - 13
[36] VideoCon: Robust Video-Language Alignment via Contrast Captions
Bansall, Hritik
Bitton, Yonatan
Szpektor, Idan
Chang, Kai-Wei
Grover, Aditya
2024 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2024, : 13927 - 13937
[37] Object-aware Video-language Pre-training for Retrieval
Wang, Alex Jinpeng
Ge, Yixiao
Cai, Guanyu
Yan, Rui
Lin, Xudong
Shan, Ying
Qie, Xiaohu
Shou, Mike Zheng
2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2022), 2022, : 3303 - 3312
[38] STOA-VLP: Spatial-Temporal Modeling of Object and Action for Video-Language Pre-training
Zhong, Weihong
Zheng, Mao
Tang, Duyu
Luo, Xuan
Gong, Heng
Feng, Xiaocheng
Qin, Bing
THIRTY-SEVENTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 37 NO 3, 2023, : 3715 - 3723
[39] Learning Unified Video-Language Representations via Joint Modeling and Contrastive Learning for Natural Language Video Localization
Cui, Chenhao
Liang, Xinnian
Wu, Shuangzhi
Li, Zhoujun
2023 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS, IJCNN, 2023,
[40] Advancing High-Resolution Video-Language Representation with Large-Scale Video Transcriptions
Xue, Hongwei
Hang, Tiankai
Zeng, Yanhong
Sun, Yuchong
Liu, Bei
Yang, Huan
Fu, Jianlong
Guo, Baining
2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2022), 2022, : 5026 - 5035

← 1 2 3 4 5 →