VLUE: A Multi-Task Benchmark for Evaluating Vision-Language Models

被引:0
|
作者
Zhou, Wangchunshu [1 ]
Zeng, Yan [1 ]
Diao, Shizhe [2 ]
Zhang, Xinsong [1 ]
机构
[1] ByteDance AI Lab, Beijing, Peoples R China
[2] Hong Kong Univ Sci & Technol, Hong Kong, Peoples R China
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Recent advances in vision-language pre-training (VLP) have demonstrated impressive performance in a range of vision-language (VL) tasks. However, there exist several challenges for measuring the community's progress in building general multi-modal intelligence. First, most of the downstream VL datasets are annotated using raw images that are already seen during pre-training, which may result in an overestimation of current VLP models' generalization ability. Second, recent VLP work mainly focuses on absolute performance but overlooks the efficiency-performance trade-off, which is also an important indicator for measuring progress. To this end, we introduce the Vision-Language Understanding Evaluation (VLUE) benchmark, a multi-task multi-dimension benchmark for evaluating the generalization capabilities and the efficiency-performance trade-off ("Pareto SOTA") of VLP models. We demonstrate that there is a sizable generalization gap for all VLP models when testing on out-of-distribution test sets annotated on images from a more diverse distribution that spreads across cultures. Moreover, we find that measuring the efficiency-performance trade-off of VLP models leads to complementary insights for several design choices of VLP. We release the VLUE benchmark(1) to promote research on building vision-language models that generalize well to more diverse images and concepts unseen during pre-training, and are practical in terms of efficiency-performance trade-off.
引用
收藏
页数:17
相关论文
共 50 条
  • [1] Multi-task Learning of Hierarchical Vision-Language Representation
    Duy-Kien Nguyen
    Okatani, Takayuki
    2019 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2019), 2019, : 10484 - 10493
  • [2] Multi-task prompt tuning with soft context sharing for vision-language models
    Ding, Kun
    Wang, Ying
    Liu, Pengzhang
    Yu, Qiang
    Zhang, Haojian
    Xiang, Shiming
    Pan, Chunhong
    NEUROCOMPUTING, 2024, 603
  • [3] CALM-Bench: A Multi-task Benchmark for Evaluating Causality Aware Language Models
    Dalal, Dhairya
    Arcan, Mihael
    Buitelaar, Paul
    17TH CONFERENCE OF THE EUROPEAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, EACL 2023, 2023, : 296 - 311
  • [4] Align vision-language semantics by multi-task learning for multi-modal summarization
    Cui C.
    Liang X.
    Wu S.
    Li Z.
    Neural Computing and Applications, 2024, 36 (25) : 15653 - 15666
  • [5] Task Residual for Tuning Vision-Language Models
    Yu, Tao
    Lu, Zhihe
    Jin, Xin
    Chen, Zhibo
    Wang, Xinchao
    2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2023, : 10899 - 10909
  • [6] Task Bias in Contrastive Vision-Language Models
    Menon, Sachit
    Chandratreya, Ishaan Preetam
    Vondrick, Carl
    INTERNATIONAL JOURNAL OF COMPUTER VISION, 2024, 132 (06) : 2026 - 2040
  • [7] Multi-Task Paired Masking With Alignment Modeling for Medical Vision-Language Pre-Training
    Zhang, Ke
    Yang, Yan
    Yu, Jun
    Jiang, Hanliang
    Fan, Jianping
    Huang, Qingming
    Han, Weidong
    IEEE TRANSACTIONS ON MULTIMEDIA, 2024, 26 : 4706 - 4721
  • [8] Evaluating Attribute Comprehension in Large Vision-Language Models
    Zhang, Haiwen
    Yang, Zixi
    Liu, Yuanzhi
    Wang, Xinran
    He, Zheqi
    Liang, Kongming
    Ma, Zhanyu
    PATTERN RECOGNITION AND COMPUTER VISION, PT V, PRCV 2024, 2025, 15035 : 98 - 113
  • [9] On Evaluating Adversarial Robustness of Large Vision-Language Models
    Zhao, Yunqing
    Pang, Tianyu
    Du, Chao
    Yang, Xiao
    Li, Chongxuan
    Cheung, Ngai-Man
    Lin, Min
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 36 (NEURIPS 2023), 2023,
  • [10] Evaluating Object Hallucination in Large Vision-Language Models
    Li, Yifan
    Du, Yifan
    Zhou, Kun
    Wang, Jinpeng
    Zhao, Wayne Xin
    Wen, Ji-Rong
    2023 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING, EMNLP 2023, 2023, : 292 - 305