VLUE: A Multi-Task Benchmark for Evaluating Vision-Language Models

被引:0
|
作者
Zhou, Wangchunshu [1 ]
Zeng, Yan [1 ]
Diao, Shizhe [2 ]
Zhang, Xinsong [1 ]
机构
[1] ByteDance AI Lab, Beijing, Peoples R China
[2] Hong Kong Univ Sci & Technol, Hong Kong, Peoples R China
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Recent advances in vision-language pre-training (VLP) have demonstrated impressive performance in a range of vision-language (VL) tasks. However, there exist several challenges for measuring the community's progress in building general multi-modal intelligence. First, most of the downstream VL datasets are annotated using raw images that are already seen during pre-training, which may result in an overestimation of current VLP models' generalization ability. Second, recent VLP work mainly focuses on absolute performance but overlooks the efficiency-performance trade-off, which is also an important indicator for measuring progress. To this end, we introduce the Vision-Language Understanding Evaluation (VLUE) benchmark, a multi-task multi-dimension benchmark for evaluating the generalization capabilities and the efficiency-performance trade-off ("Pareto SOTA") of VLP models. We demonstrate that there is a sizable generalization gap for all VLP models when testing on out-of-distribution test sets annotated on images from a more diverse distribution that spreads across cultures. Moreover, we find that measuring the efficiency-performance trade-off of VLP models leads to complementary insights for several design choices of VLP. We release the VLUE benchmark(1) to promote research on building vision-language models that generalize well to more diverse images and concepts unseen during pre-training, and are practical in terms of efficiency-performance trade-off.
引用
收藏
页数:17
相关论文
共 50 条
  • [21] Learning to Prompt for Vision-Language Models
    Zhou, Kaiyang
    Yang, Jingkang
    Loy, Chen Change
    Liu, Ziwei
    INTERNATIONAL JOURNAL OF COMPUTER VISION, 2022, 130 (09) : 2337 - 2348
  • [22] Vision-Language Models for Biomedical Applications
    Thapa, Surendrabikram
    Naseem, Usman
    Zhou, Luping
    Kim, Jinman
    PROCEEDINGS OF THE FIRST INTERNATIONAL WORKSHOP ON VISION-LANGUAGE MODELS FOR BIOMEDICAL APPLICATIONS, VLM4BIO 2024, 2024, : 1 - 2
  • [23] Learning to Prompt for Vision-Language Models
    Kaiyang Zhou
    Jingkang Yang
    Chen Change Loy
    Ziwei Liu
    International Journal of Computer Vision, 2022, 130 : 2337 - 2348
  • [24] The Neglected Tails in Vision-Language Models
    Parashar, Shubham
    Lin, Zhiqiu
    Liu, Tian
    Dong, Xiangjue
    Li, Yanan
    Ramanan, Deva
    Caverlee, James
    Kong, Shu
    2024 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2024, : 12988 - 12997
  • [25] VISION-LANGUAGE MODELS AS SUCCESS DETECTORS
    Du, Yuqing
    Konyushkova, Ksenia
    Denil, Misha
    Raju, Akhil
    Landon, Jessica
    Hill, Felix
    de Freitas, Nando
    Cabi, Serkan
    CONFERENCE ON LIFELONG LEARNING AGENTS, VOL 232, 2023, 232 : 120 - 136
  • [26] Task-to-Instance Prompt Learning for Vision-Language Models at Test Time
    Lu, Zhihe
    Bai, Jiawang
    Li, Xin
    Xiao, Zeyu
    Wang, Xinchao
    IEEE TRANSACTIONS ON IMAGE PROCESSING, 2025, 34 : 1908 - 1920
  • [27] Debiasing vision-language models for vision tasks: a survey
    Zhu, Beier
    Zhang, Hanwang
    FRONTIERS OF COMPUTER SCIENCE, 2025, 19 (01)
  • [28] VisGraphVar: A benchmark generator for Assessing Variability in Graph Analysis Using Large Vision-Language Models
    Sartori, Camilo Chacon
    Blum, Christian
    Bistaffa, Filippo
    IEEE ACCESS, 2025, 13 : 21788 - 21810
  • [29] 12-in-1: Multi-Task Vision and Language Representation Learning
    Lu, Jiasen
    Goswami, Vedanuj
    Rohrbach, Marcus
    Parikh, Devi
    Lee, Stefan
    2020 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2020), 2020, : 10434 - 10443
  • [30] AutoDistiller: An Automatic Compression Method for Multi-task Language Models
    Wang, Hongsheng
    Xiao, Geyang
    Liang, Yuan
    2022 41ST CHINESE CONTROL CONFERENCE (CCC), 2022, : 2410 - 2415