RobustMixGen: Data augmentation for enhancing robustness of visual-language models in the presence of distribution shift

被引:0
作者
Kim, Sunwoo [1 ]
Im, Hun [2 ]
Lee, Woojun [1 ]
Lee, Seonggye [1 ]
Kang, Pilsung [2 ]
机构
[1] Korea Univ, Sch Ind & Management Engn, Seoul, South Korea
[2] Seoul Natl Univ, Dept Ind Engn, Seoul, South Korea
基金
新加坡国家研究基金会;
关键词
Data augmentation; Distribution shift; Spurious correlation; Multimodal;
D O I
10.1016/j.neucom.2024.129167
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
With the increasing deployment of Vision-Language Models (VLMs) in real-world applications, there is growing interest in enhancing their robustness to noise. Data augmentation has emerged as a prominent approach for improving robustness, and in the context of VLMs, MixGen has been widely adopted. Despite its success in improving performance, our experiments indicate that MixGen significantly degrades performance under distribution shift conditions, primarily due to the model's reliance on spurious correlations induced by MixGenaugmented data. To address this limitation, we propose a novel augmentation method that enhances both model performance and robustness by mitigating the learning of spurious correlations. Our approach involves the pre-classification of object and background categories. For image synthesis, we introduce the CutMixup technique, while for text synthesis, we employ a conjunction concatenation strategy, both aimed at reducing the impact of spurious correlations. We evaluated the efficacy of our method using the COCO dataset, a largescale benchmark comprising images and text. The effectiveness of our approach was assessed in a retrieval task under simulated distribution shift conditions. Our experimental results demonstrate the superiority of the proposed method, with a 17.11% improvement in the robustness metric (MMI) under distribution shift scenarios, establishing it as amore effective data augmentation technique. We would like to broaden the applicability of the augmentation method to various vision-language tasks beyond retrieval.
引用
收藏
页数:15
相关论文
共 43 条
  • [1] Multimodal Pretraining Unmasked: A Meta-Analysis and a Unified Framework of Vision-and-Language BERTs
    Bugliarello, Emanuele
    Cotterell, Ryan
    Okazaki, Naoaki
    Elliott, Desmond
    [J]. TRANSACTIONS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, 2021, 9 : 978 - 994
  • [2] Data augmentation for sentiment classification with semantic preservation and diversity
    Chao, Guoqing
    Liu, Jingyao
    Wang, Mingyu
    Chu, Dianhui
    [J]. KNOWLEDGE-BASED SYSTEMS, 2023, 280
  • [3] Chen M., 2022, arXiv, DOI 10.48550/arXiv.2204.06125
  • [4] Chen X, 2022, Arxiv, DOI arXiv:2209.06794
  • [5] Randaugment: Practical automated data augmentation with a reduced search space
    Cubuk, Ekin D.
    Zoph, Barret
    Shlens, Jonathon
    Le, Quoc, V
    [J]. 2020 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION WORKSHOPS (CVPRW 2020), 2020, : 3008 - 3017
  • [6] Ding Kaize, 2022, ACM SIGKDD Explorations Newsletter, P61, DOI 10.1145/3575637.3575646
  • [7] Euijong Whang S., 2021, arXiv
  • [8] Gur S, 2021, Arxiv, DOI arXiv:2104.08108
  • [9] MixGen: A New Multi-Modal Data Augmentation
    Hao, Xiaoshuai
    Zhu, Yi
    Appalaraju, Srikar
    Zhang, Aston
    Zhang, Wanqian
    Li, Bo
    Li, Mu
    [J]. 2023 IEEE/CVF WINTER CONFERENCE ON APPLICATIONS OF COMPUTER VISION WORKSHOPS (WACVW), 2023, : 379 - 389
  • [10] Hazarika Devamanyu, 2022, arXiv