RobustMixGen: Data augmentation for enhancing robustness of visual-language models in the presence of distribution shift

被引：0

作者：

Kim, Sunwoo ^{[1
]}

Im, Hun ^{[2
]}

Lee, Woojun ^{[1
]}

Lee, Seonggye ^{[1
]}

Kang, Pilsung ^{[2
]}

机构：

[1] Korea Univ, Sch Ind & Management Engn, Seoul, South Korea

[2] Seoul Natl Univ, Dept Ind Engn, Seoul, South Korea

来源：

NEUROCOMPUTING | 2025年 / 619卷

基金：

新加坡国家研究基金会;

关键词：

Data augmentation; Distribution shift; Spurious correlation; Multimodal;

D O I：

10.1016/j.neucom.2024.129167

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

With the increasing deployment of Vision-Language Models (VLMs) in real-world applications, there is growing interest in enhancing their robustness to noise. Data augmentation has emerged as a prominent approach for improving robustness, and in the context of VLMs, MixGen has been widely adopted. Despite its success in improving performance, our experiments indicate that MixGen significantly degrades performance under distribution shift conditions, primarily due to the model's reliance on spurious correlations induced by MixGenaugmented data. To address this limitation, we propose a novel augmentation method that enhances both model performance and robustness by mitigating the learning of spurious correlations. Our approach involves the pre-classification of object and background categories. For image synthesis, we introduce the CutMixup technique, while for text synthesis, we employ a conjunction concatenation strategy, both aimed at reducing the impact of spurious correlations. We evaluated the efficacy of our method using the COCO dataset, a largescale benchmark comprising images and text. The effectiveness of our approach was assessed in a retrieval task under simulated distribution shift conditions. Our experimental results demonstrate the superiority of the proposed method, with a 17.11% improvement in the robustness metric (MMI) under distribution shift scenarios, establishing it as amore effective data augmentation technique. We would like to broaden the applicability of the augmentation method to various vision-language tasks beyond retrieval.

引用

页数：15

共 43 条

[1] Multimodal Pretraining Unmasked: A Meta-Analysis and a Unified Framework of Vision-and-Language BERTs
Bugliarello, Emanuele
Cotterell, Ryan
Okazaki, Naoaki
Elliott, Desmond
[J]. TRANSACTIONS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, 2021, 9 : 978 - 994
[2] Data augmentation for sentiment classification with semantic preservation and diversity
Chao, Guoqing
Liu, Jingyao
Wang, Mingyu
Chu, Dianhui
[J]. KNOWLEDGE-BASED SYSTEMS, 2023, 280
[3] Chen M., 2022, arXiv, DOI 10.48550/arXiv.2204.06125
[4] Chen X, 2022, Arxiv, DOI arXiv:2209.06794
[5] Randaugment: Practical automated data augmentation with a reduced search space
Cubuk, Ekin D.
Zoph, Barret
Shlens, Jonathon
Le, Quoc, V
[J]. 2020 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION WORKSHOPS (CVPRW 2020), 2020, : 3008 - 3017
[6] Ding Kaize, 2022, ACM SIGKDD Explorations Newsletter, P61, DOI 10.1145/3575637.3575646
[7] Euijong Whang S., 2021, arXiv
[8] Gur S, 2021, Arxiv, DOI arXiv:2104.08108
[9] MixGen: A New Multi-Modal Data Augmentation
Hao, Xiaoshuai
Zhu, Yi
Appalaraju, Srikar
Zhang, Aston
Zhang, Wanqian
Li, Bo
Li, Mu
[J]. 2023 IEEE/CVF WINTER CONFERENCE ON APPLICATIONS OF COMPUTER VISION WORKSHOPS (WACVW), 2023, : 379 - 389
[10] Hazarika Devamanyu, 2022, arXiv

← 1 2 3 4 5 →