VLATTACK: Multimodal Adversarial Attacks on Vision-Language Tasks via Pre-trained Models

被引：0

作者：

Yin, Ziyi ^{[1
]}

Ye, Muchao ^{[1
]}

Zhang, Tianrong ^{[1
]}

Du, Tianyu ^{[2
]}

Zhu, Jinguo ^{[3
]}

Liu, Han ^{[4
]}

Chen, Jinghui ^{[1
]}

Wang, Ting ^{[5
]}

Ma, Fenglong ^{[1
]}

机构：

[1] Penn State Univ, University Pk, PA 16802 USA

[2] Zhejiang Univ, Hangzhou, Peoples R China

[3] Xi An Jiao Tong Univ, Xian, Peoples R China

[4] Dalian Univ Technol, Dalian, Peoples R China

[5] SUNY Stony Brook, Stony Brook, NY USA

来源：

ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 36 (NEURIPS 2023) | 2023年

基金：

美国国家科学基金会;

关键词：

D O I：

暂无

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Vision-Language (VL) pre-trained models have shown their superiority on many multimodal tasks. However, the adversarial robustness of such models has not been fully explored. Existing approaches mainly focus on exploring the adversarial robustness under the white-box setting, which is unrealistic. In this paper, we aim to investigate a new yet practical task to craft image and text perturbations using pre-trained VL models to attack black-box fine-tuned models on different downstream tasks. Towards this end, we propose VLATTACK(2) to generate adversarial samples by fusing perturbations of images and texts from both single-modal and multimodal levels. At the single-modal level, we propose a new blockwise similarity attack (BSA) strategy to learn image perturbations for disrupting universal representations. Besides, we adopt an existing text attack strategy to generate text perturbations independent of the image-modal attack. At the multimodal level, we design a novel iterative cross-search attack (ICSA) method to update adversarial image-text pairs periodically, starting with the outputs from the single-modal level. We conduct extensive experiments to attack five widely-used VL pre-trained models for six tasks. Experimental results show that VLATTACK achieves the highest attack success rates on all tasks compared with state-of-the-art baselines, which reveals a blind spot in the deployment of pre-trained VL models.

引用

页数：21

共 50 条

[31] Rethinking Textual Adversarial Defense for Pre-Trained Language Models [J].

Wang, Jiayi ;

Bao, Rongzhou ;

Zhang, Zhuosheng ;

Zhao, Hai .

IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2022, 30 :2526-2540

[32] UOR: Universal Backdoor Attacks on Pre-trained Language Models [J].

Du, Wei ;

Li, Peixuan ;

Zhao, Haodong ;

Ju, Tianjie ;

Ren, Ge ;

Liu, Gongshen .

FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: ACL 2024, 2024, :7865-7877

[33] Temporal Effects on Pre-trained Models for Language Processing Tasks [J].

Agarwal, Oshin ;

Nenkova, Ani .

TRANSACTIONS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, 2022, 10 :904-921

[34] Quantifying Adaptability in Pre-trained Language Models with 500 Tasks [J].

Li, Belinda Z. ;

Yu, Jane ;

Khabsa, Madian ;

Zettlemoyer, Luke ;

Halevy, Alon ;

Andreas, Jacob .

NAACL 2022: THE 2022 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES, 2022, :4696-4715

[35] Efficient and Long-Tailed Generalization for Pre-trained Vision-Language Model [J].

Shi, Jiang-Xin ;

Zhang, Chi ;

Wei, Tong ;

Li, Yu-Feng .

PROCEEDINGS OF THE 30TH ACM SIGKDD CONFERENCE ON KNOWLEDGE DISCOVERY AND DATA MINING, KDD 2024, 2024, :2663-2673

[36] Bidirectional Cross-Modal Knowledge Exploration for Video Recognition with Pre-trained Vision-Language Models [J].

Wu, Wenhao ;

Wang, Xiaohan ;

Luo, Haipeng ;

Wang, Jingdong ;

Yang, Yi ;

Ouyang, Wanli .

2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR, 2023, :6620-6630

[37] Mind the Interference: Retaining Pre-trained Knowledge in Parameter Efficient Continual Learning of Vision-Language Models [J].

Tang, Longxiang ;

Tian, Zhuotao ;

Li, Kai ;

He, Chunming ;

Zhou, Hantao ;

Zhao, Hengshuang ;

Li, Xiu ;

Jia, Jiaya .

COMPUTER VISION - ECCV 2024, PT XXXVI, 2025, 15094 :346-365

[38] Robotic environmental state recognition with pre-trained vision-language models and black-box optimization [J].

Kawaharazuka, Kento ;

Obinata, Yoshiki ;

Kanazawa, Naoaki ;

Okada, Kei ;

Inaba, Masayuki .

ADVANCED ROBOTICS, 2024, 38 (18) :1255-1264

[39] Vision-Language Models for Vision Tasks: A Survey [J].

Zhang, Jingyi ;

Huang, Jiaxing ;

Jin, Sheng ;

Lu, Shijian .

IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2024, 46 (08) :5625-5644

[40] Adversarial Attacks on Pre-trained Deep Learning Models for Encrypted Traffic Analysis [J].

Seok, Byoungjin ;

Sohn, Kiwook .

JOURNAL OF WEB ENGINEERING, 2024, 23 (06) :749-768

← 1 2 3 4 5 →