Unveiling Typographic Deceptions: Insights of the Typographic Vulnerability in Large Vision-Language Models

被引：0

作者：

Cheng, Hao ^{[1
]}

Xiao, Erjia ^{[1
]}

Gu, Jindong ^{[2
]}

Yang, Le ^{[3
]}

Duan, Jinhao ^{[4
]}

Zhang, Jize ^{[5
]}

Cao, Jiahang ^{[1
]}

Xu, Kaidi ^{[4
]}

Xu, Renjing ^{[1
]}

机构：

[1] Hong Kong Univ Sci & Technol Guangzhou, Hong Kong, Peoples R China

[2] Univ Oxford, Oxford, England

[3] Xi An Jiao Tong Univ, Xian, Peoples R China

[4] Drexel Univ, Philadelphia, PA USA

[5] Hong Kong Univ Sci & Technol, Hong Kong, Peoples R China

来源：

COMPUTER VISION - ECCV 2024, PT LIX | 2025年 / 15117卷

关键词：

Vision-Language Model; Typographic Attack; Attention;

D O I：

10.1007/978-3-031-73202-7_11

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Large Vision-Language Models (LVLMs) rely on vision encoders and Large Language Models (LLMs) to exhibit remarkable capabilities on various multi-modal tasks in the joint space of vision and language. However, typographic attacks, which disrupt Vision-Language Models (VLMs) such as Contrastive Language-Image Pretraining (CLIP), have also been expected to be a security threat to LVLMs. Firstly, we verify typographic attacks on current well-known commercial and open-source LVLMs and uncover the widespread existence of this threat. Secondly, to better assess this vulnerability, we propose the most comprehensive and largest-scale Typographic Dataset to date. The Typographic Dataset not only considers the evaluation of typographic attacks under various multi-modal tasks but also evaluates the effects of typographic attacks, influenced by texts generated with diverse factors. Based on the evaluation results, we investigate the causes why typographic attacks impacting VLMs and LVLMs, leading to three highly insightful discoveries. During the process of further validating the rationality of our discoveries, we can reduce the performance degradation caused by typographic attacks from 42.07% to 13.90%. Code and Dataset are available in https://github.com/ChaduCheng/TypoDeceptions.

引用

页码：179 / 196

页数：18

共 50 条

[1] Unveiling Vulnerabilities in Large Vision-Language Models: The SAVJ Jailbreak Approach
Zhang, Gang
Fan, Xiaowei
Fang, Jingquan
Sun, Yanna
Shi, Xiayang
Lu, Chunyang
ARTIFICIAL NEURAL NETWORKS AND MACHINE LEARNING-ICANN 2024, PT V, 2024, 15020 : 417 - 434
[2] Attention Prompting on Image for Large Vision-Language Models
Yu, Runpeng
Yu, Weihao
Wang, Xinchao
COMPUTER VISION - ECCV 2024, PT XXX, 2025, 15088 : 251 - 268
[3] Effectiveness assessment of recent large vision-language models
Yao Jiang
Xinyu Yan
Ge-Peng Ji
Keren Fu
Meijun Sun
Huan Xiong
Deng-Ping Fan
Fahad Shahbaz Khan
Visual Intelligence, 2 (1):
[4] Evaluating Attribute Comprehension in Large Vision-Language Models
Zhang, Haiwen
Yang, Zixi
Liu, Yuanzhi
Wang, Xinran
He, Zheqi
Liang, Kongming
Ma, Zhanyu
PATTERN RECOGNITION AND COMPUTER VISION, PT V, PRCV 2024, 2025, 15035 : 98 - 113
[5] Evaluating Object Hallucination in Large Vision-Language Models
Li, Yifan
Du, Yifan
Zhou, Kun
Wang, Jinpeng
Zhao, Wayne Xin
Wen, Ji-Rong
2023 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING, EMNLP 2023, 2023, : 292 - 305
[6] On Evaluating Adversarial Robustness of Large Vision-Language Models
Zhao, Yunqing
Pang, Tianyu
Du, Chao
Yang, Xiao
Li, Chongxuan
Cheung, Ngai-Man
Lin, Min
ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 36 (NEURIPS 2023), 2023,
[7] Learning the Visualness of Text Using Large Vision-Language Models
Verma, Gaurav
Rossi, Ryan A.
Tensmeyer, Christopher
Gu, Jiuxiang
Nenkova, Ani
2023 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING, EMNLP 2023, 2023, : 2394 - 2408
[8] Visual In-Context Learning for Large Vision-Language Models
Zhou, Yucheng
Le, Xiang
Wang, Qianning
Shen, Jianbing
FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: ACL 2024, 2024, : 15890 - 15902
[9] Vision-Language Models for Vision Tasks: A Survey
Zhang, Jingyi
Huang, Jiaxing
Jin, Sheng
Lu, Shijian
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2024, 46 (08) : 5625 - 5644
[10] JailbreakZoo: Survey, Landscapes, and Horizons in Jailbreaking Large Language and Vision-Language Models
Jin, Haibo
Hu, Leyang
Li, Xinnuo
Zhang, Peiyan
Chen, Chonghan
Zhuang, Jun
Wang, Haohan
arXiv,

← 1 2 3 4 5 →