Exposing fake images generated by text-to-image diffusion models

被引：0

作者：

Xu, Qiang ^{[1
,2
,6
,7
]}

Wang, Hao ^{[3
]}

Meng, Laijin ^{[4
]}

Mi, Zhongjie ^{[4
]}

Yuan, Jianye ^{[5
]}

Yan, Hong ^{[1
,2
]}

机构：

[1] City Univ Hong Kong, Dept Elect Engn, Kowloon, Hong Kong, Peoples R China

[2] City Univ Hong Kong, Ctr Intelligent Multidimens Data Anal, Kowloon, Hong Kong, Peoples R China

[3] Chongqing Univ Posts & Telecommun, Coll Comp Sci & Technol, Chongqing 400065, Peoples R China

[4] Shanghai Jiao Tong Univ, Sch Elect Informat & Elect Engn, Shanghai 200240, Peoples R China

[5] Wuhan Univ, Sch Elect Informat, Wuhan 473072, Peoples R China

[6] City Univ Hong Kong, Dept Elect Engn, Hong Kong, Peoples R China

[7] City Univ Hong Kong, Ctr Intelligent Multidimens Data Anal, Hong Kong, Peoples R China

来源：

PATTERN RECOGNITION LETTERS | 2023年 / 176卷

关键词：

Text-to-image; Diffusion models (DM); Image forensics; Attention mechanism; Vision transformers (ViTs);

D O I：

暂无

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Text-to-image diffusion models (DM) have posed unprecedented challenges to the authenticity and integrity of digital images, which makes the detection of computer-generated images one of the most important image forensics techniques. However, the detection of images generated by text-to-image diffusion models is rarely reported in the literature. To tackle this issue, we first analyze the acquisition process of DM images. Then, we construct a hybrid neural network based on attention-guided feature extraction (AGFE) and vision transformers (ViTs)-based feature extraction (ViTFE) modules. An attention mechanism is adopted in the AGFE module to capture long-range feature interactions and boost the representation capability. ViTFE module containing sequential MobileNetv2 block (MNV2) and MobileViT blocks are designed to learn global representations. By conducting extensive experiments on different types of generated images, the results demonstrate the effectiveness and robustness of our method in exposing fake images generated by text-to-image diffusion models.

引用

页码：76 / 82

页数：7

共 24 条

[1] VGGFace2: A dataset for recognising faces across pose and age
Cao, Qiong
Shen, Li
Xie, Weidi
Parkhi, Omkar M.
Zisserman, Andrew
[J]. PROCEEDINGS 2018 13TH IEEE INTERNATIONAL CONFERENCE ON AUTOMATIC FACE & GESTURE RECOGNITION (FG 2018), 2018, : 67 - 74
[2] Dosovitskiy A, 2021, Arxiv, DOI arXiv:2010.11929
[3] Ferrari C., 2018, P EUR C COMP VIS ECC
[4] Unsupervised Discovery and Manipulation of Continuous Disentangled Factors of Variation
Fontanini, Tomaso
Donati, Luca
Bertozzi, Massimo
Prati, Andrea
[J]. ACM TRANSACTIONS ON MULTIMEDIA COMPUTING COMMUNICATIONS AND APPLICATIONS, 2023, 19 (06)
[5] Hensel M, 2017, ADV NEUR IN, V30
[6] High-quality face image generated with conditional boundary equilibrium generative adversarial networks
Huang, Bin
Chen, Weihai
Wu, Xingming
Lin, Chun-Liang
Suganthan, Ponnuthurai Nagaratnam
[J]. PATTERN RECOGNITION LETTERS, 2018, 111 : 72 - 79
[7] Image-to-Image Translation with Conditional Adversarial Networks
Isola, Phillip
Zhu, Jun-Yan
Zhou, Tinghui
Efros, Alexei A.
[J]. 30TH IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2017), 2017, : 5967 - 5976
[8] Karras T, 2018, Arxiv, DOI [arXiv:1710.10196, 10.48550/arXiv.1710.10196, DOI 10.48550/ARXIV.1710.10196]
[9] Karras Tero, 2020, P IEEE CVF C COMP VI, P8110
[10] Kingma D. P., 2017, ARXIV, DOI DOI 10.48550/ARXIV.1412.6980

← 1 2 3 →