Learning From Text: A Multimodal Face Inpainting Network for Irregular Holes

被引：1

作者：

Zhan, Dandan ^{[1
]}

Wu, Jiahao ^{[1
]}

Luo, Xing ^{[2
]}

Jin, Zhi ^{[1
,3
]}

机构：

[1] Sun Yat Sen Univ, Sch Intelligent Syst Engn, Shenzhen Campus, Shenzhen 518107, Guangdong, Peoples R China

[2] Peng Cheng Lab, Dept Math & Theories, Shenzhen 518055, Guangdong, Peoples R China

[3] Guangdong Prov Key Lab Fire Sci & Technol, Guangzhou 510006, Guangdong, Peoples R China

来源：

IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY | 2024年 / 34卷 / 08期

基金：

中国国家自然科学基金;

关键词：

Faces; Face recognition; Feature extraction; Visualization; Transformers; Task analysis; Semantics; Face inpainting; irregular hole; multimodality; text description;

D O I：

10.1109/TCSVT.2024.3370578

中图分类号：

TM [电工技术]; TN [电子技术、通信技术];

学科分类号：

0808 ; 0809 ;

摘要：

Irregular hole face inpainting is a challenging task, since the appearance of faces varies greatly (e.g., different expressions and poses) and the human vision is more sensitive to subtle blemishes in the inpainted face images. Without external information, most existing methods struggle to generate new content containing semantic information for face components in the absence of sufficient contextual information. As it is known that text can be used to describe the content of an image in most cases, and is flexible and user-friendly. In this work, a concise and effective Multimodal Face Inpainting Network (MuFIN) is proposed, which simultaneously utilizes the information of the known regions and the descriptive text of the input image to address the problem of irregular hole face inpainting. To fully exploit the rest parts of the corrupted face images, a plug-and-play Multi-scale Multi-level Skip Fusion Module (MMSFM), which extracts multi-scale features and fuses shallow features into deep features at multiple levels, is illustrated. Moreover, to bridge the gap between textual and visual modalities and effectively fuse cross-modal features, a Multi-scale Text-Image Fusion Block (MTIFB), which incorporates text features into image features from both local and global scales, is developed. Extensive experiments conducted on two commonly used datasets CelebA and Multi-Modal-CelebA-HQ demonstrate that our method outperforms state-of-the-art methods both qualitatively and quantitatively, and can generate realistic and controllable results.

引用

页码：7484 / 7497

页数：14

共 50 条

[1] When Face Completion Meets Irregular Holes: An Attributes Guided Deep Inpainting Network
Xiao, Jie
Zhan, Dandan
Qi, Haoran
Jin, Zhi
PROCEEDINGS OF THE 29TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2021, 2021, : 3202 - 3210
[2] A Text-Enhanced Transformer Fusion Network for Multimodal Knowledge Graph Completion
Wang, Jingchao
Liu, Xiao
Li, Weimin
Liu, Fangfang
Wu, Xing
Jin, Qun
IEEE INTELLIGENT SYSTEMS, 2024, 39 (03) : 54 - 62
[3] Dual Face Alignment Learning Network for NIR-VIS Face Recognition
Hu, Weipeng
Yan, Wenjun
Hu, Haifeng
IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2022, 32 (04) : 2411 - 2424
[4] MGN-Net: Multigranularity Graph Fusion Network in Multimodal for Scene Text Spotting
Yuan, Zhengyi
Shi, Cao
IEEE INTERNET OF THINGS JOURNAL, 2024, 11 (14): : 25088 - 25098
[5] SMAN: Stacked Multimodal Attention Network for Cross-Modal Image-Text Retrieval
Ji, Zhong
Wang, Haoran
Han, Jungong
Pang, Yanwei
IEEE TRANSACTIONS ON CYBERNETICS, 2022, 52 (02) : 1086 - 1097
[6] Image-Text Multimodal Emotion Classification via Multi-View Attentional Network
Yang, Xiaocui
Feng, Shi
Wang, Daling
Zhang, Yifei
IEEE TRANSACTIONS ON MULTIMEDIA, 2021, 23 : 4014 - 4026
[7] Learning Social Relationship From Videos via Pre-Trained Multimodal Transformer
Teng, Yiyang
Song, Chenguang
Wu, Bin
IEEE SIGNAL PROCESSING LETTERS, 2022, 29 : 1377 - 1381
[8] Multimodal Learning for Temporally Coherent Talking Face Generation With Articulator Synergy
Yu, Lingyun
Xie, Hongtao
Zhang, Yongdong
IEEE TRANSACTIONS ON MULTIMEDIA, 2022, 24 : 2950 - 2962
[9] Perceptual face inpainting with multicolumn gated convolutional network
Yang, You
Li, Kesen
Liu, Sixun
Luo, Ling
Xing, Bin
JOURNAL OF ELECTRONIC IMAGING, 2022, 31 (01)
[10] Learning Multi-Scale Knowledge-Guided Features for Text-Guided Face Recognition
Hasan, Md Mahedi
Sami, Shoaib Meraj
Nasrabadi, Nasser M.
Dawson, Jeremy
IEEE TRANSACTIONS ON BIOMETRICS, BEHAVIOR, AND IDENTITY SCIENCE, 2025, 7 (02): : 195 - 209

← 1 2 3 4 5 →