Open-Vocabulary Text-Driven Human Image Generation

被引:1
作者
Zhang, Kaiduo [1 ,2 ]
Sun, Muyi [1 ,3 ]
Sun, Jianxin [1 ,2 ]
Zhang, Kunbo [1 ,2 ]
Sun, Zhenan [1 ,2 ]
Tan, Tieniu [1 ,2 ,4 ]
机构
[1] CASIA, CRIPAC, MAIS, Beijing 100190, Peoples R China
[2] UCAS, Sch AI, Beijing 101408, Peoples R China
[3] BUPT, Sch AI, Beijing 100875, Peoples R China
[4] Nanjing Univ, Nanjing 210008, Peoples R China
基金
中国国家自然科学基金;
关键词
Multi-modal biometric analysis; Human image generation; Text-to-human generation; Human image editing; MANIPULATION;
D O I
10.1007/s11263-024-02079-7
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Generating human images from open-vocabulary text descriptions is an exciting but challenging task. Previous methods (i.e., Text2Human) face two challenging problems: (1) they cannot well handle the open-vocabulary setting by arbitrary text inputs (i.e., unseen clothing appearances) and heavily rely on limited preset words (i.e., pattern styles of clothing appearances); (2) the generated human image is inaccuracy in open-vocabulary settings. To alleviate these drawbacks, we propose a flexible diffusion-based framework, namely HumanDiffusion, for open-vocabulary text-driven human image generation (HIG). The proposed framework mainly consists of two novel modules: the Stylized Memory Retrieval (SMR) module and the Multi-scale Feature Mapping (MFM) module. Encoded by the vision-language pretrained CLIP model, we obtain coarse features of the local human appearance. Then, the SMR module utilizes an external database that contains clothing texture details to refine the initial coarse features. Through SMR refreshing, we can achieve the HIG task with arbitrary text inputs, and the range of expression styles is greatly expanded. Later, the MFM module embedding in the diffusion backbone can learn fine-grained appearance features, which effectively achieves precise semantic-coherence alignment of different body parts with appearance features and realizes the accurate expression of desired human appearance. The seamless combination of the proposed novel modules in HumanDiffusion realizes the freestyle and high accuracy of text-guided HIG and editing tasks. Extensive experiments demonstrate that the proposed method can achieve state-of-the-art (SOTA) performance, especially in the open-vocabulary setting.
引用
收藏
页码:4379 / 4397
页数:19
相关论文
共 60 条
  • [1] Pose with Style: Detail-Preserving Pose-Guided Image Synthesis with Conditional StyleGAN
    Albahar, Badour
    Lu, Jingwan
    Yang, Jimei
    Shu, Zhixin
    Shechtman, Eli
    Huang, Jia-Bin
    [J]. ACM TRANSACTIONS ON GRAPHICS, 2021, 40 (06):
  • [2] Blended Diffusion for Text-driven Editing of Natural Images
    Avrahami, Omri
    Lischinski, Dani
    Fried, Ohad
    [J]. 2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2022), 2022, : 18187 - 18197
  • [3] Blattmann A, 2022, Arxiv, DOI [arXiv:2204.11824, 10.48550/ARXIV.2204.11824]
  • [4] Cheong S.Y., 2022, arXiv
  • [5] Cho KYHY, 2014, Arxiv, DOI [arXiv:1406.1078, DOI 10.48550/ARXIV.1406.1078]
  • [6] Dhariwal P, 2021, ADV NEUR IN, V34
  • [7] Fu JL, 2022, LECT NOTES COMPUT SC, V13676, P1, DOI [10.1007/978-3-031-19787-1_1, DOI 10.1007/978-3-031-19787-1_1]
  • [8] Make-A-Scene: Scene-Based Text-to-Image Generation with Human Priors
    Gafni, Oran
    Polyak, Adam
    Ashual, Oron
    Sheynin, Shelly
    Parikh, Devi
    Taigman, Yaniv
    [J]. COMPUTER VISION - ECCV 2022, PT XV, 2022, 13675 : 89 - 106
  • [9] Graves A, 2012, STUD COMPUT INTELL, V385, P1, DOI [10.1007/978-3-642-24797-2, 10.1162/neco.1997.9.1.1]
  • [10] Vector Quantized Diffusion Model for Text-to-Image Synthesis
    Gu, Shuyang
    Chen, Dong
    Bao, Jianmin
    Wen, Fang
    Zhang, Bo
    Chen, Dongdong
    Yuan, Lu
    Guo, Baining
    [J]. 2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2022, : 10686 - 10696