Control With Style: Style Embedding-Based Variational Autoencoder for Controlled Stylized Caption Generation Framework

被引：0

作者：

Sharma, Dhruv ^{[1
]}

Dhiman, Chhavi ^{[1
]}

Kumar, Dinesh ^{[1
]}

机构：

[1] Delhi Technol Univ, Dept Elect & Commun Engn, Delhi 110042, India

来源：

IEEE TRANSACTIONS ON COGNITIVE AND DEVELOPMENTAL SYSTEMS | 2024年 / 16卷 / 06期

关键词：

Visualization; Task analysis; Long short term memory; Decoding; Adaptation models; Transformers; Generators; Bag of captions (BoCs); computer vision; controlled text generation; image captioning; natural language processing; smooth maximum unit (SMU); stylized image captioning; variational autoencoder (VAE);

D O I：

10.1109/TCDS.2024.3405573

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Automatic image captioning is a computationally intensive and structurally complicated task that describes the contents of an image in the form of a natural language sentence. Methods developed in the recent past focused mainly on the description of factual content in images thereby ignoring the different emotions and styles (romantic, humorous, angry, etc.) associated with the image. To overcome this, few works incorporated style-based caption generation that captures the variability in the generated descriptions. This article presents a style embedding-based variational autoencoder for controlled stylized caption generation framework (RFCG+SE-VAE-CSCG). It generates controlled text-based stylized descriptions of images. It works in two phases, i.e., $ 1)$ refined factual caption generation (RFCG); and $ 2)$ SE-VAE-CSCG. The former defines an encoder-decoder model for the generation of refined factual captions. Whereas, the latter presents a SE-VAE for controlled stylized caption generation. The overall proposed framework generates style-based descriptions of images by leveraging bag of captions (BoCs). More so, with the use of a controlled text generation model, the proposed work efficiently learns disentangled representations and generates realistic stylized descriptions of images. Experiments on MSCOCO, Flickr30K, and FlickrStyle10K provide state-of-the-art results for both refined and style-based caption generation, supported with an ablation study.

引用

页码：2032 / 2042

页数：11

共 59 条

[11] Chen Xi, 2016, Advances in neural information processing systems, V29
[12] Cornia M, 2020, Arxiv, DOI [arXiv:1912.08226, DOI 10.48550/ARXIV.1912.08226]
[13] Image captioning using DenseNet network and adaptive attention
Deng, Zhenrong
Jiang, Zhouqin
Lan, Rushi
Huang, Wenming
Luo, Xiaonan
[J]. SIGNAL PROCESSING-IMAGE COMMUNICATION, 2020, 85
[14] StyleNet: Generating Attractive Visual Captions with Styles
Gan, Chuang
Gan, Zhe
He, Xiaodong
Gao, Jianfeng
Deng, Li
[J]. 30TH IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2017), 2017, : 955 - 964
[15] Goodfellow IJ, 2014, ADV NEUR IN, V27, P2672
[16] I can't believe there's no images! Learning Visual Tasks Using Only Language Supervision
Gu, Sophia
Clark, Christopher
Kembhavi, Aniruddha
[J]. 2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION, ICCV, 2023, : 2672 - 2683
[17] MSCap: Multi-Style Image Captioning with Unpaired Stylized Text
Guo, Longteng
Liu, Jing
Yao, Peng
Li, Jiangwei
Lu, Hanqing
[J]. 2019 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2019), 2019, : 4199 - 4208
[18] THE WAKE-SLEEP ALGORITHM FOR UNSUPERVISED NEURAL NETWORKS
HINTON, GE
DAYAN, P
FREY, BJ
NEAL, RM
[J]. SCIENCE, 1995, 268 (5214) : 1158 - 1161
[19] Framing Image Description as a Ranking Task: Data, Models and Evaluation Metrics
Hodosh, Micah
Young, Peter
Hockenmaier, Julia
[J]. JOURNAL OF ARTIFICIAL INTELLIGENCE RESEARCH, 2013, 47 : 853 - 899
[20] Hu J, 2019, Arxiv, DOI arXiv:1709.01507

← 1 2 3 4 5 6 →