An Unsupervised Vision-related Keywords Retrieval and Fusion Method for Visual Storytelling

被引:0
作者
Li, Bing [1 ,2 ]
Ma, Can [1 ]
Gao, Xiyan [1 ]
Jia, Guangheng [1 ]
机构
[1] Chinese Acad Sci, Inst Informat Engn, Beijing, Peoples R China
[2] Univ Chinese Acad Sci, Sch Cyber Secur, Beijing, Peoples R China
来源
2023 IEEE 35TH INTERNATIONAL CONFERENCE ON TOOLS WITH ARTIFICIAL INTELLIGENCE, ICTAI | 2023年
关键词
Visual Storytelling; multi-modal; text feature; KNOWLEDGE;
D O I
10.1109/ICTAI59109.2023.00120
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Visual storytelling is a multi-modal generation task aiming to generate a coherent story for a sequence of images. Previous visual storytelling models utilize task-beneficial non-visual features, e.g., emotion, sentiment, or knowledge graph, as additional supplements for visual features to further improve the generation quality of stories. However, these appropriate non-visual features require to be carefully designed and selected by specialized researchers, moreover, high-quality external knowledge sources are not readily available. This increases the development cost of the VST model. To alleviate the above problem, this paper explores mining the knowledge existing in the multi-modal pre-trained models (MM-PTMs). First, we propose an Unsupervised Keywords Retrieval module (UKR), which takes an MM-PTM as the expert to select image-related keywords from a prepared task-related text corpus. The retrieved keywords not only complement and illustrate the visual features of the images but also provide more explicit generative signals to improve the interpretability and controllability of the generation process. Furthermore, we propose a Local Multi-modal Adaptive Fusion module (LMAF) to better fuse the textual and visual features and avoid noise brought by irrelevant keywords. LMAF dynamically aggregates features from both modalities through finer-grained correlation matching. The experimental results on the VST dataset, VIST, show that our proposed method achieves competitive results on several automatic metrics. Comparable results to previous methods can be achieved even if the model does not refer to visual features during story generation.
引用
收藏
页码:784 / 790
页数:7
相关论文
共 49 条
[1]  
[Anonymous], 2014, P 9 WORKSHOP STAT MA, DOI DOI 10.3115/V1/W14-3348
[2]  
Braude T., 2021, arXiv
[3]  
Chen H, 2021, AAAI CONF ARTIF INTE, V35, P999
[4]   SentiStory: A Multi-Layered Sentiment-Aware Generative Model for Visual Storytelling [J].
Chen, Wei ;
Liu, Xuefeng ;
Niu, Jianwei .
IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2022, 32 (11) :8051-8064
[5]  
Chen Y., 2015, classification by training a generic convolutional neural network
[6]  
Cho KYHY, 2014, Arxiv, DOI arXiv:1406.1078
[7]  
Chowdhary K., 2020, Fundamentals of artificial intelligence, P603
[8]  
Dosovitskiy A, 2021, Arxiv, DOI arXiv:2010.11929
[9]  
Dou ZY, 2021, Arxiv, DOI arXiv:2010.08014
[10]  
Fabbri M., 2018, P 7 INT C DAT SCI TE, P142, DOI [DOI 10.5220/0006922101420153, 10.5220/0006922101420153]