ML2MG-VLCR: A Multimodal LLM Guided Zero-shot Method for Visio-linguistic Compositional Reasoning with Autoregressive Generative Language Model

被引:0
|
作者
Gong, Ziyu [1 ]
Mai, Chengcheng [2 ,3 ]
Huang, Yihua [1 ]
机构
[1] Nanjing Univ, State Key Lab Novel Software Technol, Nanjing, Jiangsu, Peoples R China
[2] Nanjing Normal Univ, Sch Comp Sci & Elect Informat, Nanjing, Jiangsu, Peoples R China
[3] Nanjing Normal Univ, Sch Artif Intelligence, Nanjing, Jiangsu, Peoples R China
基金
中国国家自然科学基金;
关键词
Visio-linguistic Compositional Reasoning; Multimodal LLM; Autoregressive GLM; Zero-shot; Multimodal retrieval;
D O I
10.1145/3652583.3658016
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
The visio-linguistic compositional reasoning is an interesting but challenging task aimed at matching two images and two captions, where the two images are different but the two corresponding captions are composed of the same words but in different order. This requires the matching model to have the ability to understand both the composition structure of the image and the order of the description text. However, when faced with compositional reasoning tasks, existing vision-language models are not sensitive to the image structure and text order, acting more like bag-of-words models. To address this challenge, a zero-shot visio-linguistic compositional reasoning method was proposed with the assistance of multimodal LLM and autoregressive generative language model. Given an image and candidate texts with different order compositions, we first leveraged LLaVA to generate descriptive text according to the image, for reflecting the compositional structure of image into text order. Then, an order-sensitive image-text matching method was proposed by calculating the generation probability of the candidate text conditioned on the textualized image information obtained by LLaVA, where autoregressive generative language model explicitly plays an important role in order modeling and evaluating. Experimental results on VG-Relation, VG-Attribution and Flickr30K-Order, demonstrated the superiority of our method in understanding the compositional structure and order of images and texts.
引用
收藏
页码:842 / 850
页数:9
相关论文
empty
未找到相关数据