Exploring the Limits of VLMs: A Dataset for Evaluating Text-to-Video Generation

被引:0
作者
Srivastava, Avnish [1 ]
Sista, Raviteja [1 ]
Chakrabarti, Partha P. [1 ]
Sheet, Debdoot [1 ]
机构
[1] Indian Inst Technol Kharagpur, Kharagpur, W Bengal, India
来源
PROCEEDINGS OF FIFTEENTH INDIAN CONFERENCE ON COMPUTER VISION, GRAPHICS AND IMAGE PROCESSING, ICVGIP 2024 | 2024年
关键词
Benchmark dataset; Diffusion models; Temporal consistency; Video generation; Vision language models; IMAGE;
D O I
10.1145/3702250.3702298
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Vision language models (VLMs) integrate vision and text by learning from images and their textual descriptions for generating text from images and vice-versa. VLMs are used for various tasks such as image captioning, and visual question answering. As VLMs extend to video generation tasks, significant challenges arise. The generated videos often lack temporal consistency, and there exist issues with alignment between the generated video content and the input text. This work proposes a set of prompts for systematically evaluating VLMs on text-to-video generation focused on object coherence and temporal consistency. This dataset of prompt covers two categories: first, the complexity of prompts, and second, the number of objects and actions. The first category consists of four levels of prompt complexity: simple, mid-level, high-level, and unique and rare object prompts. The second category also consists of four levels: single object, single object in action, multiple objects, and multiple objects in action. Thus, the dataset is a combination of 16 different prompt scenarios where each scenarios has 10 prompts resulting in a dataset of 160 prompts. This work explores and evaluates the outputs of three models from the family of diffusion-based VLMs for the task of text to video generation. The videos generated by the models were assessed by five participants on a 0-5 Likert scale. VLMs under study were able to generate temporally consistent videos and proper object in video in only 33.63% and 39.43% of the total evaluation scenario prompts. Most of these prompts belong to the categories of single object and simple prompts. The performance of models drops with increase in objects, actions and prompt complexity. The scenarios of actions, with rare prompts are where all models perform poorly. Thus, reflecting on limitations of VLMs in generating videos. The proposed dataset can be extended to other video generation models to benchmark their performances on the basic aspects of consistency and alignment in videos.
引用
收藏
页数:9
相关论文
共 26 条
[1]  
Betker James., 2023, Computer Science
[2]  
Blattmann A, 2023, Arxiv, DOI arXiv:2311.15127
[3]   VideoCrafter2: Overcoming Data Limitations for High-Quality Video Diffusion Models [J].
Chen, Haoxin ;
Zhang, Yong ;
Cun, Xiaodong ;
Xia, Menghan ;
Wang, Xintao ;
Weng, Chao ;
Shan, Ying .
2024 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2024, 2024, :7310-7320
[4]  
Chen X, 2022, Arxiv, DOI arXiv:2209.06794
[5]  
Deng KL, 2019, PROCEEDINGS OF THE TWENTY-EIGHTH INTERNATIONAL JOINT CONFERENCE ON ARTIFICIAL INTELLIGENCE, P2216
[6]  
Ding M, 2022, ADV NEUR IN
[7]  
Heinly J, 2015, PROC CVPR IEEE, P3287, DOI 10.1109/CVPR.2015.7298949
[8]  
Hong Wenyi., 2022, arXiv
[9]   Text2Video-Zero: Text-to-Image Diffusion Models are Zero-Shot Video Generators [J].
Khachatryan, Levon ;
Movsisyan, Andranik ;
Tadevosyan, Vahram ;
Henschel, Roberto ;
Wang, Zhangyang ;
Navasardyan, Shant ;
Shi, Humphrey .
2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2023), 2023, :15908-15918
[10]   TiVGAN: Text to Image to Video Generation With Step-by-Step Evolutionary Generator [J].
Kim, Doyeon ;
Joo, Donggyu ;
Kim, Junmo .
IEEE ACCESS, 2020, 8 :153113-153122