Exploring a CLIP-Enhanced Automated Approach for Video Description Generation

被引:0
|
作者
Zhang, Siang-Ling [1 ]
Cheng, Huai-Hsun [1 ]
Chen, Yen-Hsin [1 ]
Yeh, Mei-Chen [1 ]
机构
[1] Natl Taiwan Normal Univ, Dept Comp Sci & Informat Engn, Taipei, Taiwan
关键词
D O I
10.1109/APSIPAASC58517.2023.10317231
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Visual storytelling is a learned ability that humans have developed through the course of their evolution. In contrast to written records and descriptions, people are now sharing glimpses of their lives through short videos. Human conversations largely rely on visual and auditory inputs, followed by corresponding feedbacks. For machines, the conversion of images to texts serves as a bridge between visual and linguistic information, enabling machine-human interactions more naturally. Video captioning-involving automatically generating textual descriptions from videos-is one of the core technologies that enable such applications. In this work, we present CLIP-CAP, an automatic method for transforming visual contents to concise textual descriptions. We investigate the CLIP pretraining model as well as its potential in this task. Through experiments on the ActivityNet Captions dataset, we show that the proposed CLIP-CAP model outperforms existing video captioning methods in terms of several different metrics.
引用
收藏
页码:1506 / 1511
页数:6
相关论文
共 50 条
  • [1] Triplane-Smoothed Video Dehazing with CLIP-Enhanced Generalization
    Ren, Jingjing
    Chen, Haoyu
    Ye, Tian
    Wu, Hongtao
    Zhu, Lei
    INTERNATIONAL JOURNAL OF COMPUTER VISION, 2025, 133 (01) : 475 - 488
  • [2] CLIP-enhanced multimodal machine translation: integrating visual and label features with transformer fusion
    ShaoDong Cui
    Xinyan Yin
    Kaibo Duan
    Hiroyuki Shinnou
    Multimedia Tools and Applications, 2025, 84 (14) : 12699 - 12713
  • [3] Approach for video retrieval by video clip
    Peng, Yu-Xin
    Ngo, Chong-Wah
    Dong, Qing-Jie
    Guo, Zong-Ming
    Xiao, Jian-Guo
    Ruan Jian Xue Bao/Journal of Software, 2003, 14 (08): : 1409 - 1417
  • [4] VCLIPSeg: Voxel-Wise CLIP-Enhanced Model for Semi-supervised Medical Image Segmentation
    Li, Lei
    Lian, Sheng
    Luo, Zhiming
    Wang, Beizhan
    Li, Shaozi
    MEDICAL IMAGE COMPUTING AND COMPUTER ASSISTED INTERVENTION - MICCAI 2024, PT IX, 2024, 15009 : 692 - 701
  • [5] An Approach for Automated Kannada Subtitle Generation from Kannada Video
    Santosh
    Jenila Livingston, L. M.
    INTERNATIONAL JOURNAL OF UNCERTAINTY FUZZINESS AND KNOWLEDGE-BASED SYSTEMS, 2023, 31 (SUPP01) : 101 - 119
  • [6] A Drone Video Clip Dataset and its Applications in Automated Cinematography
    Ashtari, Amirsaman
    Jung, Raehyuk
    Li, Mingxiao
    Noh, Junyong
    COMPUTER GRAPHICS FORUM, 2022, 41 (07) : 189 - 203
  • [7] Unsupervised Video Summarization based on Consistent Clip Generation
    Ai, Xin
    Song, Yan
    Li, Zechao
    2018 IEEE FOURTH INTERNATIONAL CONFERENCE ON MULTIMEDIA BIG DATA (BIGMM), 2018,
  • [8] Automated description generation for software patches
    Vu, Thanh Trong
    Bui, Tuan-Dung
    Do, Thanh-Dat
    Nguyen, Thu-Trang
    Vo, Hieu Dinh
    Nguyen, Son
    INFORMATION AND SOFTWARE TECHNOLOGY, 2025, 177
  • [9] ST-CLIP: Spatio-Temporal Enhanced CLIP Towards Dense Video Captioning
    Chen, Huimin
    Duan, Pengfei
    Huang, Mingru
    Guo, Jingyi
    Xiong, Shengwu
    ADVANCED INTELLIGENT COMPUTING TECHNOLOGY AND APPLICATIONS, PT XI, ICIC 2024, 2024, 14872 : 396 - 407
  • [10] A SOUNDTRACK GENERATION SYSTEM TO SYNCHRONIZE THE CLIMAX OF A VIDEO CLIP WITH MUSIC
    Sato, Haruki
    Hirai, Tatsunori
    Nakano, Tomoyasu
    Goto, Masataka
    Morishima, Shigeo
    2016 IEEE INTERNATIONAL CONFERENCE ON MULTIMEDIA & EXPO (ICME), 2016,