Exploring a CLIP-Enhanced Automated Approach for Video Description Generation

被引:0
|
作者
Zhang, Siang-Ling [1 ]
Cheng, Huai-Hsun [1 ]
Chen, Yen-Hsin [1 ]
Yeh, Mei-Chen [1 ]
机构
[1] Natl Taiwan Normal Univ, Dept Comp Sci & Informat Engn, Taipei, Taiwan
关键词
D O I
10.1109/APSIPAASC58517.2023.10317231
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Visual storytelling is a learned ability that humans have developed through the course of their evolution. In contrast to written records and descriptions, people are now sharing glimpses of their lives through short videos. Human conversations largely rely on visual and auditory inputs, followed by corresponding feedbacks. For machines, the conversion of images to texts serves as a bridge between visual and linguistic information, enabling machine-human interactions more naturally. Video captioning-involving automatically generating textual descriptions from videos-is one of the core technologies that enable such applications. In this work, we present CLIP-CAP, an automatic method for transforming visual contents to concise textual descriptions. We investigate the CLIP pretraining model as well as its potential in this task. Through experiments on the ActivityNet Captions dataset, we show that the proposed CLIP-CAP model outperforms existing video captioning methods in terms of several different metrics.
引用
收藏
页码:1506 / 1511
页数:6
相关论文
共 50 条
  • [21] An indexing-based approach to pattern and video clip recognition
    A. M. Mikhailov
    Automation and Remote Control, 2014, 75 : 2201 - 2211
  • [22] Video structural description technology for the new generation video surveillance systems
    Chuanping Hu
    Zheng Xu
    Yunhuai Liu
    Lin Mei
    Frontiers of Computer Science, 2015, 9 : 980 - 989
  • [23] Video structural description technology for the new generation video surveillance systems
    Chuanping HU
    Zheng XU
    Yunhuai LIU
    Lin MEI
    Frontiers of Computer Science, 2015, 9 (06) : 980 - 989
  • [24] Video structural description technology for the new generation video surveillance systems
    Hu, Chuanping
    Xu, Zheng
    Liu, Yunhuai
    Mei, Lin
    FRONTIERS OF COMPUTER SCIENCE, 2015, 9 (06) : 980 - 989
  • [25] Automated Traffic Scenario Description Extraction Using Video Transformers
    Harder, Aron
    Behl, Madhur
    2024 DESIGN, AUTOMATION & TEST IN EUROPE CONFERENCE & EXHIBITION, DATE, 2024,
  • [26] Video Description Generation using Audio and Visual Cues
    Jin, Qin
    Liang, Junwei
    ICMR'16: PROCEEDINGS OF THE 2016 ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA RETRIEVAL, 2016, : 239 - 242
  • [27] Relational Graph Learning for Grounded Video Description Generation
    Zhang, Wenqiao
    Wang, Xin Eric
    Tang, Siliang
    Shi, Haizhou
    Shi, Haochen
    Xiao, Jun
    Zhuang, Yueting
    Wang, William Yang
    MM '20: PROCEEDINGS OF THE 28TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, 2020, : 3807 - 3816
  • [28] An approach for automated video indexing and video search in large lecture video archives
    Kate, Laxmikant S.
    Waghmare, M. M.
    Amrit
    2015 INTERNATIONAL CONFERENCE ON PERVASIVE COMPUTING (ICPC), 2015,
  • [29] An Enhanced Intelligent Agent with Image Description Generation
    Fielding, Ben
    Kinghorn, Philip
    Mistry, Kamlesh
    Zhang, Li
    INTELLIGENT VIRTUAL AGENTS, IVA 2016, 2016, 10011 : 110 - 119
  • [30] Automated Highlight Generation from Cricket Broadcast Video
    Ramsaran, Marise
    Pooransingh, Akash
    Singh, Arvind
    2016 8TH INTERNATIONAL CONFERENCE ON COMPUTATIONAL INTELLIGENCE AND COMMUNICATION NETWORKS (CICN), 2016, : 251 - 255