Exploring a CLIP-Enhanced Automated Approach for Video Description Generation

被引：0

作者：

Zhang, Siang-Ling ^{[1
]}

Cheng, Huai-Hsun ^{[1
]}

Chen, Yen-Hsin ^{[1
]}

Yeh, Mei-Chen ^{[1
]}

机构：

[1] Natl Taiwan Normal Univ, Dept Comp Sci & Informat Engn, Taipei, Taiwan

来源：

2023 ASIA PACIFIC SIGNAL AND INFORMATION PROCESSING ASSOCIATION ANNUAL SUMMIT AND CONFERENCE, APSIPA ASC | 2023年

关键词：

D O I：

10.1109/APSIPAASC58517.2023.10317231

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Visual storytelling is a learned ability that humans have developed through the course of their evolution. In contrast to written records and descriptions, people are now sharing glimpses of their lives through short videos. Human conversations largely rely on visual and auditory inputs, followed by corresponding feedbacks. For machines, the conversion of images to texts serves as a bridge between visual and linguistic information, enabling machine-human interactions more naturally. Video captioning-involving automatically generating textual descriptions from videos-is one of the core technologies that enable such applications. In this work, we present CLIP-CAP, an automatic method for transforming visual contents to concise textual descriptions. We investigate the CLIP pretraining model as well as its potential in this task. Through experiments on the ActivityNet Captions dataset, we show that the proposed CLIP-CAP model outperforms existing video captioning methods in terms of several different metrics.

引用

页码：1506 / 1511

页数：6

共 50 条

[1] Triplane-Smoothed Video Dehazing with CLIP-Enhanced Generalization
Ren, Jingjing
Chen, Haoyu
Ye, Tian
Wu, Hongtao
Zhu, Lei
INTERNATIONAL JOURNAL OF COMPUTER VISION, 2025, 133 (01) : 475 - 488
[2] CLIP-enhanced multimodal machine translation: integrating visual and label features with transformer fusion
ShaoDong Cui
Xinyan Yin
Kaibo Duan
Hiroyuki Shinnou
Multimedia Tools and Applications, 2025, 84 (14) : 12699 - 12713
[3] Approach for video retrieval by video clip
Peng, Yu-Xin
Ngo, Chong-Wah
Dong, Qing-Jie
Guo, Zong-Ming
Xiao, Jian-Guo
Ruan Jian Xue Bao/Journal of Software, 2003, 14 (08): : 1409 - 1417
[4] VCLIPSeg: Voxel-Wise CLIP-Enhanced Model for Semi-supervised Medical Image Segmentation
Li, Lei
Lian, Sheng
Luo, Zhiming
Wang, Beizhan
Li, Shaozi
MEDICAL IMAGE COMPUTING AND COMPUTER ASSISTED INTERVENTION - MICCAI 2024, PT IX, 2024, 15009 : 692 - 701
[5] An Approach for Automated Kannada Subtitle Generation from Kannada Video
Santosh
Jenila Livingston, L. M.
INTERNATIONAL JOURNAL OF UNCERTAINTY FUZZINESS AND KNOWLEDGE-BASED SYSTEMS, 2023, 31 (SUPP01) : 101 - 119
[6] A Drone Video Clip Dataset and its Applications in Automated Cinematography
Ashtari, Amirsaman
Jung, Raehyuk
Li, Mingxiao
Noh, Junyong
COMPUTER GRAPHICS FORUM, 2022, 41 (07) : 189 - 203
[7] Unsupervised Video Summarization based on Consistent Clip Generation
Ai, Xin
Song, Yan
Li, Zechao
2018 IEEE FOURTH INTERNATIONAL CONFERENCE ON MULTIMEDIA BIG DATA (BIGMM), 2018,
[8] Automated description generation for software patches
Vu, Thanh Trong
Bui, Tuan-Dung
Do, Thanh-Dat
Nguyen, Thu-Trang
Vo, Hieu Dinh
Nguyen, Son
INFORMATION AND SOFTWARE TECHNOLOGY, 2025, 177
[9] ST-CLIP: Spatio-Temporal Enhanced CLIP Towards Dense Video Captioning
Chen, Huimin
Duan, Pengfei
Huang, Mingru
Guo, Jingyi
Xiong, Shengwu
ADVANCED INTELLIGENT COMPUTING TECHNOLOGY AND APPLICATIONS, PT XI, ICIC 2024, 2024, 14872 : 396 - 407
[10] A SOUNDTRACK GENERATION SYSTEM TO SYNCHRONIZE THE CLIMAX OF A VIDEO CLIP WITH MUSIC
Sato, Haruki
Hirai, Tatsunori
Nakano, Tomoyasu
Goto, Masataka
Morishima, Shigeo
2016 IEEE INTERNATIONAL CONFERENCE ON MULTIMEDIA & EXPO (ICME), 2016,

← 1 2 3 4 5 →