Fine-Grained Length Controllable Video Captioning With Ordinal Embeddings

被引:0
作者
Nitta, Tomoya [1 ,2 ]
Fukuzawa, Takumi [2 ]
Tamaki, Toru [2 ]
机构
[1] Toshiba, Kawasaki 2128582, Japan
[2] Nagoya Inst Technol, Nagoya 4668555, Japan
基金
日本学术振兴会;
关键词
Decoding; Vectors; Earth Observing System; Training; Long short term memory; Data models; Web sites; Video on demand; Reviews; Reliability; Video captioning; length controllable generation; ordinal embedding;
D O I
10.1109/ACCESS.2024.3506751
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
This paper proposes a method for video captioning that controls the length of generated captions. Previous work on length control often had few levels for expressing length. In this study, we propose two methods of length embedding for fine-grained length control. A traditional embedding method is linear, using a one-hot vector and an embedding matrix. In this study, we propose methods that represent length in multi-hot vectors. One is bit embedding that expresses length in bit representation, and the other is ordinal embedding that uses the binary representation often used in ordinal regression. These length representations of multi-hot vectors are converted into length embedding by a nonlinear MLP. This method allows for not only the length control of caption sentences but also the control of the time when reading the caption. Experiments using ActivityNet Captions and Spoken Moments in Time show that the proposed method effectively controls the length of the generated captions. Analysis of the embedding vectors with Independent Component Analysis (ICA) shows that length and semantics were learned separately, demonstrating the effectiveness of the proposed embedding methods. Our code and online demo are available at https://huggingface.co/spaces/fztkm/length_controllable_video_captioning.
引用
收藏
页码:189667 / 189688
页数:22
相关论文
共 80 条
[1]   Video Description: A Survey of Methods, Datasets, and Evaluation Metrics [J].
Aafaq, Nayyer ;
Mian, Ajmal ;
Liu, Wei ;
Gilani, Syed Zulqarnain ;
Shah, Mubarak .
ACM COMPUTING SURVEYS, 2020, 52 (06)
[2]  
Abdar M, 2023, Arxiv, DOI arXiv:2304.11431
[3]  
[Anonymous], SPACY
[4]  
[Anonymous], eSpeak
[5]   ViViT: A Video Vision Transformer [J].
Arnab, Anurag ;
Dehghani, Mostafa ;
Heigold, Georg ;
Sun, Chen ;
Lucic, Mario ;
Schmid, Cordelia .
2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021), 2021, :6816-6826
[6]  
Banerjee S, 2005, P ACL WORKSH INTR EX
[7]  
Bian JY, 2019, Arxiv, DOI arXiv:1909.09492
[8]  
Chai J., 2022, arXiv
[9]   A Survey on Evaluation of Large Language Models [J].
Chang, Yupeng ;
Wang, Xu ;
Wang, Jindong ;
Wu, Yuan ;
Yang, Linyi ;
Zhu, Kaijie ;
Chen, Hao ;
Yi, Xiaoyuan ;
Wang, Cunxiang ;
Wang, Yidong ;
Ye, Wei ;
Zhang, Yue ;
Chang, Yi ;
Yu, Philip S. ;
Yang, Qiang ;
Xie, Xing .
ACM TRANSACTIONS ON INTELLIGENT SYSTEMS AND TECHNOLOGY, 2024, 15 (03)
[10]  
Chen SX, 2019, PROCEEDINGS OF THE TWENTY-EIGHTH INTERNATIONAL JOINT CONFERENCE ON ARTIFICIAL INTELLIGENCE, P6283