IcoCap: Improving Video Captioning by Compounding Images

被引:9
作者
Liang, Yuanzhi [1 ]
Zhu, Linchao [2 ]
Wang, Xiaohan [2 ]
Yang, Yi [2 ]
机构
[1] Univ Technol Sydney, Australian Artificial Intelligence Inst, Sydney, NSW 2007, Australia
[2] Zhejiang Univ, Coll Comp Sci & Technol, Hangzhou 310027, Peoples R China
基金
澳大利亚研究理事会;
关键词
Multi-modal understanding; representation learning; video captioning; REPRESENTATION; CLIP;
D O I
10.1109/TMM.2023.3322329
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Video captioning is a more challenging task compared to image captioning, primarily due to differences in content density. Video data contains redundant visual content, making it difficult for captioners to generalize diverse content and avoid being misled by irrelevant elements. Moreover, redundant content is not well-trimmed to match the corresponding visual semantics in the ground truth, further increasing the difficulty of video captioning. Current research in video captioning predominantly focuses on captioner design, neglecting the impact of content density on captioner performance. Considering the differences between videos and images, there exists an another line to improve video captioning by leveraging concise and easily-learned image samples to further diversify video samples. This modification to content density compels the captioner to learn more effectively against redundancy and ambiguity. In this article, we propose a novel approach called Image-Compounded learning for video Captioners (IcoCap) to facilitate better learning of complex video semantics. IcoCap comprises two components: the Image-Video Compounding Strategy (ICS) and Visual-Semantic Guided Captioning (VGC). ICS compounds easily-learned image semantics into video semantics, further diversifying video content and prompting the network to generalize contents in a more diverse sample. Besides, learning with the sample compounded with image contents, the captioner is compelled to better extract valuable video cues in the presence of straightforward image semantics. This helps the captioner further focus on relevant information while filtering out extraneous content. Then, VGC guides the network in flexibly learning ground truth captions based on the compounded samples, helping to mitigate the mismatch between the ground truth and ambiguous semantics in video samples. Our experimental results demonstrate the effectiveness of IcoCap in improving the learning of video captioners. Applied to the widely-used MSVD, MSR-VTT, and VATEX datasets, our approach achieves competitive or superior results compared to state-of-the-art methods, illustrating its capacity to handle the redundant and ambiguous video data
引用
收藏
页码:4389 / 4400
页数:12
相关论文
共 66 条
  • [1] Spatio-Temporal Dynamics and Semantic Attribute Enriched Visual Encoding for Video Captioning
    Aafaq, Nayyer
    Akhtar, Naveed
    Liu, Wei
    Gilani, Syed Zulqarnain
    Mian, Ajmal
    [J]. 2019 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2019), 2019, : 12479 - 12488
  • [2] Video Description: A Survey of Methods, Datasets, and Evaluation Metrics
    Aafaq, Nayyer
    Mian, Ajmal
    Liu, Wei
    Gilani, Syed Zulqarnain
    Shah, Mubarak
    [J]. ACM COMPUTING SURVEYS, 2020, 52 (06)
  • [3] Alwassel H., 2020, P INT C NEUR INF PRO, P9758
  • [4] Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering
    Anderson, Peter
    He, Xiaodong
    Buehler, Chris
    Teney, Damien
    Johnson, Mark
    Gould, Stephen
    Zhang, Lei
    [J]. 2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, : 6077 - 6086
  • [5] ViViT: A Video Vision Transformer
    Arnab, Anurag
    Dehghani, Mostafa
    Heigold, Georg
    Sun, Chen
    Lucic, Mario
    Schmid, Cordelia
    [J]. 2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021), 2021, : 6816 - 6826
  • [6] Frozen in Time: A Joint Video and Image Encoder for End-to-End Retrieval
    Bain, Max
    Nagrani, Arsha
    Varol, Gul
    Zisserman, Andrew
    [J]. 2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021), 2021, : 1708 - 1718
  • [7] Hierarchical Boundary-Aware Neural Encoder for Video Captioning
    Baraldi, Lorenzo
    Grana, Costantino
    Cucchiara, Rita
    [J]. 30TH IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2017), 2017, : 3185 - 3194
  • [8] Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset
    Carreira, Joao
    Zisserman, Andrew
    [J]. 30TH IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2017), 2017, : 4724 - 4733
  • [9] Chen David, 2011, P 49 ANN M ASS COMPU, P190
  • [10] Chen JW, 2019, AAAI CONF ARTIF INTE, P8167