Video description: A comprehensive survey of deep learning approaches

被引:0
作者
Ghazala Rafiq
Muhammad Rafiq
Gyu Sang Choi
机构
[1] Yeungnam University,Department of Information & Communication Engineering
[2] Keimyung University,Department of Game & Mobile Engineering
来源
Artificial Intelligence Review | 2023年 / 56卷
关键词
Deep learning; Encoder–Decoder architecture; Text description; Video captioning techniques; Video description approaches; Video captioning; Vision to text;
D O I
暂无
中图分类号
学科分类号
摘要
Video description refers to understanding visual content and transforming that acquired understanding into automatic textual narration. It bridges the key AI fields of computer vision and natural language processing in conjunction with real-time and practical applications. Deep learning-based approaches employed for video description have demonstrated enhanced results compared to conventional approaches. The current literature lacks a thorough interpretation of the recently developed and employed sequence to sequence techniques for video description. This paper fills that gap by focusing mainly on deep learning-enabled approaches to automatic caption generation. Sequence to sequence models follow an Encoder–Decoder architecture employing a specific composition of CNN, RNN, or the variants LSTM or GRU as an encoder and decoder block. This standard-architecture can be fused with an attention mechanism to focus on a specific distinctiveness, achieving high quality results. Reinforcement learning employed within the Encoder–Decoder structure can progressively deliver state-of-the-art captions by following exploration and exploitation strategies. The transformer mechanism is a modern and efficient transductive architecture for robust output. Free from recurrence, and solely based on self-attention, it allows parallelization along with training on a massive amount of data. It can fully utilize the available GPUs for most NLP tasks. Recently, with the emergence of several versions of transformers, long term dependency handling is not an issue anymore for researchers engaged in video processing for summarization and description, or for autonomous-vehicle, surveillance, and instructional purposes. They can get auspicious directions from this research.
引用
收藏
页码:13293 / 13372
页数:79
相关论文
共 313 条
[1]  
Aafaq N(2022)Dense video captioning with early linguistic information fusion IEEE Trans Multimedia 9 70797-70805
[2]  
Mian AS(2021)Optimizing spatiotemporal feature learning in 3D convolutional neural networks with pooling blocks IEEE Access 2012 102-112
[3]  
Akhtar N(2015)VQA: visual question answering Proc IEEE Int Conf Comput Vis 2017 328-338
[4]  
Liu W(2012)Video in sentences out Uncertainty Artif Intell–Proc 28th Conf–UAI 49 2631-2641
[5]  
Shah M(2009)Curriculum learning ACM Int Conf Proc Ser 3024 25-36
[6]  
Agyeman R(2017)Natural language processing (almost) from scratch Proc IEEE 3rd Int Conf Collaboration Internet Comput CIC 2017 9 1735-1780
[7]  
Rafiq M(2019)Describing video with attention-based bidirectional LSTM IEEE Trans Cybern 95 847-862
[8]  
Shin HK(2014)High accuracy optical flow estimation based on warping-presentation Lecture Notes Comput Sci (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) 33 8191-8198
[9]  
Rinner B(1997)TVT: two-view transformer network for video captioning Long Short–Term Memory 1 8421-8431
[10]  
Choi GS(2018)Motion guided spatial attention for video captioning Proc Mach Learn Res 2019 6283-6290