Evolution of automatic visual description techniques-a methodological survey

被引:10
作者
Bhowmik, Arka [1 ]
Kumar, Sanjay [1 ]
Bhat, Neeraj [1 ]
机构
[1] Delhi Technol Univ, Dept Comp Sci & Engn, Main Bawana Rd, New Delhi 110042, India
关键词
Image captioning; Video captioning; Activity recognition; Deep learning; Convolutional neural networks; Recurrent neural networks; IMAGE; ATTENTION;
D O I
10.1007/s11042-021-10964-3
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Describing the contents and activities in an image or video in semantically and syntactically correct sentences are known as captioning. Automated captioning is one of the most researched topics these days, with new sophisticated models being discovered every day. Captioning models require intense training and perform intense, complex calculations before successfully generating a caption and hence, takes a considerable amount of time even in machines with high specifications. In this survey, we go through the recent state-of-the-art advancements in automatic image and video description methodologies using deep neural networks and summarize the concepts inferred from them. The summarization has been done with a systematic, detailed, and critical analysis of the latest methodologies published in high impact proceedings and journals. Our investigation focuses on techniques that can optimize existing concepts and incorporate new methods of visual attention for generating captions. This survey emphasizes on the importance of applicability and effectiveness of existing works in real-life applications and highlights those computationally feasible and optimized techniques which can be supported in multiple devices, including lightweight devices like smartphones. Furthermore, we propose possible improvements and model architecture to support online video captioning.
引用
收藏
页码:28015 / 28059
页数:45
相关论文
共 92 条
[1]   Machine Learning from Theory to Algorithms: An Overview [J].
Alzubi, Jafar ;
Nayyar, Anand ;
Kumar, Akshi .
SECOND NATIONAL CONFERENCE ON COMPUTATIONAL INTELLIGENCE (NCCI 2018), 2018, 1142
[2]   Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering [J].
Anderson, Peter ;
He, Xiaodong ;
Buehler, Chris ;
Teney, Damien ;
Johnson, Mark ;
Gould, Stephen ;
Zhang, Lei .
2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, :6077-6086
[3]   SPICE: Semantic Propositional Image Caption Evaluation [J].
Anderson, Peter ;
Fernando, Basura ;
Johnson, Mark ;
Gould, Stephen .
COMPUTER VISION - ECCV 2016, PT V, 2016, 9909 :382-398
[4]   2D Human Pose Estimation: New Benchmark and State of the Art Analysis [J].
Andriluka, Mykhaylo ;
Pishchulin, Leonid ;
Gehler, Peter ;
Schiele, Bernt .
2014 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2014, :3686-3693
[5]  
[Anonymous], 2016, ARXIV160906782
[6]  
[Anonymous], 2010, CMU VASC Seminar
[7]  
[Anonymous], 2015, P 2015 C EMPIRICAL M, DOI [DOI 10.18653/V1/D15-1021, 10.18653/v1/D15-1021]
[8]  
[Anonymous], 2014, C N AM CHAPTER ASS C
[9]  
[Anonymous], 2018, ARXIV180703658
[10]  
[Anonymous], 2013, ARXIV, DOI DOI 10.48550/ARXIV.1308.0850