Evolution of automatic visual description techniques-a methodological survey

被引:10
作者
Bhowmik, Arka [1 ]
Kumar, Sanjay [1 ]
Bhat, Neeraj [1 ]
机构
[1] Delhi Technol Univ, Dept Comp Sci & Engn, Main Bawana Rd, New Delhi 110042, India
关键词
Image captioning; Video captioning; Activity recognition; Deep learning; Convolutional neural networks; Recurrent neural networks; IMAGE; ATTENTION;
D O I
10.1007/s11042-021-10964-3
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Describing the contents and activities in an image or video in semantically and syntactically correct sentences are known as captioning. Automated captioning is one of the most researched topics these days, with new sophisticated models being discovered every day. Captioning models require intense training and perform intense, complex calculations before successfully generating a caption and hence, takes a considerable amount of time even in machines with high specifications. In this survey, we go through the recent state-of-the-art advancements in automatic image and video description methodologies using deep neural networks and summarize the concepts inferred from them. The summarization has been done with a systematic, detailed, and critical analysis of the latest methodologies published in high impact proceedings and journals. Our investigation focuses on techniques that can optimize existing concepts and incorporate new methods of visual attention for generating captions. This survey emphasizes on the importance of applicability and effectiveness of existing works in real-life applications and highlights those computationally feasible and optimized techniques which can be supported in multiple devices, including lightweight devices like smartphones. Furthermore, we propose possible improvements and model architecture to support online video captioning.
引用
收藏
页码:28015 / 28059
页数:45
相关论文
共 92 条
[21]   Paying More Attention to Saliency: Image Captioning with Saliency and Context Attention [J].
Cornia, Marcella ;
Baraldi, Lorenzo ;
Serra, Giuseppe ;
Cucchiara, Rita .
ACM TRANSACTIONS ON MULTIMEDIA COMPUTING COMMUNICATIONS AND APPLICATIONS, 2018, 14 (02)
[22]   Towards Diverse and Natural Image Descriptions via a Conditional GAN [J].
Dai, Bo ;
Fidler, Sanja ;
Urtasun, Raquel ;
Lin, Dahua .
2017 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2017, :2989-2998
[23]   Histograms of oriented gradients for human detection [J].
Dalal, N ;
Triggs, B .
2005 IEEE COMPUTER SOCIETY CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, VOL 1, PROCEEDINGS, 2005, :886-893
[24]   Image Segmentation using K-means Clustering Algorithm and Subtractive Clustering Algorithm [J].
Dhanachandra, Nameirakpam ;
Manglem, Khumanthem ;
Chanu, Yambem Jina .
ELEVENTH INTERNATIONAL CONFERENCE ON COMMUNICATION NETWORKS, ICCN 2015/INDIA ELEVENTH INTERNATIONAL CONFERENCE ON DATA MINING AND WAREHOUSING, ICDMW 2015/NDIA ELEVENTH INTERNATIONAL CONFERENCE ON IMAGE AND SIGNAL PROCESSING, ICISP 2015, 2015, 54 :764-771
[25]   Learning Spatiotemporal Features with 3D Convolutional Networks [J].
Du Tran ;
Bourdev, Lubomir ;
Fergus, Rob ;
Torresani, Lorenzo ;
Paluri, Manohar .
2015 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2015, :4489-4497
[26]   DAPs: Deep Action Proposals for Action Understanding [J].
Escorcia, Victor ;
Heilbron, Fabian Caba ;
Niebles, Juan Carlos ;
Ghanem, Bernard .
COMPUTER VISION - ECCV 2016, PT III, 2016, 9907 :768-784
[27]  
Freitag Markus, 2017, P 1 WORKSHOP NEURAL, P56
[28]   The durability investigation of a 10-cell metal bipolar plate proton exchange membrane fuel cell stack [J].
Fu, Kailin ;
Tian, Tian ;
Chen, Yanan ;
Li, Shang ;
Cai, Chao ;
Zhang, Yu ;
Guo, Wei ;
Pan, Mu .
INTERNATIONAL JOURNAL OF ENERGY RESEARCH, 2019, 43 (07) :2605-2614
[29]   Aligning Where to See and What to Tell: Image Captioning with Region-Based Attention and Scene-Specific Contexts [J].
Fu, Kun ;
Jin, Junqi ;
Cui, Runpeng ;
Sha, Fei ;
Zhang, Changshui .
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2017, 39 (12) :2321-2334
[30]   Video Captioning With Attention-Based LSTM and Semantic Consistency [J].
Gao, Lianli ;
Guo, Zhao ;
Zhang, Hanwang ;
Xu, Xing ;
Shen, Heng Tao .
IEEE TRANSACTIONS ON MULTIMEDIA, 2017, 19 (09) :2045-2055