Attention-Based Multimodal Deep Learning on Vision-Language Data: Models, Datasets, Tasks, Evaluation Metrics and Applications

被引：3

作者：

Bose, Priyankar ^{[1
]}

Rana, Pratip ^{[1
,2
]}

Ghosh, Preetam ^{[1
]}

机构：

[1] Virginia Commonwealth Univ, Dept Comp Sci, Richmond, VA 23284 USA

[2] Bennett Aerosp, Raleigh, NC 27603 USA

来源：

IEEE ACCESS | 2023年 / 11卷

关键词：

Task analysis; Data models; Deep learning; Transformers; Visualization; Training; Surveys; Question answering (information retrieval); Image segmentation; Image texture analysis; Attention mechanism; data fusion; multimodal learning; vision-language classification; vision-language question-answering; vision-language segmentation; IMAGE; NETWORK;

D O I：

10.1109/ACCESS.2023.3299877

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

Multimodal learning has gained immense popularity due to the explosive growth in the volume of image and textual data in various domains. Vision-language heterogeneous multimodal data has been utilized to solve a variety of tasks including classification, image segmentation, image captioning, question-answering, etc. Consequently, several attention mechanism-based approaches with deep learning have been proposed on image-text multimodal data. In this paper, we highlight the current status of attention-based deep learning approaches on vision-language multimodal data by presenting a detailed description of the existing models, their performances and the variety of evaluation metrics used therein. We revisited the various attention mechanisms on image-text multimodal data since its inception in 2015 till 2022 and considered a total of 75 articles for the survey. Our comprehensive discussion also encompasses the current tasks, datasets, application areas and future directions in this domain. This is the very first attempt to discuss the vast scope of attention-based deep learning mechanisms on image-text multimodal data.

引用

页码：80624 / 80646

页数：23

共 139 条

[1] Hudson DA, 2019, Arxiv, DOI [arXiv:1902.09506, DOI 10.48550/ARXIV.1902.09506]
[2] State-of-the-art in artificial neural network applications: A survey
Abiodun, Oludare Isaac
Jantan, Aman
Omolara, Abiodun Esther
Dada, Kemi Victoria
Mohamed, Nachaat AbdElatif
Arshad, Humaira
[J]. HELIYON, 2018, 4 (11)
[3] Classifying Imbalanced Multi-modal Sensor Data for Human Activity Recognition in a Smart Home using Deep Learning
Alani, Ali A.
Cosma, Georgina
Taherkhani, Aboozar
[J]. 2020 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS (IJCNN), 2020,
[4] [Anonymous], 2014, Advances in neural information processing systems
[5] VQA: Visual Question Answering
Antol, Stanislaw
Agrawal, Aishwarya
Lu, Jiasen
Mitchell, Margaret
Batra, Dhruv
Zitnick, C. Lawrence
Parikh, Devi
[J]. 2015 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2015, : 2425 - 2433
[6] Bahdanau D, 2016, Arxiv, DOI [arXiv:1409.0473, 10.48550/arXiv.1409.0473, DOI 10.48550/ARXIV.1409.0473]
[7] Ben Abacha A., 2019, CLEF2019 WORKING NOT, P1
[8] LaTr: Layout-Aware Transformer for Scene-Text VQA
Biten, Ali Furkan
Litman, Ron
Xie, Yusheng
Appalaraju, Srikar
Manmatha, R.
[J]. 2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2022), 2022, : 16527 - 16537
[9] Blum H, 2018, IEEE INT C INT ROBOT, P3670, DOI 10.1109/IROS.2018.8593786
[10] Detection and visualization of misleading content on Twitter
Boididou, Christina
Papadopoulos, Symeon
Zampoglou, Markos
Apostolidis, Lazaros
Papadopoulou, Olga
Kompatsiaris, Yiannis
[J]. INTERNATIONAL JOURNAL OF MULTIMEDIA INFORMATION RETRIEVAL, 2018, 7 (01) : 71 - 86

← 1 2 3 4 5 6 7 8 9 10 →