Attention-Based Multimodal Deep Learning on Vision-Language Data: Models, Datasets, Tasks, Evaluation Metrics and Applications

被引:3
作者
Bose, Priyankar [1 ]
Rana, Pratip [1 ,2 ]
Ghosh, Preetam [1 ]
机构
[1] Virginia Commonwealth Univ, Dept Comp Sci, Richmond, VA 23284 USA
[2] Bennett Aerosp, Raleigh, NC 27603 USA
关键词
Task analysis; Data models; Deep learning; Transformers; Visualization; Training; Surveys; Question answering (information retrieval); Image segmentation; Image texture analysis; Attention mechanism; data fusion; multimodal learning; vision-language classification; vision-language question-answering; vision-language segmentation; IMAGE; NETWORK;
D O I
10.1109/ACCESS.2023.3299877
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Multimodal learning has gained immense popularity due to the explosive growth in the volume of image and textual data in various domains. Vision-language heterogeneous multimodal data has been utilized to solve a variety of tasks including classification, image segmentation, image captioning, question-answering, etc. Consequently, several attention mechanism-based approaches with deep learning have been proposed on image-text multimodal data. In this paper, we highlight the current status of attention-based deep learning approaches on vision-language multimodal data by presenting a detailed description of the existing models, their performances and the variety of evaluation metrics used therein. We revisited the various attention mechanisms on image-text multimodal data since its inception in 2015 till 2022 and considered a total of 75 articles for the survey. Our comprehensive discussion also encompasses the current tasks, datasets, application areas and future directions in this domain. This is the very first attempt to discuss the vast scope of attention-based deep learning mechanisms on image-text multimodal data.
引用
收藏
页码:80624 / 80646
页数:23
相关论文
共 139 条
  • [1] Hudson DA, 2019, Arxiv, DOI [arXiv:1902.09506, DOI 10.48550/ARXIV.1902.09506]
  • [2] State-of-the-art in artificial neural network applications: A survey
    Abiodun, Oludare Isaac
    Jantan, Aman
    Omolara, Abiodun Esther
    Dada, Kemi Victoria
    Mohamed, Nachaat AbdElatif
    Arshad, Humaira
    [J]. HELIYON, 2018, 4 (11)
  • [3] Classifying Imbalanced Multi-modal Sensor Data for Human Activity Recognition in a Smart Home using Deep Learning
    Alani, Ali A.
    Cosma, Georgina
    Taherkhani, Aboozar
    [J]. 2020 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS (IJCNN), 2020,
  • [4] [Anonymous], 2014, Advances in neural information processing systems
  • [5] VQA: Visual Question Answering
    Antol, Stanislaw
    Agrawal, Aishwarya
    Lu, Jiasen
    Mitchell, Margaret
    Batra, Dhruv
    Zitnick, C. Lawrence
    Parikh, Devi
    [J]. 2015 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2015, : 2425 - 2433
  • [6] Bahdanau D, 2016, Arxiv, DOI [arXiv:1409.0473, 10.48550/arXiv.1409.0473, DOI 10.48550/ARXIV.1409.0473]
  • [7] Ben Abacha A., 2019, CLEF2019 WORKING NOT, P1
  • [8] LaTr: Layout-Aware Transformer for Scene-Text VQA
    Biten, Ali Furkan
    Litman, Ron
    Xie, Yusheng
    Appalaraju, Srikar
    Manmatha, R.
    [J]. 2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2022), 2022, : 16527 - 16537
  • [9] Blum H, 2018, IEEE INT C INT ROBOT, P3670, DOI 10.1109/IROS.2018.8593786
  • [10] Detection and visualization of misleading content on Twitter
    Boididou, Christina
    Papadopoulos, Symeon
    Zampoglou, Markos
    Apostolidis, Lazaros
    Papadopoulou, Olga
    Kompatsiaris, Yiannis
    [J]. INTERNATIONAL JOURNAL OF MULTIMEDIA INFORMATION RETRIEVAL, 2018, 7 (01) : 71 - 86