Attention-Based Multimodal Deep Learning on Vision-Language Data: Models, Datasets, Tasks, Evaluation Metrics and Applications

被引:3
作者
Bose, Priyankar [1 ]
Rana, Pratip [1 ,2 ]
Ghosh, Preetam [1 ]
机构
[1] Virginia Commonwealth Univ, Dept Comp Sci, Richmond, VA 23284 USA
[2] Bennett Aerosp, Raleigh, NC 27603 USA
关键词
Task analysis; Data models; Deep learning; Transformers; Visualization; Training; Surveys; Question answering (information retrieval); Image segmentation; Image texture analysis; Attention mechanism; data fusion; multimodal learning; vision-language classification; vision-language question-answering; vision-language segmentation; IMAGE; NETWORK;
D O I
10.1109/ACCESS.2023.3299877
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Multimodal learning has gained immense popularity due to the explosive growth in the volume of image and textual data in various domains. Vision-language heterogeneous multimodal data has been utilized to solve a variety of tasks including classification, image segmentation, image captioning, question-answering, etc. Consequently, several attention mechanism-based approaches with deep learning have been proposed on image-text multimodal data. In this paper, we highlight the current status of attention-based deep learning approaches on vision-language multimodal data by presenting a detailed description of the existing models, their performances and the variety of evaluation metrics used therein. We revisited the various attention mechanisms on image-text multimodal data since its inception in 2015 till 2022 and considered a total of 75 articles for the survey. Our comprehensive discussion also encompasses the current tasks, datasets, application areas and future directions in this domain. This is the very first attempt to discuss the vast scope of attention-based deep learning mechanisms on image-text multimodal data.
引用
收藏
页码:80624 / 80646
页数:23
相关论文
共 139 条
  • [11] Bose P, 2022, MED PHYS, V49, pE654
  • [12] Deep Neural Network Models to automate Incident Triage in the Radiation Oncology Incident Learning System
    Bose, Priyankar
    Sleeman, William C.
    Syed, Khajamoinuddin
    Hagan, Michael
    Palta, Jatinder
    Kapoor, Rishabh
    Ghosh, Preetam
    [J]. 12TH ACM CONFERENCE ON BIOINFORMATICS, COMPUTATIONAL BIOLOGY, AND HEALTH INFORMATICS (ACM-BCB 2021), 2021,
  • [13] A Comparative NLP-Based Study on the Current Trends and Future Directions in COVID-19 Research
    Bose, Priyankar
    Roy, Satyaki
    Ghosh, Preetam
    [J]. IEEE ACCESS, 2021, 9 : 78341 - 78355
  • [14] UTC: A Unified Transformer with Inter-Task Contrastive Learning for Visual Dialog
    Chen, Cheng
    Tan, Zhenshan
    Cheng, Qingrong
    Jiang, Xin
    Liu, Qun
    Zhu, Yudong
    Gu, Xiaodong
    [J]. 2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2022), 2022, : 18082 - 18091
  • [15] See-Through-Text Grouping for Referring Image Segmentation
    Chen, Ding-Jie
    Jia, Songhao
    Lo, Yi-Chen
    Chen, Hwann-Tzong
    Liu, Tyng-Luh
    [J]. 2019 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2019), 2019, : 7453 - 7462
  • [16] IMPROVING CROSS-MODAL UNDERSTANDING IN VISUAL DIALOG VIA CONTRASTIVE LEARNING
    Chen, Feilong
    Chen, Xiuyi
    Xu, Shuang
    Xu, Bo
    [J]. 2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, : 7937 - 7941
  • [17] HGMF: Heterogeneous Graph-based Fusion for Multimodal Data with Incompleteness
    Chen, Jiayi
    Zhang, Aidong
    [J]. KDD '20: PROCEEDINGS OF THE 26TH ACM SIGKDD INTERNATIONAL CONFERENCE ON KNOWLEDGE DISCOVERY & DATA MINING, 2020, : 1295 - 1305
  • [18] Chen YC, 2020, Arxiv, DOI [arXiv:1909.11740, DOI 10.48550/ARXIV.1909.11740]
  • [19] Chen YW, 2019, Arxiv, DOI arXiv:1910.04748
  • [20] Chen Z, 2022, Arxiv, DOI arXiv:2207.12888