Attention-Based Multimodal Deep Learning on Vision-Language Data: Models, Datasets, Tasks, Evaluation Metrics and Applications

被引：3

作者：

Bose, Priyankar ^{[1
]}

Rana, Pratip ^{[1
,2
]}

Ghosh, Preetam ^{[1
]}

机构：

[1] Virginia Commonwealth Univ, Dept Comp Sci, Richmond, VA 23284 USA

[2] Bennett Aerosp, Raleigh, NC 27603 USA

来源：

IEEE ACCESS | 2023年 / 11卷

关键词：

Task analysis; Data models; Deep learning; Transformers; Visualization; Training; Surveys; Question answering (information retrieval); Image segmentation; Image texture analysis; Attention mechanism; data fusion; multimodal learning; vision-language classification; vision-language question-answering; vision-language segmentation; IMAGE; NETWORK;

D O I：

10.1109/ACCESS.2023.3299877

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

Multimodal learning has gained immense popularity due to the explosive growth in the volume of image and textual data in various domains. Vision-language heterogeneous multimodal data has been utilized to solve a variety of tasks including classification, image segmentation, image captioning, question-answering, etc. Consequently, several attention mechanism-based approaches with deep learning have been proposed on image-text multimodal data. In this paper, we highlight the current status of attention-based deep learning approaches on vision-language multimodal data by presenting a detailed description of the existing models, their performances and the variety of evaluation metrics used therein. We revisited the various attention mechanisms on image-text multimodal data since its inception in 2015 till 2022 and considered a total of 75 articles for the survey. Our comprehensive discussion also encompasses the current tasks, datasets, application areas and future directions in this domain. This is the very first attempt to discuss the vast scope of attention-based deep learning mechanisms on image-text multimodal data.

引用

页码：80624 / 80646

页数：23

共 139 条

[11] Bose P, 2022, MED PHYS, V49, pE654
[12] Deep Neural Network Models to automate Incident Triage in the Radiation Oncology Incident Learning System
Bose, Priyankar
Sleeman, William C.
Syed, Khajamoinuddin
Hagan, Michael
Palta, Jatinder
Kapoor, Rishabh
Ghosh, Preetam
[J]. 12TH ACM CONFERENCE ON BIOINFORMATICS, COMPUTATIONAL BIOLOGY, AND HEALTH INFORMATICS (ACM-BCB 2021), 2021,
[13] A Comparative NLP-Based Study on the Current Trends and Future Directions in COVID-19 Research
Bose, Priyankar
Roy, Satyaki
Ghosh, Preetam
[J]. IEEE ACCESS, 2021, 9 : 78341 - 78355
[14] UTC: A Unified Transformer with Inter-Task Contrastive Learning for Visual Dialog
Chen, Cheng
Tan, Zhenshan
Cheng, Qingrong
Jiang, Xin
Liu, Qun
Zhu, Yudong
Gu, Xiaodong
[J]. 2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2022), 2022, : 18082 - 18091
[15] See-Through-Text Grouping for Referring Image Segmentation
Chen, Ding-Jie
Jia, Songhao
Lo, Yi-Chen
Chen, Hwann-Tzong
Liu, Tyng-Luh
[J]. 2019 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2019), 2019, : 7453 - 7462
[16] IMPROVING CROSS-MODAL UNDERSTANDING IN VISUAL DIALOG VIA CONTRASTIVE LEARNING
Chen, Feilong
Chen, Xiuyi
Xu, Shuang
Xu, Bo
[J]. 2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, : 7937 - 7941
[17] HGMF: Heterogeneous Graph-based Fusion for Multimodal Data with Incompleteness
Chen, Jiayi
Zhang, Aidong
[J]. KDD '20: PROCEEDINGS OF THE 26TH ACM SIGKDD INTERNATIONAL CONFERENCE ON KNOWLEDGE DISCOVERY & DATA MINING, 2020, : 1295 - 1305
[18] Chen YC, 2020, Arxiv, DOI [arXiv:1909.11740, DOI 10.48550/ARXIV.1909.11740]
[19] Chen YW, 2019, Arxiv, DOI arXiv:1910.04748
[20] Chen Z, 2022, Arxiv, DOI arXiv:2207.12888

← 1 2 3 4 5 6 7 8 9 10 →