A Review on Methods and Applications in Multimodal Deep Learning

被引:64
作者
Jabeen, Summaira [1 ]
Li, Xi [1 ,2 ,3 ,4 ]
Amin, Muhammad Shoib [5 ]
Bourahla, Omar [1 ]
Li, Songyuan [1 ]
Jabbar, Abdul [1 ]
机构
[1] Zhejiang Univ, Coll Comp Sci & Technol, Hangzhou 310027, Peoples R China
[2] Zhejiang Univ, Shanghai Inst Adv Study, Shanghai 201203, Peoples R China
[3] Zhejiang Singapore Innovat & AI Joint Res Lab, Shanghai 201203, Peoples R China
[4] Shanghai AI Lab, Shanghai 201203, Peoples R China
[5] East China Normal Univ, Sch Software Engn, 3663 North Zhongshan Rd, Shanghai, Peoples R China
基金
美国国家科学基金会; 中国国家自然科学基金;
关键词
Deep learning; multimedia; multimodal learning; datasets; neural networks; survey; IMAGE CAPTION GENERATION; SEMANTIC ATTENTION; EMOTION DETECTION; VISUAL-ATTENTION; REPRESENTATION; DATABASE; NETWORK; FUSION; MODELS; CORPUS;
D O I
10.1145/3545572
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Deep Learning has implemented a wide range of applications and has become increasingly popular in recent years. The goal ofmultimodal deep learning (MMDL) is to createmodels that can process and link information using various modalities. Despite the extensive development made for unimodal learning, it still cannot cover all the aspects of human learning. Multimodal learning helps to understand and analyze better when various senses are engaged in the processing of information. This article focuses on multiple types of modalities, i.e., image, video, text, audio, body gestures, facial expressions, physiological signals, flow, RGB, pose, depth, mesh, and point cloud. Detailed analysis of the baseline approaches and an in-depth study of recent advancements during the past five years (2017 to 2021) in multimodal deep learning applications has been provided. A fine-grained taxonomy of various multimodal deep learning methods is proposed, elaborating on different applications in more depth. Last, main issues are highlighted separately for each domain, along with their possible future research directions.
引用
收藏
页数:41
相关论文
共 172 条
[1]   Spatio-Temporal Dynamics and Semantic Attribute Enriched Visual Encoding for Video Captioning [J].
Aafaq, Nayyer ;
Akhtar, Naveed ;
Liu, Wei ;
Gilani, Syed Zulqarnain ;
Mian, Ajmal .
2019 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2019), 2019, :12479-12488
[2]   Don't Just Assume; Look and Answer: Overcoming Priors for Visual Question Answering [J].
Agrawal, Aishwarya ;
Batra, Dhruv ;
Parikh, Devi ;
Kembhavi, Aniruddha .
2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, :4971-4980
[3]   Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering [J].
Anderson, Peter ;
He, Xiaodong ;
Buehler, Chris ;
Teney, Damien ;
Johnson, Mark ;
Gould, Stephen ;
Zhang, Lei .
2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, :6077-6086
[4]  
[Anonymous], 2011, NEURAL INFORM PROCES
[5]  
[Anonymous], 2014, T ASSOC COMPUT LING
[6]  
[Anonymous], 2011, ACL
[7]  
[Anonymous], 2006, 22 INT C DAT ENG WOR, DOI [DOI 10.1109/ICDEW.2006.145, 10.1109/ICDEW.2006.145]
[8]   VQA: Visual Question Answering [J].
Antol, Stanislaw ;
Agrawal, Aishwarya ;
Lu, Jiasen ;
Mitchell, Margaret ;
Batra, Dhruv ;
Zitnick, C. Lawrence ;
Parikh, Devi .
2015 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2015, :2425-2433
[9]  
Arik SÖ, 2017, ADV NEUR IN, V30
[10]  
Arik SO, 2017, PR MACH LEARN RES, V70