Multimodal Intelligence: Representation Learning, Information Fusion, and Applications

被引:231
作者
Zhang, Chao [1 ,2 ]
Yang, Zichao [3 ]
He, Xiaodong [2 ]
Deng, Li [4 ]
机构
[1] Univ Cambridge, Dept Engn, Cambridge CB2 1PZ, England
[2] JD Com Inc, JD AI Res, Beijing 100101, Peoples R China
[3] Citadel LLC, Chicago, IL 60603 USA
[4] Citadel Amer, Seattle, WA 98121 USA
关键词
Task analysis; Visualization; Machine learning; Training; Semantics; Natural language processing; Multimodality; representation; multimodal fusion; deep learning; embedding; speech; vision; natural language; caption generation; text-to-image generation; visual question answering; visual reasoning; END FACTOR-ANALYSIS; NEURAL-NETWORKS; MEMORY NETWORKS; SPEAKER; GENERATION; PRIVACY; TEXT;
D O I
10.1109/JSTSP.2020.2987728
中图分类号
TM [电工技术]; TN [电子技术、通信技术];
学科分类号
0808 ; 0809 ;
摘要
Deep learning methods haverevolutionized speech recognition, image recognition, and natural language processing since 2010. Each of these tasks involves a single modality in their input signals. However, many applications in the artificial intelligence field involve multiple modalities. Therefore, it is of broad interest to study the more difficult and complex problem of modeling and learning across multiple modalities. In this paper, we provide a technical review of available models and learning methods for multimodal intelligence. The main focus of this review is the combination of vision and natural language modalities, which has become an important topic in both the computer vision and natural language processing research communities. This review provides a comprehensive analysis of recent works on multimodal deep learning from three perspectives: learning multimodal representations, fusing multimodal signals at various levels, and multimodal applications. Regarding multimodal representation learning, we review the key concepts of embedding, which unify multimodal signals into a single vector space and thereby enable cross-modality signal processing. We also review the properties of many types of embeddings that are constructed and learned for general downstream tasks. Regarding multimodal fusion, this review focuses on special architectures for the integration of representations of unimodal signals for a particular task. Regarding applications, selected areas of a broad interest in the current literature are covered, including image-to-text caption generation, text-to-image generation, and visual question answering. We believe that this review will facilitate future studies in the emerging field of multimodal intelligence for related communities.
引用
收藏
页码:478 / 493
页数:16
相关论文
共 267 条
  • [1] Abdel-Hamid O, 2013, INT CONF ACOUST SPEE, P7942, DOI 10.1109/ICASSP.2013.6639211
  • [2] Deep Lip Reading: a comparison of models and an online application
    Afouras, Triantafyllos
    Chung, Joon Son
    Zisserman, Andrew
    [J]. 19TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2018), VOLS 1-6: SPEECH RESEARCH FOR EMERGING MARKETS IN MULTILINGUAL SOCIETIES, 2018, : 3514 - 3518
  • [3] Don't Just Assume; Look and Answer: Overcoming Priors for Visual Question Answering
    Agrawal, Aishwarya
    Batra, Dhruv
    Parikh, Devi
    Kembhavi, Aniruddha
    [J]. 2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, : 4971 - 4980
  • [4] Aila T., 2018, INT C LEARN REPR ICL
  • [5] Alberti C., 2019, INT C MACH LEARN COM
  • [6] Anastasopoulos A., 2019, ARXIV190302930
  • [7] Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering
    Anderson, Peter
    He, Xiaodong
    Buehler, Chris
    Teney, Damien
    Johnson, Mark
    Gould, Stephen
    Zhang, Lei
    [J]. 2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, : 6077 - 6086
  • [8] Vision-and-Language Navigation: Interpreting visually-grounded navigation instructions in real environments
    Anderson, Peter
    Wu, Qi
    Teney, Damien
    Bruce, Jake
    Johnson, Mark
    Sunderhauf, Niko
    Reid, Ian
    Gould, Stephen
    van den Hengel, Anton
    [J]. 2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, : 3674 - 3683
  • [9] Andreas J., 2016, C N AM CHAPT ASS COM
  • [10] [Anonymous], 2018, IEEE T NEUR NET LEAR, DOI DOI 10.1109/TNNLS.2018.2817340