Visual language integration: A survey and open challenges

被引:6
作者
Park, Sang-Min [1 ,6 ]
Kim, Young-Gab [2 ,3 ,4 ,5 ]
机构
[1] Korea Univ, Dept Comp Sci & Engn, Seoul, South Korea
[2] Sejong Univ, Dept Comp & Informat Secur, Seoul, South Korea
[3] Sejong Univ, Convergence Engn Intelligent Drone, Seoul, South Korea
[4] Sejong Univ, Dept Comp & Informat Secur, Seoul 05006, South Korea
[5] Sejong Univ, Convergence Engn Intelligent Drone, Seoul 05006, South Korea
[6] Korea Univ, Dept Comp Sci & Engn, Seoul 02841, South Korea
基金
新加坡国家研究基金会;
关键词
Multimodal learning; Multi-task learning; End-to-end learning; Embodiment; Visual language interaction; INTRINSIC MOTIVATION; EPISODIC MEMORY; FUSION; ATTENTION; FRAMEWORK; NETWORK; LEVEL;
D O I
10.1016/j.cosrev.2023.100548
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
With the recent development of deep learning technology comes the wide use of artificial intelligence (AI) models in various domains. AI shows good performance for definite-purpose tasks, such as image recognition and text classification. The recognition performance for every single task has become more accurate than feature engineering, enabling more work that could not be done before. In addition, with the development of generation technology (e.g., GPT-3), AI models are showing stable performances in each recognition and generation task. However, not many studies have focused on how to integrate these models efficiently to achieve comprehensive human interaction. Each model grows in size with improved performance, thereby consequently requiring more computing power and more complicated designs to train than before. This requirement increases the complexity of each model and requires more paired data, making model integration difficult. This study provides a survey on visual language integration with a hierarchical approach for reviewing the recent trends that have already been performed on AI models among research communities as the interaction component. We also compare herein the strengths of existing AI models and integration approaches and the limitations they face. Furthermore, we discuss the current related issues and which research is needed for visual language integration. More specifically, we identify four aspects of visual language integration models: multimodal learning, multi-task learning, end-to-end learning, and embodiment for embodied visual language interaction. Finally, we discuss some current open issues and challenges and conclude our survey by giving possible future directions.(c) 2023 Elsevier Inc. All rights reserved
引用
收藏
页数:28
相关论文
共 215 条
  • [81] He Z, 2021, AAAI C ARTIFICIAL IN, P5931
  • [82] Herzig R, 2018, ADV NEURAL INFORM PR, P7211
  • [83] Hessel M, 2019, AAAI CONF ARTIF INTE, P3796
  • [84] Hill F, 2021, ICLR 2021
  • [85] Deep Multimodal Clustering for Unsupervised Audiovisual Learning
    Hu, Di
    Nie, Feiping
    Li, Xuelong
    [J]. 2019 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2019), 2019, : 9240 - 9249
  • [86] Hu H., 2020, ICLR 2020
  • [87] Learning to Reason: End-to-End Module Networks for Visual Question Answering
    Hu, Ronghang
    Andreas, Jacob
    Rohrbach, Marcus
    Darrell, Trevor
    Saenko, Kate
    [J]. 2017 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2017, : 804 - 813
  • [88] Le H, 2019, 57TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2019), P5612
  • [89] A Survey on Contrastive Self-Supervised Learning
    Jaiswal, Ashish
    Babu, Ashwin Ramesh
    Zadeh, Mohammad Zaki
    Banerjee, Debapriya
    Makedon, Fillia
    [J]. TECHNOLOGIES, 2021, 9 (01)
  • [90] Direct speech-to-speech translation with a sequence-to-sequence model
    Jia, Ye
    Weiss, Ron J.
    Biadsy, Fadi
    Macherey, Wolfgang
    Johnson, Melvin
    Chen, Zhifeng
    Wu, Yonghui
    [J]. INTERSPEECH 2019, 2019, : 1123 - 1127