Visual language integration: A survey and open challenges

被引:6
作者
Park, Sang-Min [1 ,6 ]
Kim, Young-Gab [2 ,3 ,4 ,5 ]
机构
[1] Korea Univ, Dept Comp Sci & Engn, Seoul, South Korea
[2] Sejong Univ, Dept Comp & Informat Secur, Seoul, South Korea
[3] Sejong Univ, Convergence Engn Intelligent Drone, Seoul, South Korea
[4] Sejong Univ, Dept Comp & Informat Secur, Seoul 05006, South Korea
[5] Sejong Univ, Convergence Engn Intelligent Drone, Seoul 05006, South Korea
[6] Korea Univ, Dept Comp Sci & Engn, Seoul 02841, South Korea
基金
新加坡国家研究基金会;
关键词
Multimodal learning; Multi-task learning; End-to-end learning; Embodiment; Visual language interaction; INTRINSIC MOTIVATION; EPISODIC MEMORY; FUSION; ATTENTION; FRAMEWORK; NETWORK; LEVEL;
D O I
10.1016/j.cosrev.2023.100548
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
With the recent development of deep learning technology comes the wide use of artificial intelligence (AI) models in various domains. AI shows good performance for definite-purpose tasks, such as image recognition and text classification. The recognition performance for every single task has become more accurate than feature engineering, enabling more work that could not be done before. In addition, with the development of generation technology (e.g., GPT-3), AI models are showing stable performances in each recognition and generation task. However, not many studies have focused on how to integrate these models efficiently to achieve comprehensive human interaction. Each model grows in size with improved performance, thereby consequently requiring more computing power and more complicated designs to train than before. This requirement increases the complexity of each model and requires more paired data, making model integration difficult. This study provides a survey on visual language integration with a hierarchical approach for reviewing the recent trends that have already been performed on AI models among research communities as the interaction component. We also compare herein the strengths of existing AI models and integration approaches and the limitations they face. Furthermore, we discuss the current related issues and which research is needed for visual language integration. More specifically, we identify four aspects of visual language integration models: multimodal learning, multi-task learning, end-to-end learning, and embodiment for embodied visual language interaction. Finally, we discuss some current open issues and challenges and conclude our survey by giving possible future directions.(c) 2023 Elsevier Inc. All rights reserved
引用
收藏
页数:28
相关论文
共 215 条
  • [1] To React or not to React: End-to-End Visual Pose Forecasting for Personalized Avatar during Dyadic Conversations
    Ahuja, Chaitanya
    Ma, Shugao
    Morency, Louis-Philippe
    Sheikh, Yaser
    [J]. ICMI'19: PROCEEDINGS OF THE 2019 INTERNATIONAL CONFERENCE ON MULTIMODAL INTERACTION, 2019, : 74 - 84
  • [2] Multi-Level Multimodal Common Semantic Space for Image-Phrase Grounding
    Akbari, Hassan
    Karaman, Svebor
    Bhargava, Surabhi
    Chen, Brian
    Vondrick, Carl
    Chang, Shih-Fu
    [J]. 2019 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2019), 2019, : 12468 - 12478
  • [3] Aktas B., 2018, Proceedings of the First Workshop on Computational Models of Reference, Anaphora and Coreference, P1
  • [4] Alamri H, 2018, DSTC7 AAAI2019 WORKS, V2
  • [5] Ammanabrolu P, 2020, INT C LEARNING REPRE
  • [6] [Anonymous], 2022, TEXT TO TEXT TRANSFE
  • [7] [Anonymous], 2022, COCO DATASET
  • [8] [Anonymous], TEXT VIDEO EARLY ACC
  • [9] [Anonymous], 2019, Advances in neural information processing systems
  • [10] [Anonymous], 2022, MULTI30K