Visual language integration: A survey and open challenges

被引：6

作者：

Park, Sang-Min ^{[1
,6
]}

Kim, Young-Gab ^{[2
,3
,4
,5
]}

机构：

[1] Korea Univ, Dept Comp Sci & Engn, Seoul, South Korea

[2] Sejong Univ, Dept Comp & Informat Secur, Seoul, South Korea

[3] Sejong Univ, Convergence Engn Intelligent Drone, Seoul, South Korea

[4] Sejong Univ, Dept Comp & Informat Secur, Seoul 05006, South Korea

[5] Sejong Univ, Convergence Engn Intelligent Drone, Seoul 05006, South Korea

[6] Korea Univ, Dept Comp Sci & Engn, Seoul 02841, South Korea

来源：

COMPUTER SCIENCE REVIEW | 2023年 / 48卷

基金：

新加坡国家研究基金会;

关键词：

Multimodal learning; Multi-task learning; End-to-end learning; Embodiment; Visual language interaction; INTRINSIC MOTIVATION; EPISODIC MEMORY; FUSION; ATTENTION; FRAMEWORK; NETWORK; LEVEL;

D O I：

10.1016/j.cosrev.2023.100548

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

With the recent development of deep learning technology comes the wide use of artificial intelligence (AI) models in various domains. AI shows good performance for definite-purpose tasks, such as image recognition and text classification. The recognition performance for every single task has become more accurate than feature engineering, enabling more work that could not be done before. In addition, with the development of generation technology (e.g., GPT-3), AI models are showing stable performances in each recognition and generation task. However, not many studies have focused on how to integrate these models efficiently to achieve comprehensive human interaction. Each model grows in size with improved performance, thereby consequently requiring more computing power and more complicated designs to train than before. This requirement increases the complexity of each model and requires more paired data, making model integration difficult. This study provides a survey on visual language integration with a hierarchical approach for reviewing the recent trends that have already been performed on AI models among research communities as the interaction component. We also compare herein the strengths of existing AI models and integration approaches and the limitations they face. Furthermore, we discuss the current related issues and which research is needed for visual language integration. More specifically, we identify four aspects of visual language integration models: multimodal learning, multi-task learning, end-to-end learning, and embodiment for embodied visual language interaction. Finally, we discuss some current open issues and challenges and conclude our survey by giving possible future directions.(c) 2023 Elsevier Inc. All rights reserved

引用

页数：28

共 215 条

[81] He Z, 2021, AAAI C ARTIFICIAL IN, P5931
[82] Herzig R, 2018, ADV NEURAL INFORM PR, P7211
[83] Hessel M, 2019, AAAI CONF ARTIF INTE, P3796
[84] Hill F, 2021, ICLR 2021
[85] Deep Multimodal Clustering for Unsupervised Audiovisual Learning
Hu, Di
Nie, Feiping
Li, Xuelong
[J]. 2019 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2019), 2019, : 9240 - 9249
[86] Hu H., 2020, ICLR 2020
[87] Learning to Reason: End-to-End Module Networks for Visual Question Answering
Hu, Ronghang
Andreas, Jacob
Rohrbach, Marcus
Darrell, Trevor
Saenko, Kate
[J]. 2017 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2017, : 804 - 813
[88] Le H, 2019, 57TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2019), P5612
[89] A Survey on Contrastive Self-Supervised Learning
Jaiswal, Ashish
Babu, Ashwin Ramesh
Zadeh, Mohammad Zaki
Banerjee, Debapriya
Makedon, Fillia
[J]. TECHNOLOGIES, 2021, 9 (01)
[90] Direct speech-to-speech translation with a sequence-to-sequence model
Jia, Ye
Weiss, Ron J.
Biadsy, Fadi
Macherey, Wolfgang
Johnson, Melvin
Chen, Zhifeng
Wu, Yonghui
[J]. INTERSPEECH 2019, 2019, : 1123 - 1127

← 4 5 6 7 8 9 10 11 12 13 →