Evolution and Prospects of Foundation Models: From Large Language Models to Large Multimodal Models

被引：16

作者：

Chen, Zheyi ^{[1
]}

Xu, Liuchang ^{[1
]}

Zheng, Hongting ^{[1
]}

Chen, Luyao ^{[1
]}

Tolba, Amr ^{[2
,3
]}

Zhao, Liang ^{[4
]}

Yu, Keping ^{[5
]}

Feng, Hailin ^{[1
]}

机构：

[1] Zhejiang A&F Univ, Coll Math & Comp Sci, Hangzhou 311300, Peoples R China

[2] King Saud Univ, Community Coll, Comp Sci Dept, Riyadh 11437, Saudi Arabia

[3] Menoufia Univ, Fac Sci, Math & Comp Sci Dept, Shibin Al Kawm 32511, Menoufia Govern, Egypt

[4] Shenyang Aerosp Univ, Sch Comp Sci, Shenyang 110136, Peoples R China

[5] Hosei Univ, Grad Sch Sci & Engn, Tokyo 1848584, Japan

来源：

CMC-COMPUTERS MATERIALS & CONTINUA | 2024年 / 80卷 / 02期

关键词：

Artificial intelligence; large language models; large multimodal models; foundation models;

D O I：

10.32604/cmc.2024.052618

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

Since the 1950s, when the Turing Test was introduced, there has been notable progress in machine language intelligence. Language modeling, crucial for AI development, has evolved from statistical to neural models over the last two decades. Recently, transformer-based Pre-trained Language Models (PLM) have excelled in Natural Language Processing (NLP) tasks by leveraging large-scale training corpora. Increasing the scale of these models enhances performance significantly, introducing abilities like context learning that smaller models lack. The advancement in Large Language Models, exemplified by the development of ChatGPT, has made significant impacts both academically and industrially, capturing widespread societal interest. This survey provides an overview of the development and prospects from Large Language Models (LLM) to Large Multimodal Models (LMM). It first discusses the contributions and technological advancements of LLMs in the field of natural language processing, especially in text generation and language understanding. Then, it turns to the discussion of LMMs, which integrates various data modalities such as text, images, and sound, demonstrating advanced capabilities in understanding and generating cross-modal content, paving new pathways for the adaptability and flexibility of AI systems. Finally, the survey highlights the prospects of LMMs in terms of technological development and application potential, while also pointing out challenges in data integration, cross-modal understanding accuracy, providing a comprehensive perspective on the latest developments in this field.

引用

页码：1753 / 1808

页数：56

共 324 条

[11]

Bian N, 2024, Arxiv, DOI [arXiv:2303.16421, DOI 10.48550/ARXIV.2303.16421]

[12]

Bisong E., 2019, Building Machine Learning and Deep Learning Models on Google Cloud Platform: A Comprehensive Guide for Beginners, P485, DOI [DOI 10.1007/978-1-4842-4470-8_38, 10.1007/978-1-4842-4470-838, DOI 10.1007/978-1-4842-4470-838]

[13] LaTr: Layout-Aware Transformer for Scene-Text VQA [J].

Biten, Ali Furkan ;

Litman, Ron ;

Xie, Yusheng ;

Appalaraju, Srikar ;

Manmatha, R. .

2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2022), 2022, :16527-16537

[14]

Brock Andrew, 2021, P MACHINE LEARNING R, V139

[15]

Brown TB, 2020, ADV NEUR IN, V33

[16]

Byeon Minwoo, 2022, COYO-700M: Image -Text Pair Dataset

[17] A novel graph-attention based multimodal fusion network for joint classification of hyperspectral image and LiDAR data [J].

Cai, Jianghui ;

Zhang, Min ;

Yang, Haifeng ;

He, Yanting ;

Yang, Yuqing ;

Shi, Chenhui ;

Zhao, Xujun ;

Xun, Yaling .

EXPERT SYSTEMS WITH APPLICATIONS, 2024, 249

[18]

Cai RZ, 2023, Arxiv, DOI arXiv:2312.02896

[19]

Cao YH, 2023, Arxiv, DOI [arXiv:2303.04226, DOI 10.48550/ARXIV.2303.04226, 10.48550/arXiv.2303.04226]

[20] Conceptual 12M: Pushing Web-Scale Image-Text Pre-Training To Recognize Long-Tail Visual Concepts [J].

Changpinyo, Soravit ;

Sharma, Piyush ;

Ding, Nan ;

Soricut, Radu .

2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2021, 2021, :3557-3567

← 1 2 3 4 5 6 7 8 9 10 →