TOMGPT: Reliable Text-Only Training Approach for Cost-Effective Multi-modal Large Language Model

被引:4
作者
Chen, Yunkai [1 ]
Wang, Qimeng [2 ]
Wu, Shiwei [1 ]
Gao, Yan [2 ]
Xu, Tong [1 ]
Hu, Yao [2 ]
机构
[1] Univ Sci & Technol China, Hefei, Anhui, Peoples R China
[2] Xiaohongshu Inc, Beijing, Peoples R China
关键词
Multi-modal; large language model; text-only training;
D O I
10.1145/3654674
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Multi-modal large language models (MLLMs), such as GPT-4, exhibit great comprehension capabilities on human instruction, as well as zero-shot ability on new downstream multi-modal tasks. To integrate the different modalities within a unified embedding space, previous MLLMs attempted to conduct visual instruction tuning with massive and high-quality image-text pair data, which requires substantial costs in data collection and training resources. In this article, we propose TOMGPT (Text-Only training Multi-modal GPT), a costeffective MLLM tuned solely on easily accessible text data withmuch fewer resources. Along with pre-trained visual-linguistic coupled modality space (e.g., CLIP and ALIGN model), a text-only training strategy is devised to further project the aligned multi-modal latent space to that of LLM, endowing the LLM with visual comprehension capabilities in an efficient manner. Instead of enormous image-text training data required by previous MLLMs, we find that TOMGPT can be well-tuned with fewer yet diverse GPT-generated free-form text data, as we establish the semantic connection between LLM and pre-trained vision-language model. A quantitative evaluation is conducted on both MME and LVLM, which are recently released and extensively utilized MLLM benchmarks. The experiments reveal that TOMGPT achieved reliable performance compared to numerous models trained on a large amount of image-text pair data. Case studies are also presented, demonstrating TOMGPT's broad understanding and dialogue capabilities across diverse image categories.
引用
收藏
页数:19
相关论文
共 56 条
  • [1] Achiam J., 2023, arXiv
  • [2] Alayrac JB, 2022, ADV NEUR IN
  • [3] The Unreasonable Effectiveness of CLIP Features for Image Captioning: An Experimental Analysis
    Barraco, Manuele
    Cornia, Marcella
    Cascianelli, Silvia
    Baraldi, Lorenzo
    Cucchiara, Rita
    [J]. 2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION WORKSHOPS, CVPRW 2022, 2022, : 4661 - 4669
  • [4] Brown TB, 2020, ADV NEUR IN, V33
  • [5] A Survey on Evaluation of Large Language Models
    Chang, Yupeng
    Wang, Xu
    Wang, Jindong
    Wu, Yuan
    Yang, Linyi
    Zhu, Kaijie
    Chen, Hao
    Yi, Xiaoyuan
    Wang, Cunxiang
    Wang, Yidong
    Ye, Wei
    Zhang, Yue
    Chang, Yi
    Yu, Philip S.
    Yang, Qiang
    Xie, Xing
    [J]. ACM TRANSACTIONS ON INTELLIGENT SYSTEMS AND TECHNOLOGY, 2024, 15 (03)
  • [6] Chen Zhihong, 2023, P ACL, P13710
  • [7] Chiang Wei-Lin, 2023, VICUNA OPEN SOURCE C
  • [8] Chowdhery A, 2023, J MACH LEARN RES, V24
  • [9] Dai Wenliang, 2023, Advances in neural information processing systems
  • [10] Devlin J, 2019, 2019 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES (NAACL HLT 2019), VOL. 1, P4171