Self-Improving Teacher Cultivates Better Student: Distillation Calibration for Multimodal Large Language Models

被引:0
作者
Li, Xinwei [1 ]
Lin, Li [1 ]
Wang, Shuai [1 ]
Qian, Chen [2 ]
机构
[1] Southeast Univ, Nanjing, Peoples R China
[2] Tsinghua Univ, Beijing, Peoples R China
来源
PROCEEDINGS OF THE 47TH INTERNATIONAL ACM SIGIR CONFERENCE ON RESEARCH AND DEVELOPMENT IN INFORMATION RETRIEVAL, SIGIR 2024 | 2024年
关键词
multimodal reasoning; knowledge distillation; large language models;
D O I
10.1145/3626772.3657692
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Multimodal content generation, which leverages visual information to enhance the comprehension of cross-modal understanding, plays a critical role in Multimodal Information Retrieval. With the development of large language models (LLMs), recent research has adopted visual instruction tuning to inject the knowledge of LLMs into downstream multimodal tasks. The high complexity and great demand for resources urge researchers to study e.cient distillation solutions to transfer the knowledge from pre-trained multimodal models (teachers) to more compact student models. However, the instruction tuning for knowledge distillation in multimodal LLMs is resource-intensive and capability-restricted. The comprehension of students is highly reliant on the teacher models. To address this issue, we propose a novel Multimodal Distillation Calibration framework (MmDC). The main idea is to generate high-quality training instances that challenge student models to comprehend and prompt the teacher to calibrate the knowledge transferred to students, ultimately cultivating a better student model in downstream tasks. This framework comprises two stages: (1) multimodal alignment and (2) knowledge distillation calibration. In the.rst stage, parameter-e.cient.ne-tuning is used to enhance feature alignment between di.erent modalities. In the second stage, we develop a calibration strategy to assess the student model's capability and generate high-quality instances to calibrate knowledge distillation from teacher to student. The experiments on diverse datasets show that our framework e.ciently improves the student model's capabilities. Our 7B-size student model, after three iterations of distillation calibration, outperforms the current state-of-the-art LLaVA-13B model on the ScienceQA and LLaVA Test datasets and also exceeds other strong baselines in a zero-shot setting.
引用
收藏
页码:882 / 892
页数:11
相关论文
共 52 条
  • [1] Awadalla A, 2023, Arxiv, DOI [arXiv:2308.01390, 10.48550/arXiv.2308.01390]
  • [2] Large Language Models for Recommendation: Progresses and Future Directions
    Bao, Keqin
    Zhang, Jizhi
    Zhang, Yang
    Wang, Wenjie
    Feng, Fuli
    He, Xiangnan
    [J]. ANNUAL INTERNATIONAL ACM SIGIR CONFERENCE ON RESEARCH AND DEVELOPMENT IN INFORMATION RETRIEVAL IN THE ASIA PACIFIC REGION, SIGIR-AP 2023, 2023, : 306 - 309
  • [3] InstructPix2Pix: Learning to Follow Image Editing Instructions
    Brooks, Tim
    Holynski, Aleksander
    Efros, Alexei A.
    [J]. 2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2023, : 18392 - 18402
  • [4] Conceptual 12M: Pushing Web-Scale Image-Text Pre-Training To Recognize Long-Tail Visual Concepts
    Changpinyo, Soravit
    Sharma, Piyush
    Ding, Nan
    Soricut, Radu
    [J]. 2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2021, 2021, : 3557 - 3567
  • [5] Chiang W.L., 2023, Vicuna: An open -source chatbot impressing gpt-4 with 90%* chatgpt quality
  • [6] Chung HW, 2022, Arxiv, DOI [arXiv:2210.11416, DOI 10.48550/ARXIV.2210.11416]
  • [7] Dai WL, 2023, Arxiv, DOI [arXiv:2305.06500, DOI 10.48550/ARXIV.2305.06500, 10.48550/arXiv.2305.06500]
  • [8] Towards Multi-Modal Conversational Information Seeking
    Deldjoo, Yashar
    Trippas, Johanne R.
    Zamani, Hamed
    [J]. SIGIR '21 - PROCEEDINGS OF THE 44TH INTERNATIONAL ACM SIGIR CONFERENCE ON RESEARCH AND DEVELOPMENT IN INFORMATION RETRIEVAL, 2021, : 1577 - 1587
  • [9] Dong QX, 2024, Arxiv, DOI [arXiv:2301.00234, DOI 10.48550/ARXIV.2301.00234]
  • [10] Furlanello T, 2018, PR MACH LEARN RES, V80