A Method for Efficient Structured Data Generation with Large Language Models

被引:0
|
作者
Hou, Zongzhi [1 ]
Zhao, Ruohan [1 ]
Li, Zhongyang [1 ]
Wang, Zheng [1 ]
Wu, Yizhen [1 ]
Gou, Junwei [1 ]
Zhu, Zhifeng [1 ]
机构
[1] Huawei, Shanghai, Peoples R China
来源
PROCEEDINGS OF THE 2ND WORKSHOP ON LARGE GENERATIVE MODELS MEET MULTIMODAL APPLICATIONS, LGM(CUBE)A 2024 | 2024年
关键词
Multi-modality; Data Generation; Artificial Intelligence; Large Language Model;
D O I
10.1145/3688866.3689127
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
With the rapid advancement of large language model technology, the data utilized for training these models has become increasingly significant. The quality of text data samples produced by large unsupervised models is often inadequate, leading to insufficient outcomes. This inadequacy arises from the model's constrained capacity to precisely emulate the underlying structure of the data without direct supervision, resulting in outputs that may lack the necessary fidelity and relevance to the authentic data distribution. In order to overcome the shortcomings of training data generation for specific language generation tasks, this paper proposes a fast data generation system (Fast Data Generation System, FDGS) that can handle multi-modal and structured data generation. As a method for generating data, FDGS uses clustering abstraction to handle multiple data input types through templates. This approach allows for quick data generation and reduces consumption. FDGS is robust, ensuring stable and reliable performance under various conditions. It is more cost-effective in terms of token usage compared to traditional methods that work on a per-instance basis and do not use templates. By abstracting and clustering different input types, FDGS can efficiently generate data from large models. This system is highly adaptable, making it a great choice for multi-modal data generation tasks. It relies on the basic functions of general large-scale language models and employs a query-answer bidirectional generation mechanism to achieve fast data amplification.
引用
收藏
页码:36 / 44
页数:9
相关论文
共 50 条
  • [21] Exploring Large Language Models for Method Name Prediction
    Qian, Hanwei
    Xu, Tingting
    Ding, Ziqi
    Liu, Wei
    Zhu, Shaomin
    2024 IEEE 24TH INTERNATIONAL CONFERENCE ON SOFTWARE QUALITY, RELIABILITY AND SECURITY, QRS, 2024, : 192 - 203
  • [22] A comprehensive survey of large language models and multimodal large models in medicine
    Xiao, Hanguang
    Zhou, Feizhong
    Liu, Xingyue
    Liu, Tianqi
    Li, Zhipeng
    Liu, Xin
    Huang, Xiaoxuan
    INFORMATION FUSION, 2025, 117
  • [23] Large Language Models in Cosmetic Dermatology
    Landau, Marina
    Kroumpouzos, George
    Goldust, Mohamad
    JOURNAL OF COSMETIC DERMATOLOGY, 2025, 24 (02)
  • [24] Applications of Large Language Models in Pathology
    Cheng, Jerome
    BIOENGINEERING-BASEL, 2024, 11 (04):
  • [25] A survey of large language models for healthcare: from data, technology, and applications to accountability and ethics
    He, Kai
    Mao, Rui
    Lin, Qika
    Ruan, Yucheng
    Lan, Xiang
    Feng, Mengling
    Cambria, Erik
    INFORMATION FUSION, 2025, 118
  • [26] Quo Vadis ChatGPT? From large language models to Large Knowledge Models
    Venkatasubramanian, Venkat
    Chakraborty, Arijit
    COMPUTERS & CHEMICAL ENGINEERING, 2025, 192
  • [27] On the role of the UMLS in supporting diagnosis generation proposed by Large Language Models
    Afshar, Majid
    Gao, Yanjun
    Gupta, Deepak
    Croxford, Emma
    Demner-Fushman, Dina
    JOURNAL OF BIOMEDICAL INFORMATICS, 2024, 157
  • [28] Research on Dataset Generation in the Development of Large Language Models for Digital Textbooks
    Lee, Youngho
    2023 3RD INTERNATIONAL CONFERENCE ON ROBOTICS, AUTOMATION AND ARTIFICIAL INTELLIGENCE, RAAI 2023, 2023, : 297 - 300
  • [29] Smaller But Better: Unifying Layout Generation with Smaller Large Language Models
    Zhang, Peirong
    Zhang, Jiaxin
    Cao, Jiahuan
    Li, Hongliang
    Jin, Lianwen
    INTERNATIONAL JOURNAL OF COMPUTER VISION, 2025,
  • [30] RAG-Driven multiple assertions generation with large language models
    Zhuang Liu
    Hailong Wang
    Tongtong Xu
    Bei Wang
    Empirical Software Engineering, 2025, 30 (3)