A Method for Efficient Structured Data Generation with Large Language Models

被引：0

作者：

Hou, Zongzhi ^{[1
]}

Zhao, Ruohan ^{[1
]}

Li, Zhongyang ^{[1
]}

Wang, Zheng ^{[1
]}

Wu, Yizhen ^{[1
]}

Gou, Junwei ^{[1
]}

Zhu, Zhifeng ^{[1
]}

机构：

[1] Huawei, Shanghai, Peoples R China

来源：

PROCEEDINGS OF THE 2ND WORKSHOP ON LARGE GENERATIVE MODELS MEET MULTIMODAL APPLICATIONS, LGM(CUBE)A 2024 | 2024年

关键词：

Multi-modality; Data Generation; Artificial Intelligence; Large Language Model;

D O I：

10.1145/3688866.3689127

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

With the rapid advancement of large language model technology, the data utilized for training these models has become increasingly significant. The quality of text data samples produced by large unsupervised models is often inadequate, leading to insufficient outcomes. This inadequacy arises from the model's constrained capacity to precisely emulate the underlying structure of the data without direct supervision, resulting in outputs that may lack the necessary fidelity and relevance to the authentic data distribution. In order to overcome the shortcomings of training data generation for specific language generation tasks, this paper proposes a fast data generation system (Fast Data Generation System, FDGS) that can handle multi-modal and structured data generation. As a method for generating data, FDGS uses clustering abstraction to handle multiple data input types through templates. This approach allows for quick data generation and reduces consumption. FDGS is robust, ensuring stable and reliable performance under various conditions. It is more cost-effective in terms of token usage compared to traditional methods that work on a per-instance basis and do not use templates. By abstracting and clustering different input types, FDGS can efficiently generate data from large models. This system is highly adaptable, making it a great choice for multi-modal data generation tasks. It relies on the basic functions of general large-scale language models and employs a query-answer bidirectional generation mechanism to achieve fast data amplification.

引用

页码：36 / 44

页数：9

共 50 条

[21] Exploring Large Language Models for Method Name Prediction
Qian, Hanwei
Xu, Tingting
Ding, Ziqi
Liu, Wei
Zhu, Shaomin
2024 IEEE 24TH INTERNATIONAL CONFERENCE ON SOFTWARE QUALITY, RELIABILITY AND SECURITY, QRS, 2024, : 192 - 203
[22] A comprehensive survey of large language models and multimodal large models in medicine
Xiao, Hanguang
Zhou, Feizhong
Liu, Xingyue
Liu, Tianqi
Li, Zhipeng
Liu, Xin
Huang, Xiaoxuan
INFORMATION FUSION, 2025, 117
[23] Large Language Models in Cosmetic Dermatology
Landau, Marina
Kroumpouzos, George
Goldust, Mohamad
JOURNAL OF COSMETIC DERMATOLOGY, 2025, 24 (02)
[24] Applications of Large Language Models in Pathology
Cheng, Jerome
BIOENGINEERING-BASEL, 2024, 11 (04):
[25] A survey of large language models for healthcare: from data, technology, and applications to accountability and ethics
He, Kai
Mao, Rui
Lin, Qika
Ruan, Yucheng
Lan, Xiang
Feng, Mengling
Cambria, Erik
INFORMATION FUSION, 2025, 118
[26] Quo Vadis ChatGPT? From large language models to Large Knowledge Models
Venkatasubramanian, Venkat
Chakraborty, Arijit
COMPUTERS & CHEMICAL ENGINEERING, 2025, 192
[27] On the role of the UMLS in supporting diagnosis generation proposed by Large Language Models
Afshar, Majid
Gao, Yanjun
Gupta, Deepak
Croxford, Emma
Demner-Fushman, Dina
JOURNAL OF BIOMEDICAL INFORMATICS, 2024, 157
[28] Research on Dataset Generation in the Development of Large Language Models for Digital Textbooks
Lee, Youngho
2023 3RD INTERNATIONAL CONFERENCE ON ROBOTICS, AUTOMATION AND ARTIFICIAL INTELLIGENCE, RAAI 2023, 2023, : 297 - 300
[29] Smaller But Better: Unifying Layout Generation with Smaller Large Language Models
Zhang, Peirong
Zhang, Jiaxin
Cao, Jiahuan
Li, Hongliang
Jin, Lianwen
INTERNATIONAL JOURNAL OF COMPUTER VISION, 2025,
[30] RAG-Driven multiple assertions generation with large language models
Zhuang Liu
Hailong Wang
Tongtong Xu
Bei Wang
Empirical Software Engineering, 2025, 30 (3)

← 1 2 3 4 5 →