A Method for Efficient Structured Data Generation with Large Language Models

被引：0

作者：

Hou, Zongzhi ^{[1
]}

Zhao, Ruohan ^{[1
]}

Li, Zhongyang ^{[1
]}

Wang, Zheng ^{[1
]}

Wu, Yizhen ^{[1
]}

Gou, Junwei ^{[1
]}

Zhu, Zhifeng ^{[1
]}

机构：

[1] Huawei, Shanghai, Peoples R China

来源：

PROCEEDINGS OF THE 2ND WORKSHOP ON LARGE GENERATIVE MODELS MEET MULTIMODAL APPLICATIONS, LGM(CUBE)A 2024 | 2024年

关键词：

Multi-modality; Data Generation; Artificial Intelligence; Large Language Model;

D O I：

10.1145/3688866.3689127

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

With the rapid advancement of large language model technology, the data utilized for training these models has become increasingly significant. The quality of text data samples produced by large unsupervised models is often inadequate, leading to insufficient outcomes. This inadequacy arises from the model's constrained capacity to precisely emulate the underlying structure of the data without direct supervision, resulting in outputs that may lack the necessary fidelity and relevance to the authentic data distribution. In order to overcome the shortcomings of training data generation for specific language generation tasks, this paper proposes a fast data generation system (Fast Data Generation System, FDGS) that can handle multi-modal and structured data generation. As a method for generating data, FDGS uses clustering abstraction to handle multiple data input types through templates. This approach allows for quick data generation and reduces consumption. FDGS is robust, ensuring stable and reliable performance under various conditions. It is more cost-effective in terms of token usage compared to traditional methods that work on a per-instance basis and do not use templates. By abstracting and clustering different input types, FDGS can efficiently generate data from large models. This system is highly adaptable, making it a great choice for multi-modal data generation tasks. It relies on the basic functions of general large-scale language models and employs a query-answer bidirectional generation mechanism to achieve fast data amplification.

引用

页码：36 / 44

页数：9

共 50 条

[31] Triplet-based contrastive method enhances the reasoning ability of large language models
Chen, Hongwei
Zhu, Jiahui
Wang, Wei
Zhu, Yuan
Xi, Liya
JOURNAL OF SUPERCOMPUTING, 2025, 81 (04):
[32] Large corpora and large language models: a replicable method for automating grammatical annotation
Morin, Cameron
Larsson, Matti Marttinen
LINGUISTICS VANGUARD, 2025,
[33] A divide-and-conquer approach to neural natural language generation from structured data
Dethlefs, Nina
Schoene, Annika
Cuayahuitl, Heriberto
NEUROCOMPUTING, 2021, 433 : 300 - 309
[34] Quantitative Evaluation of Using Large Language Models and Retrieval-Augmented Generation in Computer Science Education
Wang, Kevin Shukang
Lawrence, Ramon
PROCEEDINGS OF THE 56TH ACM TECHNICAL SYMPOSIUM ON COMPUTER SCIENCE EDUCATION, SIGCSE TS 2025, VOL 2, 2025, : 1183 - 1189
[35] Quantitative Evaluation of Using Large Language Models and Retrieval-Augmented Generation in Computer Science Education
Wang, Kevin Shukang
Lawrence, Ramon
PROCEEDINGS OF THE 56TH ACM TECHNICAL SYMPOSIUM ON COMPUTER SCIENCE EDUCATION, SIGCSE TS 2025, VOL 1, 2025, : 1183 - 1189
[36] Large language models for data extraction from unstructured and semi-structured electronic health records: a multiple model performance evaluation
Ntinopoulos, Vasileios
Biefer, Hector Rodriguez Cetina
Tudorache, Igor
Papadopoulos, Nestoras
Odavic, Dragan
Risteski, Petar
Haeussler, Achim
Dzemali, Omer
BMJ HEALTH & CARE INFORMATICS, 2025, 32 (01)
[37] Safety of Large Language Models in Addressing Depression
Heston, Thomas F.
CUREUS JOURNAL OF MEDICAL SCIENCE, 2023, 15 (12)
[38] Harnessing Large Language Models for Chart Review
Xu, Dongchu
Cunningham, Jonathan W.
JOURNAL OF THE AMERICAN HEART ASSOCIATION, 2025, 14 (07):
[39] Large Language Models in Gastroenterology: Systematic Review
Gong, Eun Jeong
Bang, Chang Seok
Lee, Jae Jun
Park, Jonghyung
Kim, Eunsil
Kim, Subeen
Kimm, Minjae
Choi, Seoung-Ho
JOURNAL OF MEDICAL INTERNET RESEARCH, 2024, 26
[40] Automated Commit Message Generation With Large Language Models: An Empirical Study and Beyond
Xue, Pengyu
Wu, Linhao
Yu, Zhongxing
Jin, Zhi
Yang, Zhen
Li, Xinyi
Yang, Zhenyu
Tan, Yue
IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, 2024, 50 (12) : 3208 - 3224

← 1 2 3 4 5 →