Synthetic Data Generation Using Large Language Models: Advances in Text and Code

被引：0

作者：

Nadas, Mihai ^{[1
]}

Diosan, Laura ^{[1
]}

Tomescu, Andreea ^{[2
]}

机构：

[1] Babes Bolyai Univ, Fac Math & Comp Sci, Dept Comp Sci, Cluj Napoca 400347, Romania

[2] KlusAI Labs, Cluj Napoca 400577, Romania

来源：

IEEE ACCESS | 2025年 / 13卷

关键词：

Codes; Synthetic data; Surveys; Data models; Translation; Reviews; Tuning; Training; Synthetic data generation; large language models (LLMs); text data augmentation; code data synthesis; prompt engineering; instruction tuning; machine learning training data; natural language processing (NLP); code generation; reinforcement learning for code; automated data annotation; bias and fairness in synthetic data; retrieval-augmented generation (RAG); evaluation of synthetic data; model collapse in LLMs;

D O I：

10.1109/ACCESS.2025.3589503

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

This survey reviews how large language models (LLMs) are transforming synthetic training data generation in both natural language and code domains. By producing artificial but task-relevant examples, these models can significantly augment or even substitute for real-world datasets, particularly in scenarios where labeled data is scarce, expensive, or sensitive. This paper surveys recent advances in leveraging LLMs to create synthetic text and code, highlighting key techniques such as prompt-based generation, retrieval-augmented pipelines, and iterative self-refinement. We examine how these methods can enrich low-resource tasks (e.g. classification, question answering) and facilitate code-centric applications (e.g. instruction tuning, code translation, bug repair) through automated verification of functional correctness. Alongside potential benefits-cost-effectiveness, broad coverage, and controllable diversity-we discuss the accompanying challenges, including factual inaccuracies in generated text, insufficient stylistic or distributional realism, and risks of bias amplification. Proposed mitigation strategies range from filtering and weighting synthetic outputs to reinforcement learning with execution feedback in code domains. We conclude by outlining open research directions, such as automated prompt engineering, cross-modal data synthesis, and robust evaluation frameworks, underscoring the growing importance of LLM-generated synthetic data in accelerating AI development while emphasizing ethical and quality safeguards.

引用

页码：134615 / 134633

页数：19

共 64 条

[1]

Khaliq MA, 2024, Arxiv, DOI arXiv:2404.12065

[2] InPars: Unsupervised Dataset Generation for Information Retrieval [J].

Bonifacio, Luiz ;

Abonizio, Hugo ;

Fadaee, Marzieh ;

Nogueira, Rodrigo .

PROCEEDINGS OF THE 45TH INTERNATIONAL ACM SIGIR CONFERENCE ON RESEARCH AND DEVELOPMENT IN INFORMATION RETRIEVAL (SIGIR '22), 2022, :2387-2392

[3]

Cassano F, 2024, Arxiv, DOI arXiv:2308.09895

[4]

Chai Y., 2025, arXiv, DOI 10.48550/arXiv.2501.18845

[5]

Chaudhary S., 2025, Sahil280114/Codealpaca

[6]

Chen M., 2021, arXiv.2107.03374. 1613, V1612

[7]

Chen M, 2025, Arxiv, DOI arXiv:2411.00005

[8]

Christopoulou F., 2022, arXiv

[9]

Couairon G., 2022, arXiv

[10]

Dai HX, 2023, Arxiv, DOI [arXiv:2302.13007, DOI 10.48550/ARXIV.2302.13007]

← 1 2 3 4 5 6 7 →