Synthetic Text as Data: On Usefulness and Limitations

被引:0
作者
Choi, Seoyeon [1 ]
Sim, Jaein [2 ]
Choi, Guebin [2 ]
机构
[1] Nanum Space Co Ltd, Div Res & Dev, Jeonju 54907, South Korea
[2] Jeonbuk Natl Univ, Inst Appl Stat, Dept Stat, Jeonju 54896, South Korea
来源
APPLIED SCIENCES-BASEL | 2025年 / 15卷 / 10期
基金
新加坡国家研究基金会;
关键词
GPT-generated text; synthetic data; data augmentation; MBTI classification; class imbalance; few-shot learning; fine-grained classification; data granularity; large language models;
D O I
10.3390/app15105460
中图分类号
O6 [化学];
学科分类号
0703 ;
摘要
This study investigates the utility of GPT-generated text as a training resource in supervised learning, focusing on two perspectives: its effectiveness as an augmentation tool in data-scarce or class-imbalanced settings and its potential as a substitute for human-written data. Using MBTI personality classification as a benchmark task, we conducted controlled experiments under both class imbalance and few-shot learning conditions. Results showed that GPT-generated text could improve classification performance when used to supplement underrepresented classes. However, when synthetic data fully replace real data, performance declines significantly-particularly in tasks requiring fine-grained semantic distinctions. Further analysis reveals that GPT outputs often capture only partial personality traits, enabling coarse-level classification but falling short in nuanced cases. These findings suggest that GPT-generated text can function as a conditional training resource, with its effectiveness closely tied to the granularity of the classification task.
引用
收藏
页数:17
相关论文
共 23 条
[1]  
Abaskohi A, 2023, Arxiv, DOI arXiv:2305.18169
[2]  
Allaire J.J., 2022, Quarto
[3]   A Survey on Data Augmentation for Text Classification [J].
Bayer, Markus ;
Kaufhold, Marc-Andre ;
Reuter, Christian .
ACM COMPUTING SURVEYS, 2023, 55 (07)
[4]  
Caruana R., 2004, P 21 INT C MACHINE L, DOI 10.1145/1015330.1015432
[5]  
Cegin J, 2023, Arxiv, DOI arXiv:2305.12947
[6]   XGBoost: A Scalable Tree Boosting System [J].
Chen, Tianqi ;
Guestrin, Carlos .
KDD'16: PROCEEDINGS OF THE 22ND ACM SIGKDD INTERNATIONAL CONFERENCE ON KNOWLEDGE DISCOVERY AND DATA MINING, 2016, :785-794
[7]  
Dai HX, 2023, Arxiv, DOI [arXiv:2302.13007, DOI 10.48550/ARXIV.2302.13007]
[8]  
Dou Y., 2021, P 2021 C EM METH NAT
[9]  
Fakoor R., 2020, P NIPS20 P 34 INT C
[10]  
Feng SY, 2021, Arxiv, DOI arXiv:2105.03075