Predicting Missing Values in Survey Data Using Prompt Engineering for Addressing Item Non-Response

被引:0
作者
Ji, Junyung [1 ]
Kim, Jiwoo [1 ]
Kim, Younghoon [1 ]
机构
[1] Hanyang Univ Ansan, Dept Appl Artificial Intelligence, Ansan 15588, South Korea
关键词
survey data; item non-response; large language models; prompt engineering; IMPUTATION;
D O I
10.3390/fi16100351
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Survey data play a crucial role in various research fields, including economics, education, and healthcare, by providing insights into human behavior and opinions. However, item non-response, where respondents fail to answer specific questions, presents a significant challenge by creating incomplete datasets that undermine data integrity and can hinder or even prevent accurate analysis. Traditional methods for addressing missing data, such as statistical imputation techniques and deep learning models, often fall short when dealing with the rich linguistic content of survey data. These approaches are also hampered by high time complexity for training and the need for extensive preprocessing or feature selection. In this paper, we introduce an approach that leverages Large Language Models (LLMs) through prompt engineering for predicting item non-responses in survey data. Our method combines the strengths of both traditional imputation techniques and deep learning methods with the advanced linguistic understanding of LLMs. By integrating respondent similarities, question relevance, and linguistic semantics, our approach enhances the accuracy and comprehensiveness of survey data analysis. The proposed method bypasses the need for complex preprocessing and additional training, making it adaptable, scalable, and capable of generating explainable predictions in natural language. We evaluated the effectiveness of our LLM-based approach through a series of experiments, demonstrating its competitive performance against established methods such as Multivariate Imputation by Chained Equations (MICE), MissForest, and deep learning models like TabTransformer. The results show that our approach not only matches but, in some cases, exceeds the performance of these methods while significantly reducing the time required for data processing.
引用
收藏
页数:19
相关论文
共 36 条
[1]  
Asai A, 2023, Arxiv, DOI arXiv:2310.11511
[2]  
Brown TB, 2020, Arxiv, DOI [arXiv:2005.14165, DOI 10.48550/ARXIV.2005.14165]
[3]   Random forests [J].
Breiman, L .
MACHINE LEARNING, 2001, 45 (01) :5-32
[4]  
Brick J M, 1996, Stat Methods Med Res, V5, P215, DOI 10.1177/096228029600500302
[5]  
Chen J., 2000, J OFF STAT, V16, P113
[6]  
Chen J, 2024, Arxiv, DOI arXiv:2402.03216
[7]  
Chen Z, 2023, Arxiv, DOI [arXiv:2305.07622, DOI 10.48550/ARXIV.2305.07622]
[8]   Measuring norms using social survey data [J].
de Wit, Juliette R. ;
Lisciandra, Chiara .
ECONOMICS AND PHILOSOPHY, 2021, 37 (02) :188-221
[9]  
Delfino A.P., 2019, Asian Journal of University Education, V15, P42, DOI DOI 10.24191/AJUE.V15I3.05
[10]  
Gao LY, 2023, PROCEEDINGS OF THE 61ST ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, ACL 2023, VOL 1, P1762