Assessing the potential of GPT-4 to perpetuate racial and gender biases in health care: a model evaluation study

被引:118
作者
Zack T. [1 ,2 ]
Lehman E. [3 ]
Suzgun M. [5 ,6 ]
Rodriguez J.A. [8 ]
Celi L.A. [4 ,10 ,11 ]
Gichoya J. [13 ]
Jurafsky D. [5 ,7 ]
Szolovits P. [3 ]
Bates D.W. [8 ,12 ]
Abdulnour R.-E.E. [9 ,14 ]
Butte A.J. [1 ,15 ]
Alsentzer E. [8 ,14 ]
机构
[1] Bakar Computational Health Sciences Institute, University of California San Francisco, San Francisco, CA
[2] Helen Diller Family Comprehensive Cancer Center, University of California San Francisco, San Francisco, CA
[3] Computer Science and Artificial Intelligence Laboratory, Massachusetts Institute of Technology, Cambridge, MA
[4] Laboratory for Computational Physiology, Massachusetts Institute of Technology, Cambridge, MA
[5] Department of Computer Science, Stanford University, Stanford, CA
[6] Stanford Law School, Stanford University, Stanford, CA
[7] Department of Linguistics, Stanford University, Stanford, CA
[8] Division of General Internal Medicine, Brigham and Women's Hospital, Boston, MA
[9] Division of Pulmonary and Critical Care Medicine, Brigham and Women's Hospital, Boston, MA
[10] Division of Pulmonary, Critical Care and Sleep Medicine, Beth Israel Deaconess Medical Center, Boston, MA
[11] Department of Biostatistics, Harvard T H Chan School of Public Health, Boston, MA
[12] Department of Health Policy and Management, Harvard T H Chan School of Public Health, Boston, MA
[13] Department of Radiology, Emory University, Atlanta, GA
[14] Harvard Medical School, Boston, MA
[15] Center for Data-Driven Insights and Innovation, University of California, Office of the President, Oakland, CA
来源
The Lancet Digital Health | 2024年 / 6卷 / 01期
基金
美国国家科学基金会;
关键词
Diagnosis;
D O I
10.1016/S2589-7500(23)00225-X
中图分类号
学科分类号
摘要
Background: Large language models (LLMs) such as GPT-4 hold great promise as transformative tools in health care, ranging from automating administrative tasks to augmenting clinical decision making. However, these models also pose a danger of perpetuating biases and delivering incorrect medical diagnoses, which can have a direct, harmful impact on medical care. We aimed to assess whether GPT-4 encodes racial and gender biases that impact its use in health care. Methods: Using the Azure OpenAI application interface, this model evaluation study tested whether GPT-4 encodes racial and gender biases and examined the impact of such biases on four potential applications of LLMs in the clinical domain—namely, medical education, diagnostic reasoning, clinical plan generation, and subjective patient assessment. We conducted experiments with prompts designed to resemble typical use of GPT-4 within clinical and medical education applications. We used clinical vignettes from NEJM Healer and from published research on implicit bias in health care. GPT-4 estimates of the demographic distribution of medical conditions were compared with true US prevalence estimates. Differential diagnosis and treatment planning were evaluated across demographic groups using standard statistical tests for significance between groups. Findings: We found that GPT-4 did not appropriately model the demographic diversity of medical conditions, consistently producing clinical vignettes that stereotype demographic presentations. The differential diagnoses created by GPT-4 for standardised clinical vignettes were more likely to include diagnoses that stereotype certain races, ethnicities, and genders. Assessment and plans created by the model showed significant association between demographic attributes and recommendations for more expensive procedures as well as differences in patient perception. Interpretation: Our findings highlight the urgent need for comprehensive and transparent bias assessments of LLM tools such as GPT-4 for intended use cases before they are integrated into clinical care. We discuss the potential sources of these biases and potential mitigation strategies before clinical implementation. Funding: Priscilla Chan and Mark Zuckerberg. © 2024 The Author(s). Published by Elsevier Ltd. This is an Open Access article under the CC BY 4.0 license
引用
收藏
页码:e12 / e22
页数:10
相关论文
共 57 条
  • [1] ChatGPT, (2023)
  • [2] Lee P., Bubeck S., Petro J., Benefits, limits, and risks of GPT-4 as an AI chatbot for medicine, N Engl J Med, 388, pp. 1233-1239, (2023)
  • [3] Bartlett J., Massachusetts hospitals, doctors, medical groups to pilot ChatGPT technology, (2023)
  • [4] Kolata G., Doctors are using chatbots in an unexpected way, (2023)
  • [5] Dash D., Thapa R., Banda J.M., Et al., Evaluation of GPT-3.5 and GPT-4 for supporting real-world information needs in healthcare delivery, arXiV, (2023)
  • [6] Armitage H., Researchers are harnessing millions of de-identified patient records for the ultimate consult, (2019)
  • [7] Kanjee Z., Crowe B., Rodman A., Accuracy of a generative artificial intelligence model in a complex diagnostic challenge, JAMA, 330, pp. 78-80, (2023)
  • [8] Kapoor S., Narayanan A., Quantifying ChatGPT's gender bias. AI Snake Oil
  • [9] Liu Y., Wang W., Agarwal R., Echoes of biases: how stigmatizing language affects AI performance, arXiv, (2023)
  • [10] Abid A., Farooqi M., Zou J., Large language models associate Muslims with violence, Nat Mach Intell, 3, pp. 461-463, (2021)