Language discrepancies in the performance of generative artificial intelligence models: an examination of infectious disease queries in English and Arabic

被引:3
作者
Sallam, Malik [1 ,2 ,7 ]
Al-Mahzoum, Kholoud [3 ]
Alshuaib, Omaima [3 ]
Alhajri, Hawajer [3 ]
Alotaibi, Fatmah [3 ]
Alkhurainej, Dalal [3 ]
Al-Balwah, Mohammad Yahya [3 ]
Barakat, Muna [4 ,5 ]
Egger, Jan [6 ]
机构
[1] Univ Jordan, Sch Med, Dept Pathol Microbiol & Forens Med, Amman 11942, Jordan
[2] Lund Univ, Fac Med, Dept Translat Med, S-22184 Malmo, Sweden
[3] Univ Jordan, Sch Med, Amman 11942, Jordan
[4] Appl Sci Private Univ, Fac Pharm, Dept Clin Pharm & Therapeut, Amman 11931, Jordan
[5] Middle East Univ, MEU Res Unit, Amman 11831, Jordan
[6] Univ Med Essen AoR, Inst AI Med IKIM, Essen, Germany
[7] Jordan Univ Hosp, Dept Clin Labs & Forens Med, Queen Rania Al Abdullah St Aljubeiha,POB 13046, Amman, Jordan
关键词
AI chatbots; Infectious diseases; Language performance; Healthcare technology; Digital health queries; HEALTH INFORMATION; CHATGPT; CARE;
D O I
10.1186/s12879-024-09725-y
中图分类号
R51 [传染病];
学科分类号
100401 ;
摘要
BackgroundAssessment of artificial intelligence (AI)-based models across languages is crucial to ensure equitable access and accuracy of information in multilingual contexts. This study aimed to compare AI model efficiency in English and Arabic for infectious disease queries.MethodsThe study employed the METRICS checklist for the design and reporting of AI-based studies in healthcare. The AI models tested included ChatGPT-3.5, ChatGPT-4, Bing, and Bard. The queries comprised 15 questions on HIV/AIDS, tuberculosis, malaria, COVID-19, and influenza. The AI-generated content was assessed by two bilingual experts using the validated CLEAR tool.ResultsIn comparing AI models' performance in English and Arabic for infectious disease queries, variability was noted. English queries showed consistently superior performance, with Bard leading, followed by Bing, ChatGPT-4, and ChatGPT-3.5 (P = .012). The same trend was observed in Arabic, albeit without statistical significance (P = .082). Stratified analysis revealed higher scores for English in most CLEAR components, notably in completeness, accuracy, appropriateness, and relevance, especially with ChatGPT-3.5 and Bard. Across the five infectious disease topics, English outperformed Arabic, except for flu queries in Bing and Bard. The four AI models' performance in English was rated as "excellent", significantly outperforming their "above-average" Arabic counterparts (P = .002).ConclusionsDisparity in AI model performance was noticed between English and Arabic in response to infectious disease queries. This language variation can negatively impact the quality of health content delivered by AI models among native speakers of Arabic. This issue is recommended to be addressed by AI developers, with the ultimate goal of enhancing health outcomes.
引用
收藏
页数:13
相关论文
共 28 条
  • [1] The performance of artificial intelligence chatbot large language models to address skeletal biology and bone health queries
    Cung, Michelle
    Sosa, Branden
    Yang, He S.
    McDonald, Michelle M.
    Matthews, Brya G.
    Vlug, Annegreet G.
    Imel, Erik A.
    Wein, Marc N.
    Stein, Emily Margaret
    Greenblatt, Matthew B.
    JOURNAL OF BONE AND MINERAL RESEARCH, 2024, 39 (02) : 106 - 115
  • [2] Legal aspects of generative artificial intelligence and large language models in examinations and theses
    Maerz, Maren
    Himmelbauer, Monika
    Boldt, Kevin
    Oksche, Alexander
    GMS JOURNAL FOR MEDICAL EDUCATION, 2024, 41 (04):
  • [3] Generative Artificial Intelligence and Large Language Models in Primary Care Medical Education
    Parente, Daniel J.
    FAMILY MEDICINE, 2024, 56 (09) : 534 - 540
  • [4] A Generative Artificial Intelligence Using Multilingual Large Language Models for ChatGPT Applications
    Tuan, Nguyen Trung
    Moore, Philip
    Thanh, Dat Ha Vu
    Pham, Hai Van
    APPLIED SCIENCES-BASEL, 2024, 14 (07):
  • [5] Comparitive performance of artificial intelligence-based large language models on the orthopedic in-training examination
    Xu, Andrew Y.
    Singh, Manjot
    Balmaceno-Criss, Mariah
    Oh, Allison
    Leigh, David
    Daher, Mohammad
    Alsoof, Daniel
    Mcdonald, Christopher L.
    Diebo, Bassel G.
    Daniels, Alan H.
    JOURNAL OF ORTHOPAEDIC SURGERY, 2025, 33 (01)
  • [6] Integrating large language models and generative artificial intelligence tools into information literacy instruction
    Carroll, Alexander J.
    Borycz, Joshua
    JOURNAL OF ACADEMIC LIBRARIANSHIP, 2024, 50 (04)
  • [7] Generative Artificial Intelligence Models in Clinical Infectious Disease Consultations: A Cross-Sectional Analysis Among Specialists and Resident Trainees
    Chiu, Edwin Kwan-Yeung
    Sridhar, Siddharth
    Wong, Samson Sai-Yin
    Tam, Anthony Raymond
    Choi, Ming-Hong
    Lau, Alicia Wing-Tung
    Wong, Wai-Ching
    Chiu, Kelvin Hei-Yeung
    Ng, Yuey-Zhun
    Yuen, Kwok-Yung
    Chung, Tom Wai-Hin
    HEALTHCARE, 2025, 13 (07)
  • [8] The academic industry's response to generative artificial intelligence: An institutional analysis of large language models
    Kshetri, Nir
    TELECOMMUNICATIONS POLICY, 2024, 48 (05)
  • [9] Impact of generative artificial intelligence models on the performance of citizen data scientists in retail firms
    Abumalloh, Rabab Ali
    Nilashi, Mehrbakhsh
    Ooi, Keng Boon
    Tan, Garry Wei Han
    Chan, Hing Kai
    COMPUTERS IN INDUSTRY, 2024, 161