Mental-LLM: Leveraging Large Language Models for Mental Health Prediction via Online Text Data

被引:28
作者
Xu, Xuhai [1 ,2 ]
Yao, Bingsheng [3 ]
Dong, Yuanzhe [4 ]
Gabriel, Saadia [1 ]
Yu, Hong [5 ]
Hendler, James [3 ]
Ghassemi, Marzyeh [1 ]
Dey, Anind K.
Wang, Dakuo [2 ,6 ]
机构
[1] MIT, Cambridge, MA 02139 USA
[2] Univ Washington, Seattle, WA 98195 USA
[3] Rensselaer Polytechn Inst, Rensselaer, NY USA
[4] Stanford Univ, Stanford, CA USA
[5] Univ Massachusetts Lowell, Lowell, MA USA
[6] Northeastern Univ, Boston, MA USA
来源
PROCEEDINGS OF THE ACM ON INTERACTIVE MOBILE WEARABLE AND UBIQUITOUS TECHNOLOGIES-IMWUT | 2024年 / 8卷 / 01期
基金
美国国家卫生研究院;
关键词
Mental Health; Large Language Model; Instruction Finetuning; SOCIAL MEDIA; DEPRESSION;
D O I
10.1145/3643540
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Advances in large language models (LLMs) have empowered a variety of applications. However, there is still a significant gap in research when it comes to understanding and enhancing the capabilities of LLMs in the field of mental health. In this work, we present a comprehensive evaluation of multiple LLMs on various mental health prediction tasks via online text data, including Alpaca, Alpaca-LoRA, FLAN-T5, GPT-3.5, and GPT-4. We conduct a broad range of experiments, covering zero-shot prompting, few-shot prompting, and instruction fine-tuning. The results indicate a promising yet limited performance of LLMs with zero-shot and few-shot prompt designs for mental health tasks. More importantly, our experiments show that instruction finetuning can significantly boost the performance of LLMs for all tasks simultaneously. Our best-finetuned models, Mental-Alpaca and Mental-FLAN-T5, outperform the best prompt design of GPT-3.5 (25 and 15 times bigger) by 10.9% on balanced accuracy and the best of GPT-4 (250 and 150 times bigger) by 4.8%. They further perform on par with the state-of-the-art task-specific language model. We also conduct an exploratory case study on LLMs' capability on mental health reasoning tasks, illustrating the promising capability of certain models such as GPT-4. We summarize our findings into a set of action guidelines for potential methods to enhance LLMs' capability for mental health tasks. Meanwhile, we also emphasize the important limitations before achieving deployability in real-world mental health settings, such as known racial and gender bias. We highlight the important ethical risks accompanying this line of research.
引用
收藏
页数:32
相关论文
共 134 条
[1]   Perceptions and Opinions of Patients About Mental Health Chatbots: Scoping Review [J].
Abd-Alrazaq, Alaa A. ;
Alajlani, Mohannad ;
Ali, Nashva ;
Denecke, Kerstin ;
Bewick, Bridgette M. ;
Househ, Mowafa .
JOURNAL OF MEDICAL INTERNET RESEARCH, 2021, 23 (01)
[2]   An overview of the features of chatbots in mental health: A scoping review [J].
Abd-alrazaq, Alaa A. ;
Alajlani, Mohannad ;
Alalwan, Ali Abdallah ;
Bewick, Bridgette M. ;
Gardner, Peter ;
Househ, Mowafa .
INTERNATIONAL JOURNAL OF MEDICAL INFORMATICS, 2019, 132
[3]   Persistent Anti-Muslim Bias in Large Language Models [J].
Abid, Abubakar ;
Farooqi, Maheen ;
Zou, James .
AIES '21: PROCEEDINGS OF THE 2021 AAAI/ACM CONFERENCE ON AI, ETHICS, AND SOCIETY, 2021, :298-306
[4]  
Agrawal M., 2022, P 2022 C EMPIRICAL M, P1998, DOI [10.18653/v1/2022.emnlp-main.130, DOI 10.18653/V1/2022.EMNLP-MAIN.130]
[5]  
Ahmed Arfan, 2022, Computer Methods and Programs in Biomedicine Update, V2022
[6]  
Amin MM, 2023, Arxiv, DOI arXiv:2303.03186
[7]  
[Anonymous], 2022, Introducing ChatGPT
[8]  
[Anonymous], 2023, Mental Health By the Numbers-NAMI: National Alliance on Mental Illness-nami.org
[9]  
[Anonymous], 2023, Mental Illness
[10]   Judging facts, judging norms: Training machine learning models to judge humans requires a modified approach to labeling data [J].
Balagopalan, Aparna ;
Madras, David ;
Yang, David H. ;
Hadfield-Menell, Dylan ;
Hadfield, Gillian K. ;
Ghassemi, Marzyeh .
SCIENCE ADVANCES, 2023, 9 (19)