Mental-LLM: Leveraging Large Language Models for Mental Health Prediction via Online Text Data

被引：28

作者：

Xu, Xuhai ^{[1
,2
]}

Yao, Bingsheng ^{[3
]}

Dong, Yuanzhe ^{[4
]}

Gabriel, Saadia ^{[1
]}

Yu, Hong ^{[5
]}

Hendler, James ^{[3
]}

Ghassemi, Marzyeh ^{[1
]}

Dey, Anind K.

Wang, Dakuo ^{[2
,6
]}

机构：

[1] MIT, Cambridge, MA 02139 USA

[2] Univ Washington, Seattle, WA 98195 USA

[3] Rensselaer Polytechn Inst, Rensselaer, NY USA

[4] Stanford Univ, Stanford, CA USA

[5] Univ Massachusetts Lowell, Lowell, MA USA

[6] Northeastern Univ, Boston, MA USA

来源：

PROCEEDINGS OF THE ACM ON INTERACTIVE MOBILE WEARABLE AND UBIQUITOUS TECHNOLOGIES-IMWUT | 2024年 / 8卷 / 01期

基金：

美国国家卫生研究院;

关键词：

Mental Health; Large Language Model; Instruction Finetuning; SOCIAL MEDIA; DEPRESSION;

D O I：

10.1145/3643540

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

Advances in large language models (LLMs) have empowered a variety of applications. However, there is still a significant gap in research when it comes to understanding and enhancing the capabilities of LLMs in the field of mental health. In this work, we present a comprehensive evaluation of multiple LLMs on various mental health prediction tasks via online text data, including Alpaca, Alpaca-LoRA, FLAN-T5, GPT-3.5, and GPT-4. We conduct a broad range of experiments, covering zero-shot prompting, few-shot prompting, and instruction fine-tuning. The results indicate a promising yet limited performance of LLMs with zero-shot and few-shot prompt designs for mental health tasks. More importantly, our experiments show that instruction finetuning can significantly boost the performance of LLMs for all tasks simultaneously. Our best-finetuned models, Mental-Alpaca and Mental-FLAN-T5, outperform the best prompt design of GPT-3.5 (25 and 15 times bigger) by 10.9% on balanced accuracy and the best of GPT-4 (250 and 150 times bigger) by 4.8%. They further perform on par with the state-of-the-art task-specific language model. We also conduct an exploratory case study on LLMs' capability on mental health reasoning tasks, illustrating the promising capability of certain models such as GPT-4. We summarize our findings into a set of action guidelines for potential methods to enhance LLMs' capability for mental health tasks. Meanwhile, we also emphasize the important limitations before achieving deployability in real-world mental health settings, such as known racial and gender bias. We highlight the important ethical risks accompanying this line of research.

引用

页数：32

共 134 条

[1] Perceptions and Opinions of Patients About Mental Health Chatbots: Scoping Review [J].

Abd-Alrazaq, Alaa A. ;

Alajlani, Mohannad ;

Ali, Nashva ;

Denecke, Kerstin ;

Bewick, Bridgette M. ;

Househ, Mowafa .

JOURNAL OF MEDICAL INTERNET RESEARCH, 2021, 23 (01)

[2] An overview of the features of chatbots in mental health: A scoping review [J].

Abd-alrazaq, Alaa A. ;

Alajlani, Mohannad ;

Alalwan, Ali Abdallah ;

Bewick, Bridgette M. ;

Gardner, Peter ;

Househ, Mowafa .

INTERNATIONAL JOURNAL OF MEDICAL INFORMATICS, 2019, 132

[3] Persistent Anti-Muslim Bias in Large Language Models [J].

Abid, Abubakar ;

Farooqi, Maheen ;

Zou, James .

AIES '21: PROCEEDINGS OF THE 2021 AAAI/ACM CONFERENCE ON AI, ETHICS, AND SOCIETY, 2021, :298-306

[4]

Agrawal M., 2022, P 2022 C EMPIRICAL M, P1998, DOI [10.18653/v1/2022.emnlp-main.130, DOI 10.18653/V1/2022.EMNLP-MAIN.130]

[5]

Ahmed Arfan, 2022, Computer Methods and Programs in Biomedicine Update, V2022

[6]

Amin MM, 2023, Arxiv, DOI arXiv:2303.03186

[7]

[Anonymous], 2022, Introducing ChatGPT

[8]

[Anonymous], 2023, Mental Health By the Numbers-NAMI: National Alliance on Mental Illness-nami.org

[9]

[Anonymous], 2023, Mental Illness

[10] Judging facts, judging norms: Training machine learning models to judge humans requires a modified approach to labeling data [J].

Balagopalan, Aparna ;

Madras, David ;

Yang, David H. ;

Hadfield-Menell, Dylan ;

Hadfield, Gillian K. ;

Ghassemi, Marzyeh .

SCIENCE ADVANCES, 2023, 9 (19)

← 1 2 3 4 5 6 7 8 9 10 →