WavFace: A Multimodal Transformer-Based Model for Depression Screening

被引:0
作者
Flores, Ricardo [1 ,2 ]
Tlachac, M. L. [3 ,4 ]
Shrestha, Avantika [2 ]
Rundensteiner, Elke A. [2 ]
机构
[1] Univ Concepcion, Dept Comp Sci, n, Concepci&x00F3, Concepcion, Chile
[2] Worcester Polytech Inst, Dept Data Sci, Worcester, MA 01609 USA
[3] Bryant Univ, Dept Informat Syst & Analyt, Smithfield, RI 01609 USA
[4] Bryant Univ, Ctr Hlth & Behav Sci, Smithfield, RI 01609 USA
关键词
Depression; Interviews; Deep learning; Facial features; Videos; Transformers; Computational modeling; Telemedicine; Mental health; Bidirectional long short term memory; Digital health; digital biomarkers; time series classification; transfer Learning; alignment; fusion; UNITED-STATES; DEEP; NETWORKS; HEALTH; TRENDS;
D O I
10.1109/JBHI.2025.3529348
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Depression, a prevalent mental health disorder with severe health and economic consequences, can be costly and difficult to detect. To alleviate this burden, recent research has been exploring the depression screening capabilities of deep learning (DL) models trained on videos of clinical interviews conducted by a virtual agent. Such DL models need to consider the challenges of modality representation, alignment, and fusion as well as small sample sizes. To address them, we propose WavFace, a multimodal deep learning model that inputs audio and temporal facial features. WavFace adds an encoder-transformer layer over pre-trained models to improve the unimodal representation. It also applies an explicit alignment method for both modalities and then uses sequential and spatial self-attention over the alignment. Finally, WavFace fuses the sequential and spatial self-attentions among the two modality embeddings, inspired by how mental health professionals simultaneously observe visual and vocal presentation during clinical interviews. By leveraging sequential and spatial self-attention, WavFace outperforms pre-trained unimodal and multimodal models from the literature. With a single interview question, WaveFace screened for depression with a balanced accuracy of 0.81. This presents a valuable modeling approach for audio-visual mental health screening.
引用
收藏
页码:3632 / 3641
页数:10
相关论文
共 86 条
[1]   Multimodal Depression Detection: Fusion Analysis of Paralinguistic, Head Pose and Eye Gaze Behaviors [J].
Alghowinem, Sharifa ;
Goecke, Roland ;
Wagner, Michael ;
Epps, Julien ;
Hyett, Matthew ;
Parker, Gordon ;
Breakspear, Michael .
IEEE TRANSACTIONS ON AFFECTIVE COMPUTING, 2018, 9 (04) :478-490
[2]  
Alghowinem S, 2013, IEEE IMAGE PROC, P4220, DOI 10.1109/ICIP.2013.6738869
[3]   Head Pose and Movement Analysis as an Indicator of Depression [J].
Alghowinem, Sharifa ;
Goecke, Roland ;
Wagner, Michael ;
Parker, Gordon ;
Breakspear, Michael .
2013 HUMAINE ASSOCIATION CONFERENCE ON AFFECTIVE COMPUTING AND INTELLIGENT INTERACTION (ACII), 2013, :283-288
[4]   Barriers and Facilitators That Influence Telemedicine-Based, Real-Time, Online Consultation at Patients' Homes: Systematic Literature Review [J].
Almathami, Hassan Khader Y. ;
Win, Khin Than ;
Vlahu-Gjorgievska, Elena .
JOURNAL OF MEDICAL INTERNET RESEARCH, 2020, 22 (02)
[5]  
Asgari M, 2014, IEEE INT WORKS MACH
[6]  
Baevski A, 2020, ADV NEUR IN, V33
[7]  
Baltrusaitis T, 2016, IEEE WINT CONF APPL
[8]  
Baltrusaitis T, 2015, IEEE INT CONF AUTOMA
[9]   Multimodal Machine Learning: A Survey and Taxonomy [J].
Baltrusaitis, Tadas ;
Ahuja, Chaitanya ;
Morency, Louis-Philippe .
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2019, 41 (02) :423-443
[10]   OpenFace 2.0: Facial Behavior Analysis Toolkit [J].
Baltrusaitis, Tadas ;
Zadeh, Amir ;
Lim, Yao Chong ;
Morency, Louis-Philippe .
PROCEEDINGS 2018 13TH IEEE INTERNATIONAL CONFERENCE ON AUTOMATIC FACE & GESTURE RECOGNITION (FG 2018), 2018, :59-66