DepITCM: an audio-visual method for detecting depression

被引：0

作者：

Zhang, Lishan ^{[1
,2
]}

Liu, Zhenhua ^{[3
]}

Wan, Yumei ^{[3
]}

Fan, Yunli ^{[3
]}

Chen, Diancai ^{[3
]}

Wang, Qingxiang ^{[3
]}

Zhang, Kaihong ^{[3
]}

Zheng, Yunshao ^{[3
]}

机构：

[1] Qilu Univ Technol, Shandong Acad Sci, Key Lab Comp Power Network & Informat Secur, Minist Educ,Shandong Comp Sci Ctr, Jinan, Peoples R China

[2] Shandong Fundamental Res Ctr Comp Sci, Shandong Prov Key Lab Comp Networks, Jinan, Peoples R China

[3] Shandong Univ, Shandong Mental Hlth Ctr, Jinan, Peoples R China

来源：

FRONTIERS IN PSYCHIATRY | 2025年 / 15卷

关键词：

depression detection; multimodal; feature extraction; multi-task learning; DepITCM;

D O I：

10.3389/fpsyt.2024.1466507

中图分类号：

R749 [精神病学];

学科分类号：

100205 ;

摘要：

Introduction Depression is a prevalent mental disorder, and early screening and treatment are crucial for detecting depression. However, there are still some limitations in the currently proposed deep models based on audio-video data, for example, it is difficult to effectively extract and select useful multimodal information and features from audio-video data, and very few studies have been able to focus on three dimensions of information: time, channel, and space at the same time in depression detection. In addition, there are challenges in utilizing other tasks to enhance prediction accuracy. The resolution of these issues is crucial for constructing models of depression detection.Methods In this paper, we propose a multi-task representation learning based on vision and audio for depression detection model (DepITCM).The model comprises three main modules: a data preprocessing module, the Inception-Temporal-Channel Principal Component Analysis Module(ITCM Encoder), and a multi-task learning module. To efficiently extract rich feature representations from audio and video data, the ITCM Encoder employs a staged feature extraction strategy, transitioning from global to local features. This approach enables the capture of global features while emphasizing the fusion of temporal, channel, and spatial information in finer detail. Furthermore, inspired by multi-task learning strategies, this paper enhances the primary task of depression classification by incorporating a secondary task (regression task) to improve overall performance.Results We conducted experiments on the AVEC2017 and AVEC2019 datasets. The results show that, in the classification task, our method achieved an F1 score of 0.823 and a classification accuracy of 0.823 on the AVEC2017 dataset, and an F1 score of 0.816 and a classification accuracy of 0.810 on the AVEC2019 dataset. In the regression task, the RMSE was 6.10 (AVEC2017) and 4.89 (AVEC2019), respectively. These results demonstrate that our method outperforms most existing methods in both classification and regression tasks. Furthermore, we demonstrate that the model proposed in this paper can effectively improve the performance of depression detection when using multi-task learning.Discussion Although depression detection through multimodality has shown good results in previous studies. However, multi-task learning can utilize the complementary information between different tasks. Therefore, our work combines multimodal and multi-task learning to improve the accuracy of depression detection. Previous studies have mostly focused on the extraction of global features while ignoring the importance of local features. Based on the problems of previous studies, we have made corresponding improvements to provide a more comprehensive and effective solution for depression detection.

引用

页数：11

共 50 条

[1] Treating depression with audio-visual entrainment
Siever, D
APPLIED PSYCHOPHYSIOLOGY AND BIOFEEDBACK, 2001, 26 (03) : 244 - 244
[2] Audio-Visual Fusion for Detecting Violent Scenes in Videos
Giannakopoulos, Theodoros
Makris, Alexandros
Kosmopoulos, Dimitrios
Perantonis, Stavros
Theodoridis, Sergios
ARTIFICIAL INTELLIGENCE: THEORIES, MODELS AND APPLICATIONS, PROCEEDINGS, 2010, 6040 : 91 - +
[3] Novel Method for Detecting Coughing Pigs with Audio-Visual Multimodality for Smart Agriculture Monitoring
Chae, Heechan
Lee, Junhee
Kim, Jonggwan
Lee, Sejun
Lee, Jonguk
Chung, Yongwha
Park, Daihee
SENSORS, 2024, 24 (22)
[4] An audio-visual distance for audio-visual speech vector quantization
Girin, L
Foucher, E
Feng, G
1998 IEEE SECOND WORKSHOP ON MULTIMEDIA SIGNAL PROCESSING, 1998, : 523 - 528
[5] Catching audio-visual mice:: The extrapolation of audio-visual speed
Hofbauer, MM
Wuerger, SM
Meyer, GF
Röhrbein, F
Schill, K
Zetzsche, C
PERCEPTION, 2003, 32 : 96 - 96
[6] Detecting Audio-Visual Synchrony Using Deep Neural Networks
Marcheret, Etienne
Potamianos, Gerasimos
Vopicka, Josef
Goel, Vaibhava
16TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2015), VOLS 1-5, 2015, : 548 - 552
[7] Detecting tampering in audio-visual content using QIM watermarking
Rigoni, Ronaldo
Freitas, Pedro Garcia
Farias, Mylene C. Q.
INFORMATION SCIENCES, 2016, 328 : 127 - 143
[8] An audio-visual speech recognition with a new mandarin audio-visual database
Liao, Wen-Yuan
Pao, Tsang-Long
Chen, Yu-Te
Chang, Tsun-Wei
INT CONF ON CYBERNETICS AND INFORMATION TECHNOLOGIES, SYSTEMS AND APPLICATIONS/INT CONF ON COMPUTING, COMMUNICATIONS AND CONTROL TECHNOLOGIES, VOL 1, 2007, : 19 - +
[9] AUDIO-VISUAL EDUCATION
Brickman, William W.
SCHOOL AND SOCIETY, 1948, 67 (1739): : 320 - 326
[10] Audio-Visual Objects
Kubovy M.
Schutz M.
Review of Philosophy and Psychology, 2010, 1 (1) : 41 - 61

← 1 2 3 4 5 →