DepITCM: an audio-visual method for detecting depression

被引:0
|
作者
Zhang, Lishan [1 ,2 ]
Liu, Zhenhua [3 ]
Wan, Yumei [3 ]
Fan, Yunli [3 ]
Chen, Diancai [3 ]
Wang, Qingxiang [3 ]
Zhang, Kaihong [3 ]
Zheng, Yunshao [3 ]
机构
[1] Qilu Univ Technol, Shandong Acad Sci, Key Lab Comp Power Network & Informat Secur, Minist Educ,Shandong Comp Sci Ctr, Jinan, Peoples R China
[2] Shandong Fundamental Res Ctr Comp Sci, Shandong Prov Key Lab Comp Networks, Jinan, Peoples R China
[3] Shandong Univ, Shandong Mental Hlth Ctr, Jinan, Peoples R China
来源
FRONTIERS IN PSYCHIATRY | 2025年 / 15卷
关键词
depression detection; multimodal; feature extraction; multi-task learning; DepITCM;
D O I
10.3389/fpsyt.2024.1466507
中图分类号
R749 [精神病学];
学科分类号
100205 ;
摘要
Introduction Depression is a prevalent mental disorder, and early screening and treatment are crucial for detecting depression. However, there are still some limitations in the currently proposed deep models based on audio-video data, for example, it is difficult to effectively extract and select useful multimodal information and features from audio-video data, and very few studies have been able to focus on three dimensions of information: time, channel, and space at the same time in depression detection. In addition, there are challenges in utilizing other tasks to enhance prediction accuracy. The resolution of these issues is crucial for constructing models of depression detection.Methods In this paper, we propose a multi-task representation learning based on vision and audio for depression detection model (DepITCM).The model comprises three main modules: a data preprocessing module, the Inception-Temporal-Channel Principal Component Analysis Module(ITCM Encoder), and a multi-task learning module. To efficiently extract rich feature representations from audio and video data, the ITCM Encoder employs a staged feature extraction strategy, transitioning from global to local features. This approach enables the capture of global features while emphasizing the fusion of temporal, channel, and spatial information in finer detail. Furthermore, inspired by multi-task learning strategies, this paper enhances the primary task of depression classification by incorporating a secondary task (regression task) to improve overall performance.Results We conducted experiments on the AVEC2017 and AVEC2019 datasets. The results show that, in the classification task, our method achieved an F1 score of 0.823 and a classification accuracy of 0.823 on the AVEC2017 dataset, and an F1 score of 0.816 and a classification accuracy of 0.810 on the AVEC2019 dataset. In the regression task, the RMSE was 6.10 (AVEC2017) and 4.89 (AVEC2019), respectively. These results demonstrate that our method outperforms most existing methods in both classification and regression tasks. Furthermore, we demonstrate that the model proposed in this paper can effectively improve the performance of depression detection when using multi-task learning.Discussion Although depression detection through multimodality has shown good results in previous studies. However, multi-task learning can utilize the complementary information between different tasks. Therefore, our work combines multimodal and multi-task learning to improve the accuracy of depression detection. Previous studies have mostly focused on the extraction of global features while ignoring the importance of local features. Based on the problems of previous studies, we have made corresponding improvements to provide a more comprehensive and effective solution for depression detection.
引用
收藏
页数:11
相关论文
共 50 条
  • [1] Treating depression with audio-visual entrainment
    Siever, D
    APPLIED PSYCHOPHYSIOLOGY AND BIOFEEDBACK, 2001, 26 (03) : 244 - 244
  • [2] Audio-Visual Fusion for Detecting Violent Scenes in Videos
    Giannakopoulos, Theodoros
    Makris, Alexandros
    Kosmopoulos, Dimitrios
    Perantonis, Stavros
    Theodoridis, Sergios
    ARTIFICIAL INTELLIGENCE: THEORIES, MODELS AND APPLICATIONS, PROCEEDINGS, 2010, 6040 : 91 - +
  • [3] Novel Method for Detecting Coughing Pigs with Audio-Visual Multimodality for Smart Agriculture Monitoring
    Chae, Heechan
    Lee, Junhee
    Kim, Jonggwan
    Lee, Sejun
    Lee, Jonguk
    Chung, Yongwha
    Park, Daihee
    SENSORS, 2024, 24 (22)
  • [4] An audio-visual distance for audio-visual speech vector quantization
    Girin, L
    Foucher, E
    Feng, G
    1998 IEEE SECOND WORKSHOP ON MULTIMEDIA SIGNAL PROCESSING, 1998, : 523 - 528
  • [5] Catching audio-visual mice:: The extrapolation of audio-visual speed
    Hofbauer, MM
    Wuerger, SM
    Meyer, GF
    Röhrbein, F
    Schill, K
    Zetzsche, C
    PERCEPTION, 2003, 32 : 96 - 96
  • [6] Detecting Audio-Visual Synchrony Using Deep Neural Networks
    Marcheret, Etienne
    Potamianos, Gerasimos
    Vopicka, Josef
    Goel, Vaibhava
    16TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2015), VOLS 1-5, 2015, : 548 - 552
  • [7] Detecting tampering in audio-visual content using QIM watermarking
    Rigoni, Ronaldo
    Freitas, Pedro Garcia
    Farias, Mylene C. Q.
    INFORMATION SCIENCES, 2016, 328 : 127 - 143
  • [8] An audio-visual speech recognition with a new mandarin audio-visual database
    Liao, Wen-Yuan
    Pao, Tsang-Long
    Chen, Yu-Te
    Chang, Tsun-Wei
    INT CONF ON CYBERNETICS AND INFORMATION TECHNOLOGIES, SYSTEMS AND APPLICATIONS/INT CONF ON COMPUTING, COMMUNICATIONS AND CONTROL TECHNOLOGIES, VOL 1, 2007, : 19 - +
  • [9] AUDIO-VISUAL EDUCATION
    Brickman, William W.
    SCHOOL AND SOCIETY, 1948, 67 (1739): : 320 - 326
  • [10] Audio-Visual Objects
    Kubovy M.
    Schutz M.
    Review of Philosophy and Psychology, 2010, 1 (1) : 41 - 61