Learning Better Representations for Audio-Visual Emotion Recognition with Common Information

被引:18
作者
Ma, Fei [1 ]
Zhang, Wei [1 ]
Li, Yang [1 ]
Huang, Shao-Lun [1 ]
Zhang, Lin [1 ]
机构
[1] Tsinghua Univ, Tsinghua Berkeley Shenzhen Inst, Shenzhen 518055, Peoples R China
来源
APPLIED SCIENCES-BASEL | 2020年 / 10卷 / 20期
关键词
audio-visual emotion recognition; common information; HGR maximal correlation; semi-supervised learning; FEATURES; CLASSIFICATION; FRAMEWORK;
D O I
10.3390/app10207239
中图分类号
O6 [化学];
学科分类号
0703 ;
摘要
Audio-visual emotion recognition aims to distinguish human emotional states by integrating the audio and visual data acquired in the expression of emotions. It is crucial for facilitating the affect-related human-machine interaction system by enabling machines to intelligently respond to human emotions. One challenge of this problem is how to efficiently extract feature representations from audio and visual modalities. Although progresses have been made by previous works, most of them ignore common information between audio and visual data during the feature learning process, which may limit the performance since these two modalities are highly correlated in terms of their emotional information. To address this issue, we propose a deep learning approach in order to efficiently utilize common information for audio-visual emotion recognition by correlation analysis. Specifically, we design an audio network and a visual network to extract the feature representations from audio and visual data respectively, and then employ a fusion network to combine the extracted features for emotion prediction. These neural networks are trained by a joint loss, combining: (i) the correlation loss based on Hirschfeld-Gebelein-Renyi (HGR) maximal correlation, which extracts common information between audio data, visual data, and the corresponding emotion labels, and (ii) the classification loss, which extracts discriminative information from each modality for emotion prediction. We further generalize our architecture to the semi-supervised learning scenario. The experimental results on the eNTERFACE'05 dataset, BAUM-1s dataset, and RAVDESS dataset show that common information can significantly enhance the stability of features learned from different modalities, and improve the emotion recognition performance.
引用
收藏
页码:1 / 23
页数:23
相关论文
共 103 条
  • [1] Andrew G., 2013, P 30 INT C MACHINE L, P1247
  • [2] [Anonymous], 2014, PROCEDIA IEEE COMPUT, DOI [DOI 10.1109/CVPR.2014.233, 10.1109/CVPR.2014.233]
  • [3] [Anonymous], ARXIVCS0609071
  • [4] Audiovisual emotion recognition in wild
    Avots, Egils
    Sapinski, Tomasz
    Bachmann, Maie
    Kaminska, Dorota
    [J]. MACHINE VISION AND APPLICATIONS, 2019, 30 (05) : 975 - 985
  • [5] Multimodal Machine Learning: A Survey and Taxonomy
    Baltrusaitis, Tadas
    Ahuja, Chaitanya
    Morency, Louis-Philippe
    [J]. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2019, 41 (02) : 423 - 443
  • [6] Bunt H., 1998, Multimodal human-computer communication: systems, techniques, and experiments, V1374
  • [7] Busso C., 2004, P 6 INT C MULT INT, P205, DOI [DOI 10.1145/1027933.1027968, 10.1145/1027933.1027968]
  • [8] Chapelle Olivier, 2009, SEMISUPERVISED LEARN, V20, P542, DOI DOI 10.1109/TNN.2009.2015974
  • [9] Facial Expression Recognition in Video with Multiple Feature Fusion
    Chen, Junkai
    Chen, Zenghai
    Chi, Zheru
    Fu, Hong
    [J]. IEEE TRANSACTIONS ON AFFECTIVE COMPUTING, 2018, 9 (01) : 38 - 50
  • [10] Bimodal Emotion Recognition Based on Convolutional Neural Network
    Chen, Mengmeng
    Jiang, Lifen
    Ma, Chunmei
    Sun, Huazhi
    [J]. ICMLC 2019: 2019 11TH INTERNATIONAL CONFERENCE ON MACHINE LEARNING AND COMPUTING, 2019, : 178 - 181