Spatio-Temporal Encoder-Decoder Fully Convolutional Network for Video-Based Dimensional Emotion Recognition

被引:20
|
作者
Du, Zhengyin [1 ]
Wu, Suowei [2 ]
Huang, Di [1 ]
Li, Weixin [3 ]
Wang, Yunhong [3 ]
机构
[1] Beihang Univ, Beijing Adv Innovat Ctr Big Data & Brain Comp, Sch Comp Sci & Engn, State Key Lab Software Dev Environm, Beijing 100191, Peoples R China
[2] Beihang Univ, Beijing Adv Innovat Ctr Big Data & Brain Comp, Sino French Engineer Sch, Beijing 100191, Peoples R China
[3] Beihang Univ, Beijing Adv Innovat Ctr Big Data & Brain Comp, Beijing 100191, Peoples R China
基金
中国国家自然科学基金;
关键词
Emotion recognition; Convolution; Decoding; Feature extraction; Videos; Visualization; Task analysis; Dimensional emotion recognition; spatio-temporal fully convolutional network; temporal hourglass CNN; temporal intermediate supervision; EXPRESSION RECOGNITION;
D O I
10.1109/TAFFC.2019.2940224
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Video-based dimensional emotion recognition aims to map human affect into the dimensional emotion space based on visual signals, which is a fundamental challenge in affective computing and human-computer interaction. In this paper, we present a novel encoder-decoder framework to tackle this problem. It adopts a fully convolutional design with the cascaded 2D convolution based spatial encoder and 1D convolution based temporal encoder-decoder for joint spatio-temporal modeling. In particular, to address the key issue of capturing discriminative long-term dynamic dependency, our temporal model, referred to as Temporal Hourglass Convolutional Neural Network (TH-CNN), extracts contextual relationship through integrating both low-level encoded and high-level decoded clues. Temporal Intermediate Supervision (TIS) is then introduced to enhance affective representations generated by TH-CNN under a multi-resolution strategy, which guides TH-CNN to learn macroscopic long-term trend and refined short-term fluctuations progressively. Furthermore, thanks to TH-CNN and TIS, knowledge learnt from the intermediate layers also makes it possible to offer customized solutions to different applications by adjusting the decoder depth. Extensive experiments are conducted on three benchmark databases (RECOLA, SEWA and OMG) and superior results are shown compared to state-of-the-art methods, which indicates the effectiveness of the proposed approach.
引用
收藏
页码:565 / 578
页数:14
相关论文
共 50 条
  • [21] Adaptive Encoder-Decoder Model Considering Spatio-Temporal Features for Short-Term Power Prediction of Distributed Photovoltaic Station
    Dou, Xun
    Deng, Yehang
    Wang, Shunjiang
    Chu, Tianfeng
    Li, Jiacheng
    Luo, Haifeng
    IEEE TRANSACTIONS ON INDUSTRY APPLICATIONS, 2025, 61 (01) : 1363 - 1373
  • [22] Multi-Attention Fusion Network for Video-based Emotion Recognition
    Wang, Yanan
    Wu, Jianming
    Hoashi, Keiichiro
    ICMI'19: PROCEEDINGS OF THE 2019 INTERNATIONAL CONFERENCE ON MULTIMODAL INTERACTION, 2019, : 595 - 601
  • [23] HASTF: a hybrid attention spatio-temporal feature fusion network for EEG emotion recognition
    Hu, Fangzhou
    Wang, Fei
    Bi, Jinying
    An, Zida
    Chen, Chao
    Qu, Gangguo
    Han, Shuai
    FRONTIERS IN NEUROSCIENCE, 2024, 18
  • [24] EEG-GCN: Spatio-Temporal and Self-Adaptive Graph Convolutional Networks for Single and Multi-View EEG-Based Emotion Recognition
    Gao, Yue
    Fu, Xiangling
    Ouyang, Tianxiong
    Wang, Yi
    IEEE SIGNAL PROCESSING LETTERS, 2022, 29 : 1574 - 1578
  • [25] Spatio-temporal deep forest for emotion recognition based on facial electromyography signals
    Xu, Muhua
    Cheng, Juan
    Li, Chang
    Liu, Yu
    Chen, Xun
    COMPUTERS IN BIOLOGY AND MEDICINE, 2023, 156
  • [26] Spatio-Temporal Image-Based Encoded Atlases for EEG Emotion Recognition
    Avola, Danilo
    Cinque, Luigi
    Mambro, Angelo Di
    Fagioli, Alessio
    Marini, Marco Raoul
    Pannone, Daniele
    Fanini, Bruno
    Foresti, Gian Luca
    INTERNATIONAL JOURNAL OF NEURAL SYSTEMS, 2024, 34 (05)
  • [27] Pedestrian Trajectory Prediction in Heterogeneous Traffic Using Pose Keypoints-Based Convolutional Encoder-Decoder Network
    Chen, Kai
    Song, Xiao
    Ren, Xiaoxiang
    IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2021, 31 (05) : 1764 - 1775
  • [28] Video Fingerprint Algorithm Based on Spatio-Temporal Deep Neural Network
    Wang Dongdong
    Li Yuenan
    LASER & OPTOELECTRONICS PROGRESS, 2018, 55 (01)
  • [29] Facial Expression Recognition Based on the Fusion of Spatio-temporal Features in Video Sequences
    Wang Xiaohua
    Xia Chen
    Hu Min
    Ren Fuji
    JOURNAL OF ELECTRONICS & INFORMATION TECHNOLOGY, 2018, 40 (03) : 626 - 632
  • [30] Micro-Expression Recognition Based on Spatio-Temporal Capsule Network
    Shang, Ziyang
    Liu, Jie
    Li, Xinfu
    IEEE ACCESS, 2023, 11 : 13704 - 13713