STT-Net: Simplified Temporal Transformer for Emotion Recognition

被引:6
作者
Khan, Mustaqeem [1 ]
El Saddik, Abdulmotaleb [1 ,2 ]
Deriche, Mohamed [3 ]
Gueaieb, Wail [2 ]
机构
[1] Mohamed Bin Zayed Univ Artificial Intelligence MBZ, Abu Dhabi, U Arab Emirates
[2] Univ Ottawa, Sch Elect Engn & Comp Sci, Ottawa, ON K1N 6N5, Canada
[3] Ajman Univ, Artificial Intelligence Res Ctr AIRC, Ajman, U Arab Emirates
来源
IEEE ACCESS | 2024年 / 12卷
关键词
Attention mechanism; deep learning; end-to-end architecture; multi-head self/cross-attention; emotion recognition; FACIAL EXPRESSION RECOGNITION; DEEP;
D O I
10.1109/ACCESS.2024.3413136
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Emotion recognition is one of the crucial topics in computer vision to efficiently recognize the expression of humans through faces. Recently, transformers have been recognized as a robust architecture, and many vision-based transformer models for emotion recognition have been proposed. The major drawback of such models is the high computational cost of the attention mechanism for computing space-time attention. To that end, we studied temporal feature shifting for frame-wise deep learning models to avoid this burden. In this work, we propose a novel temporal shifting approach for a frame-wise transformer-based model by replacing multi-head self-attention (MSA) with multi-head self/cross-attention (MSCA) to model the temporal interactions between tokens without additional cost. The contextual connection between and inside channels and across time is encoded by the proposed MSCA to enhance the recognition rate and reduce the latency for real-world applications. We extensively evaluated our system on CK+ (Cohn-Kanad) and Fer-2013plus (Facial-Emotion-Recognition) benchmark datasets with geometric transforms-based augmentation to address the imbalance issue in the data. According to the results, the proposed MSCA has either outperformed or closely matched the performance of state-of-the-art (SOTA) techniques. However, we conducted an ablation study on a challenging Fer2013+ dataset to demonstrate the significance and potential of our model for complex emotion recognition tasks.
引用
收藏
页码:86220 / 86231
页数:12
相关论文
共 50 条
  • [1] Al-Ani A, 2001, ISSPA 2001: SIXTH INTERNATIONAL SYMPOSIUM ON SIGNAL PROCESSING AND ITS APPLICATIONS, VOLS 1 AND 2, PROCEEDINGS, P477, DOI 10.1109/ISSPA.2001.950184
  • [2] Aloysius N, 2017, 2017 INTERNATIONAL CONFERENCE ON COMMUNICATION AND SIGNAL PROCESSING (ICCSP), P588, DOI 10.1109/ICCSP.2017.8286426
  • [3] Aouayeb M, 2021, Arxiv, DOI [arXiv:2107.03107, DOI 10.1016/J.PATREC.2021.01.029]
  • [4] Training Deep Networks for Facial Expression Recognition with Crowd-Sourced Label Distribution
    Barsoum, Emad
    Zhang, Cha
    Ferrer, Cristian Canton
    Zhang, Zhengyou
    [J]. ICMI'16: PROCEEDINGS OF THE 18TH ACM INTERNATIONAL CONFERENCE ON MULTIMODAL INTERACTION, 2016, : 279 - 283
  • [5] Multichannel convolutional neural network for human emotion recognition from in-the-wild facial expressions
    Boughanem, Hadjer
    Ghazouani, Haythem
    Barhoumi, Walid
    [J]. VISUAL COMPUTER, 2023, 39 (11) : 5693 - 5718
  • [6] A survey on facial emotion recognition techniques: A state-of-the-art literature review
    Canal, Felipe Zago
    Mueller, Tobias Rossi
    Matias, Jhennifer Cristine
    Scotton, Gustavo Gino
    de Sa, Antonio Reis
    Pozzebon, Eliane
    Sobieranski, Antonio Carlos
    [J]. INFORMATION SCIENCES, 2022, 582 : 593 - 617
  • [7] Pattern Mining Approaches Used in Sensor-Based Biometric Recognition: A Review
    Chaki, Jyotismita
    Dey, Nilanjan
    Shi, Fuqian
    Sherratt, R. Simon
    [J]. IEEE SENSORS JOURNAL, 2019, 19 (10) : 3569 - 3580
  • [8] ViTFER: Facial Emotion Recognition with Vision Transformers
    Chaudhari, Aayushi
    Bhatt, Chintan
    Krishna, Achyut
    Mazzeo, Pier Luigi
    [J]. APPLIED SYSTEM INNOVATION, 2022, 5 (04)
  • [9] CrossViT: Cross-Attention Multi-Scale Vision Transformer for Image Classification
    Chen, Chun-Fu
    Fan, Quanfu
    Panda, Rameswar
    [J]. 2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021), 2021, : 347 - 356
  • [10] Dosovitskiy A, 2021, Arxiv, DOI arXiv:2010.11929