Audio-Visual Group-based Emotion Recognition using Local and Global Feature Aggregation based Multi-Task Learning

被引:0
作者
Li, Sunan [1 ]
Lian, Hailun [1 ]
Lu, Cheng [2 ]
Zhao, Yan [1 ,2 ]
Tang, Chuangao [2 ]
Zong, Yuan [2 ]
Zheng, Wenming [3 ]
机构
[1] Southeast Univ, Sch Informat Sci & Engn, Nanjing, Peoples R China
[2] Southeast Univ, Sch Biol Sci & Med Engn, Nanjing, Peoples R China
[3] Southeast Univ, Key Lab Child Dev & Learning Sci, Minist Educ, Nanjing, Peoples R China
来源
PROCEEDINGS OF THE 25TH INTERNATIONAL CONFERENCE ON MULTIMODAL INTERACTION, ICMI 2023 | 2023年
关键词
Multimodal emotion recognition; neural networks; modality robustness; feature fusion; label revision;
D O I
10.1145/3577190.3616544
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Audio-video group emotion recognition is a challenging task and has attracted more attention in recent decades. Recently, deep learning models have shown tremendous advances in analyzing human emotion. However, due to its difculties such as hard to gather a broad range of potential information to obtain meaningful emotional representations and hard to associate implicit contextual knowledge like humans. To tackle these problems, in this paper, we proposed the Local and Global Feature Aggregation based Multi-Task Learning (LGFAM) method to tackle the Group Emotion Recognition problem. The framework consists of three parallel feature extraction networks that were verifed in previous work. After that, an attention network using MLP as a backbone with specially designed loss functions was used to fuse features from diferent modalities. In the experiment section, we present its performance on the EmotiW2023 Audio-Visual Group-based Emotion Recognition subchallenge which aims to classify a video into one of the three emotions. According to the feedback results, the best result achieved 70.63% WAR and 70.38% UAR on the test set. Such improvement proves the efectiveness of our method.
引用
收藏
页码:741 / 745
页数:5
相关论文
共 18 条
  • [1] Amos B., 2016, Openface: A general-purpose face recognition library with mobile applications, V6, P20, DOI DOI 10.1080/09541449108406221
  • [2] Baevski Alexei, 2020, NeurIPS
  • [3] Devlin Jacob, 2019, P NAACLHLT, V1
  • [4] Dhall Abhinav, 2020, ICMI '20: Proceedings of the 2020 International Conference on Multimodal Interaction, P784, DOI 10.1145/3382507.3417973
  • [5] Dhall Abhinav, 2023, P 2023 INT C MULTIMO
  • [6] Goodfellow Ian J., 2013, Neural Information Processing. 20th International Conference, ICONIP 2013. Proceedings: LNCS 8228, P117, DOI 10.1007/978-3-642-42051-1_16
  • [7] Deep Residual Learning for Image Recognition
    He, Kaiming
    Zhang, Xiangyu
    Ren, Shaoqing
    Sun, Jian
    [J]. 2016 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2016, : 770 - 778
  • [8] HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units
    Hsu, Wei-Ning
    Bolte, Benjamin
    Tsai, Yao-Hung Hubert
    Lakhotia, Kushal
    Salakhutdinov, Ruslan
    Mohamed, Abdelrahman
    [J]. IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2021, 29 : 3451 - 3460
  • [9] Revisiting the EmotiW challenge: how wild is it really? Classification of human emotions in movie snippets based on multiple features
    Kaechele, Markus
    Schels, Martin
    Meudt, Sascha
    Palm, Guenther
    Schwenker, Friedhelm
    [J]. JOURNAL ON MULTIMODAL USER INTERFACES, 2016, 10 (02) : 151 - 162
  • [10] Contrasting and Combining Least Squares Based Learners for Emotion Recognition in the Wild
    Kaya, Heysem
    Gurpinar, Furkan
    Afshar, Sadaf
    Salah, Albert Ali
    [J]. ICMI'15: PROCEEDINGS OF THE 2015 ACM INTERNATIONAL CONFERENCE ON MULTIMODAL INTERACTION, 2015, : 459 - 466