Audio-Visual Group-based Emotion Recognition using Local and Global Feature Aggregation based Multi-Task Learning

被引：0

作者：

Li, Sunan ^{[1
]}

Lian, Hailun ^{[1
]}

Lu, Cheng ^{[2
]}

Zhao, Yan ^{[1
,2
]}

Tang, Chuangao ^{[2
]}

Zong, Yuan ^{[2
]}

Zheng, Wenming ^{[3
]}

机构：

[1] Southeast Univ, Sch Informat Sci & Engn, Nanjing, Peoples R China

[2] Southeast Univ, Sch Biol Sci & Med Engn, Nanjing, Peoples R China

[3] Southeast Univ, Key Lab Child Dev & Learning Sci, Minist Educ, Nanjing, Peoples R China

来源：

PROCEEDINGS OF THE 25TH INTERNATIONAL CONFERENCE ON MULTIMODAL INTERACTION, ICMI 2023 | 2023年

关键词：

Multimodal emotion recognition; neural networks; modality robustness; feature fusion; label revision;

D O I：

10.1145/3577190.3616544

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Audio-video group emotion recognition is a challenging task and has attracted more attention in recent decades. Recently, deep learning models have shown tremendous advances in analyzing human emotion. However, due to its difculties such as hard to gather a broad range of potential information to obtain meaningful emotional representations and hard to associate implicit contextual knowledge like humans. To tackle these problems, in this paper, we proposed the Local and Global Feature Aggregation based Multi-Task Learning (LGFAM) method to tackle the Group Emotion Recognition problem. The framework consists of three parallel feature extraction networks that were verifed in previous work. After that, an attention network using MLP as a backbone with specially designed loss functions was used to fuse features from diferent modalities. In the experiment section, we present its performance on the EmotiW2023 Audio-Visual Group-based Emotion Recognition subchallenge which aims to classify a video into one of the three emotions. According to the feedback results, the best result achieved 70.63% WAR and 70.38% UAR on the test set. Such improvement proves the efectiveness of our method.

引用

页码：741 / 745

页数：5

共 18 条

[1] Amos B., 2016, Openface: A general-purpose face recognition library with mobile applications, V6, P20, DOI DOI 10.1080/09541449108406221
[2] Baevski Alexei, 2020, NeurIPS
[3] Devlin Jacob, 2019, P NAACLHLT, V1
[4] Dhall Abhinav, 2020, ICMI '20: Proceedings of the 2020 International Conference on Multimodal Interaction, P784, DOI 10.1145/3382507.3417973
[5] Dhall Abhinav, 2023, P 2023 INT C MULTIMO
[6] Goodfellow Ian J., 2013, Neural Information Processing. 20th International Conference, ICONIP 2013. Proceedings: LNCS 8228, P117, DOI 10.1007/978-3-642-42051-1_16
[7] Deep Residual Learning for Image Recognition
He, Kaiming
Zhang, Xiangyu
Ren, Shaoqing
Sun, Jian
[J]. 2016 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2016, : 770 - 778
[8] HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units
Hsu, Wei-Ning
Bolte, Benjamin
Tsai, Yao-Hung Hubert
Lakhotia, Kushal
Salakhutdinov, Ruslan
Mohamed, Abdelrahman
[J]. IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2021, 29 : 3451 - 3460
[9] Revisiting the EmotiW challenge: how wild is it really? Classification of human emotions in movie snippets based on multiple features
Kaechele, Markus
Schels, Martin
Meudt, Sascha
Palm, Guenther
Schwenker, Friedhelm
[J]. JOURNAL ON MULTIMODAL USER INTERFACES, 2016, 10 (02) : 151 - 162
[10] Contrasting and Combining Least Squares Based Learners for Emotion Recognition in the Wild
Kaya, Heysem
Gurpinar, Furkan
Afshar, Sadaf
Salah, Albert Ali
[J]. ICMI'15: PROCEEDINGS OF THE 2015 ACM INTERNATIONAL CONFERENCE ON MULTIMODAL INTERACTION, 2015, : 459 - 466

← 1 2 →