A Neural Network Architecture for Children's Audio-Visual Emotion Recognition

被引：1

作者：

Matveev, Anton ^{[1
]}

Matveev, Yuri ^{[1
]}

Frolova, Olga ^{[1
]}

Nikolaev, Aleksandr ^{[1
]}

Lyakso, Elena ^{[1
]}

机构：

[1] St Petersburg Univ, Dept Higher Nervous Act & Psychophysiol, Child Speech Res Grp, St Petersburg 199034, Russia

来源：

MATHEMATICS | 2023年 / 11卷 / 22期

基金：

俄罗斯科学基金会;

关键词：

audio-visual speech; emotion recognition; children; MULTIMODAL FUSION; SPEECH; AGE;

D O I：

10.3390/math11224573

中图分类号：

O1 [数学];

学科分类号：

0701 ; 070101 ;

摘要：

Detecting and understanding emotions are critical for our daily activities. As emotion recognition (ER) systems develop, we start looking at more difficult cases than just acted adult audio-visual speech. In this work, we investigate the automatic classification of the audio-visual emotional speech of children, which presents several challenges including the lack of publicly available annotated datasets and the low performance of the state-of-the art audio-visual ER systems. In this paper, we present a new corpus of children's audio-visual emotional speech that we collected. Then, we propose a neural network solution that improves the utilization of the temporal relationships between audio and video modalities in the cross-modal fusion for children's audio-visual emotion recognition. We select a state-of-the-art neural network architecture as a baseline and present several modifications focused on a deeper learning of the cross-modal temporal relationships using attention. By conducting experiments with our proposed approach and the selected baseline model, we observe a relative improvement in performance by 2%. Finally, we conclude that focusing more on the cross-modal temporal relationships may be beneficial for building ER systems for child-machine communications and environments where qualified professionals work with children.

引用

页数：17

共 50 条

[41] Metric Learning-Based Multimodal Audio-Visual Emotion Recognition
Ghaleb, Esam
Popa, Mirela
Asteriadis, Stylianos
IEEE MULTIMEDIA, 2020, 27 (01) : 37 - 48
[42] Audio-Visual Emotion Recognition Based on a DBN Model with Constrained Asynchrony
Chen, Danqi
Jiang, Dongmei
Ravyse, Ilse
Sahli, Hichem
PROCEEDINGS OF THE FIFTH INTERNATIONAL CONFERENCE ON IMAGE AND GRAPHICS (ICIG 2009), 2009, : 912 - 916
[43] Leveraging recent advances in deep learning for audio-Visual emotion recognition
Schoneveld, Liam
Othmani, Alice
Abdelkawy, Hazem
PATTERN RECOGNITION LETTERS, 2021, 146 : 1 - 7
[44] Learning Better Representations for Audio-Visual Emotion Recognition with Common Information
Ma, Fei
Zhang, Wei
Li, Yang
Huang, Shao-Lun
Zhang, Lin
APPLIED SCIENCES-BASEL, 2020, 10 (20): : 1 - 23
[45] Audio-visual affect recognition
Zeng, Zhihong
Tu, Jilin
Liu, Ming
Huang, Thomas S.
Pianfetti, Brian
Roth, Dan
Levinson, Stephen
IEEE TRANSACTIONS ON MULTIMEDIA, 2007, 9 (02) : 424 - 428
[46] Audio-visual integration of emotion expression
Collignon, Olivier
Girard, Simon
Gosselin, Frederic
Roy, Sylvain
Saint-Amour, Dave
Lassonde, Maryse
Lepore, Franco
BRAIN RESEARCH, 2008, 1242 : 126 - 135
[47] Audio-visual gender recognition
Liu, Ming
Xu, Xun
Huang, Thomas S.
MIPPR 2007: PATTERN RECOGNITION AND COMPUTER VISION, 2007, 6788
[48] AUDIO-VISUAL FUSION AND CONDITIONING WITH NEURAL NETWORKS FOR EVENT RECOGNITION
Brousmiche, Mathilde
Rouat, Jean
Dupont, Stephane
2019 IEEE 29TH INTERNATIONAL WORKSHOP ON MACHINE LEARNING FOR SIGNAL PROCESSING (MLSP), 2019,
[49] DAVIS: Driver's Audio-Visual Speech Recognition
Ivanko, Denis
Ryumin, Dmitry
Kashevnik, Alexey
Axyonov, Alexandr
Kitenko, Andrey
Lashkov, Igor
Karpov, Alexey
INTERSPEECH 2022, 2022, : 1141 - 1142
[50] An audio-visual speech recognition system for testing new audio-visual databases
Pao, Tsang-Long
Liao, Wen-Yuan
VISAPP 2006: PROCEEDINGS OF THE FIRST INTERNATIONAL CONFERENCE ON COMPUTER VISION THEORY AND APPLICATIONS, VOL 2, 2006, : 192 - +

← 1 2 3 4 5 →