Fusion of deep learning features with mixture of brain emotional learning for audio-visual emotion recognition

被引：27

作者：

Farhoudi, Zeinab ^{[1
]}

Setayeshi, Saeed ^{[2
]}

机构：

[1] Islamic Azad Univ, Dept Comp Engn, Sci & Res Branch, Tehran, Iran

[2] Amirkabir Univ Technol, Dept Energy Engn & Phys, Tehran, Iran

来源：

SPEECH COMMUNICATION | 2021年 / 127卷

关键词：

Audio-Visual emotion recognition; Brain emotional learning; Deep learning; Convolutional neural networks; Mixture of network; Multimodal fusion; MODEL;

D O I：

10.1016/j.specom.2020.12.001

中图分类号：

O42 [声学];

学科分类号：

070206 ; 082403 ;

摘要：

Multimodal emotion recognition is a challenging task due to different modalities emotions expressed during a specific time in video clips. Considering the existed spatial-temporal correlation in the video, we propose an audio-visual fusion model of deep learning features with a Mixture of Brain Emotional Learning (MoBEL) model inspired by the brain limbic system. The proposed model is composed of two stages. First, deep learning methods, especially Convolutional Neural Network (CNN) and Recurrent Neural Network (RNN), are applied to represent highly abstract features. Second, the fusion model, namely MoBEL, is designed to learn the previously joined audio-visual features simultaneously. For the visual modality representation, the 3D-CNN model has been used to learn the spatial-temporal features of visual expression. On the other hand, for the auditory modality, the Mel-spectrograms of speech signals have been fed into CNN-RNN for the spatial-temporal feature extraction. The high-level feature fusion approach with the MoBEL network is presented to make use of a correlation between the visual and auditory modalities for improving the performance of emotion recognition. The experimental results on the eNterface'05 database have been demonstrated that the performance of the proposed method is better than the hand-crafted features and the other state-of-the-art information fusion models in video emotion recognition.

引用

页码：92 / 103

页数：12

共 56 条

[1]

[Anonymous], 2007, INTERSPEECH

[2]

[Anonymous], 2010, P ACM INT C IM VID R, DOI DOI 10.1145/1816041.1816069

[3] Neo-Fuzzy Supported Brain Emotional Learning Based Pattern Recognizer for Classification Problems [J].

Asad, Muhammad Usman ;

Farooq, Umar ;

Gu, Jason ;

Amin, Javeria ;

Sadaqat, Amna ;

El-Hawary, Mohamed E. ;

Luo, Jun .

IEEE ACCESS, 2017, 5 :6951-6968

[4]

Babaie T, 2008, SOFT COMPUT, V12, P857, DOI [10.1007/s00500-007-0258-8, 10.1007/S00500-007-0258-8]

[5]

Badshah AM, 2017, 2017 INTERNATIONAL CONFERENCE ON PLATFORM TECHNOLOGY AND SERVICE (PLATCON), P125

[6] Emotional learning:: A computational model of the amygdala [J].

Balkenius, C ;

Morén, J .

CYBERNETICS AND SYSTEMS, 2001, 32 (06) :611-636

[7]

Balkenius C., 1998, COMPUTATIONAL MODEL

[8]

Beale R, 2008, LECT NOTES COMPUT SC, V4868, P1, DOI 10.1007/978-3-540-85099-1_1

[9] Audiovisual emotion recognition using ANOVA feature selection method and multi-classifier neural networks [J].

Bejani, Mahdi ;

Gharavian, Davood ;

Charkari, Nasrollah Moghaddam .

NEURAL COMPUTING & APPLICATIONS, 2014, 24 (02) :399-412

[10] Manifold based analysis of facial expression [J].

Chang, Ya ;

Hu, Changbo ;

Feris, Rogerio ;

Turk, Matthew .

IMAGE AND VISION COMPUTING, 2006, 24 (06) :605-614

← 1 2 3 4 5 6 →