RobinNet: A Multimodal Speech Emotion Recognition System With Speaker Recognition for Social Interactions

被引:17
作者
Khurana, Yash [1 ]
Gupta, Swamita [1 ]
Sathyaraj, R. [1 ]
Raja, S. P. [1 ]
机构
[1] Vellore Inst Technol, Sch Comp Sci & Engn, Vellore 632014, India
关键词
Emotion recognition; Speech recognition; Task analysis; Feature extraction; Computational modeling; Bit error rate; Transfer learning; Inception-ResNetV2; intermediate fusion; multimodal emotion recognition (MER); RoBERTa; speaker recognition;
D O I
10.1109/TCSS.2022.3228649
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
It is essential to understand the underlying emotions that are imparted through speech in order to study social communications as well as to generate seamless human-computer interactions. Speech emotion recognition (SER) is a considerably challenging task due to the lack of sufficient data and the complex interdependence of phrases with the context and emotion they imply. This article presents RobinNet: a RoBERTa-and Inception-ResNet-V2-based novel multimodal network for SER. The model employs transfer learning to build two unimodal systems for text and audio features and then incorporates them into a single classifier through Intermediate Fusion. This work has been created after carefully analyzing the performance of various top-performing unimodal systems and then utilizing a fine-tuned RoBERTa-based model to represent the textual features. Furthermore, we utilize an Inception-ResNetV2 pretrained network for Speaker Identification and employ transfer learning to train it for the task of emotion recognition through speech using spectrogram augmentation. The proposed multimodal system combines the two modalities through intermediate fusion and gives out a weighted accuracy (WA) of 72.8% when evaluated against the interactive emotional dyadic motion capture (IEMOCAP) dataset. Experimental results reveal that the proposed multimodal system outperforms state-of-the-art (SOTA) solutions on the benchmark datasets IEMOCAP, multimodal emotion lines dataset (MELD), and CMU-MOSEI. The proposed model utilizes intermediate fusion unlike any of its predecessors that perform late fusion after significant independent processing, thereby improving the overall artificial multimodal representations.
引用
收藏
页码:478 / 487
页数:10
相关论文
共 58 条
[1]   A survey of state-of-the-art approaches for emotion recognition in text [J].
Alswaidan, Nourah ;
Menai, Mohamed El Bachir .
KNOWLEDGE AND INFORMATION SYSTEMS, 2020, 62 (08) :2937-2987
[2]  
[Anonymous], 2003, WIKIMEDIA DOWNLOADS
[3]   Gated multimodal networks [J].
Arevalo, John ;
Solorio, Thamar ;
Montes-y-Gomez, Manuel ;
Gonzalez, Fabio A. .
NEURAL COMPUTING & APPLICATIONS, 2020, 32 (14) :10209-10228
[4]   Speech Emotion Recognition among Couples using the Peak-End Rule and Transfer Learning [J].
Boateng, George ;
Sels, Laura ;
Kuppens, Peter ;
Hilpert, Peter ;
Kowatsch, Tobias .
COMPANION PUBLICATON OF THE 2020 INTERNATIONAL CONFERENCE ON MULTIMODAL INTERACTION (ICMI '20 COMPANION), 2020, :17-21
[5]  
Cahn J. E., 1990, Journal of the American Voice I/O Society, V8, P1
[6]  
Cambria E, 2013, PROCEEDINGS OF THE 2013 IEEE SYMPOSIUM ON COMPUTATIONAL INTELLIGENCE FOR HUMAN-LIKE INTELLIGENCE (CIHLI), P108, DOI 10.1109/CIHLI.2013.6613272
[7]  
Cortiz Diogo, 2022, CCRIS'22: Proceedings of the 2022 3rd International Conference on Control, Robotics and Intelligent System, P230, DOI 10.1145/3562007.3562051
[8]   Multimodal Emotion Recognition Using Transfer Learning on Audio and Text Data [J].
Deng, James J. ;
Leung, Clement H. C. ;
Li, Yuanxi .
COMPUTATIONAL SCIENCE AND ITS APPLICATIONS, ICCSA 2021, PT III, 2021, 12951 :552-563
[9]  
Evci Utku, 2022, P MACHINE LEARNING R
[10]   A Review of Generalizable Transfer Learning in Automatic Emotion Recognition [J].
Feng, Kexin ;
Chaspari, Theodora .
FRONTIERS IN COMPUTER SCIENCE, 2020, 2