AMGCN: An adaptive multi-graph convolutional network for speech emotion recognition

被引：0

作者：

Lian, Hailun ^{[1
,2
]}

Lu, Cheng ^{[1
,3
]}

Chang, Hongli ^{[1
,2
]}

Zhao, Yan ^{[1
,2
]}

Li, Sunan ^{[1
,2
]}

Li, Yang ^{[1
,4
]}

Zong, Yuan ^{[1
,3
]}

机构：

[1] Southeast Univ, Key Lab Child Dev & Learning Sci, Minist Educ, Nanjing 210096, Peoples R China

[2] Southeast Univ, Sch Informat Sci & Engn, Nanjing 210096, Peoples R China

[3] Southeast Univ, Sch Biol Sci & Med Engn, Nanjing 210096, Peoples R China

[4] Xidian Univ, Sch Artificial Intelligence, Xian 710071, Peoples R China

来源：

SPEECH COMMUNICATION | 2025年 / 168卷

基金：

中国博士后科学基金;

关键词：

Multi-graph convolutional network; Speech emotion recognition; Time-frequency domain; NEURAL-NETWORKS; DEEP; FEATURES;

D O I：

10.1016/j.specom.2024.103184

中图分类号：

O42 [声学];

学科分类号：

070206 ; 082403 ;

摘要：

Speech contains rich emotional information, especially in its time and frequency domains. Therefore, extracting emotional information from these domains to model the global emotional representation of speech has been successful in Speech Emotion Recognition (SER). However, this global emotion modeling method, particularly in the frequency domain, mainly focuses on modeling the emotional correlations between frequency bands while neglecting the modeling of dynamic changes infrequency bins within local frequency bands. It should be noted that related studies indicate that the energy distribution within local frequency bands contains important emotional cues. Therefore, relying solely on global modeling may not be able to effectively capture these critical local emotional cues. To address this issue, we introduce the Adaptive Multi-Graph Convolutional Network (AMGCN) for SER, which integrates both local and global analysis to more comprehensively capture emotional information from speech. The AMGCN comprises two core components: the Local Multi-Graph Convolutional Network (Local Multi-GCN) and the Global Multi-Graph Convolutional Network (Global MultiGCN). Specifically, the Local Multi-GCN focuses on modeling dynamic changes infrequency bins within local frequency bands. In other words, each frequency band within the network has its own independent graph convolution network, thereby obtaining local frequency domain contextual information and avoiding the loss of emotional information in the local frequency domain. Furthermore, the Global Multi-GCN combines two distinct graph convolutions to model global time-frequency pattern: one that extends from the Local MultiGCN to further consider emotional correlations between frequency bands, and another that forges connections with the initial features to utilize the global temporal contextual information. By combining local and global modeling, AMGCN can leverage complementary information from both levels to obtain a more discriminative and robust emotional representation. The effectiveness of AMGCN has been empirically validated on three benchmark datasets: IEMOCAP, CASIA, and ABC, where it has achieved impressive accuracies of 74.25%, 49.67%, and 70.93%, respectively, surpassing existing state-of-the-art methods in SER.

引用

页数：11

共 61 条

[1] An ensemble 1D-CNN-LSTM-GRU model with data augmentation for speech emotion recognition
Ahmed, Md. Rayhan
Islam, Salekul
Islam, A. K. M. Muzahidul
Shatabda, Swakkhar
[J]. EXPERT SYSTEMS WITH APPLICATIONS, 2023, 218
[2] Speech emotion recognition: Emotional models, databases, features, preprocessing methods, supporting modalities, and classifiers
Akcay, Mehmet Berkehan
Oguz, Kaya
[J]. SPEECH COMMUNICATION, 2020, 116 (116) : 56 - 76
[3] [Anonymous], 2009, 2009 3 INT C AFF COM
[4] Survey on bimodal speech emotion recognition from acoustic and linguistic information fusion
Atmaja, Bagus Tris
Sasou, Akira
Akagi, Masato
[J]. SPEECH COMMUNICATION, 2022, 140 : 11 - 28
[5] Baum E., 1987, Neural Information Processing Systems
[6] Bhosale S, 2020, INT CONF ACOUST SPEE, P7189, DOI [10.1109/icassp40776.2020.9054621, 10.1109/ICASSP40776.2020.9054621]
[7] IEMOCAP: interactive emotional dyadic motion capture database
Busso, Carlos
Bulut, Murtaza
Lee, Chi-Chun
Kazemzadeh, Abe
Mower, Emily
Kim, Samuel
Chang, Jeannette N.
Lee, Sungbok
Narayanan, Shrikanth S.
[J]. LANGUAGE RESOURCES AND EVALUATION, 2008, 42 (04) : 335 - 359
[8] Chen LW, 2023, Arxiv, DOI arXiv:2110.06309
[9] Learning multi-scale features for speech emotion recognition with connection attention mechanism
Chen, Zengzhao
Li, Jiawen
Liu, Hai
Wang, Xuyang
Wang, Hu
Zheng, Qiuyu
[J]. EXPERT SYSTEMS WITH APPLICATIONS, 2023, 214
[10] Emotion recognition in human-computer interaction
Cowie, R
Douglas-Cowie, E
Tsapatsoulis, N
Votsis, G
Kollias, S
Fellenz, W
Taylor, JG
[J]. IEEE SIGNAL PROCESSING MAGAZINE, 2001, 18 (01) : 32 - 80

← 1 2 3 4 5 6 7 →