AMGCN: An adaptive multi-graph convolutional network for speech emotion recognition

被引：0

作者：

Lian, Hailun ^{[1
,2
]}

Lu, Cheng ^{[1
,3
]}

Chang, Hongli ^{[1
,2
]}

Zhao, Yan ^{[1
,2
]}

Li, Sunan ^{[1
,2
]}

Li, Yang ^{[1
,4
]}

Zong, Yuan ^{[1
,3
]}

机构：

[1] Southeast Univ, Key Lab Child Dev & Learning Sci, Minist Educ, Nanjing 210096, Peoples R China

[2] Southeast Univ, Sch Informat Sci & Engn, Nanjing 210096, Peoples R China

[3] Southeast Univ, Sch Biol Sci & Med Engn, Nanjing 210096, Peoples R China

[4] Xidian Univ, Sch Artificial Intelligence, Xian 710071, Peoples R China

来源：

SPEECH COMMUNICATION | 2025年 / 168卷

基金：

中国博士后科学基金;

关键词：

Multi-graph convolutional network; Speech emotion recognition; Time-frequency domain; NEURAL-NETWORKS; DEEP; FEATURES;

D O I：

10.1016/j.specom.2024.103184

中图分类号：

O42 [声学];

学科分类号：

070206 ; 082403 ;

摘要：

Speech contains rich emotional information, especially in its time and frequency domains. Therefore, extracting emotional information from these domains to model the global emotional representation of speech has been successful in Speech Emotion Recognition (SER). However, this global emotion modeling method, particularly in the frequency domain, mainly focuses on modeling the emotional correlations between frequency bands while neglecting the modeling of dynamic changes infrequency bins within local frequency bands. It should be noted that related studies indicate that the energy distribution within local frequency bands contains important emotional cues. Therefore, relying solely on global modeling may not be able to effectively capture these critical local emotional cues. To address this issue, we introduce the Adaptive Multi-Graph Convolutional Network (AMGCN) for SER, which integrates both local and global analysis to more comprehensively capture emotional information from speech. The AMGCN comprises two core components: the Local Multi-Graph Convolutional Network (Local Multi-GCN) and the Global Multi-Graph Convolutional Network (Global MultiGCN). Specifically, the Local Multi-GCN focuses on modeling dynamic changes infrequency bins within local frequency bands. In other words, each frequency band within the network has its own independent graph convolution network, thereby obtaining local frequency domain contextual information and avoiding the loss of emotional information in the local frequency domain. Furthermore, the Global Multi-GCN combines two distinct graph convolutions to model global time-frequency pattern: one that extends from the Local MultiGCN to further consider emotional correlations between frequency bands, and another that forges connections with the initial features to utilize the global temporal contextual information. By combining local and global modeling, AMGCN can leverage complementary information from both levels to obtain a more discriminative and robust emotional representation. The effectiveness of AMGCN has been empirically validated on three benchmark datasets: IEMOCAP, CASIA, and ABC, where it has achieved impressive accuracies of 74.25%, 49.67%, and 70.93%, respectively, surpassing existing state-of-the-art methods in SER.

引用

页数：11

共 61 条

[11] Etienne C., 2018, P WORK SPEECH MUS MI, DOI DOI 10.21437/SMM.2018-5
[12] Eyben F., 2010, P 18 ACM INT C MULT, P1459, DOI [DOI 10.1145/1873951.1874246, 10.1145/1873951.1874246]
[13] The Geneva Minimalistic Acoustic Parameter Set (GeMAPS) for Voice Research and Affective Computing
Eyben, Florian
Scherer, Klaus R.
Schuller, Bjoern W.
Sundberg, Johan
Andre, Elisabeth
Busso, Carlos
Devillers, Laurence Y.
Epps, Julien
Laukka, Petri
Narayanan, Shrikanth S.
Truong, Khiet P.
[J]. IEEE TRANSACTIONS ON AFFECTIVE COMPUTING, 2016, 7 (02) : 190 - 202
[14] Gao Yingxue, 2024, ICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), P1116, DOI 10.1109/ICASSP48485.2024.10447829
[15] Han K, 2014, INTERSPEECH, P223
[16] Multiple Enhancements to LSTM for Learning Emotion-Salient Features in Speech Emotion Recognition
Hu, Desheng
Hu, Xinhui
Xu, Xinkang
[J]. INTERSPEECH 2022, 2022, : 4720 - 4724
[17] Speech Emotion Recognition Using CNN
Huang, Zhengwei
Dong, Ming
Mao, Qirong
Zhan, Yongzhao
[J]. PROCEEDINGS OF THE 2014 ACM CONFERENCE ON MULTIMEDIA (MM'14), 2014, : 801 - 804
[18] DFEW: A Large-Scale Database for Recognizing Dynamic Facial Expressions in the Wild
Jiang, Xingxun
Zong, Yuan
Zheng, Wenming
Tang, Chuangao
Xia, Wanchuang
Lu, Cheng
Liu, Jiateng
[J]. MM '20: PROCEEDINGS OF THE 28TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, 2020, : 2881 - 2889
[19] Johnstone T., 2000, HDB EMOTION, V2nd ed., P220, DOI DOI 10.1016/S0167-6393(02)00084-5
[20] Kingma D. P., 2014, ARXIV, DOI DOI 10.48550/ARXIV.1412.6980

← 1 2 3 4 5 6 7 →