AMGCN: An adaptive multi-graph convolutional network for speech emotion recognition

被引:0
作者
Lian, Hailun [1 ,2 ]
Lu, Cheng [1 ,3 ]
Chang, Hongli [1 ,2 ]
Zhao, Yan [1 ,2 ]
Li, Sunan [1 ,2 ]
Li, Yang [1 ,4 ]
Zong, Yuan [1 ,3 ]
机构
[1] Southeast Univ, Key Lab Child Dev & Learning Sci, Minist Educ, Nanjing 210096, Peoples R China
[2] Southeast Univ, Sch Informat Sci & Engn, Nanjing 210096, Peoples R China
[3] Southeast Univ, Sch Biol Sci & Med Engn, Nanjing 210096, Peoples R China
[4] Xidian Univ, Sch Artificial Intelligence, Xian 710071, Peoples R China
基金
中国博士后科学基金;
关键词
Multi-graph convolutional network; Speech emotion recognition; Time-frequency domain; NEURAL-NETWORKS; DEEP; FEATURES;
D O I
10.1016/j.specom.2024.103184
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
Speech contains rich emotional information, especially in its time and frequency domains. Therefore, extracting emotional information from these domains to model the global emotional representation of speech has been successful in Speech Emotion Recognition (SER). However, this global emotion modeling method, particularly in the frequency domain, mainly focuses on modeling the emotional correlations between frequency bands while neglecting the modeling of dynamic changes infrequency bins within local frequency bands. It should be noted that related studies indicate that the energy distribution within local frequency bands contains important emotional cues. Therefore, relying solely on global modeling may not be able to effectively capture these critical local emotional cues. To address this issue, we introduce the Adaptive Multi-Graph Convolutional Network (AMGCN) for SER, which integrates both local and global analysis to more comprehensively capture emotional information from speech. The AMGCN comprises two core components: the Local Multi-Graph Convolutional Network (Local Multi-GCN) and the Global Multi-Graph Convolutional Network (Global MultiGCN). Specifically, the Local Multi-GCN focuses on modeling dynamic changes infrequency bins within local frequency bands. In other words, each frequency band within the network has its own independent graph convolution network, thereby obtaining local frequency domain contextual information and avoiding the loss of emotional information in the local frequency domain. Furthermore, the Global Multi-GCN combines two distinct graph convolutions to model global time-frequency pattern: one that extends from the Local MultiGCN to further consider emotional correlations between frequency bands, and another that forges connections with the initial features to utilize the global temporal contextual information. By combining local and global modeling, AMGCN can leverage complementary information from both levels to obtain a more discriminative and robust emotional representation. The effectiveness of AMGCN has been empirically validated on three benchmark datasets: IEMOCAP, CASIA, and ABC, where it has achieved impressive accuracies of 74.25%, 49.67%, and 70.93%, respectively, surpassing existing state-of-the-art methods in SER.
引用
收藏
页数:11
相关论文
共 61 条
  • [1] An ensemble 1D-CNN-LSTM-GRU model with data augmentation for speech emotion recognition
    Ahmed, Md. Rayhan
    Islam, Salekul
    Islam, A. K. M. Muzahidul
    Shatabda, Swakkhar
    [J]. EXPERT SYSTEMS WITH APPLICATIONS, 2023, 218
  • [2] Speech emotion recognition: Emotional models, databases, features, preprocessing methods, supporting modalities, and classifiers
    Akcay, Mehmet Berkehan
    Oguz, Kaya
    [J]. SPEECH COMMUNICATION, 2020, 116 (116) : 56 - 76
  • [3] [Anonymous], 2009, 2009 3 INT C AFF COM
  • [4] Survey on bimodal speech emotion recognition from acoustic and linguistic information fusion
    Atmaja, Bagus Tris
    Sasou, Akira
    Akagi, Masato
    [J]. SPEECH COMMUNICATION, 2022, 140 : 11 - 28
  • [5] Baum E., 1987, Neural Information Processing Systems
  • [6] Bhosale S, 2020, INT CONF ACOUST SPEE, P7189, DOI [10.1109/icassp40776.2020.9054621, 10.1109/ICASSP40776.2020.9054621]
  • [7] IEMOCAP: interactive emotional dyadic motion capture database
    Busso, Carlos
    Bulut, Murtaza
    Lee, Chi-Chun
    Kazemzadeh, Abe
    Mower, Emily
    Kim, Samuel
    Chang, Jeannette N.
    Lee, Sungbok
    Narayanan, Shrikanth S.
    [J]. LANGUAGE RESOURCES AND EVALUATION, 2008, 42 (04) : 335 - 359
  • [8] Chen LW, 2023, Arxiv, DOI arXiv:2110.06309
  • [9] Learning multi-scale features for speech emotion recognition with connection attention mechanism
    Chen, Zengzhao
    Li, Jiawen
    Liu, Hai
    Wang, Xuyang
    Wang, Hu
    Zheng, Qiuyu
    [J]. EXPERT SYSTEMS WITH APPLICATIONS, 2023, 214
  • [10] Emotion recognition in human-computer interaction
    Cowie, R
    Douglas-Cowie, E
    Tsapatsoulis, N
    Votsis, G
    Kollias, S
    Fellenz, W
    Taylor, JG
    [J]. IEEE SIGNAL PROCESSING MAGAZINE, 2001, 18 (01) : 32 - 80