AMGCN: An adaptive multi-graph convolutional network for speech emotion recognition

被引:0
作者
Lian, Hailun [1 ,2 ]
Lu, Cheng [1 ,3 ]
Chang, Hongli [1 ,2 ]
Zhao, Yan [1 ,2 ]
Li, Sunan [1 ,2 ]
Li, Yang [1 ,4 ]
Zong, Yuan [1 ,3 ]
机构
[1] Southeast Univ, Key Lab Child Dev & Learning Sci, Minist Educ, Nanjing 210096, Peoples R China
[2] Southeast Univ, Sch Informat Sci & Engn, Nanjing 210096, Peoples R China
[3] Southeast Univ, Sch Biol Sci & Med Engn, Nanjing 210096, Peoples R China
[4] Xidian Univ, Sch Artificial Intelligence, Xian 710071, Peoples R China
基金
中国博士后科学基金;
关键词
Multi-graph convolutional network; Speech emotion recognition; Time-frequency domain; NEURAL-NETWORKS; DEEP; FEATURES;
D O I
10.1016/j.specom.2024.103184
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
Speech contains rich emotional information, especially in its time and frequency domains. Therefore, extracting emotional information from these domains to model the global emotional representation of speech has been successful in Speech Emotion Recognition (SER). However, this global emotion modeling method, particularly in the frequency domain, mainly focuses on modeling the emotional correlations between frequency bands while neglecting the modeling of dynamic changes infrequency bins within local frequency bands. It should be noted that related studies indicate that the energy distribution within local frequency bands contains important emotional cues. Therefore, relying solely on global modeling may not be able to effectively capture these critical local emotional cues. To address this issue, we introduce the Adaptive Multi-Graph Convolutional Network (AMGCN) for SER, which integrates both local and global analysis to more comprehensively capture emotional information from speech. The AMGCN comprises two core components: the Local Multi-Graph Convolutional Network (Local Multi-GCN) and the Global Multi-Graph Convolutional Network (Global MultiGCN). Specifically, the Local Multi-GCN focuses on modeling dynamic changes infrequency bins within local frequency bands. In other words, each frequency band within the network has its own independent graph convolution network, thereby obtaining local frequency domain contextual information and avoiding the loss of emotional information in the local frequency domain. Furthermore, the Global Multi-GCN combines two distinct graph convolutions to model global time-frequency pattern: one that extends from the Local MultiGCN to further consider emotional correlations between frequency bands, and another that forges connections with the initial features to utilize the global temporal contextual information. By combining local and global modeling, AMGCN can leverage complementary information from both levels to obtain a more discriminative and robust emotional representation. The effectiveness of AMGCN has been empirically validated on three benchmark datasets: IEMOCAP, CASIA, and ABC, where it has achieved impressive accuracies of 74.25%, 49.67%, and 70.93%, respectively, surpassing existing state-of-the-art methods in SER.
引用
收藏
页数:11
相关论文
共 61 条
  • [11] Etienne C., 2018, P WORK SPEECH MUS MI, DOI DOI 10.21437/SMM.2018-5
  • [12] Eyben F., 2010, P 18 ACM INT C MULT, P1459, DOI [DOI 10.1145/1873951.1874246, 10.1145/1873951.1874246]
  • [13] The Geneva Minimalistic Acoustic Parameter Set (GeMAPS) for Voice Research and Affective Computing
    Eyben, Florian
    Scherer, Klaus R.
    Schuller, Bjoern W.
    Sundberg, Johan
    Andre, Elisabeth
    Busso, Carlos
    Devillers, Laurence Y.
    Epps, Julien
    Laukka, Petri
    Narayanan, Shrikanth S.
    Truong, Khiet P.
    [J]. IEEE TRANSACTIONS ON AFFECTIVE COMPUTING, 2016, 7 (02) : 190 - 202
  • [14] Gao Yingxue, 2024, ICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), P1116, DOI 10.1109/ICASSP48485.2024.10447829
  • [15] Han K, 2014, INTERSPEECH, P223
  • [16] Multiple Enhancements to LSTM for Learning Emotion-Salient Features in Speech Emotion Recognition
    Hu, Desheng
    Hu, Xinhui
    Xu, Xinkang
    [J]. INTERSPEECH 2022, 2022, : 4720 - 4724
  • [17] Speech Emotion Recognition Using CNN
    Huang, Zhengwei
    Dong, Ming
    Mao, Qirong
    Zhan, Yongzhao
    [J]. PROCEEDINGS OF THE 2014 ACM CONFERENCE ON MULTIMEDIA (MM'14), 2014, : 801 - 804
  • [18] DFEW: A Large-Scale Database for Recognizing Dynamic Facial Expressions in the Wild
    Jiang, Xingxun
    Zong, Yuan
    Zheng, Wenming
    Tang, Chuangao
    Xia, Wanchuang
    Lu, Cheng
    Liu, Jiateng
    [J]. MM '20: PROCEEDINGS OF THE 28TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, 2020, : 2881 - 2889
  • [19] Johnstone T., 2000, HDB EMOTION, V2nd ed., P220, DOI DOI 10.1016/S0167-6393(02)00084-5
  • [20] Kingma D. P., 2014, ARXIV, DOI DOI 10.48550/ARXIV.1412.6980