Speech emotion recognition based on bi-directional acoustic-articulatory conversion

被引:3
作者
Li, Haifeng [1 ]
Zhang, Xueying [1 ]
Duan, Shufei [1 ]
Liang, Huizhi [2 ]
机构
[1] Taiyuan Univ Technol, Coll Elect Informat & Opt Engn, Taiyuan 030024, Shanxi, Peoples R China
[2] Newcastle Univ, Sch Comp, Newcastle Upon Tyne NE1 7RU, England
关键词
Speech emotion recognition; Acoustic and articulatory conversions; Cycle consistent generative adversarial; networks; Temporal convolutional network; Contrastive learning; FEATURES; FUSION; NETWORK; LSTM;
D O I
10.1016/j.knosys.2024.112123
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Acoustic and articulatory signals are naturally coupled and complementary. The challenge of acquiring articulatory data and the nonlinear ill-posedness of acoustic-articulatory conversions have resulted in previous studies on speech emotion recognition (SER) primarily relying on unidirectional acoustic-articulatory conversions. However, these studies have ignored the potential benefits of bi-directional acoustic-articulatory conversion. Addressing the problem of nonlinear ill-posedness and effectively extracting and utilizing these two modal features in SER remain open research questions. To bridge this gap, this study proposes a Bi-A2CEmo framework that simultaneously addresses the bi-directional acoustic-articulatory conversion for SER. This framework comprises three components: a Bi-MGAN that addresses the nonlinear ill-posedness problem, KCLNet that enhances the emotional attributes of the mapped features, and ResTCN-FDA that fully exploits the emotional attributes of the features. Another challenge is the absence of a parallel acoustic- articulatory emotion database. To overcome this issue, this study utilizes electromagnetic articulography (EMA) to create a multi -modal acoustic-articulatory emotion database for Mandarin Chinese called STEME 2 VA. A comparative analysis is then conducted between the proposed method and state-of-the-art models to evaluate the effectiveness of the framework. Bi-A2CEmo achieves an accuracy of 89.04% in SER, which is an improvement of 5.27% compared with the actual acoustic and articulatory features recorded by the EMA. The results for the STEM-E 2 VA dataset show that Bi-MGAN achieves a higher accuracy in mapping and inversion than conventional conversion networks. Visualization of the mapped features before and after enhancement reveals that KCLNet reduces the intra-class spacing while increasing the inter -class spacing of the features. ResTCN-FDA demonstrates high recognition accuracy on three publicly available datasets. The experimental results show that the proposed bi-directional acoustic-articulatory conversion framework can significantly improve the SER performance.
引用
收藏
页数:17
相关论文
共 54 条
[41]   Acoustic-to-Articulatory Mapping With Joint Optimization of Deep Speech Enhancement and Articulatory Inversion Models [J].
Shahrebabaki, Abdolreza Sabzi ;
Salvi, Giampiero ;
Svendsen, Torbjorn ;
Siniscalchi, Sabato Marco .
IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2022, 30 :135-147
[42]   Modulation spectral features for speech emotion recognition using deep neural networks [J].
Singh, Premjeet ;
Sahidullah, Md ;
Saha, Goutam .
SPEECH COMMUNICATION, 2023, 146 :53-69
[43]   EEG Emotion Recognition Using Dynamical Graph Convolutional Neural Networks [J].
Song, Tengfei ;
Zheng, Wenming ;
Song, Peng ;
Cui, Zhen .
IEEE TRANSACTIONS ON AFFECTIVE COMPUTING, 2020, 11 (03) :532-541
[44]  
Su B.-H., 2022, IEEE Trans. Affect. Comput.
[45]  
Vij Anneketh, 2018, Procedia Computer Science, V132, P1184, DOI [10.1016/j.procs.2018.05.033, 10.1016/j.procs.2018.05.033]
[46]  
Yuan J, 2019, ASIAPAC SIGN INFO PR, P878, DOI [10.1109/APSIPAASC47483.2019.9023072, 10.1109/apsipaasc47483.2019.9023072]
[47]   Spectrogram based multi-task audio classification [J].
Zeng, Yuni ;
Mao, Hua ;
Peng, Dezhong ;
Yi, Zhang .
MULTIMEDIA TOOLS AND APPLICATIONS, 2019, 78 (03) :3705-3722
[48]   Spontaneous Speech Emotion Recognition Using Multiscale Deep Convolutional LSTM [J].
Zhang, Shiqing ;
Zhao, Xiaoming ;
Tian, Qi .
IEEE TRANSACTIONS ON AFFECTIVE COMPUTING, 2022, 13 (02) :680-688
[49]   Speech Emotion Recognition Using Deep Convolutional Neural Network and Discriminant Temporal Pyramid Matching [J].
Zhang, Shiqing ;
Zhang, Shiliang ;
Huang, Tiejun ;
Gao, Wen .
IEEE TRANSACTIONS ON MULTIMEDIA, 2018, 20 (06) :1576-1590
[50]   A study of correlation between physiological process of articulation and emotions on Mandarin Chinese [J].
Zhang, Ziqian ;
Huang, Min ;
Xiao, Zhongzhe .
SPEECH COMMUNICATION, 2023, 147 :82-92