Speech emotion recognition based on bi-directional acoustic-articulatory conversion

被引：3

作者：

Li, Haifeng ^{[1
]}

Zhang, Xueying ^{[1
]}

Duan, Shufei ^{[1
]}

Liang, Huizhi ^{[2
]}

机构：

[1] Taiyuan Univ Technol, Coll Elect Informat & Opt Engn, Taiyuan 030024, Shanxi, Peoples R China

[2] Newcastle Univ, Sch Comp, Newcastle Upon Tyne NE1 7RU, England

来源：

KNOWLEDGE-BASED SYSTEMS | 2024年 / 299卷

关键词：

Speech emotion recognition; Acoustic and articulatory conversions; Cycle consistent generative adversarial; networks; Temporal convolutional network; Contrastive learning; FEATURES; FUSION; NETWORK; LSTM;

D O I：

10.1016/j.knosys.2024.112123

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Acoustic and articulatory signals are naturally coupled and complementary. The challenge of acquiring articulatory data and the nonlinear ill-posedness of acoustic-articulatory conversions have resulted in previous studies on speech emotion recognition (SER) primarily relying on unidirectional acoustic-articulatory conversions. However, these studies have ignored the potential benefits of bi-directional acoustic-articulatory conversion. Addressing the problem of nonlinear ill-posedness and effectively extracting and utilizing these two modal features in SER remain open research questions. To bridge this gap, this study proposes a Bi-A2CEmo framework that simultaneously addresses the bi-directional acoustic-articulatory conversion for SER. This framework comprises three components: a Bi-MGAN that addresses the nonlinear ill-posedness problem, KCLNet that enhances the emotional attributes of the mapped features, and ResTCN-FDA that fully exploits the emotional attributes of the features. Another challenge is the absence of a parallel acoustic- articulatory emotion database. To overcome this issue, this study utilizes electromagnetic articulography (EMA) to create a multi -modal acoustic-articulatory emotion database for Mandarin Chinese called STEME 2 VA. A comparative analysis is then conducted between the proposed method and state-of-the-art models to evaluate the effectiveness of the framework. Bi-A2CEmo achieves an accuracy of 89.04% in SER, which is an improvement of 5.27% compared with the actual acoustic and articulatory features recorded by the EMA. The results for the STEM-E 2 VA dataset show that Bi-MGAN achieves a higher accuracy in mapping and inversion than conventional conversion networks. Visualization of the mapped features before and after enhancement reveals that KCLNet reduces the intra-class spacing while increasing the inter -class spacing of the features. ResTCN-FDA demonstrates high recognition accuracy on three publicly available datasets. The experimental results show that the proposed bi-directional acoustic-articulatory conversion framework can significantly improve the SER performance.

引用

页数：17

共 54 条

[41] Acoustic-to-Articulatory Mapping With Joint Optimization of Deep Speech Enhancement and Articulatory Inversion Models [J].

Shahrebabaki, Abdolreza Sabzi ;

Salvi, Giampiero ;

Svendsen, Torbjorn ;

Siniscalchi, Sabato Marco .

IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2022, 30 :135-147

[42] Modulation spectral features for speech emotion recognition using deep neural networks [J].

Singh, Premjeet ;

Sahidullah, Md ;

Saha, Goutam .

SPEECH COMMUNICATION, 2023, 146 :53-69

[43] EEG Emotion Recognition Using Dynamical Graph Convolutional Neural Networks [J].

Song, Tengfei ;

Zheng, Wenming ;

Song, Peng ;

Cui, Zhen .

IEEE TRANSACTIONS ON AFFECTIVE COMPUTING, 2020, 11 (03) :532-541

[44]

Su B.-H., 2022, IEEE Trans. Affect. Comput.

[45]

Vij Anneketh, 2018, Procedia Computer Science, V132, P1184, DOI [10.1016/j.procs.2018.05.033, 10.1016/j.procs.2018.05.033]

[46]

Yuan J, 2019, ASIAPAC SIGN INFO PR, P878, DOI [10.1109/APSIPAASC47483.2019.9023072, 10.1109/apsipaasc47483.2019.9023072]

[47] Spectrogram based multi-task audio classification [J].

Zeng, Yuni ;

Mao, Hua ;

Peng, Dezhong ;

Yi, Zhang .

MULTIMEDIA TOOLS AND APPLICATIONS, 2019, 78 (03) :3705-3722

[48] Spontaneous Speech Emotion Recognition Using Multiscale Deep Convolutional LSTM [J].

Zhang, Shiqing ;

Zhao, Xiaoming ;

Tian, Qi .

IEEE TRANSACTIONS ON AFFECTIVE COMPUTING, 2022, 13 (02) :680-688

[49] Speech Emotion Recognition Using Deep Convolutional Neural Network and Discriminant Temporal Pyramid Matching [J].

Zhang, Shiqing ;

Zhang, Shiliang ;

Huang, Tiejun ;

Gao, Wen .

IEEE TRANSACTIONS ON MULTIMEDIA, 2018, 20 (06) :1576-1590

[50] A study of correlation between physiological process of articulation and emotions on Mandarin Chinese [J].

Zhang, Ziqian ;

Huang, Min ;

Xiao, Zhongzhe .

SPEECH COMMUNICATION, 2023, 147 :82-92

← 1 2 3 4 5 6 →