Speech emotion recognition based on bi-directional acoustic-articulatory conversion

被引：1

作者：

Li, Haifeng ^{[1
]}

Zhang, Xueying ^{[1
]}

Duan, Shufei ^{[1
]}

Liang, Huizhi ^{[2
]}

机构：

[1] Taiyuan Univ Technol, Coll Elect Informat & Opt Engn, Taiyuan 030024, Shanxi, Peoples R China

[2] Newcastle Univ, Sch Comp, Newcastle Upon Tyne NE1 7RU, England

来源：

KNOWLEDGE-BASED SYSTEMS | 2024年 / 299卷

关键词：

Speech emotion recognition; Acoustic and articulatory conversions; Cycle consistent generative adversarial; networks; Temporal convolutional network; Contrastive learning; FEATURES; FUSION; NETWORK; LSTM;

D O I：

10.1016/j.knosys.2024.112123

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Acoustic and articulatory signals are naturally coupled and complementary. The challenge of acquiring articulatory data and the nonlinear ill-posedness of acoustic-articulatory conversions have resulted in previous studies on speech emotion recognition (SER) primarily relying on unidirectional acoustic-articulatory conversions. However, these studies have ignored the potential benefits of bi-directional acoustic-articulatory conversion. Addressing the problem of nonlinear ill-posedness and effectively extracting and utilizing these two modal features in SER remain open research questions. To bridge this gap, this study proposes a Bi-A2CEmo framework that simultaneously addresses the bi-directional acoustic-articulatory conversion for SER. This framework comprises three components: a Bi-MGAN that addresses the nonlinear ill-posedness problem, KCLNet that enhances the emotional attributes of the mapped features, and ResTCN-FDA that fully exploits the emotional attributes of the features. Another challenge is the absence of a parallel acoustic- articulatory emotion database. To overcome this issue, this study utilizes electromagnetic articulography (EMA) to create a multi -modal acoustic-articulatory emotion database for Mandarin Chinese called STEME 2 VA. A comparative analysis is then conducted between the proposed method and state-of-the-art models to evaluate the effectiveness of the framework. Bi-A2CEmo achieves an accuracy of 89.04% in SER, which is an improvement of 5.27% compared with the actual acoustic and articulatory features recorded by the EMA. The results for the STEM-E 2 VA dataset show that Bi-MGAN achieves a higher accuracy in mapping and inversion than conventional conversion networks. Visualization of the mapped features before and after enhancement reveals that KCLNet reduces the intra-class spacing while increasing the inter -class spacing of the features. ResTCN-FDA demonstrates high recognition accuracy on three publicly available datasets. The experimental results show that the proposed bi-directional acoustic-articulatory conversion framework can significantly improve the SER performance.

引用

页数：17

共 54 条

[1] Deep-Net: A Lightweight CNN-Based Speech Emotion Recognition System Using Deep Frequency Features
Anvarjon, Tursunov
Mustaqeem
Kwon, Soonil
[J]. SENSORS, 2020, 20 (18) : 1 - 16
[2] Aryal S., 2015, Ph.D. thesis
[3] Reduction of non-native accents through statistical parametric articulatory synthesis
Aryal, Sandesh
Gutierrez-Osuna, Ricardo
[J]. JOURNAL OF THE ACOUSTICAL SOCIETY OF AMERICA, 2015, 137 (01) : 433 - 446
[4] Atmaja Bagus Tris, 2020, 2020 IEEE Region 10 Conference (TENCON), P968, DOI 10.1109/TENCON50793.2020.9293852
[5] Baevski A, 2020, ADV NEUR IN, V33
[6] Bagged support vector machines for emotion recognition from speech
Bhavan, Anjali
Chauhan, Pankaj
Hitkul
Shah, Rajiv Ratn
[J]. KNOWLEDGE-BASED SYSTEMS, 2019, 184
[7] Deep Residual Network for Steganalysis of Digital Images
Boroumand, Mehdi
Chen, Mo
Fridrich, Jessica
[J]. IEEE TRANSACTIONS ON INFORMATION FORENSICS AND SECURITY, 2019, 14 (05) : 1181 - 1193
[8] Burkhardt F, 2005, 9 EUR C SPEECH COMM, DOI DOI 10.21437/INTERSPEECH.2005-446
[9] Learning multi-scale features for speech emotion recognition with connection attention mechanism
Chen, Zengzhao
Li, Jiawen
Liu, Hai
Wang, Xuyang
Wang, Hu
Zheng, Qiuyu
[J]. EXPERT SYSTEMS WITH APPLICATIONS, 2023, 214
[10] HS-TCN: A Semi-supervised Hierarchical Stacking Temporal Convolutional Network for Anomaly Detection in IoT
Cheng, Yongliang
Xu, Yan
Zhong, Hong
Liu, Yi
[J]. 2019 IEEE 38TH INTERNATIONAL PERFORMANCE COMPUTING AND COMMUNICATIONS CONFERENCE (IPCCC), 2019,

← 1 2 3 4 5 6 →