Speech emotion recognition based on bi-directional acoustic-articulatory conversion

被引:1
作者
Li, Haifeng [1 ]
Zhang, Xueying [1 ]
Duan, Shufei [1 ]
Liang, Huizhi [2 ]
机构
[1] Taiyuan Univ Technol, Coll Elect Informat & Opt Engn, Taiyuan 030024, Shanxi, Peoples R China
[2] Newcastle Univ, Sch Comp, Newcastle Upon Tyne NE1 7RU, England
关键词
Speech emotion recognition; Acoustic and articulatory conversions; Cycle consistent generative adversarial; networks; Temporal convolutional network; Contrastive learning; FEATURES; FUSION; NETWORK; LSTM;
D O I
10.1016/j.knosys.2024.112123
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Acoustic and articulatory signals are naturally coupled and complementary. The challenge of acquiring articulatory data and the nonlinear ill-posedness of acoustic-articulatory conversions have resulted in previous studies on speech emotion recognition (SER) primarily relying on unidirectional acoustic-articulatory conversions. However, these studies have ignored the potential benefits of bi-directional acoustic-articulatory conversion. Addressing the problem of nonlinear ill-posedness and effectively extracting and utilizing these two modal features in SER remain open research questions. To bridge this gap, this study proposes a Bi-A2CEmo framework that simultaneously addresses the bi-directional acoustic-articulatory conversion for SER. This framework comprises three components: a Bi-MGAN that addresses the nonlinear ill-posedness problem, KCLNet that enhances the emotional attributes of the mapped features, and ResTCN-FDA that fully exploits the emotional attributes of the features. Another challenge is the absence of a parallel acoustic- articulatory emotion database. To overcome this issue, this study utilizes electromagnetic articulography (EMA) to create a multi -modal acoustic-articulatory emotion database for Mandarin Chinese called STEME 2 VA. A comparative analysis is then conducted between the proposed method and state-of-the-art models to evaluate the effectiveness of the framework. Bi-A2CEmo achieves an accuracy of 89.04% in SER, which is an improvement of 5.27% compared with the actual acoustic and articulatory features recorded by the EMA. The results for the STEM-E 2 VA dataset show that Bi-MGAN achieves a higher accuracy in mapping and inversion than conventional conversion networks. Visualization of the mapped features before and after enhancement reveals that KCLNet reduces the intra-class spacing while increasing the inter -class spacing of the features. ResTCN-FDA demonstrates high recognition accuracy on three publicly available datasets. The experimental results show that the proposed bi-directional acoustic-articulatory conversion framework can significantly improve the SER performance.
引用
收藏
页数:17
相关论文
共 54 条
  • [1] Deep-Net: A Lightweight CNN-Based Speech Emotion Recognition System Using Deep Frequency Features
    Anvarjon, Tursunov
    Mustaqeem
    Kwon, Soonil
    [J]. SENSORS, 2020, 20 (18) : 1 - 16
  • [2] Aryal S., 2015, Ph.D. thesis
  • [3] Reduction of non-native accents through statistical parametric articulatory synthesis
    Aryal, Sandesh
    Gutierrez-Osuna, Ricardo
    [J]. JOURNAL OF THE ACOUSTICAL SOCIETY OF AMERICA, 2015, 137 (01) : 433 - 446
  • [4] Atmaja Bagus Tris, 2020, 2020 IEEE Region 10 Conference (TENCON), P968, DOI 10.1109/TENCON50793.2020.9293852
  • [5] Baevski A, 2020, ADV NEUR IN, V33
  • [6] Bagged support vector machines for emotion recognition from speech
    Bhavan, Anjali
    Chauhan, Pankaj
    Hitkul
    Shah, Rajiv Ratn
    [J]. KNOWLEDGE-BASED SYSTEMS, 2019, 184
  • [7] Deep Residual Network for Steganalysis of Digital Images
    Boroumand, Mehdi
    Chen, Mo
    Fridrich, Jessica
    [J]. IEEE TRANSACTIONS ON INFORMATION FORENSICS AND SECURITY, 2019, 14 (05) : 1181 - 1193
  • [8] Burkhardt F, 2005, 9 EUR C SPEECH COMM, DOI DOI 10.21437/INTERSPEECH.2005-446
  • [9] Learning multi-scale features for speech emotion recognition with connection attention mechanism
    Chen, Zengzhao
    Li, Jiawen
    Liu, Hai
    Wang, Xuyang
    Wang, Hu
    Zheng, Qiuyu
    [J]. EXPERT SYSTEMS WITH APPLICATIONS, 2023, 214
  • [10] HS-TCN: A Semi-supervised Hierarchical Stacking Temporal Convolutional Network for Anomaly Detection in IoT
    Cheng, Yongliang
    Xu, Yan
    Zhong, Hong
    Liu, Yi
    [J]. 2019 IEEE 38TH INTERNATIONAL PERFORMANCE COMPUTING AND COMMUNICATIONS CONFERENCE (IPCCC), 2019,