Speech Emotion Recognition Using Deep Learning Transfer Models and Explainable Techniques

被引：4

作者：

Kim, Tae-Wan ^{[1
]}

Kwak, Keun-Chang ^{[1
]}

机构：

[1] Chosun Univ, Dept Elect Engn, Interdisciplinary Program IT Bio Convergence Syst, Gwangju 61452, South Korea

来源：

APPLIED SCIENCES-BASEL | 2024年 / 14卷 / 04期

关键词：

speech emotion recognition; explainable model; deep learning; YAMNet; VGGish; audible feature;

D O I：

10.3390/app14041553

中图分类号：

O6 [化学];

学科分类号：

0703 ;

摘要：

This study aims to establish a greater reliability compared to conventional speech emotion recognition (SER) studies. This is achieved through preprocessing techniques that reduce uncertainty elements, models that combine the structural features of each model, and the application of various explanatory techniques. The ability to interpret can be made more accurate by reducing uncertain learning data, applying data in different environments, and applying techniques that explain the reasoning behind the results. We designed a generalized model using three different datasets, and each speech was converted into a spectrogram image through STFT preprocessing. The spectrogram was divided into the time domain with overlapping to match the input size of the model. Each divided section is expressed as a Gaussian distribution, and the quality of the data is investigated by the correlation coefficient between distributions. As a result, the scale of the data is reduced, and uncertainty is minimized. VGGish and YAMNet are the most representative pretrained deep learning networks frequently used in conjunction with speech processing. In dealing with speech signal processing, it is frequently advantageous to use these pretrained models synergistically rather than exclusively, resulting in the construction of ensemble deep networks. And finally, various explainable models (Grad CAM, LIME, occlusion sensitivity) are used in analyzing classified results. The model exhibits adaptability to voices in various environments, yielding a classification accuracy of 87%, surpassing that of individual models. Additionally, output results are confirmed by an explainable model to extract essential emotional areas, converted into audio files for auditory analysis using Grad CAM in the time domain. Through this study, we enhance the uncertainty of activation areas that are generated by Grad CAM. We achieve this by applying the interpretable ability from previous studies, along with effective preprocessing and fusion models. We can analyze it from a more diverse perspective through other explainable techniques.

引用

页数：23

共 19 条

[1]

aihub, ABOUT US

[2]

Badshah AM, 2017, 2017 INTERNATIONAL CONFERENCE ON PLATFORM TECHNOLOGY AND SERVICE (PLATCON), P125

[3] EXPLAINABLE ACOUSTIC SCENE CLASSIFICATION: MAKING DECISIONS AUDIBLE [J].

Bicer, H. Nazim ;

Goetz, Philipp ;

Tuna, Cagdas ;

Habets, Emanuel A. P. .

2022 INTERNATIONAL WORKSHOP ON ACOUSTIC SIGNAL ENHANCEMENT (IWAENC 2022), 2022,

[4] Improvement of Accent Classification Models Through Grad-Transfer From Spectrograms and Gradient-Weighted Class Activation Mapping [J].

Carofilis, Andres ;

Alegre, Enrique ;

Fidalgo, Eduardo ;

Fernandez-Robles, Laura .

IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2023, 31 :2859-2871

[5] Deep Learning for Heartbeat Phonocardiogram Signals Explainable Classification [J].

Cesarelli, Mario ;

Di Giammarco, Marcello ;

Iadarola, Giacomo ;

Martinelli, Fabio ;

Mercaldo, Francesco ;

Santone, Antonella .

2022 IEEE 22ND INTERNATIONAL CONFERENCE ON BIOINFORMATICS AND BIOENGINEERING (BIBE 2022), 2022, :75-78

[6] Analysis of ultrasonic vocalizations from mice using computer vision and machine learning [J].

Fonseca, Antonio H. O. ;

Santana, Gustavo M. ;

Ortiz, Gabriela M. Bosque ;

Bampi, Sergio ;

Dietrich, Marcelo O. .

ELIFE, 2021, 10

[7]

Heng Li, 2021, 2021 IEEE 6th International Conference on Signal and Image Processing (ICSIP), P795, DOI 10.1109/ICSIP52628.2021.9689043

[8]

jinsung cho, 2022, [The Journal of The Institute of Internet, Broadcasting and Communication, 한국인터넷방송통신학회 논문지], V21, P69, DOI 10.7236/JIIBC.2022.22.3.69

[9] Speech Technology for Healthcare: Opportunities, Challenges, and State of the Art [J].

Latif, Siddique ;

Qadir, Junaid ;

Qayyum, Adnan ;

Usama, Muhammad ;

Younis, Shahzad .

IEEE REVIEWS IN BIOMEDICAL ENGINEERING, 2021, 14 :342-356

[10] Predictions for Three-Month Postoperative Vocal Recovery after Thyroid Surgery from Spectrograms with Deep Neural Network [J].

Lee, Jeong Hoon ;

Lee, Chang Yoon ;

Eom, Jin Seop ;

Pak, Mingun ;

Jeong, Hee Seok ;

Son, Hee Young .

SENSORS, 2022, 22 (17)

← 1 2 →