Speech emotion recognition approaches: A systematic review

被引:19
作者
Hashem, Ahlam [1 ]
Arif, Muhammad [1 ]
Alghamdi, Manal [1 ]
机构
[1] Umm Al Qura Univ, Dept Comp Sci, Al Abdiyah, Makkah, Saudi Arabia
关键词
Speech emotion recognition; Emotional speech database; Classification of emotion; Speech features; Systematic review; TIME-COURSE; NEURAL-NETWORK; FEATURES; SELECTION; DOMAIN; REPRESENTATIONS; CLASSIFICATION; CLASSIFIERS; INFORMATION; PERFORMANCE;
D O I
10.1016/j.specom.2023.102974
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
The speech emotion recognition (SER) field has been active since it became a crucial feature in advanced Human-Computer Interaction (HCI), and wide real-life applications use it. In recent years, numerous SER systems have been covered by researchers, including the availability of appropriate emotional databases, selecting robustness features, and applying suitable classifiers using Machine Learning (ML) and Deep Learning (DL). Deep models proved to perform more accurately for SER than conventional ML techniques. Nevertheless, SER is yet challenging for classification where to separate similar emotional patterns; it needs a highly discriminative feature representation. For this purpose, this survey aims to critically analyze what is being done in this field of research in light of previous studies that aim to recognize emotions using speech audio in different aspects and review the current state of SER using DL. Through a systematic literature review whereby searching selected keywords from 2012-2022, 96 papers were extracted and covered the most current findings and directions. Specifically, we covered the database (acted, evoked, and natural) and features (prosodic, spectral, voice quality, and teager energy operator), the necessary preprocessing steps. Furthermore, different DL models and their performance are examined in depth. Based on our review, we also suggested SER aspects that could be considered in the future.
引用
收藏
页数:29
相关论文
共 258 条
[11]  
[Anonymous], 2005, Interspeech, DOI DOI 10.21437/INTERSPEECH.2005-446
[12]  
[Anonymous], 2009, Automatic classification of emotion-related user states in spontaneous children's speech
[13]  
Arul Edwin Raj A., 2023, 2023 International Conference on Innovative Data Communication Technologies and Application (ICIDCA), P505, DOI 10.1109/ICIDCA56705.2023.10100056
[14]   Multi-channel multi-model feature learning for face recognition [J].
Aslan, Melih S. ;
Hailat, Zeyad ;
Alafif, Tarik K. ;
Chen, Xue-Wen .
PATTERN RECOGNITION LETTERS, 2017, 85 :79-83
[15]   Attention guided 3D CNN-LSTM model for accurate speech based emotion recognition [J].
Atila, Orhan ;
Sengur, Abdulkadir .
APPLIED ACOUSTICS, 2021, 182
[16]   Survey on bimodal speech emotion recognition from acoustic and linguistic information fusion [J].
Atmaja, Bagus Tris ;
Sasou, Akira ;
Akagi, Masato .
SPEECH COMMUNICATION, 2022, 140 :11-28
[17]   Score normalization for text-independent speaker verification systems [J].
Auckenthaler, R ;
Carey, M ;
Lloyd-Thomas, H .
DIGITAL SIGNAL PROCESSING, 2000, 10 (1-3) :42-54
[18]  
Audibert N., 2007, P ICPHS 16 M SAARBR, P6
[19]  
Baevski A, 2020, ADV NEUR IN, V33
[20]  
Bao W, 2014, INT CONF SIGN PROCES, P583, DOI 10.1109/ICOSP.2014.7015071