Speech emotion recognition using multi resolution Hilbert transform based spectral and entropy features

被引:0
作者
Mishra, Siba Prasad [1 ]
Warule, Pankaj [1 ]
Deb, Suman [1 ]
机构
[1] Sardar Vallabhbhai Natl Inst Technol, Surat, Gujarat, India
关键词
Deep neural network; Speech emotion recognition; Mel frequency cepstral coefficient; MRHT; MRHAE; MRHPE; MRHIE; MRHSE; MRHSME; PERMUTATION ENTROPY; APPROXIMATE ENTROPY; CLASSIFICATION; DIAGNOSIS; DISEASE;
D O I
10.1016/j.apacoust.2024.110403
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
Speech emotion recognition (SER) is essential for addressing many personal and professional challenges in our everyday lives. The application of SER has shown potential in a number of domains, such as medical intervention, fortification of security systems, online marketing and educational platforms, personal communication, strengthening of devices and human interaction, and numerous other domains. Due to its extensive variety of applications, this subject has attracted the attention of several researchers for more than three decades. The performance of SER can be improved by adopting a suitable methodology for extracting the feature and using it to classify speech emotion. In our study, we used a novel technique known as the multi-resolution Hilbert transform (MRHT) method to extract the speech feature. We used the multi-resolution signal decomposition (MRSD) method to break down the speech signal frame (SSF) into a number of sub- frequency band signals, which are called modes or intrinsic mode functions (IMFs). Then, Hilbert transform (HT) is applied to each IMF signal to find the MRHT-based instantaneous amplitude (MRHIA) and MRHT-based instantaneous frequency (MRHIF) signal vectors. Features such as MRHT-based approximate entropy (MRHAE), MRHT-based permutation entropy (MRHPE), MRHT-based increment entropy (MRHIE), MRHT-based spectral entropy (MRHSE), and MRHT-based sample entropy (MRHSME) were calculated using each MRHIA and MRHIF signal vectors and the mel frequency cepstral coefficient (MFCC) feature extracted using the speech signals. The combinations of the proposed MRHT-based features (MRHAE + MRHPE + MRHIE + MRHSE + MRHSME) are known as the MRHT-based entropy feature (MRHEF). Subsequently, the MRHEF and MFCC features are used both alone and in conjunction to categorize speech emotion using a deep neural network (DNN) classifier. This results in emotion classification accuracies of 89.67%, 85.42%, and 83.48% for the EMO-DB, EMOVO, and SAVEE datasets, respectively. Comparing our experimental results with the other approaches, we found that the proposed feature combinations (MFCC + MRHEF) using a DNN classifier outperformed the state-of-the-art methods in SER.
引用
收藏
页数:15
相关论文
共 67 条
[1]   A new approach to early diagnosis of congestive heart failure disease by using Hilbert-Huang transform [J].
Altan, Gokhan ;
Kutlu, Yakup ;
Allahverdi, Novruz .
COMPUTER METHODS AND PROGRAMS IN BIOMEDICINE, 2016, 137 :23-34
[2]   Improved speech emotion recognition with Mel frequency magnitude coefficient [J].
Ancilin, J. ;
Milton, A. .
APPLIED ACOUSTICS, 2021, 179
[3]   Hybrid LSTM-Transformer Model for Emotion Recognition From Speech Audio Files [J].
Andayani, Felicia ;
Theng, Lau Bee ;
Tsun, Mark Teekit ;
Chua, Caslon .
IEEE ACCESS, 2022, 10 :36018-36027
[4]  
[Anonymous], 1995, Wavelets and subband coding
[5]  
[Anonymous], 1999, P ART NEUR NETW ENG
[6]   Speaker Awareness for Speech Emotion Recognition [J].
Assuncao, Gustavo ;
Menezes, Paulo ;
Perdigao, Fernando .
INTERNATIONAL JOURNAL OF ONLINE AND BIOMEDICAL ENGINEERING, 2020, 16 (04) :15-22
[7]   Improved multiscale permutation entropy for biomedical signal analysis: Interpretation and application to electroencephalogram recordings [J].
Azami, Hamed ;
Escudero, Javier .
BIOMEDICAL SIGNAL PROCESSING AND CONTROL, 2016, 23 :28-41
[8]   A comparative study of traditional and newly proposed features for recognition of speech under stress [J].
Bou-Ghazale, SE ;
Hansen, JHL .
IEEE TRANSACTIONS ON SPEECH AND AUDIO PROCESSING, 2000, 8 (04) :429-442
[9]  
Burkhardt F., 2005, INTERSPEECH, P1517, DOI DOI 10.21437/INTERSPEECH.2005-446
[10]  
Cen L., 2016, Emotions, technology, design, and learning, P27, DOI [10.1016/B978-0-12-801856-9.00002-5, DOI 10.1016/B978-0-12-801856-9.00002-5]