Analyzing the influence of different speech data corpora and speech features on speech emotion recognition: A review

被引：1

作者：

Rathi, Tarun ^{[1
]}

Tripathy, Manoj ^{[1
]}

机构：

[1] Indian Inst Technol, Dept Elect Engn, Roorkee 247667, India

来源：

SPEECH COMMUNICATION | 2024年 / 162卷

关键词：

Speech emotion recognition; Speech emotional data corpus; Speech features; Mel-frequency cepstral coefficients; Deep neural network; Convolutional neural network; DEEP; MODEL; NETWORK; DATABASES; RECURRENT; CNN; REPRESENTATIONS; CLASSIFIERS; 1D;

D O I：

10.1016/j.specom.2024.103102

中图分类号：

O42 [声学];

学科分类号：

070206 ; 082403 ;

摘要：

Emotion recognition from speech has become crucial in human-computer interaction and affective computing applications. This review paper examines the complex relationship between two critical factors: the selection of speech data corpora and the extraction of speech features regarding speech emotion classification accuracy. Through an extensive analysis of literature from 2014 to 2023, publicly available speech datasets are explored and categorized based on their diversity, scale, linguistic attributes, and emotional classifications. The importance of various speech features, from basic spectral features to sophisticated prosodic cues, and their influence on emotion recognition accuracy is analyzed.. In the context of speech data corpora, this review paper unveils trends and insights from comparative studies exploring the repercussions of dataset choice on recognition efficacy. Various datasets such as IEMOCAP, EMODB, and MSP-IMPROV are scrutinized in terms of their influence on classifying the accuracy of the speech emotion recognition (SER) system. At the same time, potential challenges associated with dataset limitations are also examined. Notable features like Mel-frequency cepstral coefficients, pitch, intensity, and prosodic patterns are evaluated for their contributions to emotion recognition. Advanced feature extraction methods, too, are explored for their potential to capture intricate emotional dynamics. Moreover, this review paper offers insights into the methodological aspects of emotion recognition, shedding light on the diverse machine learning and deep learning approaches employed. Through a holistic synthesis of research findings, this review paper observes connections between the choice of speech data corpus, selection of speech features, and resulting emotion recognition accuracy. As the field continues to evolve, avenues for future research are proposed, ranging from enhanced feature extraction techniques to the development of standardized benchmark datasets. In essence, this review serves as a compass guiding researchers and practitioners through the intricate landscape of speech emotion recognition, offering a nuanced understanding of the factors shaping its recognition accuracy of speech emotion.

引用

页数：24

共 112 条

[51]

Latif Siddique, 2020, P INT

[52]

Lee CW, 2018, FIRST GRAND CHALLENGE AND WORKSHOP ON HUMAN MULTIMODAL LANGUAGE (CHALLENGE-HML), P28

[53]

Li Z., 2022, arXiv

[54] Chunk-Level Speech Emotion Recognition: A General Framework of Sequence-to-One Dynamic Temporal Modeling [J].

Lin, Wei-Cheng ;

Busso, Carlos .

IEEE TRANSACTIONS ON AFFECTIVE COMPUTING, 2023, 14 (02) :1215-1227

[55] Multiscale-multichannel feature extraction and classification through one-dimensional convolutional neural network for Speech emotion recognition [J].

Liu, Minying ;

Raj, Alex Noel Joseph ;

Rajangam, Vijayarajan ;

Ma, Kunwu ;

Zhuang, Zhemin ;

Zhuang, Shuxin .

SPEECH COMMUNICATION, 2024, 156

[56] Speech emotion recognition based on convolutional neural network with attention-based bidirectional long short-term memory network and multi-task learning [J].

Liu, Zhen-Tao ;

Han, Meng-Ting ;

Wu, Bao-Han ;

Rehman, Abdul .

APPLIED ACOUSTICS, 2023, 202

[57] The Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS): A dynamic, multimodal set of facial and vocal expressions in North American English [J].

Livingstone, Steven R. ;

Russo, Frank A. .

PLOS ONE, 2018, 13 (05)

[58]

Martin-Ortega O., 2019, PUBLIC PROCUREMENT H, P2

[59] Speech Emotion Recognition From 3D Log-Mel Spectrograms With Deep Learning Network [J].

Meng, Hao ;

Yan, Tianhao ;

Yuan, Fei ;

Wei, Hongwei .

IEEE ACCESS, 2019, 7 :125868-125881

[60] Emotion recognition framework using multiple modalities for an effective human-computer interaction [J].

Moin, Anam ;

Aadil, Farhan ;

Ali, Zeeshan ;

Kang, Dongwann .

JOURNAL OF SUPERCOMPUTING, 2023, 79 (08) :9320-9349

← 1 2 3 4 5 6 7 8 9 10 →