ISNet: Individual Standardization Network for Speech Emotion Recognition

被引：23

作者：

Fan, Weiquan ^{[1
]}

Xu, Xiangmin ^{[1
]}

Cai, Bolun ^{[1
]}

Xing, Xiaofen ^{[1
]}

机构：

[1] South China Univ Technol, Sch Elect & Informat, Guangzhou 510640, Peoples R China

来源：

IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING | 2022年 / 30卷

基金：

中国国家自然科学基金;

关键词：

Speech recognition; Emotion recognition; Feature extraction; Benchmark testing; Standardization; Speech processing; Task analysis; Individual standardization network (ISNet); speech emotion recognition; individual differences; metric; dataset; CLASSIFICATION; ATTENTION; FEATURES; VOICE;

D O I：

10.1109/TASLP.2022.3171965

中图分类号：

O42 [声学];

学科分类号：

070206 ; 082403 ;

摘要：

Speech emotion recognition plays an essential role in human-computer interaction. However, cross-individual representation learning and individual-agnostic systems are challenging due to the distribution deviation caused by individual differences. The existing related approaches mostly use the auxiliary task of speaker recognition to eliminate individual differences. Unfortunately, although these methods can reduce interindividual voiceprint differences, it is difficult to dissociate interindividual expression differences since each individual has its unique expression habits. In this paper, we propose an individual standardization network (ISNet) for speech emotion recognition to alleviate the problem of interindividual emotion confusion caused by individual differences. Specifically, we model individual benchmarks as representations of nonemotional neutral speech, and ISNet realizes individual standardization using the automatically generated benchmark, which improves the robustness of individual-agnostic emotion representations. In response to individual differences, we also propose more comprehensive and meaningful individual-level evaluation metrics. In addition, we continue our previous work to construct a challenging large-scale speech emotion dataset (LSSED). We propose a more reasonable division method of the training set and testing set to prevent individual information leakage. Experimental results on datasets of both large and small scales have proven the effectiveness of ISNet, and the new state-of-the-art performance is achieved under the same experimental conditions on IEMOCAP and LSSED.

引用

页码：1803 / 1814

页数：12

共 50 条

[41] Cascaded Convolutional Neural Network Architecture for Speech Emotion Recognition in Noisy Conditions
Nam, Youngja
Lee, Chankyu
SENSORS, 2021, 21 (13)
[42] Multimodal speech emotion recognition and classification using convolutional neural network techniques
Christy, A.
Vaithyasubramanian, S.
Jesudoss, A.
Praveena, M. D. Anto
INTERNATIONAL JOURNAL OF SPEECH TECHNOLOGY, 2020, 23 (02) : 381 - 388
[43] Multimodal speech emotion recognition and classification using convolutional neural network techniques
A. Christy
S. Vaithyasubramanian
A. Jesudoss
M. D. Anto Praveena
International Journal of Speech Technology, 2020, 23 : 381 - 388
[44] Research on Emergency Parking Instruction Recognition Based on Speech Recognition and Speech Emotion Recognition
Tian Kexin
Huang Yongming
Zhang Guobao
Zhang Lin
2019 CHINESE AUTOMATION CONGRESS (CAC2019), 2019, : 2933 - 2937
[45] Adaptive Alignment and Time Aggregation Network for Speech-Visual Emotion Recognition
Wu, Lile
Bai, Lei
Cheng, Wenhao
Cheng, Zutian
Chen, Guanghui
IEEE SIGNAL PROCESSING LETTERS, 2025, 32 : 1181 - 1185
[46] Speech Emotion Recognition: A Comprehensive Survey
Mohammed Jawad Al-Dujaili
Abbas Ebrahimi-Moghadam
Wireless Personal Communications, 2023, 129 : 2525 - 2561
[47] RobinNet: A Multimodal Speech Emotion Recognition System With Speaker Recognition for Social Interactions
Khurana, Yash
Gupta, Swamita
Sathyaraj, R.
Raja, S. P.
IEEE TRANSACTIONS ON COMPUTATIONAL SOCIAL SYSTEMS, 2022, 11 (01) : 478 - 487
[48] Head Fusion: Improving the Accuracy and Robustness of Speech Emotion Recognition on the IEMOCAP and RAVDESS Dataset
Xu, Mingke
Zhang, Fan
Zhang, Wei
IEEE ACCESS, 2021, 9 : 74539 - 74549
[49] Hybrid LSTM-Transformer Model for Emotion Recognition From Speech Audio Files
Andayani, Felicia
Theng, Lau Bee
Tsun, Mark Teekit
Chua, Caslon
IEEE ACCESS, 2022, 10 : 36018 - 36027
[50] A multi-dilated convolution network for speech emotion recognition
Madanian, Samaneh
Adeleye, Olayinka
Templeton, John Michael
Chen, Talen
Poellabauer, Christian
Zhang, Enshi
Schneider, Sandra L.
SCIENTIFIC REPORTS, 2025, 15 (01):

← 1 2 3 4 5 →