Speaker-Aware Speech Emotion Recognition by Fusing Amplitude and Phase Information

被引：1

作者：

Guo, Lili ^{[1
]}

Wang, Longbiao ^{[1
]}

Dang, Jianwu ^{[1
,2
]}

Liu, Zhilei ^{[1
]}

Guan, Haotian ^{[3
]}

机构：

[1] Tianjin Univ, Coll Intelligence & Comp, Tianjin Key Lab Cognit Comp & Applicat, Tianjin, Peoples R China

[2] Japan Adv Inst Sci & Technol, Nomi, Ishikawa, Japan

[3] Huiyan Technol Tianjin Co Ltd, Tianjin, Peoples R China

来源：

MULTIMEDIA MODELING (MMM 2020), PT I | 2020年 / 11961卷

基金：

中国国家自然科学基金;

关键词：

Speech emotion recognition; Amplitude spectrogram; Phase information; Modified group delay; Speaker information; CLASSIFICATION; FEATURES;

D O I：

10.1007/978-3-030-37731-1_2

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

The use of a convolutional neural network (CNN) for extracting deep acoustic features from spectrograms has become one of the most commonly used methods for speech emotion recognition. In those studies, however, common amplitude information is chosen as input with no special attention to phase-related or speaker-related information. In this paper, we propose a multi-channel method employing amplitude and phase channels for speech emotion recognition. Two separate CNN channels are adopted to extract deep features from amplitude spectrograms and modified group delay (MGD) spectrograms. Then a concatenate layer is used to fuse the features. Furthermore, to gain more robust features, speaker information is considered in the stage of emotional feature extraction. Finally, the fusion features that considering speaker-related information are fed into the extreme learning machine (ELM) to distinguish emotions. Experiments are conducted on the Emo-DB database to evaluate the proposed model. Results demonstrate the recognition performance of average F1 in 94.82%, which significantly outperforms the baseline CNN-ELM model based on amplitude only spectrograms by 39.27% relative error reduction.

引用

页码：14 / 25

页数：12

共 50 条

[21] Speaker Recognition and Speech Emotion Recognition Based on GMM
Xu, Shupeng
Liu, Yan
Liu, Xiping
PROCEEDINGS OF THE 3RD INTERNATIONAL CONFERENCE ON ELECTRIC AND ELECTRONICS, 2013, : 434 - 436
[22] AN ONLINE SPEAKER-AWARE SPEECH SEPARATION APPROACH BASED ON TIME-DOMAIN REPRESENTATION
Wang, Hui
Song, Yan
Li, Zeng-Xi
McLoughlin, Ian
Dai, Li-Rong
2020 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, 2020, : 6379 - 6383
[23] Integrating Emotion Recognition with Speech Recognition and Speaker Diarisation for Conversations
Wu, Wen
Zhang, Chao
Woodland, Philip C.
INTERSPEECH 2023, 2023, : 3607 - 3611
[24] Improving Speech Emotion Recognition via Fine-tuning ASR with Speaker Information
Ta, Bao Thang
Nguyen, Tung Lam
Dang, Dinh Son
Le, Nhat Minh
Do, Van Hai
PROCEEDINGS OF 2022 ASIA-PACIFIC SIGNAL AND INFORMATION PROCESSING ASSOCIATION ANNUAL SUMMIT AND CONFERENCE (APSIPA ASC), 2022, : 1596 - 1601
[25] Speaker to Emotion: Domain Adaptation for Speech Emotion Recognition with Residual Adapters
Xi, Yuxuan
Li, Pengcheng
Song, Yan
Jiang, Yiheng
Dai, Lirong
2019 ASIA-PACIFIC SIGNAL AND INFORMATION PROCESSING ASSOCIATION ANNUAL SUMMIT AND CONFERENCE (APSIPA ASC), 2019, : 513 - 518
[26] Emotion Attribute Projection for Speaker Recognition on Emotional Speech
Bao, Huanjun
Xu, Mingxing
Zheng, Thomas Fang
INTERSPEECH 2007: 8TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION, VOLS 1-4, 2007, : 601 - 604
[27] Speaker independent speech emotion recognition by ensemble classification
Schuller, B
Reiter, S
Müller, R
Al-Hames, M
Lang, M
Rigoll, G
2005 IEEE INTERNATIONAL CONFERENCE ON MULTIMEDIA AND EXPO (ICME), VOLS 1 AND 2, 2005, : 865 - 868
[28] Compensating for speaker or lexical variabilities in speech for emotion recognition
Mariooryad, Soroosh
Busso, Carlos
SPEECH COMMUNICATION, 2014, 57 : 1 - 12
[29] AUTOMATED SPEECH RECOGNITION SYSTEM FOR SPEAKER EMOTION CLASSIFICATION
Anithadevi, N.
Gokul, P.
Nandan, S. Muhil
Magesh, R.
Shiddharth, S.
PROCEEDINGS OF THE 2020 5TH INTERNATIONAL CONFERENCE ON COMPUTING, COMMUNICATION AND SECURITY (ICCCS-2020), 2020,
[30] Multimodal Emotion Recognition Based on the Decoupling of Emotion and Speaker Information
Gajsek, Rok
Struc, Vitomir
Mihelic, France
TEXT, SPEECH AND DIALOGUE, 2010, 6231 : 275 - 282

← 1 2 3 4 5 →