Emotion Recognition of Affective Speech Based on Multiple Classifiers Using Acoustic-Prosodic Information and Semantic Labels

被引：183

作者：

Wu, Chung-Hsien ^{[1
]}

Liang, Wei-Bin ^{[1
]}

机构：

[1] Natl Cheng Kung Univ, Dept Comp Sci & Informat Engn, Tainan 701, Taiwan

来源：

IEEE TRANSACTIONS ON AFFECTIVE COMPUTING | 2011年 / 2卷 / 01期

关键词：

Emotion recognition; acoustic-prosodic features; semantic labels; meta decision trees; personality trait;

D O I：

10.1109/T-AFFC.2010.16

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

This work presents an approach to emotion recognition of affective speech based on multiple classifiers using acoustic-prosodic information (AP) and semantic labels (SLs). For AP-based recognition, acoustic and prosodic features including spectrum, formant, and pitch-related features are extracted from the detected emotional salient segments of the input speech. Three types of models, GMMs, SVMs, and MLPs, are adopted as the base-level classifiers. A Meta Decision Tree (MDT) is then employed for classifier fusion to obtain the AP-based emotion recognition confidence. For SL-based recognition, semantic labels derived from an existing Chinese knowledge base called HowNet are used to automatically extract Emotion Association Rules (EARs) from the recognized word sequence of the affective speech. The maximum entropy model (MaxEnt) is thereafter utilized to characterize the relationship between emotional states and EARs for emotion recognition. Finally, a weighted product fusion method is used to integrate the AP-based and SL-based recognition results for the final emotion decision. For evaluation, 2,033 utterances for four emotional states (Neutral, Happy, Angry, and Sad) are collected. The speaker-independent experimental results reveal that the emotion recognition performance based on MDT can achieve 80.00 percent, which is better than each individual classifier. On the other hand, an average recognition accuracy of 80.92 percent can be obtained for SL-based recognition. Finally, combining acoustic-prosodic information and semantic labels can achieve 83.55 percent, which is superior to either AP-based or SL-Based approaches. Moreover, considering the individual personality trait for personalized application, the recognition accuracy of the proposed approach can be further improved to 85.79 percent.

引用

页码：10 / 21

页数：12

共 42 条

[21] Paeschke W.F., 2000, SpeechEmotion-2000, P75
[22] Petrushin ValeryA., 2000, PROC ICSLP 2000, P222
[23] Platt JC, 2000, ADV NEUR IN, P61
[24] Quinlan J., C4 5 PROGRAMS MACHIN
[25] Roy N, 2000, 38TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, PROCEEDINGS OF THE CONFERENCE, P93
[26] Schuller B, 2004, 2004 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, VOL I, PROCEEDINGS, P577
[27] Schuller B, 2009, INTERSPEECH 2009: 10TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2009, VOLS 1-5, P336
[28] Prosody-based automatic segmentation of speech into sentences and topics
Shriberg, E
Stolcke, A
Hakkani-Tür, D
Tür, G
[J]. SPEECH COMMUNICATION, 2000, 32 (1-2) : 127 - 154
[29] Similarminds.com, 2010, PERS TEST
[30] BabyEars: A recognition system for affective vocalizations
Slaney, M
McRoberts, G
[J]. SPEECH COMMUNICATION, 2003, 39 (3-4) : 367 - 384

← 1 2 3 4 5 →