Significance of Phonological Features in Speech Emotion Recognition

被引：8

作者：

Wang, Wei ^{[1
]}

Watters, Paul A. ^{[2
]}

Cao, Xinyi ^{[1
]}

Shen, Lingjie ^{[1
]}

Li, Bo ^{[3
]}

机构：

[1] Nanjing Normal Univ, Sch Educ Sci, Nanjing 210097, JS, Peoples R China

[2] La Trobe Univ, Dept Comp Sci & Informat Technol, Melbourne, Vic 3350, Australia

[3] Univ Southern Mississippi, Sch Comp Sci & Comp Engn, 730 East Beach Blvd, Long Beach, MS 39560 USA

来源：

INTERNATIONAL JOURNAL OF SPEECH TECHNOLOGY | 2020年 / 23卷 / 03期

基金：

中国国家自然科学基金;

关键词：

Speech emotion recognition; Phonological features; Feature analysis; Acoustic features;

D O I：

10.1007/s10772-020-09734-7

中图分类号：

TM [电工技术]; TN [电子技术、通信技术];

学科分类号：

0808 ; 0809 ;

摘要：

A novel Speech Emotion Recognition (SER) method based on phonological features is proposed in this paper. Intuitively, as expert knowledge derived from linguistics, phonological features are correlated with emotions. However, it has been found that they are seldomly used as features to improve SER. Motivated by this, we set our goal to utilize phonological features to further advance SER's accuracy since they can provide complementary information for the task. Furthermore, we will also explore the relationship between phonological features and emotions. Firstly, instead of only based on acoustic features, we devise a new SER approach by fusing phonological representations and acoustic features together. A significant improvement in SER performance has been demonstrated on a publicly available SER database named Interactive Emotional Dyadic Motion Capture (IEMOCAP). Secondly, the experimental results show that the top-performing method for the task of categorical emotion recognition is a deep learning-based classifier which generates an unweighted average recall (UAR) accuracy of 60.02%. Finally, we investigate the most discriminative features and find some patterns of emotional rhyme based on the phonological representations.

引用

页码：633 / 642

页数：10

共 25 条

[1]

[Anonymous], NEURAL NETWORKS

[2]

[Anonymous], 2004, A Computational Model for the Automatic Recognition of Affect in Speech"

[3] Long Short-Term Memory Networks Based Automatic Feature Extraction for Photovoltaic Array Fault Diagnosis [J].

Appiah, Albert Yaw ;

Zhang, Xinghua ;

Ayawli, Ben Beklisi Kwame ;

Kyeremeh, Frimpong .

IEEE ACCESS, 2019, 7 :30089-30101

[4] Deep features-based speech emotion recognition for smart affective services [J].

Badshah, Abdul Malik ;

Rahim, Nasir ;

Ullah, Noor ;

Ahmad, Jamil ;

Muhammad, Khan ;

Lee, Mi Young ;

Kwon, Soonil ;

Baik, Sung Wook .

MULTIMEDIA TOOLS AND APPLICATIONS, 2019, 78 (05) :5571-5589

[5]

Beckman M. E., 2010, Prosodic typology: The phonology of intonation and phrasing, P1, DOI [DOI 10.1093/ACPROF:OSO/9780199249633.003.0002, DOI 10.1093/ACPROF:OSO/9780199249633.003, 10.1093/acprof:oso/9780199249633.003.0002]

[6] Manner of articulation based Bengali phoneme classification [J].

Bhowmik, Tanmay ;

Das Mandal, Shyamal Kumar .

INTERNATIONAL JOURNAL OF SPEECH TECHNOLOGY, 2018, 21 (02) :233-250

[7] IEMOCAP: interactive emotional dyadic motion capture database [J].

Busso, Carlos ;

Bulut, Murtaza ;

Lee, Chi-Chun ;

Kazemzadeh, Abe ;

Mower, Emily ;

Kim, Samuel ;

Chang, Jeannette N. ;

Lee, Sungbok ;

Narayanan, Shrikanth S. .

LANGUAGE RESOURCES AND EVALUATION, 2008, 42 (04) :335-359

[8]

Cao H., 2014, SPEECH PROSODY, P1147

[9] ARE THERE BASIC EMOTIONS [J].

EKMAN, P .

PSYCHOLOGICAL REVIEW, 1992, 99 (03) :550-553

[10]

Eyben F., 2010, P 18 ACM INT C MULT, P1459, DOI [10.1145/1873951.1874246, DOI 10.1145/1873951.1874246]

← 1 2 3 →