Towards Discriminative Representation Learning for Speech Emotion Recognition

被引：0

作者：

Li, Runnan ^{[1
,2
]}

Wu, Zhiyong ^{[1
,2
]}

Jia, Jia ^{[1
,2
]}

Bu, Yaohua ^{[2
]}

Zhao, Sheng ^{[3
]}

Meng, Helen ^{[4
]}

机构：

[1] Tsinghua Univ, Grad Sch Shenzhen, Beijing, Peoples R China

[2] Tsinghua Univ, Dept Comp Sci & Technol, Beijing, Peoples R China

[3] Microsoft, Search Technol Ctr Asia STCA, Beijing, Peoples R China

[4] Chinese Univ Hong Kong, Dept Syst Engn & Engn Management, Hong Kong, Peoples R China

来源：

PROCEEDINGS OF THE TWENTY-EIGHTH INTERNATIONAL JOINT CONFERENCE ON ARTIFICIAL INTELLIGENCE | 2019年

基金：

中国国家自然科学基金;

关键词：

D O I：

暂无

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

In intelligent speech interaction, automatic speech emotion recognition (SER) plays an important role in understanding user intention. While sentimental speech has different speaker characteristics but similar acoustic attributes, one vital challenge in SER is how to learn robust and discriminative representations for emotion inferring. In this paper, inspired by human emotion perception, we propose a novel representation learning component (RLC) for SER system, which is constructed with Multihead Self-attention and Global Context-aware Attention Long Short-Term Memory Recurrent Neutral Network (GCA-LSTM). With the ability of Multi-head Self-attention mechanism in modeling the element-wise correlative dependencies, RLC can exploit the common patterns of sentimental speech features to enhance emotion-salient information importing in representation learning. By employing GCA-LSTM, RLC can selectively focus on emotion-salient factors with the consideration of entire utterance context, and gradually produce discriminative representation for emotion inferring. Experiments on public emotional benchmark database IEMOCAP and a tremendous realistic interaction database demonstrate the outperformance of the proposed SER framework, with 6.6% to 26.7% relative improvement on unweighted accuracy compared to state-of-the-art techniques.

引用

页码：5060 / 5066

页数：7

共 24 条

[1]

Abadi M, 2016, PROCEEDINGS OF OSDI'16: 12TH USENIX SYMPOSIUM ON OPERATING SYSTEMS DESIGN AND IMPLEMENTATION, P265

[2]

[Anonymous], 2011, J. Mach. Learn. Technol

[3]

Bahdanau D, 2016, Arxiv, DOI arXiv:1409.0473

[4] IEMOCAP: interactive emotional dyadic motion capture database [J].

Busso, Carlos ;

Bulut, Murtaza ;

Lee, Chi-Chun ;

Kazemzadeh, Abe ;

Mower, Emily ;

Kim, Samuel ;

Chang, Jeannette N. ;

Lee, Sungbok ;

Narayanan, Shrikanth S. .

LANGUAGE RESOURCES AND EVALUATION, 2008, 42 (04) :335-359

[5]

Chorowski J, 2015, ADV NEUR IN, V28

[6] Deep Residual Learning for Image Recognition [J].

He, Kaiming ;

Zhang, Xiangyu ;

Ren, Shaoqing ;

Sun, Jian .

2016 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2016, :770-778

[7]

Ioffe S, 2015, 32 INT C MACH LEARN

[8]

Kingma DP, 2014, ADV NEUR IN, V27

[9]

Lee, 2015, High-level feature representation using recurrent neural network for speech emotion recognition

[10]

Lian Zheng, 2018, ABS181107691 CORR

← 1 2 3 →