Inferring Emotions From Large-Scale Internet Voice Data

被引：13

作者：

Jia, Jia ^{[1
]}

Zhou, Suping ^{[1
]}

Yin, Yufeng ^{[1
]}

Wu, Boya ^{[1
]}

Chen, Wei ^{[2
]}

Meng, Fanbo ^{[2
]}

Wang, Yanfeng ^{[2
]}

机构：

[1] Tsinghua Univ, Dept Comp Sci & Technol, Beijing Natl Res Ctr Informat Sci & Technol, Key Lab Pervas Comp,Minist Educ, Beijing 100084, Peoples R China

[2] Sogou Corp, Beijing 100084, Peoples R China

来源：

IEEE TRANSACTIONS ON MULTIMEDIA | 2019年 / 21卷 / 07期

关键词：

Emotion; Internet voice data; deep sparse neural network; long short-term memory; RECOGNITION; FEATURES; MOOD;

D O I：

10.1109/TMM.2018.2887016

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

As voice dialog applications (VDAs, e.g., Siri,(1) Cortana,(2) Google Now(3)) are increasing in popularity, inferring emotions from the large-scale internet voice data generated from VDAs can help give a more reasonable and humane response. However, the tremendous amounts of users in large-scale internet voice data lead to a great diversity of users accents and expression patterns. Therefore, the traditional speech emotion recognition methods, which mainly target acted corpora, cannot effectively handle the massive and diverse amount of internet voice data. To address this issue, we carry out a series of observations, find suitable emotion categories for large-scale internet voice data, and verify the indicators of the social attributes (query time, query topic, and users location) and emotion inferring. Based on our observations, two different strategies are employed to solve the problem. First, a deep sparse neural network model that uses acoustic information, textual information, and three indicators (a temporal indicator, descriptive indicator, and geo-social indicator) as the input is proposed. Then, to capture the contextual information, we propose a hybrid emotion inference model that includes long short-term memory to capture the acoustic features and a latent dirichlet allocation to extract text features. Experiments on 93 000 utterances collected from the Sogou Voice Assistant(4) (Chinese Siri) validate the effectiveness of the proposed methodologies. Furthermore, we compare the two methodologies and give their advantages and disadvantages.

引用

页码：1853 / 1866

页数：14

共 70 条

[21]

Hasan M., 2014, ASE BIG DATA SOCIALC, P27

[22] Deep Neural Networks for Acoustic Modeling in Speech Recognition [J].

Hinton, Geoffrey ;

Deng, Li ;

Yu, Dong ;

Dahl, George E. ;

Mohamed, Abdel-rahman ;

Jaitly, Navdeep ;

Senior, Andrew ;

Vanhoucke, Vincent ;

Patrick Nguyen ;

Sainath, Tara N. ;

Kingsbury, Brian .

IEEE SIGNAL PROCESSING MAGAZINE, 2012, 29 (06) :82-97

[23]

Hochreiter S, 1997, Neural Computation, V9, P1735

[24] Speech Emotion Recognition Using CNN [J].

Huang, Zhengwei ;

Dong, Ming ;

Mao, Qirong ;

Zhan, Yongzhao .

PROCEEDINGS OF THE 2014 ACM CONFERENCE ON MULTIMEDIA (MM'14), 2014, :801-804

[25]

Jia J., 2012, P 20 ACM INT C MULT, P857

[26]

Kamath Krishna., 2013, P 22 INT C WORLD WID, P667, DOI DOI 10.1145/2488388.2488447

[27]

Kawahara H., 2005, P INT 2005 LISB PORT, P537

[28]

Kim E., 2009, Web Ecology, V3, P1

[29]

Le D, 2015, INT CONF AFFECT, P146, DOI 10.1109/ACII.2015.7344564

[30] Efficient backprop [J].

LeCun, Y ;

Bottou, L ;

Orr, GB ;

Müller, KR .

NEURAL NETWORKS: TRICKS OF THE TRADE, 1998, 1524 :9-50

← 1 2 3 4 5 6 7 →