Articulation constrained learning with application to speech emotion recognition

被引:4
作者
Shah, Mohit [1 ]
Tu, Ming [2 ]
Berisha, Visar [1 ,2 ]
Chakrabarti, Chaitali [1 ]
Spanias, Andreas [1 ]
机构
[1] Arizona State Univ, Sch Elect Comp & Energy Engn, Tempe, AZ 85281 USA
[2] Arizona State Univ, Speech & Hearing Sci Dept, Tempe, AZ 85281 USA
关键词
Emotion recognition; Articulation; Constrained optimization; Cross-corpus; CLASSIFICATION; FEATURES;
D O I
10.1186/s13636-019-0157-9
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
Speech emotion recognition methods combining articulatory information with acoustic features have been previously shown to improve recognition performance. Collection of articulatory data on a large scale may not be feasible in many scenarios, thus restricting the scope and applicability of such methods. In this paper, a discriminative learning method for emotion recognition using both articulatory and acoustic information is proposed. A traditional l(1)-regularized logistic regression cost function is extended to include additional constraints that enforce the model to reconstruct articulatory data. This leads to sparse and interpretable representations jointly optimized for both tasks simultaneously. Furthermore, the model only requires articulatory features during training; only speech features are required for inference on out-of-sample data. Experiments are conducted to evaluate emotion recognition performance over vowels /AA/, /AE/, /IY/, /UW/ and complete utterances. Incorporating articulatory information is shown to significantly improve the performance for valence-based classification. Results obtained for within-corpus and cross-corpus categorical emotion recognition indicate that the proposed method is more effective at distinguishing happiness from other emotions.
引用
收藏
页数:17
相关论文
共 64 条
[1]  
Aldeneh Z., 2017, P 19 ACM INT C MULTI, P68, DOI [DOI 10.1145/3136755, 10.1145/3136755.3136760]
[2]  
Aldeneh Z, 2017, INT CONF ACOUST SPEE, P2741, DOI 10.1109/ICASSP.2017.7952655
[3]  
[Anonymous], 2006, FEATURE EXTRACTION F
[4]  
[Anonymous], P 1 EUR C INT DAT AN
[5]  
[Anonymous], 2004, P INTERSPEECH
[6]  
[Anonymous], ACOUST SPEECH SIG PR
[7]   Integrating articulatory data in deep neural network-based acoustic modeling [J].
Badino, Leonardo ;
Canevari, Claudia ;
Fadiga, Luciano ;
Metta, Giorgio .
COMPUTER SPEECH AND LANGUAGE, 2016, 36 :173-195
[8]   The ripple effect: Emotional contagion and its influence on group behavior [J].
Barsade, SG .
ADMINISTRATIVE SCIENCE QUARTERLY, 2002, 47 (04) :644-675
[9]  
Boersma P., 2002, GLOT INT, P5
[10]   Analysis of Emotionally Salient Aspects of Fundamental Frequency for Emotion Detection [J].
Busso, Carlos ;
Lee, Sungbok ;
Narayanan, Shrikanth .
IEEE TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2009, 17 (04) :582-596