Interpretabilty of Speech Emotion Recognition modelled using Self-Supervised Speech and Text Pre-Trained Embeddings

被引：3

作者：

Girish, K. V. Vijay ^{[1
]}

Konjeti, Srikanth ^{[1
]}

Vepa, Jithendra ^{[1
]}

机构：

[1] Observe AI, San Francisco, CA 94105 USA

来源：

INTERSPEECH 2022 | 2022年

关键词：

interpretability; emotion recognition; human-computer interaction; computational paralinguistics;

D O I：

10.21437/Interspeech.2022-10685

中图分类号：

O42 [声学];

学科分类号：

070206 ; 082403 ;

摘要：

Speech emotion recognition (SER) is useful in many applications and is approached using signal processing techniques in the past and deep learning techniques recently. Human emotions are complex in nature and can vary widely within an utterance. The SER accuracy has improved using various multimodal techniques but there is still some gap in understanding the model behaviour and expressing these complex emotions in a human interpretable form. In this work, we propose and define interpretability measures represented as a Human Level Indicator Matrix for an utterance and showcase it's effectiveness in both qualitative and quantitative terms. A word level interpretability is presented using an attention based sequence modelling of self-supervised speech and text pre-trained embeddings. Prosody features are also combined with the proposed model to see the efficacy at the word and utterance level. We provide insights into sub-utterance level emotion predictions for complex utterances where the emotion classes change within the utterance. We evaluate the model and provide the interpretations on the publicly available IEMOCAP dataset.

引用

页码：4496 / 4500

页数：5

共 26 条

[1]

Asokan A. R., 2022, ARXIV220201072

[2]

Baevski A., 2020, Proc. Advances in neural information processing systems, V33, P12449

[3] Attention and Feature Selection for Automatic Speech Emotion Recognition Using Utterance and Syllable-Level Prosodic Features [J].

Ben Alex, Starlet ;

Mary, Leena ;

Babu, Ben P. .

CIRCUITS SYSTEMS AND SIGNAL PROCESSING, 2020, 39 (11) :5681-5709

[4] IEMOCAP: interactive emotional dyadic motion capture database [J].

Busso, Carlos ;

Bulut, Murtaza ;

Lee, Chi-Chun ;

Kazemzadeh, Abe ;

Mower, Emily ;

Kim, Samuel ;

Chang, Jeannette N. ;

Lee, Sungbok ;

Narayanan, Shrikanth S. .

LANGUAGE RESOURCES AND EVALUATION, 2008, 42 (04) :335-359

[5]

Cai X., 2021, P INTERSPEECH

[6] Machine Learning Interpretability: A Survey on Methods and Metrics [J].

Carvalho, Diogo, V ;

Pereira, Eduardo M. ;

Cardoso, Jaime S. .

ELECTRONICS, 2019, 8 (08)

[7] A COEFFICIENT OF AGREEMENT FOR NOMINAL SCALES [J].

COHEN, J .

EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT, 1960, 20 (01) :37-46

[8] Semantic Space Theory: A Computational Approach to Emotion [J].

Cowen, Alan S. ;

Keltner, Dacher .

TRENDS IN COGNITIVE SCIENCES, 2021, 25 (02) :124-136

[9]

Das S., 2021, ARXIV210502055

[10]

Dellaert F, 1996, ICSLP 96 - FOURTH INTERNATIONAL CONFERENCE ON SPOKEN LANGUAGE PROCESSING, PROCEEDINGS, VOLS 1-4, P1970, DOI 10.1109/ICSLP.1996.608022

← 1 2 3 →