Temporal Relation Inference Network for Multimodal Speech Emotion Recognition

被引：13

作者：

Dong, Guan-Nan ^{[1
]}

Pun, Chi-Man ^{[1
]}

Zhang, Zheng ^{[1
,2
]}

机构：

[1] Univ Macau, Dept Comp & Informat Sci, Macau, Peoples R China

[2] Harbin Inst Technol, Sch Comp Sci & Technol, Shenzhen 150001, Peoples R China

来源：

IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY | 2022年 / 32卷 / 09期

关键词：

Feature extraction; Emotion recognition; Speech recognition; Cognition; Hidden Markov models; Correlation; Task analysis; Speech emotion recognition; multi-modal learning; temporal learning; relation inference network; SENTIMENT ANALYSIS; MODEL; FEATURES;

D O I：

10.1109/TCSVT.2022.3163445

中图分类号：

TM [电工技术]; TN [电子技术、通信技术];

学科分类号：

0808 ; 0809 ;

摘要：

Speech emotion recognition (SER) is a non-trivial task for humans, while it remains challenging for automatic SER due to the linguistic complexity and contextual distortion. Notably, previous automatic SER systems always regarded multi-modal information and temporal relations of speech as two independent tasks, ignoring their association. We argue that the valid semantic features and temporal relations of speech are both meaningful event relationships. This paper proposes a novel temporal relation inference network (TRIN) to help tackle multi-modal SER, which fully considers the underlying hierarchy of phonetic structure and its associations between various modalities under the sequential temporal guidance. Mainly, we design a temporal reasoning calibration module to imitate real and abundant contextual conditions. Unlike the previous works, which assume all multiple modalities are related, it infers the dependency relationship between the semantic information from the temporal level and learns to handle the multi-modal interaction sequence with a flexible order. To enhance the feature representation, an innovative temporal attentive fusion unit is developed to magnify the details embedded in a single modality from semantic level. Meanwhile, it aggregates the feature representation from both the temporal and semantic levels to maximize the integrity of feature representation by an adaptive feature fusion mechanism to selectively collect the implicit complementary information to strengthen the dependencies between different information subspaces. Extensive experiments conducted on two benchmark datasets demonstrate the superiority of our TRIN method against some state-of-the-art SER methods.

引用

页码：6472 / 6485

页数：14

共 61 条

[1]

[Anonymous], 2016, PROC CVPR IEEE, DOI DOI 10.1109/CVPR.2016.389

[2]

[Anonymous], 2013, INT C LEARNING REPRE

[3] Feature Pooling of Modulation Spectrum Features for Improved Speech Emotion Recognition in the Wild [J].

Avila, Anderson R. ;

Akhtar, Zahid ;

Santos, Joao F. ;

O'Shaughnessy, Douglas ;

Falk, Tiago H. .

IEEE TRANSACTIONS ON AFFECTIVE COMPUTING, 2021, 12 (01) :177-188

[4] IEMOCAP: interactive emotional dyadic motion capture database [J].

Busso, Carlos ;

Bulut, Murtaza ;

Lee, Chi-Chun ;

Kazemzadeh, Abe ;

Mower, Emily ;

Kim, Samuel ;

Chang, Jeannette N. ;

Lee, Sungbok ;

Narayanan, Shrikanth S. .

LANGUAGE RESOURCES AND EVALUATION, 2008, 42 (04) :335-359

[5] Deep neural networks for emotion recognition combining audio and transcripts [J].

Cho, Jaejin ;

Pappagari, Raghavendra ;

Kulkarni, Purva ;

Villalba, Jesus ;

Carmiel, Yishay ;

Dehak, Najim .

19TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2018), VOLS 1-6: SPEECH RESEARCH FOR EMERGING MARKETS IN MULTILINGUAL SOCIETIES, 2018, :247-251

[6] Autoencoder-based Unsupervised Domain Adaptation for Speech Emotion Recognition [J].

Deng, Jun ;

Zhang, Zixing ;

Eyben, Florian ;

Schuller, Bjoern .

IEEE SIGNAL PROCESSING LETTERS, 2014, 21 (09) :1068-1072

[7] Cross-lingual Speech Emotion Recognition through Factor Analysis [J].

Desplanques, Brecht ;

Demuynck, Kris .

19TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2018), VOLS 1-6: SPEECH RESEARCH FOR EMERGING MARKETS IN MULTILINGUAL SOCIETIES, 2018, :3648-3652

[8] Bidirectional Convolutional Recurrent Sparse Network (BCRSN): An Efficient Model for Music Emotion Recognition [J].

Dong, Yizhuo ;

Yang, Xinyu ;

Zhao, Xi ;

Li, Juan .

IEEE TRANSACTIONS ON MULTIMEDIA, 2019, 21 (12) :3150-3163

[9] A Multitask Approach to Continuous Five-Dimensional Affect Sensing in Natural Speech [J].

Eyben, Florian ;

Woellmer, Martin ;

Schuller, Bjoern .

ACM TRANSACTIONS ON INTERACTIVE INTELLIGENT SYSTEMS, 2012, 2 (01) :1-29

[10] Recognition of visual speech elements using adaptively boosted hidden Markov models [J].

Foo, SW ;

Lian, Y ;

Dong, L .

IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2004, 14 (05) :693-705

← 1 2 3 4 5 6 7 →