Multi-Fusion Residual Memory Network for Multimodal Human Sentiment Comprehension

被引：40

作者：

Mai, Sijie ^{[1
]}

Hu, Haifeng ^{[1
]}

Xu, Jia ^{[1
]}

Xing, Songlong ^{[1
]}

机构：

[1] Sun Yat Sen Univ, Sch Elect & Informat Technol, Guangzhou 510275, Guangdong, Peoples R China

来源：

IEEE TRANSACTIONS ON AFFECTIVE COMPUTING | 2022年 / 13卷 / 01期

基金：

中国国家自然科学基金; 国家重点研发计划;

关键词：

Sentiment analysis; emotion intensity attention; time-step level fusion; residual memory network; REPRESENTATIONS; SPEECH;

D O I：

10.1109/TAFFC.2020.3000510

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Multimodal human sentiment comprehension refers to recognizing human affection from multiple modalities. There exist two key issues for this problem. First, it is difficult to explore time-dependent interactions between modalities and focus on the important time steps. Second, processing the long fused sequence of utterances is susceptible to the forgetting problem due to the long-term temporal dependency. In this article, we introduce a hierarchical learning architecture to classify utterance-level sentiment. To address the first issue, we perform time-step level fusion to generate fused features for each time step, which explicitly models time-restricted interactions by incorporating information across modalities at the same time step. Furthermore, based on the assumption that acoustic features directly reflect emotional intensity, we pioneer emotion intensity attention to focus on the time steps where emotion changes or intense affections take place. To handle the second issue, we propose Residual Memory Network (RMN) to process the fused sequence. RMN utilizes some techniques such as directly passing the previous state into the next time step, which helps to retain the information from many time steps ago. We show that our method achieves state-of-the-art performance on multiple datasets. Results also suggest that RMN yields competitive performance on sequence modeling tasks.

引用

页码：320 / 334

页数：15

共 76 条

[1] Abburi H., 2016, Int. Conf. on Mining Intell. and Knowl. Exploration, P58
[2] [Anonymous], 2015, ADV NEURAL INFORM PR
[3] [Anonymous], 2011, P 13 INT C MULT INT
[4] Bai S., 2018, ARXIV180301271
[5] Bai S., 2019, INT C LEARN REPR, V103, P1
[6] Multimodal Machine Learning: A Survey and Taxonomy
Baltrusaitis, Tadas
Ahuja, Chaitanya
Morency, Louis-Philippe
[J]. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2019, 41 (02) : 423 - 443
[7] Barezi EJ, 2019, 4TH WORKSHOP ON REPRESENTATION LEARNING FOR NLP (REPL4NLP-2019), P260
[8] LEARNING LONG-TERM DEPENDENCIES WITH GRADIENT DESCENT IS DIFFICULT
BENGIO, Y
SIMARD, P
FRASCONI, P
[J]. IEEE TRANSACTIONS ON NEURAL NETWORKS, 1994, 5 (02): : 157 - 166
[9] IEMOCAP: interactive emotional dyadic motion capture database
Busso, Carlos
Bulut, Murtaza
Lee, Chi-Chun
Kazemzadeh, Abe
Mower, Emily
Kim, Samuel
Chang, Jeannette N.
Lee, Sungbok
Narayanan, Shrikanth S.
[J]. LANGUAGE RESOURCES AND EVALUATION, 2008, 42 (04) : 335 - 359
[10] Chauhan DS, 2019, 2019 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING AND THE 9TH INTERNATIONAL JOINT CONFERENCE ON NATURAL LANGUAGE PROCESSING (EMNLP-IJCNLP 2019), P5647

← 1 2 3 4 5 6 7 8 →