Target and source modality co-reinforcement for emotion understanding from asynchronous multimodal sequences

被引:28
作者
Yang, Dingkang [1 ]
Liu, Yang [1 ]
Huang, Can [2 ]
Li, Mingcheng [1 ]
Zhao, Xiao [1 ]
Wang, Yuzheng [1 ]
Yang, Kun [1 ]
Wang, Yan [1 ]
Zhai, Peng [1 ]
Zhang, Lihua [1 ,3 ,4 ]
机构
[1] Fudan Univ, Acad Engn & Technol, Shanghai, Peoples R China
[2] Fudan Univ, Sch Journalism, Shanghai, Peoples R China
[3] Minist Educ, Engn Res Ctr AI & Robot, Shanghai, Peoples R China
[4] Jilin Prov Key Lab Intelligence Sci & Engn, Changchun, Peoples R China
基金
中国博士后科学基金; 国家重点研发计划;
关键词
Emotion understanding; Knowledge exchange; Multimodal fusion; Crossmodal interaction; Modality co-reinforcement; RECOGNITION;
D O I
10.1016/j.knosys.2023.110370
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Perceiving human emotions from a multimodal perspective has received significant attention in knowl-edge engineering communities. Due to the variable receiving frequency for sequences from various modalities, multimodal streams usually have an inherent asynchronous challenge. Most previous methods performed manual sequence alignment before multimodal fusion, which ignored long-range dependencies among modalities and failed to learn reliable crossmodal element correlations. Inspired by the human perception paradigm, we propose a target and source Modality Co-Reinforcement (MCR) approach to achieve sufficient crossmodal interaction and fusion at different granularities. Specifically, MCR introduces two types of target modality reinforcement units to reinforce the multimodal representations jointly. These target units effectively enhance emotion-related knowledge exchange in fine-grained interactions and capture the crossmodal elements that are emotionally expressive in mixed-grained interactions. Moreover, a source modality update module is presented to provide meaningful features for the crossmodal fusion of target modalities. Eventually, the multimodal representations are progressively reinforced and improved via the above components. Comprehensive experiments are conducted on three multimodal emotion understanding benchmarks. Quantitative results show that MCR significantly outperforms the previous state-of-the-art methods in both word-aligned and unaligned settings. Additionally, qualitative analysis and visualization fully demonstrate the superiority of the proposed modules.(c) 2023 Elsevier B.V. All rights reserved.
引用
收藏
页数:11
相关论文
共 70 条
[1]   Sentiment analysis on Twitter data integrating TextBlob and deep learning models: The case of US airline industry [J].
Aljedaani, Wajdi ;
Rustam, Furqan ;
Mkaouer, Mohamed Wiem ;
Ghallab, Abdullatif ;
Rupapara, Vaibhav ;
Washington, Patrick Bernard ;
Lee, Ernesto ;
Ashraf, Imran .
KNOWLEDGE-BASED SYSTEMS, 2022, 255
[2]  
Bhattacharya U, 2020, AAAI CONF ARTIF INTE, V34, P1342
[3]   IEMOCAP: interactive emotional dyadic motion capture database [J].
Busso, Carlos ;
Bulut, Murtaza ;
Lee, Chi-Chun ;
Kazemzadeh, Abe ;
Mower, Emily ;
Kim, Samuel ;
Chang, Jeannette N. ;
Lee, Sungbok ;
Narayanan, Shrikanth S. .
LANGUAGE RESOURCES AND EVALUATION, 2008, 42 (04) :335-359
[4]  
Chen M., 2017, P 19 ACM INT C MULT, P163, DOI [10.1145/3136755.3136801, DOI 10.1145/3136755.3136801]
[5]  
Chen W., 2022, MON NOT R ASTRON SOC, V256
[6]   DEVELOPING REAL-TIME STREAMING TRANSFORMER TRANSDUCER FOR SPEECH RECOGNITION ON LARGE-SCALE DATASET [J].
Chen, Xie ;
Wu, Yu ;
Wang, Zhenghao ;
Liu, Shujie ;
Li, Jinyu .
2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021), 2021, :5904-5908
[7]   Shape Matters: Deformable Patch Attack [J].
Chen, Zhaoyu ;
Li, Bo ;
Wu, Shuang ;
Xu, Jianghe ;
Ding, Shouhong ;
Zhang, Wenqiang .
COMPUTER VISION - ECCV 2022, PT IV, 2022, 13664 :529-548
[8]   Towards Practical Certifiable Patch Defense with Vision Transformer [J].
Chen, Zhaoyu ;
Li, Bo ;
Xu, Jianghe ;
Wu, Shuang ;
Ding, Shouhong ;
Zhang, Wenqiang .
2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2022), 2022, :15127-15137
[9]  
Degottex G, 2014, INT CONF ACOUST SPEE, DOI 10.1109/ICASSP.2014.6853739
[10]   Disentangled Representation Learning for Multimodal Emotion Recognition [J].
Yang, Dingkang ;
Huang, Shuai ;
Kuang, Haopeng ;
Du, Yangtao ;
Zhang, Lihua .
PROCEEDINGS OF THE 30TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2022, 2022, :1642-1651