Dynamically Adjust Word Representations Using Unaligned Multimodal Information

被引：33

作者：

Guo, Jiwei ^{[1
]}

Tang, Jiajia ^{[1
]}

Dai, Weichen ^{[1
]}

Ding, Yu ^{[2
]}

Kong, Wanzeng ^{[1
]}

机构：

[1] Hangzhou Dianzi Univ, Hangzhou, Peoples R China

[2] Netease Fuxi AI Lab, Hangzhou, Peoples R China

来源：

PROCEEDINGS OF THE 30TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2022 | 2022年

基金：

中国国家自然科学基金; 国家重点研发计划;

关键词：

multimodal sentiment analysis; multimodal fusion; multimodal representations;

D O I：

10.1145/3503161.3548137

中图分类号：

TP39 [计算机的应用];

学科分类号：

081203 ; 0835 ;

摘要：

Multimodal Sentiment Analysis is a promising research area for modeling multiple heterogeneous modalities. Two major challenges that exist in this area are a) multimodal data is unaligned in nature due to the different sampling rates of each modality, and b) long-range dependencies between elements across modalities. These challenges increase the difficulty of conducting efficient multimodal fusion. In this work, we propose a novel end-to-end network named Cross Hyper-modality Fusion Network (CHFN). The CHFN is an interpretable Transformer-based neural model that provides an efficient framework for fusing unaligned multimodal sequences. The heart of our model is to dynamically adjust word representations in different non-verbal contexts using unaligned multimodal sequences. It is concerned with the influence of non-verbal behavioral information at the scale of the entire utterances and then integrates this influence into verbal expression. We conducted experiments on both publicly available multimodal sentiment analysis datasets CMU-MOSI and CMU-MOSEI. The experiment results demonstrate that our model surpasses state-of-the-art models. In addition, we visualize the learned interactions between language modality and non-verbal behavior information and explore the underlying dynamics of multimodal language data.

引用

页码：3394 / 3402

页数：9

共 40 条

[1] End-to-End Object Detection with Transformers [J].

Carion, Nicolas ;

Massa, Francisco ;

Synnaeve, Gabriel ;

Usunier, Nicolas ;

Kirillov, Alexander ;

Zagoruyko, Sergey .

COMPUTER VISION - ECCV 2020, PT I, 2020, 12346 :213-229

[2]

Chen MH, 2017, PROCEEDINGS OF THE 19TH ACM INTERNATIONAL CONFERENCE ON MULTIMODAL INTERACTION, ICMI 2017, P163, DOI 10.1145/3136755.3136801

[3] DEVELOPING REAL-TIME STREAMING TRANSFORMER TRANSDUCER FOR SPEECH RECOGNITION ON LARGE-SCALE DATASET [J].

Chen, Xie ;

Wu, Yu ;

Wang, Zhenghao ;

Liu, Shujie ;

Li, Jinyu .

2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021), 2021, :5904-5908

[4]

Degottex G, 2014, INT CONF ACOUST SPEE, DOI 10.1109/ICASSP.2014.6853739

[5]

Devlin J, 2019, 2019 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES (NAACL HLT 2019), VOL. 1, P4171

[6]

Dong LH, 2018, 2018 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), P5884, DOI 10.1109/ICASSP.2018.8462506

[7]

Dosovitskiy A, 2020, ARXIV

[8]

Ekman R., 1997, What the Face Reveals: Basic and Applied Studies of Spontaneous Expression Using the Facial Action Coding System (FACS)

[9] Conformer: Convolution-augmented Transformer for Speech Recognition [J].

Gulati, Anmol ;

Qin, James ;

Chiu, Chung-Cheng ;

Parmar, Niki ;

Zhang, Yu ;

Yu, Jiahui ;

Han, Wei ;

Wang, Shibo ;

Zhang, Zhengdong ;

Wu, Yonghui ;

Pang, Ruoming .

INTERSPEECH 2020, 2020, :5036-5040

[10]

Han W, 2021, 2021 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP 2021), P9180

← 1 2 3 4 →