MAVEN: A Memory Augmented Recurrent Approach for Multimodal Fusion

被引：4

作者：

Islam, Md Mofijul ^{[1
]}

Yasar, Mohammad Samin ^{[1
]}

Iqbal, Tariq ^{[1
]}

机构：

[1] Univ Virginia, Sch Engn & Appl Sci, Charlottesville, VA 22903 USA

来源：

IEEE TRANSACTIONS ON MULTIMEDIA | 2023年 / 25卷

关键词：

Feature extraction; Visualization; Task analysis; Fuses; Sensors; Data mining; Noise measurement; Deep learning; human activity recognition; multimodal learning; REPRESENTATION;

D O I：

10.1109/TMM.2022.3164261

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

Multisensory systems provide complementary information that aids many machine learning approaches in perceiving the environment comprehensively. These systems consist of heterogeneous modalities, which have disparate characteristics and feature distributions. Thus, extracting, aligning, and fusing complementary representations from heterogeneous modalities (e.g., visual, skeleton, and physical sensors) remains challenging. To address these challenges, we have used the insights from several neuroscience studies of animal multisensory systems to develop MAVEN, a memory-augmented recurrent approach for multimodal fusion. MAVEN generates unimodal memory banks comprised of spatial-temporal features and uses our proposed recurrent representation alignment approach to align and refine unimodal representations iteratively. MAVEN then utilizes a multimodal variational attention-based fusion approach to produce a robust multimodal representation from the aligned unimodal features. Our extensive experimental evaluations on three multimodal datasets suggest that MAVEN outperforms state-of-the-art multimodal learning approaches in the challenging human activity recognition task across all evaluation conditions (cross-subject, leave-one-subject-out, and cross-session). Additionally, our extensive ablation studies suggest that MAVEN significantly outperforms the feed-forward fusion-based learning models (p < 0.05). Finally, the robust performance ofMAVEN in extracting complementary multimodal representation from occluded and noisy data suggests its applicability on real-world datasets.

引用

页码：3694 / 3708

页数：15

共 80 条

[1]

BAHULEYAN H, 2017, VARIATIONAL ATTENTIO

[2] Multimodal Machine Learning: A Survey and Taxonomy [J].

Baltrusaitis, Tadas ;

Ahuja, Chaitanya ;

Morency, Louis-Philippe .

IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2019, 41 (02) :423-443

[3] DMMs-Based Multiple Features Fusion for Human Action Recognition [J].

Bulbul, Mohammad Farhad ;

Jiang, Yunsheng ;

Ma, Jinwen .

INTERNATIONAL JOURNAL OF MULTIMEDIA DATA ENGINEERING & MANAGEMENT, 2015, 6 (04) :23-39

[4]

Chen C, 2015, IEEE IMAGE PROC, P168, DOI 10.1109/ICIP.2015.7350781

[5]

Deng Y, 2018, ADV NEURAL INFORM PR, P9712

[6]

Dror R, 2019, 57TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2019), P2773

[7]

Eslami S., 2016, NIPS, V29, P3225

[8]

Falcon W., 2019, Pytorch-Lightning

[9] End-to-End Learning of Motion Representation for Video Understanding [J].

Fan, Lijie ;

Huang, Wenbing ;

Gan, Chuang ;

Ermon, Stefano ;

Gong, Boqing ;

Huang, Junzhou .

2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, :6016-6025

[10] SlowFast Networks for Video Recognition [J].

Feichtenhofer, Christoph ;

Fan, Haoqi ;

Malik, Jitendra ;

He, Kaiming .

2019 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2019), 2019, :6201-6210

← 1 2 3 4 5 6 7 8 →