Can audio-visual integration strengthen robustness under multimodal attacks?

被引：14

作者：

Tian, Yapeng ^{[1
]}

Xu, Chenliang ^{[1
]}

机构：

[1] Univ Rochester, Rochester, NY 14627 USA

来源：

2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2021 | 2021年

基金：

美国国家科学基金会;

关键词：

SPARSE; SOUND; AUDIO;

D O I：

10.1109/CVPR46437.2021.00555

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

In this paper, we propose to make a systematic study on machines' multisensory perception under attacks. We use the audio-visual event recognition task against multimodal adversarial attacks as a proxy to investigate the robustness of audio-visual learning. We attack audio, visual, and both modalities to explore whether audio-visual integration still strengthens perception and how different fusion mechanisms affect the robustness of audio-visual models. For interpreting the multimodal interactions under attacks, we learn a weakly-supervised sound source visual localization model to localize sounding regions in videos. To mitigate multimodal attacks, we propose an audio-visual defense approach based on an audio-visual dissimilarity constraint and external feature memory banks. Extensive experiments demonstrate that audio-visual models are susceptible to multimodal adversarial attacks; audio-visual integration could decrease the model robustness rather than strengthen under multimodal attacks; even a weaklysupervised sound source visual localization model can be successfully fooled; our defense method can improve the invulnerability of audio-visual networks without significantly sacrificing clean model performance. The source code and pre-trained models are released in https://github.com/YapengTian/AV-Robustness-CVPR21.

引用

页码：5597 / 5607

页数：11

共 92 条

[1]

Afouras Triantafyllos, 2020, P EUROPEAN C COMPUTE, P208

[2]

[Anonymous], 2011, The neural bases of multisensory processes

[3]

[Anonymous], 1994, Advances in Neural Information Processing Systems

[4]

[Anonymous], 2016, ARXIV161101236

[5]

[Anonymous], 2019, PROC CVPR IEEE, DOI DOI 10.1109/CVPR.2019.00802

[6]

[Anonymous], 2015, ARXIV PREPRINT ARXIV

[7]

[Anonymous], 2018, ARXIV180910875

[8]

[Anonymous], 2017, ARXIV171100117

[9] Objects that Sound [J].

Arandjelovic, Relja ;

Zisserman, Andrew .

COMPUTER VISION - ECCV 2018, PT I, 2018, 11205 :451-466

[10] Look, Listen and Learn [J].

Arandjelovic, Relja ;

Zisserman, Andrew .

2017 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2017, :609-617

← 1 2 3 4 5 6 7 8 9 10 →