Can audio-visual integration strengthen robustness under multimodal attacks?

被引：14

作者：

Tian, Yapeng ^{[1
]}

Xu, Chenliang ^{[1
]}

机构：

[1] Univ Rochester, Rochester, NY 14627 USA

来源：

2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2021 | 2021年

基金：

美国国家科学基金会;

关键词：

SPARSE; SOUND; AUDIO;

D O I：

10.1109/CVPR46437.2021.00555

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

In this paper, we propose to make a systematic study on machines' multisensory perception under attacks. We use the audio-visual event recognition task against multimodal adversarial attacks as a proxy to investigate the robustness of audio-visual learning. We attack audio, visual, and both modalities to explore whether audio-visual integration still strengthens perception and how different fusion mechanisms affect the robustness of audio-visual models. For interpreting the multimodal interactions under attacks, we learn a weakly-supervised sound source visual localization model to localize sounding regions in videos. To mitigate multimodal attacks, we propose an audio-visual defense approach based on an audio-visual dissimilarity constraint and external feature memory banks. Extensive experiments demonstrate that audio-visual models are susceptible to multimodal adversarial attacks; audio-visual integration could decrease the model robustness rather than strengthen under multimodal attacks; even a weaklysupervised sound source visual localization model can be successfully fooled; our defense method can improve the invulnerability of audio-visual networks without significantly sacrificing clean model performance. The source code and pre-trained models are released in https://github.com/YapengTian/AV-Robustness-CVPR21.

引用

页码：5597 / 5607

页数：11

共 92 条

[41]

Hu Di, 2020, ADV NEURAL INFORM PR

[42] EPIC-Fusion: Audio-Visual Temporal Binding for Egocentric Action Recognition [J].

Kazakos, Evangelos ;

Nagrani, Arsha ;

Zisserman, Andrew ;

Damen, Dima .

2019 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2019), 2019, :5491-5500

[43]

Kidron E, 2005, PROC CVPR IEEE, P88

[44]

Kiela Douwe, 2018, AAAI

[45]

Korbar B, 2018, ADV NEUR IN, V31

[46]

Lin Yan-Bo, 2019, ICASSP

[47]

Ma Pingehuan, 2019, ARXIV191208639

[48]

Madry Aleksander, 2017, ARXIV170606083

[49] HEARING LIPS AND SEEING VOICES [J].

MCGURK, H ;

MACDONALD, J .

NATURE, 1976, 264 (5588) :746-748

[50] DeepFool: a simple and accurate method to fool deep neural networks [J].

Moosavi-Dezfooli, Seyed-Mohsen ;

Fawzi, Alhussein ;

Frossard, Pascal .

2016 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2016, :2574-2582

← 1 2 3 4 5 6 7 8 9 10 →