Can audio-visual integration strengthen robustness under multimodal attacks?

被引:14
作者
Tian, Yapeng [1 ]
Xu, Chenliang [1 ]
机构
[1] Univ Rochester, Rochester, NY 14627 USA
来源
2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2021 | 2021年
基金
美国国家科学基金会;
关键词
SPARSE; SOUND; AUDIO;
D O I
10.1109/CVPR46437.2021.00555
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
In this paper, we propose to make a systematic study on machines' multisensory perception under attacks. We use the audio-visual event recognition task against multimodal adversarial attacks as a proxy to investigate the robustness of audio-visual learning. We attack audio, visual, and both modalities to explore whether audio-visual integration still strengthens perception and how different fusion mechanisms affect the robustness of audio-visual models. For interpreting the multimodal interactions under attacks, we learn a weakly-supervised sound source visual localization model to localize sounding regions in videos. To mitigate multimodal attacks, we propose an audio-visual defense approach based on an audio-visual dissimilarity constraint and external feature memory banks. Extensive experiments demonstrate that audio-visual models are susceptible to multimodal adversarial attacks; audio-visual integration could decrease the model robustness rather than strengthen under multimodal attacks; even a weaklysupervised sound source visual localization model can be successfully fooled; our defense method can improve the invulnerability of audio-visual networks without significantly sacrificing clean model performance. The source code and pre-trained models are released in https://github.com/YapengTian/AV-Robustness-CVPR21.
引用
收藏
页码:5597 / 5607
页数:11
相关论文
共 92 条
[41]  
Hu Di, 2020, ADV NEURAL INFORM PR
[42]   EPIC-Fusion: Audio-Visual Temporal Binding for Egocentric Action Recognition [J].
Kazakos, Evangelos ;
Nagrani, Arsha ;
Zisserman, Andrew ;
Damen, Dima .
2019 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2019), 2019, :5491-5500
[43]  
Kidron E, 2005, PROC CVPR IEEE, P88
[44]  
Kiela Douwe, 2018, AAAI
[45]  
Korbar B, 2018, ADV NEUR IN, V31
[46]  
Lin Yan-Bo, 2019, ICASSP
[47]  
Ma Pingehuan, 2019, ARXIV191208639
[48]  
Madry Aleksander, 2017, ARXIV170606083
[49]   HEARING LIPS AND SEEING VOICES [J].
MCGURK, H ;
MACDONALD, J .
NATURE, 1976, 264 (5588) :746-748
[50]   DeepFool: a simple and accurate method to fool deep neural networks [J].
Moosavi-Dezfooli, Seyed-Mohsen ;
Fawzi, Alhussein ;
Frossard, Pascal .
2016 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2016, :2574-2582