Can audio-visual integration strengthen robustness under multimodal attacks?

被引:14
作者
Tian, Yapeng [1 ]
Xu, Chenliang [1 ]
机构
[1] Univ Rochester, Rochester, NY 14627 USA
来源
2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2021 | 2021年
基金
美国国家科学基金会;
关键词
SPARSE; SOUND; AUDIO;
D O I
10.1109/CVPR46437.2021.00555
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
In this paper, we propose to make a systematic study on machines' multisensory perception under attacks. We use the audio-visual event recognition task against multimodal adversarial attacks as a proxy to investigate the robustness of audio-visual learning. We attack audio, visual, and both modalities to explore whether audio-visual integration still strengthens perception and how different fusion mechanisms affect the robustness of audio-visual models. For interpreting the multimodal interactions under attacks, we learn a weakly-supervised sound source visual localization model to localize sounding regions in videos. To mitigate multimodal attacks, we propose an audio-visual defense approach based on an audio-visual dissimilarity constraint and external feature memory banks. Extensive experiments demonstrate that audio-visual models are susceptible to multimodal adversarial attacks; audio-visual integration could decrease the model robustness rather than strengthen under multimodal attacks; even a weaklysupervised sound source visual localization model can be successfully fooled; our defense method can improve the invulnerability of audio-visual networks without significantly sacrificing clean model performance. The source code and pre-trained models are released in https://github.com/YapengTian/AV-Robustness-CVPR21.
引用
收藏
页码:5597 / 5607
页数:11
相关论文
共 92 条
[1]  
Afouras Triantafyllos, 2020, P EUROPEAN C COMPUTE, P208
[2]  
[Anonymous], 2011, The neural bases of multisensory processes
[3]  
[Anonymous], 1994, Advances in Neural Information Processing Systems
[4]  
[Anonymous], 2016, ARXIV161101236
[5]  
[Anonymous], 2019, PROC CVPR IEEE, DOI DOI 10.1109/CVPR.2019.00802
[6]  
[Anonymous], 2015, ARXIV PREPRINT ARXIV
[7]  
[Anonymous], 2018, ARXIV180910875
[8]  
[Anonymous], 2017, ARXIV171100117
[9]   Objects that Sound [J].
Arandjelovic, Relja ;
Zisserman, Andrew .
COMPUTER VISION - ECCV 2018, PT I, 2018, 11205 :451-466
[10]   Look, Listen and Learn [J].
Arandjelovic, Relja ;
Zisserman, Andrew .
2017 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2017, :609-617