Self-supervised Learning of Audio-Visual Objects from Video

被引:156
作者
Afouras, Triantafyllos [1 ]
Owens, Andrew [2 ]
Chung, Joon Son [1 ,3 ]
Zisserman, Andrew [1 ]
机构
[1] Univ Oxford, Oxford, England
[2] Univ Michigan, Ann Arbor, MI USA
[3] Naver Corp, Seongnam Si, South Korea
来源
COMPUTER VISION - ECCV 2020, PT XVIII | 2020年 / 12363卷
基金
英国工程与自然科学研究理事会;
关键词
D O I
10.1007/978-3-030-58523-5_13
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Our objective is to transform a video into a set of discrete audio-visual objects using self-supervised learning. To this end, we introduce a model that uses attention to localize and group sound sources, and optical flow to aggregate information over time. We demonstrate the effectiveness of the audio-visual object embeddings that our model learns by using them for four downstream speech-oriented tasks: (a) multi-speaker sound source separation, (b) localizing and tracking speakers, (c) correcting misaligned audio-visual data, and (d) active speaker detection. Using our representation, these tasks can be solved entirely by training on unlabeled video, without the aid of object detectors. We also demonstrate the generality of our method by applying it to non-human speakers, including cartoons and puppets. Our model significantly outperforms other self-supervised approaches, and obtains performance competitive with methods that use supervised face detection.
引用
收藏
页码:208 / 224
页数:17
相关论文
共 58 条
[31]   Momentum Contrast for Unsupervised Visual Representation Learning [J].
He, Kaiming ;
Fan, Haoqi ;
Wu, Yuxin ;
Xie, Saining ;
Girshick, Ross .
2020 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2020), 2020, :9726-9735
[32]  
Henaff OJ, 2020, PR MACH LEARN RES, V119
[33]  
Hershey J., 1999, NeurIPS, V12
[34]  
Hu D, 2020, Arxiv, DOI arXiv:2001.09414
[35]   Deep Multimodal Clustering for Unsupervised Audiovisual Learning [J].
Hu, Di ;
Nie, Feiping ;
Li, Xuelong .
2019 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2019), 2019, :9240-9249
[36]   Multimodal Analysis for Identification and Segmentation of Moving-Sounding Objects [J].
Izadinia, Hamid ;
Saleemi, Imran ;
Shah, Mubarak .
IEEE TRANSACTIONS ON MULTIMEDIA, 2013, 15 (02) :378-390
[37]  
Khosravan N, 2018, Arxiv, DOI arXiv:1812.06071
[38]  
Kidron E., 2005, P CVPR
[39]  
Korbar B., 2018, ARXIV
[40]   Self-Supervised Learning of Pretext-Invariant Representations [J].
Misra, Ishan ;
van der Maaten, Laurens .
2020 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2020, :6706-6716