Self-supervised Learning of Audio-Visual Objects from Video

被引:156
作者
Afouras, Triantafyllos [1 ]
Owens, Andrew [2 ]
Chung, Joon Son [1 ,3 ]
Zisserman, Andrew [1 ]
机构
[1] Univ Oxford, Oxford, England
[2] Univ Michigan, Ann Arbor, MI USA
[3] Naver Corp, Seongnam Si, South Korea
来源
COMPUTER VISION - ECCV 2020, PT XVIII | 2020年 / 12363卷
基金
英国工程与自然科学研究理事会;
关键词
D O I
10.1007/978-3-030-58523-5_13
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Our objective is to transform a video into a set of discrete audio-visual objects using self-supervised learning. To this end, we introduce a model that uses attention to localize and group sound sources, and optical flow to aggregate information over time. We demonstrate the effectiveness of the audio-visual object embeddings that our model learns by using them for four downstream speech-oriented tasks: (a) multi-speaker sound source separation, (b) localizing and tracking speakers, (c) correcting misaligned audio-visual data, and (d) active speaker detection. Using our representation, these tasks can be solved entirely by training on unlabeled video, without the aid of object detectors. We also demonstrate the generality of our method by applying it to non-human speakers, including cartoons and puppets. Our model significantly outperforms other self-supervised approaches, and obtains performance competitive with methods that use supervised face detection.
引用
收藏
页码:208 / 224
页数:17
相关论文
共 58 条
[1]  
Afouras T., 2019, IEEE PAMI
[2]  
Afouras T, 2018, Arxiv, DOI arXiv:1809.00496
[3]   My lips are concealed: Audio-visual speech enhancement through obstructions [J].
Afouras, Triantafyllos ;
Chung, Joon Son ;
Zisserman, Andrew .
INTERSPEECH 2019, 2019, :4295-4299
[4]  
Afouras T, 2018, INTERSPEECH, P3244
[5]   Objects that Sound [J].
Arandjelovic, Relja ;
Zisserman, Andrew .
COMPUTER VISION - ECCV 2018, PT I, 2018, 11205 :451-466
[6]   Look, Listen and Learn [J].
Arandjelovic, Relja ;
Zisserman, Andrew .
2017 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2017, :609-617
[7]  
Barzelay Z., 2007, 2007 IEEE C COMP VIS
[8]   Cross-Modal Supervision for Learning Active Speaker Detection in Video [J].
Chakravarty, Punarjay ;
Tuytelaars, Tinne .
COMPUTER VISION - ECCV 2016, PT V, 2016, 9909 :285-301
[9]  
Chatfield K, 2014, Arxiv, DOI arXiv:1405.3531
[10]  
Chen Ting, 2020, ICML