Self-supervised Learning of Audio-Visual Objects from Video

被引：156

作者：

Afouras, Triantafyllos ^{[1
]}

Owens, Andrew ^{[2
]}

Chung, Joon Son ^{[1
,3
]}

Zisserman, Andrew ^{[1
]}

机构：

[1] Univ Oxford, Oxford, England

[2] Univ Michigan, Ann Arbor, MI USA

[3] Naver Corp, Seongnam Si, South Korea

来源：

COMPUTER VISION - ECCV 2020, PT XVIII | 2020年 / 12363卷

基金：

英国工程与自然科学研究理事会;

关键词：

D O I：

10.1007/978-3-030-58523-5_13

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Our objective is to transform a video into a set of discrete audio-visual objects using self-supervised learning. To this end, we introduce a model that uses attention to localize and group sound sources, and optical flow to aggregate information over time. We demonstrate the effectiveness of the audio-visual object embeddings that our model learns by using them for four downstream speech-oriented tasks: (a) multi-speaker sound source separation, (b) localizing and tracking speakers, (c) correcting misaligned audio-visual data, and (d) active speaker detection. Using our representation, these tasks can be solved entirely by training on unlabeled video, without the aid of object detectors. We also demonstrate the generality of our method by applying it to non-human speakers, including cartoons and puppets. Our model significantly outperforms other self-supervised approaches, and obtains performance competitive with methods that use supervised face detection.

引用

页码：208 / 224

页数：17

共 58 条

[21]

Fevotte C., 2005, IRISA Technical Report 1706

[22]

Gabbay A, 2018, 2018 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), P3051, DOI 10.1109/ICASSP.2018.8462527

[23] Semantic Video CNNs through Representation Warping [J].

Gadde, Raghudeep ;

Jampani, Varun ;

Gehler, Peter V. .

2017 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2017, :4463-4472

[24] Self-supervised Moving Vehicle Tracking with Stereo Sound [J].

Gan, Chuang ;

Zhao, Hang ;

Chen, Peihao ;

Cox, David ;

Torralba, Antonio .

2019 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2019), 2019, :7052-7061

[25]

Gao RH, 2019, Arxiv, DOI arXiv:1904.07750

[26] 2.5D Visual Sound [J].

Gao, Ruohan ;

Grauman, Kristen .

2019 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2019), 2019, :324-333

[27] Learning to Separate Object Sounds by Watching Unlabeled Video [J].

Gao, Ruohan ;

Feris, Rogerio ;

Grauman, Kristen .

COMPUTER VISION - ECCV 2018, PT III, 2018, 11207 :36-54

[28] Memory-Augmented Dense Predictive Coding for Video Representation Learning [J].

Han, Tengda ;

Xie, Weidi ;

Zisserman, Andrew .

COMPUTER VISION - ECCV 2020, PT III, 2020, 12348 :312-329

[29] Video Representation Learning by Dense Predictive Coding [J].

Han, Tengda ;

Xie, Weidi ;

Zisserman, Andrew .

2019 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION WORKSHOPS (ICCVW), 2019, :1483-1492

[30] Jointly Discovering Visual Objects and Spoken Words from Raw Sensory Input [J].

Harwath, David ;

Recasens, Adria ;

Suris, Didac ;

Chuang, Galen ;

Torralba, Antonio ;

Glass, James .

COMPUTER VISION - ECCV 2018, PT VI, 2018, 11210 :659-677

← 1 2 3 4 5 6 →