Deep Multimodal Clustering for Unsupervised Audiovisual Learning

被引:157
作者
Hu, Di
Nie, Feiping
Li, Xuelong [1 ]
机构
[1] Northwestern Polytech Univ, Sch Comp Sci, Xian 710072, Peoples R China
来源
2019 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2019) | 2019年
基金
中国国家自然科学基金;
关键词
D O I
10.1109/CVPR.2019.00947
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
The seen birds twitter, the running cars accompany with noise, etc. These naturally audiovisual correspondences provide the possibilities to explore and understand the outside world. However, the mixed multiple objects and sounds make it intractable to perform efficient matching in the unconstrained environment. To settle this problem, we propose to adequately excavate audio and visual components and perform elaborate correspondence learning among them. Concretely, a novel unsupervised audiovisual learning model is proposed, named as Deep Multimodal Clustering (DMC), that synchronously performs sets of clustering with multimodal vectors of convolutional maps in different shared spaces for capturing multiple audiovisual correspondences. And such integrated multimodal clustering network can be effectively trained with max-margin loss in the end-to-end fashion. Amounts of experiments in feature evaluation and audiovisual tasks are performed. The results demonstrate that DMC can learn effective unimodal representation, with which the classifier can even outperform human performance. Further; DMC shows noticeable performance in sound localization, multisource detection, and audiovisual understanding.
引用
收藏
页码:9240 / 9249
页数:10
相关论文
共 38 条
[1]   Learning to See by Moving [J].
Agrawal, Pulkit ;
Carreira, Joao ;
Malik, Jitendra .
2015 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2015, :37-45
[2]  
[Anonymous], DCASE 2017 CHALLENGE
[3]  
[Anonymous], 2017, ARXIV170107481
[4]  
[Anonymous], 2018, P EUROPEAN C COMPUTE
[5]  
[Anonymous], 2017, P ECCV
[6]  
[Anonymous], 2017, DCASE2017 CHALLENGE
[7]  
[Anonymous], 2015, ARXIV151106856
[8]  
[Anonymous], 2016, Advances in Neural Information Processing Systems
[9]   Look, Listen and Learn [J].
Arandjelovic, Relja ;
Zisserman, Andrew .
2017 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2017, :609-617
[10]  
Aytar Y., 2017, ARXIV170600932