Perfect Match: Self-Supervised Embeddings for Cross-Modal Retrieval

被引:13
作者
Chung, Soo-Whan [1 ]
Chung, Joon Son [2 ]
Kang, Hong-Goo [1 ]
机构
[1] Yonsei Univ, Dept Elect & Elect Engn, Seoul 03722, South Korea
[2] Naver Corp, Seongnam Si 13561, Gyeonggi Do, South Korea
关键词
Task analysis; Training; Synchronization; Visualization; Streaming media; Feature extraction; Cross-modal; multi-modal; self-supervision; embedding; retrieval;
D O I
10.1109/JSTSP.2020.2987720
中图分类号
TM [电工技术]; TN [电子技术、通信技术];
学科分类号
0808 ; 0809 ;
摘要
This paper proposes a new strategy for learning effective cross-modal joint embeddings using self-supervision. We set up the problem as one of cross-modal retrieval, where the objective is to find the most relevant data in one domain given input in another. The method builds on the recent advances in learning representations from cross-modal self-supervision using contrastive or binary cross-entropy loss functions. To investigate the robustness of the proposed learning strategy across multi-modal applications, we perform experiments for two applications - audio-visual synchronisation and cross-modal biometrics. The audio-visual synchronisation task requires temporal correspondence between modalities to obtain joint representation of phonemes and visemes, and the cross-modal biometrics task requires common speakers representations given their face images and audio tracks. Experiments show that the performance of systems trained using proposed method far exceed that of existing methods on both tasks, whilst allowing significantly faster training.
引用
收藏
页码:568 / 576
页数:9
相关论文
共 52 条
[1]   Audio-visual biometrics [J].
Aleksic, Petar S. ;
Katsaggelos, Aggelos K. .
PROCEEDINGS OF THE IEEE, 2006, 94 (11) :2025-2044
[2]  
[Anonymous], 2014, C EMPIRICAL METHODS, DOI 10.3115/v1/d14-1179.
[3]  
[Anonymous], 2016, P WORKSH MULT LIP RE
[4]  
[Anonymous], 2007, EURASIP J APPL SIG P
[5]  
[Anonymous], P OD
[6]  
Arandjelovic R., 2018, P EUR C COMP VIS
[7]  
Arandjelovic R., 2017, P INT C COMP VIS
[8]  
Bahdanau D, 2016, Arxiv, DOI arXiv:1409.0473
[9]  
Chan W, 2016, INT CONF ACOUST SPEE, P4960, DOI 10.1109/ICASSP.2016.7472621
[10]   The devil is in the details: an evaluation of recent feature encoding methods [J].
Chatfield, Ken ;
Lempitsky, Victor ;
Vedaldi, Andrea ;
Zisserman, Andrew .
PROCEEDINGS OF THE BRITISH MACHINE VISION CONFERENCE 2011, 2011,