Seeing voices and hearing voices: Learning discriminative embeddings using cross-modal self-supervision

被引:16
作者
Chung, Soo-Whan [1 ]
Kang, Hong-Goo [1 ]
Chung, Joon Son [2 ]
机构
[1] Yonsei Univ, Dept Elect & Elect Engn, Seoul, South Korea
[2] Naver Corp, Seongnam, South Korea
来源
INTERSPEECH 2020 | 2020年
关键词
self-supervised learning; metric learning; cross-modal; speaker recognition; lip reading;
D O I
10.21437/Interspeech.2020-1113
中图分类号
R36 [病理学]; R76 [耳鼻咽喉科学];
学科分类号
100104 ; 100213 ;
摘要
The goal of this work is to train discriminative cross-modal embeddings without access to manually annotated data. Recent advances in self-supervised learning have shown that effective representations can be learnt from natural cross-modal synchrony. We build on earlier work to train embeddings that are more discriminative for uni-modal downstream tasks. To this end, we propose a novel training strategy that not only optimises metrics across modalities, but also enforces intra-class feature separation within each of the modalities. The effectiveness of the method is demonstrated on two downstream tasks: lip reading using the features trained on audio-visual synchronisation, and speaker recognition using the features trained for cross-modal biometric matching. The proposed method outperforms state-of-the-art self-supervised baselines by a significant margin.
引用
收藏
页码:3486 / 3490
页数:5
相关论文
共 41 条
  • [1] Afouras Triantafyllos, 2020, P ECCV
  • [2] Audio-visual biometrics
    Aleksic, Petar S.
    Katsaggelos, Aggelos K.
    [J]. PROCEEDINGS OF THE IEEE, 2006, 94 (11) : 2025 - 2044
  • [3] [Anonymous], 2018, P ECCV
  • [4] [Anonymous], 2018, P CVPR
  • [5] [Anonymous], CVIU
  • [6] Bahdanau D, 2016, Arxiv, DOI [arXiv:1409.0473, 10.48550/arXiv.1409.0473, DOI 10.48550/ARXIV.1409.0473]
  • [7] Thinking the voice:: neural correlates of voice perception
    Belin, P
    Fecteau, S
    Bédard, C
    [J]. TRENDS IN COGNITIVE SCIENCES, 2004, 8 (03) : 129 - 135
  • [8] Chan W, 2016, INT CONF ACOUST SPEE, P4960, DOI 10.1109/ICASSP.2016.7472621
  • [9] The devil is in the details: an evaluation of recent feature encoding methods
    Chatfield, Ken
    Lempitsky, Victor
    Vedaldi, Andrea
    Zisserman, Andrew
    [J]. PROCEEDINGS OF THE BRITISH MACHINE VISION CONFERENCE 2011, 2011,
  • [10] Cho K., 2014, P EMPIRICAL METHODS, P1724, DOI DOI 10.3115/V1/D14-1179