Seeing voices and hearing voices: Learning discriminative embeddings using cross-modal self-supervision

被引：16

作者：

Chung, Soo-Whan ^{[1
]}

Kang, Hong-Goo ^{[1
]}

Chung, Joon Son ^{[2
]}

机构：

[1] Yonsei Univ, Dept Elect & Elect Engn, Seoul, South Korea

[2] Naver Corp, Seongnam, South Korea

来源：

INTERSPEECH 2020 | 2020年

关键词：

self-supervised learning; metric learning; cross-modal; speaker recognition; lip reading;

D O I：

10.21437/Interspeech.2020-1113

中图分类号：

R36 [病理学]; R76 [耳鼻咽喉科学];

学科分类号：

100104 ; 100213 ;

摘要：

The goal of this work is to train discriminative cross-modal embeddings without access to manually annotated data. Recent advances in self-supervised learning have shown that effective representations can be learnt from natural cross-modal synchrony. We build on earlier work to train embeddings that are more discriminative for uni-modal downstream tasks. To this end, we propose a novel training strategy that not only optimises metrics across modalities, but also enforces intra-class feature separation within each of the modalities. The effectiveness of the method is demonstrated on two downstream tasks: lip reading using the features trained on audio-visual synchronisation, and speaker recognition using the features trained for cross-modal biometric matching. The proposed method outperforms state-of-the-art self-supervised baselines by a significant margin.

引用

页码：3486 / 3490

页数：5

共 41 条

[1] Afouras Triantafyllos, 2020, P ECCV
[2] Audio-visual biometrics
Aleksic, Petar S.
Katsaggelos, Aggelos K.
[J]. PROCEEDINGS OF THE IEEE, 2006, 94 (11) : 2025 - 2044
[3] [Anonymous], 2018, P ECCV
[4] [Anonymous], 2018, P CVPR
[5] [Anonymous], CVIU
[6] Bahdanau D, 2016, Arxiv, DOI [arXiv:1409.0473, 10.48550/arXiv.1409.0473, DOI 10.48550/ARXIV.1409.0473]
[7] Thinking the voice:: neural correlates of voice perception
Belin, P
Fecteau, S
Bédard, C
[J]. TRENDS IN COGNITIVE SCIENCES, 2004, 8 (03) : 129 - 135
[8] Chan W, 2016, INT CONF ACOUST SPEE, P4960, DOI 10.1109/ICASSP.2016.7472621
[9] The devil is in the details: an evaluation of recent feature encoding methods
Chatfield, Ken
Lempitsky, Victor
Vedaldi, Andrea
Zisserman, Andrew
[J]. PROCEEDINGS OF THE BRITISH MACHINE VISION CONFERENCE 2011, 2011,
[10] Cho K., 2014, P EMPIRICAL METHODS, P1724, DOI DOI 10.3115/V1/D14-1179

← 1 2 3 4 5 →