Conditional Independence for Pretext Task Selection in Self-Supervised Speech Representation Learning

被引：2

作者：

Zaiem, Salah ^{[1
,2
]}

Parcollet, Titouan ^{[2
]}

Essid, Slim ^{[1
]}

机构：

[1] Inst Polytech Paris, Telecom Paris, LTCI, Palaiseau, France

[2] Avignon Univ, LIA, Avignon, France

来源：

INTERSPEECH 2021 | 2021年

关键词：

Self-Supervised Learning; Speech Representation Learning;

D O I：

10.21437/Interspeech.2021-1027

中图分类号：

R36 [病理学]; R76 [耳鼻咽喉科学];

学科分类号：

100104 ; 100213 ;

摘要：

Through solving pretext tasks, self-supervised learning (SSL) leverages unlabeled data to extract useful latent representations replacing traditional input features in the downstream task. A common pretext task consists in pretraining a SSL model on pseudo-labels derived from the original signal. This technique is particularly relevant for speech data where various meaningful signal processing features may serve as pseudolabels. However, the process of selecting pseudo-labels, for speech or other types of data, remains mostly unexplored and currently relies on observing the results on the final downstream task. Nevertheless, this methodology is not sustainable at scale due to substantial computational (hence carbon) costs. Thus, this paper introduces a practical and theoretical framework to select relevant pseudo-labels with respect to a given downstream task. More precisely, we propose a functional estimator of the pseudo-label utility grounded in the conditional independence theory, which does not require any training. The experiments conducted on speaker recognition and automatic speech recognition validate our estimator, showing a significant correlation between the performance observed on the downstream task and the utility estimates obtained with our approach, facilitating the prospection of relevant pseudo-labels for selfsupervised speech representation learning.

引用

页码：2851 / 2855

页数：5

共 37 条

[1]

Algayres R., 2020, INTERSPEECH 2020 ANN

[2]

[Anonymous], 2015, INTERSPEECH

[3]

Arandjelovic Relja, 2018, P EUR C COMP VIS ECC

[4]

Ardila R., 2020, Common voice: A massively-multilingual speech corpus

[5]

Arora S., 2019, 36 INT C MACH LEARN, P9904

[6]

Baevski A., 2020, Wav2Vec 2.0: A Framework for Self-Supervised Learning of Speech Representations

[7]

Chen T, 2020, PR MACH LEARN RES, V119

[8]

Doersch Carl, 2016, Unsupervised visual representation learning by context prediction

[9]

Eyben Florian, 2010, P 18 ACM INT C MULT, P1459

[10]

Garofolo J., 1992, TIMIT Acoustic-Phonetic Continuous Speech Corpus

← 1 2 3 4 →