Non-Contrastive Self-Supervised Learning for Utterance-Level Information Extraction From Speech

被引：6

作者：

Cho, Jaejin ^{[1
,2
]}

Villalba, Jesus ^{[1
,2
,3
]}

Moro-Velazquez, Laureano ^{[1
,2
]}

Dehak, Najim ^{[1
,2
,3
]}

机构：

[1] Johns Hopkins Univ, Dept Elect & Comp Engn, Baltimore, MD 21218 USA

[2] Johns Hopkins Univ, Ctr Language & Speech Proc CLSP, Baltimore, MD 21218 USA

[3] Johns Hopkins Univ, Human Language Technol Ctr Excellence, Baltimore, MD 21218 USA

来源：

IEEE JOURNAL OF SELECTED TOPICS IN SIGNAL PROCESSING | 2022年 / 16卷 / 06期

关键词：

Data models; Task analysis; Speech processing; Feature extraction; Adaptation models; Training; Emotion recognition; Self-supervised learning; transfer learning; speaker verification; emotion recognition; Alzheimer's disease; distillation; non-contrastive; REPRESENTATION;

D O I：

10.1109/JSTSP.2022.3197315

中图分类号：

TM [电工技术]; TN [电子技术、通信技术];

学科分类号：

0808 ; 0809 ;

摘要：

In recent studies, self-supervised pre-trained models tend to outperform supervised pre-trained models in transfer learning. In particular, self-supervised learning of utterance-level speech representation can be used in speech applications that require discriminative representation of consistent attributes within an utterance: speaker, language, emotion, and age. Existing frame-level self-supervised speech representation, e.g., wav2vec, can be used as utterance-level representation with pooling, but the models are usually large. There are also self-supervised learning techniques to learn utterance-level representation. One of the most successful is a contrastive method, which requires negative sampling: selecting alternative samples to contrast with the current sample (anchor). However, this does not ensure that all the negative samples belong to classes different from the anchor class without labels. This paper applies a non-contrastive self-supervised method to learn utterance-level embeddings. We adapted DIstillation with NO labels (DINO) from computer vision to speech. Unlike contrastive methods, DINO does not require negative sampling. We compared DINO to x-vector trained in a supervised manner. When transferred to speaker verification, speech emotion recognition (SER), and Alzheimer's disease detection, DINO outperformed x-vector. We studied the influence of several aspects during transfer learning such as dividing the fine-tuning process into steps, chunk lengths, or augmentation. Fine-tuning the last affine layers first and then the whole network surpassed fine-tuning all at once. Using shorter chunk lengths, although they generate more diverse inputs, did not necessarily improve performance, implying speech segments at least with a specific length are required for better performance per application. Augmentation was helpful in SER.

引用

页码：1284 / 1295

页数：12

共 44 条

[1] Baevski A, 2020, ADV NEUR IN, V33
[2] IEMOCAP: interactive emotional dyadic motion capture database
Busso, Carlos
Bulut, Murtaza
Lee, Chi-Chun
Kazemzadeh, Abe
Mower, Emily
Kim, Samuel
Chang, Jeannette N.
Lee, Sungbok
Narayanan, Shrikanth S.
[J]. LANGUAGE RESOURCES AND EVALUATION, 2008, 42 (04) : 335 - 359
[3] Emerging Properties in Self-Supervised Vision Transformers
Caron, Mathilde
Touvron, Hugo
Misra, Ishan
Jegou, Herve
Mairal, Julien
Bojanowski, Piotr
Joulin, Armand
[J]. 2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021), 2021, : 9630 - 9640
[4] Chen T., 2020, ADV NEURAL INF PROCE, V33, P22243
[5] Learning Speaker Embedding from Text-to-Speech
Cho, Jaejin
Zelasko, Piotr
Villalba, Jesus
Watanabe, Shinji
Dehak, Najim
[J]. INTERSPEECH 2020, 2020, : 3256 - 3260
[6] Chung J.S, 2020, WORKSHOP SELF SUPERV
[7] Chung JS, 2018, INTERSPEECH, P1086
[8] An Unsupervised Autoregressive Model for Speech Representation Learning
Chung, Yu-An
Hsu, Wei-Ning
Tang, Hao
Glass, James
[J]. INTERSPEECH 2019, 2019, : 146 - 150
[9] ArcFace: Additive Angular Margin Loss for Deep Face Recognition
Deng, Jiankang
Guo, Jia
Xue, Niannan
Zafeiriou, Stefanos
[J]. 2019 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2019), 2019, : 4685 - 4694
[10] Devlin J, 2019, 2019 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES (NAACL HLT 2019), VOL. 1, P4171

← 1 2 3 4 5 →