Non-Contrastive Self-Supervised Learning for Utterance-Level Information Extraction From Speech

被引:6
作者
Cho, Jaejin [1 ,2 ]
Villalba, Jesus [1 ,2 ,3 ]
Moro-Velazquez, Laureano [1 ,2 ]
Dehak, Najim [1 ,2 ,3 ]
机构
[1] Johns Hopkins Univ, Dept Elect & Comp Engn, Baltimore, MD 21218 USA
[2] Johns Hopkins Univ, Ctr Language & Speech Proc CLSP, Baltimore, MD 21218 USA
[3] Johns Hopkins Univ, Human Language Technol Ctr Excellence, Baltimore, MD 21218 USA
关键词
Data models; Task analysis; Speech processing; Feature extraction; Adaptation models; Training; Emotion recognition; Self-supervised learning; transfer learning; speaker verification; emotion recognition; Alzheimer's disease; distillation; non-contrastive; REPRESENTATION;
D O I
10.1109/JSTSP.2022.3197315
中图分类号
TM [电工技术]; TN [电子技术、通信技术];
学科分类号
0808 ; 0809 ;
摘要
In recent studies, self-supervised pre-trained models tend to outperform supervised pre-trained models in transfer learning. In particular, self-supervised learning of utterance-level speech representation can be used in speech applications that require discriminative representation of consistent attributes within an utterance: speaker, language, emotion, and age. Existing frame-level self-supervised speech representation, e.g., wav2vec, can be used as utterance-level representation with pooling, but the models are usually large. There are also self-supervised learning techniques to learn utterance-level representation. One of the most successful is a contrastive method, which requires negative sampling: selecting alternative samples to contrast with the current sample (anchor). However, this does not ensure that all the negative samples belong to classes different from the anchor class without labels. This paper applies a non-contrastive self-supervised method to learn utterance-level embeddings. We adapted DIstillation with NO labels (DINO) from computer vision to speech. Unlike contrastive methods, DINO does not require negative sampling. We compared DINO to x-vector trained in a supervised manner. When transferred to speaker verification, speech emotion recognition (SER), and Alzheimer's disease detection, DINO outperformed x-vector. We studied the influence of several aspects during transfer learning such as dividing the fine-tuning process into steps, chunk lengths, or augmentation. Fine-tuning the last affine layers first and then the whole network surpassed fine-tuning all at once. Using shorter chunk lengths, although they generate more diverse inputs, did not necessarily improve performance, implying speech segments at least with a specific length are required for better performance per application. Augmentation was helpful in SER.
引用
收藏
页码:1284 / 1295
页数:12
相关论文
共 44 条
  • [1] Baevski A, 2020, ADV NEUR IN, V33
  • [2] IEMOCAP: interactive emotional dyadic motion capture database
    Busso, Carlos
    Bulut, Murtaza
    Lee, Chi-Chun
    Kazemzadeh, Abe
    Mower, Emily
    Kim, Samuel
    Chang, Jeannette N.
    Lee, Sungbok
    Narayanan, Shrikanth S.
    [J]. LANGUAGE RESOURCES AND EVALUATION, 2008, 42 (04) : 335 - 359
  • [3] Emerging Properties in Self-Supervised Vision Transformers
    Caron, Mathilde
    Touvron, Hugo
    Misra, Ishan
    Jegou, Herve
    Mairal, Julien
    Bojanowski, Piotr
    Joulin, Armand
    [J]. 2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021), 2021, : 9630 - 9640
  • [4] Chen T., 2020, ADV NEURAL INF PROCE, V33, P22243
  • [5] Learning Speaker Embedding from Text-to-Speech
    Cho, Jaejin
    Zelasko, Piotr
    Villalba, Jesus
    Watanabe, Shinji
    Dehak, Najim
    [J]. INTERSPEECH 2020, 2020, : 3256 - 3260
  • [6] Chung J.S, 2020, WORKSHOP SELF SUPERV
  • [7] Chung JS, 2018, INTERSPEECH, P1086
  • [8] An Unsupervised Autoregressive Model for Speech Representation Learning
    Chung, Yu-An
    Hsu, Wei-Ning
    Tang, Hao
    Glass, James
    [J]. INTERSPEECH 2019, 2019, : 146 - 150
  • [9] ArcFace: Additive Angular Margin Loss for Deep Face Recognition
    Deng, Jiankang
    Guo, Jia
    Xue, Niannan
    Zafeiriou, Stefanos
    [J]. 2019 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2019), 2019, : 4685 - 4694
  • [10] Devlin J, 2019, 2019 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES (NAACL HLT 2019), VOL. 1, P4171