Learning Problem-agnostic Speech Representations from Multiple Self-supervised Tasks

被引:126
作者
Pascual, Santiago [1 ]
Ravanelli, Mirco [2 ]
Serra, Joan [3 ]
Bonafonte, Antonio [1 ,5 ]
Bengio, Yoshua [2 ,4 ]
机构
[1] Univ Politecn Cataluna, Barcelona, Spain
[2] Univ Montreal, Mila, Montreal, PQ, Canada
[3] Tele Res, Barcelona, Spain
[4] CIFAR, Toronto, ON, Canada
[5] Amazon Res, Cambridge, England
来源
INTERSPEECH 2019 | 2019年
关键词
speech representation; speech classification; transfer learning; self-supervised learning; RECOGNITION;
D O I
10.21437/Interspeech.2019-2605
中图分类号
R36 [病理学]; R76 [耳鼻咽喉科学];
学科分类号
100104 ; 100213 ;
摘要
Learning good representations without supervision is still an open issue in machine learning, and is particularly challenging for speech signals, which are often characterized by long sequences with a complex hierarchical structure. Some recent works, however, have shown that it is possible to derive useful speech representations by employing a self-supervised encoder-discriminator approach. This paper proposes an improved self-supervised method, where a single neural encoder is followed by multiple workers that jointly solve different self-supervised tasks. The needed consensus across different tasks naturally imposes meaningful constraints to the encoder, contributing to discover general representations and to minimize the risk of learning superficial ones. Experiments show that the proposed approach can learn transferable, robust, and problem-agnostic features that carry on relevant information from the speech signal, such as speaker identity, phonemes, and even higher-level features such as emotional cues. In addition, a number of design choices make the encoder easily exportable, facilitating its direct usage or adaptation to different problems.
引用
收藏
页码:161 / 165
页数:5
相关论文
共 45 条
[1]  
[Anonymous], 2015, P ICLR
[2]   Objects that Sound [J].
Arandjelovic, Relja ;
Zisserman, Andrew .
COMPUTER VISION - ECCV 2018, PT I, 2018, 11205 :451-466
[3]  
Bengio Y, 2012, JMLR WORKSHOP C P, P17
[4]  
Bengio Y., 2006, ADV NEURAL INFORM PR, P4
[5]  
Chang SY, 2017, ADV NEUR IN, V30
[6]  
Chorowski J., 2019, UNSUPERVISED SPEECH
[7]  
Chung J, 2014, NIPS 2014 WORKSH DEE, P1
[8]   Context-Dependent Pre-Trained Deep Neural Networks for Large-Vocabulary Speech Recognition [J].
Dahl, George E. ;
Yu, Dong ;
Deng, Li ;
Acero, Alex .
IEEE TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2012, 20 (01) :30-42
[9]   COMPARISON OF PARAMETRIC REPRESENTATIONS FOR MONOSYLLABIC WORD RECOGNITION IN CONTINUOUSLY SPOKEN SENTENCES [J].
DAVIS, SB ;
MERMELSTEIN, P .
IEEE TRANSACTIONS ON ACOUSTICS SPEECH AND SIGNAL PROCESSING, 1980, 28 (04) :357-366
[10]   Multi-task Self-Supervised Visual Learning [J].
Doersch, Carl ;
Zisserman, Andrew .
2017 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2017, :2070-2079