AutoSpeech: Neural Architecture Search for Speaker Recognition

被引:24
作者
Ding, Shaojin [1 ]
Chen, Tianlong [2 ]
Gong, Xinyu [1 ,2 ]
Zha, Weiwei [3 ]
Wang, Zhangyang [2 ]
机构
[1] Texas A&M Univ, Dept Comp Sci & Engn, College Stn, TX 77843 USA
[2] Univ Texas Austin, Dept Elect & Comp Engn, Austin, TX 78712 USA
[3] Univ Sci & Technol China, Sch Software Engn, Beijing, Peoples R China
来源
INTERSPEECH 2020 | 2020年
关键词
speaker recognition; neural architecture search;
D O I
10.21437/Interspeech.2020-1258
中图分类号
R36 [病理学]; R76 [耳鼻咽喉科学];
学科分类号
100104 ; 100213 ;
摘要
Speaker recognition systems based on Convolutional Neural Networks (CNNs) are often built with off-the-shelf backbones such as VGG-Net or ResNet. However, these backbones were originally proposed for image classification, and therefore may not be naturally fit for speaker recognition. Due to the prohibitive complexity of manually exploring the design space, we propose the first neural architecture search approach for the speaker recognition tasks, named as AutoSpeech. Our algorithm first identifies the optimal operation combination in a neural cell and then derives a CNN model by stacking the neural cell for multiple times. The final speaker recognition model can be obtained by training the derived CNN model through the standard scheme. To evaluate the proposed approach, we conduct experiments on both speaker identification and speaker verification tasks using the VoxCeleb1 dataset. Results demonstrate that the derived CNN architectures from the proposed approach significantly outperform current speaker recognition systems based on VGG-M, ResNet-18, and ResNet-34 backbones, while enjoying lower model complexity.
引用
收藏
页码:916 / 920
页数:5
相关论文
共 39 条
[1]  
[Anonymous], 2018, ARXIV180708312
[2]  
Bhattacharya G., 2019, P INTERSPEECH
[3]  
Boski M, 2017, 2017 10TH INTERNATIONAL WORKSHOP ON MULTIDIMENSIONAL (ND) SYSTEMS (NDS)
[4]  
Cai W., 2018, SPEAKER LANGUAGE REC, P74, DOI DOI 10.21437/ODYSSEY.2018-11
[5]  
Chan W, 2016, INT CONF ACOUST SPEE, P4960, DOI 10.1109/ICASSP.2016.7472621
[6]  
Chen W., 2020, P INT C LEARN REPR A
[7]  
Chung J. S., 2018, P INTERSPEECH 2018
[8]   An overview of bilevel optimization [J].
Colson, Benoit ;
Marcotte, Patrice ;
Savard, Gilles .
ANNALS OF OPERATIONS RESEARCH, 2007, 153 (01) :235-256
[9]   Impact of seasonal water-level fluctuations on autumn vegetation in Poyang Lake wetland, China [J].
Dai, Xue ;
Wan, Rongrong ;
Yang, Guishan ;
Wang, Xiaolong ;
Xu, Ligang ;
Li, Yanyan ;
Li, Bing .
FRONTIERS OF EARTH SCIENCE, 2019, 13 (02) :398-409
[10]   Front-End Factor Analysis for Speaker Verification [J].
Dehak, Najim ;
Kenny, Patrick J. ;
Dehak, Reda ;
Dumouchel, Pierre ;
Ouellet, Pierre .
IEEE TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2011, 19 (04) :788-798