ON USING BACKPROPAGATION FOR SPEECH TEXTURE GENERATION AND VOICE CONVERSION

被引:0
作者
Chorowski, Jan [1 ]
Weiss, Ron J. [1 ]
Saurous, Rif A. [1 ]
Bengio, Samy [1 ]
机构
[1] Google Brain, Mountain View, CA 94043 USA
来源
2018 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP) | 2018年
关键词
Texture synthesis; voice conversion; style transfer; deep neural networks; convolutional networks; CTC;
D O I
暂无
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
Inspired by recent work on neural network image generation which rely on backpropagation towards the network inputs, we present a proof-of-concept system for speech texture synthesis and voice conversion based on two mechanisms: approximate inversion of the representation learned by a speech recognition neural network, and on matching statistics of neuron activations between different source and target utterances. Similar to image texture synthesis and neural style transfer, the system works by optimizing a cost function with respect to the input waveform samples. To this end we use a differentiable mel-filterbank feature extraction pipeline and train a convolutional CTC speech recognition network. Our system is able to extract speaker characteristics from very limited amounts of target speaker data, as little as a few seconds, and can be used to generate realistic speech babble or reconstruct an utterance in a different voice.
引用
收藏
页码:2256 / 2260
页数:5
相关论文
共 39 条
[1]  
Abadi M., 2015, PREPRINT
[2]  
[Anonymous], 2017, P ICLR
[3]  
[Anonymous], 1978, MULTIDIMENSIONAL SCA
[4]  
[Anonymous], 2015, arXiv
[5]  
[Anonymous], 2022, ADV NEURAL INF PROCE, DOI DOI 10.1007/978-3-031-20083-0_3
[6]  
[Anonymous], IEEE WORKSH APPL SIG
[7]  
[Anonymous], 2016, Audio texture synthesis and style transfer
[8]  
[Anonymous], 2016, P 33 INT C INT C MAC
[9]  
[Anonymous], 2015, 16 ANN C INT SPEECH
[10]  
[Anonymous], WORKSHOP INT C LEARN