Voice Conversion System Based on Deep Neural Network Capable of Parallel Computation

被引：0

作者：

Sato, Kunihiko ^{[1
]}

Rekimoto, Jun ^{[2
]}

机构：

[1] Univ Tokyo, Tokyo, Japan

[2] Univ Tokyo, Sony Comp Sci Lab, Tokyo, Japan

来源：

25TH 2018 IEEE CONFERENCE ON VIRTUAL REALITY AND 3D USER INTERFACES (VR) | 2018年

关键词：

Voice conversion; Voice avatars; Deep learning;

D O I：

暂无

中图分类号：

TP31 [计算机软件];

学科分类号：

081202 ; 0835 ;

摘要：

Voice conversion (VC) algorithms modify the speech of a particular speaker to resemble that of another speaker. Many existing virtual reality (VR) and augmented reality (AR) systems make it possible to change the appearance of users, and if VC is added, then users can also change their voice. State-of-the-art VC methods employ recurrent neural networks (RNNs), including long short-term memory (LSTM) networks, for generating converted speech. However, it is difficult for RNNs to perform parallel computations because the computations at each timestep depend on the results of a previous timestep, which prevents them from operating in real-time. In contrast, we propose a novel VC approach based on a dilated convolutional neural network (Dilated CNN), which is a deep neural network model that allows for parallel computation. We adapted the Dilated CNN model to perform convolutions in both the forward and reverse directions to ensure the learning is successful. In addition, to ensure the model can be parallelized during both the training and inference phases, we developed a model architecture that predicts all output values from the value of the input speech, and does not rely on predicted values for the next input. The results demonstrate that the proposed VC approach has a faster conversion rate relative to that of state-of-the-art methods, while improving speech quality a little and maintaining speaker similarity.

引用

页码：677 / 678

页数：2