WGANSing: A Multi-Voice Singing Voice Synthesizer Based on the Wasserstein-GAN

被引：54

作者：

Chandna, Pritish ^{[1
]}

Blaauw, Merlijn ^{[1
]}

Bonada, Jordi ^{[1
]}

Gomez, Emilia ^{[1
,2
]}

机构：

[1] Univ Pompeu Fabra, Mus Technol Grp, Barcelona, Spain

[2] European Commiss, Joint Res Ctr, Seville, Spain

来源：

2019 27TH EUROPEAN SIGNAL PROCESSING CONFERENCE (EUSIPCO) | 2019年

基金：

欧盟地平线“2020”;

关键词：

Wasserstein-GAN; DCGAN; WORLD vocoder; Singing Voice Synthesis; Block-wise Predictions;

D O I：

10.23919/eusipco.2019.8903099

中图分类号：

TM [电工技术]; TN [电子技术、通信技术];

学科分类号：

0808 ; 0809 ;

摘要：

We present a deep neural network based singing voice synthesizer, inspired by the Deep Convolutions Generative Adversarial Networks (DCGAN) architecture and optimized using the Wasserstein-GAN algorithm. We use vocoder parameters for acoustic modelling, to separate the influence of pitch and timbre. This facilitates the modelling of the large variability of pitch in the singing voice. Our network takes a block of consecutive frame-wise linguistic and fundamental frequency features, along with global singer identity as input and outputs vocoder features, corresponding to the block of features. This block-wise approach, along with the training methodology allows us to model temporal dependencies within the features of the input block. For inference, sequential blocks are concatenated using an overlap-add procedure. We show that the performance of our model is competitive with regards to the state-of-the-art and the original sample using objective metrics and a subjective listening test. We also present examples of the synthesis on a supplementary website and the source code via GitHub.

引用

页数：5

共 23 条

[1]

Arjovsky M, 2017, PR MACH LEARN RES, V70

[2]

Blaauw M, 2019, INT CONF ACOUST SPEE, P6840, DOI [10.1109/ICASSP.2019.8682656, 10.1109/icassp.2019.8682656]

[3] A Neural Parametric Singing Synthesizer [J].

Blaauw, Merlijn ;

Bonada, Jordi .

18TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2017), VOLS 1-6: SITUATED INTERACTION, 2017, :4001-4005

[4] Monoaural Audio Source Separation Using Deep Convolutional Neural Networks [J].

Chandna, Pritish ;

Miron, Marius ;

Janer, Jordi ;

Gomez, Emilia .

LATENT VARIABLE ANALYSIS AND SIGNAL SEPARATION (LVA/ICA 2017), 2017, 10169 :258-266

[5]

Donahue C., 2019, ADVERSARIAL AUDIO SY

[6]

Duan ZY, 2013, ASIAPAC SIGN INFO PR

[7]

Engel Jesse, 2018, INT C LEARN REPR

[8]

Goodfellow IJ, 2014, ADV NEUR IN, V27, P2672, DOI DOI 10.1145/3422622

[9]

Hono Y, 2019, INT CONF ACOUST SPEE, P6955, DOI [10.1109/ICASSP.2019.8683154, 10.1109/icassp.2019.8683154]

[10] Generative Adversarial Network-based Postfilter for STFT Spectrograms [J].

Kaneko, Takuhiro ;

Takaki, Shinji ;

Kameoka, Hirokazu ;

Yamagishi, Junichi .

18TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2017), VOLS 1-6: SITUATED INTERACTION, 2017, :3389-3393

← 1 2 3 →