MASKCYCLEGAN-VC: LEARNING NON-PARALLEL VOICE CONVERSION WITH FILLING IN FRAMES

被引：37

作者：

Kaneko, Takuhiro ^{[1
]}

Kameoka, Hirokazu ^{[1
]}

Tanaka, Kou ^{[1
]}

Hojo, Nobukatsu ^{[1
]}

机构：

[1] NTT Corp, NTT Commun Sci Labs, Tokyo, Japan

来源：

2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021) | 2021年

关键词：

Voice conversion (VC); non-parallel VC; generative adversarial networks (GANs); CycleGAN-VC; mel-spectrogram conversion; NEURAL-NETWORKS;

D O I：

10.1109/ICASSP39728.2021.9414851

中图分类号：

O42 [声学];

学科分类号：

070206 ; 082403 ;

摘要：

Non-parallel voice conversion (VC) is a technique for training voice converters without a parallel corpus. Cycle-consistent adversarial network-based VCs (CycleGAN-VC and CycleGAN-VC2) are widely accepted as benchmark methods. However, owing to their insufficient ability to grasp time-frequency structures, their application is limited to mel-cepstrum conversion and not mel-spectrogram conversion despite recent advances in mel-spectrogram vocoders. To overcome this, CycleGAN-VC3, an improved variant of CycleGAN-VC2 that incorporates an additional module called time-frequency adaptive normalization (TFAN), has been proposed. However, an increase in the number of learned parameters is imposed. As an alternative, we propose MaskCycleGAN-VC, which is another extension of CycleGAN-VC2 and is trained using a novel auxiliary task called filling in frames (FIF). With FIF, we apply a temporal mask to the input mel-spectrogram and encourage the converter to fill in missing frames based on surrounding frames. This task allows the converter to learn time-frequency structures in a self-supervised manner and eliminates the need for an additional module such as TFAN. A subjective evaluation of the naturalness and speaker similarity showed that MaskCycleGAN-VC outperformed both CycleGAN-VC2 and CycleGAN-VC3 with a model size similar to that of CycleGAN-VC2.(1)

引用

页码：5919 / 5923

页数：5

共 53 条

[1]

Amodei D, 2016, PR MACH LEARN RES, V48

[2]

[Anonymous], 2016, A Practical Guide to Support Vector Classification

[3]

[Anonymous], 2017, IEEE I CONF COMP VIS, DOI DOI 10.1109/ICCV.2017.244

[4]

Binkowski M, 2020, ICLR

[5]

Chen N., 2020, INT C LEARN REPR

[6] Spectral Mapping Using Artificial Neural Networks for Voice Conversion [J].

Desai, Srinivas ;

Black, Alan W. ;

Yegnanarayana, B. ;

Prahallad, Kishore .

IEEE TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2010, 18 (05) :954-964

[7]

Devlin J, 2019, 2019 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES (NAACL HLT 2019), VOL. 1, P4171

[8]

Fedus W., 2018, 6 INT C LEARN REPR I

[9] Foreign accent conversion in computer assisted pronunciation training [J].

Felps, Daniel ;

Bortfeld, Heather ;

Gutierrez-Osuna, Ricardo .

SPEECH COMMUNICATION, 2009, 51 (10) :920-932

[10]

Goodfellow IJ, 2014, ADV NEUR IN, V27, P2672

← 1 2 3 4 5 6 →