Singing voice synthesis based on deep neural networks

被引：55

作者：

Nishimura, Masanari ^{[1
]}

Hashimoto, Kei ^{[1
]}

Oura, Keiichiro ^{[1
]}

Nankaku, Yoshihiko ^{[1
]}

Tokuda, Keiichi ^{[1
]}

机构：

[1] Nagoya Inst Technol, Dept Sci & Engn Simulat, Nagoya, Aichi, Japan

来源：

17TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2016), VOLS 1-5: UNDERSTANDING SPEECH PROCESSING IN HUMANS AND MACHINES | 2016年

基金：

日本科学技术振兴机构;

关键词：

Singing voice synthesis; Neural network; DNN; Acoustic model; HMM;

D O I：

10.21437/Interspeech.2016-1027

中图分类号：

O42 [声学];

学科分类号：

070206 ; 082403 ;

摘要：

Singing voice synthesis techniques have been proposed based on a hidden Markov model (HMM). In these approaches, the spectrum, excitation, and duration of singing voices are simultaneously modeled with context-dependent HMMs and waveforms are generated from the HMMs themselves. However, the quality of the synthesized singing voices still has not reached that of natural singing voices. Deep neural networks (DNNs) have largely improved on conventional approaches in various research areas including speech recognition, image recognition, speech synthesis, etc. The DNN-based text-to-speech (TTS) synthesis can synthesize high quality speech. In the DNN-based TTS system, a DNN is trained to represent the mapping function from contextual features to acoustic features, which are modeled by decision tree-clustered context dependent HMMs in the HMM-based TTS system. In this paper, we propose singing voice synthesis based on a DNN and evaluate its effectiveness. The relationship between the musical score and its acoustic features is modeled in frames by a DNN. For the sparseness of pitch context in a database, a musical-note-level pitch normalization and linear-interpolation techniques are used to prepare the excitation features. Subjective experimental results show that the DNN-based system outperformed the HMM-based system in terms of naturalness.

引用

页码：2478 / 2482

页数：5

共 50 条

[41] LITESING: TOWARDS FAST, LIGHTWEIGHT AND EXPRESSIVE SINGING VOICE SYNTHESIS
Zhuang, Xiaobin
Jiang, Tao
Chou, Szu-Yu
Wu, Bin
Hu, Peng
Lui, Simon
2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021), 2021, : 7078 - 7082
[42] Voice Conversion Using Deep Neural Networks With Layer-Wise Generative Training
Chen, Ling-Hui
Ling, Zhen-Hua
Liu, Li-Juan
Dai, Li-Rong
IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2014, 22 (12) : 1859 - 1872
[43] LEARN2SING: TARGET SPEAKER SINGING VOICE SYNTHESIS BY LEARNING FROM A SINGING TEACHER
Xue, Heyang
Yang, Shan
Lei, Yi
Xie, Lei
Li, Xiulin
2021 IEEE SPOKEN LANGUAGE TECHNOLOGY WORKSHOP (SLT), 2021, : 522 - 529
[44] Speaking Rate Estimation Based on Deep Neural Networks
Tomashenko, Natalia
Khokhlov, Yuri
SPEECH AND COMPUTER, 2014, 8773 : 418 - 424
[45] Skin lesion detection based on deep neural networks
Choudhary, Priya
Singhai, Jyoti
Yadav, J. S.
CHEMOMETRICS AND INTELLIGENT LABORATORY SYSTEMS, 2022, 230
[46] WGANSing: A Multi-Voice Singing Voice Synthesizer Based on the Wasserstein-GAN
Chandna, Pritish
Blaauw, Merlijn
Bonada, Jordi
Gomez, Emilia
2019 27TH EUROPEAN SIGNAL PROCESSING CONFERENCE (EUSIPCO), 2019,
[47] WeSinger: Data-augmented Singing Voice Synthesis with Auxiliary Losses
Zhang, Zewang
Zheng, Yibin
Li, Xinhui
Lu, Li
INTERSPEECH 2022, 2022, : 4252 - 4256
[48] MLP SINGER: TOWARDS RAPID PARALLEL KOREAN SINGING VOICE SYNTHESIS
Tae, Jaesung
Kim, Hyeongju
Lee, Younggun
2021 IEEE 31ST INTERNATIONAL WORKSHOP ON MACHINE LEARNING FOR SIGNAL PROCESSING (MLSP), 2021,
[49] DSUSING: Dual Scale U-Nets for Singing Voice Synthesis
Park, Hyunju
Woo, Jihwan
2024 IEEE INTERNATIONAL CONFERENCE ON BIG DATA AND SMART COMPUTING, IEEE BIGCOMP 2024, 2024, : 201 - 206
[50] XiaoiceSing: A High-Quality and Integrated Singing Voice Synthesis System
Lu, Peiling
Wu, Jie
Luan, Jian
Tan, Xu
Zhou, Li
INTERSPEECH 2020, 2020, : 1306 - 1310

← 1 2 3 4 5 →