Singing voice synthesis based on deep neural networks

被引：55

作者：

Nishimura, Masanari ^{[1
]}

Hashimoto, Kei ^{[1
]}

Oura, Keiichiro ^{[1
]}

Nankaku, Yoshihiko ^{[1
]}

Tokuda, Keiichi ^{[1
]}

机构：

[1] Nagoya Inst Technol, Dept Sci & Engn Simulat, Nagoya, Aichi, Japan

来源：

17TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2016), VOLS 1-5: UNDERSTANDING SPEECH PROCESSING IN HUMANS AND MACHINES | 2016年

基金：

日本科学技术振兴机构;

关键词：

Singing voice synthesis; Neural network; DNN; Acoustic model; HMM;

D O I：

10.21437/Interspeech.2016-1027

中图分类号：

O42 [声学];

学科分类号：

070206 ; 082403 ;

摘要：

Singing voice synthesis techniques have been proposed based on a hidden Markov model (HMM). In these approaches, the spectrum, excitation, and duration of singing voices are simultaneously modeled with context-dependent HMMs and waveforms are generated from the HMMs themselves. However, the quality of the synthesized singing voices still has not reached that of natural singing voices. Deep neural networks (DNNs) have largely improved on conventional approaches in various research areas including speech recognition, image recognition, speech synthesis, etc. The DNN-based text-to-speech (TTS) synthesis can synthesize high quality speech. In the DNN-based TTS system, a DNN is trained to represent the mapping function from contextual features to acoustic features, which are modeled by decision tree-clustered context dependent HMMs in the HMM-based TTS system. In this paper, we propose singing voice synthesis based on a DNN and evaluate its effectiveness. The relationship between the musical score and its acoustic features is modeled in frames by a DNN. For the sparseness of pitch context in a database, a musical-note-level pitch normalization and linear-interpolation techniques are used to prepare the excitation features. Subjective experimental results show that the DNN-based system outperformed the HMM-based system in terms of naturalness.

引用

页码：2478 / 2482

页数：5

共 50 条

[31] INTEGRATION OF SPEAKER AND PITCH ADAPTIVE TRAINING FOR HMM-BASED SINGING VOICE SYNTHESIS
Shirota, Kanako
Nakamura, Kazuhiro
Hashimoto, Kei
Oura, Keiichiro
Nankaku, Yoshihiko
Tokuda, Keiichi
2014 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2014,
[32] Biosignals learning and synthesis using deep neural networks
David Belo
João Rodrigues
João R. Vaz
Pedro Pezarat-Correia
Hugo Gamboa
BioMedical Engineering OnLine, 16
[33] Neural Dynamics of Karaoke-Like Voice Imitation in Singing Performance
Fruehholz, Sascha
Trost, Wiebke
Constantinescu, Irina
Grandjean, Didier
FRONTIERS IN HUMAN NEUROSCIENCE, 2020, 14
[34] Biosignals learning and synthesis using deep neural networks
Belo, David
Rodrigues, Joao
Vaz, Joao R.
Pezarat-Correia, Pedro
Gamboa, Hugo
BIOMEDICAL ENGINEERING ONLINE, 2017, 16
[35] HiddenSinger: High-quality singing voice synthesis via neural audio codec and latent diffusion models
Hwang, Ji-Sang
Lee, Sang-Hoon
Lee, Seong-Whan
NEURAL NETWORKS, 2025, 181
[36] Singing Voice Synthesis with Vibrato Modeling and Latent Energy Representation
Song, Yingjie
Song, Wei
Zhang, Wei
Zhang, Zhengchen
Zeng, Dan
Liu, Zhi
Yu, Yang
2022 IEEE 24TH INTERNATIONAL WORKSHOP ON MULTIMEDIA SIGNAL PROCESSING (MMSP), 2022,
[37] Singing Voice Database
Tsirulnik, Liliya
Dubnov, Shlomo
SPEECH AND COMPUTER, SPECOM 2019, 2019, 11658 : 501 - 509
[38] DeepSinger: Singing Voice Synthesis with Data Mined From the Web
Ren, Yi
Tan, Xu
Qin, Tao
Luan, Jian
Zhao, Zhou
Liu, Tie-Yan
KDD '20: PROCEEDINGS OF THE 26TH ACM SIGKDD INTERNATIONAL CONFERENCE ON KNOWLEDGE DISCOVERY & DATA MINING, 2020, : 1979 - 1989
[39] Continuous vocoder applied in deep neural network based voice conversion
Al-Radhi, Mohammed Salah
Csapo, Tamas Gabor
Nemeth, Geza
MULTIMEDIA TOOLS AND APPLICATIONS, 2019, 78 (23) : 33549 - 33572
[40] VIBRATO LEARNING IN MULTI-SINGER SINGING VOICE SYNTHESIS
Liu, Ruolan
Wen, Xue
Lu, Chunhui
Son, Liming
Sung, June Sig
2021 IEEE AUTOMATIC SPEECH RECOGNITION AND UNDERSTANDING WORKSHOP (ASRU), 2021, : 773 - 779

← 1 2 3 4 5 →