LITESING: TOWARDS FAST, LIGHTWEIGHT AND EXPRESSIVE SINGING VOICE SYNTHESIS

被引：8

作者：

Zhuang, Xiaobin ^{[1
]}

Jiang, Tao ^{[1
]}

Chou, Szu-Yu ^{[1
]}

Wu, Bin ^{[1
]}

Hu, Peng ^{[1
]}

Lui, Simon ^{[1
]}

机构：

[1] Tencent Mus Entertainment, Shenzhen, Peoples R China

来源：

2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021) | 2021年

关键词：

singing voice synthesis; non-autoregressive model; generative adversarial network; lightweight; expressive;

D O I：

10.1109/ICASSP39728.2021.9414043

中图分类号：

O42 [声学];

学科分类号：

070206 ; 082403 ;

摘要：

LiteSing proposed in this paper is a high-quality singing voice synthesis (SVS) system, which is fast, lightweight and expressive. This model mainly stacks several non-autoregressive WaveNet blocks in the encoder and decoder under a generative adversarial architecture, predicts full conditions from the musical score, and generates acoustic features from these conditions. The full conditions in this paper consist of dynamic spectrogram energy, voiced/unvoiced (V/UV) decision and dynamic pitch curve, which are proven related to the expressiveness. We predict the pitch and the timbre features separately, avoiding the interdependence between these two features. Instead of neural network vocoders, a parametric WORLD vocoder is employed for the pitch curve consistency. Experiment results show that LiteSing outperforms the baseline model using feed-forward Transformer by 1.386 times faster on inference speed, 15 times smaller on training parameters number, and achieves a similar MOS on sound quality. Through an A/B test, LiteSing achieves 67.3% preference rate over baseline in pitch curve and dynamic spectrogram energy prediction. which demonstrates the advantage of LiteSing over the other compared models.

引用

页码：7078 / 7082

页数：5

共 50 条

[1] MLP SINGER: TOWARDS RAPID PARALLEL KOREAN SINGING VOICE SYNTHESIS
Tae, Jaesung
Kim, Hyeongju
Lee, Younggun
2021 IEEE 31ST INTERNATIONAL WORKSHOP ON MACHINE LEARNING FOR SIGNAL PROCESSING (MLSP), 2021,
[2] Expressive Singing Synthesis based on Unit Selection for the Singing Synthesis Challenge 2016
Bonada, Jordi
Umbert, Marti
Blaauw, Merlijn
17TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2016), VOLS 1-5: UNDERSTANDING SPEECH PROCESSING IN HUMANS AND MACHINES, 2016, : 1230 - 1234
[3] FGP-GAN: Fine-Grained Perception Integrated Generative Adversarial Network for Expressive Mandarin Singing Voice Synthesis
Liu, Xin
Zhang, Weiwei
Zheng, Zhaohui
Pan, Mingyang
Wang, Rong
IEEE TRANSACTIONS ON CONSUMER ELECTRONICS, 2024, 70 (03) : 6054 - 6063
[4] Expressive control of singing voice synthesis using musical contexts and a parametric F0 model
Ardaillon, Luc
Chabot-Canet, Celine
Roebel, Axel
17TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2016), VOLS 1-5: UNDERSTANDING SPEECH PROCESSING IN HUMANS AND MACHINES, 2016, : 1250 - 1254
[5] Towards Improving the Expressiveness of Singing Voice Synthesis with BERT Derived Semantic Information
Zhou, Shaohuan
Lei, Shun
You, Weiya
Tuo, Deyi
You, Yuren
Wu, Zhiyong
Kang, Shiyin
Meng, Helen
INTERSPEECH 2022, 2022, : 4292 - 4296
[6] SINGING VOICE SYNTHESIS BASED ON GENERATIVE ADVERSARIAL NETWORKS
Hono, Yukiya
Hashimoto, Kei
Oura, Keiichiro
Nankaku, Yoshihiko
Tokuda, Keiichi
2019 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2019, : 6955 - 6959
[7] SinTechSVS: A Singing Technique Controllable Singing Voice Synthesis System
Zhao, Junchuan
Chetwin, Low Qi Hong
Wang, Ye
IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2024, 32 : 2641 - 2653
[8] FAST AND HIGH-QUALITY SINGING VOICE SYNTHESIS SYSTEM BASED ON CONVOLUTIONAL NEURAL NETWORKS
Nakamura, Kazuhiro
Takaki, Shinji
Hashimoto, Kei
Oura, Keiichiro
Nankaku, Yoshihiko
Tokuda, Keiichi
2020 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, 2020, : 7239 - 7243
[9] Singing Voice Synthesis System for Carnatic Music
Rajan, Ragesh M.
2018 5TH INTERNATIONAL CONFERENCE ON SIGNAL PROCESSING AND INTEGRATED NETWORKS (SPIN), 2018, : 831 - 835
[10] MusicFace: Music-driven expressive singing face synthesis
Liu, Pengfei
Deng, Wenjin
Li, Hengda
Wang, Jintai
Zheng, Yinglin
Ding, Yiwei
Guo, Xiaohu
Zeng, Ming
COMPUTATIONAL VISUAL MEDIA, 2024, 10 (01): : 119 - 136

← 1 2 3 4 5 →