HMM-Based Voice Conversion Using Quantized F0 Context

被引:8
作者
Nose, Takashi [1 ]
Ota, Yuhei [1 ]
Kobayashi, Takao [1 ]
机构
[1] Tokyo Inst Technol, Interdisciplinary Grad Sch Sci & Engn, Yokohama, Kanagawa 2268502, Japan
关键词
voice conversion; HMM-based speech synthesis; F0; quantization; prosodic context; nonparallel data;
D O I
10.1587/transinf.E93.D.2483
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
We propose a segment-based voice conversion technique using hidden Markov model (HMM)-based speech synthesis with nonparallel training data. In the proposed technique, the phoneme information with durations and a quantized F0 contour are extracted from the input speech of a source speaker, and are transmitted to a synthesis part. In the synthesis part, the quantized F0 symbols are used as prosodic context. A phonetically and prosodically context-dependent label sequence is generated from the transmitted phoneme and the F0 symbols. Then, converted speech is generated from the label sequence with durations using the target speaker's pre-trained context-dependent HMMs. In the model training, the models of the source and target speakers can be trained separately, hence there is no need to prepare parallel speech data of the source and target speakers. Objective and subjective experimental results show that the segment-based voice conversion with phonetic and prosodic contexts works effectively even if the parallel speech data is not available.
引用
收藏
页码:2483 / 2490
页数:8
相关论文
共 50 条
[31]   Emotional Voice Conversion with Adaptive Scales F0 based on Wavelet Transform using Limited Amount of Emotional Data [J].
Luo, Zhaojie ;
Chen, Jinhui ;
Takiguchi, Tetsuya ;
Ariki, Yasuo .
18TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2017), VOLS 1-6: SITUATED INTERACTION, 2017, :3399-3403
[32]   HMM-Based Maximum Likelihood Frame Alignment for Voice Conversion from a Nonparallel Corpus [J].
Lee, Ki-Seung .
IEICE TRANSACTIONS ON INFORMATION AND SYSTEMS, 2017, E100D (12) :3064-3067
[33]   The USTC System for Voice Conversion Challenge 2016: Neural Network Based Approaches for Spectrum, Aperiodicity and F0 Conversion [J].
Chen, Ling-Hui ;
Liu, Li-Juan ;
Ling, Zhen-Hua ;
Jiang, Yuan ;
Dai, Li-Rong .
17TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2016), VOLS 1-5: UNDERSTANDING SPEECH PROCESSING IN HUMANS AND MACHINES, 2016, :1642-1646
[34]   Cross-lingual voice conversion based on F0 multi-scale modeling with VITS [J].
Cao, Danyang ;
Zhang, Zeyi .
PROCEEDINGS OF 2024 3RD INTERNATIONAL CONFERENCE ON CYBER SECURITY, ARTIFICIAL INTELLIGENCE AND DIGITAL ECONOMY, CSAIDE 2024, 2024, :375-379
[35]   Discontinuous Observation HMM for Prosodic-Event-Based F0 Generation [J].
Koriyama, Tomoki ;
Nose, Takashi ;
Kobayashi, Takao .
13TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2012 (INTERSPEECH 2012), VOLS 1-3, 2012, :462-465
[36]   HMM-Based Emphatic Speech Synthesis Using Unsupervised Context Labeling [J].
Maeno, Yu ;
Nose, Takashi ;
Kobayashi, Takao ;
Ijima, Yusuke ;
Nakajima, Hideharu ;
Mizuno, Hideyuki ;
Yoshioka, Osamu .
12TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2011 (INTERSPEECH 2011), VOLS 1-5, 2011, :1860-+
[37]   F0 TRANSFORMATION TECHNIQUES FOR STATISTICAL VOICE CONVERSION WITH DIRECTWAVEFORM MODIFICATION WITH SPECTRAL DIFFERENTIAL [J].
Kobayashi, Kazuhiro ;
Toda, Tomoki ;
Nakamura, Satoshi .
2016 IEEE WORKSHOP ON SPOKEN LANGUAGE TECHNOLOGY (SLT 2016), 2016, :693-700
[38]   On the Use of Extended Context for HMM-based Spontaneous Conversational Speech Synthesis [J].
Koriyama, Tomoki ;
Nose, Takashi ;
Kobayashi, Takao .
12TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2011 (INTERSPEECH 2011), VOLS 1-5, 2011, :2668-2671
[39]   JOINT MODELLING OF VOICING LABEL AND CONTINUOUS F0 FOR HMM BASED SPEECH SYNTHESIS [J].
Yu, K. ;
Young, S. .
2011 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, 2011, :4572-4575
[40]   FACTOR ANALYZED VOICE MODELS FOR HMM-BASED SPEECH SYNTHESIS [J].
Kazumi, Kyosuke ;
Nankaku, Yoshihiko ;
Tokuda, Keiichi .
2010 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, 2010, :4234-4237