A Revisit to Feature Handling for High-quality Voice Conversion Based on Gaussian Mixture Model

被引:0
作者
Suda, Hitoshi [1 ]
Kotani, Gaku [1 ]
Takamichi, Shinnosuke [2 ]
Saito, Daisuke [1 ]
机构
[1] Univ Tokyo, Grad Sch Engn, Tokyo, Japan
[2] Univ Tokyo, Grad Sch Informat Sci & Technol, Tokyo, Japan
来源
2018 ASIA-PACIFIC SIGNAL AND INFORMATION PROCESSING ASSOCIATION ANNUAL SUMMIT AND CONFERENCE (APSIPA ASC) | 2018年
关键词
D O I
暂无
中图分类号
TM [电工技术]; TN [电子技术、通信技术];
学科分类号
0808 ; 0809 ;
摘要
This paper discusses influences of handling acoustic features on the quality of generated sounds in voice conversion (VC) systems based on Gaussian mixture models (GMMs). In the context of improving the quality of VC, mapping models, which are used to convert acoustic features, have been widely discussed. Nevertheless, the components other than the mapping models have rarely been studied. The experimental results show that the quality of VC depends on not only the models but also the methods of analysis and synthesis of utterances. This paper also investigates filtering methods for synthesis. In order to avoid buzzy sounds generated from vocoders, differential-spectrum compensation is applied as an alternative method of synthesizing waveforms. Although mel log spectral approximation (MLSA) filtering is traditionally used for differential-spectrum compensation, the experimental results indicate the approximation in MLSA filtering degrades the quality of the synthesized speech. In order to avoid this approximation, this paper introduces an alternative filtering method, which is named SP-WORLD, inspired by the WORLD vocoder framework. The subjective experiments demonstrate that SP-WORLD is comparable to MLSA filtering, and outperforms it in some cases.
引用
收藏
页码:816 / 822
页数:7
相关论文
共 18 条
  • [1] Abe M., 1988, ICASSP 88: 1988 International Conference on Acoustics, Speech, and Signal Processing (Cat. No.88CH2561-9), P655, DOI 10.1109/ICASSP.1988.196671
  • [2] Aihara R, 2015, INT CONF ACOUST SPEE, P4899, DOI 10.1109/ICASSP.2015.7178902
  • [3] [Anonymous], 2008, Springer handbook of speech processing
  • [4] [Anonymous], 2014, INT SPEECH COMMUNICA
  • [5] VOICE CONVERSION USING ARTIFICIAL NEURAL NETWORKS
    Desai, Srinivas
    Raghavendra, E. Veera
    Yegnanarayana, B.
    Black, Alan W.
    Prahallad, Kishore
    [J]. 2009 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, VOLS 1- 8, PROCEEDINGS, 2009, : 3893 - +
  • [6] Kain A, 1998, INT CONF ACOUST SPEE, P285, DOI 10.1109/ICASSP.1998.674423
  • [7] Restructuring speech representations using a pitch-adaptive time-frequency smoothing and an instantaneous-frequency-based F0 extraction:: Possible role of a repetitive structure in sounds
    Kawahara, H
    Masuda-Katsuse, I
    de Cheveigné, A
    [J]. SPEECH COMMUNICATION, 1999, 27 (3-4) : 187 - 207
  • [8] ATR JAPANESE SPEECH DATABASE AS A TOOL OF SPEECH RECOGNITION AND SYNTHESIS
    KUREMATSU, A
    TAKEDA, K
    SAGISAKA, Y
    KATAGIRI, S
    KUWABARA, H
    SHIKANO, K
    [J]. SPEECH COMMUNICATION, 1990, 9 (04) : 357 - 363
  • [9] Lee CH, 2006, INTERSPEECH 2006 AND 9TH INTERNATIONAL CONFERENCE ON SPOKEN LANGUAGE PROCESSING, VOLS 1-5, P2254
  • [10] WORLD: A Vocoder-Based High-Quality Speech Synthesis System for Real-Time Applications
    Morise, Masanori
    Yokomori, Fumiya
    Ozawa, Kenji
    [J]. IEICE TRANSACTIONS ON INFORMATION AND SYSTEMS, 2016, E99D (07): : 1877 - 1884