Mandarin-Tibetan Cross-Lingual Voice Conversion System Based on Deep Neural Network

被引:1
作者
Gan, Zhenye [1 ,2 ]
Xing, Xiaotian [1 ]
Yang, Hongwu [1 ,2 ]
Zhao, Guangying [1 ]
机构
[1] Northwest Normal Univ, Coll Phys & Elect Engn, Lanzhou 730000, Gansu, Peoples R China
[2] Engn Res Ctr Gansu Prov Intelligent Informat Tech, Lanzhou 730000, Gansu, Peoples R China
来源
PROCEEDINGS OF 2018 THE 2ND INTERNATIONAL CONFERENCE ON COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE (CSAI 2018) / 2018 THE 10TH INTERNATIONAL CONFERENCE ON INFORMATION AND MULTIMEDIA TECHNOLOGY (ICIMT 2018) | 2018年
基金
中国国家自然科学基金;
关键词
Cross-lingual voice conversion; speech recognition; speech synthesis; DNN;
D O I
10.1145/3297156.3297221
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
This paper realizes a Mandarin-Tibetan cross-lingual voice conversion system to solve the communication problem between the Mandarin speaker and the Tibetan speaker. Mandarin speech recognition and Tibetan speech synthesis techniques based on deep neural network(DNN) are adopted to convert Mandarin to Tibetan. In this way, we can effectively avoid the problem of building large parallel corpus and complex conversion rules. Meanwhile, we modify the converted Tibetan speech features so that it is perceived as a sentence uttered by the Mandarin speaker. The experimental results show that Mean Opinion Score (MOS) is 3.26 points and the degradation mean opinion score (DMOS) of the timbre similarity between the converted Tibetan speech and the Mandarin speech is 3.07 points.
引用
收藏
页码:67 / 71
页数:5
相关论文
共 13 条
  • [1] [Anonymous], 2015, THCHS 30 FREE CHINES
  • [2] Chen Z., 2015, P 12 INT C SIGN PROC, P577, DOI [10.1109/ICOSP.2014.7015070, DOI 10.1109/ICOSP.2014.7015070]
  • [3] Grancharov V., 2008, Springer Handbook of Speech Processing, P83, DOI DOI 10.1007/978-3-540-49127-9_5
  • [4] KOIKE H, 2016, J ACOUST SOC AM, V140, P2963, DOI DOI 10.1121/1.4969157
  • [5] Building DNN acoustic models for large vocabulary speech recognition
    Maas, Andrew L.
    Qi, Peng
    Xie, Ziang
    Hannun, Awni Y.
    Lengerich, Christopher T.
    Jurafsky, Daniel
    Ng, Andrew Y.
    [J]. COMPUTER SPEECH AND LANGUAGE, 2017, 41 : 195 - 213
  • [6] Machado A. F., 2011, P IEEE INT S MULT TA, P365, DOI [10.1109/ISM.2010.62, DOI 10.1109/ISM.2010.62]
  • [7] Mohammadi S. H., 2018, P INT 2018 HYD IND S
  • [8] Mohammadi S. H., 2017, OVERVIEW VOICE CONVE, P65, DOI [10.1016/j.specom.2017.01.008, DOI 10.1016/J.SPECOM.2017.01.008]
  • [9] Ning YS, 2017, INT CONF ACOUST SPEE, P5615, DOI 10.1109/ICASSP.2017.7953231
  • [10] A Multi-level GMM-Based Cross-Lingual Voice Conversion Using Language-Specific Mixture Weights for Polyglot Synthesis
    Ramani, B.
    Jeeva, M. P. Actlin
    Vijayalakshmi, P.
    Nagarajan, T.
    [J]. CIRCUITS SYSTEMS AND SIGNAL PROCESSING, 2016, 35 (04) : 1283 - 1311