A continuous vocoder for statistical parametric speech synthesis and its evaluation using an audio-visual phonetically annotated Arabic corpus

被引:6
作者
Al-Radhi, Mohammed Salah [1 ]
Abdo, Omnia [2 ]
Csapo, Tamas Gabor [1 ,4 ]
Abdou, Sherif [3 ]
Nemeth, Geza [1 ]
Fashal, Mervat [2 ]
机构
[1] Budapest Univ Technol & Econ, Dept Telecommun & Media Informat, Budapest, Hungary
[2] Alexandria Univ, Dept Phonet & Linguist, Alexandria, Egypt
[3] Cairo Univ, Fac Comp & Informat, Giza, Egypt
[4] MTA ELTE Lendulet Lingual Articulat Res Grp, Budapest, Hungary
关键词
Speech synthesis; Continuous vocoder; Envelope; Arabic; PLUS NOISE MODEL; ENVELOPE; INTELLIGIBILITY; EXTRACTION; HMM;
D O I
10.1016/j.csl.2019.101025
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
In this paper, we present an extension of a novel continuous residual-based vocoder for statistical parametric speech synthesis by addressing two objectives. First, because the noise component is often not accurately modelled in modern vocoders (e.g. STRAIGHT), a new technique for modelling unvoiced sounds is proposed by adding time domain envelope to the unvoiced segments to avoid any residual buzziness. Four time-domain envelopes (Amplitude, Hilbert, Triangular and True) are investigated, enhanced, and then applied to the noise component of the excitation in our continuous vocoder, i.e. of which all parameters are continuous. With the future aim of producing high-quality Arabic speech synthesis, we secondly apply this vocoder on a modern standard Arabic audio-visual corpus which is annotated both phonetically and visually, and dedicated to emotional speech processing studies. In an objective experiment, we investigated the Phase Distortion Deviation, whereas a MUSHRA type subjective listening test was conducted comparing natural and vocoded speech samples. As a result, both experiments based on the proposed noise modelling have shown satisfactory results in terms of naturalness and intelligibility, while outperforming STRAIGHT and other earlier residual-based approaches. (C) 2019 Elsevier Ltd. All rights reserved.
引用
收藏
页数:15
相关论文
共 49 条
[1]   Building audio-visual phonetically annotated Arabic corpus for expressive text to speech [J].
Abdo, Omnia ;
Abdou, Sherif ;
Fashal, Mervat .
18TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2017), VOLS 1-6: SITUATED INTERACTION, 2017, :3767-3771
[2]   Time-domain envelope modulating the noise component of excitation in a continuous residual-based vocoder for statistical parametric speech synthesis [J].
Al-Radhi, Mohammed Salah ;
Csapo, Tamas Gabor ;
Nemeth, Geza .
18TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2017), VOLS 1-6: SITUATED INTERACTION, 2017, :434-438
[3]   Deep Recurrent Neural Networks in Speech Synthesis Using a Continuous Vocoder [J].
Al-Radhi, Mohammed Salah ;
Csapo, Tamas Gabor ;
Nemeth, Geza .
SPEECH AND COMPUTER, SPECOM 2017, 2017, 10458 :282-291
[4]  
[Anonymous], PREDICTION PERCEIVED
[5]  
[Anonymous], P INT 2011
[6]  
[Anonymous], AUDITORY PROCESSING
[7]  
[Anonymous], P INT COMP MUS C GLA
[8]  
[Anonymous], 2003, COMPUT SYST
[9]  
[Anonymous], P ADV NONL SPEECH PR
[10]  
[Anonymous], 2010, P INT SPEECH COMM AS