A Hybrid HMM-Waveglow based Text-to-speech Synthesizer using Histogram Equalization for Low resource Indian Languages

被引：1

作者：

Kumar, Mano Ranjith M. ^{[1
]}

Srivastava, Sudhanshu ^{[1
]}

Prakash, Anusha ^{[1
]}

Murthy, Hema A. ^{[1
]}

机构：

[1] Indian Inst Technol, Madras, Tamil Nadu, India

来源：

INTERSPEECH 2020 | 2020年

关键词：

Speech synthesis; Histogram Equalization; HMM; based speech synthesis; Waveglow; Hybrid systems;

D O I：

10.21437/Interspeech.2020-3180

中图分类号：

R36 [病理学]; R76 [耳鼻咽喉科学];

学科分类号：

100104 ; 100213 ;

摘要：

Conventional text-to-speech (TTS) synthesis requires extensive linguistic processing for producing quality output. The advent of end-to-end (E2E) systems has caused a relocation in the paradigm with better synthesized voices. However, hidden Markov model (HMM) based systems are still popular due to their fast synthesis time, robustness to less training data, and flexible adaptation of voice characteristics, speaking styles, and emotions. This paper proposes a technique that combines the classical parametric HMM-based TTS framework (HTS) with the neural-network-based Waveglow vocoder using histogram equalization (HEQ) in a low resource environment. The two paradigms are combined by performing HEQ across mel-spectrograms extracted from HTS generated audio and source spectra of training data. During testing, the synthesized mel-spectrograms are mapped to the source spectrograms using the learned HEQ. Experiments are carried out on Hindi male and female dataset of the Indic TTS database. Systems are evaluated based on degradation mean opinion scores (DMOS). Results indicate that the synthesis quality of the hybrid system is better than that of the conventional HTS system. These results are quite promising as they pave way to good quality TTS systems with less data compared to E2E systems.

引用

页码：2037 / 2041

页数：5

共 25 条

[1]

[Anonymous], 2013, SSW

[2]

[Anonymous], 2017, The LJ Speech Dataset

[3]

Baby A., 2016, CBBLR-Community-Based Building of Language Resources, P37

[4] Deep Learning Techniques in Tandem with Signal Processing Cues for Phonetic Segmentation for Text to Speech Synthesis in Indian Languages [J].

Baby, Arun ;

Prakash, Jeena J. ;

Vignesh, Rupak ;

Murthy, Hema A. .

18TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2017), VOLS 1-6: SITUATED INTERACTION, 2017, :3817-3821

[5] A Unified Parser for Developing Indian Language Text to Speech Synthesizers [J].

Baby, Arun ;

Nishanthi, N. L. ;

Thomas, Anju Leela ;

Murthy, Hema A. .

TEXT, SPEECH, AND DIALOGUE, 2016, 9924 :514-521

[6]

Black A., 1998, The Festival Speech Synthesis System

[7]

Black A W., 1994, Proceedings of International Conference on Computational Linguistics (COLING'94), V2, P983

[8]

Bulut M, 2010, HUMAN-CENTRIC INTERFACES FOR AMBIENT INTELLIGENCE, P255, DOI 10.1016/B978-0-12-374708-2.00010-3

[9]

de la Torre A., 2002, P IEEE INT C AC SPEE, V1

[10] Class-Based Parametric Approximation to Histogram Equalization for ASR [J].

Garcia, Luz ;

Benitez Ortuzar, Carmen ;

De la Torre, Angel ;

Segura, Jose C. .

IEEE SIGNAL PROCESSING LETTERS, 2012, 19 (07) :415-418

← 1 2 3 →