A Syllable-Based Framework for Unit Selection Synthesis in 13 Indian Languages

被引:0
作者
Patil, Hemant A. [1 ]
Patel, Tanvina B. [1 ]
Shah, Nirmesh J. [1 ]
Sailor, Hardik B. [1 ]
Krishnan, Raghava [2 ]
Kasthuri, G. R. [2 ]
Nagarajan, T. [3 ]
Christina, Lilly [3 ]
Kumar, Naresh [4 ]
Raghavendra, Veera [4 ]
Kishore, S. P. [4 ]
Prasanna, S. R. M. [5 ]
Adiga, Nagaraj [5 ]
Singh, Sanasam Ranbir [5 ]
Anand, Konjengbam [5 ]
Kumar, Pranaw [6 ]
Singh, Bira Chandra [6 ]
Kumar, S. L. Binil [7 ]
Bhadran, T. G. [7 ]
Sajini, T. [7 ]
Saha, Arup [8 ]
Basu, Tulika [8 ]
Rao, K. Sreenivasa [9 ]
Narendra, N. P. [9 ]
Sao, Anil Kumar [10 ]
Kumar, Rakesh [10 ]
Talukdar, Pranhari [11 ]
Acharyaa, Purnendu [11 ]
Chandra, Somnath [12 ]
Lata, Swaran [12 ]
Murthy, Hema A. [2 ]
机构
[1] Dhirubhai Ambani Inst Informat & Commun Technol D, Gandhinagar, India
[2] IIT Madras, Dept Comp Sci & Engn, Madras, Tamil Nadu, India
[3] SSN Coll Engn, Kalavakka, Tamil Nadu, India
[4] Int inst Informat Technol, Hyderabad, Andhra Pradesh, India
[5] Indian Inst Technol, Gauhati, India
[6] CDAC, Bombay, Maharashtra, India
[7] CDAC, Trivandrum, Kerala, India
[8] CDAC, Kolkata, India
[9] Indian Inst Technol, Kharagpur, W Bengal, India
[10] Indian Inst Technol, Mandi, India
[11] Univ Guwahati, Gauhati, India
[12] Minist Informat Technol Govt India, Technol Dev Indian Languages, New Delhi, India
来源
2013 INTERNATIONAL CONFERENCE ORIENTAL COCOSDA HELD JOINTLY WITH 2013 CONFERENCE ON ASIAN SPOKEN LANGUAGE RESEARCH AND EVALUATION (O-COCOSDA/CASLRE) | 2013年
关键词
Indian languages; Text-to-Speech (TTS); text optimization; speaker selection; recording; labeling; pronunciation dictionary; SPEECH; QUALITY; MOS;
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
In this paper, we discuss a consortium effort on building text to speech (TTS) systems for 13 Indian languages. There are about 1652 Indian languages. A unified framework is therefore attempted required for building TTSes for Indian languages. As Indian languages are syllable-timed, a syllable-based framework is developed. As quality of speech synthesis is of paramount interest, unit-selection synthesizers are built. Building TTS systems for low-resource languages requires that the data be carefully collected an annotated as the database has to be built from the scratch. Various criteria have to addressed while building the database, namely, speaker selection, pronunciation variation, optimal text selection, handling of out of vocabulary words and so on. The various characteristics of the voice that affect speech synthesis quality are first analysed. Next the design of the corpus of each of the Indian languages is tabulated. The collected data is labeled at the syllable level using a semiautomatic labeling tool. Text to speech synthesizers are built for all the 13 languages, namely, Hindi, Tamil, Marathi, Bengali, Malayalam, Telugu, Kannada, Gujarati, Rajasthani, Assamese, Manipuri, Odia and Bodo using the same common framework. The TTS systems are evaluated using degradation Mean Opinion Score (DMOS) and Word Error Rate (WER). An average DMOS score of approximate to 3.0 and an average WER of about 20 % is observed across all the languages.
引用
收藏
页数:8
相关论文
共 33 条
[21]   Development of syllable-based text to speech synthesis system in Bengali [J].
Narendra, N. ;
Rao, K. ;
Ghosh, Krishnendu ;
Vempada, Ramu ;
Maity, Sudhamay .
INTERNATIONAL JOURNAL OF SPEECH TECHNOLOGY, 2011, 14 (03) :167-181
[22]   Automatic segmentation of continuous speech using minimum phase group delay functions [J].
Prasad, VK ;
Nagarajan, T ;
Murthy, HA .
SPEECH COMMUNICATION, 2004, 42 (3-4) :429-446
[23]   Vowel Onset Point Detection Using Source, Spectral Peaks, and Modulation Spectrum Energies [J].
Prasanna, S. R. Mahadeva ;
Reddy, B. V. Sandeep ;
Krishnamoorthy, P. .
IEEE TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2009, 17 (04) :556-565
[24]  
Raghavendra E., 2009, ICON, P67
[25]  
Ramani B., 2013, 8 ISCA WORKSH SPEECH, P311
[26]   Intonation modeling for Indian languages [J].
Rao, K. Sreenivasa ;
Yegnanarayana, B. .
COMPUTER SPEECH AND LANGUAGE, 2009, 23 (02) :240-256
[27]  
Salza PL, 1996, ACUSTICA, V82, P650
[28]  
Sreekanth M., 2007, P 3 LANG TECHN C, P187
[29]  
Thomas Samuel, 2006, EUSIPCO
[30]  
TOKUDA K, 2002, P IEEE WORKSH SPEECH