PROSODIC REPRESENTATION LEARNING AND CONTEXTUAL SAMPLING FOR NEURAL TEXT-TO-SPEECH

被引:10
作者
Karlapati, Sri [1 ]
Abbas, Ammar [1 ]
Hodari, Zack [2 ]
Moinet, Alexis [1 ]
Joly, Arnaud [1 ]
Karanasou, Penny [1 ]
Drugman, Thomas [1 ]
机构
[1] Amazon Res, Cambridge, England
[2] Univ Edinburgh, Ctr Speech Technol Res, Edinburgh, Midlothian, Scotland
来源
2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021) | 2021年
关键词
TTS; prosody modelling; contextual prosody;
D O I
10.1109/ICASSP39728.2021.9413696
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
In this paper, we introduce Kathaka, a model trained with a novel two-stage training process for neural speech synthesis with contextually appropriate prosody. In Stage I, we learn a prosodic distribution at the sentence level from mel-spectrograms available during training. In Stage II, we propose a novel method to sample from this learnt prosodic distribution using the contextual information available in text. To do this, we use BERT on text, and graph-attention networks on parse trees extracted from text. We show a statistically significant relative improvement of 13.2% in naturalness over a strong baseline when compared to recordings. We also conduct an ablation study on variations of our sampling technique, and show a statistically significant improvement over the baseline in each case.
引用
收藏
页码:6573 / 6577
页数:5
相关论文
共 50 条
  • [1] ON GRANULARITY OF PROSODIC REPRESENTATIONS IN EXPRESSIVE TEXT-TO-SPEECH
    Babianski, Mikolaj
    Pokora, Kamil
    Shah, Raahil
    Sienkiewicz, Rafal
    Korzekwa, Daniel
    Klimkov, Viacheslav
    2022 IEEE SPOKEN LANGUAGE TECHNOLOGY WORKSHOP, SLT, 2022, : 892 - 899
  • [2] Expressive Text-to-Speech with Contextual Background for ICAGC 2024
    Jiang, Yu
    Wang, Tianrui
    Wang, Haoyu
    Gong, Cheng
    Liu, Qiuyu
    Huang, Zikang
    Wang, Longbiao
    Dang, Jianwu
    2024 IEEE 14TH INTERNATIONAL SYMPOSIUM ON CHINESE SPOKEN LANGUAGE PROCESSING, ISCSLP 2024, 2024, : 611 - 615
  • [3] Decoding Knowledge Transfer for Neural Text-to-Speech Training
    Liu, Rui
    Sisman, Berrak
    Gao, Guanglai
    Li, Haizhou
    IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2022, 30 : 1789 - 1802
  • [4] FastTalker: A neural text-to-speech architecture with shallow and group autoregression
    Liu, Rui
    Sisman, Berrak
    Lin, Yixing
    Li, Haizhou
    NEURAL NETWORKS, 2021, 141 : 306 - 314
  • [5] GRAPHTTS: GRAPH-TO-SEQUENCE MODELLING IN NEURAL TEXT-TO-SPEECH
    Sun, Aolan
    Wang, Jianzong
    Cheng, Ning
    Peng, Huayi
    Zeng, Zhen
    Xiao, Jing
    2020 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, 2020, : 6719 - 6723
  • [6] PROSODYSPEECH: TOWARDS ADVANCED PROSODY MODEL FOR NEURAL TEXT-TO-SPEECH
    Yi, Yuanhao
    He, Lei
    Pan, Shifeng
    Wang, Xi
    Xiao, Yujia
    2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, : 7582 - 7586
  • [7] UNSUPERVISED POLYGLOT TEXT-TO-SPEECH
    Nachmani, Eliya
    Wolf, Lior
    2019 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2019, : 7055 - 7059
  • [8] PARAMETER GENERATION ALGORITHMS FOR TEXT-TO-SPEECH SYNTHESIS WITH RECURRENT NEURAL NETWORKS
    Klimkov, Viacheslav
    Moinet, Alexis
    Nadolski, Adam
    Drugman, Thomas
    2018 IEEE WORKSHOP ON SPOKEN LANGUAGE TECHNOLOGY (SLT 2018), 2018, : 626 - 631
  • [9] Cross-Dialect Adaptation Framework for Constructing Prosodic Models for Chinese Dialect Text-to-Speech Systems
    Chiang, Chen-Yu
    IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2018, 26 (01) : 108 - 121
  • [10] A Comparative Study of Text-to-Speech Systems in LabVIEW
    Panoiu, Manuela
    Rat, Cezara-Liliana
    Panoiu, Caius
    SOFT COMPUTING APPLICATIONS, (SOFA 2014), VOL 1, 2016, 356 : 3 - 11