CONTROLLING EMOTION STRENGTH WITH RELATIVE ATTRIBUTE FOR END-TO-END SPEECH SYNTHESIS

被引：0

作者：

Zhu, Xiaolian ^{[1
,2
]}

Yang, Shan ^{[1
]}

Yang, Geng ^{[1
]}

Xie, Lei ^{[1
]}

机构：

[1] Northwestern Polytech Univ, Sch Comp Sci, Xian, Peoples R China

[2] Hebei Univ Econ & Business, Publ Comp Educ Ctr, Shijiazhuang, Hebei, Peoples R China

来源：

2019 IEEE AUTOMATIC SPEECH RECOGNITION AND UNDERSTANDING WORKSHOP (ASRU 2019) | 2019年

关键词：

Emotion strength; relative attributes; speech synthesis; text-to-speech; end-to-end;

D O I：

10.1109/asru46091.2019.9003829

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Recently, attention-based end-to-end speech synthesis has achieved superior performance compared to traditional speech synthesis models, and several approaches like global style tokens are proposed to explore the style controllability of the end-to-end model. Although the existing methods show good performance in style disentanglement and transfer, it is still unable to control the explicit emotion of generated speech. In this paper, we mainly focus on the subtle control of expressive speech synthesis, where the emotion category and strength can be easily controlled with a discrete emotional vector and a continuous simple scalar, respectively. The continuous strength controller is learned by a ranking function according to the relative attribute measured on an emotion dataset. Our method automatically learns the relationship between low-level acoustic features and high-level subtle emotion strength. Experiments show that our method can effectively improve the controllability for an expressive end-to-end model.

引用

页码：192 / 199

页数：8

共 50 条

[31] IMPROVING MANDARIN END-TO-END SPEECH SYNTHESIS BY SELF-ATTENTION AND LEARNABLE GAUSSIAN BIAS [J].

Yang, Fengyu ;

Yang, Shan ;

Zhu, Pengcheng ;

Yan, Pengju ;

Xie, Lei .

2019 IEEE AUTOMATIC SPEECH RECOGNITION AND UNDERSTANDING WORKSHOP (ASRU 2019), 2019, :208-213

[32] End-to-end text-to-speech synthesis with unaligned multiple language units based on attention [J].

Aso, Masashi ;

Takamichi, Shinnosuke ;

Saruwatari, Hiroshi .

INTERSPEECH 2020, 2020, :4009-4013

[33] ESPnet: End-to-End Speech Processing Toolkit [J].

Watanabe, Shinji ;

Hori, Takaaki ;

Karita, Shigeki ;

Hayashi, Tomoki ;

Nishitoba, Jiro ;

Unno, Yuya ;

Soplin, Nelson Enrique Yalta ;

Heymann, Jahn ;

Wiesner, Mattew ;

Chen, Nanxin ;

Renduchintala, Adithya ;

Ochiai, Tsubasa .

19TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2018), VOLS 1-6: SPEECH RESEARCH FOR EMERGING MARKETS IN MULTILINGUAL SOCIETIES, 2018, :2207-2211

[34] Review of End-to-End Streaming Speech Recognition [J].

Wang, Aohui ;

Zhang, Long ;

Song, Wenyu ;

Meng, Jie .

Computer Engineering and Applications, 2024, 59 (02) :22-33

[35] Neurorecognition visualization in multitask end-to-end speech [J].

Mamyrbayev, Orken ;

Pavlov, Sergii ;

Bekarystankyzy, Akbayan ;

Oralbekova, Dina ;

Zhumazhanov, Bagashar ;

Azarova, Larysa ;

Mussayeva, Dinara ;

Koval, Tetiana ;

Gromaszek, Konrad ;

Issimov, Nurdaulet ;

Shiyapov, Kadrzhan .

OPTICAL FIBERS AND THEIR APPLICATIONS 2023, 2024, 12985

[36] An Overview of End-to-End Automatic Speech Recognition [J].

Wang, Dong ;

Wang, Xiaodong ;

Lv, Shaohe .

SYMMETRY-BASEL, 2019, 11 (08)

[37] End-to-End Localization and Ranking for Relative Attributes [J].

Singh, Krishna Kumar ;

Lee, Yong Jae .

COMPUTER VISION - ECCV 2016, PT VI, 2016, 9910 :753-769

[38] On the localness modeling for the self-attention based end-to-end speech synthesis [J].

Yang, Shan ;

Lu, Heng ;

Kang, Shiyin ;

Xue, Liumeng ;

Xiao, Jinba ;

Su, Dan ;

Xie, Lei ;

Yu, Dong .

NEURAL NETWORKS, 2020, 125 :121-130

[39] USING SPEECH SYNTHESIS TO TRAIN END-TO-END SPOKEN LANGUAGE UNDERSTANDING MODELS [J].

Lugosch, Loren ;

Meyer, Brett H. ;

Nowrouzezahrai, Derek ;

Ravanelli, Mirco .

2020 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, 2020, :8499-8503

[40] LEARNING LATENT REPRESENTATIONS FOR STYLE CONTROL AND TRANSFER IN END-TO-END SPEECH SYNTHESIS [J].

Zhang, Ya-Jie ;

Pan, Shifeng ;

He, Lei ;

Ling, Zhen-Hua .

2019 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2019, :6945-6949

← 1 2 3 4 5 →