Sequence-to-Sequence Models for Emphasis Speech Translation

被引:11
|
作者
Quoc Truong Do [1 ]
Sakti, Sakriani [1 ,2 ]
Nakamura, Satoshi [1 ,2 ]
机构
[1] Nara Inst Sci & Technol, Ikoma 6300192, Japan
[2] RIKEN Ctr Adv Intelligence Project AIP, Ikoma 6300192, Japan
关键词
Emphasis estimation; emphasis translation; speech-to-speech translation (S2ST); joint optimization of words and emphasis; GENERATION;
D O I
10.1109/TASLP.2018.2846402
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
Speech-to-speech translation (S2ST) systems are capable of breaking language barriers in cross-lingual communication by translating speech across languages. Recent studies have introduced many improvements that allow existing S2ST systems to handle not only linguistic meaning hut also paralinguistic information such as emphasis by proposing additional emphasis estimation and translation components. However, the approach used for emphasis translation is not optimal for sequence translation tasks and fails to easily handle the long-term dependencies of words and emphasis levels. It also requires the quantization of emphasis levels and treats them asdiscrete labels instead of continuous values. Moreover, the whole translation pipeline is fairly complexand slow because all components are trained separately without joint optimization. In this paper, we make two contributions: 1) we propose an approach that can handle continuous emphasis levels based on sequence-to-sequence models, and 2) we combine machine and emphasis translation into a single model, which greatly simplifies the translation pipeline and make it easier to perform joint optimization. Our results on an emphasis translation task indicate that our translation models outperform previous models by a large margin in both objective and subjective tests. Experiments on a joint translation model also show that our models can perform joint translation of words and emphasis with one-word delays instead of full-sentence delays while preserving the translation performance of both tasks.
引用
收藏
页码:1873 / 1883
页数:11
相关论文
共 50 条
  • [1] Direct speech-to-speech translation with a sequence-to-sequence model
    Jia, Ye
    Weiss, Ron J.
    Biadsy, Fadi
    Macherey, Wolfgang
    Johnson, Melvin
    Chen, Zhifeng
    Wu, Yonghui
    INTERSPEECH 2019, 2019, : 1123 - 1127
  • [2] Toward Expressive Speech Translation: A Unified Sequence-to-Sequence LSTMs Approach for Translating Words and Emphasis
    Quoc Truong Do
    Sakti, Sakriani
    Nakamura, Satoshi
    18TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2017), VOLS 1-6: SITUATED INTERACTION, 2017, : 2640 - 2644
  • [3] A Comparison of Sequence-to-Sequence Models for Speech Recognition
    Prabhavalkar, Rohit
    Rao, Kanishka
    Sainath, Tara N.
    Li, Bo
    Johnson, Leif
    Jaitly, Navdeep
    18TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2017), VOLS 1-6: SITUATED INTERACTION, 2017, : 939 - 943
  • [4] SUPERVISED ATTENTION IN SEQUENCE-TO-SEQUENCE MODELS FOR SPEECH RECOGNITION
    Yang, Gene-Ping
    Tang, Hao
    2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, : 7222 - 7226
  • [5] Sequence-to-Sequence Models Can Directly Translate Foreign Speech
    Weiss, Ron J.
    Chorowski, Jan
    Jaitly, Navdeep
    Wu, Yonghui
    Chen, Zhifeng
    18TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2017), VOLS 1-6: SITUATED INTERACTION, 2017, : 2625 - 2629
  • [6] STATE-OF-THE-ART SPEECH RECOGNITION WITH SEQUENCE-TO-SEQUENCE MODELS
    Chiu, Chung-Cheng
    Sainath, Tara N.
    Wu, Yonghui
    Prabhavalkar, Rohit
    Nguyen, Patrick
    Chen, Zhifeng
    Kannan, Anjuli
    Weiss, Ron J.
    Rao, Kanishka
    Gonina, Ekaterina
    Jaitly, Navdeep
    Li, Bo
    Chorowski, Jan
    Bacchiani, Michiel
    2018 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2018, : 4774 - 4778
  • [7] COUPLED TRAINING OF SEQUENCE-TO-SEQUENCE MODELS FOR ACCENTED SPEECH RECOGNITION
    Unni, Vinit
    Joshi, Nitish
    Jyothi, Preethi
    2020 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, 2020, : 8254 - 8258
  • [8] Synthesizing waveform sequence-to-sequence to augment training data for sequence-to-sequence speech recognition
    Ueno, Sei
    Mimura, Masato
    Sakai, Shinsuke
    Kawahara, Tatsuya
    ACOUSTICAL SCIENCE AND TECHNOLOGY, 2021, 42 (06) : 333 - 343
  • [9] Sparse Sequence-to-Sequence Models
    Peters, Ben
    Niculae, Vlad
    Martins, Andre F. T.
    57TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2019), 2019, : 1504 - 1519
  • [10] ON USING 2D SEQUENCE-TO-SEQUENCE MODELS FOR SPEECH RECOGNITION
    Bahar, Parnia
    Zeyer, Albert
    Schlueter, Ralf
    Ney, Hermann
    2019 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2019, : 5671 - 5675