TRANSFORMER-BASED DIRECT SPEECH-TO-SPEECH TRANSLATION WITH TRANSCODER

被引:31
作者
Kano, Takatomo [1 ]
Sakti, Sakriani [1 ,2 ]
Nakamura, Satoshi [1 ,2 ]
机构
[1] Nara Inst Sci & Technol, Ikoma, Japan
[2] RIKEN, Ctr Adv Intelligence Project AIP, Wako, Saitama, Japan
来源
2021 IEEE SPOKEN LANGUAGE TECHNOLOGY WORKSHOP (SLT) | 2021年
关键词
speech-to-speech translation; Transcoder; Transformer; sequence-to-sequence model; multitask learning; MODELS;
D O I
10.1109/SLT48900.2021.9383496
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Traditional speech translation systems use a cascade manner that concatenates speech recognition (ASR), machine translation (MT), and text-to-speech (TTS) synthesis to translate speech from one language to another language in a step-by-step manner. Unfortunately, since those components are trained separately, MT often struggles to handle ASR errors, resulting in unnatural translation results. Recently, one work attempted to construct direct speech translation in a single model. The model used a multi-task scheme that learns to predict not only the target speech spectrograms directly but also the source and target phoneme transcription as auxiliary tasks. However, that work was only evaluated Spanish-English language pairs with similar syntax and word order. With syntactically distant language pairs, speech translation requires distant word order, and thus direct speech frame-to-frame alignments become difficult. Another direction was to construct a single deep-learning framework while keeping the step-by-step translation process. However, such studies focused only on speech-to-text translation. Furthermore, all of these works were based on a recurrent neural network (RNN) model. In this work, we propose a step-by-step scheme to a complete end-to-end speech-to-speech translation and propose a Transformer-based speech translation using Transcoder. We compare our proposed and multi-task model using syntactically similar and distant language pairs.
引用
收藏
页码:958 / 965
页数:8
相关论文
共 20 条
[1]  
[Anonymous], 2017, Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, EMNLP 2017, Copenhagen, Denmark, September 9-11, 2017
[2]  
Bahdanau D., 2015, P 3 INT C LEARN REPR
[3]  
Bérard A, 2018, 2018 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), P6224, DOI 10.1109/ICASSP.2018.8461690
[4]  
Berard Alexandre, 2016, Listen and translate: A proof of concept for end-to-end speech-to-text translation
[5]  
Chorowski J, 2015, ADV NEUR IN, V28
[6]  
Denkowski M., 2011, 6 WORKSHOP STAT MACH, P85
[7]  
Inaguma H, 2020, 58TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2020): SYSTEM DEMONSTRATIONS, P302
[8]   Direct speech-to-speech translation with a sequence-to-sequence model [J].
Jia, Ye ;
Weiss, Ron J. ;
Biadsy, Fadi ;
Macherey, Wolfgang ;
Johnson, Melvin ;
Chen, Zhifeng ;
Wu, Yonghui .
INTERSPEECH 2019, 2019, :1123-1127
[9]   End-to-End Speech Translation With Transcoding by Multi-Task Learning for Distant Language Pairs [J].
Kano, Takatomo ;
Sakti, Sakriani ;
Nakamura, Satoshi .
IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2020, 28 :1342-1355
[10]  
Kano Takatomo, 2017, INTERSPEECH, P2630, DOI DOI 10.21437/Interspeech.2017-944