End-to-End Speech Translation With Transcoding by Multi-Task Learning for Distant Language Pairs

被引：14

作者：

Kano, Takatomo ^{[1
]}

Sakti, Sakriani ^{[2
,3
]}

Nakamura, Satoshi ^{[3
,4
]}

机构：

[1] Nara Inst Sci & Technol, Ikoma 6300192, Japan

[2] Nara Inst Sci & Technol, Grad Sch Informat Sci, Ikoma 6300192, Japan

[3] RIKEN Ctr Adv Intelligence Project AIP, Chuo Ku, Tokyo 1030027, Japan

[4] Ikoma Inst Sci & Technol, Grad Sch Informat Sci, Nara 6300192, Japan

来源：

IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING | 2020年 / 28卷

关键词：

Task analysis; Decoding; Speech processing; Recurrent neural networks; Training; Adaptation models; End-to-end speech-to-text translation; automatic speech recognition; machine translation (MT); multi-task learning;

D O I：

10.1109/TASLP.2020.2986886

中图分类号：

O42 [声学];

学科分类号：

070206 ; 082403 ;

摘要：

Directly translating spoken utterances from a source language to a target language is challenging because it requires a fundamental transformation in both linguistic and para/non-linguistic features. Traditional speech-to-speech translation approaches concatenate automatic speech recognition (ASR), text-to-text machine translation (MT), and text-to-speech synthesizer (TTS) by text information. The current state-of-the-art models for ASR, MT, and TTS have mainly been built using deep neural networks, in particular, an attention-based encoder-decoder neural network with an attention mechanism. Recently, several works have constructed end-to-end direct speech-to-text translation by combining ASR and MT into a single model. However, the usefulness of these models has only been investigated on language pairs of similar syntax and word order (e.g., English-French or English-Spanish). For syntactically distant language pairs (e.g., English-Japanese), speech translation requires distant word reordering. Furthermore, parallel texts with corresponding speech utterances that are suitable for training end-to-end speech translation are generally unavailable. Collecting such corpora is usually time-consuming and expensive. This article proposes the first attempt to build an end-to-end direct speech-to-text translation system on syntactically distant language pairs that suffer from long-distance reordering. We train the model on English (subject-verb-object (SVO) word order) and Japanese (SOV word order) language pairs. To guide the attention-based encoder-decoder model on this difficult problem, we construct end-to-end speech translation with transcoding and utilize curriculum learning (CL) strategies that gradually train the network for end-to-end speech translation tasks by adapting the decoder or encoder parts. We use TTS for data augmentation to generate corresponding speech utterances from the existing parallel text data. Our experiment results show that the proposed approach provides significant improvements compared with conventional cascade models and the direct speech translation approach that uses a single model without transcoding and CL strategies.

引用

页码：1342 / 1355

页数：14

共 32 条

[11]

Dong LH, 2018, 2018 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), P5884, DOI 10.1109/ICASSP.2018.8462506

[12]

Duong Long, 2016, P C N AM CHAPT ASS C, P949

[13] Multi-Modal Curriculum Learning for Semi-Supervised Image Classification [J].

Gong, Chen ;

Tao, Dacheng ;

Maybank, Stephen J. ;

Liu, Wei ;

Kang, Guoliang ;

Yang, Jie .

IEEE TRANSACTIONS ON IMAGE PROCESSING, 2016, 25 (07) :3249-3260

[14]

Graves A, 2012, STUD COMPUT INTELL, V385, P1, DOI [10.1007/978-3-642-24797-2, 10.1162/neco.1997.9.1.1]

[15] Deep Residual Learning for Image Recognition [J].

He, Kaiming ;

Zhang, Xiangyu ;

Ren, Shaoqing ;

Sun, Jian .

2016 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2016, :770-778

[16]

Kano T, 2013, INTERSPEECH, P2613

[17] Comparative study on corpora for speech translation [J].

Kikui, Genichiro ;

Yamamoto, Seiichi ;

Takezawa, Toshiyuki ;

Sumita, Eiichiro .

IEEE TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2006, 14 (05) :1674-1682

[18]

Kikui Genichiro., 2003, P EUROSPEECH, P381

[19]

Kingma DP., 2014, ACS SYM SER

[20]

Koehn P, 2003, HLT-NAACL 2003: HUMAN LANGUAGE TECHNOLOGY CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, PROCEEDINGS OF THE MAIN CONFERENCE, P127

← 1 2 3 4 →