An end-to-end model for cross-lingual transformation of paralinguistic information

被引：3

作者：

Kano, Takatomo ^{[1
]}

Takamichi, Shinnosuke ^{[1
]}

Sakti, Sakriani ^{[1
]}

Neubig, Graham ^{[1
]}

Toda, Tomoki ^{[1
]}

Nakamura, Satoshi ^{[1
]}

机构：

[1] Nara Inst Sci & Technol, Grad Sch Informat Sci, Kansai Sci City, Japan

来源：

MACHINE TRANSLATION | 2018年 / 32卷 / 04期

基金：

日本学术振兴会;

关键词：

Paralinguistic information; Speech to speech translation; Automatic speech recognition; Machine translation; Text to speech synthesis;

D O I：

10.1007/s10590-018-9217-7

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Speech translation is a technology that helps people communicate across different languages. The most commonly used speech translation model is composed of automatic speech recognition, machine translation and text-to-speech synthesis components, which share information only at the text level. However, spoken communication is different from written communication in that it uses rich acoustic cues such as prosody in order to transmit more information through non-verbal channels. This paper is concerned with speech-to-speech translation that is sensitive to this paralinguistic information. Our long-term goal is to make a system that allows users to speak a foreign language with the same expressiveness as if they were speaking in their own language. Our method works by reconstructing input acoustic features in the target language. From the many different possible paralinguistic features to handle, in this paper we choose duration and power as a first step, proposing a method that can translate these features from input speech to the output speech in continuous space. This is done in a simple and language-independent fashion by training an end-to-end model that maps source-language duration and power information into the target language. Two approaches are investigated: linear regression and neural network models. We evaluate the proposed methods and show that paralinguistic information in the input speech of the source language can be reflected in the output speech of the target language.

引用

页码：353 / 368

页数：16

共 26 条

[1]

Abe M., 1988, ICASSP 88: 1988 International Conference on Acoustics, Speech, and Signal Processing (Cat. No.88CH2561-9), P655, DOI 10.1109/ICASSP.1988.196671

[2]

Aguero PD, 2006, 2006 IEEE INT C AC S

[3]

Anumanchipalli GK, 2012, IEEE W SP LANG TECH, P153, DOI 10.1109/SLT.2012.6424214

[4]

Dreyer M, 2015, P NAACL, P1018

[5]

Duong L., 2016, P 2016 C N AM CHAPT, P949

[6]

Hirsch H. G., 2000, ASR2000 AUT SPEECH R, P181

[7]

Jiang J, 2011, P MACH TRANSL SUMM 1, P81

[8]

Kano T., 2012, P INT WORKSH SPOK LA, P158

[9]

Kano T, 2013, INTERSPEECH, P2613

[10]

Koehn P, 2007, P 2007 JOINT C EMP M, P868

← 1 2 3 →