DASpeech: Directed Acyclic Transformer for Fast and High-quality Speech-to-Speech Translation

被引:0
作者
Fang, Qingkai [1 ,2 ]
Zhou, Yan [1 ,2 ]
Feng, Yang [1 ,2 ]
机构
[1] Chinese Acad Sci ICT CAS, Key Lab Intelligent Informat Proc, Inst Comp Technol, Beijing, Peoples R China
[2] Univ Chinese Acad Sci, Beijing, Peoples R China
来源
ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 36 (NEURIPS 2023) | 2023年
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Direct speech-to-speech translation (S2ST) translates speech from one language into another using a single model. However, due to the presence of linguistic and acoustic diversity, the target speech follows a complex multimodal distribution, posing challenges to achieving both high-quality translations and fast decoding speeds for S2ST models. In this paper, we propose DASpeech, a non-autoregressive direct S2ST model which realizes both fast and high-quality S2ST. To better capture the complex distribution of the target speech, DASpeech adopts the two-pass architecture to decompose the generation process into two steps, where a linguistic decoder first generates the target text, and an acoustic decoder then generates the target speech based on the hidden states of the linguistic decoder. Specifically, we use the decoder of DA-Transformer as the linguistic decoder, and use FastSpeech 2 as the acoustic decoder. DA-Transformer models translations with a directed acyclic graph (DAG). To consider all potential paths in the DAG during training, we calculate the expected hidden states for each target token via dynamic programming, and feed them into the acoustic decoder to predict the target mel-spectrogram. During inference, we select the most probable path and take hidden states on that path as input to the acoustic decoder. Experiments on the CVSS Fr.En benchmark demonstrate that DASpeech can achieve comparable or even better performance than the state-of-the-art S2ST model Translatotron 2, while preserving up to 18.53x speedup compared to the autoregressive baseline. Compared with the previous non-autoregressive S2ST model, DASpeech does not rely on knowledge distillation and iterative decoding, achieving significant improvements in both translation quality and decoding speed. Furthermore, DASpeech shows the ability to preserve the speaker's voice of the source speech during translation.
引用
收藏
页数:20
相关论文
共 69 条
  • [1] Bao Yu, 2022, ACL
  • [2] ChanghanWang AnneWu, 2020, Covost 2 and massively multilingual speech-to-text translation
  • [3] Chen Peng-Jen, 2022, ABS221106474 CORR, DOI [10.48550/arXiv.2211.06474, DOI 10.48550/ARXIV.2211.06474]
  • [4] Dai ZH, 2019, 57TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2019), P2978
  • [5] Leveraging Pseudo-labeled Data to Improve Direct Speech-to-Speech Translation
    Dong, Qianqian
    Yue, Fengpeng
    Ko, Tom
    Wang, Mingxuan
    Bai, Qibing
    Zhang, Yu
    [J]. INTERSPEECH 2022, 2022, : 1781 - 1785
  • [6] Du CX, 2021, PR MACH LEARN RES, V139
  • [7] Fang Qingkai, 2023, P 61 ANN M ASS COMP
  • [8] Fang Qingkai, 2023, P 61 ANN M ASS COMP
  • [9] Fang Qingkai, 2022, P 60 ANN M ASS COMP
  • [10] Gu J., 2018, INT C LEARN REPR