Leveraging Pseudo-labeled Data to Improve Direct Speech-to-Speech Translation

被引:3
作者
Dong, Qianqian [1 ]
Yue, Fengpeng [1 ,2 ]
Ko, Tom [1 ]
Wang, Mingxuan [1 ]
Bai, Qibing [1 ,2 ]
Zhang, Yu [2 ,3 ]
机构
[1] ByteDance AI Lab, Beijing, Peoples R China
[2] Southern Univ Sci & Technol, Dept Comp Sci & Engn, Shenzhen, Peoples R China
[3] Peng Cheng Lab, Shenzhen, Peoples R China
来源
INTERSPEECH 2022 | 2022年
关键词
speech translation; speech-to-speech translation; pseudo-labeling;
D O I
10.21437/Interspeech.2022-10011
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
Direct Speech-to-speech translation (S2ST) has drawn more and more attention recently. The task is very challenging due to data scarcity and complex speech-to-speech mapping. In this paper, we report our recent achievements in S2ST. Firstly, we build a S2ST Transformer baseline which outperforms the original Translatotron. Secondly, we utilize the external data by pseudo-labeling and obtain a new state-of-the-art result on the Fisher English-to-Spanish test set. Indeed, we exploit the pseudo data with a combination of popular techniques which are not trivial when applied to S2ST. Moreover, we evaluate our approach on both syntactically similar (Spanish-English) and distant (English-Chinese) language pairs. Our implementation is available at https://github.com/fengpeng-yue/speech-to-speech-translation.
引用
收藏
页码:1781 / 1785
页数:5
相关论文
共 29 条
  • [1] Bacvski Alexei, 2020, Advances in neural information processing systems, V33, P12449, DOI DOI 10.48550/ARXIV.2006.11477
  • [2] Bu H, 2017, 2017 20TH CONFERENCE OF THE ORIENTAL CHAPTER OF THE INTERNATIONAL COORDINATING COMMITTEE ON SPEECH DATABASES AND SPEECH I/O SYSTEMS AND ASSESSMENT (O-COCOSDA), P58, DOI 10.1109/ICSDA.2017.8384449
  • [3] Chen G., 2021, P INT 2021
  • [4] Duquenne PA, 2021, ADV NEUR IN, V34
  • [5] Conformer: Convolution-augmented Transformer for Speech Recognition
    Gulati, Anmol
    Qin, James
    Chiu, Chung-Cheng
    Parmar, Niki
    Zhang, Yu
    Yu, Jiahui
    Han, Wei
    Wang, Shibo
    Zhang, Zhengdong
    Wu, Yonghui
    Pang, Ruoming
    [J]. INTERSPEECH 2020, 2020, : 5036 - 5040
  • [6] Inaguma H, 2020, 58TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2020): SYSTEM DEMONSTRATIONS, P302
  • [7] Jia Y., 2021, ARXIV210708661
  • [8] Direct speech-to-speech translation with a sequence-to-sequence model
    Jia, Ye
    Weiss, Ron J.
    Biadsy, Fadi
    Macherey, Wolfgang
    Johnson, Melvin
    Chen, Zhifeng
    Wu, Yonghui
    [J]. INTERSPEECH 2019, 2019, : 1123 - 1127
  • [9] Jia Y, 2019, INT CONF ACOUST SPEE, P7180, DOI 10.1109/ICASSP.2019.8683343
  • [10] Kahn J, 2020, INT CONF ACOUST SPEE, P7669, DOI [10.1109/icassp40776.2020.9052942, 10.1109/ICASSP40776.2020.9052942]