SHAS: Approaching optimal Segmentation for End-to-End Speech Translation

被引:10
|
作者
Tsiamas, Ioannis [1 ]
Gallego, Gerard I. [1 ]
Fonollosa, Jose A. R. [1 ]
Costa-jussa, Marta R. [1 ]
机构
[1] Univ Politecn Cataluna, TALP Res Ctr, Barcelona, Spain
来源
INTERSPEECH 2022 | 2022年
关键词
speech translation; audio segmentation;
D O I
10.21437/Interspeech.2022-59
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
Speech translation models are unable to directly process long audios, like TED talks, which have to be split into shorter segments. Speech translation datasets provide manual segmentations of the audios, which are not available in real-world scenarios, and existing segmentation methods usually significantly reduce translation quality at inference time. To bridge the gap between the manual segmentation of training and the automatic one at inference, we propose Supervised Hybrid Audio Segmentation (SHAS), a method that can effectively learn the optimal segmentation from any manually segmented speech corpus. First, we train a classifier to identify the included frames in a segmentation, using speech representations from a pre-trained wav2vec 2.0. The optimal splitting points are then found by a probabilistic Divide-and-Conquer algorithm that progressively splits at the frame of lowest probability until all segments are below a pre-specified length. Experiments on MuST-C and m-TEDx show that the translation of the segments produced by our method approaches the quality of the manual segmentation on 5 languages pairs. Namely, SHAS retains 95-98% of the manual segmentation's BLEU score, compared to the 87-93% of the best existing methods. Our method is additionally generalizable to different domains and achieves high zero-shot performance in unseen languages.
引用
收藏
页码:106 / 110
页数:5
相关论文
共 50 条
  • [41] IMPROVING UNSUPERVISED STYLE TRANSFER IN END-TO-END SPEECH SYNTHESIS WITH END-TO-END SPEECH RECOGNITION
    Liu, Da-Rong
    Yang, Chi-Yu
    Wu, Szu-Lin
    Lee, Hung-Yi
    2018 IEEE WORKSHOP ON SPOKEN LANGUAGE TECHNOLOGY (SLT 2018), 2018, : 640 - 647
  • [42] CKDST: Comprehensively and Effectively Distill Knowledge from Machine Translation to End-to-End Speech Translation
    Lei, Yikun
    Xue, Zhengshan
    Sun, Haoran
    Zhao, Xiaohu
    Zhu, Shaolin
    Lin, Xiaodong
    Xiong, Deyi
    FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, ACL 2023, 2023, : 3123 - 3137
  • [43] Improving Cross-Lingual Transfer Learning for End-to-End Speech Recognition with Speech Translation
    Wang, Changhan
    Pino, Juan
    Gu, Jiatao
    INTERSPEECH 2020, 2020, : 4731 - 4735
  • [44] CCSRD: Content-Centric Speech Representation Disentanglement Learning for End-to-End Speech Translation
    Zhao, Xiaohu
    Sun, Haoran
    Lei, Yikun
    Zhu, Shaolin
    Xiong, Deyi
    FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS - EMNLP 2023, 2023, : 5920 - 5932
  • [45] ComSL: A Composite Speech-Language Model for End-to-End Speech-to-Text Translation
    Le, Chenyang
    Qian, Yao
    Zhou, Long
    Liu, Shujie
    Qian, Yanmin
    Zeng, Michael
    Huang, Xuedong
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 36 (NEURIPS 2023), 2023,
  • [46] The NiuTrans End-to-End Speech Translation System for IWSLT 2021 Offline Task
    Xu, Chen
    Liu, Xiaoqian
    Liu, Xiaowen
    Wang, Laohu
    Huang, Canan
    Xiao, Tong
    Zhu, Jingbo
    IWSLT 2021: THE 18TH INTERNATIONAL CONFERENCE ON SPOKEN LANGUAGE TRANSLATION, 2021, : 92 - 99
  • [47] Transformer-Based End-to-End Speech Translation With Rotary Position Embedding
    Li, Xueqing
    Li, Shengqiang
    Zhang, Xiao-Lei
    Rahardja, Susanto
    IEEE SIGNAL PROCESSING LETTERS, 2024, 31 : 371 - 375
  • [48] Large-Scale Streaming End-to-End Speech Translation with Neural Transducers
    Xue, Jian
    Wang, Peidong
    Li, Jinyu
    Post, Matt
    Gaur, Yashesh
    INTERSPEECH 2022, 2022, : 3263 - 3267
  • [49] Beyond Sentence-Level End-to-End Speech Translation: Context Helps
    Zhang, Biao
    Titov, Ivan
    Haddow, Barry
    Sennrich, Rico
    59TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS AND THE 11TH INTERNATIONAL JOINT CONFERENCE ON NATURAL LANGUAGE PROCESSING, VOL 1 (ACL-IJCNLP 2021), 2021, : 2566 - 2578
  • [50] Optimally Encoding Inductive Biases into the Transformer Improves End-to-End Speech Translation
    Vyas, Piyush
    Kuznetsova, Anastasia
    Williamson, Donald S.
    INTERSPEECH 2021, 2021, : 2287 - 2291