Speaker voice normalization for end-to-end speech translation

被引:1
|
作者
Xue, Zhengshan [1 ]
Shi, Tingxun
Zhang, Xiaolei
Xiong, Deyi [1 ]
机构
[1] Tianjin Univ, Coll Intelligence & Comp, Tianjin, Peoples R China
关键词
Machine translation; Speech translation; Speaker normalization;
D O I
10.1016/j.eswa.2024.123317
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Speaker voices exhibit acoustic variation. Our preliminary experiments reveal that normalized voice can significantly improve end -to -end speech translation. To mitigate the negative impact of acoustic voice variation across speakers on speech translation, we propose SVN-ST, a Speaker -Voice -Normalized end -to -end Speech Translation framework. In SVN-ST, we use synthetic speech inputs generated from a Text -to -Speech system to complement raw speech inputs. In order to explore synthetic speech inputs, we introduce two essential components for SVN-ST: an alignment adapter at the encoder side and a normalized speech knowledge distillation module at the decoder side. The former forces the representations of raw speech inputs to be close to those of synthetic (normalized) speech inputs while the latter attempts to guide the translations of raw speech inputs with those yielded from synthetic speech inputs. Two additional losses are also defined to equip with the two components. Experimental results on the MuST-C benchmark dataset demonstrate that SVN-ST outperforms previous state-of-the-art end -to -end non -normalized speech translation systems by 0.4 BLEU and cascaded speech translation systems by 2.3 BLEU. On the Covost 2 testset, SVN-ST also outperforms other normalized speech methods on robustness. Further analyses suggest that our model effectively aligns speech representations from different speakers, enhances robustness, and significantly improves sentence -level translation quality.
引用
收藏
页数:11
相关论文
共 50 条
  • [31] Curriculum Pre-training for End-to-End Speech Translation
    Wang, Chengyi
    Wu, Yu
    Liu, Shujie
    Zhou, Ming
    Yang, Zhenglu
    58TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2020), 2020, : 3728 - 3738
  • [32] Mutual-Learning Improves End-to-End Speech Translation
    Zhao, Jiawei
    Luo, Wei
    Chen, Boxing
    Gilman, Andrew
    2021 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP 2021), 2021, : 3989 - 3994
  • [33] Improving End-to-End Speech Translation with Progressive Dual Encoding
    Zhang, Runlai
    Chen, Saihan
    Zhang, Yuhao
    Du, Yangfan
    Chen, Hao
    Xiao, Tong
    Zhu, Jingbo
    NATURAL LANGUAGE PROCESSING AND CHINESE COMPUTING, PT III, NLPCC 2024, 2025, 15361 : 199 - 212
  • [34] TIGHT INTEGRATED END-TO-END TRAINING FOR CASCADED SPEECH TRANSLATION
    Bahar, Parnia
    Bieschke, Tobias
    Schlueter, Ralf
    Ney, Hermann
    2021 IEEE SPOKEN LANGUAGE TECHNOLOGY WORKSHOP (SLT), 2021, : 950 - 957
  • [35] Towards a Deep Understanding of Multilingual End-to-End Speech Translation
    Sun, Haoran
    Zhao, Xiaohu
    Lei, Yikun
    Zhu, Shaolin
    Xiong, Deyi
    FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (EMNLP 2023), 2023, : 14332 - 14348
  • [36] Regularizing End-to-End Speech Translation with Triangular Decomposition Agreement
    Du, Yichao
    Zhang, Zhirui
    Wang, Weizhi
    Chen, Boxing
    Xie, Jun
    Xu, Tong
    THIRTY-SIXTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE / THIRTY-FOURTH CONFERENCE ON INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE / TWELVETH SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2022, : 10590 - 10598
  • [37] Knowledge Distillation on Joint Task End-to-End Speech Translation
    Nayem, Khandokar Md
    Xue, Ran
    Chang, Ching-Yun
    Shanbhogue, Akshaya Vishnu Kudlu
    INTERSPEECH 2023, 2023, : 1493 - 1497
  • [38] SHAS: Approaching optimal Segmentation for End-to-End Speech Translation
    Tsiamas, Ioannis
    Gallego, Gerard I.
    Fonollosa, Jose A. R.
    Costa-jussa, Marta R.
    INTERSPEECH 2022, 2022, : 106 - 110
  • [39] PromptST: Abstract Prompt Learning for End-to-End Speech Translation
    Yu, Tengfei
    Ding, Liang
    Liu, Xuebo
    Chen, Kehai
    Zhang, Meishan
    Tao, Dacheng
    Zhang, Min
    2023 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP 2023), 2023, : 10140 - 10154
  • [40] ONE-TO-MANY MULTILINGUAL END-TO-END SPEECH TRANSLATION
    Di Gangi, Mattia A.
    Negri, Matteo
    Turchi, Marco
    2019 IEEE AUTOMATIC SPEECH RECOGNITION AND UNDERSTANDING WORKSHOP (ASRU 2019), 2019, : 585 - 592