Speaker voice normalization for end-to-end speech translation

被引：1

作者：

Xue, Zhengshan ^{[1
]}

Shi, Tingxun

Zhang, Xiaolei

Xiong, Deyi ^{[1
]}

机构：

[1] Tianjin Univ, Coll Intelligence & Comp, Tianjin, Peoples R China

来源：

EXPERT SYSTEMS WITH APPLICATIONS | 2024年 / 248卷

关键词：

Machine translation; Speech translation; Speaker normalization;

D O I：

10.1016/j.eswa.2024.123317

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Speaker voices exhibit acoustic variation. Our preliminary experiments reveal that normalized voice can significantly improve end -to -end speech translation. To mitigate the negative impact of acoustic voice variation across speakers on speech translation, we propose SVN-ST, a Speaker -Voice -Normalized end -to -end Speech Translation framework. In SVN-ST, we use synthetic speech inputs generated from a Text -to -Speech system to complement raw speech inputs. In order to explore synthetic speech inputs, we introduce two essential components for SVN-ST: an alignment adapter at the encoder side and a normalized speech knowledge distillation module at the decoder side. The former forces the representations of raw speech inputs to be close to those of synthetic (normalized) speech inputs while the latter attempts to guide the translations of raw speech inputs with those yielded from synthetic speech inputs. Two additional losses are also defined to equip with the two components. Experimental results on the MuST-C benchmark dataset demonstrate that SVN-ST outperforms previous state-of-the-art end -to -end non -normalized speech translation systems by 0.4 BLEU and cascaded speech translation systems by 2.3 BLEU. On the Covost 2 testset, SVN-ST also outperforms other normalized speech methods on robustness. Further analyses suggest that our model effectively aligns speech representations from different speakers, enhances robustness, and significantly improves sentence -level translation quality.

引用

页数：11

共 50 条

[41] Dynamic Speaker Representations Adjustment and Decoder Factorization for Speaker Adaptation in End-to-End Speech Synthesis
Fu, Ruibo
Tao, Jianhua
Wen, Zhengqi
Yi, Jiangyan
Wang, Tao
Qiang, Chunyu
INTERSPEECH 2020, 2020, : 4701 - 4705
[42] End-to-End Multi-Speaker Speech Recognition using Speaker Embeddings and Transfer Learning
Denisov, Pavel
Ngoc Thang Vu
INTERSPEECH 2019, 2019, : 4425 - 4429
[43] INCORPORATING END-TO-END FRAMEWORK INTO TARGET-SPEAKER VOICE ACTIVITY DETECTION
Wang, Weiqing
Li, Ming
2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, : 8362 - 8366
[44] Exploring Phoneme-Level Speech Representations for End-to-End Speech Translation
Salesky, Elizabeth
Sperber, Matthias
Black, Alan W.
57TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2019), 2019, : 1835 - 1841
[45] Improving End-to-End Speech Translation by Leveraging Auxiliary Speech and Text Data
Zhang, Yuhao
Xu, Chen
Hu, Bojie
Zhang, Chunliang
Xiao, Tong
Zhu, Jingbo
THIRTY-SEVENTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 37 NO 11, 2023, : 13984 - 13992
[46] ATTENTION-BASED END-TO-END SPEECH RECOGNITION ON VOICE SEARCH
Shan, Changhao
Zhang, Junbo
Wang, Yujun
Xie, Lei
2018 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2018, : 4764 - 4768
[47] Cross-Speaker Emotion Disentangling and Transfer for End-to-End Speech Synthesis
Li, Tao
Wang, Xinsheng
Xie, Qicong
Wang, Zhichao
Xie, Lei
IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2022, 30 : 1448 - 1460
[48] NEURAL NOISE EMBEDDING FOR END-TO-END SPEECH ENHANCEMENT WITH CONDITIONAL LAYER NORMALIZATION
Zhang, Zhihui
Li, Xiaoqi
Li, Yaxing
Dong, Yuanjie
Wang, Dan
Xiong, Shengwu
2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021), 2021, : 7113 - 7117
[49] Speech Segmentation Optimization using Segmented Bilingual Speech Corpus for End-to-end Speech Translation
Fukuda, Ryo
Sudoh, Katsuhito
Nakamura, Satoshi
INTERSPEECH 2022, 2022, : 121 - 125
[50] End-to-End Chinese Speaker Identification
Yu, Dian
Zhou, Ben
Yu, Dong
NAACL 2022: THE 2022 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES, 2022, : 2274 - 2285

← 1 2 3 4 5 →