SSHR: Leveraging Self-supervised Hierarchical Representations for Multilingual Automatic Speech Recognition

被引：0

作者：

Xue, Hongfei ^{[1
]}

Shao, Qijie ^{[2
]}

Huang, Kaixun ^{[1
]}

Chen, Peikun ^{[2
]}

Liu, Jie ^{[3
]}

Xie, Lei ^{[2
]}

机构：

[1] Northwestern Polytech Univ, Sch Software, Xian, Peoples R China

[2] Northwestern Polytech Univ, Sch Comp Sci, Xian, Peoples R China

[3] HuaWei, HuaWei Cloud, Xian, Peoples R China

来源：

2024 IEEE INTERNATIONAL CONFERENCE ON MULTIMEDIA AND EXPO, ICME 2024 | 2024年

关键词：

Multilingual ASR; self-supervised learning; representation analysis; low-resource ASR;

D O I：

10.1109/ICME57554.2024.10687681

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Multilingual automatic speech recognition (ASR) systems have garnered attention for their potential to extend language coverage globally. While self-supervised learning (SSL) models, like MMS, have demonstrated their effectiveness in multilingual ASR, it is worth noting that various layers' representations potentially contain distinct information that has not been fully leveraged. In this study, we propose a novel method that leverages self-supervised hierarchical representations (SSHR) to fine-tune the MMS model. We first analyze the different layers of MMS and show that the middle layers capture language-related information, and the high layers encode content-related information, which gradually decreases in the final layers. Then, we extract a language-related frame from correlated middle layers and guide specific language extraction through self-attention mechanisms. Additionally, we steer the model toward acquiring more content-related information in the final layers using our proposed Cross-CTC. We evaluate SSHR on two multilingual datasets, Common Voice and ML-SUPERB, and the experimental results demonstrate that our method achieves state-of-the-art performance to the best of our knowledge.

引用

页数：6

共 27 条

[1]

Ardila R, 2020, PROCEEDINGS OF THE 12TH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION (LREC 2020), P4218

[2] XLS-R: Self-supervised Cross-lingual Speech Representation Learning at Scale [J].

Babu, Arun ;

Wang, Changhan ;

Tjandra, Andros ;

Lakhotia, Kushal ;

Xu, Qiantong ;

Goyal, Naman ;

Singh, Kritika ;

von Platen, Patrick ;

Saraf, Yatharth ;

Pino, Juan ;

Baevski, Alexei ;

Conneau, Alexis ;

Auli, Michael .

INTERSPEECH 2022, 2022, :2278-2282

[3]

Baevski Alexei, 2020, wav2vec 2.0: A framework for self-supervised learning of speech representations

[4] JOINT UNSUPERVISED AND SUPERVISED TRAINING FOR MULTILINGUAL ASR [J].

Bai, Junwen ;

Li, Bo ;

Zhang, Yu ;

Bapna, Ankur ;

Siddhartha, Nikhil ;

Sim, Khe Chai ;

Sainath, Tara N. .

2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, :6402-6406

[5]

Chen William, 2023, ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), P1, DOI 10.1109/ICASSP49357.2023.10095326

[6] Unsupervised Cross-lingual Representation Learning for Speech Recognition [J].

Conneau, Alexis ;

Baevski, Alexei ;

Collobert, Ronan ;

Mohamed, Abdelrahman ;

Auli, Michael .

INTERSPEECH 2021, 2021, :2426-2430

[7]

Graves A., 2006, P 23 INT C MACHINE L, P369, DOI [DOI 10.1145/1143844.1143891, 10.1145/1143844.1143891]

[8] HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units [J].

Hsu, Wei-Ning ;

Bolte, Benjamin ;

Tsai, Yao-Hung Hubert ;

Lakhotia, Kushal ;

Salakhutdinov, Ruslan ;

Mohamed, Abdelrahman .

IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2021, 29 :3451-3460

[9] Large-Scale Multilingual Speech Recognition with a Streaming End-to-End Model [J].

Kannan, Anjuli ;

Datta, Arindrima ;

Sainath, Tara N. ;

Weinstein, Eugene ;

Ramabhadran, Bhuvana ;

Wu, Yonghui ;

Bapna, Ankur ;

Chen, Zhifeng ;

Lee, Seungji .

INTERSPEECH 2019, 2019, :2130-2134

[10]

Kingsbury D, 2015, P1, DOI [DOI 10.1021/bk-2015-1214.ch001, DOI 10.48550/ARXIV.1412.6980]

← 1 2 3 →