Efficiently Fusing Pretrained Acoustic and Linguistic Encoders for Low-Resource Speech Recognition

被引:28
|
作者
Yi, Cheng [1 ,2 ]
Zhou, Shiyu [1 ]
Xu, Bo [1 ]
机构
[1] Chinese Acad Sci, Inst Automat, Beijing 100190, Peoples R China
[2] Univ Chinese Acad Sci, Beijing 100190, Peoples R China
关键词
Acoustics; Bit error rate; Linguistics; Task analysis; Training; Decoding; Data models; BERT; end-to-end modeling; low-resource ASR; pre-training; wav2vec; CTC; ASR;
D O I
10.1109/LSP.2021.3071668
中图分类号
TM [电工技术]; TN [电子技术、通信技术];
学科分类号
0808 ; 0809 ;
摘要
End-to-end models have achieved impressive results on the task of automatic speech recognition (ASR). For low-resource ASR tasks, however, labeled data can hardly satisfy the demand of end-to-end models. Self-supervised acoustic pre-training has already shown its impressive ASR performance, while the transcription is still inadequate for language modeling in end-to-end models. In this work, we fuse a pre-trained acoustic encoder (wav2vec2.0) and a pre-trained linguistic encoder (BERT) into an end-to-end ASR model. The fused model only needs to learn the transfer from speech to language during fine-tuning on limited labeled data. The length of the two modalities is matched by a monotonic attention mechanism without additional parameters. Besides, a fully connected layer is introduced for the hidden mapping between modalities. We further propose a scheduled fine-tuning strategy to preserve and utilize the text context modeling ability of the pre-trained linguistic encoder. Experiments show our effective utilizing of pre-trained modules. Our model achieves better recognition performance on CALLHOME corpus (15 hours) than other end-to-end models.
引用
收藏
页码:788 / 792
页数:5
相关论文
共 37 条
  • [1] Boosting Low-Resource Speech Recognition in Air Traffic Communication via Pretrained Feature Aggregation and Multi-Task Learning
    Guo, Dongyue
    Zhang, Zichen
    Yang, Bo
    Zhang, Jianwei
    Lin, Yi
    IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS II-EXPRESS BRIEFS, 2023, 70 (09) : 3714 - 3718
  • [2] Acoustic Modeling Based on Deep Learning for Low-Resource Speech Recognition: An Overview
    Yu, Chongchong
    Kang, Meng
    Chen, Yunbing
    Wu, Jiajia
    Zhao, Xia
    IEEE ACCESS, 2020, 8 : 163829 - 163843
  • [3] A Semi-Supervised Complementary Joint Training Approach for Low-Resource Speech Recognition
    Du, Ye-Qian
    Zhang, Jie
    Fang, Xin
    Wu, Ming-Hui
    Yang, Zhou-Wang
    IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2023, 31 : 3908 - 3921
  • [4] Exploiting Adapters for Cross-Lingual Low-Resource Speech Recognition
    Hou, Wenxin
    Zhu, Han
    Wang, Yidong
    Wang, Jindong
    Qin, Tao
    Xu, Renju
    Shinozaki, Takahiro
    IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2022, 30 : 317 - 329
  • [5] Improving Automatic Speech Recognition Performance for Low-Resource Languages With Self-Supervised Models
    Zhao, Jing
    Zhang, Wei-Qiang
    IEEE JOURNAL OF SELECTED TOPICS IN SIGNAL PROCESSING, 2022, 16 (06) : 1227 - 1241
  • [6] Frontier Research on Low-Resource Speech Recognition Technology
    Slam, Wushour
    Li, Yanan
    Urouvas, Nurmamet
    SENSORS, 2023, 23 (22)
  • [7] Transfer Learning from Multi-Lingual Speech Translation Benefits Low-Resource Speech Recognition
    Vanderreydt, Geoffroy
    Remy, Francois
    Demuynck, Kris
    INTERSPEECH 2022, 2022, : 3053 - 3057
  • [8] Meta-Prompt: Boosting Whisper's Performance in Low-Resource Speech Recognition
    Chen, Yaqi
    Niu, Tong
    Zhang, Hao
    Zhang, Wenlin
    Qu, Dan
    IEEE SIGNAL PROCESSING LETTERS, 2024, 31 : 3039 - 3043
  • [9] Improving Low-Resource Speech Recognition Based on Improved NN-HMM Structures
    Sun, Xiusong
    Yang, Qun
    Liu, Shaohan
    Yuan, Xin
    IEEE ACCESS, 2020, 8 : 73005 - 73014
  • [10] Improvement of Acoustic Models Fused with Lip Visual Information for Low-Resource Speech
    Yu, Chongchong
    Yu, Jiaqi
    Qian, Zhaopeng
    Tan, Yuchen
    SENSORS, 2023, 23 (04)