INJECTING TEXT IN SELF-SUPERVISED SPEECH PRETRAINING

被引:10
|
作者
Chen, Zhehuai [1 ]
Zhang, Yu [1 ]
Rosenberg, Andrew [1 ]
Ramabhadran, Bhuvana [1 ]
Wang, Gary [1 ]
Moreno, Pedro [1 ]
机构
[1] Google Inc, Mountain View, CA 94043 USA
关键词
Speech Recognition; Speech Synthesis; Self-supervised; Representation learning; RECOGNITION;
D O I
10.1109/ASRU51503.2021.9688018
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Self-supervised pretraining for Automated Speech Recognition (ASR) has shown varied degrees of success. In this paper, we propose to jointly learn representations during pretraining from two different modalities: speech and text. The proposed method, tts4pretrain complements the power of contrastive learning in self-supervision with linguistic/lexical representations derived from synthesized speech, effectively learning from untranscribed speech and unspoken text. Lexical learning in the speech encoder is enforced through an additional sequence loss term that is coupled with contrastive loss during pretraining. We demonstrate that this novel pretraining method yields Word Error Rate (WER) reductions of 10% relative on the well-benchmarked, Librispeech task over a state-of-the-art baseline pretrained with wav2vec2.0 only. The proposed method also serves as an effective strategy to compensate for the lack of transcribed speech, effectively matching the performance of 5000 hours of transcribed speech with just 100 hours of transcribed speech on the AMI meeting transcription task. Finally, we demonstrate WER reductions of up to 15% on an in-house Voice Search task over traditional pretraining. Incorporating text into encoder pretraining is complimentary to rescoring with a larger or in-domain language model, resulting in additional 6% relative reduction in WER.
引用
收藏
页码:251 / 258
页数:8
相关论文
共 50 条
  • [41] Hierarchically Contrastive Hard Sample Mining for Graph Self-Supervised Pretraining
    Tu, Wenxuan
    Zhou, Sihang
    Liu, Xinwang
    Ge, Chunpeng
    Cai, Zhiping
    Liu, Yue
    IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, 2024, 35 (11) : 16748 - 16761
  • [42] Self-Supervised Pretraining via Multimodality Images With Transformer for Change Detection
    Zhang, Yuxiang
    Zhao, Yang
    Dong, Yanni
    Du, Bo
    IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, 2023, 61
  • [43] SELF-SUPERVISED PRETRAINING FOR DEEP HASH-BASED IMAGE RETRIEVAL
    Yang, Haeyoon
    Jang, Young Kyun
    Kang, Isaac
    Cho, Nam Ik
    2022 IEEE INTERNATIONAL CONFERENCE ON IMAGE PROCESSING, ICIP, 2022, : 3813 - 3817
  • [44] Self-Supervised Pretraining for RGB-D Salient Object Detection
    Zhao, Xiaoqi
    Pang, Youwei
    Zhang, Lihe
    Lu, Huchuan
    Ruan, Xiang
    THIRTY-SIXTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE / THIRTY-FOURTH CONFERENCE ON INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE / THE TWELVETH SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2022, : 3463 - 3471
  • [45] Self-Supervised Pretraining for Point Cloud Object Detection in Autonomous Driving
    Shi, Weijing
    Rajkumar, Ragunathan
    2022 IEEE 25TH INTERNATIONAL CONFERENCE ON INTELLIGENT TRANSPORTATION SYSTEMS (ITSC), 2022, : 4341 - 4348
  • [46] AudioLDM 2: Learning Holistic Audio Generation With Self-Supervised Pretraining
    Liu, Haohe
    Yuan, Yi
    Liu, Xubo
    Mei, Xinhao
    Kong, Qiuqiang
    Tian, Qiao
    Wang, Yuping
    Wang, Wenwu
    Wang, Yuxuan
    Plumbley, Mark D.
    IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2024, 32 : 2871 - 2883
  • [47] Self-supervised pretraining for transferable quantitative phase image cell segmentation
    Vicar, Tomas
    Chemelik, Jiri
    Jakubicek, Roman
    Chmelikova, Larisa
    Gumulec, Jaromir
    Balvan, J. A. N.
    Provaznik, I. V. O.
    Kolar, Radim
    BIOMEDICAL OPTICS EXPRESS, 2021, 12 (10) : 6514 - 6528
  • [48] Self-Supervised Speech Representation Learning: A Review
    Mohamed, Abdelrahman
    Lee, Hung-yi
    Borgholt, Lasse
    Havtorn, Jakob D.
    Edin, Joakim
    Igel, Christian
    Kirchhoff, Katrin
    Li, Shang-Wen
    Livescu, Karen
    Maaloe, Lars
    Sainath, Tara N.
    Watanabe, Shinji
    IEEE JOURNAL OF SELECTED TOPICS IN SIGNAL PROCESSING, 2022, 16 (06) : 1179 - 1210
  • [49] PROPERTY NEURONS IN SELF-SUPERVISED SPEECH TRANSFORMERS
    Lin, Tzu-Quan
    Lin, Guan-Ting
    Lee, Hung-Yi
    Tang, Hao
    arXiv,
  • [50] Boosting Self-Supervised Embeddings for Speech Enhancement
    Hung, Kuo-Hsuan
    Fu, Szu-Wei
    Tseng, Huan-Hsin
    Chiang, Hsin-Tien
    Tsao, Yu
    Lin, Chii-Wann
    INTERSPEECH 2022, 2022, : 186 - 190