TWO-STAGE PRE-TRAINING FOR SEQUENCE TO SEQUENCE SPEECH RECOGNITION

被引:0
|
作者
Fan, Zhiyun [1 ,2 ]
Zhou, Shiyu [1 ]
Xu, Bo [1 ]
机构
[1] Chinese Acad Sci, Inst Automat, Beijing, Peoples R China
[2] Univ Chinese Acad Sci, Sch Artificial Intelligence, Beijing, Peoples R China
关键词
pre-training; speech recognition; encoder-decoder; sequence-to-sequence;
D O I
10.1109/IJCNN52387.2021.9534170
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
The attention-based encoder-decoder structure is popular in automatic speech recognition (ASR). However, it relies heavily on transcribed data. In this paper, we propose a novel pre-training strategy for the encoder-decoder sequence-to-sequence (seq2seq) model by utilizing unpaired speech and transcripts. The pre-training process consists of two stages, acoustic pre-training and linguistic pre-training. In the acoustic pre-training stage, we use a large amount of speech to pre-train the encoder by predicting masked speech feature chunks with their contexts. In the linguistic pre-training stage, we first generate synthesized speech from a large number of transcripts using a text-to-speech (TTS) system and then use the synthesized paired data to pretrain the decoder. The two-stage pre-training is conducted on the AISHELL-2 dataset, and we apply this pre-trained model to multiple subsets of AISHELL-1 and HKUST for post-training. As the size of the subset increases, we obtain relative character error rate reduction (CERR) from 38.24% to 7.88% on AISHELL-1 and from 12.00% to 1.20% on HKUST.
引用
收藏
页数:6
相关论文
共 50 条
  • [1] TWO-STAGE TRAINING METHOD FOR JAPANESE ELECTROLARYNGEAL SPEECH ENHANCEMENT BASED ON SEQUENCE-TO-SEQUENCE VOICE CONVERSION
    Ma, Ding
    Violeta, Lester Phillip
    Kobayashi, Kazuhiro
    Toda, Tomoki
    2022 IEEE SPOKEN LANGUAGE TECHNOLOGY WORKSHOP, SLT, 2022, : 949 - 954
  • [2] Improving AMR Parsing with Sequence-to-Sequence Pre-training
    Xu, Dongqin
    Li, Junhui
    Zhu, Muhua
    Min Zhang
    Zhou, Guodong
    PROCEEDINGS OF THE 2020 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP), 2020, : 2501 - 2511
  • [3] MASS: Masked Sequence to Sequence Pre-training for Language Generation
    Song, Kaitao
    Tan, Xu
    Qin, Tao
    Lu, Jianfeng
    Liu, Tie-Yan
    INTERNATIONAL CONFERENCE ON MACHINE LEARNING, VOL 97, 2019, 97
  • [4] Improving Sequence-to-Sequence Pre-training via Sequence Span Rewriting
    Zhou, Wangchunshu
    Ge, Tao
    Xu, Canwen
    Xu, Ke
    Wei, Furu
    2021 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP 2021), 2021, : 571 - 582
  • [5] Limited Data Emotional Voice Conversion Leveraging Text-to-Speech: Two-stage Sequence-to-Sequence Training
    Zhou, Kun
    Sisman, Berrak
    Li, Haizhou
    INTERSPEECH 2021, 2021, : 811 - 815
  • [6] Denoising based Sequence-to-Sequence Pre-training for Text Generation
    Wang, Liang
    Zhao, Wei
    Jia, Ruoyu
    Li, Sujian
    Liu, Jingming
    2019 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING AND THE 9TH INTERNATIONAL JOINT CONFERENCE ON NATURAL LANGUAGE PROCESSING (EMNLP-IJCNLP 2019): PROCEEDINGS OF THE CONFERENCE, 2019, : 4003 - 4015
  • [7] SELF-TRAINING AND PRE-TRAINING ARE COMPLEMENTARY FOR SPEECH RECOGNITION
    Xu, Qiantong
    Baevski, Alexei
    Likhomanenko, Tatiana
    Tomasello, Paden
    Conneau, Alexis
    Collobert, Ronan
    Synnaeve, Gabriel
    Auli, Michael
    2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021), 2021, : 3030 - 3034
  • [8] Uncovering Hidden Consequences of Pre-training Objectives in Sequence-to-Sequence Models
    Kew, Tannon
    Sennrich, Rico
    FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, ACL 2023, 2023, : 7010 - 7022
  • [9] Pre-training with a rational approach for antibody sequence representation
    Gao, Xiangrui
    Cao, Changling
    He, Chenfeng
    Lai, Lipeng
    FRONTIERS IN IMMUNOLOGY, 2024, 15
  • [10] Unified Speech-Text Pre-training for Speech Translation and Recognition
    Tang, Yun
    Gong, Hongyu
    Dong, Ning
    Wang, Changhan
    Hsu, Wei-Ning
    Gu, Jiatao
    Baevski, Alexei
    Li, Xian
    Mohamed, Abdelrahman
    Auli, Michael
    Pino, Juan
    PROCEEDINGS OF THE 60TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2022), VOL 1: (LONG PAPERS), 2022, : 1488 - 1499