Exploration of Efficient End-to-End ASR using Discretized Input from Self-Supervised Learning

被引:2
|
作者
Chang, Xuankai [1 ]
Yan, Brian [1 ]
Fujita, Yuya [2 ]
Maekaku, Takashi [2 ]
Watanabe, Shinji [1 ]
机构
[1] Carnegie Mellon Univ, Pittsburgh, PA 15213 USA
[2] Yahoo Japan Corp, Nagoya, Aichi, Japan
来源
基金
美国国家科学基金会;
关键词
self-supervised learning; discrete tokens; discretized input; speech recognition;
D O I
10.21437/Interspeech.2023-2051
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
Self-supervised learning (SSL) of speech has shown impressive results in speech-related tasks, particularly in automatic speech recognition (ASR). While most methods employ the output of intermediate layers of the SSL model as real-valued features for downstream tasks, there is potential in exploring alternative approaches that use discretized token sequences. This approach offers benefits such as lower storage requirements and the ability to apply techniques from natural language processing. In this paper, we propose a new protocol that utilizes discretized token sequences in ASR tasks, which includes de-duplication and sub-word modeling to enhance the input sequence. It reduces computational cost by decreasing the length of the sequence. Our experiments on the LibriSpeech dataset demonstrate that our proposed protocol performs competitively with conventional ASR systems using continuous input features, while reducing computational and storage costs.
引用
收藏
页码:1399 / 1403
页数:5
相关论文
共 50 条
  • [1] AN EXPLORATION OF SELF-SUPERVISED PRETRAINED REPRESENTATIONS FOR END-TO-END SPEECH RECOGNITION
    Chang, Xuankai
    Maekaku, Takashi
    Guo, Pengcheng
    Shi, Jing
    Lu, Yen-Ju
    Subramanian, Aswin Shanmugam
    Wang, Tianzi
    Yang, Shu-wen
    Tsao, Yu
    Lee, Hung-yi
    Watanabe, Shinji
    2021 IEEE AUTOMATIC SPEECH RECOGNITION AND UNDERSTANDING WORKSHOP (ASRU), 2021, : 228 - 235
  • [2] Self-supervised End-to-End ASR for Low Resource L2 Swedish
    Al-Ghezi, Ragheb
    Getman, Yaroslav
    Rouhe, Aku
    Hilden, Raili
    Kurimo, Mikko
    INTERSPEECH 2021, 2021, : 1429 - 1433
  • [3] ActiveStereoNet: End-to-End Self-supervised Learning for Active Stereo Systems
    Zhang, Yinda
    Khamis, Sameh
    Rhemann, Christoph
    Valentin, Julien
    Kowdle, Adarsh
    Tankovich, Vladimir
    Schoenberg, Michael
    Izadi, Shahram
    Funkhouser, Thomas
    Fanello, Sean
    COMPUTER VISION - ECCV 2018, PT VIII, 2018, 11212 : 802 - 819
  • [4] An End-to-End Contrastive Self-Supervised Learning Framework for Language Understanding
    Fang, Hongchao
    Xie, Pengtao
    TRANSACTIONS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, 2022, 10 : 1324 - 1340
  • [5] Self-supervised end-to-end graph local clustering
    Zhe Yuan
    World Wide Web, 2023, 26 : 1157 - 1179
  • [6] Self-supervised end-to-end graph local clustering
    Yuan, Zhe
    WORLD WIDE WEB-INTERNET AND WEB INFORMATION SYSTEMS, 2023, 26 (03): : 1157 - 1179
  • [7] Semi-supervised ASR by End-to-end Self-training
    Chen, Yang
    Wang, Weiran
    Wang, Chao
    INTERSPEECH 2020, 2020, : 2787 - 2791
  • [8] Semi-Supervised Learning with Data Augmentation for End-to-End ASR
    Weninger, Felix
    Mana, Franco
    Gemello, Roberto
    Andres-Ferrer, Jesus
    Zhan, Puming
    INTERSPEECH 2020, 2020, : 2802 - 2806
  • [9] END-TO-END MUSIC REMASTERING SYSTEM USING SELF-SUPERVISED AND ADVERSARIAL TRAINING
    Koo, Junghyun
    Paik, Seungryeol
    Lee, Kyogu
    2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, : 4608 - 4612
  • [10] End-to-end learning of self-rectification and self-supervised disparity prediction for stereo vision
    Zhang, Xuchong
    Zhao, Yongli
    Wang, Hang
    Zhai, Han
    Sun, Hongbin
    Zheng, Nanning
    NEUROCOMPUTING, 2022, 494 : 308 - 319