A Purely End-to-end System for Multi-speaker Speech Recognition

被引:0
|
作者
Seki, Hiroshi [1 ,2 ]
Hori, Takaaki [1 ]
Watanabe, Shinji [3 ]
Le Roux, Jonathan [1 ]
Hershey, John R. [1 ]
机构
[1] MERL, Cambridge, MA 02139 USA
[2] Toyohashi Univ Technol, Toyohashi, Aichi, Japan
[3] Johns Hopkins Univ, Baltimore, MD 21218 USA
关键词
SEPARATION;
D O I
暂无
中图分类号
TP39 [计算机的应用];
学科分类号
081203 ; 0835 ;
摘要
Recently, there has been growing interest in multi-speaker speech recognition, where the utterances of multiple speakers are recognized from their mixture. Promising techniques have been proposed for this task, but earlier works have required additional training data such as isolated source signals or senone alignments for effective learning. In this paper, we propose a new sequence-to-sequence framework to directly decode multiple label sequences from a single speech sequence by unifying source separation and speech recognition functions in an end-to-end manner. We further propose a new objective function to improve the contrast between the hidden vectors to avoid generating similar hypotheses. Experimental results show that the model is directly able to learn a mapping from a speech mixture to multiple label sequences, achieving 83.1% relative improvement compared to a model trained without the proposed objective. Interestingly, the results are comparable to those produced by previous end-to-end works featuring explicit separation and recognition modules.
引用
收藏
页码:2620 / 2630
页数:11
相关论文
共 50 条
  • [31] Language and Speaker-Independent Feature Transformation for End-to-End Multilingual Speech Recognition
    Hayakawa, Tomoaki
    Leow, Chee Siang
    Kobayashi, Akio
    Utsuro, Takehito
    Nishizaki, Hiromitsu
    INTERSPEECH 2021, 2021, : 2431 - 2435
  • [32] Personality-aware Training based Speaker Adaptation for End-to-end Speech Recognition
    Gu, Yue
    Du, Zhihao
    Zhang, Shiliang
    Chen, Qian
    Han, Jiqing
    INTERSPEECH 2023, 2023, : 1249 - 1253
  • [33] Speaker voice normalization for end-to-end speech translation
    Xue, Zhengshan
    Shi, Tingxun
    Zhang, Xiaolei
    Xiong, Deyi
    EXPERT SYSTEMS WITH APPLICATIONS, 2024, 248
  • [34] A Lightweight End-to-End Speech Recognition System on Embedded Devices
    Wang, Yu
    Nishizaki, Hiromitsu
    IEICE TRANSACTIONS ON INFORMATION AND SYSTEMS, 2023, E106D (07) : 1230 - 1239
  • [35] End-to-End Audiovisual Speech Recognition System With Multitask Learning
    Tao, Fei
    Busso, Carlos
    IEEE TRANSACTIONS ON MULTIMEDIA, 2021, 23 : 1 - 11
  • [36] IMPROVING UNSUPERVISED STYLE TRANSFER IN END-TO-END SPEECH SYNTHESIS WITH END-TO-END SPEECH RECOGNITION
    Liu, Da-Rong
    Yang, Chi-Yu
    Wu, Szu-Lin
    Lee, Hung-Yi
    2018 IEEE WORKSHOP ON SPOKEN LANGUAGE TECHNOLOGY (SLT 2018), 2018, : 640 - 647
  • [37] Towards End-to-End Private Automatic Speaker Recognition
    Teixeira, Francisco
    Abad, Alberto
    Raj, Bhiksha
    Trancoso, Isabel
    INTERSPEECH 2022, 2022, : 2798 - 2802
  • [38] END-TO-END MULTI-MODAL SPEECH RECOGNITION WITH AIR AND BONE CONDUCTED SPEECH
    Chen, Junqi
    Wang, Mou
    Zhang, Xiao-Lei
    Huang, Zhiyong
    Rahardja, Susanto
    2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, : 6052 - 6056
  • [39] Gammatonegram representation for end-to-end dysarthric speech processing tasks: speech recognition, speaker identification, and intelligibility assessment
    Aref Farhadipour
    Hadi Veisi
    Iran Journal of Computer Science, 2024, 7 (2) : 311 - 324
  • [40] SYNCHRONOUS TRANSFORMERS FOR END-TO-END SPEECH RECOGNITION
    Tian, Zhengkun
    Yi, Jiangyan
    Bai, Ye
    Tao, Jianhua
    Zhang, Shuai
    Wen, Zhengqi
    2020 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, 2020, : 7884 - 7888