Joint Autoregressive Modeling of End-to-End Multi-Talker Overlapped Speech Recognition and Utterance-level Timestamp Prediction

被引:0
|
作者
Makishima, Naoki [1 ]
Suzuki, Keita [1 ]
Suzuki, Satoshi [1 ]
Ando, Atsushi [1 ]
Masumura, Ryo [1 ]
机构
[1] NTT Corp, NTT Comp & Data Sci Labs, Tokyo, Japan
来源
关键词
multi-talker automatic speech recognition; timestamp prediction; autoregressive modeling; SEPARATION;
D O I
10.21437/Interspeech.2023-564
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
This paper proposes autoregressive modeling of the joint multitalker automatic speech recognition (ASR) and timestamp prediction. Autoregressive modeling of multi-talker ASR is a simple and promising approach. However, it does not predict utterance timestamp information despite its being important in practice. To address this problem, our key idea is to extend autoregressive-modeling-based multi-talker ASR to predict quantized timestamp tokens representing the start and end time of an utterance. Our method estimates transcription and utterance-level timestamp tokens of multiple speakers one after another. This enables joint modeling of multi-talker ASR and timestamps prediction without changing the simple autoregressive modeling of the conventional multi-talker ASR. Experimental results show that our method outperforms the ASR performance of conventional autoregressive multi-talker ASR without timestamp prediction and achieves promising timestamp prediction accuracy.
引用
收藏
页码:2913 / 2917
页数:5
相关论文
共 50 条
  • [1] Unsupervised Domain Adaptation on End-to-End Multi-Talker Overlapped Speech Recognition
    Zheng, Lin
    Zhu, Han
    Tian, Sanli
    Zhao, Qingwei
    Li, Ta
    IEEE SIGNAL PROCESSING LETTERS, 2024, 31 : 3119 - 3123
  • [2] END-TO-END MULTI-TALKER OVERLAPPING SPEECH RECOGNITION
    Tripathi, Anshuman
    Lu, Han
    Sak, Hasim
    2020 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, 2020, : 6129 - 6133
  • [3] Streaming End-to-End Multi-Talker Speech Recognition
    Lu, Liang
    Kanda, Naoyuki
    Li, Jinyu
    Gong, Yifan
    IEEE SIGNAL PROCESSING LETTERS, 2021, 28 : 803 - 807
  • [4] UTTERANCE-LEVEL NEURAL CONFIDENCE MEASURE FOR END-TO-END CHILDREN SPEECH RECOGNITION
    Liu, Wei
    Lee, Tan
    2021 IEEE AUTOMATIC SPEECH RECOGNITION AND UNDERSTANDING WORKSHOP (ASRU), 2021, : 449 - 456
  • [5] Improving End-to-End Single-Channel Multi-Talker Speech Recognition
    Zhang, Wangyou
    Chang, Xuankai
    Qian, Yanmin
    Watanabe, Shinji
    IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2020, 28 : 1385 - 1394
  • [6] End-to-End Brain-Driven Speech Enhancement in Multi-Talker Conditions
    Hosseini, Maryam
    Celotti, Luca
    Plourde, Eric
    IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2022, 30 : 1718 - 1733
  • [7] Unified Modeling of Multi-Talker Overlapped Speech Recognition and Diarization with a Sidecar Separator
    Meng, Lingwei
    Kang, Jiawen
    Cui, Mingyu
    Wu, Haibin
    Wu, Xixin
    Meng, Helen
    INTERSPEECH 2023, 2023, : 3467 - 3471
  • [8] ENDPOINT DETECTION FOR STREAMING END-TO-END MULTI-TALKER ASR
    Lu, Liang
    Li, Jinyu
    Gong, Yifan
    2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, : 7312 - 7316
  • [9] Knowledge Distillation for End-to-End Monaural Multi-talker ASR System
    Zhang, Wangyou
    Chang, Xuankai
    Qian, Yanmin
    INTERSPEECH 2019, 2019, : 2633 - 2637
  • [10] Multi-Level Modeling Units for End-to-End Mandarin Speech Recognition
    Yang, Yuting
    Du, Binbin
    Li, Yuke
    2022 13TH INTERNATIONAL SYMPOSIUM ON CHINESE SPOKEN LANGUAGE PROCESSING (ISCSLP), 2022, : 175 - 179