Joint Autoregressive Modeling of End-to-End Multi-Talker Overlapped Speech Recognition and Utterance-level Timestamp Prediction

被引:0
|
作者
Makishima, Naoki [1 ]
Suzuki, Keita [1 ]
Suzuki, Satoshi [1 ]
Ando, Atsushi [1 ]
Masumura, Ryo [1 ]
机构
[1] NTT Corp, NTT Comp & Data Sci Labs, Tokyo, Japan
来源
关键词
multi-talker automatic speech recognition; timestamp prediction; autoregressive modeling; SEPARATION;
D O I
10.21437/Interspeech.2023-564
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
This paper proposes autoregressive modeling of the joint multitalker automatic speech recognition (ASR) and timestamp prediction. Autoregressive modeling of multi-talker ASR is a simple and promising approach. However, it does not predict utterance timestamp information despite its being important in practice. To address this problem, our key idea is to extend autoregressive-modeling-based multi-talker ASR to predict quantized timestamp tokens representing the start and end time of an utterance. Our method estimates transcription and utterance-level timestamp tokens of multiple speakers one after another. This enables joint modeling of multi-talker ASR and timestamps prediction without changing the simple autoregressive modeling of the conventional multi-talker ASR. Experimental results show that our method outperforms the ASR performance of conventional autoregressive multi-talker ASR without timestamp prediction and achieves promising timestamp prediction accuracy.
引用
收藏
页码:2913 / 2917
页数:5
相关论文
共 50 条
  • [21] NON-AUTOREGRESSIVE END-TO-END APPROACHES FOR JOINT AUTOMATIC SPEECH RECOGNITION AND SPOKEN LANGUAGE UNDERSTANDING
    Li, Mohan
    Doddipatla, Rama
    2022 IEEE SPOKEN LANGUAGE TECHNOLOGY WORKSHOP, SLT, 2022, : 390 - 397
  • [22] END-TO-END MULTI-TALKER AUDIO-VISUAL ASR USING AN ACTIVE SPEAKER ATTENTION MODULE
    Rose, Richard
    Siohan, Olivier
    INTERSPEECH 2022, 2022, : 2828 - 2832
  • [23] Multi-Stream End-to-End Speech Recognition
    Li, Ruizhi
    Wang, Xiaofei
    Mallidi, Sri Harish
    Watanabe, Shinji
    Hori, Takaaki
    Hermansky, Hynek
    IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2020, 28 (646-655) : 646 - 655
  • [24] END-TO-END MULTI-SPEAKER SPEECH RECOGNITION
    Settle, Shane
    Le Roux, Jonathan
    Hori, Takaaki
    Watanabe, Shinji
    Hershey, John R.
    2018 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2018, : 4819 - 4823
  • [25] Utterance invariant training for hybrid two-pass end-to-end speech recognition
    Gowda, Dhananjaya
    Kumar, Ankur
    Kim, Kwangyoun
    Yang, Hejung
    Garg, Abhinav
    Singh, Sachin
    Kim, Jiyeon
    Jin, Mehul Kumar Sichen
    Singh, Shatrughan
    Kim, Chanwoo
    INTERSPEECH 2020, 2020, : 2827 - 2831
  • [26] Joint CTC/attention decoding for end-to-end speech recognition
    Hori, Takaaki
    Watanabe, Shinji
    Hershey, John R.
    PROCEEDINGS OF THE 55TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2017), VOL 1, 2017, : 518 - 529
  • [27] Super-human multi-talker speech recognition: A graphical modeling approach
    Hershey, John R.
    Rennie, Steven J.
    Olsen, Peder A.
    Kristjansson, Trausti T.
    COMPUTER SPEECH AND LANGUAGE, 2010, 24 (01): : 45 - 66
  • [28] HYPOTHESIS STITCHER FOR END-TO-END SPEAKER-ATTRIBUTED ASR ON LONG-FORM MULTI-TALKER RECORDINGS
    Chang, Xuankai
    Kanda, Naoyuki
    Gaur, Yashesh
    Wang, Xiaofei
    Meng, Zhong
    Yoshioka, Takuya
    2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021), 2021, : 6763 - 6767
  • [29] IMPROVING RNN TRANSDUCER MODELING FOR END-TO-END SPEECH RECOGNITION
    Li, Jinyu
    Zhao, Rui
    Hu, Hu
    Gong, Yifan
    2019 IEEE AUTOMATIC SPEECH RECOGNITION AND UNDERSTANDING WORKSHOP (ASRU 2019), 2019, : 114 - 121
  • [30] Multi-Head Decoder for End-to-End Speech Recognition
    Hayashi, Tomoki
    Watanabe, Shinji
    Toda, Tomoki
    Takeda, Kazuya
    19TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2018), VOLS 1-6: SPEECH RESEARCH FOR EMERGING MARKETS IN MULTILINGUAL SOCIETIES, 2018, : 801 - 805