Joint Autoregressive Modeling of End-to-End Multi-Talker Overlapped Speech Recognition and Utterance-level Timestamp Prediction

被引：0

作者：

Makishima, Naoki ^{[1
]}

Suzuki, Keita ^{[1
]}

Suzuki, Satoshi ^{[1
]}

Ando, Atsushi ^{[1
]}

Masumura, Ryo ^{[1
]}

机构：

[1] NTT Corp, NTT Comp & Data Sci Labs, Tokyo, Japan

来源：

INTERSPEECH 2023 | 2023年

关键词：

multi-talker automatic speech recognition; timestamp prediction; autoregressive modeling; SEPARATION;

D O I：

10.21437/Interspeech.2023-564

中图分类号：

O42 [声学];

学科分类号：

070206 ; 082403 ;

摘要：

This paper proposes autoregressive modeling of the joint multitalker automatic speech recognition (ASR) and timestamp prediction. Autoregressive modeling of multi-talker ASR is a simple and promising approach. However, it does not predict utterance timestamp information despite its being important in practice. To address this problem, our key idea is to extend autoregressive-modeling-based multi-talker ASR to predict quantized timestamp tokens representing the start and end time of an utterance. Our method estimates transcription and utterance-level timestamp tokens of multiple speakers one after another. This enables joint modeling of multi-talker ASR and timestamps prediction without changing the simple autoregressive modeling of the conventional multi-talker ASR. Experimental results show that our method outperforms the ASR performance of conventional autoregressive multi-talker ASR without timestamp prediction and achieves promising timestamp prediction accuracy.

引用

页码：2913 / 2917

页数：5

共 50 条

[21] NON-AUTOREGRESSIVE END-TO-END APPROACHES FOR JOINT AUTOMATIC SPEECH RECOGNITION AND SPOKEN LANGUAGE UNDERSTANDING
Li, Mohan
Doddipatla, Rama
2022 IEEE SPOKEN LANGUAGE TECHNOLOGY WORKSHOP, SLT, 2022, : 390 - 397
[22] END-TO-END MULTI-TALKER AUDIO-VISUAL ASR USING AN ACTIVE SPEAKER ATTENTION MODULE
Rose, Richard
Siohan, Olivier
INTERSPEECH 2022, 2022, : 2828 - 2832
[23] Multi-Stream End-to-End Speech Recognition
Li, Ruizhi
Wang, Xiaofei
Mallidi, Sri Harish
Watanabe, Shinji
Hori, Takaaki
Hermansky, Hynek
IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2020, 28 (646-655) : 646 - 655
[24] END-TO-END MULTI-SPEAKER SPEECH RECOGNITION
Settle, Shane
Le Roux, Jonathan
Hori, Takaaki
Watanabe, Shinji
Hershey, John R.
2018 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2018, : 4819 - 4823
[25] Utterance invariant training for hybrid two-pass end-to-end speech recognition
Gowda, Dhananjaya
Kumar, Ankur
Kim, Kwangyoun
Yang, Hejung
Garg, Abhinav
Singh, Sachin
Kim, Jiyeon
Jin, Mehul Kumar Sichen
Singh, Shatrughan
Kim, Chanwoo
INTERSPEECH 2020, 2020, : 2827 - 2831
[26] Joint CTC/attention decoding for end-to-end speech recognition
Hori, Takaaki
Watanabe, Shinji
Hershey, John R.
PROCEEDINGS OF THE 55TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2017), VOL 1, 2017, : 518 - 529
[27] Super-human multi-talker speech recognition: A graphical modeling approach
Hershey, John R.
Rennie, Steven J.
Olsen, Peder A.
Kristjansson, Trausti T.
COMPUTER SPEECH AND LANGUAGE, 2010, 24 (01): : 45 - 66
[28] HYPOTHESIS STITCHER FOR END-TO-END SPEAKER-ATTRIBUTED ASR ON LONG-FORM MULTI-TALKER RECORDINGS
Chang, Xuankai
Kanda, Naoyuki
Gaur, Yashesh
Wang, Xiaofei
Meng, Zhong
Yoshioka, Takuya
2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021), 2021, : 6763 - 6767
[29] IMPROVING RNN TRANSDUCER MODELING FOR END-TO-END SPEECH RECOGNITION
Li, Jinyu
Zhao, Rui
Hu, Hu
Gong, Yifan
2019 IEEE AUTOMATIC SPEECH RECOGNITION AND UNDERSTANDING WORKSHOP (ASRU 2019), 2019, : 114 - 121
[30] Multi-Head Decoder for End-to-End Speech Recognition
Hayashi, Tomoki
Watanabe, Shinji
Toda, Tomoki
Takeda, Kazuya
19TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2018), VOLS 1-6: SPEECH RESEARCH FOR EMERGING MARKETS IN MULTILINGUAL SOCIETIES, 2018, : 801 - 805

← 1 2 3 4 5 →