Joint Autoregressive Modeling of End-to-End Multi-Talker Overlapped Speech Recognition and Utterance-level Timestamp Prediction

被引:0
|
作者
Makishima, Naoki [1 ]
Suzuki, Keita [1 ]
Suzuki, Satoshi [1 ]
Ando, Atsushi [1 ]
Masumura, Ryo [1 ]
机构
[1] NTT Corp, NTT Comp & Data Sci Labs, Tokyo, Japan
来源
关键词
multi-talker automatic speech recognition; timestamp prediction; autoregressive modeling; SEPARATION;
D O I
10.21437/Interspeech.2023-564
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
This paper proposes autoregressive modeling of the joint multitalker automatic speech recognition (ASR) and timestamp prediction. Autoregressive modeling of multi-talker ASR is a simple and promising approach. However, it does not predict utterance timestamp information despite its being important in practice. To address this problem, our key idea is to extend autoregressive-modeling-based multi-talker ASR to predict quantized timestamp tokens representing the start and end time of an utterance. Our method estimates transcription and utterance-level timestamp tokens of multiple speakers one after another. This enables joint modeling of multi-talker ASR and timestamps prediction without changing the simple autoregressive modeling of the conventional multi-talker ASR. Experimental results show that our method outperforms the ASR performance of conventional autoregressive multi-talker ASR without timestamp prediction and achieves promising timestamp prediction accuracy.
引用
收藏
页码:2913 / 2917
页数:5
相关论文
共 50 条
  • [41] Contextualized End-to-End Speech Recognition with Contextual Phrase Prediction Network
    Huang, Kaixun
    Zhang, Ao
    Yang, Zhanheng
    Guo, Pengcheng
    Mu, Bingshen
    Xu, Tianyi
    Xie, Lei
    INTERSPEECH 2023, 2023, : 4933 - 4937
  • [42] Loss Prediction: End-to-End Active Learning Approach For Speech Recognition
    Luo, Jian
    Wang, Jianzong
    Cheng, Ning
    Xiao, Jing
    2021 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS (IJCNN), 2021,
  • [43] Gaussian Prediction based Attention for Online End-to-End Speech Recognition
    Hou, Junfeng
    Zhang, Shiliang
    Dai, Lirong
    18TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2017), VOLS 1-6: SITUATED INTERACTION, 2017, : 3692 - 3696
  • [44] Insertion-Based Modeling for End-to-End Automatic Speech Recognition
    Fujita, Yuya
    Watanabe, Shinji
    Omachi, Motoi
    Chang, Xuankai
    INTERSPEECH 2020, 2020, : 3660 - 3664
  • [45] A COMPARABLE STUDY OF MODELING UNITS FOR END-TO-END MANDARIN SPEECH RECOGNITION
    Zou, Wei
    Jiang, Dongwei
    Zhao, Shuaijiang
    Yang, Guilin
    Li, Xiangang
    2018 11TH INTERNATIONAL SYMPOSIUM ON CHINESE SPOKEN LANGUAGE PROCESSING (ISCSLP), 2018, : 369 - 373
  • [46] A Purely End-to-end System for Multi-speaker Speech Recognition
    Seki, Hiroshi
    Hori, Takaaki
    Watanabe, Shinji
    Le Roux, Jonathan
    Hershey, John R.
    PROCEEDINGS OF THE 56TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL), VOL 1, 2018, : 2620 - 2630
  • [47] END-TO-END MULTI-MODAL SPEECH RECOGNITION WITH AIR AND BONE CONDUCTED SPEECH
    Chen, Junqi
    Wang, Mou
    Zhang, Xiao-Lei
    Huang, Zhiyong
    Rahardja, Susanto
    2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, : 6052 - 6056
  • [48] AN END-TO-END APPROACH TO JOINT SOCIAL SIGNAL DETECTION AND AUTOMATIC SPEECH RECOGNITION
    Inaguma, Hirofumi
    Mimura, Masato
    Inoue, Koji
    Yoshii, Kazuyoshi
    Kawahara, Tatsuya
    2018 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2018, : 6214 - 6218
  • [49] LANGUAGE INDEPENDENT END-TO-END ARCHITECTURE FOR JOINT LANGUAGE IDENTIFICATION AND SPEECH RECOGNITION
    Watanabe, Shinji
    Hori, Takaaki
    Hershey, John R.
    2017 IEEE AUTOMATIC SPEECH RECOGNITION AND UNDERSTANDING WORKSHOP (ASRU), 2017, : 265 - 271
  • [50] Paraformer: Fast and Accurate Parallel Transformer for Non-autoregressive End-to-End Speech Recognition
    Gao, Zhifu
    Zhang, Shiliang
    McLoughlin, Ian
    Yan, Zhijie
    INTERSPEECH 2022, 2022, : 2063 - 2067