INTEGRATION OF SPEECH SEPARATION, DIARIZATION, AND RECOGNITION FOR MULTI-SPEAKER MEETINGS: SYSTEM DESCRIPTION, COMPARISON, AND ANALYSIS

被引:38
|
作者
Raj, Desh [1 ]
Denisov, Pavel [2 ]
Chen, Zhuo [3 ]
Erdogan, Hakan [4 ]
Huang, Zili [1 ]
He, Maokui [5 ,6 ]
Watanabe, Shinji [1 ]
Du, Jun [5 ,6 ]
Yoshioka, Takuya [3 ]
Luo, Yi
Kanda, Naoyuki [3 ]
Li, Jinyu [3 ]
Wisdom, Scott [4 ]
Hershey, John R. [4 ]
机构
[1] Johns Hopkins Univ, Ctr Language & Speech Proc, Baltimore, MD 21218 USA
[2] Univ Stuttgart, Inst Nat Language Proc, Stuttgart, Germany
[3] Microsoft Corp, Redmond, WA 98052 USA
[4] Google Res, Cambridge, MA USA
[5] Univ Sci & Technol China, Hefei, Peoples R China
[6] Columbia Univ, Dept Elect Engn, New York, NY 10027 USA
关键词
Speech separation; diarization; speech recognition; multi-speaker;
D O I
10.1109/SLT48900.2021.9383556
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Multi-speaker speech recognition of unsegmented recordings has diverse applications such as meeting transcription and automatic subtitle generation. With technical advances in systems dealing with speech separation, speaker diarization, and automatic speech recognition (ASR) in the last decade, it has become possible to build pipelines that achieve reasonable error rates on this task. In this paper, we propose an end-to-end modular system for the LibriCSS meeting data, which combines independently trained separation, diarization, and recognition components, in that order. We study the effect of different state-of-the-art methods at each stage of the pipeline, and report results using task-specific metrics like SDR and DER, as well as downstream WER. Experiments indicate that the problem of overlapping speech for diarization and ASR can be effectively mitigated with the presence of a well-trained separation module. Our best system achieves a speaker-attributed WER of 12.7%, which is close to that of a non-overlapping ASR.
引用
收藏
页码:897 / 904
页数:8
相关论文
共 50 条
  • [41] A Comparative Study on Multichannel Speaker-Attributed Automatic Speech Recognition in Multi-party Meetings
    Shi, Mohan
    Zhang, Jie
    Du, Zhihao
    Yu, Fan
    Chen, Qian
    Zhang, Shiliang
    Dai, Li-Rong
    2023 ASIA PACIFIC SIGNAL AND INFORMATION PROCESSING ASSOCIATION ANNUAL SUMMIT AND CONFERENCE, APSIPA ASC, 2023, : 1943 - 1948
  • [42] Analysis of the Effect of Speech-Laugh on Speaker Recognition System
    Dumpala, Sri Harsha
    Panda, Ashish
    Kopparapu, Sunil Kumar
    19TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2018), VOLS 1-6: SPEECH RESEARCH FOR EMERGING MARKETS IN MULTILINGUAL SOCIETIES, 2018, : 1751 - 1755
  • [43] Analysis of Compressed Speech Signals in an Automatic Speaker Recognition System
    Metzger, Richard A.
    Doherty, John F.
    Jenkins, David M.
    2015 49TH ANNUAL CONFERENCE ON INFORMATION SCIENCES AND SYSTEMS (CISS), 2015,
  • [44] MSDTRON: A HIGH-CAPABILITY MULTI-SPEAKER SPEECH SYNTHESIS SYSTEM FOR DIVERSE DATA USING CHARACTERISTIC INFORMATION
    Wu, Qinghua
    Shen, Quanbo
    Luan, Jian
    Wang, Yujun
    2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, : 6327 - 6331
  • [45] Optimal scale-invariant signal-to-noise ratio and curriculum learning for monaural multi-speaker speech separation in noisy environment
    Ma, Chao
    Li, Dongmei
    Jia, Xupeng
    2020 ASIA-PACIFIC SIGNAL AND INFORMATION PROCESSING ASSOCIATION ANNUAL SUMMIT AND CONFERENCE (APSIPA ASC), 2020, : 711 - 715
  • [46] Separate-to-Recognize: Joint Multi-target Speech Separation and Speech Recognition for Speaker-attributed ASR
    Lin, Yuxiao
    Du, Zhihao
    Zhang, Shiliang
    Yu, Fan
    Zhao, Zhou
    Wu, Fei
    2022 13TH INTERNATIONAL SYMPOSIUM ON CHINESE SPOKEN LANGUAGE PROCESSING (ISCSLP), 2022, : 150 - 154
  • [47] Holonic multi-agent system model for fuzzy automatic speech/speaker recognition
    Valencia-Jimenez, J. J.
    Fernandez-Caballero, Antonio
    AGENT AND MULTI-AGENT SYSTEMS: TECHNOLOGIES AND APPLICATIONS, PROCEEDINGS, 2008, 4953 : 73 - 82
  • [48] Integration of fixed and multiple resolution analysis in a speech recognition system
    Gemello, R
    Albesano, D
    Moisa, L
    De Mori, R
    2001 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, VOLS I-VI, PROCEEDINGS: VOL I: SPEECH PROCESSING 1; VOL II: SPEECH PROCESSING 2 IND TECHNOL TRACK DESIGN & IMPLEMENTATION OF SIGNAL PROCESSING SYSTEMS NEURALNETWORKS FOR SIGNAL PROCESSING; VOL III: IMAGE & MULTIDIMENSIONAL SIGNAL PROCESSING MULTIMEDIA SIGNAL PROCESSING, 2001, : 121 - 124
  • [49] A Speaker-Dependent Deep Learning Approach to Joint Speech Separation and Acoustic Modeling for Multi-Talker Automatic Speech Recognition
    Tu, Yan-Hui
    Du, Jun
    Dai, Li-Rung
    Lee, Chin-Hui
    2016 10TH INTERNATIONAL SYMPOSIUM ON CHINESE SPOKEN LANGUAGE PROCESSING (ISCSLP), 2016,
  • [50] Speech Recognition System of the Punjabi Language for Multi-Resolution Speech Analysis
    Guglani, Jyoti
    Mishra, A.N.
    SSRN, 1600,