INTEGRATION OF SPEECH SEPARATION, DIARIZATION, AND RECOGNITION FOR MULTI-SPEAKER MEETINGS: SYSTEM DESCRIPTION, COMPARISON, AND ANALYSIS

被引：38

作者：

Raj, Desh ^{[1
]}

Denisov, Pavel ^{[2
]}

Chen, Zhuo ^{[3
]}

Erdogan, Hakan ^{[4
]}

Huang, Zili ^{[1
]}

He, Maokui ^{[5
,6
]}

Watanabe, Shinji ^{[1
]}

Du, Jun ^{[5
,6
]}

Yoshioka, Takuya ^{[3
]}

Luo, Yi

Kanda, Naoyuki ^{[3
]}

Li, Jinyu ^{[3
]}

Wisdom, Scott ^{[4
]}

Hershey, John R. ^{[4
]}

机构：

[1] Johns Hopkins Univ, Ctr Language & Speech Proc, Baltimore, MD 21218 USA

[2] Univ Stuttgart, Inst Nat Language Proc, Stuttgart, Germany

[3] Microsoft Corp, Redmond, WA 98052 USA

[4] Google Res, Cambridge, MA USA

[5] Univ Sci & Technol China, Hefei, Peoples R China

[6] Columbia Univ, Dept Elect Engn, New York, NY 10027 USA

来源：

2021 IEEE SPOKEN LANGUAGE TECHNOLOGY WORKSHOP (SLT) | 2021年

关键词：

Speech separation; diarization; speech recognition; multi-speaker;

D O I：

10.1109/SLT48900.2021.9383556

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Multi-speaker speech recognition of unsegmented recordings has diverse applications such as meeting transcription and automatic subtitle generation. With technical advances in systems dealing with speech separation, speaker diarization, and automatic speech recognition (ASR) in the last decade, it has become possible to build pipelines that achieve reasonable error rates on this task. In this paper, we propose an end-to-end modular system for the LibriCSS meeting data, which combines independently trained separation, diarization, and recognition components, in that order. We study the effect of different state-of-the-art methods at each stage of the pipeline, and report results using task-specific metrics like SDR and DER, as well as downstream WER. Experiments indicate that the problem of overlapping speech for diarization and ASR can be effectively mitigated with the presence of a well-trained separation module. Our best system achieves a speaker-attributed WER of 12.7%, which is close to that of a non-overlapping ASR.

引用

页码：897 / 904

页数：8

共 50 条

[41] A Comparative Study on Multichannel Speaker-Attributed Automatic Speech Recognition in Multi-party Meetings
Shi, Mohan
Zhang, Jie
Du, Zhihao
Yu, Fan
Chen, Qian
Zhang, Shiliang
Dai, Li-Rong
2023 ASIA PACIFIC SIGNAL AND INFORMATION PROCESSING ASSOCIATION ANNUAL SUMMIT AND CONFERENCE, APSIPA ASC, 2023, : 1943 - 1948
[42] Analysis of the Effect of Speech-Laugh on Speaker Recognition System
Dumpala, Sri Harsha
Panda, Ashish
Kopparapu, Sunil Kumar
19TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2018), VOLS 1-6: SPEECH RESEARCH FOR EMERGING MARKETS IN MULTILINGUAL SOCIETIES, 2018, : 1751 - 1755
[43] Analysis of Compressed Speech Signals in an Automatic Speaker Recognition System
Metzger, Richard A.
Doherty, John F.
Jenkins, David M.
2015 49TH ANNUAL CONFERENCE ON INFORMATION SCIENCES AND SYSTEMS (CISS), 2015,
[44] MSDTRON: A HIGH-CAPABILITY MULTI-SPEAKER SPEECH SYNTHESIS SYSTEM FOR DIVERSE DATA USING CHARACTERISTIC INFORMATION
Wu, Qinghua
Shen, Quanbo
Luan, Jian
Wang, Yujun
2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, : 6327 - 6331
[45] Optimal scale-invariant signal-to-noise ratio and curriculum learning for monaural multi-speaker speech separation in noisy environment
Ma, Chao
Li, Dongmei
Jia, Xupeng
2020 ASIA-PACIFIC SIGNAL AND INFORMATION PROCESSING ASSOCIATION ANNUAL SUMMIT AND CONFERENCE (APSIPA ASC), 2020, : 711 - 715
[46] Separate-to-Recognize: Joint Multi-target Speech Separation and Speech Recognition for Speaker-attributed ASR
Lin, Yuxiao
Du, Zhihao
Zhang, Shiliang
Yu, Fan
Zhao, Zhou
Wu, Fei
2022 13TH INTERNATIONAL SYMPOSIUM ON CHINESE SPOKEN LANGUAGE PROCESSING (ISCSLP), 2022, : 150 - 154
[47] Holonic multi-agent system model for fuzzy automatic speech/speaker recognition
Valencia-Jimenez, J. J.
Fernandez-Caballero, Antonio
AGENT AND MULTI-AGENT SYSTEMS: TECHNOLOGIES AND APPLICATIONS, PROCEEDINGS, 2008, 4953 : 73 - 82
[48] Integration of fixed and multiple resolution analysis in a speech recognition system
Gemello, R
Albesano, D
Moisa, L
De Mori, R
2001 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, VOLS I-VI, PROCEEDINGS: VOL I: SPEECH PROCESSING 1; VOL II: SPEECH PROCESSING 2 IND TECHNOL TRACK DESIGN & IMPLEMENTATION OF SIGNAL PROCESSING SYSTEMS NEURALNETWORKS FOR SIGNAL PROCESSING; VOL III: IMAGE & MULTIDIMENSIONAL SIGNAL PROCESSING MULTIMEDIA SIGNAL PROCESSING, 2001, : 121 - 124
[49] A Speaker-Dependent Deep Learning Approach to Joint Speech Separation and Acoustic Modeling for Multi-Talker Automatic Speech Recognition
Tu, Yan-Hui
Du, Jun
Dai, Li-Rung
Lee, Chin-Hui
2016 10TH INTERNATIONAL SYMPOSIUM ON CHINESE SPOKEN LANGUAGE PROCESSING (ISCSLP), 2016,
[50] Speech Recognition System of the Punjabi Language for Multi-Resolution Speech Analysis
Guglani, Jyoti
Mishra, A.N.
SSRN, 1600,

← 1 2 3 4 5 →