INTEGRATION OF SPEECH SEPARATION, DIARIZATION, AND RECOGNITION FOR MULTI-SPEAKER MEETINGS: SYSTEM DESCRIPTION, COMPARISON, AND ANALYSIS

被引：38

作者：

Raj, Desh ^{[1
]}

Denisov, Pavel ^{[2
]}

Chen, Zhuo ^{[3
]}

Erdogan, Hakan ^{[4
]}

Huang, Zili ^{[1
]}

He, Maokui ^{[5
,6
]}

Watanabe, Shinji ^{[1
]}

Du, Jun ^{[5
,6
]}

Yoshioka, Takuya ^{[3
]}

Luo, Yi

Kanda, Naoyuki ^{[3
]}

Li, Jinyu ^{[3
]}

Wisdom, Scott ^{[4
]}

Hershey, John R. ^{[4
]}

机构：

[1] Johns Hopkins Univ, Ctr Language & Speech Proc, Baltimore, MD 21218 USA

[2] Univ Stuttgart, Inst Nat Language Proc, Stuttgart, Germany

[3] Microsoft Corp, Redmond, WA 98052 USA

[4] Google Res, Cambridge, MA USA

[5] Univ Sci & Technol China, Hefei, Peoples R China

[6] Columbia Univ, Dept Elect Engn, New York, NY 10027 USA

来源：

2021 IEEE SPOKEN LANGUAGE TECHNOLOGY WORKSHOP (SLT) | 2021年

关键词：

Speech separation; diarization; speech recognition; multi-speaker;

D O I：

10.1109/SLT48900.2021.9383556

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Multi-speaker speech recognition of unsegmented recordings has diverse applications such as meeting transcription and automatic subtitle generation. With technical advances in systems dealing with speech separation, speaker diarization, and automatic speech recognition (ASR) in the last decade, it has become possible to build pipelines that achieve reasonable error rates on this task. In this paper, we propose an end-to-end modular system for the LibriCSS meeting data, which combines independently trained separation, diarization, and recognition components, in that order. We study the effect of different state-of-the-art methods at each stage of the pipeline, and report results using task-specific metrics like SDR and DER, as well as downstream WER. Experiments indicate that the problem of overlapping speech for diarization and ASR can be effectively mitigated with the presence of a well-trained separation module. Our best system achieves a speaker-attributed WER of 12.7%, which is close to that of a non-overlapping ASR.

引用

页码：897 / 904

页数：8

共 50 条

[31] Single Channel multi-speaker speech Separation based on quantized ratio mask and residual network
Shanfa Ke
Ruimin Hu
Xiaochen Wang
Tingzhao Wu
Gang Li
Zhongyuan Wang
Multimedia Tools and Applications, 2020, 79 : 32225 - 32241
[32] MULTI-SPEAKER SEQUENCE-TO-SEQUENCE SPEECH SYNTHESIS FOR DATA AUGMENTATION IN ACOUSTIC-TO-WORD SPEECH RECOGNITION
Ueno, Sei
Mimura, Masato
Sakai, Shinsuke
Kawahara, Tatsuya
2019 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2019, : 6161 - 6165
[33] SOURCE-AWARE CONTEXT NETWORK FOR SINGLE-CHANNEL MULTI-SPEAKER SPEECH SEPARATION
Li, Zeng-Xi
Song, Yan
Dai, Li-Rong
McLoughlin, Ian
2018 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2018, : 681 - 685
[34] Single Channel multi-speaker speech Separation based on quantized ratio mask and residual network
Ke, Shanfa
Hu, Ruimin
Wang, Xiaochen
Wu, Tingzhao
Li, Gang
Wang, Zhongyuan
MULTIMEDIA TOOLS AND APPLICATIONS, 2020, 79 (43-44) : 32225 - 32241
[35] Real-time multilingual speech recognition and speaker diarization system based on Whisper segmentation
Lyu, Ke-Ming
Lyu, Ren-yuan
Chang, Hsien-Tsung
PEERJ COMPUTER SCIENCE, 2024, 10
[36] AISHELL-4: An Open Source Dataset for Speech Enhancement, Separation, Recognition and Speaker Diarization in Conference Scenario
Fu, Yihui
Cheng, Luyao
Lv, Shubo
Jv, Yukai
Kong, Yuxiang
Chen, Zhuo
Hu, Yanxin
Xie, Lei
Wu, Jian
Bu, Hui
Xu, Xin
Du, Jun
Chen, Jingdong
INTERSPEECH 2021, 2021, : 3665 - 3669
[37] A Comparative Study on Speaker-attributed Automatic Speech Recognition in Multi-party Meetings
Yu, Fan
Du, Zhihao
Zhang, Shiliang
Lin, Yuxiao
Xie, Lei
INTERSPEECH 2022, 2022, : 560 - 564
[38] THE ROYALFLUSH AUTOMATIC SPEECH DIARIZATION AND RECOGNITION SYSTEM FOR IN-CAR MULTI-CHANNEL AUTOMATIC SPEECH RECOGNITION CHALLENGE
Tian, Jingguang
Ye, Shuaishuai
Chen, Shunfei
Xiang, Yang
Yin, Zhaohui
Hu, Xinhui
Xu, Xinkang
2024 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING WORKSHOPS, ICASSPW 2024, 2024, : 1 - 2
[39] DIRECTIONAL ASR: A NEW PARADIGM FOR E2E MULTI-SPEAKER SPEECH RECOGNITION WITH SOURCE LOCALIZATION
Subramanian, Aswin Shanmugam
Weng, Chao
Watanabe, Shinji
Yu, Meng
Xu, Yong
Zhang, Shi-Xiong
Yu, Dong
2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021), 2021, : 8433 - 8437
[40] THE HUYA MULTI-SPEAKER AND MULTI-STYLE SPEECH SYNTHESIS SYSTEM FOR M2VOC CHALLENGE 2020
Wang, Jie
You, Yuren
Liu, Feng
Tuo, Deyi
Kang, Shiyin
Wu, Zhiyong
Meng, Helen
2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021), 2021, : 8608 - 8612

← 1 2 3 4 5 →