THE ROYALFLUSH AUTOMATIC SPEECH DIARIZATION AND RECOGNITION SYSTEM FOR IN-CAR MULTI-CHANNEL AUTOMATIC SPEECH RECOGNITION CHALLENGE

被引:1
作者
Tian, Jingguang [1 ]
Ye, Shuaishuai [1 ]
Chen, Shunfei [1 ]
Xiang, Yang [1 ]
Yin, Zhaohui [1 ]
Hu, Xinhui [1 ]
Xu, Xinkang [1 ]
机构
[1] Hithink RoyalFlush AI Res Inst, Hangzhou, Zhejiang, Peoples R China
来源
2024 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING WORKSHOPS, ICASSPW 2024 | 2024年
关键词
ICMC-ASR; ASDR; TS-VAD; speaker diarization; speech recognition;
D O I
10.1109/ICASSPW62465.2024.10626136
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
This paper presents our system submission for the In-Car Multi-Channel Automatic Speech Recognition (ICMC-ASR) Challenge, which focuses on speaker diarization and speech recognition in complex multi-speaker scenarios. To address these challenges, we develop end-to-end speaker diarization models that notably decrease the diarization error rate (DER) by 49.58% compared to the official baseline on the development set. For speech recognition, we utilize self-supervised learning representations to train end-to-end ASR models. By integrating these models, we achieve a character error rate (CER) of 16.93% on the track 1 evaluation set, and a concatenated minimum permutation character error rate (cpCER) of 25.88% on the track 2 evaluation set.
引用
收藏
页码:1 / 2
页数:2
相关论文
共 13 条
[1]   End-to-End Integration of Speech Recognition, Speech Enhancement, and Self-Supervised Learning Representation [J].
Chang, Xuankai ;
Maekaku, Takashi ;
Fujita, Yuya ;
Watanabe, Shinji .
INTERSPEECH 2022, 2022, :3819-3823
[2]   A post-processing system to yield reduced word error rates: Recognizer output voting error reduction (ROVER) [J].
Fiscus, JG .
1997 IEEE WORKSHOP ON AUTOMATIC SPEECH RECOGNITION AND UNDERSTANDING, PROCEEDINGS, 1997, :347-354
[3]   HUBERT: HOW MUCH CAN A BAD TEACHER BENEFIT ASR PRE-TRAINING? [J].
Hsu, Wei-Ning ;
Tsai, Yao-Hung Hubert ;
Bolte, Benjamin ;
Salakhutdinov, Ruslan ;
Mohamed, Abdelrahman .
2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021), 2021, :6533-6537
[4]   CN-Celeb: Multi-genre speaker recognition [J].
Li, Lantian ;
Liu, Ruiqi ;
Kang, Jiawen ;
Fan, Yue ;
Cui, Hao ;
Cai, Yunqi ;
Vipperla, Ravichander ;
Zheng, Thomas Fang ;
Wang, Dong .
SPEECH COMMUNICATION, 2022, 137 :77-91
[5]  
Luo Y, 2020, INT CONF ACOUST SPEE, P46, DOI [10.1109/ICASSP40776.2020.9054266, 10.1109/icassp40776.2020.9054266]
[6]   Target-Speaker Voice Activity Detection: a Novel Approach for Multi-Speaker Diarization in a Dinner Party Scenario [J].
Medennikov, Ivan ;
Korenevsky, Maxim ;
Prisyach, Tatiana ;
Khokhlov, Yuri ;
Korenevskaya, Mariya ;
Sorokin, Ivan ;
Timofeeva, Tatiana ;
Mitrofanov, Anton ;
Andrusenko, Andrei ;
Podluzhny, Ivan ;
Laptev, Aleksandr ;
Romanenko, Aleksei .
INTERSPEECH 2020, 2020, :274-278
[7]  
Raj D, 2023, Arxiv, DOI arXiv:2212.05271
[8]   DOVER-LAP: A METHOD FOR COMBINING OVERLAP-AWARE DIARIZATION OUTPUTS [J].
Raj, Desh ;
Garcia-Perera, Leibny Paola ;
Huang, Zili ;
Watanabe, Shinji ;
Povey, Daniel ;
Stolcke, Andreas ;
Khudanpur, Sanjeev .
2021 IEEE SPOKEN LANGUAGE TECHNOLOGY WORKSHOP (SLT), 2021, :881-888
[9]  
Tian JG, 2022, Arxiv, DOI arXiv:2202.04814
[10]   CROSS-CHANNEL ATTENTION-BASED TARGET SPEAKER VOICE ACTIVITY DETECTION: EXPERIMENTAL RESULTS FOR THE M2MET CHALLENGE [J].
Wang, Weiqing ;
Qin, Xiaoyi ;
Li, Ming .
2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, :9171-9175