Target-speaker Voice Activity Detection with Improved I-Vector Estimation for Unknown Number of Speaker

被引:13
作者
He, Maokui [1 ]
Raj, Desh [2 ]
Huang, Zili [2 ]
Du, Jun [1 ]
Chen, Zhuo [3 ]
Watanabe, Shinji [2 ]
机构
[1] Univ Sci & Technol China, Hefei, Peoples R China
[2] Johns Hopkins Univ, Ctr Language & Speech Proc, Baltimore, MD USA
[3] Microsoft Corp, Redmond, WA 98052 USA
来源
INTERSPEECH 2021 | 2021年
关键词
Speaker diarization; multi-speaker; TS-VAD; overlap; DIARIZATION; SPEECH;
D O I
10.21437/Interspeech.2021-750
中图分类号
R36 [病理学]; R76 [耳鼻咽喉科学];
学科分类号
100104 ; 100213 ;
摘要
Target-speaker voice activity detection (TS-VAD) has recently shown promising results for speaker diarization on highly overlapped speech. However, the original model requires a fixed (and known) number of speakers, which limits its application to real conversations. In this paper, we extend TS-VAD to speaker diarization with unknown numbers of speakers. This is achieved by two steps: first, an initial diarization system is applied for speaker number estimation, followed by TS-VAD network output masking according to this estimate. We further investigate different diarization methods, including clustering-based and region proposal networks, for estimating the initial i-vectors. Since these systems have complementary strengths, we propose a fusion-based method to combine frame-level decisions from the systems for an improved initialization. We demonstrate through experiments on variants of the LibriCSS meeting corpus that our proposed approach can improve the DER by up to 50% relative across varying numbers of speakers. This improvement also results in better downstream ASR performance approaching that using oracle segments.
引用
收藏
页码:3555 / 3559
页数:5
相关论文
共 31 条
[1]   Speaker Diarization: A Review of Recent Research [J].
Anguera Miro, Xavier ;
Bozonnet, Simon ;
Evans, Nicholas ;
Fredouille, Corinne ;
Friedland, Gerald ;
Vinyals, Oriol .
IEEE TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2012, 20 (02) :356-370
[2]  
[Anonymous], 2017, INTERSPEECH
[3]   Overlapped speech detection for improved speaker diarization in multiparty meetings [J].
Boakye, Kofi ;
Trueba-Hornero, Beatriz ;
Vinyals, Oriol ;
Friedland, Gerald .
2008 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING, VOLS 1-12, 2008, :4353-4356
[4]   THE LIA-EURECOM RT'09 SPEAKER DIARIZATION SYSTEM: ENHANCEMENTS IN SPEAKER MODELLING AND CLUSTER PURIFICATION [J].
Bozonnet, Simon ;
Evans, Nicholas W. D. ;
Fredouille, Corinne .
2010 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, 2010, :4958-4961
[5]  
Bullock L., 2019, OVERLAP AWARE DIARIZ
[6]  
Chen Z, 2020, INT CONF ACOUST SPEE, P7284, DOI [10.1109/icassp40776.2020.9053426, 10.1109/ICASSP40776.2020.9053426]
[7]   Front-End Factor Analysis for Speaker Verification [J].
Dehak, Najim ;
Kenny, Patrick J. ;
Dehak, Reda ;
Dumouchel, Pierre ;
Ouellet, Pierre .
IEEE TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2011, 19 (04) :788-798
[8]   Bayesian HMM based x-vector clustering for Speaker Diarization [J].
Diez, Mireia ;
Burget, Lukas ;
Wang, Shuai ;
Rohdin, Johan ;
Cernocky, Jan .
INTERSPEECH 2019, 2019, :346-350
[9]  
Ding S., 2019, PERSONAL VAD SPEAKER
[10]  
Garcia-Romero D, 2017, INT CONF ACOUST SPEE, P4930, DOI 10.1109/ICASSP.2017.7953094