INTEGRATING END-TO-END NEURAL AND CLUSTERING-BASED DIARIZATION: GETTING THE BEST OF BOTH WORLDS

被引：32

作者：

Kinoshita, Keisuke ^{[1
]}

Delcroix, Marc ^{[1
]}

Tawara, Naohiro ^{[1
]}

机构：

[1] NTT Corp, Tokyo, Japan

来源：

2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021) | 2021年

关键词：

Speaker diarization; neural networks; SPEAKER DIARIZATION;

D O I：

10.1109/ICASSP39728.2021.9414333

中图分类号：

O42 [声学];

学科分类号：

070206 ; 082403 ;

摘要：

Recent diarization technologies can be categorized into two approaches, i.e., clustering and end-to-end neural approaches, which have different pros and cons. The clustering-based approaches assign speaker labels to speech regions by clustering speaker embeddings such as x-vectors. While it can be seen as a current state-of-the-art approach that works for various challenging data with reasonable robustness and accuracy, it has a critical disadvantage that it cannot handle overlapped speech that is inevitable in natural conversational data. In contrast, the end-to-end neural diarization (EEND), which directly predicts diarization labels using a neural network, was devised to handle the overlapped speech. While the EEND, which can easily incorporate emerging deep-learning technologies, has started outperforming the x-vector clustering approach in some realistic database, it is difficult to make it work for long recordings (e.g., recordings longer than 10 minutes) because of, e.g., its huge memory consumption. Block-wise independent processing is also difficult because it poses an inter-block label permutation problem, i.e., an ambiguity of the speaker label assignments between blocks. In this paper, we propose a simple but effective hybrid diarization framework that works with overlapped speech and for long recordings containing an arbitrary number of speakers. It modifies the conventional EEND framework to output global speaker embeddings so that speaker clustering can be performed across blocks based on a constrained clustering algorithm to solve the permutation problem. With experiments based on simulated noisy reverberant 2-speaker meeting-like data, we show that the proposed framework works significantly better than the original EEND especially when the input data is long.

引用

页码：7198 / 7202

页数：5

共 24 条

[1] Speaker Diarization: A Review of Recent Research [J].

Anguera Miro, Xavier ;

Bozonnet, Simon ;

Evans, Nicholas ;

Fredouille, Corinne ;

Friedland, Gerald ;

Vinyals, Oriol .

IEEE TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2012, 20 (02) :356-370

[2]

[Anonymous], 2001, P INT C MACH LEARN

[3]

Carletta J, 2005, LECT NOTES COMPUT SC, V3869, P28

[4] Front-End Factor Analysis for Speaker Verification [J].

Dehak, Najim ;

Kenny, Patrick J. ;

Dehak, Reda ;

Dumouchel, Pierre ;

Ouellet, Pierre .

IEEE TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2011, 19 (04) :788-798

[5] BUT system for DIHARD Speech Diarization Challenge 2018 [J].

Diez, Mireia ;

Landini, Federico ;

Burget, Lukas ;

Rohdin, Johan ;

Silnova, Anna ;

Zmolikova, Katerina ;

Novotny, Ondrej ;

Vesely, Karel ;

Glembek, Ondrej ;

Plchot, Oldrich ;

Mosner, Ladislav ;

Matejka, Pavel .

19TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2018), VOLS 1-6: SPEECH RESEARCH FOR EMERGING MARKETS IN MULTILINGUAL SOCIETIES, 2018, :2798-2802

[6] End-to-End Neural Speaker Diarization with Permutation-Free Objectives [J].

Fujita, Yusuke ;

Kanda, Naoyuki ;

Horiguchi, Shota ;

Nagamatsu, Kenji ;

Watanabe, Shinji .

INTERSPEECH 2019, 2019, :4300-4304

[7]

Fujita Y, 2019, 2019 IEEE AUTOMATIC SPEECH RECOGNITION AND UNDERSTANDING WORKSHOP (ASRU 2019), P296, DOI [10.1109/asru46091.2019.9003959, 10.1109/ASRU46091.2019.9003959]

[8] End-to-End Speaker Diarization for an Unknown Number of Speakers with Encoder-Decoder Based Attractors [J].

Horiguchi, Shota ;

Fujita, Yusuke ;

Watanabe, Shinji ;

Xue, Yawen ;

Nagamatsu, Kenji .

INTERSPEECH 2020, 2020, :269-273

[9]

Kinoshita K, 2018, 2018 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), P5064, DOI 10.1109/ICASSP.2018.8462646

[10]

Ko T, 2017, INT CONF ACOUST SPEE, P5220, DOI 10.1109/ICASSP.2017.7953152

← 1 2 3 →