DOVER-LAP: A METHOD FOR COMBINING OVERLAP-AWARE DIARIZATION OUTPUTS

被引：26

作者：

Raj, Desh ^{[1
]}

Garcia-Perera, Leibny Paola ^{[1
,2
]}

Huang, Zili ^{[1
]}

Watanabe, Shinji ^{[1
]}

Povey, Daniel ^{[3
]}

Stolcke, Andreas ^{[4
]}

Khudanpur, Sanjeev ^{[1
,2
]}

机构：

[1] Johns Hopkins Univ, Ctr Language & Speech Proc, Baltimore, MD 21218 USA

[2] Johns Hopkins Univ, Human Language Technol Ctr Excellence, Baltimore, MD 21218 USA

[3] Xia Corp, Beijing, Peoples R China

[4] Amazon Alexa Speech, Sunnyvale, CA USA

来源：

2021 IEEE SPOKEN LANGUAGE TECHNOLOGY WORKSHOP (SLT) | 2021年

关键词：

overlapped speaker diarization; voting-based methods; multichannel diarization; SPEAKER DIARIZATION;

D O I：

10.1109/SLT48900.2021.9383490

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Several advances have been made recently towards handling overlapping speech for speaker diarization. Since speech and natural language tasks often benefit from ensemble techniques, we propose an algorithm for combining outputs from such diarization systems through majority voting. Our method, DOVER-Lap, is inspired from the recently proposed DOVER algorithm, but is designed to handle overlapping segments in diarization outputs. We also modify the pair-wise incremental label mapping strategy used in DOVER, and propose an approximation algorithm based on weighted k-partite graph matching, which performs this mapping using a global cost tensor. We demonstrate the strength of our method by combining outputs from diverse systems- clustering-based, region proposal networks, and target-speaker voice activity detection - on AMI and LibriCSS datasets, where it consistently outperforms the single best system. Additionally, we show that DOVER-Lap can be used for late fusion in multichannel diarization, and compares favorably with early fusion methods like beamforming.

引用

页码：881 / 888

页数：8

共 45 条

[1] Detecting overlapped speech on short timeframes using deep learning [J].

Andrei, Valentin ;

Cucu, Horia ;

Burileanu, Corneliu .

18TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2017), VOLS 1-6: SITUATED INTERACTION, 2017, :1198-1202

[2] Speaker Diarization: A Review of Recent Research [J].

Anguera Miro, Xavier ;

Bozonnet, Simon ;

Evans, Nicholas ;

Fredouille, Corinne ;

Friedland, Gerald ;

Vinyals, Oriol .

IEEE TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2012, 20 (02) :356-370

[3] Acoustic beamforming for speaker diarization of meetings [J].

Anguera, Xavier ;

Wooters, Chuck ;

Hernando, Javier .

IEEE TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2007, 15 (07) :2011-2022

[4]

[Anonymous], 2000, P NIST SPEECH TRANSC

[5]

Arora Ashish, 2020, JHU MULTIMICROPHONE

[6]

Ausiello G., 1999, COMPLEXITY APPROXIMA, DOI DOI 10.1007/978-3-642-58412-1

[7] The fifth 'CHiME' Speech Separation and Recognition Challenge: Dataset, task and baselines [J].

Barker, Jon ;

Watanabe, Shinji ;

Vincent, Emmanuel ;

Trmal, Jan .

19TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2018), VOLS 1-6: SPEECH RESEARCH FOR EMERGING MARKETS IN MULTILINGUAL SOCIETIES, 2018, :1561-1565

[8] Overlapped speech detection for improved speaker diarization in multiparty meetings [J].

Boakye, Kofi ;

Trueba-Hornero, Beatriz ;

Vinyals, Oriol ;

Friedland, Gerald .

2008 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING, VOLS 1-12, 2008, :4353-4356

[9]

Carletta J, 2005, LECT NOTES COMPUT SC, V3869, P28

[10]

Chen Z, 2020, INT CONF ACOUST SPEE, P7284, DOI [10.1109/ICASSP40776.2020.9053426, 10.1109/icassp40776.2020.9053426]

← 1 2 3 4 5 →