Speech refinement using Bi-LSTM and improved spectral clustering in speaker diarization

被引：1

作者：

Gupta, Aishwarya ^{[1
]}

Purwar, Archana ^{[1
]}

机构：

[1] Jaypee Inst Informat Technol, Comp Sci & Engn & Informat Technol, Noida, Uttar Pradesh, India

来源：

MULTIMEDIA TOOLS AND APPLICATIONS | 2023年 / 83卷 / 18期

关键词：

Speaker Diarization; Speech Refinement; Bi-directional Long Short-Term Memory (Bi-LSTM); Skip U-Net Connections; Singular Value Decomposition; Spectral clustering; MEAN SHIFT; ENHANCEMENT;

D O I：

10.1007/s11042-023-17017-x

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

In this digitally-driven culture, the need and demand for diarizing online meetings, classes, conferences, and medical diagnoses have increased a lot. Speaker Diarization, a sub-domain of Speaker Recognition has grown with the advent of neural networks in the last decade. Diarize generally refers to obtaining the duration of individual speakers in any event. Researchers have suggested various approaches for multiple-speaker diarization. However, it still suffers from a problem of various environmental noises, and non-speech sounds like laughter, murmuring, clapping, etc. in the datasets. Hence, this paper proposes an improved speaker diarization pipeline to deal with the noise present in a dataset having multiple speakers. This improved diarization pipeline uses Bi-directional Long Short-Term Memory (Bi-LSTM), based speech refinement pre-processing module, and Modified Spectral Clustering with Symmetrized Singular Value Decomposition (MSC-SSVD). MSC-SSVD is used to cater to the problem of spectral clustering in large datasets. The proposed diarization pipeline is evaluated using the publicly available VoxConverse dataset. The Diarization Error Rate (DER) obtained after experimentation are 37.2%, 37.1%, and 43.3% respectively for three batches of dataset under study. The results are also compared with the baseline system and significant change in DER by 6.1%, 4.7%, and 7% respectively for three batches is observed.

引用

页码：54433 / 54448

页数：16

共 67 条

[1] Speech enhancement with an adaptive Wiener filter
Abd El-Fattah, Marwa
Dessouky, Moawad
Abbas, Alaa
Diab, Salaheldin
El-Rabaie, El-Sayed
Al-Nuaimy, Waleed
Alshebeili, Saleh
Abd El-Samie, Fathi
[J]. INTERNATIONAL JOURNAL OF SPEECH TECHNOLOGY, 2014, 17 (01) : 53 - 64
[2] Speech Enhancement for Multimodal Speaker Diarization System
Ahmad, Rehan
Zubair, Syed
Alquhayz, Hani
[J]. IEEE ACCESS, 2020, 8 : 126671 - 126680
[3] Multimodal Speaker Diarization Using a Pre-Trained Audio-Visual Synchronization Model
Ahmad, Rehan
Zubair, Syed
Alquhayz, Hani
Ditta, Allah
[J]. SENSORS, 2019, 19 (23)
[4] Speaker Diarization: A Review of Recent Research
Anguera Miro, Xavier
Bozonnet, Simon
Evans, Nicholas
Fredouille, Corinne
Friedland, Gerald
Vinyals, Oriol
[J]. IEEE TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2012, 20 (02): : 356 - 370
[5] [Anonymous], 1998, PROC DARPA BROADCAST
[6] SUPPRESSION OF ACOUSTIC NOISE IN SPEECH USING SPECTRAL SUBTRACTION
BOLL, SF
[J]. IEEE TRANSACTIONS ON ACOUSTICS SPEECH AND SIGNAL PROCESSING, 1979, 27 (02): : 113 - 120
[7] pyannote.metrics: a toolkit for reproducible evaluation, diagnostic, and error analysis of speaker diarization systems
Bredin, Herve
[J]. 18TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2017), VOLS 1-6: SITUATED INTERACTION, 2017, : 3587 - 3591
[8] Chung J.S., 2018, ARXIV
[9] Chung JS, 2021, Arxiv, DOI arXiv:2007.01216
[10] Mean shift: A robust approach toward feature space analysis
Comaniciu, D
Meer, P
[J]. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2002, 24 (05) : 603 - 619

← 1 2 3 4 5 6 7 →