Speech refinement using Bi-LSTM and improved spectral clustering in speaker diarization

被引:1
作者
Gupta, Aishwarya [1 ]
Purwar, Archana [1 ]
机构
[1] Jaypee Inst Informat Technol, Comp Sci & Engn & Informat Technol, Noida, Uttar Pradesh, India
关键词
Speaker Diarization; Speech Refinement; Bi-directional Long Short-Term Memory (Bi-LSTM); Skip U-Net Connections; Singular Value Decomposition; Spectral clustering; MEAN SHIFT; ENHANCEMENT;
D O I
10.1007/s11042-023-17017-x
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
In this digitally-driven culture, the need and demand for diarizing online meetings, classes, conferences, and medical diagnoses have increased a lot. Speaker Diarization, a sub-domain of Speaker Recognition has grown with the advent of neural networks in the last decade. Diarize generally refers to obtaining the duration of individual speakers in any event. Researchers have suggested various approaches for multiple-speaker diarization. However, it still suffers from a problem of various environmental noises, and non-speech sounds like laughter, murmuring, clapping, etc. in the datasets. Hence, this paper proposes an improved speaker diarization pipeline to deal with the noise present in a dataset having multiple speakers. This improved diarization pipeline uses Bi-directional Long Short-Term Memory (Bi-LSTM), based speech refinement pre-processing module, and Modified Spectral Clustering with Symmetrized Singular Value Decomposition (MSC-SSVD). MSC-SSVD is used to cater to the problem of spectral clustering in large datasets. The proposed diarization pipeline is evaluated using the publicly available VoxConverse dataset. The Diarization Error Rate (DER) obtained after experimentation are 37.2%, 37.1%, and 43.3% respectively for three batches of dataset under study. The results are also compared with the baseline system and significant change in DER by 6.1%, 4.7%, and 7% respectively for three batches is observed.
引用
收藏
页码:54433 / 54448
页数:16
相关论文
共 67 条
  • [1] Speech enhancement with an adaptive Wiener filter
    Abd El-Fattah, Marwa
    Dessouky, Moawad
    Abbas, Alaa
    Diab, Salaheldin
    El-Rabaie, El-Sayed
    Al-Nuaimy, Waleed
    Alshebeili, Saleh
    Abd El-Samie, Fathi
    [J]. INTERNATIONAL JOURNAL OF SPEECH TECHNOLOGY, 2014, 17 (01) : 53 - 64
  • [2] Speech Enhancement for Multimodal Speaker Diarization System
    Ahmad, Rehan
    Zubair, Syed
    Alquhayz, Hani
    [J]. IEEE ACCESS, 2020, 8 : 126671 - 126680
  • [3] Multimodal Speaker Diarization Using a Pre-Trained Audio-Visual Synchronization Model
    Ahmad, Rehan
    Zubair, Syed
    Alquhayz, Hani
    Ditta, Allah
    [J]. SENSORS, 2019, 19 (23)
  • [4] Speaker Diarization: A Review of Recent Research
    Anguera Miro, Xavier
    Bozonnet, Simon
    Evans, Nicholas
    Fredouille, Corinne
    Friedland, Gerald
    Vinyals, Oriol
    [J]. IEEE TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2012, 20 (02): : 356 - 370
  • [5] [Anonymous], 1998, PROC DARPA BROADCAST
  • [6] SUPPRESSION OF ACOUSTIC NOISE IN SPEECH USING SPECTRAL SUBTRACTION
    BOLL, SF
    [J]. IEEE TRANSACTIONS ON ACOUSTICS SPEECH AND SIGNAL PROCESSING, 1979, 27 (02): : 113 - 120
  • [7] pyannote.metrics: a toolkit for reproducible evaluation, diagnostic, and error analysis of speaker diarization systems
    Bredin, Herve
    [J]. 18TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2017), VOLS 1-6: SITUATED INTERACTION, 2017, : 3587 - 3591
  • [8] Chung J.S., 2018, ARXIV
  • [9] Chung JS, 2021, Arxiv, DOI arXiv:2007.01216
  • [10] Mean shift: A robust approach toward feature space analysis
    Comaniciu, D
    Meer, P
    [J]. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2002, 24 (05) : 603 - 619