Directed Speech Separation for Automatic Speech Recognition of Long-form Conversational Speech

被引：5

作者：

Paturi, Rohit ^{[1
]}

Srinivasan, Sundararajan ^{[1
]}

Kirchhoff, Katrin ^{[1
]}

Romero, Daniel Garcia ^{[1
]}

机构：

[1] Amazon AWS AI, Washington, DC 20052 USA

来源：

INTERSPEECH 2022 | 2022年

关键词：

Speech Separation; Speaker embeddings; Spectral clustering; ASR; deep learning;

D O I：

10.21437/Interspeech.2022-10843

中图分类号：

O42 [声学];

学科分类号：

070206 ; 082403 ;

摘要：

Many of the recent advances in speech separation are primarily aimed at synthetic mixtures of short audio utterances with high degrees of overlap. Most of these approaches need an additional stitching step to stitch the separated speech chunks for long form audio. Since most of the approaches involve Permutation Invariant training (PIT), the order of separated speech chunks is nondeterministic and leads to difficulty in accurately stitching homogenous speaker chunks for downstream tasks like Automatic Speech Recognition (ASR). Also, most of these models are trained with synthetic mixtures and do not generalize to real conversational data. In this paper, we propose a speaker conditioned separator trained on speaker embeddings extracted directly from the mixed signal using an over-clustering based approach. This model naturally regulates the order of the separated chunks without the need for an additional stitching step. We also introduce a data sampling strategy with real and synthetic mixtures which generalizes well to real conversation speech. With this model and data sampling technique, we show significant improvements in speaker-attributed word error rate (SA-WER) on Hub5 data.

引用

页码：5388 / 5392

页数：5

共 50 条

[31] Mixture Encoder for Joint Speech Separation and Recognition [J].

Berger, Simon ;

Vieting, Peter ;

Boeddeker, Christoph ;

Schlueter, Ralf ;

Haeb-Umbach, Reinhold .

INTERSPEECH 2023, 2023, :3527-3531

[32] Bilingual Automatic Speech Recognition: A Review, Taxonomy and Open Challenges [J].

Abushariah, Ahmad A. M. ;

Ting, Hua-Nong ;

Mustafa, Mumtaz Begum Peer ;

Khairuddin, Anis Salwa Mohd ;

Abushariah, Mohammad A. M. ;

Tan, Tien-Ping .

IEEE ACCESS, 2023, 11 :5944-5954

[33] Speech Databases, Speech Features, and Classifiers in Speech Emotion Recognition: A Review [J].

Dar, G. H. Mohmad ;

Delhibabu, Radhakrishnan .

IEEE ACCESS, 2024, 12 :151122-151152

[34] Transcription-Free Fine-Tuning of Speech Separation Models for Noisy and Reverberant Multi-Speaker Automatic Speech Recognition [J].

Ravenscroft, William ;

Close, George ;

Goetze, Stefan ;

Hain, Thomas ;

Soleymanpour, Mohammad ;

Chowdhury, Anurag ;

Fuhs, Mark C. .

INTERSPEECH 2024, 2024, :4998-5002

[35] Performance evaluation of automatic speech recognition systems on integrated noise-network distorted speech [J].

Kumalija, Elhard ;

Nakamoto, Yukikazu .

FRONTIERS IN SIGNAL PROCESSING, 2022, 2

[36] Impact of Speech Mode in Automatic Pathological Speech Detection [J].

Sheikh, Shakeel A. ;

Kodrasi, Ina .

32ND EUROPEAN SIGNAL PROCESSING CONFERENCE, EUSIPCO 2024, 2024, :81-85

[37] Two-Stage Enhancement of Noisy and Reverberant Microphone Array Speech for Automatic Speech Recognition Systems Trained with Only Clean Speech [J].

Wang, Quandong ;

Wang, Sicheng ;

Ge, Fengpei ;

Han, Chang Woo ;

Lee, Jaewon ;

Guo, Lianghao ;

Lee, Chin-Hui .

2018 11TH INTERNATIONAL SYMPOSIUM ON CHINESE SPOKEN LANGUAGE PROCESSING (ISCSLP), 2018, :21-25

[38] ADVERSARIAL LEARNING OF RAW SPEECH FEATURES FOR DOMAIN INVARIANT SPEECH RECOGNITION [J].

Tripathi, Aditay ;

Mohan, Aanchan ;

Anand, Saket ;

Singh, Maneesh .

2018 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2018, :5959-5963

[39] Data augmentation for speech separation [J].

Alex, Ashish ;

Wang, Lin ;

Gastaldo, Paolo ;

Cavallaro, Andrea .

SPEECH COMMUNICATION, 2023, 152

[40] The Harming Part of Room Acoustics in Automatic Speech Recognition [J].

Petrick, Rico ;

Lohde, Kevin ;

Wolff, Matthias ;

Hoffmann, Ruediger .

INTERSPEECH 2007: 8TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION, VOLS 1-4, 2007, :1509-+

← 1 2 3 4 5 →