Directed Speech Separation for Automatic Speech Recognition of Long-form Conversational Speech

被引：4

作者：

Paturi, Rohit ^{[1
]}

Srinivasan, Sundararajan ^{[1
]}

Kirchhoff, Katrin ^{[1
]}

Romero, Daniel Garcia ^{[1
]}

机构：

[1] Amazon AWS AI, Washington, DC 20052 USA

来源：

INTERSPEECH 2022 | 2022年

关键词：

Speech Separation; Speaker embeddings; Spectral clustering; ASR; deep learning;

D O I：

10.21437/Interspeech.2022-10843

中图分类号：

O42 [声学];

学科分类号：

070206 ; 082403 ;

摘要：

Many of the recent advances in speech separation are primarily aimed at synthetic mixtures of short audio utterances with high degrees of overlap. Most of these approaches need an additional stitching step to stitch the separated speech chunks for long form audio. Since most of the approaches involve Permutation Invariant training (PIT), the order of separated speech chunks is nondeterministic and leads to difficulty in accurately stitching homogenous speaker chunks for downstream tasks like Automatic Speech Recognition (ASR). Also, most of these models are trained with synthetic mixtures and do not generalize to real conversational data. In this paper, we propose a speaker conditioned separator trained on speaker embeddings extracted directly from the mixed signal using an over-clustering based approach. This model naturally regulates the order of the separated chunks without the need for an additional stitching step. We also introduce a data sampling strategy with real and synthetic mixtures which generalizes well to real conversation speech. With this model and data sampling technique, we show significant improvements in speaker-attributed word error rate (SA-WER) on Hub5 data.

引用

收藏

页码：5388 / 5392

页数：5

相关论文

共 50 条

[41] Data augmentation for speech separation [J].

Alex, Ashish ;

Wang, Lin ;

Gastaldo, Paolo ;

Cavallaro, Andrea .

SPEECH COMMUNICATION, 2023, 152

[42] The Harming Part of Room Acoustics in Automatic Speech Recognition [J].

Petrick, Rico ;

Lohde, Kevin ;

Wolff, Matthias ;

Hoffmann, Ruediger .

INTERSPEECH 2007: 8TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION, VOLS 1-4, 2007, :1509-+

[43] Automatic context window composition for distant speech recognition [J].

Ravanelli, Mirco ;

Omologo, Maurizio .

SPEECH COMMUNICATION, 2018, 101 :34-44

[44] Using Automatic Speech Recognition in Spoken Corpus Curation [J].

Gorisch, Jan ;

Gref, Michael ;

Schmidt, Thomas .

PROCEEDINGS OF THE 12TH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION (LREC 2020), 2020, :6423-6428

[45] DISTRIBUTED DEEP LEARNING STRATEGIES FOR AUTOMATIC SPEECH RECOGNITION [J].

Zhang, Wei ;

Cui, Xiaodong ;

Finkler, Ulrich ;

Kingsbury, Brian ;

Saon, George ;

Kung, David ;

Picheny, Michael .

2019 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2019, :5706-5710

[46] Psycho-acoustics inspired automatic speech recognition [J].

Coro, Gianpaolo ;

Massoli, Fabio Valerio ;

Origlia, Antonio ;

Cutugno, Francesco .

COMPUTERS & ELECTRICAL ENGINEERING, 2021, 93

[47] Automatic Speech Recognition for Supporting Endangered Language Documentation [J].

Prud'hommeaux, Emily ;

Jimerson, Robbie ;

Hatcher, Richard ;

Michelson, Karin .

LANGUAGE DOCUMENTATION & CONSERVATION, 2021, 15 :491-513

[48] Multilingual Transfer Learning for Children Automatic Speech Recognition [J].

Rolland, Thomas ;

Abad, Alberto ;

Cucchiarini, Catia ;

Strik, Helmer .

LREC 2022: THIRTEEN INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION, 2022, :7314-7320

[49] Robust Automatic Speech Recognition for Call Center Applications [J].

Felipe Parra-Gallego, Luis ;

Arias-Vergara, Tomas ;

Orozco Arroyave, Juan Rafael .

APPLIED COMPUTER SCIENCES IN ENGINEERING, WEA 2021, 2021, 1431 :72-83

[50] Discovering phonetic inventories with crosslingual automatic speech recognition [J].

Zelasko, Piotr ;

Feng, Siyuan ;

Velazquez, Laureano Moro ;

Abavisani, Ali ;

Bhati, Saurabhchand ;

Scharenborg, Odette ;

Hasegawa-Johnson, Mark ;

Dehak, Najim .

COMPUTER SPEECH AND LANGUAGE, 2022, 74

← 1 2 3 4 5 →