CONTINUOUS SPEECH SEPARATION WITH RECURRENT SELECTIVE ATTENTION NETWORK

被引：4

作者：

Zhang, Yixuan ^{[1
,2
]}

Chen, Zhuo ^{[1
]}

Wu, Jian ^{[1
]}

Yoshioka, Takuya ^{[1
]}

Wang, Peidong ^{[1
]}

Meng, Zhong ^{[1
]}

Li, Jinyu ^{[1
]}

机构：

[1] Microsoft, Redmond, WA 98052 USA

[2] Ohio State Univ, Columbus, OH 43210 USA

来源：

2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP) | 2022年

关键词：

Continuous speech separation; recurrent selective attention network; meeting transcription; ENHANCEMENT;

D O I：

10.1109/ICASSP43922.2022.9746394

中图分类号：

O42 [声学];

学科分类号：

070206 ; 082403 ;

摘要：

While permutation invariant training (PIT) based continuous speech separation (CSS) significantly improves the conversation transcription accuracy, it often suffers from speech leakages and failures in separation at "hot spot" regions because it has a fixed number of output channels. In this paper, we propose to apply recurrent selective attention network (RSAN) to CSS, which generates a variable number of output channels based on active speaker counting. In addition, we propose a novel block-wise dependency extension of RSAN by introducing dependencies between adjacent processing blocks in the CSS framework. It enables the network to utilize the separation results from the previous blocks to facilitate the current block processing. Experimental results on the LibriCSS dataset show that the RSAN-based CSS (RSAN-CSS) network consistently improves the speech recognition accuracy over PIT-based models. The proposed block-wise dependency modeling further boosts the performance of RSAN-CSS.

引用

页码：6017 / 6021

页数：5

共 29 条

[1] IMAGE METHOD FOR EFFICIENTLY SIMULATING SMALL-ROOM ACOUSTICS [J].

ALLEN, JB ;

BERKLEY, DA .

JOURNAL OF THE ACOUSTICAL SOCIETY OF AMERICA, 1979, 65 (04) :943-950

[2]

[Anonymous], 1994, LINGUISTIC DATA CONS

[3] CONTINUOUS SPEECH SEPARATION WITH CONFORMER [J].

Chen, Sanyuan ;

Wu, Yu ;

Chen, Zhuo ;

Wu, Jian ;

Li, Jinyu ;

Yoshioka, Takuya ;

Wang, Chengyi ;

Liu, Shujie ;

Zhou, Ming .

2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021), 2021, :5749-5753

[4]

Chen Z, 2020, INT CONF ACOUST SPEE, P7284, DOI [10.1109/icassp40776.2020.9053426, 10.1109/ICASSP40776.2020.9053426]

[5] Generating sensor signals in isotropic noise fields [J].

Habets, Emanuel A. P. ;

Gannot, Sharon .

JOURNAL OF THE ACOUSTICAL SOCIETY OF AMERICA, 2007, 122 (06) :3464-3470

[6]

Hershey JR, 2016, INT CONF ACOUST SPEE, P31, DOI 10.1109/ICASSP.2016.7471631

[7]

Kinoshita K, 2018, 2018 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), P5064, DOI 10.1109/ICASSP.2018.8462646

[8] Multitalker Speech Separation With Utterance-Level Permutation Invariant Training of Deep Recurrent Neural Networks [J].

Kolbaek, Morten ;

Yu, Dong ;

Tan, Zheng-Hua ;

Jensen, Jesper .

IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2017, 25 (10) :1901-1913

[9]

Liu YZ, 2019, IEEE-ACM T AUDIO SPE, V27, P2092, DOI [10.1109/TASLP.2019.2941148, 10.1109/taslp.2019.2941148]

[10]

Loshchilov I., 2017, CoRR

← 1 2 3 →