DUAL-PATH RNN FOR LONG RECORDING SPEECH SEPARATION

被引：20

作者：

Li, Chenda ^{[1
]}

Luo, Yi ^{[2
]}

Han, Cong ^{[2
]}

Li, Jinyu ^{[3
]}

Yoshioka, Takuya ^{[3
]}

Zhou, Tianyan ^{[3
]}

Delcroix, Marc ^{[4
]}

Kinoshita, Keisuke ^{[4
]}

Boeddeker, Christoph ^{[5
]}

Qian, Yanmin ^{[1
]}

Watanabe, Shinji ^{[6
]}

Chen, Zhuo ^{[3
]}

机构：

[1] Shanghai Jiao Tong Univ, Shanghai, Peoples R China

[2] Columbia Univ, New York, NY 10027 USA

[3] Microsoft Corp, Redmond, WA 98052 USA

[4] NTT Corp, Chiyoda City, Tokyo, Japan

[5] Paderborn Univ, Paderborn, Germany

[6] Johns Hopkins Univ, Baltimore, MD 21218 USA

来源：

2021 IEEE SPOKEN LANGUAGE TECHNOLOGY WORKSHOP (SLT) | 2021年

关键词：

Continuous speech separation; long recording speech separation; dual-path RNN;

D O I：

10.1109/SLT48900.2021.9383514

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Continuous speech separation (CSS) is an arising task in speech separation aiming at separating overlap-free targets from a long, partially-overlapped recording. A straightforward extension of previously proposed sentence-level separation models to this task is to segment the long recording into fixed-length blocks and perform separation on them independently. However, such simple extension does not fully address the cross-block dependencies and the separation performance may not be satisfactory. In this paper, we focus on how the block-level separation performance can be improved by exploring methods to utilize the cross-block information. Based on the recently proposed dual-path RNN (DPRNN) architecture, we investigate how DPRNN can help the block-level separation by the interleaved intra- and inter-block modules. Experiment results show that DPRNN is able to significantly outperform the baseline block-level model in both offline and block-online configurations under certain settings.

引用

页码：865 / 872

页数：8

共 40 条

[1]

Çetin Ö, 2006, INTERSPEECH 2006 AND 9TH INTERNATIONAL CONFERENCE ON SPOKEN LANGUAGE PROCESSING, VOLS 1-5, P293

[2]

Chen Z, 2020, INT CONF ACOUST SPEE, P7284, DOI [10.1109/ICASSP40776.2020.9053426, 10.1109/icassp40776.2020.9053426]

[3]

Chen Z, 2017, INT CONF ACOUST SPEE, P246, DOI 10.1109/ICASSP.2017.7952155

[4]

Delcroix M, 2018, 2018 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), P5554, DOI 10.1109/ICASSP.2018.8462661

[5]

Feng Q., ARXIV PREPRINT ARXIV

[6]

Gu R., 2019, ARXIV PREPRINT ARXIV

[7]

Gu RZ, 2020, INT CONF ACOUST SPEE, P7319, DOI [10.1109/icassp40776.2020.9053092, 10.1109/ICASSP40776.2020.9053092]

[8]

Hershey JR, 2016, INT CONF ACOUST SPEE, P31, DOI 10.1109/ICASSP.2016.7471631

[9] Single-Channel Multi-Speaker Separation using Deep Clustering [J].

Isik, Yusuf ;

Le Roux, Jonathan ;

Chen, Zhuo ;

Watanabe, Shinji ;

Hershey, John R. .

17TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2016), VOLS 1-5: UNDERSTANDING SPEECH PROCESSING IN HUMANS AND MACHINES, 2016, :545-549

[10]

Kingma DP, 2014, ADV NEUR IN, V27

← 1 2 3 4 →