DE-DPCTnet: Deep Encoder Dual-path Convolutional Transformer Network for Multi-channel Speech Separation

被引:0
作者
Wang, Zhenyu [1 ,2 ,4 ]
Zhou, Yi [1 ,2 ]
Gan, Lu [3 ,4 ]
Chen, Rilin
Tang, Xinyu [1 ,2 ]
Liu, Hongqing [1 ,2 ]
机构
[1] Chongqing Univ Posts & Telecommun, Chongqing 400065, Peoples R China
[2] Chongqing Key Lab Signal & Informat Proc, Chongqing 400065, Peoples R China
[3] Brunel Univ, Coll Engn Design & Phys Sci, London UB8 3PH, England
[4] Tencent AI Lab, Beijing, Peoples R China
来源
2022 IEEE WORKSHOP ON SIGNAL PROCESSING SYSTEMS (SIPS) | 2022年
关键词
Speech separation; multi-channel; deep encoder; improved transformer; beamforming; TASNET;
D O I
10.1109/SIPS55645.2022.9919247
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
In recent years, beamforming has been extensively investigated in multi-channel speech separation task. In this paper, we propose a deep encoder dual-path convolutional transformer network (DE-DPCTnet), which directly estimates the beamforming filters for speech separation task in time domain. In order to learn the signal repetitions correctly, nonlinear deep encoder module is proposed to replace the traditional linear one. The improved transformer is also developed by utilizing convolutions to capture long-time speech sequences. The ablation studies demonstrate that the deep encoder and improved transformer indeed benefit the separation performance. The comparisons show that the DE-DPCTnet outperforms the state-of-the-art filter-and-sum network with transform-average-concatenate module (FaSNet-TAC), even with a lower computational complexity.
引用
收藏
页码:180 / 184
页数:5
相关论文
共 28 条
[1]  
Chen Z, 2018, IEEE W SP LANG TECH, P558, DOI 10.1109/SLT.2018.8639593
[2]  
Cho C.-S., 2018, AUDIO ENG SOC CONVEN
[3]   gpuRIR: A python']python library for room impulse response simulation with GPU acceleration [J].
Diaz-Guerra, David ;
Miguel, Antonio ;
Beltran, Jose R. .
MULTIMEDIA TOOLS AND APPLICATIONS, 2021, 80 (04) :5653-5671
[4]   Improved MVDR beamforming using single-channel mask prediction networks [J].
Erdogan, Hakan ;
Hershey, John ;
Watanabe, Shinji ;
Mandel, Michael ;
Le Roux, Jonathan .
17TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2016), VOLS 1-5: UNDERSTANDING SPEECH PROCESSING IN HUMANS AND MACHINES, 2016, :1981-1985
[5]  
Gu RZ, 2020, INT CONF ACOUST SPEE, P7319, DOI [10.1109/ICASSP40776.2020.9053092, 10.1109/icassp40776.2020.9053092]
[6]   The cocktail party problem [J].
Haykin, S ;
Chen, Z .
NEURAL COMPUTATION, 2005, 17 (09) :1875-1902
[7]  
Heitkaemper J, 2020, INT CONF ACOUST SPEE, P6359, DOI [10.1109/icassp40776.2020.9052981, 10.1109/ICASSP40776.2020.9052981]
[8]  
Hershey JR, 2016, INT CONF ACOUST SPEE, P31, DOI 10.1109/ICASSP.2016.7471631
[9]  
Heymann J, 2016, INT CONF ACOUST SPEE, P196, DOI 10.1109/ICASSP.2016.7471664
[10]  
Hu G., 2006, 100 nonspeech sounds