Focus the Sound around You: Monaural Target Speaker Extraction via Distance and Speaker Information

被引：3

作者：

Lin, Jiuxin ^{[1
]}

Wang, Peng ^{[2
]}

Dinkel, Heinrich ^{[2
]}

Chen, Jun ^{[1
]}

Wu, Zhiyong ^{[1
]}

Wang, Yongqing ^{[2
]}

Yan, Zhiyong ^{[2
]}

Zhang, Junbo ^{[2
]}

Wang, Yujun ^{[2
]}

机构：

[1] Tsinghua Univ, Shenzhen Int Grad Sch, Shenzhen, Peoples R China

[2] Xiaomi Inc, Beijing, Peoples R China

来源：

INTERSPEECH 2023 | 2023年

基金：

中国国家自然科学基金;

关键词：

target speaker extraction; distance-based sound separation; SPEECH;

D O I：

10.21437/Interspeech.2023-218

中图分类号：

O42 [声学];

学科分类号：

070206 ; 082403 ;

摘要：

Previously, Target Speaker Extraction (TSE) has yielded outstanding performance in certain application scenarios for speech enhancement and source separation. However, obtaining auxiliary speaker-related information is still challenging in noisy environments with significant reverberation. Inspired by the recently proposed distance-based sound separation, we propose the near sound (NS) extractor, which leverages distance information for TSE to reliably extract speaker information without requiring previous speaker enrolment, called speaker embedding self-enrollment (SESE). Full- & sub-band modeling is introduced to enhance our NS-Extractor's adaptability towards environments with significant reverberation. Experimental results on several cross-datasets demonstrate the effectiveness of our improvements and the excellent performance of our proposed NS-Extractor in different application scenarios.

引用

页码：2488 / 2492

页数：5

共 29 条

[1] On the Modeling of Rectangular Geometries in Room Acoustic Simulations
De Sena, Enzo
Antonello, Niccolo
Moonen, Marc
van Waterschoot, Toon
[J]. IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2015, 23 (04) : 774 - 786
[2] Delcroix M, 2018, 2018 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), P5554, DOI 10.1109/ICASSP.2018.8462661
[3] SpEx plus : A Complete Time Domain Speaker Extraction Network
Ge, Meng
Xu, Chenglin
Wang, Longbiao
Chng, Eng Siong
Dang, Jianwu
Li, Haizhou
[J]. INTERSPEECH 2020, 2020, : 1406 - 1410
[4] MULTI-STAGE SPEAKER EXTRACTION WITH UTTERANCE AND FRAME-LEVEL REFERENCE SIGNALS
Ge, Meng
Xu, Chenglin
Wang, Longbiao
Chng, Eng Siong
Dang, Jianwu
Li, Haizhou
[J]. 2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021), 2021, : 6109 - 6113
[5] Graves A, 2012, STUD COMPUT INTELL, V385, P1, DOI [10.1162/neco.1997.9.1.1, 10.1007/978-3-642-24797-2]
[6] DPCCN: DENSELY-CONNECTED PYRAMID COMPLEX CONVOLUTIONAL NETWORK FOR ROBUST SPEECH SEPARATION AND EXTRACTION
Han, Jiangyu
Long, Yanhua
Burget, Lukas
Cernocky, Jan
[J]. 2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, : 7292 - 7296
[7] FULLSUBNET: A FULL-BAND AND SUB-BAND FUSION MODEL FOR REAL-TIME SINGLE-CHANNEL SPEECH ENHANCEMENT
Hao, Xiang
Su, Xiangdong
Horaud, Radu
Li, Xiaofei
[J]. 2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021), 2021, : 6633 - 6637
[8] Hershey JR, 2016, INT CONF ACOUST SPEE, P31, DOI 10.1109/ICASSP.2016.7471631
[9] Kahn J, 2020, INT CONF ACOUST SPEE, P7669, DOI [10.1109/ICASSP40776.2020.9052942, 10.1109/icassp40776.2020.9052942]
[10] Kingma D. P., 2014, arXiv

← 1 2 3 →