Focus the Sound around You: Monaural Target Speaker Extraction via Distance and Speaker Information

被引:3
作者
Lin, Jiuxin [1 ]
Wang, Peng [2 ]
Dinkel, Heinrich [2 ]
Chen, Jun [1 ]
Wu, Zhiyong [1 ]
Wang, Yongqing [2 ]
Yan, Zhiyong [2 ]
Zhang, Junbo [2 ]
Wang, Yujun [2 ]
机构
[1] Tsinghua Univ, Shenzhen Int Grad Sch, Shenzhen, Peoples R China
[2] Xiaomi Inc, Beijing, Peoples R China
来源
INTERSPEECH 2023 | 2023年
基金
中国国家自然科学基金;
关键词
target speaker extraction; distance-based sound separation; SPEECH;
D O I
10.21437/Interspeech.2023-218
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
Previously, Target Speaker Extraction (TSE) has yielded outstanding performance in certain application scenarios for speech enhancement and source separation. However, obtaining auxiliary speaker-related information is still challenging in noisy environments with significant reverberation. Inspired by the recently proposed distance-based sound separation, we propose the near sound (NS) extractor, which leverages distance information for TSE to reliably extract speaker information without requiring previous speaker enrolment, called speaker embedding self-enrollment (SESE). Full- & sub-band modeling is introduced to enhance our NS-Extractor's adaptability towards environments with significant reverberation. Experimental results on several cross-datasets demonstrate the effectiveness of our improvements and the excellent performance of our proposed NS-Extractor in different application scenarios.
引用
收藏
页码:2488 / 2492
页数:5
相关论文
共 29 条
  • [1] On the Modeling of Rectangular Geometries in Room Acoustic Simulations
    De Sena, Enzo
    Antonello, Niccolo
    Moonen, Marc
    van Waterschoot, Toon
    [J]. IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2015, 23 (04) : 774 - 786
  • [2] Delcroix M, 2018, 2018 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), P5554, DOI 10.1109/ICASSP.2018.8462661
  • [3] SpEx plus : A Complete Time Domain Speaker Extraction Network
    Ge, Meng
    Xu, Chenglin
    Wang, Longbiao
    Chng, Eng Siong
    Dang, Jianwu
    Li, Haizhou
    [J]. INTERSPEECH 2020, 2020, : 1406 - 1410
  • [4] MULTI-STAGE SPEAKER EXTRACTION WITH UTTERANCE AND FRAME-LEVEL REFERENCE SIGNALS
    Ge, Meng
    Xu, Chenglin
    Wang, Longbiao
    Chng, Eng Siong
    Dang, Jianwu
    Li, Haizhou
    [J]. 2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021), 2021, : 6109 - 6113
  • [5] Graves A, 2012, STUD COMPUT INTELL, V385, P1, DOI [10.1162/neco.1997.9.1.1, 10.1007/978-3-642-24797-2]
  • [6] DPCCN: DENSELY-CONNECTED PYRAMID COMPLEX CONVOLUTIONAL NETWORK FOR ROBUST SPEECH SEPARATION AND EXTRACTION
    Han, Jiangyu
    Long, Yanhua
    Burget, Lukas
    Cernocky, Jan
    [J]. 2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, : 7292 - 7296
  • [7] FULLSUBNET: A FULL-BAND AND SUB-BAND FUSION MODEL FOR REAL-TIME SINGLE-CHANNEL SPEECH ENHANCEMENT
    Hao, Xiang
    Su, Xiangdong
    Horaud, Radu
    Li, Xiaofei
    [J]. 2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021), 2021, : 6633 - 6637
  • [8] Hershey JR, 2016, INT CONF ACOUST SPEE, P31, DOI 10.1109/ICASSP.2016.7471631
  • [9] Kahn J, 2020, INT CONF ACOUST SPEE, P7669, DOI [10.1109/ICASSP40776.2020.9052942, 10.1109/icassp40776.2020.9052942]
  • [10] Kingma D. P., 2014, arXiv