Speaker Extraction with Detection of Presence and Absence of Target Speakers

被引:1
|
作者
Zhang, Ke [1 ,2 ]
Borsdorf, Marvin [3 ]
Pan, Zexu [2 ]
Li, Haizhou [2 ,3 ,4 ]
Wei, Yangjie [1 ]
Wang, Yi [1 ]
机构
[1] Northeastern Univ, Key Lab Intelligent Comp Med Image, Shenyang, Liaoning, Peoples R China
[2] Natl Univ Singapore, Dept Elect & Comp Engn, Singapore, Singapore
[3] Univ Bremen, Machine Listening Lab MLL, Bremen, Germany
[4] Chinese Univ Hong Kong, SDS, SRIBD, Shenzhen, Peoples R China
来源
INTERSPEECH 2023 | 2023年
基金
中国国家自然科学基金;
关键词
cocktail party problem; target speaker extraction; speaker detection; selective auditory attention; absent speaker; SPEECH; VERIFICATION; ATTENTION; SINGLE;
D O I
10.21437/Interspeech.2023-655
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
Target speaker extraction extracts a target voice from a given cocktail party mixture signal. Most studies are restricted to conditions in which the target speaker is present in the mixture (PT), which often fail when the target speaker is absent (AT). Training on both PT and AT situations helps, but degrades the PT performance as the model intrinsically tries to detect the target presence. We propose a new model, called TSEJoint, that jointly performs target speaker detection and extraction. Both tasks share the low-level modules, allowing the detection branch to use a pre-separated signal and keeping the overall processing pipeline length similar, while at the high-level they have different branches to ensure the performance of each task. We evaluate our proposed methods under PT and AT conditions comprising one and two talkers. The TSEJoint model shows better extraction performance under the PT condition and better detection performance on all conditions compared with the baseline.
引用
收藏
页码:3714 / 3718
页数:5
相关论文
共 50 条
  • [31] Target speaker lipreading by audio-visual self-distillation pretraining and speaker adaptation
    Zhang, Jing-Xuan
    Mao, Tingzhi
    Guo, Longjiang
    Li, Jin
    Zhang, Lichen
    EXPERT SYSTEMS WITH APPLICATIONS, 2025, 272
  • [32] Online Neural Speaker Diarization With Target Speaker Tracking
    Wang, Weiqing
    Li, Ming
    IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2024, 32 : 5078 - 5091
  • [33] Dual-Channel Target Speaker Extraction Based on Conditional Variational Autoencoder and Directional Information
    Wang, Rui
    Li, Li
    Toda, Tomoki
    IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2024, 32 : 1968 - 1979
  • [34] A Hybrid Continuity Loss to Reduce Over-Suppression for Time-domain Target Speaker Extraction
    Pan, Zexu
    Ge, Meng
    Li, Haizhou
    INTERSPEECH 2022, 2022, : 1786 - 1790
  • [35] LEARNING SPEAKER REPRESENTATION FOR NEURAL NETWORK BASED MULTICHANNEL SPEAKER EXTRACTION
    Zmolikova, Katerina
    Delcroix, Marc
    Kinoshita, Keisuke
    Higuchi, Takuya
    Ogawa, Atsunori
    Nakatani, Tomohiro
    2017 IEEE AUTOMATIC SPEECH RECOGNITION AND UNDERSTANDING WORKSHOP (ASRU), 2017, : 8 - 15
  • [36] Deep asymmetric extraction and aggregation for infrared small target detection
    Lin, Zhongwu
    Ma, Yuhao
    Ming, Ruixing
    Yao, Guohui
    Lei, Zhuo
    Zhou, Qinghui
    Huang, Min
    SCIENTIFIC REPORTS, 2023, 13 (01)
  • [37] AN AUDIO-QUALITY-BASED MULTI-STRATEGY APPROACH FOR TARGET SPEAKER EXTRACTION IN THE MISP 2023 CHALLENGE
    Han, Runduo
    Yang, Xiaopeng
    Peng, Weiming
    Guo, Pengcheng
    Sun, Jiayao
    Wang, He
    Lu, Quan
    Jiang, Ning
    Xi, Lei
    2024 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING WORKSHOPS, ICASSPW 2024, 2024, : 27 - 28
  • [38] Visual Lip Activity Detection and Speaker Detection Using Mouth Region Intensities
    Siatras, Spyridon
    Nikolaidis, Nikos
    Krinidis, Michail
    Pitas, Ioannis
    IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2009, 19 (01) : 133 - 137
  • [39] Speaker-Aware Mixture of Mixtures Training for Weakly Supervised Speaker Extraction
    Zhao, Zifeng
    Gu, Rongzhi
    Yang, Dongchao
    Tian, Jinchuan
    Zou, Yuexian
    INTERSPEECH 2022, 2022, : 5318 - 5322
  • [40] USEV: Universal Speaker Extraction With Visual Cue
    Pan, Zexu
    Ge, Meng
    Li, Haizhou
    IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2022, 30 : 3032 - 3045