SPEAKER-CONDITIONING SINGLE-CHANNEL TARGET SPEAKER EXTRACTION USING CONFORMER-BASED ARCHITECTURES

被引:1
|
作者
Sinha, Ragini [1 ]
Tammen, Marvin [2 ,3 ]
Rollwage, Christian [1 ]
Doclo, Simon [1 ,2 ,3 ]
机构
[1] Fraunhofer Inst Digital Media Technol IDMT, Oldenburg Branch Hearing Speech & Audio Technol H, Ilmenau, Germany
[2] Carl von Ossietzky Univ Oldenburg, Dept Med Phys & Acoust, Oldenburg, Germany
[3] Carl von Ossietzky Univ Oldenburg, Cluster Excellence Hearing4all, Oldenburg, Germany
来源
2022 INTERNATIONAL WORKSHOP ON ACOUSTIC SIGNAL ENHANCEMENT (IWAENC 2022) | 2022年
关键词
target speaker extraction; multi-task learning; TCN; attention; conformer;
D O I
10.1109/IWAENC53105.2022.9914691
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
Target speaker extraction aims at extracting the target speaker from a mixture of multiple speakers exploiting auxiliary information about the target speaker. In this paper, we consider a complete time-domain target speaker extraction system consisting of a speaker embedder network and a speaker separator network which are jointly trained in an end-to-end learning process. We propose two different architectures for the speaker separator network which are based on the convolutional augmented transformer (conformer). The first architecture uses stacks of conformer and external feed-forward blocks (Conformer-FFN), while the second architecture uses stacks of temporal convolutional network (TCN) and conformer blocks (TCN-Conformer). Experimental results for 2-speaker mixtures, 3-speaker mixtures, and noisy mixtures of 2-speakers show that among the proposed separator networks, the TCN-Conformer significantly improves the target speaker extraction performance compared to the Conformer-FFN and a TCN-based baseline system.
引用
收藏
页数:5
相关论文
共 12 条
  • [1] SINGLE-CHANNEL SPEECH EXTRACTION USING SPEAKER INVENTORY AND ATTENTION NETWORK
    Xiao, Xiong
    Chen, Zhuo
    Yoshioka, Takuya
    Erdogan, Hakan
    Liu, Changliang
    Dimitriadis, Dimitrios
    Droppo, Jasha
    Gong, Yifan
    2019 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2019, : 86 - 90
  • [2] Speaker Distance Estimation in Enclosures From Single-Channel Audio
    Neri, Michael
    Politis, Archontis
    Krause, Daniel Aleksander
    Carli, Marco
    Virtanen, Tuomas
    IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2024, 32 (2242-2254) : 2242 - 2254
  • [3] A UNIFIED APPROACH TO SPEAKER SEPARATION AND TARGET SPEAKER EXTRACTION USING ENCODER-DECODER BASED ATTRACTORS
    Chetupalli, Srikanth Raj
    Habets, Emanuel A. P.
    2024 18TH INTERNATIONAL WORKSHOP ON ACOUSTIC SIGNAL ENHANCEMENT, IWAENC 2024, 2024, : 190 - 194
  • [4] Centroid Estimation with Transformer-Based Speaker Embedder for Robust Target Speaker Extraction
    Heo, Woon-Haeng
    Maeng, Joongyu
    Kang, Yoseb
    Cho, Namhyun
    INTERSPEECH 2024, 2024, : 4333 - 4337
  • [5] Dual-Channel Target Speaker Extraction Based on Conditional Variational Autoencoder and Directional Information
    Wang, Rui
    Li, Li
    Toda, Tomoki
    IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2024, 32 : 1968 - 1979
  • [6] Contrastive Learning for Target Speaker Extraction With Attention-Based Fusion
    Li, Xiao
    Liu, Ruirui
    Huang, Huichou
    Wu, Qingyao
    IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2024, 32 : 178 - 188
  • [7] Target Speaker Extraction Using Attention-Enhanced Temporal Convolutional Network
    Wang, Jian-Hong
    Lai, Yen-Ting
    Tai, Tzu-Chiang
    Le, Phuong Thi
    Pham, Tuan
    Wang, Ze-Yu
    Li, Yung-Hui
    Wang, Jia-Ching
    Chang, Pao-Chi
    Botzheim, Janos
    ELECTRONICS, 2024, 13 (02)
  • [8] Coarse-to-Fine Target Speaker Extraction Based on Contextual Information Exploitation
    Yang, Xue
    Bao, Changchun
    Chen, Xianhong
    IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2024, 32 : 3795 - 3810
  • [9] Direction-aware target speaker extraction with a dual-channel system based on conditional variational autoencoders under underdetermined conditions
    Wang, Rui
    Li, Li
    Toda, Tomoki
    PROCEEDINGS OF 2022 ASIA-PACIFIC SIGNAL AND INFORMATION PROCESSING ASSOCIATION ANNUAL SUMMIT AND CONFERENCE (APSIPA ASC), 2022, : 347 - 353
  • [10] AN AUDIO-QUALITY-BASED MULTI-STRATEGY APPROACH FOR TARGET SPEAKER EXTRACTION IN THE MISP 2023 CHALLENGE
    Han, Runduo
    Yang, Xiaopeng
    Peng, Weiming
    Guo, Pengcheng
    Sun, Jiayao
    Wang, He
    Lu, Quan
    Jiang, Ning
    Xi, Lei
    2024 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING WORKSHOPS, ICASSPW 2024, 2024, : 27 - 28