ON PERMUTATION INVARIANT TRAINING FOR SPEECH SOURCE SEPARATION

被引:0
作者
Liu, Xiaoyu [1 ]
Pons, Jordi [1 ]
机构
[1] Dolby Labs, San Francisco, CA 94103 USA
来源
2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021) | 2021年
关键词
Speech source separation; permutation invariant training; waveform-based models; spectrogram-based models; FILTERBANK;
D O I
10.1109/ICASSP39728.2021.9413559
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
We study permutation invariant training (PIT), which targets at the permutation ambiguity problem for speaker independent source separation models. We extend two state-of-the-art PIT strategies. First, we look at the two-stage speaker separation and tracking algorithm based on frame level PIT (tPIT) and clustering, which was originally proposed for the STFT domain, and we adapt it to work with waveforms and over a learned latent space. Further, we propose an efficient clustering loss scalable to waveform models. Second, we extend a recently proposed auxiliary speaker-ID loss with a deep feature loss based on "problem agnostic speech features", to reduce the local permutation errors made by the utterance level PIT (uPIT). Our results show that the proposed extensions help reducing permutation ambiguity. However, we also note that the studied STFT-based models are more effective at reducing permutation errors than waveform-based models, a perspective overlooked in recent studies.
引用
收藏
页码:6 / 10
页数:5
相关论文
共 50 条
  • [31] EFFICIENT INTEGRATION OF FIXED BEAMFORMERS AND SPEECH SEPARATION NETWORKS FOR MULTI-CHANNEL FAR-FIELD SPEECH SEPARATION
    Chen, Zhuo
    Yoshioka, Takuya
    Xiao, Xiong
    Li, Jinyu
    Seltzer, Michael L.
    Gong, Yifan
    2018 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2018, : 5384 - 5388
  • [32] Deep Multi-channel Speech Source Separation with Time-frequency Masking for Spatially Filtered Microphone Input Signal
    Togami, Masahito
    28TH EUROPEAN SIGNAL PROCESSING CONFERENCE (EUSIPCO 2020), 2021, : 266 - 270
  • [33] AN INFORMATION THEORETIC APPROACH FOR SPEECH SOURCE ENUMERATION
    Ayllon, David
    Gil-Pita, Roberto
    Rosa-Zurera, Manuel
    Krim, Hamid
    2013 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2013, : 4300 - 4304
  • [34] FURCAX: END-TO-END MONAURAL SPEECH SEPARATION BASED ON DEEP GATED (DE)CONVOLUTIONAL NEURAL NETWORKS WITH ADVERSARIAL EXAMPLE TRAINING
    Shi, Ziqiang
    Lin, Huibin
    Liu, Liu
    Liu, Rujie
    Hayakawa, Shoji
    Han, Jiqing
    2019 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2019, : 6985 - 6989
  • [35] An Improved Unsupervised Single-Channel Speech Separation Algorithm for Processing Speech Sensor Signals
    Jiang, Dazhi
    He, Zhihui
    Lin, Yingqing
    Chen, Yifei
    Xu, Linyan
    WIRELESS COMMUNICATIONS & MOBILE COMPUTING, 2021, 2021
  • [36] Improved Speech Separation Performance from Monaural Mixed Speech Based on Deep Embedding Network
    Dang S.
    Matsumoto T.
    Kudo H.
    Takeuchi Y.
    IEEJ Transactions on Electronics, Information and Systems, 2022, 142 (06) : 643 - 649
  • [37] SEPNET: A DEEP SEPARATION MATRIX PREDICTION NETWORK FOR MULTICHANNEL AUDIO SOURCE SEPARATION
    Inoue, Shota
    Kameoka, Hirokazu
    Li, Li
    Makino, Shoji
    2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021), 2021, : 191 - 195
  • [38] Esophageal Speech Enhancement using Modified Voicing Source
    Ishaq, Rizwan
    Zapirain, Begona Garcia
    2013 IEEE INTERNATIONAL SYMPOSIUM ON SIGNAL PROCESSING AND INFORMATION TECHNOLOGY (IEEE ISSPIT 2013), 2013, : 210 - 214
  • [39] Influence of Speaker-Specific Parameters on Speech Separation Systems
    Ditter, David
    Gerkmann, Timo
    INTERSPEECH 2019, 2019, : 4584 - 4588
  • [40] Time-Domain Speech Separation Networks With Graph Encoding Auxiliary
    Wang, Tingting
    Pan, Zexu
    Ge, Meng
    Yang, Zhen
    Li, Haizhou
    IEEE SIGNAL PROCESSING LETTERS, 2023, 30 : 110 - 114