wTIMIT2mix: A Cocktail Party Mixtures Database to Study Target Speaker Extraction for Normal and Whispered Speech

被引:0
作者
Borsdorf, Marvin [1 ]
Pan, Zexu [2 ]
Li, Haizhou [1 ,3 ,4 ]
Schultz, Tanja [5 ]
机构
[1] Univ Bremen, Machine Listening Lab MLL, Bremen, Germany
[2] Alibaba Grp, Singapore, Singapore
[3] Chinese Univ Hong Kong, SRIBD, SDS, Shenzhen, Peoples R China
[4] Natl Univ Singapore, Dept Elect & Comp Engn, Singapore, Singapore
[5] Univ Bremen, Cognit Syst Lab CSL, Bremen, Germany
来源
INTERSPEECH 2024 | 2024年
关键词
Speaker extraction; speech separation; cocktail party problem; speech mode; whispered speech; SEPARATION; RECOGNITION;
D O I
10.21437/Interspeech.2024-1172
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Target speaker extraction (TSE) seeks to single out a target speaker's voice from a given speech mixture signal with the help of a target reference signal. This algorithm enables novel speech applications such as smart hearing aids. A TSE system has to work reliably in any everyday conversational situation. This may also include speakers who switch naturally between normal and whispered speech modes. This work represents the first attempt to perform TSE for whispered speech. For this, we construct a new first of its kind database, called wTIMIT2mix, which comprises two-speaker speech mixtures and target speaker reference signals given in both normal and whispered speech modes. Our results on TSE show that if these conditions are included in the training, a model can be equipped to work under all closed-set conditions.
引用
收藏
页码:5038 / 5042
页数:5
相关论文
共 40 条
  • [1] Practical applicability of deep neural networks for overlapping speaker separation
    Appeltans, Pieter
    Zegers, Jeroen
    Van Hamme, Hugo
    [J]. INTERSPEECH 2019, 2019, : 1353 - 1357
  • [2] Borsdorf M., 2021, INTERSPEECH
  • [3] Borsdorf M., 2021, ASRU
  • [4] Bronkhorst AW, 2000, ACUSTICA, V86, P117
  • [5] Chang H.-J., 2021, SLT
  • [6] Chen Z, 2020, INT CONF ACOUST SPEE, P7284, DOI [10.1109/ICASSP40776.2020.9053426, 10.1109/icassp40776.2020.9053426]
  • [8] Cosentino J, 2020, Arxiv, DOI arXiv:2005.11262
  • [9] Cummins F., 2006, SPECOM
  • [10] Delcroix M, 2018, 2018 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), P5554, DOI 10.1109/ICASSP.2018.8462661