LEARNING SPEAKER REPRESENTATION FOR NEURAL NETWORK BASED MULTICHANNEL SPEAKER EXTRACTION

被引:0
|
作者
Zmolikova, Katerina [1 ,2 ]
Delcroix, Marc [1 ]
Kinoshita, Keisuke [1 ]
Higuchi, Takuya [1 ]
Ogawa, Atsunori [1 ]
Nakatani, Tomohiro [1 ]
机构
[1] NTT Corp, NTT Commun Sci Labs, Kyoto, Japan
[2] Brno Univ Technol, Speech FIT, Brno, Czech Republic
关键词
speaker extraction; speaker adaptive neural network; multi-speaker speech recognition; speaker representation learning; beamforming; SOURCE SEPARATION; SPEECH;
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Recently, schemes employing deep neural networks (DNNs) for extracting speech from noisy observation have demonstrated great potential for noise robust automatic speech recognition. However, these schemes are not well suited when the interfering noise is another speaker. To enable extracting a target speaker from a mixture of speakers, we have recently proposed to inform the neural network using speaker information extracted from an adaptation utterance from the same speaker. In our previous work, we explored ways how to inform the network about the speaker and found a speaker adaptive layer approach to be suitable for this task. In our experiments, we used speaker features designed for speaker recognition tasks as the additional speaker information, which may not be optimal for the speaker extraction task. In this paper, we propose a usage of a sequence summarizing scheme enabling to learn the speaker representation jointly with the network. Furthermore, we extend the previous experiments to demonstrate the potential of our proposed method as a front-end for speech recognition and explore the effect of additional noise on the performance of the method.
引用
收藏
页码:8 / 15
页数:8
相关论文
共 50 条
  • [41] Bootstrap Equilibrium and Probabilistic Speaker Representation Learning for Self-Supervised Speaker Verification
    Mun, Sung Hwan
    Han, Min Hyun
    Lee, Dongjune
    Kim, Jihwan
    Kim, Nam Soo
    IEEE ACCESS, 2021, 9 : 167615 - 167627
  • [42] Overview of Speaker Modeling and Its Applications: From the Lens of Deep Speaker Representation Learning
    Wang, Shuai
    Chen, Zhengyang
    Lee, Kong Aik
    Qian, Yanmin
    Li, Haizhou
    IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2024, 32 : 4971 - 4998
  • [43] Multi-task Neural Network for Robust Multiple Speaker Embedding Extraction
    He, Weipeng
    Motlicek, Petr
    Odobez, Jean-Marc
    INTERSPEECH 2021, 2021, : 506 - 510
  • [44] OPTIMIZATION OF SPEAKER EXTRACTION NEURAL NETWORK WITH MAGNITUDE AND TEMPORAL SPECTRUM APPROXIMATION LOSS
    Xu, Chenglin
    Rao, Wei
    Chng, Eng Siong
    Li, Haizhou
    2019 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2019, : 6990 - 6994
  • [45] Towards Speaker Identification System based on Dynamic Neural Network
    Ivanovas, E.
    Navakauskas, D.
    ELEKTRONIKA IR ELEKTROTECHNIKA, 2012, 18 (10) : 69 - 72
  • [46] Extraction and representation of prosodic features for language and speaker recognition
    Mary, Leena
    Yegnanarayana, B.
    SPEECH COMMUNICATION, 2008, 50 (10) : 782 - 796
  • [47] DEAAN: DISENTANGLED EMBEDDING AND ADVERSARIAL ADAPTATION NETWORK FOR ROBUST SPEAKER REPRESENTATION LEARNING
    Sang, Mufan
    Xia, Wei
    Hansen, John H. L.
    2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021), 2021, : 6169 - 6173
  • [48] TEnet: target speaker extraction network with accumulated speaker embedding for automatic speech recognition
    Li, Wenjie
    Zhang, Pengyuan
    Yan, Yonghong
    ELECTRONICS LETTERS, 2019, 55 (14) : 816 - 818
  • [49] COMPACT NETWORK FOR SPEAKERBEAM TARGET SPEAKER EXTRACTION
    Delcroix, Marc
    Zmolikova, Katerina
    Ochiai, Tsubasa
    Kinoshita, Keisuke
    Araki, Shoko
    Nakatani, Tomohiro
    2019 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2019, : 6965 - 6969
  • [50] TIME-DOMAIN SPEAKER EXTRACTION NETWORK
    Xu, Chenglin
    Rao, Wei
    Chng, Eng Siong
    Li, Haizhou
    2019 IEEE AUTOMATIC SPEECH RECOGNITION AND UNDERSTANDING WORKSHOP (ASRU 2019), 2019, : 327 - 334