Speaker-Independent Audio-Visual Speech Separation Based on Transformer in Multi-Talker Environments

被引:2
|
作者
Wang, Jing [1 ]
Luo, Yiyu [1 ]
Yi, Weiming [2 ]
Xie, Xiang [1 ]
机构
[1] Beijing Inst Technol, Sch Informat & Elect, Beijing 100081, Peoples R China
[2] Beijing Inst Technol, Sch Foreign Languages, Key Lab Language Cognit & Computat, Minist Ind & Informat Technol, Beijing 100081, Peoples R China
基金
中国国家自然科学基金;
关键词
audio-visual speech separation; multi-talker transformer; multi-head attention; lip embedding; time-frequency mask; ENHANCEMENT; INTELLIGIBILITY; QUALITY;
D O I
10.1587/transinf.2021EDP7020
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Speech separation is the task of extracting target speech while suppressing background interference components. In applications like video telephones, visual information about the target speaker is available, which can be leveraged for multi-speaker speech separation. Most previous multi-speaker separation methods are mainly based on convolutional or recurrent neural networks. Recently, Transformer-based Seq2Seq models have achieved state-of-the-art performance in various tasks, such as neural machine translation (NMT), automatic speech recognition (ASR), etc. Transformer has showed an advantage in modeling audio-visual temporal context by multi-head attention blocks through explicitly assigning attention weights. Besides, Transformer doesn't have any recurrent sub-networks, thus supporting parallelization of sequence computation. In this paper, we propose a novel speaker-independent audio-visual speech separation method based on Transformer, which can be flexibly applied to unknown number and identity of speakers. The model receives both audiovisual streams, including noisy spectrogram and speaker lip embeddings, and predicts a complex time-frequency mask for the corresponding target speaker. The model is made up by three main components: audio encoder, visual encoder and Transformer-based mask generator. Two different structures of encoders are investigated and compared, including ResNet-based and Transformer-based. The performance of the proposed method is evaluated in terms of source separation and speech quality metrics. The experimental results on the benchmark GRID dataset show the effectiveness of the method on speaker-independent separation task in multi-talker environments. The model generalizes well to unseen identities of speakers and noise types. Though only trained on 2-speaker mixtures, the model achieves reasonable performance when tested on 2-speaker and 3-speaker mixtures. Besides, the model still shows an advantage compared with previous audio-visual speech separation works.
引用
收藏
页码:766 / 777
页数:12
相关论文
共 50 条
  • [41] The Impact of Speech-Irrelevant Head Movements on Speech Intelligibility in Multi-Talker Environments
    Frissen, Ilja
    Scherzer, Johannes
    Yao, Hsin-Yun
    ACTA ACUSTICA UNITED WITH ACUSTICA, 2019, 105 (06) : 1286 - 1290
  • [42] Separation of audio-visual speech sources: A new approach exploiting the audio-visual coherence of speech stimuli
    Sodoyer, D
    Schwartz, JL
    Girin, L
    Klinkisch, J
    Jutten, C
    EURASIP JOURNAL ON APPLIED SIGNAL PROCESSING, 2002, 2002 (11) : 1165 - 1173
  • [43] Separation of audio-visual speech sources: A new approach exploiting the audio-visual coherence of speech stimuli
    Sodoyer, D. (sodoyer@icp.inpg.fr), 1600, Hindawi Publishing Corporation (2002):
  • [44] Separation of Audio-Visual Speech Sources: A New Approach Exploiting the Audio-Visual Coherence of Speech Stimuli
    David Sodoyer
    Jean-Luc Schwartz
    Laurent Girin
    Jacob Klinkisch
    Christian Jutten
    EURASIP Journal on Advances in Signal Processing, 2002
  • [45] A Speaker-Dependent Deep Learning Approach to Joint Speech Separation and Acoustic Modeling for Multi-Talker Automatic Speech Recognition
    Tu, Yan-Hui
    Du, Jun
    Dai, Li-Rung
    Lee, Chin-Hui
    2016 10TH INTERNATIONAL SYMPOSIUM ON CHINESE SPOKEN LANGUAGE PROCESSING (ISCSLP), 2016,
  • [46] Deep audio-visual speech separation based on facial motion
    Rigal, Remi
    Chodorowski, Jacques
    Zerr, Benoit
    INTERSPEECH 2021, 2021, : 3540 - 3544
  • [47] LOW-LATENCY SPEAKER-INDEPENDENT CONTINUOUS SPEECH SEPARATION
    Yoshioka, Takuya
    Chen, Zhuo
    Liu, Changliang
    Xiao, Xiong
    Erdogan, Hakan
    Dimitriadis, Dimitrios
    2019 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2019, : 6980 - 6984
  • [48] AUDIO-VISUAL MULTI-CHANNEL SPEECH SEPARATION, DEREVERBERATION AND RECOGNITION
    Li, Guinan
    Yu, Jianwei
    Deng, Jiajun
    Liu, Xunying
    Meng, Helen
    2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, : 6042 - 6046
  • [49] Audio-Visual Deep Clustering for Speech Separation
    Lu, Rui
    Duan, Zhiyao
    Zhang, Changshui
    IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2019, 27 (11) : 1697 - 1712
  • [50] Bayesian separation of audio-visual speech sources
    Rajaram, S
    Nefian, AV
    Huang, TS
    2004 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, VOL V, PROCEEDINGS: DESIGN AND IMPLEMENTATION OF SIGNAL PROCESSING SYSTEMS INDUSTRY TECHNOLOGY TRACKS MACHINE LEARNING FOR SIGNAL PROCESSING MULTIMEDIA SIGNAL PROCESSING SIGNAL PROCESSING FOR EDUCATION, 2004, : 657 - 660