Speaker-Independent Audio-Visual Speech Separation Based on Transformer in Multi-Talker Environments

被引:2
|
作者
Wang, Jing [1 ]
Luo, Yiyu [1 ]
Yi, Weiming [2 ]
Xie, Xiang [1 ]
机构
[1] Beijing Inst Technol, Sch Informat & Elect, Beijing 100081, Peoples R China
[2] Beijing Inst Technol, Sch Foreign Languages, Key Lab Language Cognit & Computat, Minist Ind & Informat Technol, Beijing 100081, Peoples R China
基金
中国国家自然科学基金;
关键词
audio-visual speech separation; multi-talker transformer; multi-head attention; lip embedding; time-frequency mask; ENHANCEMENT; INTELLIGIBILITY; QUALITY;
D O I
10.1587/transinf.2021EDP7020
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Speech separation is the task of extracting target speech while suppressing background interference components. In applications like video telephones, visual information about the target speaker is available, which can be leveraged for multi-speaker speech separation. Most previous multi-speaker separation methods are mainly based on convolutional or recurrent neural networks. Recently, Transformer-based Seq2Seq models have achieved state-of-the-art performance in various tasks, such as neural machine translation (NMT), automatic speech recognition (ASR), etc. Transformer has showed an advantage in modeling audio-visual temporal context by multi-head attention blocks through explicitly assigning attention weights. Besides, Transformer doesn't have any recurrent sub-networks, thus supporting parallelization of sequence computation. In this paper, we propose a novel speaker-independent audio-visual speech separation method based on Transformer, which can be flexibly applied to unknown number and identity of speakers. The model receives both audiovisual streams, including noisy spectrogram and speaker lip embeddings, and predicts a complex time-frequency mask for the corresponding target speaker. The model is made up by three main components: audio encoder, visual encoder and Transformer-based mask generator. Two different structures of encoders are investigated and compared, including ResNet-based and Transformer-based. The performance of the proposed method is evaluated in terms of source separation and speech quality metrics. The experimental results on the benchmark GRID dataset show the effectiveness of the method on speaker-independent separation task in multi-talker environments. The model generalizes well to unseen identities of speakers and noise types. Though only trained on 2-speaker mixtures, the model achieves reasonable performance when tested on 2-speaker and 3-speaker mixtures. Besides, the model still shows an advantage compared with previous audio-visual speech separation works.
引用
收藏
页码:766 / 777
页数:12
相关论文
共 50 条
  • [31] Audio-Visual Speech Separation in Noisy Environments with a Lightweight Iterative Model
    Martel, Hector
    Richter, Julius
    Li, Kai
    Hu, Xiaolin
    Gerkmann, Timo
    INTERSPEECH 2023, 2023, : 1673 - 1677
  • [32] A microphone array beamforming-based system for multi-talker speech separation
    Hidri, Adel
    Amiri, Hamid
    INTERNATIONAL JOURNAL OF SIGNAL AND IMAGING SYSTEMS ENGINEERING, 2016, 9 (4-5) : 209 - 217
  • [33] Boosting Spatial Information for Deep Learning Based Multichannel Speaker-Independent Speech Separation In Reverberant Environments
    Yang, Ziye
    Zhang, Xiao-Lei
    2019 ASIA-PACIFIC SIGNAL AND INFORMATION PROCESSING ASSOCIATION ANNUAL SUMMIT AND CONFERENCE (APSIPA ASC), 2019, : 1506 - 1510
  • [34] Speaker-Independent Speech Recognition using Visual Features
    Pooventhiran, G.
    Sandeep, A.
    Manthiravalli, K.
    Harish, D.
    Renuka, Karthika D.
    INTERNATIONAL JOURNAL OF ADVANCED COMPUTER SCIENCE AND APPLICATIONS, 2020, 11 (11) : 616 - 620
  • [35] Audio-Visual Speech Recognition in Noisy Audio Environments
    Palecek, Karel
    Chaloupka, Josef
    2013 36TH INTERNATIONAL CONFERENCE ON TELECOMMUNICATIONS AND SIGNAL PROCESSING (TSP), 2013, : 484 - 487
  • [36] Audio-Visual Speech Recognition in the Presence of a Competing Speaker
    Shao, Xu
    Barker, Jon
    INTERSPEECH 2006 AND 9TH INTERNATIONAL CONFERENCE ON SPOKEN LANGUAGE PROCESSING, VOLS 1-5, 2006, : 1292 - 1295
  • [37] Audio-Visual Multilevel Fusion for Speech and Speaker Recognition
    Chetty, Girija
    Wagner, Michael
    INTERSPEECH 2008: 9TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2008, VOLS 1-5, 2008, : 379 - 382
  • [38] A VISUAL-PILOT DEEP FUSION FOR TARGET SPEECH SEPARATION IN MULTI-TALKER NOISY ENVIRONMENT
    Li, Yun
    Liu, Zhang
    Na, Yueyue
    Wang, Ziteng
    Tian, Biao
    Fu, Qiang
    2020 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, 2020, : 4442 - 4446
  • [39] Target Speaker Verification With Selective Auditory Attention for Single and Multi-Talker Speech
    Xu, Chenglin
    Rao, Wei
    Wu, Jibin
    Li, Haizhou
    IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2021, 29 : 2696 - 2709
  • [40] The effect of nearby maskers on speech intelligibility in reverberant, multi-talker environments
    Westermann, Adam
    Buchholz, Joerg M.
    JOURNAL OF THE ACOUSTICAL SOCIETY OF AMERICA, 2017, 141 (03): : 2214 - 2223