Using Visual Speech Information in Masking Methods for Audio Speaker Separation

被引:7
作者
Khan, Faheem Ullah [1 ]
Milner, Ben P. [1 ]
Le Cornu, Thomas [1 ]
机构
[1] Univ East Anglia, Sch Comp Sci, Norwich NR4 7TJ, Norfolk, England
关键词
Speaker separation; audio-visual processing; binary masks; ratio mask; ENHANCEMENT; NOISE; INTELLIGIBILITY; SEGREGATION; PREDICTION; FREQUENCY; TRACKING;
D O I
10.1109/TASLP.2018.2835719
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
This paper examines whether visual speech information can be effective within audio-masking-based speaker separation to improve the quality and intelligibility of the target speech. Two visual-only methods of generating an audio mask for speaker separation are first developed. These use a deep neural network to map the visual speech features to an audio feature space from which both visually derived binary masks and visually derived ratio masks are estimated, before application to the speech mixture. Second, an audio ratio masking method forms a baseline approach for speaker separation which is extended to exploit visual speech information to form audio-visual ratio masks. Speech quality and intelligibility tests are carried out on the visual-only, audio-only, and audio-visual masking methods of speaker separation at mixing levels from - 10 to +10 dB. These reveal substantial improvements in the target speech when applying the visual-only and audio-only masks, but with highest performance occurring when combining audio and visual information to create the audio-visual masks.
引用
收藏
页码:1742 / 1754
页数:13
相关论文
共 50 条
  • [1] Using audio and visual information for single channel speaker separation
    Khan, Faheem
    Milner, Ben
    16TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2015), VOLS 1-5, 2015, : 1517 - 1521
  • [2] Speaker Separation Using Visual Speech Features and Single-channel Audio
    Khan, Faheem
    Milner, Ben
    14TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2013), VOLS 1-5, 2013, : 3263 - 3267
  • [3] DNN driven Speaker Independent Audio-Visual Mask Estimation for Speech Separation
    Gogate, Mandar
    Adeel, Ahsan
    Marxer, Ricard
    Barker, Jon
    Hussain, Amir
    19TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2018), VOLS 1-6: SPEECH RESEARCH FOR EMERGING MARKETS IN MULTILINGUAL SOCIETIES, 2018, : 2723 - 2727
  • [4] An Overview of Deep-Learning-Based Audio-Visual Speech Enhancement and Separation
    Michelsanti, Daniel
    Tan, Zheng-Hua
    Zhang, Shi-Xiong
    Xu, Yong
    Yu, Meng
    Yu, Dong
    Jensen, Jesper
    IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2021, 29 : 1368 - 1396
  • [5] Speaker-Independent Audio-Visual Speech Separation Based on Transformer in Multi-Talker Environments
    Wang, Jing
    Luo, Yiyu
    Yi, Weiming
    Xie, Xiang
    IEICE TRANSACTIONS ON INFORMATION AND SYSTEMS, 2022, E105D (04) : 766 - 777
  • [6] Speaker Separation Using Speaker Inventories and Estimated Speech
    Wang, Peidong
    Chen, Zhuo
    Wang, DeLiang
    Li, Jinyu
    Gong, Yifan
    IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2021, 29 : 537 - 546
  • [7] Integration of audio-visual information for multi-speaker multimedia speaker recognition
    Yang, Jichen
    Chen, Fangfan
    Cheng, Yu
    Lin, Pei
    DIGITAL SIGNAL PROCESSING, 2024, 145
  • [8] AUDIO-VISUAL SPEECH SEPARATION USING CROSS-MODAL CORRESPONDENCE LOSS
    Makishima, Naoki
    Ihori, Mana
    Takashima, Akihiko
    Tanaka, Tomohiro
    Orihashi, Shota
    Masumura, Ryo
    2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021), 2021, : 6673 - 6677
  • [9] Complex Ratio Masking for Monaural Speech Separation
    Williamson, Donald S.
    Wang, Yuxuan
    Wang, DeLiang
    IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2016, 24 (03) : 483 - 492
  • [10] Effects of audio-visual information on the intelligibility of alaryngeal speech
    Evitts, Paul M.
    Portugal, Lindsay
    Van Dine, Ami
    Holler, Aline
    JOURNAL OF COMMUNICATION DISORDERS, 2010, 43 (02) : 92 - 104