Using Visual Speech Information in Masking Methods for Audio Speaker Separation

被引:7
作者
Khan, Faheem Ullah [1 ]
Milner, Ben P. [1 ]
Le Cornu, Thomas [1 ]
机构
[1] Univ East Anglia, Sch Comp Sci, Norwich NR4 7TJ, Norfolk, England
关键词
Speaker separation; audio-visual processing; binary masks; ratio mask; ENHANCEMENT; NOISE; INTELLIGIBILITY; SEGREGATION; PREDICTION; FREQUENCY; TRACKING;
D O I
10.1109/TASLP.2018.2835719
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
This paper examines whether visual speech information can be effective within audio-masking-based speaker separation to improve the quality and intelligibility of the target speech. Two visual-only methods of generating an audio mask for speaker separation are first developed. These use a deep neural network to map the visual speech features to an audio feature space from which both visually derived binary masks and visually derived ratio masks are estimated, before application to the speech mixture. Second, an audio ratio masking method forms a baseline approach for speaker separation which is extended to exploit visual speech information to form audio-visual ratio masks. Speech quality and intelligibility tests are carried out on the visual-only, audio-only, and audio-visual masking methods of speaker separation at mixing levels from - 10 to +10 dB. These reveal substantial improvements in the target speech when applying the visual-only and audio-only masks, but with highest performance occurring when combining audio and visual information to create the audio-visual masks.
引用
收藏
页码:1742 / 1754
页数:13
相关论文
共 50 条
  • [21] Audio-Visual Speech Separation and Dereverberation With a Two-Stage Multimodal Network
    Tan, Ke
    Xu, Yong
    Zhang, Shi-Xiong
    Yu, Meng
    Yu, Dong
    IEEE JOURNAL OF SELECTED TOPICS IN SIGNAL PROCESSING, 2020, 14 (03) : 542 - 553
  • [22] Audio-visual speech recognition based on joint training with audio-visual speech enhancement for robust speech recognition
    Hwang, Jung-Wook
    Park, Jeongkyun
    Park, Rae-Hong
    Park, Hyung-Min
    APPLIED ACOUSTICS, 2023, 211
  • [23] Effects of aging on audio-visual speech integration Effects of aging on audio-visual speech integration
    Huyse, Aurelie
    Leybaert, Jacqueline
    Berthommier, Frederic
    JOURNAL OF THE ACOUSTICAL SOCIETY OF AMERICA, 2014, 136 (04) : 1918 - 1931
  • [24] The impact of the Lombard effect on audio and visual speech recognition systems
    Marxer, Ricard
    Barker, Jon
    Alghamdi, Najwa
    Maddock, Steve
    SPEECH COMMUNICATION, 2018, 100 : 58 - 68
  • [25] Features for Masking-Based Monaural Speech Separation in Reverberant Conditions
    Delfarah, Masood
    Wang, DeLiang
    IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2017, 25 (05) : 1085 - 1094
  • [26] Does visual speech provide release from perceptual masking in children?
    Halverson, Destinee M.
    Lalonde, Kaylah
    JOURNAL OF THE ACOUSTICAL SOCIETY OF AMERICA, 2020, 148 (03) : EL221 - EL226
  • [27] Audio Visual Integration with Competing Sources in the Framework of Audio Visual Speech Scene Analysis
    Ganesh, Attigodu Chandrashekara
    Berthommier, Frederic
    Schwartz, Jean-Luc
    PHYSIOLOGY, PSYCHOACOUSTICS AND COGNITION IN NORMAL AND IMPAIRED HEARING, 2016, 894 : 399 - 408
  • [28] Assessment of Speech Processing and Listening Effort Associated With Speech-on-Speech Masking Using the Visual World Paradigm and Pupillometry
    Abdel-Latif, Khaled H. A.
    Koelewijn, Thomas
    Baskent, Deniz
    Meister, Hartmut
    TRENDS IN HEARING, 2025, 29
  • [29] Long short-term memory for speaker generalization in supervised speech separation
    Chen, Jitong
    Wang, DeLiang
    JOURNAL OF THE ACOUSTICAL SOCIETY OF AMERICA, 2017, 141 (06) : 4705 - 4714
  • [30] Group Delay Based Methods for Speaker Segregation and its Application in Multimedia Information Retrieval
    Nathwani, Karan
    Pandit, Pranav
    Hegde, Rajesh M.
    IEEE TRANSACTIONS ON MULTIMEDIA, 2013, 15 (06) : 1326 - 1339