Using Visual Speech Information in Masking Methods for Audio Speaker Separation

被引:7
作者
Khan, Faheem Ullah [1 ]
Milner, Ben P. [1 ]
Le Cornu, Thomas [1 ]
机构
[1] Univ East Anglia, Sch Comp Sci, Norwich NR4 7TJ, Norfolk, England
关键词
Speaker separation; audio-visual processing; binary masks; ratio mask; ENHANCEMENT; NOISE; INTELLIGIBILITY; SEGREGATION; PREDICTION; FREQUENCY; TRACKING;
D O I
10.1109/TASLP.2018.2835719
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
This paper examines whether visual speech information can be effective within audio-masking-based speaker separation to improve the quality and intelligibility of the target speech. Two visual-only methods of generating an audio mask for speaker separation are first developed. These use a deep neural network to map the visual speech features to an audio feature space from which both visually derived binary masks and visually derived ratio masks are estimated, before application to the speech mixture. Second, an audio ratio masking method forms a baseline approach for speaker separation which is extended to exploit visual speech information to form audio-visual ratio masks. Speech quality and intelligibility tests are carried out on the visual-only, audio-only, and audio-visual masking methods of speaker separation at mixing levels from - 10 to +10 dB. These reveal substantial improvements in the target speech when applying the visual-only and audio-only masks, but with highest performance occurring when combining audio and visual information to create the audio-visual masks.
引用
收藏
页码:1742 / 1754
页数:13
相关论文
共 50 条
  • [31] Detection of Ball Hits in a Tennis Game Using Audio and Visual Information
    Huang, Qiang
    Cox, Stephen
    Zhou, Xiangzeng
    Xie, Lei
    2012 ASIA-PACIFIC SIGNAL AND INFORMATION PROCESSING ASSOCIATION ANNUAL SUMMIT AND CONFERENCE (APSIPA ASC), 2012,
  • [32] The Role of Visual Speech Information in Supporting Perceptual Learning of Degraded Speech
    Wayne, Rachel V.
    Johnsrude, Ingrid S.
    JOURNAL OF EXPERIMENTAL PSYCHOLOGY-APPLIED, 2012, 18 (04) : 419 - 435
  • [33] Semantic Cues Modulate Children's and Adults' Processing of Audio-Visual Face Mask Speech
    Schwarz, Julia
    Li, Katrina Kechun
    Sim, Jasper Hong
    Zhang, Yixin
    Buchanan-Worster, Elizabeth
    Post, Brechtje
    Gibson, Jenny Louise
    McDougall, Kirsty
    FRONTIERS IN PSYCHOLOGY, 2022, 13
  • [34] Audio-Visual Speaker Diarization Based on Spatiotemporal Bayesian Fusion
    Gebru, Israel D.
    Ba, Sileye
    Li, Xiaofei
    Horaud, Radu
    IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2018, 40 (05) : 1086 - 1099
  • [35] Improved Lite Audio-Visual Speech Enhancement
    Chuang, Shang-Yi
    Wang, Hsin-Min
    Tsao, Yu
    IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2022, 30 : 1345 - 1359
  • [36] Audio-visual speech in noise perception in dyslexia
    van Laarhoven, Thijs
    Keetels, Mirjam
    Schakel, Lemmy
    Vroomen, Jean
    DEVELOPMENTAL SCIENCE, 2018, 21 (01)
  • [37] Somatosensory contribution to audio-visual speech processing
    Ito, Takayuki
    Ohashi, Hiroki
    Gracco, Vincent L.
    CORTEX, 2021, 143 : 195 - 204
  • [38] Complementary models for audio-visual speech classification
    Sad, Gonzalo D.
    Terissi, Lucas D.
    Gomez, Juan C.
    INTERNATIONAL JOURNAL OF SPEECH TECHNOLOGY, 2022, 25 (01) : 231 - 249
  • [39] Deep neural networks based binary classification for single channel speaker independent multi-talker speech separation
    Saleem, Nasir
    Khattak, Muhammad Irfan
    APPLIED ACOUSTICS, 2020, 167
  • [40] Audio-Visual Speech Enhancement Using Conditional Variational Auto-Encoders
    Sadeghi, Mostafa
    Leglaive, Simon
    Alameda-Pineda, Xavier
    Girin, Laurent
    Horaud, Radu
    IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2020, 28 : 1788 - 1800