Using Visual Speech Information in Masking Methods for Audio Speaker Separation

被引:7
|
作者
Khan, Faheem Ullah [1 ]
Milner, Ben P. [1 ]
Le Cornu, Thomas [1 ]
机构
[1] Univ East Anglia, Sch Comp Sci, Norwich NR4 7TJ, Norfolk, England
关键词
Speaker separation; audio-visual processing; binary masks; ratio mask; ENHANCEMENT; NOISE; INTELLIGIBILITY; SEGREGATION; PREDICTION; FREQUENCY; TRACKING;
D O I
10.1109/TASLP.2018.2835719
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
This paper examines whether visual speech information can be effective within audio-masking-based speaker separation to improve the quality and intelligibility of the target speech. Two visual-only methods of generating an audio mask for speaker separation are first developed. These use a deep neural network to map the visual speech features to an audio feature space from which both visually derived binary masks and visually derived ratio masks are estimated, before application to the speech mixture. Second, an audio ratio masking method forms a baseline approach for speaker separation which is extended to exploit visual speech information to form audio-visual ratio masks. Speech quality and intelligibility tests are carried out on the visual-only, audio-only, and audio-visual masking methods of speaker separation at mixing levels from - 10 to +10 dB. These reveal substantial improvements in the target speech when applying the visual-only and audio-only masks, but with highest performance occurring when combining audio and visual information to create the audio-visual masks.
引用
收藏
页码:1742 / 1754
页数:13
相关论文
共 50 条
  • [21] Method of speech recognition and speaker identification using audio-visual of polish speech and hidden Markov models
    Kubanek, Mariusz
    BIOMETRICS, COMPUTER SECURITY SYSTEMS AND ARTIFICIAL INTELLIGENCE APPLICATIONS, 2006, : 45 - 55
  • [22] Speaker-Independent Audio-Visual Speech Separation Based on Transformer in Multi-Talker Environments
    Wang, Jing
    Luo, Yiyu
    Yi, Weiming
    Xie, Xiang
    IEICE TRANSACTIONS ON INFORMATION AND SYSTEMS, 2022, E105D (04) : 766 - 777
  • [23] An audio-visual approach to simultaneous-speaker speech recognition
    Patterson, EK
    Gowdy, JN
    2003 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, VOL V, PROCEEDINGS: SENSOR ARRAY & MULTICHANNEL SIGNAL PROCESSING AUDIO AND ELECTROACOUSTICS MULTIMEDIA SIGNAL PROCESSING, 2003, : 780 - 783
  • [24] Integrating Audio and Visual Cues for Speaker Friendliness in Multimodal Speech Synthesis
    House, David
    INTERSPEECH 2007: 8TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION, VOLS 1-4, 2007, : 1461 - 1464
  • [25] Multi-Speaker Audio-Visual Corpus RUSAVIC: Russian Audio-Visual Speech in Cars
    Ivanko, Denis
    Ryumin, Dmitry
    Axyonov, Alexandr
    Kashevnik, Alexey
    Karpov, Alexey
    LREC 2022: THIRTEEN INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION, 2022, : 1555 - 1559
  • [26] A speaker tracking algorithm based on audio and visual information fusion using particle filter
    Li, X
    Sun, L
    Tao, LM
    Xu, GY
    Jia, Y
    IMAGE ANALYSIS AND RECOGNITION, PT 2, PROCEEDINGS, 2004, 3212 : 572 - 580
  • [27] Audio-Visual Deep Clustering for Speech Separation
    Lu, Rui
    Duan, Zhiyao
    Zhang, Changshui
    IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2019, 27 (11) : 1697 - 1712
  • [28] Bayesian separation of audio-visual speech sources
    Rajaram, S
    Nefian, AV
    Huang, TS
    2004 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, VOL V, PROCEEDINGS: DESIGN AND IMPLEMENTATION OF SIGNAL PROCESSING SYSTEMS INDUSTRY TECHNOLOGY TRACKS MACHINE LEARNING FOR SIGNAL PROCESSING MULTIMEDIA SIGNAL PROCESSING SIGNAL PROCESSING FOR EDUCATION, 2004, : 657 - 660
  • [29] Estimation of speaker position using audio information
    Vahedian, A
    Frater, M
    Arnold, J
    Cavenor, M
    Godara, L
    Pickering, M
    IEEE TENCON'97 - IEEE REGIONAL 10 ANNUAL CONFERENCE, PROCEEDINGS, VOLS 1 AND 2: SPEECH AND IMAGE TECHNOLOGIES FOR COMPUTING AND TELECOMMUNICATIONS, 1997, : 181 - 184
  • [30] Speaker Selection and Tracking in a Cluttered Environment with Audio and Visual Information
    Lim, Yoonseob
    Choi, Jongsuk
    IEEE TRANSACTIONS ON CONSUMER ELECTRONICS, 2009, 55 (03) : 1581 - 1589