Using Visual Speech Information in Masking Methods for Audio Speaker Separation

被引：7

作者：

Khan, Faheem Ullah ^{[1
]}

Milner, Ben P. ^{[1
]}

Le Cornu, Thomas ^{[1
]}

机构：

[1] Univ East Anglia, Sch Comp Sci, Norwich NR4 7TJ, Norfolk, England

来源：

IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING | 2018年 / 26卷 / 10期

关键词：

Speaker separation; audio-visual processing; binary masks; ratio mask; ENHANCEMENT; NOISE; INTELLIGIBILITY; SEGREGATION; PREDICTION; FREQUENCY; TRACKING;

D O I：

10.1109/TASLP.2018.2835719

中图分类号：

O42 [声学];

学科分类号：

070206 ; 082403 ;

摘要：

This paper examines whether visual speech information can be effective within audio-masking-based speaker separation to improve the quality and intelligibility of the target speech. Two visual-only methods of generating an audio mask for speaker separation are first developed. These use a deep neural network to map the visual speech features to an audio feature space from which both visually derived binary masks and visually derived ratio masks are estimated, before application to the speech mixture. Second, an audio ratio masking method forms a baseline approach for speaker separation which is extended to exploit visual speech information to form audio-visual ratio masks. Speech quality and intelligibility tests are carried out on the visual-only, audio-only, and audio-visual masking methods of speaker separation at mixing levels from - 10 to +10 dB. These reveal substantial improvements in the target speech when applying the visual-only and audio-only masks, but with highest performance occurring when combining audio and visual information to create the audio-visual masks.

引用

页码：1742 / 1754

页数：13

共 50 条

[41] Audio-Visual Automatic Speech Recognition Using PZM, MFCC and Statistical Analysis
Debnath, Saswati
Roy, Pinki
INTERNATIONAL JOURNAL OF INTERACTIVE MULTIMEDIA AND ARTIFICIAL INTELLIGENCE, 2021, 7 (02): : 121 - 133
[42] Audio-Visual Speech Enhancement Using Multimodal Deep Convolutional Neural Networks
Hou, Jen-Cheng
Wang, Syu-Siang
Lai, Ying-Hui
Tsao, Yu
Chang, Hsiu-Wen
Wang, Hsin-Min
IEEE TRANSACTIONS ON EMERGING TOPICS IN COMPUTATIONAL INTELLIGENCE, 2018, 2 (02): : 117 - 128
[43] Do gender differences in audio-visual benefit and visual influence in audio-visual speech perception emerge with age?
Alm, Magnus
Behne, Dawn
FRONTIERS IN PSYCHOLOGY, 2015, 6
[44] On Learning Spectral Masking for Single Channel Speech Enhancement Using Feedforward and Recurrent Neural Networks
Saleem, Nasir
Khattak, Muhammad Irfan
Al-Hasan, Muath
Qazi, Abdul Baseer
IEEE ACCESS, 2020, 8 : 160581 - 160595
[45] Speech Separation Using Deep Learning
Nandal, P.
SUSTAINABLE COMMUNICATION NETWORKS AND APPLICATION, ICSCN 2019, 2020, 39 : 319 - 326
[46] The Effect of Spatial Separation of Sound Masking and Distracting Speech Sounds on Working Memory Performance and Annoyance
Renz, Tobias
Leistner, Philip
Liebl, Andreas
ACTA ACUSTICA UNITED WITH ACUSTICA, 2018, 104 (04) : 611 - 622
[47] Audio-Visual Cross-Attention Network for Robotic Speaker Tracking
Qian, Xinyuan
Wang, Zhengdong
Wang, Jiadong
Guan, Guohui
Li, Haizhou
IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2023, 31 : 550 - 562
[48] A new feature set for masking-based monaural speech separation
Pirhosseinloo, Shadi
Brumberg, Jonathan S.
2018 CONFERENCE RECORD OF 52ND ASILOMAR CONFERENCE ON SIGNALS, SYSTEMS, AND COMPUTERS, 2018, : 828 - 832
[49] Brain-informed speech separation (BISS) for enhancement of target speaker in multitalker speech perception
Ceolini, Enea
Hjortkjaer, Jens
De Wong, Daniel
O'Sullivan, James
Raghavan, Vinay S.
Herrero, Jose
Mehta, Ashesh D.
Liu, Shih-Chii
Mesgarani, Nima
NEUROIMAGE, 2020, 223
[50] Dynamic Stream Weight Estimation in Coupled-HMM-based Audio-visual Speech Recognition Using Multilayer Perceptrons
Abdelaziz, Ahmed Hussen
Kolossa, Dorothea
15TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2014), VOLS 1-4, 2014, : 1144 - 1148

← 1 2 3 4 5 →