Using Visual Speech Information in Masking Methods for Audio Speaker Separation

被引：7

作者：

Khan, Faheem Ullah ^{[1
]}

Milner, Ben P. ^{[1
]}

Le Cornu, Thomas ^{[1
]}

机构：

[1] Univ East Anglia, Sch Comp Sci, Norwich NR4 7TJ, Norfolk, England

来源：

IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING | 2018年 / 26卷 / 10期

关键词：

Speaker separation; audio-visual processing; binary masks; ratio mask; ENHANCEMENT; NOISE; INTELLIGIBILITY; SEGREGATION; PREDICTION; FREQUENCY; TRACKING;

D O I：

10.1109/TASLP.2018.2835719

中图分类号：

O42 [声学];

学科分类号：

070206 ; 082403 ;

摘要：

This paper examines whether visual speech information can be effective within audio-masking-based speaker separation to improve the quality and intelligibility of the target speech. Two visual-only methods of generating an audio mask for speaker separation are first developed. These use a deep neural network to map the visual speech features to an audio feature space from which both visually derived binary masks and visually derived ratio masks are estimated, before application to the speech mixture. Second, an audio ratio masking method forms a baseline approach for speaker separation which is extended to exploit visual speech information to form audio-visual ratio masks. Speech quality and intelligibility tests are carried out on the visual-only, audio-only, and audio-visual masking methods of speaker separation at mixing levels from - 10 to +10 dB. These reveal substantial improvements in the target speech when applying the visual-only and audio-only masks, but with highest performance occurring when combining audio and visual information to create the audio-visual masks.

引用

页码：1742 / 1754

页数：13

共 50 条

[31] Detection of Ball Hits in a Tennis Game Using Audio and Visual Information
Huang, Qiang
Cox, Stephen
Zhou, Xiangzeng
Xie, Lei
2012 ASIA-PACIFIC SIGNAL AND INFORMATION PROCESSING ASSOCIATION ANNUAL SUMMIT AND CONFERENCE (APSIPA ASC), 2012,
[32] The Role of Visual Speech Information in Supporting Perceptual Learning of Degraded Speech
Wayne, Rachel V.
Johnsrude, Ingrid S.
JOURNAL OF EXPERIMENTAL PSYCHOLOGY-APPLIED, 2012, 18 (04) : 419 - 435
[33] Semantic Cues Modulate Children's and Adults' Processing of Audio-Visual Face Mask Speech
Schwarz, Julia
Li, Katrina Kechun
Sim, Jasper Hong
Zhang, Yixin
Buchanan-Worster, Elizabeth
Post, Brechtje
Gibson, Jenny Louise
McDougall, Kirsty
FRONTIERS IN PSYCHOLOGY, 2022, 13
[34] Audio-Visual Speaker Diarization Based on Spatiotemporal Bayesian Fusion
Gebru, Israel D.
Ba, Sileye
Li, Xiaofei
Horaud, Radu
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2018, 40 (05) : 1086 - 1099
[35] Improved Lite Audio-Visual Speech Enhancement
Chuang, Shang-Yi
Wang, Hsin-Min
Tsao, Yu
IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2022, 30 : 1345 - 1359
[36] Audio-visual speech in noise perception in dyslexia
van Laarhoven, Thijs
Keetels, Mirjam
Schakel, Lemmy
Vroomen, Jean
DEVELOPMENTAL SCIENCE, 2018, 21 (01)
[37] Somatosensory contribution to audio-visual speech processing
Ito, Takayuki
Ohashi, Hiroki
Gracco, Vincent L.
CORTEX, 2021, 143 : 195 - 204
[38] Complementary models for audio-visual speech classification
Sad, Gonzalo D.
Terissi, Lucas D.
Gomez, Juan C.
INTERNATIONAL JOURNAL OF SPEECH TECHNOLOGY, 2022, 25 (01) : 231 - 249
[39] Deep neural networks based binary classification for single channel speaker independent multi-talker speech separation
Saleem, Nasir
Khattak, Muhammad Irfan
APPLIED ACOUSTICS, 2020, 167
[40] Audio-Visual Speech Enhancement Using Conditional Variational Auto-Encoders
Sadeghi, Mostafa
Leglaive, Simon
Alameda-Pineda, Xavier
Girin, Laurent
Horaud, Radu
IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2020, 28 : 1788 - 1800

← 1 2 3 4 5 →