AUDIO-VISUAL SPEECH INPAINTING WITH DEEP LEARNING

被引：16

作者：

Morrone, Giovanni ^{[1
]}

Michelsanti, Daniel ^{[2
]}

Tan, Zheng-Hua ^{[2
]}

Jensen, Jesper ^{[2
,3
]}

机构：

[1] Univ Modena & Reggio Emilia, Dept Engn Enzo Ferrari, Modena, Italy

[2] Aalborg Univ, Dept Elect Syst, Aalborg, Denmark

[3] Oticon AS, Copenhagen, Denmark

来源：

2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021) | 2021年

关键词：

speech inpainting; audio-visual; deep learning; face-landmarks; multi-task learning;

D O I：

10.1109/ICASSP39728.2021.9413488

中图分类号：

O42 [声学];

学科分类号：

070206 ; 082403 ;

摘要：

In this paper, we present a deep-learning-based framework for audio-visual speech inpainting, i.e., the task of restoring the missing parts of an acoustic speech signal from reliable audio context and uncorrupted visual information. Recent work focuses solely on audio-only methods and generally aims at inpainting music signals, which show highly different structure than speech. Instead, we inpaint speech signals with gaps ranging from 100 ms to 1600 ms to investigate the contribution that vision can provide for gaps of different duration. We also experiment with a multi-task learning approach where a phone recognition task is learned together with speech inpainting. Results show that the performance of audio-only speech inpainting approaches degrades rapidly when gaps get large, while the proposed audio-visual approach is able to plausibly restore missing information. In addition, we show that multi-task learning is effective, although the largest contribution to performance comes from vision.

引用

页码：6653 / 6657

页数：5

共 31 条

[1] Audio Inpainting [J].

Adler, Amir ;

Emiya, Valentin ;

Jafari, Maria G. ;

Elad, Michael ;

Gribonval, Remi ;

Plumbley, Mark D. .

IEEE TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2012, 20 (03) :922-932

[2]

[Anonymous], 2012, Sequence transduction with recurrent neural networks

[3]

[Anonymous], 2002, APPL DIGITAL SIGNAL, DOI DOI 10.1007/978-1-4471-1561-8

[4] Self-content-based audio inpainting [J].

Bahat, Yuval ;

Schechner, Yoav Y. ;

Elad, Michael .

SIGNAL PROCESSING, 2015, 111 :61-72

[5] Multitask learning [J].

Caruana, R .

MACHINE LEARNING, 1997, 28 (01) :41-75

[6]

Chen Z, 2015, 16TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2015), VOLS 1-5, P3274

[7] An audio-visual corpus for speech perception and automatic speech recognition (L) [J].

Cooke, Martin ;

Barker, Jon ;

Cunningham, Stuart ;

Shao, Xu .

JOURNAL OF THE ACOUSTICAL SOCIETY OF AMERICA, 2006, 120 (05) :2421-2424

[8]

Ebner PP, 2020, ARXIV PREPRINT ARXIV

[9]

Erdogan H, 2015, INT CONF ACOUST SPEE, P708, DOI 10.1109/ICASSP.2015.7178061

[10]

Garofolo J. S., 1993, NASA STI/Recon Technical Report n, V93, P27403, DOI DOI 10.6028/NIST.IR.4930

← 1 2 3 4 →