Large-Scale Unsupervised Audio Pre-Training for Video-to-Speech Synthesis

被引：0

作者：

Kefalas, Triantafyllos ^{[1
]}

Panagakis, Yannis ^{[2
,3
]}

Pantic, Maja ^{[1
]}

机构：

[1] Imperial Coll London, Dept Comp, London SW7 2AZ, England

[2] Natl & Kapodistrian Univ Athens, Dept Informat & Telecommun, Athens 16122, Greece

[3] Archimedes Res Unit, Maroussi 15125, Greece

来源：

IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING | 2024年 / 32卷

基金：

英国工程与自然科学研究理事会;

关键词：

Video-to-speech; speech synthesis; generative adversarial networks (GANs); conformer; pre-training; RECOGNITION;

D O I：

10.1109/TASLP.2024.3382500

中图分类号：

O42 [声学];

学科分类号：

070206 ; 082403 ;

摘要：

Video-to-speech synthesis is the task of reconstructing the speech signal from a silent video of a speaker. Previous approaches train on data from almost exclusively audio-visual datasets, i.e., every audio sample has a corresponding video sample. This precludes the use of abundant audio-only datasets which may not have a corresponding visual modality such as audiobooks, radio podcasts, and speech recognition datasets. In this paper we propose to train encoder-decoder models on more than 3,500 hours of audio data at 24 kHz, and then use the pre-trained decoders to initialize the audio decoders for the video-to-speech synthesis task. The pre-training step uses audio samples only and does not require labels or corresponding samples from other modalities (visual, text). We demonstrate that this improves the reconstructed speech and that it is an unexplored way to improve the quality of the generator in a cross-modal task while only requiring samples from one of the modalities. We conduct experiments using both raw audio and mel spectrograms as target outputs and benchmark our models with existing work.

引用

页码：2255 / 2268

页数：14

共 101 条

[11] The Spoken Wikipedia Corpus collection: Harvesting, alignment and an application to hyperlistening [J].

Baumann, Timo ;

Koehn, Arne ;

Hennig, Felix .

LANGUAGE RESOURCES AND EVALUATION, 2019, 53 (02) :303-329

[12] How far are we from solving the 2D & 3D Face Alignment problem? (and a dataset of 230,000 3D facial landmarks) [J].

Bulat, Adrian ;

Tzimiropoulos, Georgios .

2017 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2017, :1021-1030

[13] CMGAN: Conformer-based Metric GAN for Speech Enhancement [J].

Cao, Ruizhe ;

Abdulatif, Sherif ;

Yang, Bin .

INTERSPEECH 2022, 2022, :936-940

[14] The devil is in the details: an evaluation of recent feature encoding methods [J].

Chatfield, Ken ;

Lempitsky, Victor ;

Vedaldi, Andrea ;

Zisserman, Andrew .

PROCEEDINGS OF THE BRITISH MACHINE VISION CONFERENCE 2011, 2011,

[15] GigaSpeech: An Evolving, Multi-domain ASR Corpus with 10,000 Hours of Transcribed Audio [J].

Chen, Guoguo ;

Chai, Shuzhou ;

Wang, Guanbo ;

Du, Jiayu ;

Zhang, Wei-Qiang ;

Weng, Chao ;

Su, Dan ;

Povey, Daniel ;

Trmal, Jan ;

Zhang, Junbo ;

Jin, Mingjie ;

Khudanpur, Sanjeev ;

Watanabe, Shinji ;

Zhae, Shuaijiang ;

Zou, Wei ;

Li, Xiangang ;

Yao, Xuchen ;

Wang, Yongqing ;

You, Zhao ;

Yan, Zhiyong .

INTERSPEECH 2021, 2021, :3670-3674

[16] DiffV2S: Diffusion-based Video-to-Speech Synthesis with Vision-guided Speaker Embedding [J].

Choi, Jeongsoo ;

Hong, Joanna ;

Ro, Yong Man .

2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION, ICCV, 2023, :7778-7787

[17]

Chung JS, 2018, INTERSPEECH, P1086

[18] Lip Reading in the Wild [J].

Chung, Joon Son ;

Zisserman, Andrew .

COMPUTER VISION - ACCV 2016, PT II, 2017, 10112 :87-103

[19]

Chung JY, 2014, Arxiv, DOI [arXiv:1412.3555, 10.48550/arXiv.1412.3555]

[20]

Clifton A., 2020, P 28 INT C COMP LING, P5903

← 1 2 3 4 5 6 7 8 9 10 →