BRAVEN: IMPROVING SELF-SUPERVISED PRE-TRAINING FOR VISUAL AND AUDITORY SPEECH RECOGNITION

被引：5

作者：

Haliassos, Alexandros ^{[1
]}

Zinonos, Andreas ^{[1
]}

Mira, Rodrigo ^{[1
]}

Petridis, Stavros ^{[1
]}

Pantie, Maja ^{[1
]}

机构：

[1] Imperial Coll London, London, England

来源：

2024 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2024) | 2024年

关键词：

visual / auditory speech recognition; self-supervised learning; multi-modal learning;

D O I：

10.1109/ICASSP48485.2024.10448473

中图分类号：

O42 [声学];

学科分类号：

070206 ; 082403 ;

摘要：

Self-supervision has recently shown great promise for learning visual and auditory speech representations from unlabelled data. In this work, we propose BRAVEn, an extension to the recent RAVEn method, which learns speech representations entirely from raw audio-visual data. Our modifications to RAVEn enable BRAVEn to achieve state-of-the-art results among self-supervised methods in various settings. Moreover, we observe favourable scaling behaviour by increasing the amount of unlabelled data well beyond other self-supervised works. In particular, we achieve 20.0% / 1.7% word error rate for VSR / ASR on the LRS3 test set, with only 30 hours of labelled data and no external ASR models. Our results suggest that readily available unlabelled audiovisual data can largely replace costly transcribed data. Code at https://github.com/ahaliassos/raven.

引用

页码：11431 / 11435

页数：5

共 32 条

[1]

Afouras T, 2020, INT CONF ACOUST SPEE, P2143, DOI [10.1109/ICASSP40776.2020.9054253, 10.1109/icassp40776.2020.9054253]

[2]

Afouras Triantafyllos, 2018, ARXIV

[3] Emerging Properties in Self-Supervised Vision Transformers [J].

Caron, Mathilde ;

Touvron, Hugo ;

Misra, Ishan ;

Jegou, Herve ;

Mairal, Julien ;

Bojanowski, Piotr ;

Joulin, Armand .

2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021), 2021, :9630-9640

[4]

Chung JS, 2018, INTERSPEECH, P1086

[5] Looking to Listen at the Cocktail Party: A Speaker-Independent Audio-Visual Model for Speech Separation [J].

Ephrat, Ariel ;

Mosseri, Inbar ;

Lang, Oran ;

Dekel, Tali ;

Wilson, Kevin ;

Hassidim, Avinatan ;

Freeman, William T. ;

Rubinstein, Michael .

ACM TRANSACTIONS ON GRAPHICS, 2018, 37 (04)

[6]

Grill J-B, 2020, PROC ADV NEURAL INF, V33, P21271

[7]

Haliassos Alexandros, 2023, ICLR

[8] Deep Residual Learning for Image Recognition [J].

He, Kaiming ;

Zhang, Xiangyu ;

Ren, Shaoqing ;

Sun, Jian .

2016 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2016, :770-778

[9]

Hsu WN, 2022, ADV NEUR IN

[10] Deep Networks with Stochastic Depth [J].

Huang, Gao ;

Sun, Yu ;

Liu, Zhuang ;

Sedra, Daniel ;

Weinberger, Kilian Q. .

COMPUTER VISION - ECCV 2016, PT IV, 2016, 9908 :646-661

← 1 2 3 4 →