Self-Supervised Audio-and-Text Pre-training with Extremely Low-Resource Parallel Data

被引：0

作者：

Kang, Yu ^{[1
]}

Liu, Tianqiao ^{[1
]}

Li, Hang ^{[1
]}

Hao, Yang ^{[1
]}

Ding, Wenbiao ^{[1
,2
]}

机构：

[1] TAL Educ Grp, Beijing, Peoples R China

[2] Tencent, Beijing, Peoples R China

来源：

THIRTY-SIXTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE / THIRTY-FOURTH CONFERENCE ON INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE / TWELVETH SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE | 2022年

基金：

国家重点研发计划;

关键词：

D O I：

暂无

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Multimodal pre-training for audio-and-text has recently been proved to be effective and has significantly improved the performance of many downstream speech understanding tasks. However, these state-of-the-art pre-training audio-text models work well only when provided with large amount of parallel audio-and-text data, which brings challenges on many languages that are rich in unimodal corpora but scarce of parallel cross-modal corpus. In this paper, we investigate whether it is possible to pre-train an audio-text multimodal model with extremely low-resource parallel data and extra non-parallel unimodal data. Our pre-training framework consists of the following components: (1) Intra-modal Denoising Auto-Encoding (IDAE), which is able to reconstruct input text (audio) representations from a noisy version of itself. (2) Cross-modal Denoising Auto-Encoding (CDAE), which is pre-trained to reconstruct the input text (audio), given both a noisy version of the input text (audio) and the corresponding translated noisy audio features (text embeddings). (3) Iterative Denoising Process (IDP), which iteratively translates raw audio (text) and the corresponding text embeddings (audio features) translated from previous iteration into the new less-noisy text embeddings (audio features). We adapt a dual cross-modal Transformer as our backbone model which consists of two unimodal encoders for IDAE and two cross-modal encoders for CDAE and IDP. Our method achieves comparable performance on multiple downstream speech understanding tasks compared with the model pre-trained on fully parallel data, demonstrating the great potential of the proposed method.

引用

页码：10875 / 10883

页数：9

共 50 条

[1] Fast and Efficient Multilingual Self-Supervised Pre-training for Low-Resource Speech Recognition
Zhang, Zhilong
Wang, Wei
Qian, Yanmin
INTERSPEECH 2023, 2023, : 2248 - 2252
[2] Self-supervised Pre-training of Text Recognizers
Kiss, Martin
Hradis, Michal
DOCUMENT ANALYSIS AND RECOGNITION-ICDAR 2024, PT IV, 2024, 14807 : 218 - 235
[3] Self-supervised ECG pre-training
Liu, Han
Zhao, Zhenbo
She, Qiang
BIOMEDICAL SIGNAL PROCESSING AND CONTROL, 2021, 70
[4] Masked Text Modeling: A Self-Supervised Pre-training Method for Scene Text Detection
Wang, Keran
Xie, Hongtao
Wang, Yuxin
Zhang, Dongming
Qu, Yadong
Gao, Zuan
Zhang, Yongdong
PROCEEDINGS OF THE 31ST ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2023, 2023, : 2006 - 2015
[5] Self-supervised Pre-training for Mirror Detection
Lin, Jiaying
Lau, Rynson W. H.
2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2023), 2023, : 12193 - 12202
[6] Self-supervised Pre-training for Nuclei Segmentation
Haq, Mohammad Minhazul
Huang, Junzhou
MEDICAL IMAGE COMPUTING AND COMPUTER ASSISTED INTERVENTION, MICCAI 2022, PT II, 2022, 13432 : 303 - 313
[7] EFFECTIVENESS OF SELF-SUPERVISED PRE-TRAINING FOR ASR
Baevski, Alexei
Mohamed, Abdelrahman
2020 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, 2020, : 7694 - 7698
[8] Self-Supervised Pre-training with Symmetric Superimposition Modeling for Scene Text Recognition
Gao, Zuan
Wang, Yuxin
Qu, Yadong
Zhang, Boqiang
Wang, Zixiao
Xu, Jianjun
PROCEEDINGS OF THE THIRTY-THIRD INTERNATIONAL JOINT CONFERENCE ON ARTIFICIAL INTELLIGENCE, IJCAI 2024, 2024, : 767 - 775
[9] Open-Domain Response Generation in Low-Resource Settings using Self-Supervised Pre-Training of Warm-Started Transformers
Naous, Tarek
Bassyouni, Zahraa
Mousi, Bassel
Hajj, Hazem
El Hajj, Wassim
Shaban, Khaled
ACM TRANSACTIONS ON ASIAN AND LOW-RESOURCE LANGUAGE INFORMATION PROCESSING, 2023, 22 (04)
[10] Self-Supervised Pre-training for Time Series Classification
Shi, Pengxiang
Ye, Wenwen
Qin, Zheng
2021 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS (IJCNN), 2021,

← 1 2 3 4 5 →