Self-Supervised Audio-and-Text Pre-training with Extremely Low-Resource Parallel Data

被引:0
|
作者
Kang, Yu [1 ]
Liu, Tianqiao [1 ]
Li, Hang [1 ]
Hao, Yang [1 ]
Ding, Wenbiao [1 ,2 ]
机构
[1] TAL Educ Grp, Beijing, Peoples R China
[2] Tencent, Beijing, Peoples R China
来源
THIRTY-SIXTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE / THIRTY-FOURTH CONFERENCE ON INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE / TWELVETH SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE | 2022年
基金
国家重点研发计划;
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Multimodal pre-training for audio-and-text has recently been proved to be effective and has significantly improved the performance of many downstream speech understanding tasks. However, these state-of-the-art pre-training audio-text models work well only when provided with large amount of parallel audio-and-text data, which brings challenges on many languages that are rich in unimodal corpora but scarce of parallel cross-modal corpus. In this paper, we investigate whether it is possible to pre-train an audio-text multimodal model with extremely low-resource parallel data and extra non-parallel unimodal data. Our pre-training framework consists of the following components: (1) Intra-modal Denoising Auto-Encoding (IDAE), which is able to reconstruct input text (audio) representations from a noisy version of itself. (2) Cross-modal Denoising Auto-Encoding (CDAE), which is pre-trained to reconstruct the input text (audio), given both a noisy version of the input text (audio) and the corresponding translated noisy audio features (text embeddings). (3) Iterative Denoising Process (IDP), which iteratively translates raw audio (text) and the corresponding text embeddings (audio features) translated from previous iteration into the new less-noisy text embeddings (audio features). We adapt a dual cross-modal Transformer as our backbone model which consists of two unimodal encoders for IDAE and two cross-modal encoders for CDAE and IDP. Our method achieves comparable performance on multiple downstream speech understanding tasks compared with the model pre-trained on fully parallel data, demonstrating the great potential of the proposed method.
引用
收藏
页码:10875 / 10883
页数:9
相关论文
共 50 条
  • [1] Fast and Efficient Multilingual Self-Supervised Pre-training for Low-Resource Speech Recognition
    Zhang, Zhilong
    Wang, Wei
    Qian, Yanmin
    INTERSPEECH 2023, 2023, : 2248 - 2252
  • [2] Self-supervised Pre-training of Text Recognizers
    Kiss, Martin
    Hradis, Michal
    DOCUMENT ANALYSIS AND RECOGNITION-ICDAR 2024, PT IV, 2024, 14807 : 218 - 235
  • [3] Self-supervised ECG pre-training
    Liu, Han
    Zhao, Zhenbo
    She, Qiang
    BIOMEDICAL SIGNAL PROCESSING AND CONTROL, 2021, 70
  • [4] Masked Text Modeling: A Self-Supervised Pre-training Method for Scene Text Detection
    Wang, Keran
    Xie, Hongtao
    Wang, Yuxin
    Zhang, Dongming
    Qu, Yadong
    Gao, Zuan
    Zhang, Yongdong
    PROCEEDINGS OF THE 31ST ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2023, 2023, : 2006 - 2015
  • [5] Self-supervised Pre-training for Mirror Detection
    Lin, Jiaying
    Lau, Rynson W. H.
    2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2023), 2023, : 12193 - 12202
  • [6] Self-supervised Pre-training for Nuclei Segmentation
    Haq, Mohammad Minhazul
    Huang, Junzhou
    MEDICAL IMAGE COMPUTING AND COMPUTER ASSISTED INTERVENTION, MICCAI 2022, PT II, 2022, 13432 : 303 - 313
  • [7] EFFECTIVENESS OF SELF-SUPERVISED PRE-TRAINING FOR ASR
    Baevski, Alexei
    Mohamed, Abdelrahman
    2020 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, 2020, : 7694 - 7698
  • [8] Self-Supervised Pre-training with Symmetric Superimposition Modeling for Scene Text Recognition
    Gao, Zuan
    Wang, Yuxin
    Qu, Yadong
    Zhang, Boqiang
    Wang, Zixiao
    Xu, Jianjun
    PROCEEDINGS OF THE THIRTY-THIRD INTERNATIONAL JOINT CONFERENCE ON ARTIFICIAL INTELLIGENCE, IJCAI 2024, 2024, : 767 - 775
  • [9] Open-Domain Response Generation in Low-Resource Settings using Self-Supervised Pre-Training of Warm-Started Transformers
    Naous, Tarek
    Bassyouni, Zahraa
    Mousi, Bassel
    Hajj, Hazem
    El Hajj, Wassim
    Shaban, Khaled
    ACM TRANSACTIONS ON ASIAN AND LOW-RESOURCE LANGUAGE INFORMATION PROCESSING, 2023, 22 (04)
  • [10] Self-Supervised Pre-training for Time Series Classification
    Shi, Pengxiang
    Ye, Wenwen
    Qin, Zheng
    2021 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS (IJCNN), 2021,