Contrastive Latent Space Reconstruction Learning for Audio-Text Retrieval

被引:2
作者
Luo, Kaiyi [1 ,2 ]
Zhang, Xulong [1 ]
Wang, Jianzong [1 ]
Li, Huaxiong [2 ]
Cheng, Ning [1 ]
Xiao, Jing [1 ]
机构
[1] Ping An Technol Shenzhen Co Ltd, Shenzhen, Peoples R China
[2] Nanjing Univ, Dept Control Sci & Intelligent Engn, Nanjing, Peoples R China
来源
2023 IEEE 35TH INTERNATIONAL CONFERENCE ON TOOLS WITH ARTIFICIAL INTELLIGENCE, ICTAI | 2023年
关键词
Cross-modal Retrieval; Data Reconstruction; Contrastive Learning;
D O I
10.1109/ICTAI59109.2023.00137
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Cross-modal retrieval (CMR) has been extensively applied in various domains, such as multimedia search engines and recommendation systems. Most existing CMR methods focus on image-to-text retrieval, whereas audio-to-text retrieval, a less explored domain, has posed a great challenge due to the difficulty to uncover discriminative features from audio clips and texts. Existing studies are restricted in the following two ways: 1) Most researchers utilize contrastive learning to construct a common subspace where similarities among data can be measured. However, they considers only cross-modal transformation, neglecting the intra-modal separability. Besides, the temperature parameter is not adaptively adjusted along with semantic guidance, which degrades the performance. 2) These methods do not take latent representation reconstruction into account, which is essential for semantic alignment. This paper introduces a novel audio-text oriented CMR approach, termed Contrastive Latent Space Reconstruction Learning (CLSR). CLSR improves contrastive representation learning by taking intra-modal separability into account and adopting an adaptive temperature control strategy. Moreover, the latent representation reconstruction modules are embedded into the CMR framework, which improves modal interaction. Experiments in comparison with some state-of-the-art methods on two audio-text datasets have validated the superiority of CLSR.
引用
收藏
页码:913 / 917
页数:5
相关论文
共 35 条
  • [11] Unsupervised Contrastive Cross-Modal Hashing
    Hu, Peng
    Zhu, Hongyuan
    Lin, Jie
    Peng, Dezhong
    Zhao, Yin-Ping
    Peng, Xi
    [J]. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2023, 45 (03) : 3877 - 3889
  • [12] ProDiff: Progressive Fast Diffusion Model for High-Quality Text-to-Speech
    Huang, Rongjie
    Zhao, Zhou
    Liu, Huadai
    Liu, Jinglin
    Cui, Chenye
    Ren, Yi
    [J]. PROCEEDINGS OF THE 30TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2022, 2022, : 2595 - 2605
  • [13] Huang Z., 2023, P ICML
  • [14] Ikawa S., 2018, DETECTION CLASSIFICA, P59
  • [15] Kim CD, 2019, 2019 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES (NAACL HLT 2019), VOL. 1, P119
  • [16] Audio Retrieval With Natural Language Queries: A Benchmark Study
    Koepke, A. Sophia
    Oncescu, Andreea-Maria
    Henriques, Joao F.
    Akata, Zeynep
    Albanie, Samuel
    [J]. IEEE TRANSACTIONS ON MULTIMEDIA, 2023, 25 : 2675 - 2685
  • [17] PANNs: Large-Scale Pretrained Audio Neural Networks for Audio Pattern Recognition
    Kong, Qiuqiang
    Cao, Yin
    Iqbal, Turab
    Wang, Yuxuan
    Wang, Wenwu
    Plumbley, Mark D.
    [J]. IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2020, 28 : 2880 - 2894
  • [18] A GENERAL FRAMEWORK FOR INCOMPLETE CROSS-MODAL RETRIEVAL WITH MISSING LABELS AND MISSING MODALITIES
    Li, Mingyang
    Huang, Shao-Lun
    Zhang, Lin
    [J]. 2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, : 4763 - 4767
  • [19] Liu Y., 2019, P BMVC
  • [20] AUDIO-TEXT RETRIEVAL IN CONTEXT
    Lou, Siyu
    Xu, Xuenan
    Wu, Mengyue
    Yu, Kai
    [J]. 2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, : 4793 - 4797