CROSS MODAL AUDIO SEARCH AND RETRIEVAL WITH JOINT EMBEDDINGS BASED ON TEXT AND AUDIO

被引:0
|
作者
Elizalde, Benjamin [1 ,2 ]
Zarar, Shuayb [1 ]
Raj, Bhiksha [2 ]
机构
[1] Microsoft Res, Redmond, WA 98052 USA
[2] Carnegie Mellon Univ, Pittsburgh, PA 15213 USA
来源
2019 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP) | 2019年
关键词
Joint Audio-Text Embedding; Cross Modal Retrieval; Audio Search Engine; Content-Based Audio Retrieval; Query by Example; Siamese Neural Network;
D O I
暂无
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
Existing audio search engines use one of two approaches: matching text-text or audio-audio pairs. In the former, text queries are matched to semantically similar words in an index of audio metadata to retrieve corresponding audio clips or segments, while in the latter, audio signals are directly used to retrieve acoustically-similar recordings from an audio database. However, independent treatment of text and audio has precluded information exchange between the two modalities. This is a problem because similarity in language does not always imply similarity in acoustics, and vice versa. Moreover, independent modeling can be error prone especially for ad hoc, user-generated recordings, which are noisy in both audio and their associated textual labels. To overcome this limitation, we propose a framework that learns joint embeddings from a shared lexico-acoustic space, where vectors from either modality can be mapped together and compared directly. Thus, we improve semantic knowledge and enable the use of either text or audio queries to search and retrieve audio. Our results break new ground for a cross-modal audio search engine, and further exploration of lexico-acoustic spaces.
引用
收藏
页码:4095 / 4099
页数:5
相关论文
共 50 条
  • [1] Cross-modal Embeddings for Video and Audio Retrieval
    Suris, Didac
    Duarte, Amanda
    Salvador, Amaia
    Torres, Jordi
    Giro-i-Nieto, Xavier
    COMPUTER VISION - ECCV 2018 WORKSHOPS, PT IV, 2019, 11132 : 711 - 716
  • [2] On Metric Learning for Audio-Text Cross-Modal Retrieval
    Mei, Xinhao
    Liu, Xubo
    Sun, Jianyuan
    Plumbley, Mark
    Wang, Wenwu
    INTERSPEECH 2022, 2022, : 4142 - 4146
  • [3] Cross-Modal Audio-Text Retrieval via Sequential Feature Augmentation
    Song, Fuhu
    Hu, Jifeng
    Wang, Che
    Huang, Jiao
    Zhang, Haowen
    Wang, Yi
    2023 2ND ASIA CONFERENCE ON ALGORITHMS, COMPUTING AND MACHINE LEARNING, CACML 2023, 2023, : 298 - 304
  • [4] Synchronising audio and ultrasound by learning cross-modal embeddings
    Eshky, Aciel
    Ribeiro, Manuel Sam
    Richmond, Korin
    Renals, Steve
    INTERSPEECH 2019, 2019, : 4100 - 4104
  • [5] Cross-modal retrieval of scripted speech audio
    Owen, CB
    Makedon, F
    MULTIMEDIA COMPUTING AND NETWORKING 1998, 1997, 3310 : 226 - 235
  • [6] Speaker identification based text to audio alignment for an audio retrieval system
    Roy, D
    Malamud, C
    1997 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, VOLS I - V: VOL I: PLENARY, EXPERT SUMMARIES, SPECIAL, AUDIO, UNDERWATER ACOUSTICS, VLSI; VOL II: SPEECH PROCESSING; VOL III: SPEECH PROCESSING, DIGITAL SIGNAL PROCESSING; VOL IV: MULTIDIMENSIONAL SIGNAL PROCESSING, NEURAL NETWORKS - VOL V: STATISTICAL SIGNAL AND ARRAY PROCESSING, APPLICATIONS, 1997, : 1099 - 1102
  • [7] Improving Audio-Text Retrieval via Hierarchical Cross-Modal Interaction and Auxiliary Captions
    Xin, Yifei
    Zou, Yuexian
    INTERSPEECH 2023, 2023, : 341 - 345
  • [8] Video and audio are images: A cross-modal mixer for original data on video-audio retrieval
    Yuan, Zichen
    Shen, Qi
    Zheng, Bingyi
    Liu, Yuting
    Jiang, Linying
    Guo, Guibing
    KNOWLEDGE-BASED SYSTEMS, 2024, 299
  • [9] Text-Based Audio Retrieval by Learning From Similarities Between Audio Captions
    Xie, Huang
    Khorrami, Khazar
    Rasanen, Okko
    Virtanen, Tuomas
    IEEE SIGNAL PROCESSING LETTERS, 2025, 32 : 221 - 225
  • [10] LEARNING CONTEXTUAL TAG EMBEDDINGS FOR CROSS-MODAL ALIGNMENT OF AUDIO AND TAGS
    Favory, Xavier
    Drossos, Konstantinos
    Virtanen, Tuomas
    Serra, Xavier
    2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021), 2021, : 596 - 600