CROSS MODAL AUDIO SEARCH AND RETRIEVAL WITH JOINT EMBEDDINGS BASED ON TEXT AND AUDIO

被引：0

作者：

Elizalde, Benjamin ^{[1
,2
]}

Zarar, Shuayb ^{[1
]}

Raj, Bhiksha ^{[2
]}

机构：

[1] Microsoft Res, Redmond, WA 98052 USA

[2] Carnegie Mellon Univ, Pittsburgh, PA 15213 USA

来源：

2019 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP) | 2019年

关键词：

Joint Audio-Text Embedding; Cross Modal Retrieval; Audio Search Engine; Content-Based Audio Retrieval; Query by Example; Siamese Neural Network;

D O I：

暂无

中图分类号：

O42 [声学];

学科分类号：

070206 ; 082403 ;

摘要：

Existing audio search engines use one of two approaches: matching text-text or audio-audio pairs. In the former, text queries are matched to semantically similar words in an index of audio metadata to retrieve corresponding audio clips or segments, while in the latter, audio signals are directly used to retrieve acoustically-similar recordings from an audio database. However, independent treatment of text and audio has precluded information exchange between the two modalities. This is a problem because similarity in language does not always imply similarity in acoustics, and vice versa. Moreover, independent modeling can be error prone especially for ad hoc, user-generated recordings, which are noisy in both audio and their associated textual labels. To overcome this limitation, we propose a framework that learns joint embeddings from a shared lexico-acoustic space, where vectors from either modality can be mapped together and compared directly. Thus, we improve semantic knowledge and enable the use of either text or audio queries to search and retrieve audio. Our results break new ground for a cross-modal audio search engine, and further exploration of lexico-acoustic spaces.

引用

页码：4095 / 4099

页数：5

共 50 条

[1] Cross-modal Embeddings for Video and Audio Retrieval
Suris, Didac
Duarte, Amanda
Salvador, Amaia
Torres, Jordi
Giro-i-Nieto, Xavier
COMPUTER VISION - ECCV 2018 WORKSHOPS, PT IV, 2019, 11132 : 711 - 716
[2] On Metric Learning for Audio-Text Cross-Modal Retrieval
Mei, Xinhao
Liu, Xubo
Sun, Jianyuan
Plumbley, Mark
Wang, Wenwu
INTERSPEECH 2022, 2022, : 4142 - 4146
[3] Cross-Modal Audio-Text Retrieval via Sequential Feature Augmentation
Song, Fuhu
Hu, Jifeng
Wang, Che
Huang, Jiao
Zhang, Haowen
Wang, Yi
2023 2ND ASIA CONFERENCE ON ALGORITHMS, COMPUTING AND MACHINE LEARNING, CACML 2023, 2023, : 298 - 304
[4] Synchronising audio and ultrasound by learning cross-modal embeddings
Eshky, Aciel
Ribeiro, Manuel Sam
Richmond, Korin
Renals, Steve
INTERSPEECH 2019, 2019, : 4100 - 4104
[5] Cross-modal retrieval of scripted speech audio
Owen, CB
Makedon, F
MULTIMEDIA COMPUTING AND NETWORKING 1998, 1997, 3310 : 226 - 235
[6] Speaker identification based text to audio alignment for an audio retrieval system
Roy, D
Malamud, C
1997 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, VOLS I - V: VOL I: PLENARY, EXPERT SUMMARIES, SPECIAL, AUDIO, UNDERWATER ACOUSTICS, VLSI; VOL II: SPEECH PROCESSING; VOL III: SPEECH PROCESSING, DIGITAL SIGNAL PROCESSING; VOL IV: MULTIDIMENSIONAL SIGNAL PROCESSING, NEURAL NETWORKS - VOL V: STATISTICAL SIGNAL AND ARRAY PROCESSING, APPLICATIONS, 1997, : 1099 - 1102
[7] Improving Audio-Text Retrieval via Hierarchical Cross-Modal Interaction and Auxiliary Captions
Xin, Yifei
Zou, Yuexian
INTERSPEECH 2023, 2023, : 341 - 345
[8] Video and audio are images: A cross-modal mixer for original data on video-audio retrieval
Yuan, Zichen
Shen, Qi
Zheng, Bingyi
Liu, Yuting
Jiang, Linying
Guo, Guibing
KNOWLEDGE-BASED SYSTEMS, 2024, 299
[9] Text-Based Audio Retrieval by Learning From Similarities Between Audio Captions
Xie, Huang
Khorrami, Khazar
Rasanen, Okko
Virtanen, Tuomas
IEEE SIGNAL PROCESSING LETTERS, 2025, 32 : 221 - 225
[10] LEARNING CONTEXTUAL TAG EMBEDDINGS FOR CROSS-MODAL ALIGNMENT OF AUDIO AND TAGS
Favory, Xavier
Drossos, Konstantinos
Virtanen, Tuomas
Serra, Xavier
2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021), 2021, : 596 - 600

← 1 2 3 4 5 →