Content-Based Music-Image Retrieval Using Self- and Cross-Modal Feature Embedding Memory

被引:6
作者
Nakatsuka, Takayuki [1 ]
Hamasaki, Masahiro [1 ]
Goto, Masataka [1 ]
机构
[1] Natl Inst Adv Ind Sci & Technol, Tokyo, Japan
来源
2023 IEEE/CVF WINTER CONFERENCE ON APPLICATIONS OF COMPUTER VISION (WACV) | 2023年
关键词
D O I
10.1109/WACV56688.2023.00221
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
This paper describes a method based on deep metric learning for content-based cross-modal retrieval of a piece of music and its representative image (i.e., a music audio signal and its cover art image). We train music and image encoders so that the embeddings of a positive music-image pair lie close to each other, while those of a random pair lie far from each other, in a shared embedding space. Furthermore, we propose a mechanism called self- and cross-modal feature embedding memory, which stores both the music and image embeddings of any previous iterations in memory and enables the encoders to mine informative pairs for training. To perform such training, we constructed a dataset containing 78,325 music-image pairs. We demonstrate the effectiveness of the proposed mechanism on this dataset: specifically, our mechanism outperforms baseline methods by x1.93 similar to 3.38 for the mean reciprocal rank, x2.19 similar to 3.56 for recall@50, and 528 similar to 891 ranks for the median rank.
引用
收藏
页码:2173 / 2183
页数:11
相关论文
共 67 条
[1]  
[Anonymous], 2011, P INT WORKSH SELF OR
[2]  
Brochu Eric, 2003, PMLR, P49
[3]  
Carroll Jim, 2016, ART SLEEVE EVERY ALB
[4]  
Chao Jiansong, 2011, P INT SEM WEB C ISWC
[5]   nnAudio: An on-the-Fly GPU Audio to Spectrogram Conversion Toolbox Using 1D Convolutional Neural Networks [J].
Cheuk, Kin Wai ;
Anderson, Hans ;
Agres, Kat ;
Herremans, Dorien .
IEEE ACCESS, 2020, 8 :161981-162003
[6]  
Craswell N., 2009, Mean Reciprocal Rank, P1703, DOI [DOI 10.1007/978-0-387-39940-9488, 10.1007/978-0-387-39940-9488, DOI 10.1007/978-0-387-39940-9_488]
[7]  
Deng ZL, 2022, AAAI CONF ARTIF INTE, P516
[8]  
Dunker P., 2008, P ACM INT C MULT INF, P97
[9]  
Gabor Dennis., 1947, Journal of the Institution of Electrical Engineers, V94
[10]   Momentum Contrast for Unsupervised Visual Representation Learning [J].
He, Kaiming ;
Fan, Haoqi ;
Wu, Yuxin ;
Xie, Saining ;
Girshick, Ross .
2020 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2020), 2020, :9726-9735