Deep Cross-Modal Correlation Learning for Audio and Lyrics in Music Retrieval

被引:65
|
作者
Yu, Yi [1 ]
Tang, Suhua [2 ]
Raposo, Francisco [3 ]
Chen, Lei [4 ]
机构
[1] Natl Inst Informat, Chiyoda Ku, 2-1-2 Hitotsubashi, Tokyo 1018430, Japan
[2] Univ Electrocommun, 1-5-1 Chofugaoka, Chofu, Tokyo 1828585, Japan
[3] Univ Lisbon, INESC ID Lisboa R Alves Redol 9, P-1000029 Lisbon, Portugal
[4] Hong Kong Univ Sci & Technol, Kowloon, Clear Water Bay, Hong Kong, Peoples R China
关键词
Convolutional neural networks; deep cross-modal models; correlation learning between audio and lyrics; cross-modal music retrieval; music knowledge discovery; CLASSIFICATION; FEATURES;
D O I
10.1145/3281746
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Deep cross-modal learning has successfully demonstrated excellent performance in cross-modal multimedia retrieval, with the aim of learning joint representations between different data modalities. Unfortunately, little research focuses on cross-modal correlation learning where temporal structures of different data modalities, such as audio and lyrics, should be taken into account. Stemming from the characteristic of temporal structures of music in nature, we are motivated to learn the deep sequential correlation between audio and lyrics. In this work, we propose a deep cross-modal correlation learning architecture involving two-branch deep neural networks for audio modality and text modality (lyrics). Data in different modalities are converted to the same canonical space where intermodal canonical correlation analysis is utilized as an objective function to calculate the similarity of temporal structures. This is the first study that uses deep architectures for learning the temporal correlation between audio and lyrics. A pretrained Doc2Vec model followed by fully connected layers is used to represent lyrics. Two significant contributions are made in the audio branch, as follows: (i) We propose an end-to-end network to learn cross-modal correlation between audio and lyrics, where feature extraction and correlation learning are simultaneously performed and joint representation is learned by considering temporal structures. (ii) And, as for feature extraction, we further represent an audio signal by a short sequence of local summaries (VGG16 features) and apply a recurrent neural network to compute a compact feature that better learns the temporal structures of music audio. Experimental results, using audio to retrieve lyrics or using lyrics to retrieve audio, verify the effectiveness of the proposed deep correlation learning architectures in cross-modal music retrieval.
引用
收藏
页数:16
相关论文
共 50 条
  • [1] Deep Semantic Correlation with Adversarial Learning for Cross-Modal Retrieval
    Hua, Yan
    Du, Jianhe
    PROCEEDINGS OF 2019 IEEE 9TH INTERNATIONAL CONFERENCE ON ELECTRONICS INFORMATION AND EMERGENCY COMMUNICATION (ICEIEC 2019), 2019, : 252 - 255
  • [2] Cross-Modal Interaction via Reinforcement Feedback for Audio-Lyrics Retrieval
    Zhou, Dong
    Lei, Fang
    Li, Lin
    Zhou, Yongmei
    Yang, Aimin
    IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2024, 32 : 1248 - 1260
  • [3] Cross-Modal Retrieval Using Deep Learning
    Malik, Shaily
    Bhardwaj, Nikhil
    Bhardwaj, Rahul
    Kumar, Saurabh
    PROCEEDINGS OF THIRD DOCTORAL SYMPOSIUM ON COMPUTATIONAL INTELLIGENCE, DOSCI 2022, 2023, 479 : 725 - 734
  • [4] Deep Cross-Modal Retrieval for Remote Sensing Image and Audio
    Guo Mao
    Yuan Yuan
    Lu Xiaoqiang
    2018 10TH IAPR WORKSHOP ON PATTERN RECOGNITION IN REMOTE SENSING (PRRS), 2018,
  • [5] Incomplete Cross-Modal Retrieval with Deep Correlation Transfer
    Shi, Dan
    Zhu, Lei
    Li, Jingjing
    Dong, Guohua
    Zhang, Huaxiang
    ACM TRANSACTIONS ON MULTIMEDIA COMPUTING COMMUNICATIONS AND APPLICATIONS, 2024, 20 (05)
  • [6] On Metric Learning for Audio-Text Cross-Modal Retrieval
    Mei, Xinhao
    Liu, Xubo
    Sun, Jianyuan
    Plumbley, Mark
    Wang, Wenwu
    INTERSPEECH 2022, 2022, : 4142 - 4146
  • [7] Deep canonical correlation analysis with progressive and hypergraph learning for cross-modal retrieval
    Shao, Jie
    Wang, Leiquan
    Zhao, Zhicheng
    Su, Fei
    Cai, Anni
    NEUROCOMPUTING, 2016, 214 : 618 - 628
  • [8] Deep Semantic Correlation Learning based Hashing for Multimedia Cross-Modal Retrieval
    Gong, Xiaolong
    Huang, Linpeng
    Wang, Fuwei
    2018 IEEE INTERNATIONAL CONFERENCE ON DATA MINING (ICDM), 2018, : 117 - 126
  • [9] Deep Multimodal Transfer Learning for Cross-Modal Retrieval
    Zhen, Liangli
    Hu, Peng
    Peng, Xi
    Goh, Rick Siow Mong
    Zhou, Joey Tianyi
    IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, 2022, 33 (02) : 798 - 810
  • [10] Variational Deep Representation Learning for Cross-Modal Retrieval
    Yang, Chen
    Deng, Zongyong
    Li, Tianyu
    Liu, Hao
    Liu, Libo
    PATTERN RECOGNITION AND COMPUTER VISION, PRCV 2021, PT II, 2021, 13020 : 498 - 510