Deep Cross-Modal Correlation Learning for Audio and Lyrics in Music Retrieval

被引：65

作者：

Yu, Yi ^{[1
]}

Tang, Suhua ^{[2
]}

Raposo, Francisco ^{[3
]}

Chen, Lei ^{[4
]}

机构：

[1] Natl Inst Informat, Chiyoda Ku, 2-1-2 Hitotsubashi, Tokyo 1018430, Japan

[2] Univ Electrocommun, 1-5-1 Chofugaoka, Chofu, Tokyo 1828585, Japan

[3] Univ Lisbon, INESC ID Lisboa R Alves Redol 9, P-1000029 Lisbon, Portugal

[4] Hong Kong Univ Sci & Technol, Kowloon, Clear Water Bay, Hong Kong, Peoples R China

来源：

ACM TRANSACTIONS ON MULTIMEDIA COMPUTING COMMUNICATIONS AND APPLICATIONS | 2019年 / 15卷 / 01期

关键词：

Convolutional neural networks; deep cross-modal models; correlation learning between audio and lyrics; cross-modal music retrieval; music knowledge discovery; CLASSIFICATION; FEATURES;

D O I：

10.1145/3281746

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

Deep cross-modal learning has successfully demonstrated excellent performance in cross-modal multimedia retrieval, with the aim of learning joint representations between different data modalities. Unfortunately, little research focuses on cross-modal correlation learning where temporal structures of different data modalities, such as audio and lyrics, should be taken into account. Stemming from the characteristic of temporal structures of music in nature, we are motivated to learn the deep sequential correlation between audio and lyrics. In this work, we propose a deep cross-modal correlation learning architecture involving two-branch deep neural networks for audio modality and text modality (lyrics). Data in different modalities are converted to the same canonical space where intermodal canonical correlation analysis is utilized as an objective function to calculate the similarity of temporal structures. This is the first study that uses deep architectures for learning the temporal correlation between audio and lyrics. A pretrained Doc2Vec model followed by fully connected layers is used to represent lyrics. Two significant contributions are made in the audio branch, as follows: (i) We propose an end-to-end network to learn cross-modal correlation between audio and lyrics, where feature extraction and correlation learning are simultaneously performed and joint representation is learned by considering temporal structures. (ii) And, as for feature extraction, we further represent an audio signal by a short sequence of local summaries (VGG16 features) and apply a recurrent neural network to compute a compact feature that better learns the temporal structures of music audio. Experimental results, using audio to retrieve lyrics or using lyrics to retrieve audio, verify the effectiveness of the proposed deep correlation learning architectures in cross-modal music retrieval.

引用

页数：16

共 50 条

[1] Deep Semantic Correlation with Adversarial Learning for Cross-Modal Retrieval
Hua, Yan
Du, Jianhe
PROCEEDINGS OF 2019 IEEE 9TH INTERNATIONAL CONFERENCE ON ELECTRONICS INFORMATION AND EMERGENCY COMMUNICATION (ICEIEC 2019), 2019, : 252 - 255
[2] Cross-Modal Interaction via Reinforcement Feedback for Audio-Lyrics Retrieval
Zhou, Dong
Lei, Fang
Li, Lin
Zhou, Yongmei
Yang, Aimin
IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2024, 32 : 1248 - 1260
[3] Cross-Modal Retrieval Using Deep Learning
Malik, Shaily
Bhardwaj, Nikhil
Bhardwaj, Rahul
Kumar, Saurabh
PROCEEDINGS OF THIRD DOCTORAL SYMPOSIUM ON COMPUTATIONAL INTELLIGENCE, DOSCI 2022, 2023, 479 : 725 - 734
[4] Deep Cross-Modal Retrieval for Remote Sensing Image and Audio
Guo Mao
Yuan Yuan
Lu Xiaoqiang
2018 10TH IAPR WORKSHOP ON PATTERN RECOGNITION IN REMOTE SENSING (PRRS), 2018,
[5] Incomplete Cross-Modal Retrieval with Deep Correlation Transfer
Shi, Dan
Zhu, Lei
Li, Jingjing
Dong, Guohua
Zhang, Huaxiang
ACM TRANSACTIONS ON MULTIMEDIA COMPUTING COMMUNICATIONS AND APPLICATIONS, 2024, 20 (05)
[6] On Metric Learning for Audio-Text Cross-Modal Retrieval
Mei, Xinhao
Liu, Xubo
Sun, Jianyuan
Plumbley, Mark
Wang, Wenwu
INTERSPEECH 2022, 2022, : 4142 - 4146
[7] Deep canonical correlation analysis with progressive and hypergraph learning for cross-modal retrieval
Shao, Jie
Wang, Leiquan
Zhao, Zhicheng
Su, Fei
Cai, Anni
NEUROCOMPUTING, 2016, 214 : 618 - 628
[8] Deep Semantic Correlation Learning based Hashing for Multimedia Cross-Modal Retrieval
Gong, Xiaolong
Huang, Linpeng
Wang, Fuwei
2018 IEEE INTERNATIONAL CONFERENCE ON DATA MINING (ICDM), 2018, : 117 - 126
[9] Deep Multimodal Transfer Learning for Cross-Modal Retrieval
Zhen, Liangli
Hu, Peng
Peng, Xi
Goh, Rick Siow Mong
Zhou, Joey Tianyi
IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, 2022, 33 (02) : 798 - 810
[10] Variational Deep Representation Learning for Cross-Modal Retrieval
Yang, Chen
Deng, Zongyong
Li, Tianyu
Liu, Hao
Liu, Libo
PATTERN RECOGNITION AND COMPUTER VISION, PRCV 2021, PT II, 2021, 13020 : 498 - 510

← 1 2 3 4 5 →