YuYin: a multi-task learning model of multi-modal e-commerce background music recommendation

被引：1

作者：

Ma, Le ^{[1
]}

Wu, Xinda ^{[1
]}

Tang, Ruiyuan ^{[1
]}

Zhong, Chongjun ^{[1
]}

Zhang, Kejun ^{[1
,2
]}

机构：

[1] Zhejiang Univ, Hangzhou, Peoples R China

[2] Innovat Ctr Yangtze River Delta, Shanghai, Peoples R China

来源：

EURASIP JOURNAL ON AUDIO SPEECH AND MUSIC PROCESSING | 2023年 / 2023卷 / 01期

基金：

中国国家自然科学基金;

关键词：

Cross-modal retrieval; Multi-modal; Music recommendation; CANONICAL CORRELATION-ANALYSIS;

D O I：

10.1186/s13636-023-00306-6

中图分类号：

O42 [声学];

学科分类号：

070206 ; 082403 ;

摘要：

Appropriate background music in e-commerce advertisements can help stimulate consumption and build product image. However, many factors like emotion and product category should be taken into account, which makes manually selecting music time-consuming and require professional knowledge and it becomes crucial to automatically recommend music for video. For there is no e-commerce advertisements dataset, we first establish a large-scale e-commerce advertisements dataset Commercial-98K, which covers major e-commerce categories. Then, we proposed a video-music retrieval model YuYin to learn the correlation between video and music. We introduce a weighted fusion module (WFM) to fuse emotion features and audio features from music to get a more fine-grained music representation. Considering the similarity of music in the same product category, YuYin is trained by multi-task learning to explore the correlation between video and music by cross-matching video, music, and tag as well as a category prediction task. We conduct extensive experiments to prove YuYin achieves a remarkable improvement in video-music retrieval on Commercial-98K.

引用

页数：13

共 60 条

[1]

Abu-El-Haija Sami., 2016, Youtube-8m: A large-scale video classification benchmark

[2]

Alayrac JB, 2020, ADV NEUR IN, V33

[3]

Alpert J., 1990, PSYCHOL MARKET, V7, P109, DOI DOI 10.1002/MAR.4220070204

[4]

Alpert J.I., 1989, ACR North American Advances

[5]

Andrew G., 2013, ICML

[6] VQA: Visual Question Answering [J].

Antol, Stanislaw ;

Agrawal, Aishwarya ;

Lu, Jiasen ;

Mitchell, Margaret ;

Batra, Dhruv ;

Zitnick, C. Lawrence ;

Parikh, Devi .

2015 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2015, :2425-2433

[7] MUSIC, MOOD, AND MARKETING [J].

BRUNER, GC .

JOURNAL OF MARKETING, 1990, 54 (04) :94-104

[8] Enhancing remote sensing image retrieval using a triplet deep metric learning network [J].

Cao, Rui ;

Zhang, Qian ;

Zhu, Jiasong ;

Li, Qing ;

Li, Qingquan ;

Liu, Bozhi ;

Qiu, Guoping .

INTERNATIONAL JOURNAL OF REMOTE SENSING, 2020, 41 (02) :740-751

[9]

Chao J., 2011, P 10 INT SEMANTIC WE

[10] Deep Cross-Modal Audio-Visual Generation [J].

Chen, Lele ;

Srivastava, Sudhanshu ;

Duan, Zhiyao ;

Xu, Chenliang .

PROCEEDINGS OF THE THEMATIC WORKSHOPS OF ACM MULTIMEDIA 2017 (THEMATIC WORKSHOPS'17), 2017, :349-357

← 1 2 3 4 5 6 →