CoverHunter: Cover Song Identification with Refined Attention and Alignments

被引:1
作者
Liu, Feng [1 ]
Tuo, Deyi [1 ]
Xu, Yinan [1 ]
Han, Xintong [1 ]
机构
[1] Huya Inc, Intelligent Media Technol Dept, Guangzhou, Peoples R China
来源
2023 IEEE INTERNATIONAL CONFERENCE ON MULTIMEDIA AND EXPO, ICME | 2023年
关键词
Cover Song Identification; Contrastive Learning; Chunk Alignment; Conformer; Coarse-to-Fine Training;
D O I
10.1109/ICME55011.2023.00189
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Cover Song Identification (CSI) focuses on finding the same music with different versions in reference anchors given a query track. In this paper, we propose a novel system named CoverHunter that overcomes the shortcomings of existing detection schemes by exploring richer features with refined attention and alignments. CoverHunter contains three key modules: 1) A convolution-augmented transformer (e.g. Conformer) structure that captures both local and global feature interactions in contrast to previous methods mainly relying on convolutional neural networks; 2) An attention-based time pooling module that further exploits the attention in the time dimension; 3) A novel coarse-to-fine training scheme that first trains a network to roughly align the song chunks and then refines the network by training on the aligned chunks. At the same time, we also summarize some important training tricks used in our system to achieve better results. Experiments on several standard CSI datasets show that our method significantly improves over state-of-the-art methods with an embedding size of 128 (2.3% on SHS100K-TEST and 17.7% on DaTacos).
引用
收藏
页码:1080 / 1085
页数:6
相关论文
共 23 条
  • [1] CALCULATION OF A CONSTANT-Q SPECTRAL TRANSFORM
    BROWN, JC
    [J]. JOURNAL OF THE ACOUSTICAL SOCIETY OF AMERICA, 1991, 89 (01) : 425 - 434
  • [2] BYTECOVER2: TOWARDS DIMENSIONALITY REDUCTION OF LATENT EMBEDDING FOR EFFICIENT COVER SONG IDENTIFICATION
    Du, Xingjian
    Chen, Ke
    Wang, Zijie
    Zhu, Bilei
    Ma, Zejun
    [J]. 2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, : 616 - 620
  • [3] BYTECOVER: COVER SONG IDENTIFICATION VIA MULTI-LOSS TRAINING
    Du, Xingjian
    Yu, Zhesong
    Zhu, Bilei
    Chen, Xiaoou
    Ma, Zejun
    [J]. 2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021), 2021, : 551 - 555
  • [4] Ellis DPW, 2007, INT CONF ACOUST SPEE, P1429
  • [5] Conformer: Convolution-augmented Transformer for Speech Recognition
    Gulati, Anmol
    Qin, James
    Chiu, Chung-Cheng
    Parmar, Niki
    Zhang, Yu
    Yu, Jiahui
    Han, Wei
    Wang, Shibo
    Zhang, Zhengdong
    Wu, Yonghui
    Pang, Ruoming
    [J]. INTERSPEECH 2020, 2020, : 5036 - 5040
  • [6] Guo RQ, 2020, PR MACH LEARN RES, V119
  • [7] Hu S., 2022, Interspeech
  • [8] Focal Loss for Dense Object Detection
    Lin, Tsung-Yi
    Goyal, Priya
    Girshick, Ross
    He, Kaiming
    Dollar, Piotr
    [J]. 2017 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2017, : 2999 - 3007
  • [9] Bag of Tricks and A Strong Baseline for Deep Person Re-identification
    Luo, Hao
    Gu, Youzhi
    Liao, Xingyu
    Lai, Shenqi
    Jiang, Wei
    [J]. 2019 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION WORKSHOPS (CVPRW 2019), 2019, : 1487 - 1495
  • [10] Marolt M., 2006, ISMIR