Late multimodal fusion for image and audio music transcription

被引:8
|
作者
Alfaro-Contreras, Maria [1 ]
Valero-Mas, Jose J. [1 ]
Inesta, Jose M. [1 ]
Calvo-Zaragoza, Jorge [1 ]
机构
[1] Univ Alicante, Univ Inst Comp Res, Carretera San Vicente Raspeig S-N, Alicante 03690, Spain
关键词
Optical Music Recognition; Automatic Music Transcription; Multimodality; Deep learning; Connectionist Temporal Classification; Sequence labeling; Word graphs;
D O I
10.1016/j.eswa.2022.119491
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Music transcription, which deals with the conversion of music sources into a structured digital format, is a key problem for Music Information Retrieval (MIR). When addressing this challenge in computational terms, the MIR community follows two lines of research: music documents, which is the case of Optical Music Recognition (OMR), or audio recordings, which is the case of Automatic Music Transcription (AMT). The different nature of the aforementioned input data has conditioned these fields to develop modality-specific frameworks. However, their recent definition in terms of sequence labeling tasks leads to a common output representation, which enables research on a combined paradigm. In this respect, multimodal image and audio music transcription comprises the challenge of effectively combining the information conveyed by image and audio modalities. In this work, we explore this question at a late-fusion level: we study four combination approaches in order to merge, for the first time, the hypotheses regarding end-to-end OMR and AMT systems in a lattice-based search space. The results obtained for a series of performance scenarios-in which the corresponding single-modality models yield different error rates-showed interesting benefits of these approaches. In addition, two of the four strategies considered significantly improve the corresponding unimodal standard recognition frameworks.
引用
收藏
页数:10
相关论文
共 50 条
  • [1] Multimodal image and audio music transcription
    de la Fuente, Carlos
    Valero-Mas, Jose J.
    Castellanos, Francisco J.
    Calvo-Zaragoza, Jorge
    INTERNATIONAL JOURNAL OF MULTIMEDIA INFORMATION RETRIEVAL, 2022, 11 (01) : 77 - 84
  • [2] Multimodal image and audio music transcription
    Carlos de la Fuente
    Jose J. Valero-Mas
    Francisco J. Castellanos
    Jorge Calvo-Zaragoza
    International Journal of Multimedia Information Retrieval, 2022, 11 : 77 - 84
  • [3] MIDALF-multimodal image and audio late fusion for malware detection
    Ismail, Setia Juli Irzal
    Rahardjo, Budi
    Juhana, Tutun
    Musashi, Yasuo
    EURASIP JOURNAL ON INFORMATION SECURITY, 2025, 2025 (01):
  • [4] A Multimodal Approach for Percussion Music Transcription from Audio and Video
    Marenco, Bernardo
    Fuentes, Magdalena
    Lanzaro, Florencia
    Rocamora, Martin
    Gomez, Alvaro
    PROGRESS IN PATTERN RECOGNITION, IMAGE ANALYSIS, COMPUTER VISION, AND APPLICATIONS, CIARP 2015, 2015, 9423 : 92 - 99
  • [5] Multimodal Fusion Remote Sensing Image-Audio Retrieval
    Yang, Rui
    Wang, Shuang
    Sun, Yingzhi
    Zhang, Huan
    Liao, Yu
    Gu, Yu
    Hou, Biao
    Jiao, Licheng
    IEEE JOURNAL OF SELECTED TOPICS IN APPLIED EARTH OBSERVATIONS AND REMOTE SENSING, 2022, 15 : 6220 - 6235
  • [6] Multimodal fusion for audio-image and video action recognition
    Muhammad Bilal Shaikh
    Douglas Chai
    Syed Mohammed Shamsul Islam
    Naveed Akhtar
    Neural Computing and Applications, 2024, 36 : 5499 - 5513
  • [7] Multimodal fusion for audio-image and video action recognition
    Shaikh, Muhammad Bilal
    Chai, Douglas
    Islam, Syed Mohammed Shamsul
    Akhtar, Naveed
    NEURAL COMPUTING & APPLICATIONS, 2024, 36 (10): : 5499 - 5513
  • [8] Automatic transcription of piano music using audio-vision fusion
    Wan, Yulong
    Wu, Zhigang
    Zhou, Ruohua
    Yan, Yonghong
    MEASUREMENT TECHNOLOGY AND ENGINEERING RESEARCHES IN INDUSTRY, PTS 1-3, 2013, 333-335 : 742 - +
  • [9] INTERACTIVE MULTIMODAL MUSIC TRANSCRIPTION
    Inesta, Jose M.
    Perez-Sancho, Carlos
    2013 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2013, : 211 - 215
  • [10] A MULTIMODAL APPROACH TO MUSIC TRANSCRIPTION
    Paleari, Marco
    Huet, Benoit
    Schutz, Antony
    Slock, Dirk
    2008 15TH IEEE INTERNATIONAL CONFERENCE ON IMAGE PROCESSING, VOLS 1-5, 2008, : 93 - 96