Late multimodal fusion for image and audio music transcription

被引:8
|
作者
Alfaro-Contreras, Maria [1 ]
Valero-Mas, Jose J. [1 ]
Inesta, Jose M. [1 ]
Calvo-Zaragoza, Jorge [1 ]
机构
[1] Univ Alicante, Univ Inst Comp Res, Carretera San Vicente Raspeig S-N, Alicante 03690, Spain
关键词
Optical Music Recognition; Automatic Music Transcription; Multimodality; Deep learning; Connectionist Temporal Classification; Sequence labeling; Word graphs;
D O I
10.1016/j.eswa.2022.119491
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Music transcription, which deals with the conversion of music sources into a structured digital format, is a key problem for Music Information Retrieval (MIR). When addressing this challenge in computational terms, the MIR community follows two lines of research: music documents, which is the case of Optical Music Recognition (OMR), or audio recordings, which is the case of Automatic Music Transcription (AMT). The different nature of the aforementioned input data has conditioned these fields to develop modality-specific frameworks. However, their recent definition in terms of sequence labeling tasks leads to a common output representation, which enables research on a combined paradigm. In this respect, multimodal image and audio music transcription comprises the challenge of effectively combining the information conveyed by image and audio modalities. In this work, we explore this question at a late-fusion level: we study four combination approaches in order to merge, for the first time, the hypotheses regarding end-to-end OMR and AMT systems in a lattice-based search space. The results obtained for a series of performance scenarios-in which the corresponding single-modality models yield different error rates-showed interesting benefits of these approaches. In addition, two of the four strategies considered significantly improve the corresponding unimodal standard recognition frameworks.
引用
收藏
页数:10
相关论文
共 50 条
  • [21] Automatic multimodal medical image fusion
    Zhang, ZF
    Yao, J
    Bajwa, S
    Gudas, T
    SMCIA/03: PROCEEDINGS OF THE 2003 IEEE INTERNATIONAL WORKSHOP ON SOFT COMPUTING IN INDUSTRIAL APPLICATIONS, 2003, : 161 - 166
  • [22] A review on multimodal medical image fusion
    Reddy, G. R. Byra
    Kumar, H. Prasanna
    INTERNATIONAL JOURNAL OF BIOMEDICAL ENGINEERING AND TECHNOLOGY, 2020, 34 (02) : 119 - 132
  • [23] A multimodal fusion approach for image captioning
    Zhao, Dexin
    Chang, Zhi
    Guo, Shutao
    NEUROCOMPUTING, 2019, 329 : 476 - 485
  • [24] Automatic multimodal medical image fusion
    Zhang, ZF
    Yao, J
    Bajwa, S
    Gudas, T
    CBMS 2003: 16TH IEEE SYMPOSIUM ON COMPUTER-BASED MEDICAL SYSTEMS, PROCEEDINGS, 2003, : 42 - 49
  • [25] A multimodal fusion method for sarcasm detection based on late fusion
    Ning Ding
    Sheng-wei Tian
    Long Yu
    Multimedia Tools and Applications, 2022, 81 : 8597 - 8616
  • [26] A multimodal fusion method for sarcasm detection based on late fusion
    Ding, Ning
    Tian, Sheng-wei
    Yu, Long
    MULTIMEDIA TOOLS AND APPLICATIONS, 2022, 81 (06) : 8597 - 8616
  • [27] Audio-Guided Fusion Techniques for Multimodal Emotion Analysis
    Shi, Pujin
    Gao, Fei
    PROCEEDINGS OF THE 2ND INTERNATIONAL WORKSHOP ON MULTIMODAL AND RESPONSIBLE AFFECTIVE COMPUTING, MRAC 2024, 2024, : 62 - 66
  • [28] Probabilistic approach to automatic music transcription from audio signals
    Miyamoto, Kenichi
    Kameoka, Hirokazu
    Takeda, Haruto
    Nishimoto, Takuya
    Sagayama, Shigeki
    2007 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, VOL II, PTS 1-3, 2007, : 697 - +
  • [29] Software Tool for Audio Signal Analysis and Automatic Music Transcription
    Chis, Lucian-Gheorghe
    Marcu, Marius
    Dragan, Florin
    2018 IEEE 12TH INTERNATIONAL SYMPOSIUM ON APPLIED COMPUTATIONAL INTELLIGENCE AND INFORMATICS (SACI), 2018, : 497 - 501
  • [30] Data representations for audio-to-score monophonic music transcription
    Roman, Miguel A.
    Pertusa, Antonio
    Calvo-Zaragoza, Jorge
    EXPERT SYSTEMS WITH APPLICATIONS, 2020, 162