Late multimodal fusion for image and audio music transcription

被引：8

作者：

Alfaro-Contreras, Maria ^{[1
]}

Valero-Mas, Jose J. ^{[1
]}

Inesta, Jose M. ^{[1
]}

Calvo-Zaragoza, Jorge ^{[1
]}

机构：

[1] Univ Alicante, Univ Inst Comp Res, Carretera San Vicente Raspeig S-N, Alicante 03690, Spain

来源：

EXPERT SYSTEMS WITH APPLICATIONS | 2023年 / 216卷

关键词：

Optical Music Recognition; Automatic Music Transcription; Multimodality; Deep learning; Connectionist Temporal Classification; Sequence labeling; Word graphs;

D O I：

10.1016/j.eswa.2022.119491

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Music transcription, which deals with the conversion of music sources into a structured digital format, is a key problem for Music Information Retrieval (MIR). When addressing this challenge in computational terms, the MIR community follows two lines of research: music documents, which is the case of Optical Music Recognition (OMR), or audio recordings, which is the case of Automatic Music Transcription (AMT). The different nature of the aforementioned input data has conditioned these fields to develop modality-specific frameworks. However, their recent definition in terms of sequence labeling tasks leads to a common output representation, which enables research on a combined paradigm. In this respect, multimodal image and audio music transcription comprises the challenge of effectively combining the information conveyed by image and audio modalities. In this work, we explore this question at a late-fusion level: we study four combination approaches in order to merge, for the first time, the hypotheses regarding end-to-end OMR and AMT systems in a lattice-based search space. The results obtained for a series of performance scenarios-in which the corresponding single-modality models yield different error rates-showed interesting benefits of these approaches. In addition, two of the four strategies considered significantly improve the corresponding unimodal standard recognition frameworks.

引用

页数：10

共 50 条

[31] Multimodal Structure Segmentation and Analysis of Music Using Audio and Textual Information
Cheng, Heng-Tze
Yang, Yi-Hsuan
Lin, Yu-Ching
Chen, Homer H.
ISCAS: 2009 IEEE INTERNATIONAL SYMPOSIUM ON CIRCUITS AND SYSTEMS, VOLS 1-5, 2009, : 1677 - 1680
[32] Multimodal Music and Lyrics Fusion Classifier for Artist Identification
Aryafar, Kamelia
Shokoufandeh, Ali
2014 13TH INTERNATIONAL CONFERENCE ON MACHINE LEARNING AND APPLICATIONS (ICMLA), 2014, : 506 - 509
[33] Lyrics-based audio retrieval and multimodal navigation in music collections
Mueller, Meinard
Kurth, Frank
Damm, David
Fremerey, Christian
Clausen, Michael
RESEARCH AND ADVANCED TECHNOLOGY FOR DIGITAL LIBRARIES, PROCEEDINGS, 2007, 4675 : 112 - +
[34] A Multimodal Fusion Online Music Education System for Universities
Liu, Peng
Cao, Yixiao
Wang, Lei
COMPUTATIONAL INTELLIGENCE AND NEUROSCIENCE, 2022, 2022
[35] MAiVAR: Multimodal Audio-Image and Video Action Recognizer
Shaikh, Muhammad Bilal
Chai, Douglas
Islam, Syed Mohammed Shamsul
Akhtar, Naveed
2022 IEEE INTERNATIONAL CONFERENCE ON VISUAL COMMUNICATIONS AND IMAGE PROCESSING (VCIP), 2022,
[36] Multimodal deep fusion for image question answering
Zhang, Weifeng
Yu, Jing
Wang, Yuxia
Wang, Wei
KNOWLEDGE-BASED SYSTEMS, 2021, 212
[37] Ornament Image Retrieval Using Multimodal Fusion
Islam S.M.
Joardar S.
Dogra D.P.
Sekh A.A.
SN Computer Science, 2021, 2 (4)
[38] A novel approach for multimodal medical image fusion
Liu, Zhaodong
Yin, Hongpeng
Chai, Yi
Yang, Simon X.
EXPERT SYSTEMS WITH APPLICATIONS, 2014, 41 (16) : 7425 - 7435
[39] Multimodal Image Fusion Method Based on Multiscale Image Matting
Maqsood, Sarmad
Damasevicius, Robertas
Silka, Jakub
Wozniak, Marcin
ARTIFICIAL INTELLIGENCE AND SOFT COMPUTING (ICAISC 2021), PT II, 2021, 12855 : 57 - 68
[40] Laplacian Redecomposition for Multimodal Medical Image Fusion
Li, Xiaoxiao
Guo, Xiaopeng
Han, Pengfei
Wang, Xiang
Li, Huaguang
Luo, Tao
IEEE TRANSACTIONS ON INSTRUMENTATION AND MEASUREMENT, 2020, 69 (09) : 6880 - 6890

← 1 2 3 4 5 →