MusicYOLO: A Vision-Based Framework for Automatic Singing Transcription

被引:0
作者
Wang, Xianke [1 ,2 ]
Tian, Bowen [1 ,2 ]
Yang, Weiming [1 ,2 ]
Xu, Wei [1 ,2 ]
Cheng, Wenqing [1 ,2 ]
机构
[1] Huazhong Univ Sci & Technol, Hubei Key Lab Smart Internet Technol, Wuhan 430074, Peoples R China
[2] Huazhong Univ Sci & Technol, Sch Elect Informat & Commun, Wuhan 430074, Peoples R China
基金
中国国家自然科学基金;
关键词
Feature extraction; Labeling; Event detection; Spectrogram; Estimation; Deep learning; Object detection; AST; note object detection; spectrogram peak search; SOUND EVENT CLASSIFICATION; NEURAL-NETWORK; IMAGE FEATURE; PITCH; ROBUST; SPEECH; RECOGNITION; ESTIMATOR; FEATURES;
D O I
10.1109/TASLP.2022.3221005
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
Automatic singing transcription (AST), which refers to the process of inferring the onset, offset, and pitch from the singing audio, is of great significance in music information retrieval. Most AST models use the convolutional neural network to extract spectral features and predict the onset and offset moments separately. The frame-level probabilities are inferred first, and then the note-level transcription results are obtained through post-processing. In this paper, a new AST framework called MusicYOLO is proposed, which obtains the note-level transcription results directly. The onset/offset detection is based on the object detection model YOLOX, and the pitch labeling is completed by a spectrogram peak search. Compared with previous methods, the MusicYOLO detects note objects rather than isolated onset/offset moments, thus greatly enhancing the transcription performance. On the sight-singing vocal dataset (SSVD) established in this paper, the MusicYOLO achieves an 84.60% transcription F1-score, which is the state-of-the-art method.
引用
收藏
页码:229 / 241
页数:13
相关论文
共 75 条
  • [1] [Anonymous], 1995, Speech Coding and Synthesis
  • [2] Babacan O, 2013, INT CONF ACOUST SPEE, P7815, DOI 10.1109/ICASSP.2013.6639185
  • [3] Boersma P, 1993, P I PHON SCI, P97, DOI DOI 10.1371/JOURNAL.PONE.0069107
  • [4] Cakir E, 2015, IEEE IJCNN
  • [5] A sawtooth waveform inspired pitch estimator for speech and music
    Camacho, Arturo
    Harris, John G.
    [J]. JOURNAL OF THE ACOUSTICAL SOCIETY OF AMERICA, 2008, 124 (03) : 1638 - 1652
  • [6] Chuang Gan, 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Proceedings, P10475, DOI 10.1109/CVPR42600.2020.01049
  • [7] YIN, a fundamental frequency estimator for speech and music
    de Cheveigné, A
    Kawahara, H
    [J]. JOURNAL OF THE ACOUSTICAL SOCIETY OF AMERICA, 2002, 111 (04) : 1917 - 1930
  • [8] De Mulder T, 2004, 2004 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, VOL IV, PROCEEDINGS, P257
  • [9] Overlapping sound event recognition using local spectrogram features and the generalised hough transform
    Dennis, J.
    Tran, H. D.
    Chng, E. S.
    [J]. PATTERN RECOGNITION LETTERS, 2013, 34 (09) : 1085 - 1093
  • [10] Image Feature Representation of the Subband Power Distribution for Robust Sound Event Classification
    Dennis, Jonathan
    Tran, Huy Dat
    Chng, Eng Siong
    [J]. IEEE TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2013, 21 (02): : 367 - 377