MusicYOLO: A Vision-Based Framework for Automatic Singing Transcription

被引：0

作者：

Wang, Xianke ^{[1
,2
]}

Tian, Bowen ^{[1
,2
]}

Yang, Weiming ^{[1
,2
]}

Xu, Wei ^{[1
,2
]}

Cheng, Wenqing ^{[1
,2
]}

机构：

[1] Huazhong Univ Sci & Technol, Hubei Key Lab Smart Internet Technol, Wuhan 430074, Peoples R China

[2] Huazhong Univ Sci & Technol, Sch Elect Informat & Commun, Wuhan 430074, Peoples R China

来源：

IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING | 2023年 / 31卷

基金：

中国国家自然科学基金;

关键词：

Feature extraction; Labeling; Event detection; Spectrogram; Estimation; Deep learning; Object detection; AST; note object detection; spectrogram peak search; SOUND EVENT CLASSIFICATION; NEURAL-NETWORK; IMAGE FEATURE; PITCH; ROBUST; SPEECH; RECOGNITION; ESTIMATOR; FEATURES;

D O I：

10.1109/TASLP.2022.3221005

中图分类号：

O42 [声学];

学科分类号：

070206 ; 082403 ;

摘要：

Automatic singing transcription (AST), which refers to the process of inferring the onset, offset, and pitch from the singing audio, is of great significance in music information retrieval. Most AST models use the convolutional neural network to extract spectral features and predict the onset and offset moments separately. The frame-level probabilities are inferred first, and then the note-level transcription results are obtained through post-processing. In this paper, a new AST framework called MusicYOLO is proposed, which obtains the note-level transcription results directly. The onset/offset detection is based on the object detection model YOLOX, and the pitch labeling is completed by a spectrogram peak search. Compared with previous methods, the MusicYOLO detects note objects rather than isolated onset/offset moments, thus greatly enhancing the transcription performance. On the sight-singing vocal dataset (SSVD) established in this paper, the MusicYOLO achieves an 84.60% transcription F1-score, which is the state-of-the-art method.

引用

页码：229 / 241

页数：13

共 75 条

[1] [Anonymous], 1995, Speech Coding and Synthesis
[2] Babacan O, 2013, INT CONF ACOUST SPEE, P7815, DOI 10.1109/ICASSP.2013.6639185
[3] Boersma P, 1993, P I PHON SCI, P97, DOI DOI 10.1371/JOURNAL.PONE.0069107
[4] Cakir E, 2015, IEEE IJCNN
[5] A sawtooth waveform inspired pitch estimator for speech and music
Camacho, Arturo
Harris, John G.
[J]. JOURNAL OF THE ACOUSTICAL SOCIETY OF AMERICA, 2008, 124 (03) : 1638 - 1652
[6] Chuang Gan, 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Proceedings, P10475, DOI 10.1109/CVPR42600.2020.01049
[7] YIN, a fundamental frequency estimator for speech and music
de Cheveigné, A
Kawahara, H
[J]. JOURNAL OF THE ACOUSTICAL SOCIETY OF AMERICA, 2002, 111 (04) : 1917 - 1930
[8] De Mulder T, 2004, 2004 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, VOL IV, PROCEEDINGS, P257
[9] Overlapping sound event recognition using local spectrogram features and the generalised hough transform
Dennis, J.
Tran, H. D.
Chng, E. S.
[J]. PATTERN RECOGNITION LETTERS, 2013, 34 (09) : 1085 - 1093
[10] Image Feature Representation of the Subband Power Distribution for Robust Sound Event Classification
Dennis, Jonathan
Tran, Huy Dat
Chng, Eng Siong
[J]. IEEE TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2013, 21 (02): : 367 - 377

← 1 2 3 4 5 6 7 8 →