Strong Labeling of Sound Events Using Crowdsourced Weak Labels and Annotator Competence Estimation

被引:12
作者
Martin-Morato, Irene [1 ]
Mesaros, Annamaria [1 ]
机构
[1] Tampere Univ, Comp Sci, Tampere 33720, Finland
基金
芬兰科学院;
关键词
Annotations; Task analysis; Crowdsourcing; Estimation; Reliability; Labeling; Speech processing; Strong labels; Sound event detection; Multi-annotator data; ALGORITHM; TRUTH;
D O I
10.1109/TASLP.2022.3233468
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
Crowdsourcing is a popular tool for collecting large amounts of annotated data, but the specific format of the strong labels necessary for sound event detection is not easily obtainable through crowdsourcing. In this work, we propose a novel annotation workflow that leverages the efficiency of crowdsourcing weak labels, and uses a high number of annotators to produce reliable and objective strong labels. The weak labels are collected in a highly redundant setup, to allow reconstruction of the temporal information. To obtain reliable labels, the annotators' competence is estimated using MACE (Multi-Annotator Competence Estimation) and incorporated into the strong labels estimation through weighing of individual opinions. We show that the proposed method produces consistently reliable strong annotations not only for synthetic audio mixtures, but also for audio recordings of real everyday environments. While only a maximum 80% coincidence with the complete and correct reference annotations was obtained for synthetic data, these results are explained by an extended study of how polyphony and SNR levels affect the identification rate of the sound events by the annotators. On real data, even though the estimated annotators' competence is significantly lower and the coincidence with reference labels is under 69%, the proposed majority opinion approach produces reliable aggregated strong labels in comparison with the more difficult task of crowdsourcing directly strong labels.
引用
收藏
页码:902 / 914
页数:13
相关论文
共 37 条
  • [1] [Anonymous], 2013, P 2013 C N AM CHAPTE
  • [2] COMMON FACTORS IN THE IDENTIFICATION OF AN ASSORTMENT OF BRIEF EVERYDAY SOUNDS
    BALLAS, JA
    [J]. JOURNAL OF EXPERIMENTAL PSYCHOLOGY-HUMAN PERCEPTION AND PERFORMANCE, 1993, 19 (02) : 250 - 267
  • [3] Bilen C, 2020, INT CONF ACOUST SPEE, P61, DOI [10.1109/icassp40776.2020.9052995, 10.1109/ICASSP40776.2020.9052995]
  • [4] Cartwright Mark, 2017, Proceedings of the ACM on Human-Computer Interaction, V1, DOI 10.1145/3134664
  • [5] Crowdsourcing Multi-label Audio Annotation Tasks with Citizen Scientists
    Cartwright, Mark
    Dove, Graham
    Mendez, Ana Elisa Mendez
    Bello, Juan P.
    Nov, Oded
    [J]. CHI 2019: PROCEEDINGS OF THE 2019 CHI CONFERENCE ON HUMAN FACTORS IN COMPUTING SYSTEMS, 2019,
  • [6] Cheung A. H., 2021, PROC 6 DETECTION CLA, P181
  • [7] Towards Duration Robust Weakly Supervised Sound Event Detection
    Dinkel, Heinrich
    Wu, Mengyue
    Yu, Kai
    [J]. IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2021, 29 : 887 - 900
  • [8] Drossos K, 2020, INT CONF ACOUST SPEE, P736, DOI [10.1109/ICASSP40776.2020.9052990, 10.1109/icassp40776.2020.9052990]
  • [9] Fonseca E., 2018, P DET CLASS AC SCEN, P69
  • [10] FSD50K: An Open Dataset of Human-Labeled Sound Events
    Fonseca, Eduardo
    Favory, Xavier
    Pons, Jordi
    Font, Frederic
    Serra, Xavier
    [J]. IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2022, 30 : 829 - 852