Stratified Sampling for Extreme Multi-label Data

被引:1
作者
Merrillees, Maximillian [1 ]
Du, Lan [1 ]
机构
[1] Monash Univ, Fac Informat Technol, Clayton, Vic 3800, Australia
来源
ADVANCES IN KNOWLEDGE DISCOVERY AND DATA MINING, PAKDD 2021, PT II | 2021年 / 12713卷
关键词
Extreme multi-label learning; XML; Stratified sampling;
D O I
10.1007/978-3-030-75765-6_27
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Extreme multi-label classification (XML) is becoming increasingly relevant in the era of big data. Yet, there is no method for effectively generating stratified partitions of XML datasets. Instead, researchers typically rely on provided test-train splits that, 1) aren't always representative of the entire dataset, and 2) are missing many of the labels. This can lead to poor generalization ability and unreliable performance estimates, as has been established in the binary and multiclass settings. As such, this paper presents a new and simple algorithm that can efficiently generate stratified partitions of XML datasets with millions of unique labels. We also examine the label distributions of prevailing benchmark splits, and investigate the issues that arise from using unrepresentative subsets of data for model development. The results highlight the difficulty of stratifying XML data, and demonstrate the importance of using stratified partitions for training and evaluation.
引用
收藏
页码:334 / 345
页数:12
相关论文
共 50 条
  • [41] Stratified Inverse Sampling for Rare Populations
    Sangngam, Prayad
    Suwattee, Prachoom
    THAILAND STATISTICIAN, 2012, 10 (01): : 69 - 86
  • [42] The Concept of Stratified Sampling of Execution Traces
    Pirzadeh, Heidar
    Shanian, Sara
    Hamou-Lhadj, Abdelwahab
    Mehrabian, Ali
    2011 IEEE 19TH INTERNATIONAL CONFERENCE ON PROGRAM COMPREHENSION (ICPC), 2011, : 225 - +
  • [43] The Advantage and Disadvantage of Implicitly Stratified Sampling
    Lynn, Peter
    METHODS DATA ANALYSES, 2019, 13 (02): : 253 - 266
  • [44] Local averaged stratified sampling method
    Valentini, Fernando
    Silva, Olavo M.
    Torii, Andre Jacomel
    Cardoso, Eduardo Lenz
    JOURNAL OF THE BRAZILIAN SOCIETY OF MECHANICAL SCIENCES AND ENGINEERING, 2022, 44 (07)
  • [45] Stratified Sampling for Even Workload Partitioning
    Paudel, Jeeva
    Amaral, Jose Nelson
    PROCEEDINGS OF THE 23RD INTERNATIONAL CONFERENCE ON PARALLEL ARCHITECTURES AND COMPILATION TECHNIQUES (PACT'14), 2014, : 503 - 504
  • [46] Local averaged stratified sampling method
    Fernando Valentini
    Olavo M. Silva
    André Jacomel Torii
    Eduardo Lenz Cardoso
    Journal of the Brazilian Society of Mechanical Sciences and Engineering, 2022, 44
  • [47] Stratified Sampling Voxel Classification for Segmentation of Intraretinal and Subretinal Fluid in Longitudinal Clinical OCT Data
    Xu, Xiayu
    Lee, Kyungmoo
    Zhang, Li
    Sonka, Milan
    Abramoff, Michael D.
    IEEE TRANSACTIONS ON MEDICAL IMAGING, 2015, 34 (07) : 1616 - 1623
  • [48] Adaptive Stratified Random Additive Sampling Algorithm
    Wang SuNan
    Luo XingGuoi
    Wang Shuai
    Wang Bin
    2011 3RD WORLD CONGRESS IN APPLIED COMPUTING, COMPUTER SCIENCE, AND COMPUTER ENGINEERING (ACC 2011), VOL 4, 2011, 4 : 210 - +
  • [49] Adaptive stratified sampling for structural reliability analysis
    Song, Chenxiao
    Kawai, Reiichiro
    STRUCTURAL SAFETY, 2023, 101
  • [50] Federated learning based on stratified sampling and regularization
    Lu, Chenyang
    Ma, Wubin
    Wang, Rui
    Deng, Su
    Wu, Yahui
    COMPLEX & INTELLIGENT SYSTEMS, 2023, 9 (02) : 2081 - 2099