Stratified Sampling for Extreme Multi-label Data

被引:1
作者
Merrillees, Maximillian [1 ]
Du, Lan [1 ]
机构
[1] Monash Univ, Fac Informat Technol, Clayton, Vic 3800, Australia
来源
ADVANCES IN KNOWLEDGE DISCOVERY AND DATA MINING, PAKDD 2021, PT II | 2021年 / 12713卷
关键词
Extreme multi-label learning; XML; Stratified sampling;
D O I
10.1007/978-3-030-75765-6_27
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Extreme multi-label classification (XML) is becoming increasingly relevant in the era of big data. Yet, there is no method for effectively generating stratified partitions of XML datasets. Instead, researchers typically rely on provided test-train splits that, 1) aren't always representative of the entire dataset, and 2) are missing many of the labels. This can lead to poor generalization ability and unreliable performance estimates, as has been established in the binary and multiclass settings. As such, this paper presents a new and simple algorithm that can efficiently generate stratified partitions of XML datasets with millions of unique labels. We also examine the label distributions of prevailing benchmark splits, and investigate the issues that arise from using unrepresentative subsets of data for model development. The results highlight the difficulty of stratifying XML data, and demonstrate the importance of using stratified partitions for training and evaluation.
引用
收藏
页码:334 / 345
页数:12
相关论文
共 50 条
  • [31] Multi-objective optimization for optimum allocation in multivariate stratified sampling with quadratic cost
    Khowaja, Saman
    Ghufran, Shazia
    Ahsan, M. J.
    JOURNAL OF STATISTICAL COMPUTATION AND SIMULATION, 2012, 82 (12) : 1789 - 1798
  • [32] Development of novel separate ratio estimator in stratified random sampling with applications on real data
    Triveni, G. R. V.
    Danish, Faizan
    INTERNATIONAL JOURNAL OF APPLIED NONLINEAR SCIENCE, 2024, 4 (02)
  • [33] Euclidean distance stratified random sampling based clustering model for big data mining
    Pandey, Kamlesh Kumar
    Shukla, Diwakar
    COMPUTATIONAL AND MATHEMATICAL METHODS, 2021, 3 (06)
  • [34] Data splitting for artificial neural networks using SOM-based stratified sampling
    May, R. J.
    Maier, H. R.
    Dandy, G. C.
    NEURAL NETWORKS, 2010, 23 (02) : 283 - 294
  • [35] A Study on Sample Allocation for Stratified Sampling
    Lee, Ingue
    Park, Mingue
    KOREAN JOURNAL OF APPLIED STATISTICS, 2015, 28 (06) : 1047 - 1061
  • [36] Near Optimum Allocations in Stratified Sampling
    Rao, T. J.
    JOURNAL OF STATISTICAL THEORY AND PRACTICE, 2010, 4 (01) : 57 - 69
  • [37] Calibration approach estimators in stratified sampling
    Kim, Jong-Min
    Sungur, Engin A.
    Heo, Tae-Young
    STATISTICS & PROBABILITY LETTERS, 2007, 77 (01) : 99 - 103
  • [38] Estimation of software reliability by stratified sampling
    Podgurski, A
    Masri, W
    McCleese, Y
    Wolff, FG
    Yang, C
    ACM TRANSACTIONS ON SOFTWARE ENGINEERING AND METHODOLOGY, 1999, 8 (03) : 263 - 283
  • [39] ADAPTIVE STRATIFIED SAMPLING FOR NONSMOOTH PROBLEMS
    Pettersson, Per
    Krumscheid, Sebastian
    INTERNATIONAL JOURNAL FOR UNCERTAINTY QUANTIFICATION, 2022, 12 (06) : 71 - 99
  • [40] The Concept of Stratified Sampling of Execution Traces
    Pirzadeh, Heidar
    Shanian, Sara
    Hamou-Lhadj, Abdelwahab
    Mehrabian, Ali
    2011 IEEE 19TH INTERNATIONAL CONFERENCE ON PROGRAM COMPREHENSION (ICPC), 2011, : 225 - +