Stratified Sampling for Extreme Multi-label Data

被引:1
|
作者
Merrillees, Maximillian [1 ]
Du, Lan [1 ]
机构
[1] Monash Univ, Fac Informat Technol, Clayton, Vic 3800, Australia
来源
ADVANCES IN KNOWLEDGE DISCOVERY AND DATA MINING, PAKDD 2021, PT II | 2021年 / 12713卷
关键词
Extreme multi-label learning; XML; Stratified sampling;
D O I
10.1007/978-3-030-75765-6_27
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Extreme multi-label classification (XML) is becoming increasingly relevant in the era of big data. Yet, there is no method for effectively generating stratified partitions of XML datasets. Instead, researchers typically rely on provided test-train splits that, 1) aren't always representative of the entire dataset, and 2) are missing many of the labels. This can lead to poor generalization ability and unreliable performance estimates, as has been established in the binary and multiclass settings. As such, this paper presents a new and simple algorithm that can efficiently generate stratified partitions of XML datasets with millions of unique labels. We also examine the label distributions of prevailing benchmark splits, and investigate the issues that arise from using unrepresentative subsets of data for model development. The results highlight the difficulty of stratifying XML data, and demonstrate the importance of using stratified partitions for training and evaluation.
引用
收藏
页码:334 / 345
页数:12
相关论文
共 50 条
  • [1] Data scarcity, robustness and extreme multi-label classification
    Rohit Babbar
    Bernhard Schölkopf
    Machine Learning, 2019, 108 : 1329 - 1351
  • [2] Data scarcity, robustness and extreme multi-label classification
    Babbar, Rohit
    Schoelkopf, Bernhard
    MACHINE LEARNING, 2019, 108 (8-9) : 1329 - 1351
  • [3] Robust Extreme Multi-label Learning
    Xu, Chang
    Tao, Dacheng
    Xu, Chao
    KDD'16: PROCEEDINGS OF THE 22ND ACM SIGKDD INTERNATIONAL CONFERENCE ON KNOWLEDGE DISCOVERY AND DATA MINING, 2016, : 1275 - 1284
  • [4] Deep Extreme Multi-label Learning
    Zhang, Wenjie
    Yan, Junchi
    Wang, Xiangfeng
    Zha, Hongyuan
    ICMR '18: PROCEEDINGS OF THE 2018 ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA RETRIEVAL, 2018, : 100 - 107
  • [5] Multi-label sampling based on local label imbalance
    Liu, Bin
    Blekas, Konstantinos
    Tsoumakas, Grigorios
    PATTERN RECOGNITION, 2022, 122
  • [6] On the Stratification of Multi-label Data
    Sechidis, Konstantinos
    Tsoumakas, Grigorios
    Vlahavas, Ioannis
    MACHINE LEARNING AND KNOWLEDGE DISCOVERY IN DATABASES, PT III, 2011, 6913 : 145 - 158
  • [7] To Be or not to Be, Tail Labels in Extreme Multi-label Learning
    Ge, Zhiqi
    Li, Ximing
    PROCEEDINGS OF THE 30TH ACM INTERNATIONAL CONFERENCE ON INFORMATION & KNOWLEDGE MANAGEMENT, CIKM 2021, 2021, : 555 - 564
  • [8] Extreme Learning Machine for Multi-Label Classification
    Sun, Xia
    Xu, Jingting
    Jiang, Changmeng
    Feng, Jun
    Chen, Su-Shing
    He, Feijuan
    ENTROPY, 2016, 18 (06)
  • [9] Extreme Multi-label Classification for Information Retrieval
    Dembczynski, Krzysztof
    Babbar, Rohit
    ADVANCES IN INFORMATION RETRIEVAL (ECIR 2018), 2018, 10772 : 839 - 840
  • [10] Multi-Label Classification with Extreme Learning Machine
    Kongsorot, Yanika
    Horata, Punyaphol
    2014 6TH INTERNATIONAL CONFERENCE ON KNOWLEDGE AND SMART TECHNOLOGY (KST), 2014, : 81 - 86