Stratified Sampling for Extreme Multi-label Data

被引:1
|
作者
Merrillees, Maximillian [1 ]
Du, Lan [1 ]
机构
[1] Monash Univ, Fac Informat Technol, Clayton, Vic 3800, Australia
来源
ADVANCES IN KNOWLEDGE DISCOVERY AND DATA MINING, PAKDD 2021, PT II | 2021年 / 12713卷
关键词
Extreme multi-label learning; XML; Stratified sampling;
D O I
10.1007/978-3-030-75765-6_27
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Extreme multi-label classification (XML) is becoming increasingly relevant in the era of big data. Yet, there is no method for effectively generating stratified partitions of XML datasets. Instead, researchers typically rely on provided test-train splits that, 1) aren't always representative of the entire dataset, and 2) are missing many of the labels. This can lead to poor generalization ability and unreliable performance estimates, as has been established in the binary and multiclass settings. As such, this paper presents a new and simple algorithm that can efficiently generate stratified partitions of XML datasets with millions of unique labels. We also examine the label distributions of prevailing benchmark splits, and investigate the issues that arise from using unrepresentative subsets of data for model development. The results highlight the difficulty of stratifying XML data, and demonstrate the importance of using stratified partitions for training and evaluation.
引用
收藏
页码:334 / 345
页数:12
相关论文
共 50 条
  • [21] Correlation Networks for Extreme Multi-label Text Classification
    Xun, Guangxu
    Jha, Kishlay
    Sun, Jianhui
    Zhang, Aidong
    KDD '20: PROCEEDINGS OF THE 26TH ACM SIGKDD INTERNATIONAL CONFERENCE ON KNOWLEDGE DISCOVERY & DATA MINING, 2020, : 1074 - 1082
  • [22] Sparse Local Embeddings for Extreme Multi-label Classification
    Bhatia, Kush
    Jain, Himanshu
    Kar, Purushottam
    Varma, Manik
    Jain, Prateek
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 28 (NIPS 2015), 2015, 28
  • [23] Matching Neural Network for Extreme Multi-Label Learning
    Zhao, Zhiyun
    Li, Fengzhi
    Zuo, Yuan
    Wu, Junjie
    4TH INTERNATIONAL CONFERENCE ON ARTIFICIAL INTELLIGENCE APPLICATIONS AND TECHNOLOGIES (AIAAT 2020), 2020, 1642
  • [24] LAIM discretization for multi-label data
    Cano, Alberto
    Maria Luna, Jose
    Gibaja, Eva L.
    Ventura, Sebastian
    INFORMATION SCIENCES, 2016, 330 : 370 - 384
  • [25] LightXML: Transformer with Dynamic Negative Sampling for High-Performance Extreme Multi-label Text Classification
    Jiang, Ting
    Wang, Deqing
    Sun, Leilei
    Yang, Huayi
    Zhao, Zhengyang
    Zhuang, Fuzhen
    THIRTY-FIFTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, THIRTY-THIRD CONFERENCE ON INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE AND THE ELEVENTH SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2021, 35 : 7987 - 7994
  • [26] Extreme Multi-Label Classification with Label Masking for Product Attribute Value Extraction
    Chen, Wei-Te
    Xia, Yandi
    Shinzato, Keiji
    PROCEEDINGS OF THE 5TH WORKSHOP ON E-COMMERCE AND NLP (ECNLP 5), 2022, : 134 - 140
  • [27] Assessing the Multi-labelness of Multi-label Data
    Park, Laurence A. F.
    Guo, Yi
    Read, Jesse
    MACHINE LEARNING AND KNOWLEDGE DISCOVERY IN DATABASES, ECML PKDD 2019, PT II, 2020, 11907 : 164 - 179
  • [28] Cold Start Thread Recommendation as Extreme Multi-label Classification
    Halder, Kishaloy
    Poddar, Lahari
    Kan, Min-Yen
    COMPANION PROCEEDINGS OF THE WORLD WIDE WEB CONFERENCE 2018 (WWW 2018), 2018, : 1911 - 1918
  • [29] Multi-label dimensionality reduction and classification with extreme learning machines
    Lin Feng
    Jing Wang
    Shenglan Liu
    Yao Xiao
    JournalofSystemsEngineeringandElectronics, 2014, 25 (03) : 502 - 513
  • [30] Combining instance and feature neighbours for extreme multi-label classification
    Feremans, Len
    Cule, Boris
    Vens, Celine
    Goethals, Bart
    INTERNATIONAL JOURNAL OF DATA SCIENCE AND ANALYTICS, 2020, 10 (03) : 215 - 231