Combining instance and feature neighbours for extreme multi-label classification

被引:0
作者
Len Feremans
Boris Cule
Celine Vens
Bart Goethals
机构
[1] Universiteit Antwerpen,Department of Computer Science
[2] Universiteit Antwerpen,Department of Accountancy and Finance
[3] Katholieke Universiteit Leuven,Faculty of Medicine
[4] ITEC,undefined
[5] imec research group at Katholieke Universiteit Leuven,undefined
[6] Monash University,undefined
来源
International Journal of Data Science and Analytics | 2020年 / 10卷
关键词
Extreme multi-label classification; Item-based collaborative filtering; -nearest neighbours; Top-; queries; Information retrieval;
D O I
暂无
中图分类号
学科分类号
摘要
Extreme multi-label classification problems occur in different applications such as prediction of tags or advertisements. We propose a new algorithm that predicts labels using a linear ensemble of labels from instance- and feature-based nearest neighbours. In the feature-based nearest neighbours method, we precompute a matrix containing the similarities between each feature and label. For the instance-based nearest neighbourhood, we create an algorithm that uses an inverted index to compute cosine similarity on sparse datasets efficiently. We extend this baseline with a new top-k query algorithm that combines term-at-a-time and document-at-a-time traversal with tighter pruning based on a partition of the dataset. On ten real-world datasets, we find that our method outperforms state-of-the-art methods such as multi-label k-nearest neighbours, instance-based logistic regression, binary relevance with support vector machines and FastXml on different evaluation metrics. We also find that our algorithm is orders of magnitude faster than these baseline algorithms on sparse datasets and requires less than 20 ms per instance to predict labels for extreme datasets without the need for expensive hardware.
引用
收藏
页码:215 / 231
页数:16
相关论文
共 45 条
  • [1] Tsoumakas G(2006)Multi-label classification: an overview Int. J. Data Warehous. Min. 3 1-13
  • [2] Katakis I(2011)Classifier chains for multi-label classification Mach. Learn. 85 333-359
  • [3] Read J(1999)Improved boosting algorithms using confidence-rated predictions Mach. Learn. 37 297-336
  • [4] Pfahringer B(2008)Decision trees for hierarchical multi-label classification Mach. Learn. 73 185-214
  • [5] Holmes G(2007)Ml-knn: a lazy learning approach to multi-label learning Pattern Recognit. 40 2038-2048
  • [6] Frank E(2014)Multi-label learning: a review of the state of the art and ongoing research Wiley Interdiscip. Rev. Data Min. Knowl. Discov. 4 411-444
  • [7] Schapire RE(2003)Optimal aggregation algorithms for middleware J. Comput. Syst. Sci. 66 614-656
  • [8] Singer Y(2011)Evaluation strategies for top-k queries over memory-resident inverted indexes Proc. VLDB Endow. 4 1213-1224
  • [9] Vens C(2009)Combining instance-based learning and logistic regression for multilabel classification Mach. Learn. 76 211-225
  • [10] Struyf J(2016)Labelling strategies for hierarchical multi-label classification techniques Pattern Recognit. 56 170-183