Query-learning-based iterative feature-subset selection for learning from high-dimensional data sets

被引:0
作者
Hiroshi Mamitsuka
机构
[1] Kyoto University,Institute for Chemical Research
来源
Knowledge and Information Systems | 2006年 / 9卷
关键词
Query learning; Feature-subset selection; High-dimensional data set; Uncertainty sampling; Drug design;
D O I
暂无
中图分类号
学科分类号
摘要
We propose a new data-mining method that is effective for learning from extremely high-dimensional data sets. Our proposed method selects a subset of features from a high-dimensional data set by a process of iterative refinement. Our selection of a feature-subset has two steps. The first step selects a subset of instances, to which predictions by hypotheses previously obtained are most unreliable, from the data set. The second step selects a subset of features whose values in the selected instances vary the most from those in all instances of the database. We empirically evaluate the effectiveness of the proposed method by comparing its performance with those of four other methods, including one of the latest feature-subset selection methods. The evaluation was performed on a real-world data set with approximately 140,000 features. Our results show that the performance of the proposed method exceeds those of the other methods in terms of prediction accuracy, precision at a certain recall value, and computation time to reach a certain prediction accuracy. We have also examined the effect of noise in the data and found that the advantage of the proposed method becomes more pronounced for larger noise levels. Extended abstracts of parts of the work presented in this paper have appeared in Mamitsuka [14] and Mamitsuka [15].
引用
收藏
页码:91 / 108
页数:17
相关论文
共 22 条
  • [1] Breiman L(1999)Pasting small votes for classification in large databases and on-line Mach Learn 36 85-103
  • [2] Forman G(2003)An extensive empirical study of feature selection metrics for text classification J Mach Learn Res 3 1289-1305
  • [3] Freund Y(1997)A decision theoretic generalization of on-line learning and an application to boosting J Comput Sys Sci 55 119-139
  • [4] Shapire R(1997)Selective sampling using the query by committee algorithm Mach Learn 28 133-168
  • [5] Freund Y(2000)Computers aid vaccine design Science 290 80-82
  • [6] Seung H(1998)The random subspace method for constructing decision forests IEEE Trans Pattern Anal Mach Intell 20 832-844
  • [7] Shamir E(1997)Wrappers for feature subset selection Artif Intell 97 273-324
  • [8] Tishby N(1997)Attribute selection for modelling Future Gener Comput Sys 13 181-195
  • [9] Hagmann M(2002)Chemical database techniques in drug discovery Nat Rev Drug Discovery 1 220-227
  • [10] Ho TK(1999)A survey of methods for scaling up inductive algorithms Know Discovery Data Min 3 131-169