Hierarchical decision tree induction in distributed genomic databases

被引:17
作者
Bar-Or, A [1 ]
Keren, D
Schuster, A
Wolff, R
机构
[1] HP Labs Cambridge, Cambridge Ctr 1, Cambridge, MA 02142 USA
[2] Univ Haifa, Dept Comp Sci, IL-31905 Haifa, Israel
[3] Technion Israel Inst Technol, Dept Comp Sci, IL-32000 Haifa, Israel
[4] Univ Maryland Baltimore Cty, Dept Comp Sci & Elect Engn, Baltimore, MD 21250 USA
关键词
data mining; distributed algorithms; decision trees; classification;
D O I
10.1109/TKDE.2005.129
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Classification based on decision trees is one of the important problems in data mining and has applications in many fields. In recent years, database systems have become highly distributed, and distributed system paradigms, such as federated and peer-to-peer databases, are being adopted. In this paper, we consider the problem of inducing decision trees in a large distributed network of genomic databases. Our work is motivated by the existence of distributed databases in healthcare and in bioinformatics, and by the emergence of systems which automatically analyze these databases, and by the expectancy that these databases will soon contain large amounts of highly dimensional genomic data. Current decision tree algorithms require high communication bandwidth when executed on such data, which are large-scale distributed systems. We present an algorithm that sharply reduces the communication overhead by sending just a fraction of the statistical data. A fraction which is nevertheless sufficient to derive the exact same decision tree learned by a sequential learner on all the data in the network. Extensive experiments using standard synthetic SNP data show that the algorithm utilizes the high dependency among attributes, typical to genomic data, to reduce communication overhead by up to 99 percent. Scalability tests show that the algorithm scales well with both the size of the data set, the dimensionality of the data, and the size of the distributed system.
引用
收藏
页码:1138 / 1151
页数:14
相关论文
共 28 条
  • [1] Alsabti K., 1998, Proceedings Fourth International Conference on Knowledge Discovery and Data Mining, P2
  • [2] Breiman L., 1998, CLASSIFICATION REGRE
  • [3] Caragea D., 2003, INT J HYBRID INTELLI
  • [4] CASTRO M, 2002, IEEE J SEL AREA COMM, V8, P20
  • [5] CATLETT J, 1991, THESIS U SYDNEY
  • [6] CHAN P, 1993, WORKING NOTES AAAI W, P227
  • [7] Domingos P., 2000, Proceedings. KDD-2000. Sixth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, P71, DOI 10.1145/347090.347107
  • [8] Gehrke J., 1999, SIGMOD RECORD VOL 28
  • [9] GREENSPAN G, 2003, RECOMB, P131
  • [10] HALL LO, 1998, P DISTR DAT MIN WORK