DENCAST: distributed density-based clustering for multi-target regression

被引:36
作者
Corizzo, Roberto [1 ,2 ]
Pio, Gianvito [1 ,2 ]
Ceci, Michelangelo [1 ,2 ]
Malerba, Donato [1 ,2 ]
机构
[1] Univ Bari Aldo Moro, Dept Comp Sci, Via Orabona 4, Bari, Italy
[2] Natl Interuniv Consortium Informat CINI, Big Data Lab, Rome, Italy
基金
欧盟地平线“2020”;
关键词
Distributed clustering; Multi-target regression; Apache Spark; SPATIAL AUTOCORRELATION; ALGORITHM; SEARCH; DBSCAN;
D O I
10.1186/s40537-019-0207-2
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
Recent developments in sensor networks and mobile computing led to a huge increase in data generated that need to be processed and analyzed efficiently. In this context, many distributed data mining algorithms have recently been proposed. Following this line of research, we propose the DENCAST system, a novel distributed algorithm implemented in Apache Spark, which performs density-based clustering and exploits the identified clusters to solve both single- and multi-target regression tasks (and thus, solves complex tasks such as time series prediction). Contrary to existing distributed methods, DENCAST does not require a final merging step (usually performed on a single machine) and is able to handle large-scale, high-dimensional data by taking advantage of locality sensitive hashing. Experiments show that DENCAST performs clustering more efficiently than a state-of-the-art distributed clustering algorithm, especially when the number of objects increases significantly. The quality of the extracted clusters is confirmed by the predictive capabilities of DENCAST on several datasets: It is able to significantly outperform (p-value <0.05) state-of-the-art distributed regression methods, in both single and multi-target settings.
引用
收藏
页数:27
相关论文
共 49 条
[1]  
Ababei C, 2018, IEEE T PARALLEL DIST, V30, P5
[2]  
Aggarwal C.C., 2003, P 2003 VLDB C, P81, DOI DOI 10.1016/B978-012722442-8/50016-1
[3]  
Andoni A, 2015, ADV NEUR IN, V28
[4]  
Ankerst M, 1999, SIGMOD RECORD, VOL 28, NO 2 - JUNE 1999, P49
[5]   An integrated platform for spatial data mining within a GIS environment [J].
Appice, Annalisa ;
Lanza, Antonietta ;
Malerba, Donato .
2007 IEEE 23RD INTERNATIONAL CONFERENCE ON DATA ENGINEERING WORKSHOP, VOLS 1-2, 2007, :507-516
[6]   ISOTONIC REGRESSION PROBLEM AND ITS DUAL [J].
BARLOW, RE ;
BRUNK, HD .
JOURNAL OF THE AMERICAN STATISTICAL ASSOCIATION, 1972, 67 (337) :140-&
[7]  
Berchtold S, 1996, PROCEEDINGS OF THE INTERNATIONAL CONFERENCE ON VERY LARGE DATA BASES, P28
[8]   ST-DBSCAN: An algorithm for clustering spatial-temp oral data [J].
Birant, Derya ;
Kut, Alp .
DATA & KNOWLEDGE ENGINEERING, 2007, 60 (01) :208-221
[9]  
Blockeel H., 1998, Machine Learning. Proceedings of the Fifteenth International Conference (ICML'98), P55
[10]   A survey on multi-output regression [J].
Borchani, Hanen ;
Varando, Gherardo ;
Bielza, Concha ;
Larranaga, Pedro .
WILEY INTERDISCIPLINARY REVIEWS-DATA MINING AND KNOWLEDGE DISCOVERY, 2015, 5 (05) :216-233