Research on Exception Data Cleaning Method Based on Clustering in Hadoop Platform

被引：1

作者：

Guo, Aizhang ^{[1
]}

Zhang, Ningning ^{[1
]}

Sun, Tao ^{[1
]}

机构：

[1] Qilu Univ Technol, Sch Informat, Jinan, Shandong, Peoples R China

来源：

2017 10TH INTERNATIONAL SYMPOSIUM ON COMPUTATIONAL INTELLIGENCE AND DESIGN (ISCID), VOL 2 | 2017年

关键词：

Hadoop; data cleansing; data quality; MapReduce; Canopy-Kmeans;

D O I：

10.1109/ISCID.2017.191

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

In the context of the explosive growth of data, the problem of data quality has become increasingly prominent, and it has become the main factor of data analysis and application. In order to get high-quality data, you need to clean the data. In this paper, the cleaning of the anomaly data is carried out. When the K-means clustering method is used for cleaning, the choice of the initial point is too dependent on the clustering result. At the same time,Large data sets of cleaning is easier to take much time becouse of the increasing numbers of operations. For solve these problems, this paper presents a method of anomaly data cleaning based on clustering under Hadoop platform. Firstly, this method improves the K-means algorithm by using the Canopy algorithm and the "minimum maximum principle" and the weighted Euclidean distance. Then it uses the MapReduce programming model to realize the parallelization of the method. The experimental results show that in the large data environment, using this method for abnormal data cleaning achieved good results whether in accuracy or in terms of speed up.

引用

页码：316 / 320

页数：5

共 11 条

[1]

Aggarwal C C, 2001, P ACM SIGMOD INT C M, V30, P37, DOI DOI 10.1145/375663.375668

[2]

Fan Ming, 2015, DATA MINING CONCEPT, P293

[3] A Survey of Outlier Detection Methods in Network Anomaly Identification [J].

Gogoi, Prasanta ;

Bhattacharyya, D. K. ;

Borah, B. ;

Kalita, Jugal K. .

COMPUTER JOURNAL, 2011, 54 (04) :570-588

[4]

McCallum A., 2000, Proceedings. KDD-2000. Sixth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, P169, DOI 10.1145/347090.347123

[5]

Quan C, 2009, J COMPUTER APPL, V29, P2562

[6]

Rovsseeuw P J, 1990, J AM STAT ASSOC, V85, P633

[7]

Sun Ji-Gui, 2008, Journal of Software, V19, P48, DOI 10.3724/SP.J.1001.2008.00048

[8]

[王永贵 Wang Yonggui], 2014, [计算机工程, Computer Engineering], V40, P47

[9]

Wei Dai, 2016, Journal of Computing Science and Engineering, V10, P1, DOI 10.5626/JCSE.2016.10.1.1

[10]

Yuan Fuyong, 2011, COMPUTER APPL, V31, P1675

← 1 2 →