Research on the Technology of Data Cleaning in Big Data

被引:0
作者
Feng, Fu-jun [1 ]
Yao, Jun-ping [1 ]
Li, Xiao-jun [1 ]
机构
[1] Res Inst High Tech, Xian, Shaanxi, Peoples R China
来源
2018 2ND INTERNATIONAL CONFERENCE ON APPLIED MATHEMATICS, MODELING AND SIMULATION (AMMS 2018) | 2018年 / 305卷
关键词
Big data; Data quality; Data cleaning;
D O I
暂无
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
Some dirty data exists inevitably under big data environment, and it seriously affects the data quality, while the technology of data cleaning is one of the most important methods to improve data quality, and the researches on the data cleaning framework are helpful for big data decision. A general framework of data cleaning in big data is proposed, the core data cleaning module includes three submodules, which are incomplete records cleaning, inconsistent data repairing and approximate duplicate records cleaning, and the processes of data cleaning are discussed specifically. The character of big data is volume, variety, value, velocity and complexity, and there are some incomplete, incorrect and duplicate dirty data in original information, which cause the big data uncontrollable and unavailable [1-2]. It is hoped that valuable information can be extracted from the mass data to provide reference for decision makers. Because of error in data merging or migration of dada sources, it is unavoidable to exist some redundant, incomplete, indeterminable and inconsistent data, which is called dirty data and affects seriously the efficiency of data utilization and the quality of decision making. The technology of data cleaning is particularly important to make the data more accurate and consistent, and it can filter or modify the unnecessary data and output the required data. At present, there are some researches on the data cleaning for big data [3-6]. The technology of big data is developed from the traditional technology, and inherits the traditional concepts and analysis methods [7-8], such as data cleaning and data warehouse. The traditional data cleaning can provide high quality data and enhance efficiency and correctness of data analysis. In big data environment, data cleaning is the basis and original process of big data analysis, which decides the data quality of results. The technology of data cleaning in big data is discussed in this paper, and a general framework of data cleaning is proposed.
引用
收藏
页码:176 / 181
页数:6
相关论文
共 9 条
[1]  
Donghua Y., 2016, J COMPUTERS, V39, P97
[2]  
Hui T., 2017, INFOR COM, V1, P238
[3]  
Kaihang M., 2015, COMPUTER ENG SOFTWAR, V36, P46
[4]   From Databases to Big Data [J].
Madden, Sam .
IEEE INTERNET COMPUTING, 2012, 16 (03) :4-6
[5]  
Shenbin H., 2015, INTELLIGENT COMPUTER, V5, P88
[6]  
Shumeng W., 2015, COMPUTER ENG SOFTWAR, V36, P108
[7]  
Xiaoting M., 2016, J MODEN INFO, V36, P107
[8]  
Yonghong C., 2017, MODERN COMPUTER, V1, P21
[9]  
Yuefeng Q., 2001, CHINESE J COMPUTERS, V24, P69