Automatically discovering of inconsistency among cross-source data based on Web big data

被引:0
作者
Yu, Wei [1 ]
Li, Shijun [1 ]
Yang, Sha [1 ,2 ]
Hu, Yahui [1 ,3 ]
Liu, Jing [1 ]
Ding, Yonggang [1 ]
Wang, Qian [1 ]
机构
[1] Computer School, Wuhan University, Wuhan
[2] College of Computer Science and Technology, Hankou University, Wuhan
[3] Air Force Early Warning Academy, Wuhan
来源
Jisuanji Yanjiu yu Fazhan/Computer Research and Development | 2015年 / 52卷 / 02期
关键词
Cross-source analysis; Data consistency; Data quality assessment; Web big data; Web data management; Web data mining;
D O I
10.7544/issn1000-1239.2015.20140224
中图分类号
学科分类号
摘要
Data inconsistency is a pervasive phenomenon existing in Web, which has gravely affected the quality of Web information. The current research of data inconsistency mainly focused on traditional database application. It is lack of consistency research on diverse, complicated, rapidly-changing and abundant Web big data. On account of multi-source heterogeneous Web data and 5V features of big data, we present unified algorithm of data extraction and Web object data model based on three aspects: website structure, characteristic data and knowledge rules. We study and sort the features of data inconsistency, and establish inconsistency classifier model, inconsistency constraint mechanism and inconsistency inference algebra computing system. Then based on cross-source Web data consistency theory system, we've researched Web inconsistency data automatically discovery method via constraint rules detection and statistical deviation analysis. Combining the characters of the two methods, we propose an automatically discovery algorithm of Web inconsistency data in view of hierarchy probabilistic judgment based on Hadoop MapReduce architecture. The framework is applied to multiple B2C electronic commerce big data on Hadoop platform and compared with traditional architecture and other methods. The results of our experiment proves the accuracy and efficiency of the method. ©, 2015, Science Press. All right reserved.
引用
收藏
页码:295 / 308
页数:13
相关论文
共 24 条
[1]  
Huang D., Du Y., He Q., Et al., Migration algorithm for big marine data in hybrid cloud storage, Journal of Computer Research and Development, 51, 1, pp. 199-205, (2014)
[2]  
Li J., Liu X., An important aspect of big data: Data usability, Journal of Computer Research and Development, 50, 6, pp. 1147-1162, (2013)
[3]  
Rahm E., Do H.H., Data cleaning: Problems and current approaches, IEEE Data Engineering Bulletin, 23, 4, pp. 3-13, (2000)
[4]  
Meng X., Li Y., Zhu J., Et al., Social computing in the era of big data: Opportunities and challenges, Journal of Computer Research and Development, 50, 12, pp. 2483-2491, (2013)
[5]  
Meng X., Ci X., Big data management: Concepts, techniques and challenges, Journal of Computer Research and Development, 50, 1, pp. 146-169, (2013)
[6]  
Wang S., Wang H., Qin X., Et al., Architecting big data: Challenges, studies and forecasts, Chinese Journal of Computers, 34, 10, pp. 1741-1752, (2011)
[7]  
Qin X., Wang H., Du X., Et al., Big data analysis-competition and symbiosis of RDBMS and MapReduce, Journal of Software, 23, 1, pp. 32-45, (2012)
[8]  
Wang Y., Jin X., Cheng X., Et al., Network big data: Present and future, Chinese Journal of Computers, 36, 6, pp. 1125-1138, (2013)
[9]  
Cheng X., Jin X., Wang Y., Et al., Survey on big data system and analytic technology, Journal of Software, 9, pp. 1889-1908, (2014)
[10]  
Yang J., Li W., Zhang J., Et al., Canonical correlation analysis of big data based on cloud model, Journal on Communications, 10, 10, pp. 121-134, (2013)