Multi-Feature and DAG-Based Multi-Tree Matching Algorithm for Automatic Web Data Mining

被引:4
作者
Shi, Shengsheng [1 ]
Liu, Chengfei [2 ]
Yuan, Chunfeng [1 ]
Huang, Yihua [1 ]
机构
[1] Nanjing Univ, Dept Comp Sci & Technol, Natl Key Lab Novel Software Technol, Nanjing 210008, Jiangsu, Peoples R China
[2] Swinburne Univ Technol, Informat & Commun Technol, Melbourne, Vic, Australia
来源
2014 IEEE/WIC/ACM INTERNATIONAL JOINT CONFERENCES ON WEB INTELLIGENCE (WI) AND INTELLIGENT AGENT TECHNOLOGIES (IAT), VOL 1 | 2014年
关键词
Web data mining; data item alignment; multi-tree matching; multiple features; directed acyclic graph; DATA EXTRACTION; INFORMATION EXTRACTION;
D O I
10.1109/WI-IAT.2014.24
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Web data extraction has received considerable attention and study in recent decades. To improve efficiency, many automatic Web data record mining approaches have been proposed. Among these approaches, each complete approach involves data record identification as well as data item alignment. In this paper, we propose a new multi-feature and DAG (Directed Acyclic Graph) based multi-tree matching algorithm for automatic data item alignment. Our algorithm improves alignment accuracy in two aspects. First, it combines multiple features to cope with the limitations of existing algorithms; second, it employs a DAG-based method to deduce the global alignment of data items with high accuracy. Experimental results show that our algorithm outperforms state-of-the-art data item alignment algorithms.
引用
收藏
页码:118 / 125
页数:8
相关论文
共 30 条
[1]  
Adelberg B., 1998, SIGMOD Record, V27, P283, DOI 10.1145/276305.276330
[2]  
Arasu A., 2003, P 2003 ACM SIGMOD IN, P337, DOI DOI 10.1145/872757.872799
[3]   WebOQL: Restructuring documents, databases and Webs [J].
Arocena, GO ;
Mendelzon, AO .
14TH INTERNATIONAL CONFERENCE ON DATA ENGINEERING, PROCEEDINGS, 1998, :24-33
[4]  
Baumgartner R., 2001, Proceedings of the 27th International Conference on Very Large Data Bases, P119
[5]   Scalable Web Data Extraction for Online Market Intelligence [J].
Baumgartner, Robert ;
Gottlob, Georg ;
Herzog, Marcus .
PROCEEDINGS OF THE VLDB ENDOWMENT, 2009, 2 (02) :1512-1523
[6]   A fully automated object extraction system for the World Wide Web [J].
Buttler, D ;
Liu, L ;
Pu, C .
21ST INTERNATIONAL CONFERENCE ON DISTRIBUTED COMPUTING SYSTEMS, PROCEEDINGS, 2001, :361-370
[7]  
Cai D, 2003, LECT NOTES COMPUT SC, V2642, P406
[8]  
Califf ME, 1999, SIXTEENTH NATIONAL CONFERENCE ON ARTIFICIAL INTELLIGENCE (AAAI-99)/ELEVENTH INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE (IAAI-99), P328
[9]   Self-pumped and mutually pumped phase conjugation in pentagon-shaped BaTiO3 crystal with plus c-face incident geometry [J].
Chang, CC ;
Chen, TC ;
Hu, GW ;
Yau, HF ;
Ye, PX .
PHOTOREFRACTIVE EFFECTS, MATERIALS AND DEVICES, PROCEEDINGS, 2001, 62 :681-681
[10]   Olera: Semisupervised web-data extraction with visual support [J].
Chang, CH ;
Kuo, SC .
IEEE INTELLIGENT SYSTEMS, 2004, 19 (06) :56-64