Automatic Web Information Extraction and Alignment using CTVS Technique

被引:0
作者
Pandarge, Sangmesh S. [1 ]
Chakkarwar, V. A. [1 ]
机构
[1] Govt Coll Engn, Dept Comp Sci Engn, Aurangabad, Maharashtra, India
来源
2017 INTERNATIONAL CONFERENCE OF ELECTRONICS, COMMUNICATION AND AEROSPACE TECHNOLOGY (ICECA), VOL 2 | 2017年
关键词
Web page; Query result records (QRRs); Tag tree format; Data region; Record segmentation; Web data extraction and Data alignment;
D O I
暂无
中图分类号
V [航空、航天];
学科分类号
08 ; 0825 ;
摘要
User hit the query on internet browser then it generates query's result from web databases which called as query result page. Basically, web browser provides query results having structured, semi-structured or unstructured in HTML web pages through web database. In this paper, the main objective is the automatically extracting web based data and aligns that information in a tabular form. The benefit of extracted data is mainly for knowledge discovery as well as comparison shopping purpose etc. Web page contains a very large data in regularly structured objects is called as data record. This paper presents one of the methods for web information extraction and alignment is CTVS which is novel and improved technique which exploits tag as well as value similarity in a web page. The proposed approach fetches information through query result pages automatically by identifying QRRs, construction of tag tree and separating QRRs (query result records) in a query result page. Extracted data can be aligned in pairwise or holistic alignment technique. The segmented query result records are arranged according to same attribute of data values in database table. The proposed technique is suitable for both contiguous and non-contiguous data regions because of result page contain some irrelevant data with having expected result data. The experimental result gives good accuracy in less time and highly effective in extracting the web data and aligning structured data records.
引用
收藏
页码:94 / 99
页数:6
相关论文
共 12 条
  • [1] Arasu A., 2003, P 2003 ACM SIGMOD IN, P337, DOI DOI 10.1145/872757.872799
  • [2] A fully automated object extraction system for the World Wide Web
    Buttler, D
    Liu, L
    Pu, C
    [J]. 21ST INTERNATIONAL CONFERENCE ON DISTRIBUTED COMPUTING SYSTEMS, PROCEEDINGS, 2001, : 361 - 370
  • [3] Chang C.-H., 2001, P 10 INT C WORLD WID, P223
  • [4] Crescenzi V., 2001, Proceedings of the 27th International Conference on Very Large Data Bases, P109
  • [5] Kayed Mohammed, 2010, IEEE T KNOWLEDGE DAT, V22
  • [6] Liu B, 2005, LECT NOTES COMPUT SC, V3806, P487
  • [7] Liu B., 2003, KDD 03
  • [8] SIMON K, 2005, P INT C INF KNOWL MA
  • [9] Combining Tag and Value Similarity for Data Extraction and Alignment
    Su, Weifeng
    Wang, Jiying
    Lochovsky, Frederick H.
    Liu, Yi
    [J]. IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 2012, 24 (07) : 1186 - 1200
  • [10] Wang J., 2003, P 12 INT C WORLD WID, P187