Unstructured data extraction of Chinese expert web page

被引:3
作者
Hong, Xudong [1 ]
Shen, Tao [2 ]
Shen, Longhua [3 ]
Yu, Zhengtao [1 ]
Guo, Jianyi [1 ]
机构
[1] School of Information Engineering and Automation, Key Laboratory of Intelligent Information Processing, Kunming University of Science and Technology, Kunming, Yunnan
[2] School of Material Science and Engineering, Kunming University of Science and Technology, Kunming, Yunnan
[3] China Research and Development Academy of Machinery Equipment, Beijing
关键词
Expert web page; Lattice theory; Roadrunner introduction; Unstructured data; Unsupervised;
D O I
10.1504/IJWMC.2014.059709
中图分类号
学科分类号
摘要
Aiming at the problem of requiring a lot of human intervention in the process of unstructured data extraction from expert page based on traditional extraction methods, this paper proposes a method which detects data template automatically based on similarities and differences between HTML tags and strings, uses the lattice theory to find the location of the data grid region storing unstructured expert data, thus accesses to unstructured expert data. Firstly, with the help of the classifier on Chinese Expert Entity Homepages, a lot of expert pages are acquired by expert web crawler. Secondly, divide the expert pages into two types, list type and document type, then extract respectively the unstructured data from the two different types. Lastly, the extraction experiments are conducted on different types of web pages by improving open source code of Roadrunner. Experimental results show that, in the case of unsupervised, this method performs effectively on extraction of unstructured web data from Chinese expert pages. © 2014 Inderscience Enterprises Ltd.
引用
收藏
页码:132 / 136
页数:4
相关论文
共 10 条
  • [1] Ke C., Zhiping P., Wende K., Study on collaborative filtering recommendation algorithm based on web user clustering, Journal of Wireless and Mobile Computing, 5, 4, pp. 401-408, (2012)
  • [2] Ke P., Li Y., Ni F., An evolvable cellular automata based data encryption algorithm, Journal of Wireless and Mobile Computing, 6, 1, pp. 66-71, (2013)
  • [3] Liu L., Cao C., Zhang C., Tian G., Sense recognition research of hyponymy based on concept space, Chinese Journal of Computers, 32, 8, pp. 1651-1659, (2009)
  • [4] Nada M., Dominique G., Advanced tools for resolving complex issues in networking, Journal of Wireless and Mobile Computing, 4, 4, pp. 281-289, (2010)
  • [5] Rion S., Daniel J., Andrew Y.N., Learning syntactic patterns for automatic hypernym discovery, Advances in Neural Information Processing Systems, 5, 3, pp. 1297-1313, (2005)
  • [6] Sharmsfard M., Barforoush A.A., Learning ontologies from natural text, International1 Journal of Human- Computer Studies, 60, 1, pp. 13-17, (2004)
  • [7] Wen C., Shi Z., Contrast research of Chinese domain ontology concept hierarchy induction methods, Application Research of Computers, 26, 8, pp. 2847-2850, (2009)
  • [8] Xing J., Han M., An ontology learning method basedon double VSM and fuzzy FCA, Journal of Computer Research& Development, 46, 3, pp. 443-451, (2009)
  • [9] Yu J., Fan X., Metadata extraction from Chinese research papers based on conditional random fields, 4th International Conference on Fuzzy Systems and Knowledge Discovery, pp. 497-501, (2007)
  • [10] Yu X., Zou Y., On corresponding, Journal of Hunan University, 19, 2, pp. 93-96, (2005)