Semi-automatic construction of metadata from a series of web documents

被引:0
|
作者
Hirokawa, S
Itoh, E
Miyahara, T
机构
[1] Kyushu Univ, Comp & Computing Ctr, Higashi Ku, Fukuoka 8128581, Japan
[2] Hiroshima Univ, Fac Informat Sci, Asaminami Ku, Hiroshima 7313194, Japan
来源
AI 2003: ADVANCES IN ARTIFICIAL INTELLIGENCE | 2003年 / 2903卷
关键词
knowledge acquisition; knowledge engineering; knowledge discovery and data mining; machine learning; ontology;
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Metadata plays an important role in discovering, collecting, extracting and aggregating Web data. This paper proposes a method of constructing metadata for a specific topic. The method uses Web pages that are located in a site and are linked from a listing page. Web pages of recipes, real estates, used cars, hotels and syllabi are typical examples of such pages. We call them a series of Web documents. A series of Web pages have the same appearance when a user views them with a browser, because it is often the case that they are written with the same tag pattern. The method uses the tag-pattern as the common structure of the Web pages. Individual contents of the pages appear as plain texts embedded between two consecutive tags. If we remove the tags, it becomes a sequence of plain texts. The plain texts in the same relative position can be interpreted as attribute values if we presume that the pages represent records of the same kind. Most of these plain texts in the same position vary page to page. But, it may happen that the same texts show up at the same relative position in almost all pages. These constant texts can be considered as attribute names. "Location", "Rating" and "Travel from Airport" are examples of such constant texts for pages of hotel information. If the frequency of a text is higher than a threshold, we accept it as a component of metadata. If we mark a constant text with "N" and a variable text with "V", the sequence of plain texts forms a series of N's and V's. A page in a series contain two kinds of NV sequence pattern. The first pattern is (NV)(n), which we call vertical, where an attribute value follows the attribute name immediately. The second pattern is (NVn)-V-n, which we call horizontal, where names occur in the first row and the same number of values follow in the next row. Thus we can understand the meaning of values and can construct records from a series of Web pages.
引用
收藏
页码:942 / 953
页数:12
相关论文
共 50 条
  • [31] SIENA: Semi-automatic semantic enhancement of datasets using concept recognition
    Andreea Grigoriu
    Amrapali Zaveri
    Gerhard Weiss
    Michel Dumontier
    Journal of Biomedical Semantics, 12
  • [32] An Approach to Semi-Automatic Semantic Annotation on Web3D Scenes Based on An Ontology Framework
    Shi, Mengwei
    Cai, Hongming
    Jiang, Lihong
    2012 12TH INTERNATIONAL CONFERENCE ON INTELLIGENT SYSTEMS DESIGN AND APPLICATIONS (ISDA), 2012, : 574 - 579
  • [33] Semi-automatic Annotation of OCT Images for CNN Training
    Schleier, Sebastian
    Stolz, Noah
    Langner, Holger
    Hasan, Rama
    Roschke, Christian
    Ritter, Marc
    HUMAN-COMPUTER INTERACTION. DESIGN AND USER EXPERIENCE, HCI 2020, PT I, 2020, 12181 : 672 - 685
  • [34] Visual OntoBridge: Semi-automatic Semantic Annotation Software
    Grcar, Miha
    Mladenic, Dunja
    MACHINE LEARNING AND KNOWLEDGE DISCOVERY IN DATABASES, PT II, 2009, 5782 : 726 - 729
  • [35] A Semi-Automatic Annotation Approach for Human Activity Recognition
    Bota, Patricia
    Silva, Joana
    Folgado, Duarte
    Gamboa, Hugo
    SENSORS, 2019, 19 (03):
  • [36] Semi-automatic Building of Domain Module by use of Novel Machine Learning Approach
    Raj, Deepika K.
    Saani, H.
    2015 INTERNATIONAL CONFERENCE ON INNOVATIONS IN INFORMATION, EMBEDDED AND COMMUNICATION SYSTEMS (ICIIECS), 2015,
  • [37] Semi-automatic Categorization of Videos on VideoLectures.net
    Grcar, Miha
    Mladenic, Dunja
    Kese, Peter
    MACHINE LEARNING AND KNOWLEDGE DISCOVERY IN DATABASES, PT II, 2009, 5782 : 730 - +
  • [38] A scoping review of automatic and semi-automatic MRI segmentation in human brain imaging
    Chau, M.
    Vu, H.
    Debnath, T.
    Rahman, M. G.
    RADIOGRAPHY, 2025, 31 (02)
  • [39] An Ontology-Based Framework for Semi-Automatic Schema Integration
    Zille Huma
    Muhammad Jaffar-Ur Rehman
    Nadeem Iftikhar
    Journal of Computer Science and Technology, 2005, 20 : 788 - 796
  • [40] Semi-automatic matching of OCT and IVUS images for image fusion
    Pauly, Olivier
    Unal, Gozde
    Slabaugh, Greg
    Carlier, Stephane
    Fang, Tong
    MEDICAL IMAGING 2008: IMAGE PROCESSING, PTS 1-3, 2008, 6914