Semi-automatic construction of metadata from a series of web documents

被引:0
|
作者
Hirokawa, S
Itoh, E
Miyahara, T
机构
[1] Kyushu Univ, Comp & Computing Ctr, Higashi Ku, Fukuoka 8128581, Japan
[2] Hiroshima Univ, Fac Informat Sci, Asaminami Ku, Hiroshima 7313194, Japan
来源
AI 2003: ADVANCES IN ARTIFICIAL INTELLIGENCE | 2003年 / 2903卷
关键词
knowledge acquisition; knowledge engineering; knowledge discovery and data mining; machine learning; ontology;
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Metadata plays an important role in discovering, collecting, extracting and aggregating Web data. This paper proposes a method of constructing metadata for a specific topic. The method uses Web pages that are located in a site and are linked from a listing page. Web pages of recipes, real estates, used cars, hotels and syllabi are typical examples of such pages. We call them a series of Web documents. A series of Web pages have the same appearance when a user views them with a browser, because it is often the case that they are written with the same tag pattern. The method uses the tag-pattern as the common structure of the Web pages. Individual contents of the pages appear as plain texts embedded between two consecutive tags. If we remove the tags, it becomes a sequence of plain texts. The plain texts in the same relative position can be interpreted as attribute values if we presume that the pages represent records of the same kind. Most of these plain texts in the same position vary page to page. But, it may happen that the same texts show up at the same relative position in almost all pages. These constant texts can be considered as attribute names. "Location", "Rating" and "Travel from Airport" are examples of such constant texts for pages of hotel information. If the frequency of a text is higher than a threshold, we accept it as a component of metadata. If we mark a constant text with "N" and a variable text with "V", the sequence of plain texts forms a series of N's and V's. A page in a series contain two kinds of NV sequence pattern. The first pattern is (NV)(n), which we call vertical, where an attribute value follows the attribute name immediately. The second pattern is (NVn)-V-n, which we call horizontal, where names occur in the first row and the same number of values follow in the next row. Thus we can understand the meaning of values and can construct records from a series of Web pages.
引用
收藏
页码:942 / 953
页数:12
相关论文
共 50 条
  • [41] Semi-Automatic RDFization of Hindi Agricultural words using IndoWordNet
    Garg, Megha
    Sinha, Bhaskar
    Chandra, Somnath
    2014 INTERNATIONAL CONFERENCE ON ADVANCES IN COMPUTING, COMMUNICATIONS AND INFORMATICS (ICACCI), 2014, : 2769 - 2774
  • [42] Pipeline Manager: A Flexible Semi-automatic Dataflow Analysis Framework
    Chen, Cheng-Hui
    Hong, Huai-Che
    Hong, Yu-Shiang
    Wang, Hsiao Yu
    Yu, Shyr-Shen
    22ND IEEE/ACIS INTERNATIONAL CONFERENCE ON SOFTWARE ENGINEERING, ARTIFICIAL INTELLIGENCE, NETWORKING AND PARALLEL/DISTRIBUTED COMPUTING (SNPD 2021-FALL), 2021, : 174 - 176
  • [43] ONTOLOGY DEVELOPMENT FOR GREEN BUILDING BY USING A SEMI-AUTOMATIC METHOD
    Yan, Hang
    Shi, Yiming
    Lu, Xuteng
    JOURNAL OF GREEN BUILDING, 2023, 18 (04): : 129 - 147
  • [44] Towards a semi-automatic semantic approach for satellite image analysis
    di Sciascio, Cecilia
    Zanni-Merk, Cecilia
    Wemmert, Cedric
    Marc-Zwecker, Stella
    de Beuvron, Francois de Bertrand
    17TH INTERNATIONAL CONFERENCE IN KNOWLEDGE BASED AND INTELLIGENT INFORMATION AND ENGINEERING SYSTEMS - KES2013, 2013, 22 : 1388 - 1397
  • [45] Semi-Automatic formalization of a patient/doctor vocabulary for breast cancer
    Nzali M.D.T.
    Az J.
    Bringay S.
    Lavergne C.
    Mollevi C.
    Opitz T.
    1600, Lavoisier (30): : 533 - 555
  • [46] An ontology-based framework for semi-automatic schema integration
    Huma, Z
    Rehman, MJU
    Iftikhar, N
    JOURNAL OF COMPUTER SCIENCE AND TECHNOLOGY, 2005, 20 (06) : 788 - 796
  • [47] Service-oriented semi-automatic ontology mapping bridging
    Silva, N
    Rocha, J
    ENGINEERING INTELLIGENT SYSTEMS FOR ELECTRICAL ENGINEERING AND COMMUNICATIONS, 2005, 13 (04): : 253 - 258
  • [48] Semi-automatic Generation of a Patient Preoperative Knowledge-Base from a Legacy Clinical Database
    Bouamrane, Matt-Mouley
    Rector, Alan
    Hurrell, Martin
    ON THE MOVE TO MEANINGFUL INTERNET SYSTEMS: OTM 2009, PT 2, 2009, 5871 : 1224 - +
  • [49] Semi-automatic extraction of liana stems from terrestrial LiDAR point clouds of tropical rainforests
    Moorthy, Sruthi M. Krishna
    Bao, Yunfei
    Calders, Kim
    Schnitzer, Stefan A.
    Verbeeck, Hans
    ISPRS JOURNAL OF PHOTOGRAMMETRY AND REMOTE SENSING, 2019, 154 : 114 - 126
  • [50] KAnt: Leveraging ant colony optimization for automatic knowledge acquisition from web documents
    Perera, Rivindu
    Perera, Udayangi
    2013 INTERNATIONAL CONFERENCE ON ADVANCES IN ICT FOR EMERGING REGIONS (ICTER), 2013, : 168 - 171