Semi-automatic construction of metadata from a series of web documents

被引:0
|
作者
Hirokawa, S
Itoh, E
Miyahara, T
机构
[1] Kyushu Univ, Comp & Computing Ctr, Higashi Ku, Fukuoka 8128581, Japan
[2] Hiroshima Univ, Fac Informat Sci, Asaminami Ku, Hiroshima 7313194, Japan
来源
AI 2003: ADVANCES IN ARTIFICIAL INTELLIGENCE | 2003年 / 2903卷
关键词
knowledge acquisition; knowledge engineering; knowledge discovery and data mining; machine learning; ontology;
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Metadata plays an important role in discovering, collecting, extracting and aggregating Web data. This paper proposes a method of constructing metadata for a specific topic. The method uses Web pages that are located in a site and are linked from a listing page. Web pages of recipes, real estates, used cars, hotels and syllabi are typical examples of such pages. We call them a series of Web documents. A series of Web pages have the same appearance when a user views them with a browser, because it is often the case that they are written with the same tag pattern. The method uses the tag-pattern as the common structure of the Web pages. Individual contents of the pages appear as plain texts embedded between two consecutive tags. If we remove the tags, it becomes a sequence of plain texts. The plain texts in the same relative position can be interpreted as attribute values if we presume that the pages represent records of the same kind. Most of these plain texts in the same position vary page to page. But, it may happen that the same texts show up at the same relative position in almost all pages. These constant texts can be considered as attribute names. "Location", "Rating" and "Travel from Airport" are examples of such constant texts for pages of hotel information. If the frequency of a text is higher than a threshold, we accept it as a component of metadata. If we mark a constant text with "N" and a variable text with "V", the sequence of plain texts forms a series of N's and V's. A page in a series contain two kinds of NV sequence pattern. The first pattern is (NV)(n), which we call vertical, where an attribute value follows the attribute name immediately. The second pattern is (NVn)-V-n, which we call horizontal, where names occur in the first row and the same number of values follow in the next row. Thus we can understand the meaning of values and can construct records from a series of Web pages.
引用
收藏
页码:942 / 953
页数:12
相关论文
共 50 条
  • [1] SEMI-AUTOMATIC METADATA ANNOTATION OF WEB OF THINGS WITH KNOWLEDGE BASE
    Yang, Yunong
    Wu, Zhenyu
    Zhu, Xinning
    PROCEEDINGS OF 2016 5TH IEEE INTERNATIONAL CONFERENCE ON NETWORK INFRASTRUCTURE AND DIGITAL CONTENT (IEEE IC-NIDC 2016), 2016, : 124 - 129
  • [2] From Web Resources to Agricultural Ontology: a Method for Semi-Automatic Construction
    Wei Yuan-yuan
    Wang Ru-jing
    Hu Yi-min
    Wang Xue
    JOURNAL OF INTEGRATIVE AGRICULTURE, 2012, 11 (05) : 775 - 783
  • [4] Semi-automatic ontology construction for improving comprehension of legal documents
    Cestnik, Bojan
    Kern, Alenka
    Modrijan, Helena
    ELECTRONIC GOVERNMENT, PROCEEDINGS, 2008, 5184 : 328 - +
  • [5] Semi-automatic metadata extraction from imagery and cartographic data
    Diaz, Laura
    Martin, Cristian
    Gould, Michael
    Granell, Carlos
    Manso, Miguel Angel
    IGARSS: 2007 IEEE INTERNATIONAL GEOSCIENCE AND REMOTE SENSING SYMPOSIUM, VOLS 1-12: SENSING AND UNDERSTANDING OUR PLANET, 2007, : 3051 - +
  • [6] S-CREAM - Semi-automatic CREAtion of metadata
    Handschuh, S
    Staab, S
    Ciravegna, F
    KNOWLEDGE ENGINEERING AND KNOWLEDGE MANAGEMENT, PROCEEDINGS: ONTOLOGIES AND THE SEMANTIC WEB, 2002, 2473 : 358 - 372
  • [7] Semi-Automatic Generation of Metadata for Items in a Question Repository
    Ramesh, Rekha
    Mishra, Shitanshu
    Sasikumar, M.
    Iyer, Sridhar
    2014 IEEE SIXTH INTERNATIONAL CONFERENCE ON TECHNOLOGY FOR EDUCATION (T4E), 2014, : 222 - 228
  • [8] Semi-automatic indexing of documents with a multilingual thesaurus
    Schiel, U
    de Sousa, LMSF
    RIDE - MLIM 2003: THIRTEENTH INTERNATIONAL WORK SHOP ON RESEARCH ISSUES IN DATA ENGINEERING: MULTI-LINGUAL INFORMATION MANAGEMENT, PROCEEDINGS, 2003, : 31 - 38
  • [9] Semi-automatic System for Title Construction
    Duari, Swagata
    Bhatnagar, Vasudha
    INFORMATION, COMMUNICATION AND COMPUTING TECHNOLOGY (ICICCT 2019), 2019, 1025 : 216 - 227
  • [10] Semi-automatic construction of topic ontologies
    Fortuna, Blaz
    Mladenic, Dunja
    Grobelnik, Marko
    SEMANTICS, WEB AND MINING, 2006, 4289 : 121 - 131