Semi-automatic construction of metadata from a series of web documents

被引：0

作者：

Hirokawa, S

Itoh, E

Miyahara, T

机构：

[1] Kyushu Univ, Comp & Computing Ctr, Higashi Ku, Fukuoka 8128581, Japan

[2] Hiroshima Univ, Fac Informat Sci, Asaminami Ku, Hiroshima 7313194, Japan

来源：

AI 2003: ADVANCES IN ARTIFICIAL INTELLIGENCE | 2003年 / 2903卷

关键词：

knowledge acquisition; knowledge engineering; knowledge discovery and data mining; machine learning; ontology;

D O I：

暂无

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Metadata plays an important role in discovering, collecting, extracting and aggregating Web data. This paper proposes a method of constructing metadata for a specific topic. The method uses Web pages that are located in a site and are linked from a listing page. Web pages of recipes, real estates, used cars, hotels and syllabi are typical examples of such pages. We call them a series of Web documents. A series of Web pages have the same appearance when a user views them with a browser, because it is often the case that they are written with the same tag pattern. The method uses the tag-pattern as the common structure of the Web pages. Individual contents of the pages appear as plain texts embedded between two consecutive tags. If we remove the tags, it becomes a sequence of plain texts. The plain texts in the same relative position can be interpreted as attribute values if we presume that the pages represent records of the same kind. Most of these plain texts in the same position vary page to page. But, it may happen that the same texts show up at the same relative position in almost all pages. These constant texts can be considered as attribute names. "Location", "Rating" and "Travel from Airport" are examples of such constant texts for pages of hotel information. If the frequency of a text is higher than a threshold, we accept it as a component of metadata. If we mark a constant text with "N" and a variable text with "V", the sequence of plain texts forms a series of N's and V's. A page in a series contain two kinds of NV sequence pattern. The first pattern is (NV)(n), which we call vertical, where an attribute value follows the attribute name immediately. The second pattern is (NVn)-V-n, which we call horizontal, where names occur in the first row and the same number of values follow in the next row. Thus we can understand the meaning of values and can construct records from a series of Web pages.

引用

页码：942 / 953

页数：12

共 50 条

[31] SIENA: Semi-automatic semantic enhancement of datasets using concept recognition
Andreea Grigoriu
Amrapali Zaveri
Gerhard Weiss
Michel Dumontier
Journal of Biomedical Semantics, 12
[32] An Approach to Semi-Automatic Semantic Annotation on Web3D Scenes Based on An Ontology Framework
Shi, Mengwei
Cai, Hongming
Jiang, Lihong
2012 12TH INTERNATIONAL CONFERENCE ON INTELLIGENT SYSTEMS DESIGN AND APPLICATIONS (ISDA), 2012, : 574 - 579
[33] Semi-automatic Annotation of OCT Images for CNN Training
Schleier, Sebastian
Stolz, Noah
Langner, Holger
Hasan, Rama
Roschke, Christian
Ritter, Marc
HUMAN-COMPUTER INTERACTION. DESIGN AND USER EXPERIENCE, HCI 2020, PT I, 2020, 12181 : 672 - 685
[34] Visual OntoBridge: Semi-automatic Semantic Annotation Software
Grcar, Miha
Mladenic, Dunja
MACHINE LEARNING AND KNOWLEDGE DISCOVERY IN DATABASES, PT II, 2009, 5782 : 726 - 729
[35] A Semi-Automatic Annotation Approach for Human Activity Recognition
Bota, Patricia
Silva, Joana
Folgado, Duarte
Gamboa, Hugo
SENSORS, 2019, 19 (03):
[36] Semi-automatic Building of Domain Module by use of Novel Machine Learning Approach
Raj, Deepika K.
Saani, H.
2015 INTERNATIONAL CONFERENCE ON INNOVATIONS IN INFORMATION, EMBEDDED AND COMMUNICATION SYSTEMS (ICIIECS), 2015,
[37] Semi-automatic Categorization of Videos on VideoLectures.net
Grcar, Miha
Mladenic, Dunja
Kese, Peter
MACHINE LEARNING AND KNOWLEDGE DISCOVERY IN DATABASES, PT II, 2009, 5782 : 730 - +
[38] A scoping review of automatic and semi-automatic MRI segmentation in human brain imaging
Chau, M.
Vu, H.
Debnath, T.
Rahman, M. G.
RADIOGRAPHY, 2025, 31 (02)
[39] An Ontology-Based Framework for Semi-Automatic Schema Integration
Zille Huma
Muhammad Jaffar-Ur Rehman
Nadeem Iftikhar
Journal of Computer Science and Technology, 2005, 20 : 788 - 796
[40] Semi-automatic matching of OCT and IVUS images for image fusion
Pauly, Olivier
Unal, Gozde
Slabaugh, Greg
Carlier, Stephane
Fang, Tong
MEDICAL IMAGING 2008: IMAGE PROCESSING, PTS 1-3, 2008, 6914

← 1 2 3 4 5 →