Semi-automatic construction of metadata from a series of web documents

被引：0

作者：

Hirokawa, S

Itoh, E

Miyahara, T

机构：

[1] Kyushu Univ, Comp & Computing Ctr, Higashi Ku, Fukuoka 8128581, Japan

[2] Hiroshima Univ, Fac Informat Sci, Asaminami Ku, Hiroshima 7313194, Japan

来源：

AI 2003: ADVANCES IN ARTIFICIAL INTELLIGENCE | 2003年 / 2903卷

关键词：

knowledge acquisition; knowledge engineering; knowledge discovery and data mining; machine learning; ontology;

D O I：

暂无

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Metadata plays an important role in discovering, collecting, extracting and aggregating Web data. This paper proposes a method of constructing metadata for a specific topic. The method uses Web pages that are located in a site and are linked from a listing page. Web pages of recipes, real estates, used cars, hotels and syllabi are typical examples of such pages. We call them a series of Web documents. A series of Web pages have the same appearance when a user views them with a browser, because it is often the case that they are written with the same tag pattern. The method uses the tag-pattern as the common structure of the Web pages. Individual contents of the pages appear as plain texts embedded between two consecutive tags. If we remove the tags, it becomes a sequence of plain texts. The plain texts in the same relative position can be interpreted as attribute values if we presume that the pages represent records of the same kind. Most of these plain texts in the same position vary page to page. But, it may happen that the same texts show up at the same relative position in almost all pages. These constant texts can be considered as attribute names. "Location", "Rating" and "Travel from Airport" are examples of such constant texts for pages of hotel information. If the frequency of a text is higher than a threshold, we accept it as a component of metadata. If we mark a constant text with "N" and a variable text with "V", the sequence of plain texts forms a series of N's and V's. A page in a series contain two kinds of NV sequence pattern. The first pattern is (NV)(n), which we call vertical, where an attribute value follows the attribute name immediately. The second pattern is (NVn)-V-n, which we call horizontal, where names occur in the first row and the same number of values follow in the next row. Thus we can understand the meaning of values and can construct records from a series of Web pages.

引用

页码：942 / 953

页数：12

共 50 条

[21] Semi-Automatic LaTeX-Based Labeling of Mathematical Objects in PDF Documents: MOP Data Set
Beyette, Donald
Wang, Zelun
Lin, Jason
Liu, Jyh-Charn
DOCENG'19: PROCEEDINGS OF THE ACM SYMPOSIUM ON DOCUMENT ENGINEERING 2019, 2019,
[22] A Semi-Automatic Framework to Identify Abnormal States in EHR Narratives
Ma, Xiaojun
Imai, Takeshi
Shinohara, Emiko
Sakurai, Ryota
Kozaki, Kouji
Ohe, Kazuhiko
MEDINFO 2017: PRECISION HEALTHCARE THROUGH INFORMATICS, 2017, 245 : 910 - 914
[23] A Semi-Automatic Approach to Construct Vietnamese Ontology from Online Text
Bao-An Nguyen
Yang, Don-Lin
INTERNATIONAL REVIEW OF RESEARCH IN OPEN AND DISTANCE LEARNING, 2012, 13 (05) : 148 - 172
[24] A semi-automatic approach for workflow staff assignment
Liu, Yingbo
Wang, Jianmin
Yang, Yun
Sun, Jiaguang
COMPUTERS IN INDUSTRY, 2008, 59 (05) : 463 - 476
[25] Semi-Automatic Annotation for Citation Function Classification
Bakhti, Khadidja
Niu, Zhendong
Nyamawe, Ally S.
2018 INTERNATIONAL CONFERENCE ON CONTROL, ARTIFICIAL INTELLIGENCE, ROBOTICS & OPTIMIZATION (ICCAIRO), 2018, : 43 - 47
[26] Semi-automatic Tool for Ontology Learning Tasks
Sebek, Ondrej
Jirkovsky, Vaclav
Rychtyckyj, Nestor
Kadera, Petr
INDUSTRIAL APPLICATIONS OF HOLONIC AND MULTI-AGENT SYSTEMS (HOLOMAS 2019), 2019, 11710 : 119 - 129
[27] Semi-automatic Follow-up of Graduates
Rodrigues, Diego Fialho
Oliveira, Alcione de Paiva
Lisboa Filho, Jugurta
Moreira, Alexandra
2012 31ST INTERNATIONAL CONFERENCE OF THE CHILEAN COMPUTER SCIENCE SOCIETY (SCCC 2012), 2012, : 114 - 122
[28] Semi-Automatic Reliable Explanations for Prediction in Graphs
Todoriki, Masaru
Shingu, Masafumi
Yano, Shotaro
Tolmachev, Arseny
Komikado, Tao
Maruhashi, Koji
2021 IEEE 11TH ANNUAL COMPUTING AND COMMUNICATION WORKSHOP AND CONFERENCE (CCWC), 2021, : 311 - 320
[29] Biological Classification System Knowledge Graph and Semi-automatic Construction of Its Invertebrate Fossil Branches
Dong, Shaochun
Shi, Yukun
Ran, Yizao
Wu, Haijun
Deng, Yiying
Fan, Junxuan
Dai, Xinyu
JOURNAL OF EARTH SCIENCE, 2024, 35 (06) : 2119 - 2128
[30] SIENA: Semi-automatic semantic enhancement of datasets using concept recognition
Grigoriu, Andreea
Zaveri, Amrapali
Weiss, Gerhard
Dumontier, Michel
JOURNAL OF BIOMEDICAL SEMANTICS, 2021, 12 (01)

← 1 2 3 4 5 →