Semi-automatic construction of metadata from a series of web documents

被引：0

作者：

Hirokawa, S

Itoh, E

Miyahara, T

机构：

[1] Kyushu Univ, Comp & Computing Ctr, Higashi Ku, Fukuoka 8128581, Japan

[2] Hiroshima Univ, Fac Informat Sci, Asaminami Ku, Hiroshima 7313194, Japan

来源：

AI 2003: ADVANCES IN ARTIFICIAL INTELLIGENCE | 2003年 / 2903卷

关键词：

knowledge acquisition; knowledge engineering; knowledge discovery and data mining; machine learning; ontology;

D O I：

暂无

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Metadata plays an important role in discovering, collecting, extracting and aggregating Web data. This paper proposes a method of constructing metadata for a specific topic. The method uses Web pages that are located in a site and are linked from a listing page. Web pages of recipes, real estates, used cars, hotels and syllabi are typical examples of such pages. We call them a series of Web documents. A series of Web pages have the same appearance when a user views them with a browser, because it is often the case that they are written with the same tag pattern. The method uses the tag-pattern as the common structure of the Web pages. Individual contents of the pages appear as plain texts embedded between two consecutive tags. If we remove the tags, it becomes a sequence of plain texts. The plain texts in the same relative position can be interpreted as attribute values if we presume that the pages represent records of the same kind. Most of these plain texts in the same position vary page to page. But, it may happen that the same texts show up at the same relative position in almost all pages. These constant texts can be considered as attribute names. "Location", "Rating" and "Travel from Airport" are examples of such constant texts for pages of hotel information. If the frequency of a text is higher than a threshold, we accept it as a component of metadata. If we mark a constant text with "N" and a variable text with "V", the sequence of plain texts forms a series of N's and V's. A page in a series contain two kinds of NV sequence pattern. The first pattern is (NV)(n), which we call vertical, where an attribute value follows the attribute name immediately. The second pattern is (NVn)-V-n, which we call horizontal, where names occur in the first row and the same number of values follow in the next row. Thus we can understand the meaning of values and can construct records from a series of Web pages.

引用

页码：942 / 953

页数：12

共 50 条

[1] SEMI-AUTOMATIC METADATA ANNOTATION OF WEB OF THINGS WITH KNOWLEDGE BASE
Yang, Yunong
Wu, Zhenyu
Zhu, Xinning
PROCEEDINGS OF 2016 5TH IEEE INTERNATIONAL CONFERENCE ON NETWORK INFRASTRUCTURE AND DIGITAL CONTENT (IEEE IC-NIDC 2016), 2016, : 124 - 129
[2] From Web Resources to Agricultural Ontology: a Method for Semi-Automatic Construction
Wei Yuan-yuan
Wang Ru-jing
Hu Yi-min
Wang Xue
JOURNAL OF INTEGRATIVE AGRICULTURE, 2012, 11 (05) : 775 - 783
[3] From Web Resources to Agricultural Ontology:a Method for Semi-Automatic Construction
WEI Yuan-yuan1
JournalofIntegrativeAgriculture, 2012, 11 (05) : 775 - 783
[4] Semi-automatic ontology construction for improving comprehension of legal documents
Cestnik, Bojan
Kern, Alenka
Modrijan, Helena
ELECTRONIC GOVERNMENT, PROCEEDINGS, 2008, 5184 : 328 - +
[5] Semi-automatic metadata extraction from imagery and cartographic data
Diaz, Laura
Martin, Cristian
Gould, Michael
Granell, Carlos
Manso, Miguel Angel
IGARSS: 2007 IEEE INTERNATIONAL GEOSCIENCE AND REMOTE SENSING SYMPOSIUM, VOLS 1-12: SENSING AND UNDERSTANDING OUR PLANET, 2007, : 3051 - +
[6] S-CREAM - Semi-automatic CREAtion of metadata
Handschuh, S
Staab, S
Ciravegna, F
KNOWLEDGE ENGINEERING AND KNOWLEDGE MANAGEMENT, PROCEEDINGS: ONTOLOGIES AND THE SEMANTIC WEB, 2002, 2473 : 358 - 372
[7] Semi-Automatic Generation of Metadata for Items in a Question Repository
Ramesh, Rekha
Mishra, Shitanshu
Sasikumar, M.
Iyer, Sridhar
2014 IEEE SIXTH INTERNATIONAL CONFERENCE ON TECHNOLOGY FOR EDUCATION (T4E), 2014, : 222 - 228
[8] Semi-automatic indexing of documents with a multilingual thesaurus
Schiel, U
de Sousa, LMSF
RIDE - MLIM 2003: THIRTEENTH INTERNATIONAL WORK SHOP ON RESEARCH ISSUES IN DATA ENGINEERING: MULTI-LINGUAL INFORMATION MANAGEMENT, PROCEEDINGS, 2003, : 31 - 38
[9] Semi-automatic System for Title Construction
Duari, Swagata
Bhatnagar, Vasudha
INFORMATION, COMMUNICATION AND COMPUTING TECHNOLOGY (ICICCT 2019), 2019, 1025 : 216 - 227
[10] Semi-automatic construction of topic ontologies
Fortuna, Blaz
Mladenic, Dunja
Grobelnik, Marko
SEMANTICS, WEB AND MINING, 2006, 4289 : 121 - 131

← 1 2 3 4 5 →