Semi-automatic construction of metadata from a series of web documents

被引:0
|
作者
Hirokawa, S
Itoh, E
Miyahara, T
机构
[1] Kyushu Univ, Comp & Computing Ctr, Higashi Ku, Fukuoka 8128581, Japan
[2] Hiroshima Univ, Fac Informat Sci, Asaminami Ku, Hiroshima 7313194, Japan
来源
AI 2003: ADVANCES IN ARTIFICIAL INTELLIGENCE | 2003年 / 2903卷
关键词
knowledge acquisition; knowledge engineering; knowledge discovery and data mining; machine learning; ontology;
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Metadata plays an important role in discovering, collecting, extracting and aggregating Web data. This paper proposes a method of constructing metadata for a specific topic. The method uses Web pages that are located in a site and are linked from a listing page. Web pages of recipes, real estates, used cars, hotels and syllabi are typical examples of such pages. We call them a series of Web documents. A series of Web pages have the same appearance when a user views them with a browser, because it is often the case that they are written with the same tag pattern. The method uses the tag-pattern as the common structure of the Web pages. Individual contents of the pages appear as plain texts embedded between two consecutive tags. If we remove the tags, it becomes a sequence of plain texts. The plain texts in the same relative position can be interpreted as attribute values if we presume that the pages represent records of the same kind. Most of these plain texts in the same position vary page to page. But, it may happen that the same texts show up at the same relative position in almost all pages. These constant texts can be considered as attribute names. "Location", "Rating" and "Travel from Airport" are examples of such constant texts for pages of hotel information. If the frequency of a text is higher than a threshold, we accept it as a component of metadata. If we mark a constant text with "N" and a variable text with "V", the sequence of plain texts forms a series of N's and V's. A page in a series contain two kinds of NV sequence pattern. The first pattern is (NV)(n), which we call vertical, where an attribute value follows the attribute name immediately. The second pattern is (NVn)-V-n, which we call horizontal, where names occur in the first row and the same number of values follow in the next row. Thus we can understand the meaning of values and can construct records from a series of Web pages.
引用
收藏
页码:942 / 953
页数:12
相关论文
共 50 条
  • [21] Semi-Automatic LaTeX-Based Labeling of Mathematical Objects in PDF Documents: MOP Data Set
    Beyette, Donald
    Wang, Zelun
    Lin, Jason
    Liu, Jyh-Charn
    DOCENG'19: PROCEEDINGS OF THE ACM SYMPOSIUM ON DOCUMENT ENGINEERING 2019, 2019,
  • [22] A Semi-Automatic Framework to Identify Abnormal States in EHR Narratives
    Ma, Xiaojun
    Imai, Takeshi
    Shinohara, Emiko
    Sakurai, Ryota
    Kozaki, Kouji
    Ohe, Kazuhiko
    MEDINFO 2017: PRECISION HEALTHCARE THROUGH INFORMATICS, 2017, 245 : 910 - 914
  • [23] A Semi-Automatic Approach to Construct Vietnamese Ontology from Online Text
    Bao-An Nguyen
    Yang, Don-Lin
    INTERNATIONAL REVIEW OF RESEARCH IN OPEN AND DISTANCE LEARNING, 2012, 13 (05) : 148 - 172
  • [24] A semi-automatic approach for workflow staff assignment
    Liu, Yingbo
    Wang, Jianmin
    Yang, Yun
    Sun, Jiaguang
    COMPUTERS IN INDUSTRY, 2008, 59 (05) : 463 - 476
  • [25] Semi-Automatic Annotation for Citation Function Classification
    Bakhti, Khadidja
    Niu, Zhendong
    Nyamawe, Ally S.
    2018 INTERNATIONAL CONFERENCE ON CONTROL, ARTIFICIAL INTELLIGENCE, ROBOTICS & OPTIMIZATION (ICCAIRO), 2018, : 43 - 47
  • [26] Semi-automatic Tool for Ontology Learning Tasks
    Sebek, Ondrej
    Jirkovsky, Vaclav
    Rychtyckyj, Nestor
    Kadera, Petr
    INDUSTRIAL APPLICATIONS OF HOLONIC AND MULTI-AGENT SYSTEMS (HOLOMAS 2019), 2019, 11710 : 119 - 129
  • [27] Semi-automatic Follow-up of Graduates
    Rodrigues, Diego Fialho
    Oliveira, Alcione de Paiva
    Lisboa Filho, Jugurta
    Moreira, Alexandra
    2012 31ST INTERNATIONAL CONFERENCE OF THE CHILEAN COMPUTER SCIENCE SOCIETY (SCCC 2012), 2012, : 114 - 122
  • [28] Semi-Automatic Reliable Explanations for Prediction in Graphs
    Todoriki, Masaru
    Shingu, Masafumi
    Yano, Shotaro
    Tolmachev, Arseny
    Komikado, Tao
    Maruhashi, Koji
    2021 IEEE 11TH ANNUAL COMPUTING AND COMMUNICATION WORKSHOP AND CONFERENCE (CCWC), 2021, : 311 - 320
  • [29] Biological Classification System Knowledge Graph and Semi-automatic Construction of Its Invertebrate Fossil Branches
    Dong, Shaochun
    Shi, Yukun
    Ran, Yizao
    Wu, Haijun
    Deng, Yiying
    Fan, Junxuan
    Dai, Xinyu
    JOURNAL OF EARTH SCIENCE, 2024, 35 (06) : 2119 - 2128
  • [30] SIENA: Semi-automatic semantic enhancement of datasets using concept recognition
    Grigoriu, Andreea
    Zaveri, Amrapali
    Weiss, Gerhard
    Dumontier, Michel
    JOURNAL OF BIOMEDICAL SEMANTICS, 2021, 12 (01)