Annotated Open Corpus Construction and BERT-Based Approach for Automatic Metadata Extraction From Korean Academic Papers

被引:0
作者
Kong, Hyesoo [1 ]
Yoon, Hwamook [1 ]
Seol, Jaewook [1 ]
Hyun, Mihwan [1 ]
Lee, Hyejin [1 ]
Kim, Soonyoung [1 ]
Choi, Wonjun [1 ]
机构
[1] Korea Inst Sci & Technol Informat, Digital Curat Ctr, Daejeon 34141, South Korea
关键词
BERT; corpus construction; metadata extraction; transfer learning; AGREEMENT;
D O I
10.1109/ACCESS.2022.3233228
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
With the accelerating development of science and technology, the academic papers being published in various fields are increasing rapidly. Academic papers specially in science and technology fields are a crucial media for researchers who develop new technologies by identifying knowledge regarding the latest technological trends and conduct derivative studies in science and technology. Therefore, the continual collection of extensive academic papers, structuring of metadata, and construction of databases are significant tasks. However, research on automatic metadata extraction from Korean papers is not being actively conducted currently owing to insufficient Korean training data. We automatically constructed the largest labeled corpus in South Korea to date from 315,320 PDF papers belonging to 503 Korean academic journals and this labeled corpus can be used for training the models of automatic extraction for 12 metadata types from PDF papers. This labeled corpus is available at https://doi.org/10.23057/48. Moreover, we developed inspection process and guidelines for the automatically constructed data and performed a full inspection of the validation and testing data. The reliability of the inspected data was verified through the inter-annotator agreement measurement. Using our corpus, we trained and evaluated the BERT based transfer learning model to verify its reliability. Furthermore, we proposed new training methods that can improve the metadata extraction performance of Korean papers, and through these methods, we developed KorSciBERT-ME-J and KorSciBERT-ME-J+C models. The KorSciBERT-ME-J showed the highest performance with an F1 score of 99.36%, as well as robust performance in automatic metadata extraction from Korean academic papers in various formats.
引用
收藏
页码:825 / 838
页数:14
相关论文
共 35 条
  • [11] Han H., 2005, SAC, P1049
  • [12] Automatic extraction of titles from general documents using machine learning
    Hu, YH
    Li, H
    Cao, YB
    Meyerzon, D
    Zheng, QH
    [J]. PROCEEDINGS OF THE 5TH ACM/IEEE JOINT CONFERENCE ON DIGITAL LIBRARIES, PROCEEDINGS, 2005, : 145 - 154
  • [13] Huang ZH, 2015, Arxiv, DOI [arXiv:1508.01991, DOI 10.48550/ARXIV.1508.01991]
  • [14] Kawtrakul A., 2005, PROC INT ADV DIGIT L, P1
  • [15] Enhancing Korean Named Entity Recognition With Linguistic Tokenization Strategies
    Kim, Gyeongmin
    Son, Junyoung
    Kim, Jinsung
    Lee, Hyunhee
    Lim, Heuiseok
    [J]. IEEE ACCESS, 2021, 9 : 151814 - 151823
  • [16] Generating summary sentences using Adversarially Regularized Autoencoders with conditional context
    Kong, Hyesoo
    Kim, Wooju
    [J]. EXPERT SYSTEMS WITH APPLICATIONS, 2019, 130 : 1 - 11
  • [17] Lafferty J.D., 2001, P INT C MACHINE LEAR, P282
  • [18] MEASUREMENT OF OBSERVER AGREEMENT FOR CATEGORICAL DATA
    LANDIS, JR
    KOCH, GG
    [J]. BIOMETRICS, 1977, 33 (01) : 159 - 174
  • [19] BioBERT: a pre-trained biomedical language representation model for biomedical text mining
    Lee, Jinhyuk
    Yoon, Wonjin
    Kim, Sungdong
    Kim, Donghyeon
    Kim, Sunkyu
    So, Chan Ho
    Kang, Jaewoo
    [J]. BIOINFORMATICS, 2020, 36 (04) : 1234 - 1240
  • [20] LEVENSHT.VI, 1965, DOKL AKAD NAUK SSSR+, V163, P845