LAME: Layout-Aware Metadata Extraction Approach for Research Articles

被引:1
作者
Choi, Jongyun [1 ]
Kong, Hyesoo [2 ]
Yoon, Hwamook [2 ]
Oh, Heungseon [3 ]
Jung, Yuchul [1 ]
机构
[1] Kumoh Natl Inst Technol KIT, Dept Comp Engn, Gumi, South Korea
[2] Korea Inst Sci & Technol Informat KISTI, Daejeon, South Korea
[3] Korea Univ Technol & Educ KOREATECH, Sch Comp Sci & Engn, Cheonan, South Korea
来源
CMC-COMPUTERS MATERIALS & CONTINUA | 2022年 / 72卷 / 02期
基金
新加坡国家研究基金会;
关键词
Automatic layout analysis; layout-MetaBERT; metadata extrac-tion; research article;
D O I
10.32604/cmc.2022.025711
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
The volume of academic literature, such as academic conference papers and journals, has increased rapidly worldwide, and research on metadata extraction is ongoing. However, high-performing metadata extraction is still challenging due to diverse layout formats according to journal publishers. To accommodate the diversity of the layouts of academic journals, we propose a novel LAyout-aware Metadata Extraction (LAME) framework equipped with the three characteristics (e.g., design of automatic layout analysis, construction of a large meta-data training set, and implementation of metadata extractor). In the framework, we designed an automatic layout analysis using PDFMiner. Based on the layout analysis, a large volume of metadata-separated training data, including the title, abstract, author name, author affiliated organization, and keywords, were automatically extracted. Moreover, we constructed a pre-trained model, Layout-MetaBERT, to extract the metadata from academic journals with varying layout formats. The experimental results with our metadata extractor exhibited robust performance (Macro-F1, 93.27%) in metadata extraction for unseen journals with different layout formats.
引用
收藏
页码:4019 / 4037
页数:19
相关论文
共 27 条
  • [1] Sequence Classification with Neural Conditional Random Fields
    Abramson, Myriam
    [J]. 2015 IEEE 14TH INTERNATIONAL CONFERENCE ON MACHINE LEARNING AND APPLICATIONS (ICMLA), 2015, : 799 - 804
  • [2] Adhikari A, 2019, DocBERT: BERT for document classification
  • [3] Mask R-CNN
    He, Kaiming
    Gkioxari, Georgia
    Dollar, Piotr
    Girshick, Ross
    [J]. 2017 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2017, : 2980 - 2988
  • [4] Beltagy I, 2019, 2019 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING AND THE 9TH INTERNATIONAL JOINT CONFERENCE ON NATURAL LANGUAGE PROCESSING (EMNLP-IJCNLP 2019), P3615
  • [5] Devlin J, 2019, 2019 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES (NAACL HLT 2019), VOL. 1, P4171
  • [6] Garncarek L, 2020, ARXIV PREPRINT ARXIV
  • [7] Fast R-CNN
    Girshick, Ross
    [J]. 2015 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2015, : 1440 - 1448
  • [8] Gu X., PROC AAAI C ARTIF IN, V35, P12911
  • [9] Han H, 2003, ACM-IEEE J CONF DIG, P37
  • [10] Katti AR, 2018, 2018 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP 2018), P4459