LAME: Layout-Aware Metadata Extraction Approach for Research Articles

被引：1

作者：

Choi, Jongyun ^{[1
]}

Kong, Hyesoo ^{[2
]}

Yoon, Hwamook ^{[2
]}

Oh, Heungseon ^{[3
]}

Jung, Yuchul ^{[1
]}

机构：

[1] Kumoh Natl Inst Technol KIT, Dept Comp Engn, Gumi, South Korea

[2] Korea Inst Sci & Technol Informat KISTI, Daejeon, South Korea

[3] Korea Univ Technol & Educ KOREATECH, Sch Comp Sci & Engn, Cheonan, South Korea

来源：

CMC-COMPUTERS MATERIALS & CONTINUA | 2022年 / 72卷 / 02期

基金：

新加坡国家研究基金会;

关键词：

Automatic layout analysis; layout-MetaBERT; metadata extrac-tion; research article;

D O I：

10.32604/cmc.2022.025711

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

The volume of academic literature, such as academic conference papers and journals, has increased rapidly worldwide, and research on metadata extraction is ongoing. However, high-performing metadata extraction is still challenging due to diverse layout formats according to journal publishers. To accommodate the diversity of the layouts of academic journals, we propose a novel LAyout-aware Metadata Extraction (LAME) framework equipped with the three characteristics (e.g., design of automatic layout analysis, construction of a large meta-data training set, and implementation of metadata extractor). In the framework, we designed an automatic layout analysis using PDFMiner. Based on the layout analysis, a large volume of metadata-separated training data, including the title, abstract, author name, author affiliated organization, and keywords, were automatically extracted. Moreover, we constructed a pre-trained model, Layout-MetaBERT, to extract the metadata from academic journals with varying layout formats. The experimental results with our metadata extractor exhibited robust performance (Macro-F1, 93.27%) in metadata extraction for unseen journals with different layout formats.

引用

页码：4019 / 4037

页数：19

共 27 条

[1] Sequence Classification with Neural Conditional Random Fields
Abramson, Myriam
[J]. 2015 IEEE 14TH INTERNATIONAL CONFERENCE ON MACHINE LEARNING AND APPLICATIONS (ICMLA), 2015, : 799 - 804
[2] Adhikari A, 2019, DocBERT: BERT for document classification
[3] Mask R-CNN
He, Kaiming
Gkioxari, Georgia
Dollar, Piotr
Girshick, Ross
[J]. 2017 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2017, : 2980 - 2988
[4] Beltagy I, 2019, 2019 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING AND THE 9TH INTERNATIONAL JOINT CONFERENCE ON NATURAL LANGUAGE PROCESSING (EMNLP-IJCNLP 2019), P3615
[5] Devlin J, 2019, 2019 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES (NAACL HLT 2019), VOL. 1, P4171
[6] Garncarek L, 2020, ARXIV PREPRINT ARXIV
[7] Fast R-CNN
Girshick, Ross
[J]. 2015 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2015, : 1440 - 1448
[8] Gu X., PROC AAAI C ARTIF IN, V35, P12911
[9] Han H, 2003, ACM-IEEE J CONF DIG, P37
[10] Katti AR, 2018, 2018 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP 2018), P4459

← 1 2 3 →