Implicit Semantics Based Metadata Extraction and Matching of Scholarly Documents

被引:5
作者
Jiang, Congfeng [1 ]
Liu, Junming [1 ]
Ou, Dongyang [1 ]
Wang, Yumei [1 ]
Yu, Lifeng [2 ]
机构
[1] Hangzhou Dianzi Univ, Sch Comp Sci & Technol, Hangzhou, Zhejiang, Peoples R China
[2] Hithink RoyalFlush Informat Network Co Ltd, Hangzhou, Zhejiang, Peoples R China
关键词
Formatting Semantics; Information Retrieval; Metadata Extraction; PDF Document; Template; INFORMATION EXTRACTION; AUTOMATIC EXTRACTION;
D O I
10.4018/JDM.2018040101
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
The authors propose to use formatting templates and implicit formatting semantics information for automatic metadata identification and segmentation. The pure texts and their corresponding formatting information including line height, font type, and font size, are recognized in parallel to guide metadata identification. The authors use implicit formatting semantics, such as the change of formatting, formatting templates and implications, explicit formatting layouts, as well as predefined frequently occurred keywords database to increase the extraction accuracy. Unlike other OCR-based approaches, the authors use open source PDFBox package as the basic preprocessing tool to get pure texts and formatting values of the document contents. On top of PDFBox they built their own pipeline program, namely, PAXAT, to implement their approaches for metadata extraction. 10177 papers from arXiv, ACM, ACL and other publicly accessed and institution-subscribed sources are tested. The overall extraction accuracy of title, authors, affiliations, author-affiliation matching are 0.9798, 0.9425, 0.9298, and 0.9109, respectively.
引用
收藏
页码:1 / 22
页数:22
相关论文
共 47 条
  • [1] Beel J, 2013, ACM-IEEE J CONF DIG, P443
  • [2] Beel J, 2010, LECT NOTES COMPUT SC, V6273, P413, DOI 10.1007/978-3-642-15464-5_45
  • [3] Bergmark D., 2000, 20001821 CSTR CORN U
  • [4] Chang C.-C. K., 1997, P 2 ACM INT C DIG LI
  • [5] Choudhury S.R., 2013, Proceedings of the 13th ACM/IEEE-CS joint conference on Digital libraries-JCDL '13, P369, DOI DOI 10.1145/2467696.2467757
  • [6] Clark C.A., 2015, P AAAI WORKSH SCHOL
  • [7] Councill I. G., 2008, P LANG RES EV C
  • [8] Ding Ying., 1999, Proceedings of the Second Asian Digital Library Conference, Taiwan, P47
  • [9] Do H. H. N., 2013, P ACM IEEE JOINT C D, P2, DOI [10.1145/2467696.2467703, DOI 10.1145/2467696.2467703]
  • [10] Esposito F, 2008, STUD COMPUT INTELL, V90, P105