Extracting Body Text from Academic PDF Documents for Text Mining

被引:3
作者
Yu, Changfeng [1 ]
Zhang, Cheng [1 ]
Wang, Jie [1 ]
机构
[1] Univ Massachusetts, Dept Comp Sci, Lowell, MA 01854 USA
来源
PROCEEDINGS OF THE 12TH INTERNATIONAL JOINT CONFERENCE ON KNOWLEDGE DISCOVERY, KNOWLEDGE ENGINEERING AND KNOWLEDGE MANAGEMENT (KDIR), VOL 1 | 2020年
关键词
Body-text Extraction; !text type='HTML']HTML[!/text] Replication of PDF; Line Sweeping; Backward Traversal;
D O I
10.5220/0010131402350242
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Accurate extraction of body text from PDF-formatted academic documents is essential in text-mining applications for deeper semantic understandings. The objective is to extract complete sentences in the body text into a txt file with the original sentence flow and paragraph boundaries. Existing tools for extracting text from PDF documents would often mix body and nonbody texts. We devise and implement a system called PDFBoT to detect multiple-column layouts using a line-sweeping technique, remove nonbody text using computed text features and syntactic tagging in backward traversal, and align the remaining text back to sentences and paragraphs. We show that PDFBoT is highly accurate with average F1 scores of, respectively, 0.99 on extracting sentences, 0.96 on extracting paragraphs, and 0.98 on removing text on tables, figures, and charts over a corpus of PDF documents randomly selected from arXiv.org across multiple academic disciplines.
引用
收藏
页码:235 / 242
页数:8
相关论文
共 16 条
[1]  
Bast H, 2017, ACM-IEEE J CONF DIG, P99
[2]   A Hybrid Method for Mathematical Expression Detection in Scientific Document Images [J].
Bui Hai Phong ;
Thang Manh Hoang ;
Thi-Lan Le .
IEEE ACCESS, 2020, 8 :83663-83684
[3]  
Clark ChristopherAndreas., 2015, Workshops at the 29th Association for the Advancement of Artificial Intelligence (AAAI) Conference on Artificial Intelligence, AAAI '15, P2
[4]  
Giles C. L., 2006, Knowledge Discovery in Databases: PKDD 2006. 10th European Conference on Principle and Practice of Knowledge Discovery in Databases. Proceedings (Lecture Notes in Artificial Intelligence Vol. 4213)
[5]  
Lopez P, 2015, ERCIM NEWS, P41
[6]  
Mali Parag, 2020, ARXIV200308005
[7]  
Minh-Thang Luong, 2010, International Journal of Digital Library Systems, V1, P1, DOI 10.4018/jdls.2010100101
[8]   Semi-supervised sequence tagging with bidirectional language models [J].
Peters, Matthew E. ;
Ammar, Waleed ;
Bhagavatula, Chandra ;
Power, Russell .
PROCEEDINGS OF THE 55TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2017), VOL 1, 2017, :1756-1765
[9]  
Pfahler Lukas, 2019, Machine Learning and Knowledge Discovery in Databases, ECML, PKDD
[10]   Layout-aware text extraction from full-text PDF of scientific articles [J].
Ramakrishnan, Cartic ;
Patnia, Abhishek ;
Hovy, Eduard ;
Burns, Gully A. P. C. .
SOURCE CODE FOR BIOLOGY AND MEDICINE, 2012, 7 (01)