A Highly Accurate PDF-To-Text Conversion System for Academic Papers Using Natural Language Processing Approach

被引:0
作者
Yong, Tien Fui [1 ]
Azad, Saiful [2 ,3 ]
Rahman, Mohammed Mostafizur [4 ]
Zamli, Kamal Z. [2 ,3 ]
Rabby, Gollam [2 ]
机构
[1] Univ Tunku Abdul Rahman, Fac Informat & Commun Technol, Kampar 31900, Perak, Malaysia
[2] Univ Malaysia Pahang, Fac Comp Syst & Software Engn, Gambang 26300, Pahang, Malaysia
[3] UMP, IBM Ctr Excellence, Gambang, Malaysia
[4] Amer Int Univ Bangladesh, Dhaka, Bangladesh
关键词
PDF-To-Text Conversion; Natural Language Processing; Edit Distance;
D O I
10.1166/asl.2018.13029
中图分类号
O [数理科学和化学]; P [天文学、地球科学]; Q [生物科学]; N [自然科学总论];
学科分类号
07 ; 0710 ; 09 ;
摘要
Extracting text out of PDF documents is never an easy task when a higher degree of accuracy and consistency are the two main criteria to be attained. Although, there exist a considerable number of such systems; however, most of them are falling short of offering desirable performance especially when academic literature is the concern. Researches, those involved heavily in text mining and project analyzing, need an accurate and consistent supporting tool for PDF-To-Text (PTT) conversion. Therefore, in this paper, we propose a Natural Language Processing based PDF-to-text (NLPDF) conversion system, which comprises of two major steps, namely (i) reads contents from the PDF and (ii) reconstruct the text. The performance of the proposed system is evaluated via four metrics, namely Precision, Recall, F-Measure (AF), and standard deviation, and compared with eight other similar benchmarked systems available in the market. The experimental results evidently demonstrate the effectiveness of the proposed system.
引用
收藏
页码:7844 / 7849
页数:6
相关论文
共 50 条
  • [41] Mapping Free Text into MedDRA by Natural Language Processing: a Modular Approach in Designing and Evaluating Software Extensions
    Zorzi, Margherita
    Combi, Carlo
    Pozzani, Gabriele
    Moretti, Ugo
    ACM-BCB' 2017: PROCEEDINGS OF THE 8TH ACM INTERNATIONAL CONFERENCE ON BIOINFORMATICS, COMPUTATIONAL BIOLOGY,AND HEALTH INFORMATICS, 2017, : 27 - 35
  • [42] An Overview of Ways of Discovering Cause-Effect Relations in Text by Using Natural Language Processing
    Nazaruka, Erika
    EVALUATION OF NOVEL APPROACHES TO SOFTWARE ENGINEERING, 2020, 1172 : 22 - 38
  • [43] The Human Right to Water and Sanitation: Using Natural Language Processing to Uncover Patterns in Academic Publishing
    Faulkner, Christopher Michael
    Lambert, Joshua Earl
    Wilson, Bruce M.
    Faulkner, Matthew Steven
    WATER, 2021, 13 (24)
  • [44] The Text-Package: An R-Package for Analyzing and Visualizing Human Language Using Natural Language Processing and Transformers
    Kjell, Oscar
    Giorgi, Salvatore
    Schwartz, H. Andrew
    PSYCHOLOGICAL METHODS, 2023, 28 (06) : 1478 - 1498
  • [45] Telugu Movie Review Sentiment Analysis Using Natural Language Processing Approach
    Badugu, Srinivasu
    DATA ENGINEERING AND COMMUNICATION TECHNOLOGY, ICDECT-2K19, 2020, 1079 : 685 - 695
  • [46] A Novel Approach for Spam Detection Using Natural Language Processing With AMALS Models
    Agarwal, Ruchi
    Dhoot, Anshita
    Kant, Surya
    Singh Bisht, Vimal
    Malik, Hasmat
    Ansari, Md. Fahim
    Afthanorhan, Asyraf
    Hossaini, Mohammad Asef
    IEEE ACCESS, 2024, 12 : 124298 - 124313
  • [47] A Language Independent Decision Support System for Diagnosis and Treatment by Using Natural Language Processing Techniques
    Gokgol, Merve Kevser
    Orhan, Zeynep
    PROCEEDINGS OF THE INTERNATIONAL CONFERENCE ON MEDICAL AND BIOLOGICAL ENGINEERING, CMBEBIH 2019, 2020, 73 : 721 - 728
  • [48] Cyber Threat Analysis Using Natural Language Processing for a Secure Healthcare System
    Islam, Shareeful
    Papastergiou, Spyridon
    Silvestri, Stefano
    2022 27TH IEEE SYMPOSIUM ON COMPUTERS AND COMMUNICATIONS (IEEE ISCC 2022), 2022,
  • [49] Resume Classification System using Natural Language Processing and Machine Learning Techniques
    Ali, Irfan
    Mughal, Nimra
    Khand, Zahid Hussain
    Ahmed, Javed
    Mujtaba, Ghulam
    MEHRAN UNIVERSITY RESEARCH JOURNAL OF ENGINEERING AND TECHNOLOGY, 2022, 41 (01) : 65 - 79
  • [50] IMPROVING AN E-LEARNING SYSTEM USING TECHNOLOGIES FOR NATURAL LANGUAGE PROCESSING
    Dobre, Iuliana
    ADVANCED DISTRIBUTED LEARNING IN EDUCATION AND TRAINING TRANSFORMATION, 2010, : 209 - 218