A Highly Accurate PDF-To-Text Conversion System for Academic Papers Using Natural Language Processing Approach

被引:0
作者
Yong, Tien Fui [1 ]
Azad, Saiful [2 ,3 ]
Rahman, Mohammed Mostafizur [4 ]
Zamli, Kamal Z. [2 ,3 ]
Rabby, Gollam [2 ]
机构
[1] Univ Tunku Abdul Rahman, Fac Informat & Commun Technol, Kampar 31900, Perak, Malaysia
[2] Univ Malaysia Pahang, Fac Comp Syst & Software Engn, Gambang 26300, Pahang, Malaysia
[3] UMP, IBM Ctr Excellence, Gambang, Malaysia
[4] Amer Int Univ Bangladesh, Dhaka, Bangladesh
关键词
PDF-To-Text Conversion; Natural Language Processing; Edit Distance;
D O I
10.1166/asl.2018.13029
中图分类号
O [数理科学和化学]; P [天文学、地球科学]; Q [生物科学]; N [自然科学总论];
学科分类号
07 ; 0710 ; 09 ;
摘要
Extracting text out of PDF documents is never an easy task when a higher degree of accuracy and consistency are the two main criteria to be attained. Although, there exist a considerable number of such systems; however, most of them are falling short of offering desirable performance especially when academic literature is the concern. Researches, those involved heavily in text mining and project analyzing, need an accurate and consistent supporting tool for PDF-To-Text (PTT) conversion. Therefore, in this paper, we propose a Natural Language Processing based PDF-to-text (NLPDF) conversion system, which comprises of two major steps, namely (i) reads contents from the PDF and (ii) reconstruct the text. The performance of the proposed system is evaluated via four metrics, namely Precision, Recall, F-Measure (AF), and standard deviation, and compared with eight other similar benchmarked systems available in the market. The experimental results evidently demonstrate the effectiveness of the proposed system.
引用
收藏
页码:7844 / 7849
页数:6
相关论文
共 50 条
  • [21] STUDY OF ACADEMIC WRITING EVOLUTION IN GEOSPATIAL DOMAIN USING NATURAL LANGUAGE PROCESSING TECHNIQUES
    Barb, Adrian S.
    Chaudhary, Namrata
    IGARSS 2020 - 2020 IEEE INTERNATIONAL GEOSCIENCE AND REMOTE SENSING SYMPOSIUM, 2020, : 565 - 568
  • [22] Construction site accident analysis using text mining and natural language processing techniques
    Zhang, Fan
    Fleyeh, Hasan
    Wang, Xinru
    Lu, Minghui
    AUTOMATION IN CONSTRUCTION, 2019, 99 : 238 - 248
  • [23] Novel Text Steganography Using Natural Language Processing and Part-of-Speech Tagging
    Banik, Barnali Gupta
    Bandyopadhyay, Samir Kumar
    IETE JOURNAL OF RESEARCH, 2020, 66 (03) : 384 - 395
  • [24] Web Document Text and Images Extraction using DOM Analysis and Natural Language Processing
    Joshi, Parag Mulendra
    Liu, Sam
    DOCENG'09: PROCEEDINGS OF THE 2009 ACM SYMPOSIUM ON DOCUMENT ENGINEERING, 2009, : 218 - 221
  • [25] Extraction of Disease Symptoms from Free Text Using Natural Language Processing Techniques
    Laabidi, Adil
    Aissaoui, Mohammed
    Madani, Mohamed Amine
    PROCEEDINGS OF NINTH INTERNATIONAL CONGRESS ON INFORMATION AND COMMUNICATION TECHNOLOGY, VOL 2, ICICT 2024, 2024, 1012 : 549 - 561
  • [26] An ensemble approach for healthcare application and diagnosis using natural language processing
    Badi Alekhya
    R. Sasikumar
    Cognitive Neurodynamics, 2022, 16 : 1203 - 1220
  • [27] Using a Natural Language Processing Approach to Support Rapid Knowledge Acquisition
    Koonce, Taneya Y.
    Giuse, Dario A.
    Williams, Annette M.
    Blasingame, Mallory N.
    Krump, Poppy A.
    Su, Jing
    Giuse, Nunzia B.
    JMIR MEDICAL INFORMATICS, 2024, 12
  • [28] An ensemble approach for healthcare application and diagnosis using natural language processing
    Alekhya, Badi
    Sasikumar, R.
    COGNITIVE NEURODYNAMICS, 2022, 16 (05) : 1203 - 1220
  • [29] Detecting Weak Signals of the Future: A System Implementation Based on Text Mining and Natural Language Processing
    Griol-Barres, Israel
    Milla, Sergio
    Cebrian, Antonio
    Fan, Huaan
    Millet, Jose
    SUSTAINABILITY, 2020, 12 (19)
  • [30] A Hybrid Knowledge Mining Approach to Develop a System Framework for Odia Language Text Processing
    Mishra, Brojo Kishore
    Sahoo, Rekhanjali
    MATERIALS TODAY-PROCEEDINGS, 2018, 5 (01) : 1335 - 1340