Machine Learning Methods for Extracting Newspaper Articles from PDF Files

被引:0
作者
Fatima, Peer [1 ]
Fathima, S. K. [2 ]
Al Khatatneh, Arwa Mahmoud [1 ]
Al Qudah, Mosab Kasim [1 ]
机构
[1] Taibah Univ, Dept Comp Sci & Engn, Madinah, Saudi Arabia
[2] Sona Coll Technol, Dept Comp Sci & Engn, Salem 636005, Tamil Nadu, India
来源
2024 5TH INTERNATIONAL CONFERENCE ON INNOVATIVE TRENDS IN INFORMATION TECHNOLOGY, ICITIIT 2024 | 2024年
关键词
Automated classification; machine learning; CNN; Support Vector Machine; bibliographic data; SwePub; TEXT CLASSIFICATION;
D O I
10.1109/CITIIT61487.2024.10580612
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
As digital archives of newspapers continue to grow, the need for automated methods to extract and organize information from PDF files becomes increasingly critical. This study addresses the challenge of layout extraction for newspaper articles using machine learning techniques. The objective is to develop a robust and scalable solution that accurately identifies and separates distinct elements within the complex layout of newspaper pages. The layout extraction of newspaper articles from PDF files is a crucial step in making the vast amount of historical and contemporary news content more accessible and searchable. Manual extraction is time-consuming and impractical for large datasets, necessitating the exploration of machine learning methods to automate this process. The primary objective of this research is to design and implement machine learning algorithms capable of accurately parsing the layout of newspaper articles within PDF files. The proposed methods aim to identify and delineate text, images, captions, headlines, and other elements present in the diverse layouts of newspaper pages. The study employs a combination of image processing and machine learning techniques. Initially, the PDF pages are converted into images, and preprocessing methods are applied to enhance the quality of the input data. Subsequently, a machine learning model, such as Convolutional Neural Networks (CNNs), is trained to recognize and classify different layout elements. Post-processing steps are implemented to refine the extraction results and improve the overall accuracy of layout segmentation.
引用
收藏
页数:6
相关论文
共 15 条
  • [1] Automated Text Classification of News Articles: A Practical Guide
    Barbera, Pablo
    Boydstun, Amber E.
    Linn, Suzanna
    McMahon, Ryan
    Nagler, Jonathan
    [J]. POLITICAL ANALYSIS, 2021, 29 (01) : 19 - 42
  • [2] Automated classification of social network messages into Smart Cities dimensions
    Bencke, Luciana
    Cechinel, Cristian
    Munoz, Roberto
    [J]. FUTURE GENERATION COMPUTER SYSTEMS-THE INTERNATIONAL JOURNAL OF ESCIENCE, 2020, 109 : 218 - 237
  • [3] Using Word Order in Political Text Classification with Long Short-term Memory Models
    Chang, Charles
    Masterson, Michael
    [J]. POLITICAL ANALYSIS, 2020, 28 (03) : 395 - 411
  • [4] Monitoring the public opinion about the vaccination topic from tweets analysis
    D'Andrea, Eleonora
    Ducange, Pietro
    Bechini, Alessio
    Renda, Alessandro
    Marcelloni, Francesco
    [J]. EXPERT SYSTEMS WITH APPLICATIONS, 2019, 116 : 209 - 226
  • [5] A gating context-aware text classification model with BERT and graph convolutional networks
    Gao, Weiqi
    Huang, Hao
    [J]. JOURNAL OF INTELLIGENT & FUZZY SYSTEMS, 2021, 40 (03) : 4331 - 4343
  • [6] Using Social Media to Mine and Analyze Public Opinion Related to COVID-19 in China
    Han, Xuehua
    Wang, Juanle
    Zhang, Min
    Wang, Xiaojie
    [J]. INTERNATIONAL JOURNAL OF ENVIRONMENTAL RESEARCH AND PUBLIC HEALTH, 2020, 17 (08)
  • [7] Comparing automated text classification methods
    Hartmann, Jochen
    Huppertz, Juliana
    Schamp, Christina
    Heitmann, Mark
    [J]. INTERNATIONAL JOURNAL OF RESEARCH IN MARKETING, 2019, 36 (01) : 20 - 38
  • [8] Distributed Framework for Automating Opinion Discretization From Text Corpora on Facebook
    Hiep Xuan Huynh
    Vu Tuan Nguyen
    Nghia Duong-Trung
    Van-Huy Pham
    Cang Thuong Phan
    [J]. IEEE ACCESS, 2019, 7 : 78675 - 78684
  • [9] GHS-NET a generic hybridized shallow neural network for multi-label biomedical text classification
    Ibrahim, Muhammad Ali
    Khan, Muhammad Usman Ghani
    Mehmood, Faiza
    Asim, Muhammad Nabeel
    Mahmood, Waqar
    [J]. JOURNAL OF BIOMEDICAL INFORMATICS, 2021, 116
  • [10] Bi-LSTM Model to Increase Accuracy in Text Classification: Combining Word2vec CNN and Attention Mechanism
    Jang, Beakcheol
    Kim, Myeonghwi
    Harerimana, Gaspard
    Kang, Sang-ug
    Kim, Jong Wook
    [J]. APPLIED SCIENCES-BASEL, 2020, 10 (17):