Machine Learning Methods for Extracting Newspaper Articles from PDF Files

被引：0

作者：

Fatima, Peer ^{[1
]}

Fathima, S. K. ^{[2
]}

Al Khatatneh, Arwa Mahmoud ^{[1
]}

Al Qudah, Mosab Kasim ^{[1
]}

机构：

[1] Taibah Univ, Dept Comp Sci & Engn, Madinah, Saudi Arabia

[2] Sona Coll Technol, Dept Comp Sci & Engn, Salem 636005, Tamil Nadu, India

来源：

2024 5TH INTERNATIONAL CONFERENCE ON INNOVATIVE TRENDS IN INFORMATION TECHNOLOGY, ICITIIT 2024 | 2024年

关键词：

Automated classification; machine learning; CNN; Support Vector Machine; bibliographic data; SwePub; TEXT CLASSIFICATION;

D O I：

10.1109/CITIIT61487.2024.10580612

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

As digital archives of newspapers continue to grow, the need for automated methods to extract and organize information from PDF files becomes increasingly critical. This study addresses the challenge of layout extraction for newspaper articles using machine learning techniques. The objective is to develop a robust and scalable solution that accurately identifies and separates distinct elements within the complex layout of newspaper pages. The layout extraction of newspaper articles from PDF files is a crucial step in making the vast amount of historical and contemporary news content more accessible and searchable. Manual extraction is time-consuming and impractical for large datasets, necessitating the exploration of machine learning methods to automate this process. The primary objective of this research is to design and implement machine learning algorithms capable of accurately parsing the layout of newspaper articles within PDF files. The proposed methods aim to identify and delineate text, images, captions, headlines, and other elements present in the diverse layouts of newspaper pages. The study employs a combination of image processing and machine learning techniques. Initially, the PDF pages are converted into images, and preprocessing methods are applied to enhance the quality of the input data. Subsequently, a machine learning model, such as Convolutional Neural Networks (CNNs), is trained to recognize and classify different layout elements. Post-processing steps are implemented to refine the extraction results and improve the overall accuracy of layout segmentation.

引用

页数：6

共 15 条

[1] Automated Text Classification of News Articles: A Practical Guide
Barbera, Pablo
Boydstun, Amber E.
Linn, Suzanna
McMahon, Ryan
Nagler, Jonathan
[J]. POLITICAL ANALYSIS, 2021, 29 (01) : 19 - 42
[2] Automated classification of social network messages into Smart Cities dimensions
Bencke, Luciana
Cechinel, Cristian
Munoz, Roberto
[J]. FUTURE GENERATION COMPUTER SYSTEMS-THE INTERNATIONAL JOURNAL OF ESCIENCE, 2020, 109 : 218 - 237
[3] Using Word Order in Political Text Classification with Long Short-term Memory Models
Chang, Charles
Masterson, Michael
[J]. POLITICAL ANALYSIS, 2020, 28 (03) : 395 - 411
[4] Monitoring the public opinion about the vaccination topic from tweets analysis
D'Andrea, Eleonora
Ducange, Pietro
Bechini, Alessio
Renda, Alessandro
Marcelloni, Francesco
[J]. EXPERT SYSTEMS WITH APPLICATIONS, 2019, 116 : 209 - 226
[5] A gating context-aware text classification model with BERT and graph convolutional networks
Gao, Weiqi
Huang, Hao
[J]. JOURNAL OF INTELLIGENT & FUZZY SYSTEMS, 2021, 40 (03) : 4331 - 4343
[6] Using Social Media to Mine and Analyze Public Opinion Related to COVID-19 in China
Han, Xuehua
Wang, Juanle
Zhang, Min
Wang, Xiaojie
[J]. INTERNATIONAL JOURNAL OF ENVIRONMENTAL RESEARCH AND PUBLIC HEALTH, 2020, 17 (08)
[7] Comparing automated text classification methods
Hartmann, Jochen
Huppertz, Juliana
Schamp, Christina
Heitmann, Mark
[J]. INTERNATIONAL JOURNAL OF RESEARCH IN MARKETING, 2019, 36 (01) : 20 - 38
[8] Distributed Framework for Automating Opinion Discretization From Text Corpora on Facebook
Hiep Xuan Huynh
Vu Tuan Nguyen
Nghia Duong-Trung
Van-Huy Pham
Cang Thuong Phan
[J]. IEEE ACCESS, 2019, 7 : 78675 - 78684
[9] GHS-NET a generic hybridized shallow neural network for multi-label biomedical text classification
Ibrahim, Muhammad Ali
Khan, Muhammad Usman Ghani
Mehmood, Faiza
Asim, Muhammad Nabeel
Mahmood, Waqar
[J]. JOURNAL OF BIOMEDICAL INFORMATICS, 2021, 116
[10] Bi-LSTM Model to Increase Accuracy in Text Classification: Combining Word2vec CNN and Attention Mechanism
Jang, Beakcheol
Kim, Myeonghwi
Harerimana, Gaspard
Kang, Sang-ug
Kim, Jong Wook
[J]. APPLIED SCIENCES-BASEL, 2020, 10 (17):

← 1 2 →