MCNN-LSTM: Combining CNN and LSTM to Classify Multi-Class Text in Imbalanced News Data

被引:17
作者
Hasib, Khan Md [1 ]
Azam, Sami [2 ]
Karim, Asif [2 ]
Marouf, Ahmed Al [3 ]
Shamrat, F. M. Javed Mehedi [4 ]
Montaha, Sidratul [5 ]
Yeo, Kheng Cher [2 ]
Jonkman, Mirjam [2 ]
Alhajj, Reda [3 ,6 ,7 ]
Rokne, Jon G. [3 ]
机构
[1] Bangladesh Univ Business & Technol, Dept Comp Sci & Engn, Dhaka 1216, Bangladesh
[2] Charles Darwin Univ, Fac Sci & Technol, Casuarina, NT 0810, Australia
[3] Univ Calgary, Dept Comp Sci, Calgary, AB T2N 1N4, Canada
[4] Univ Malaya, Dept Comp Syst & Technol, Kuala Lumpur 50603, Malaysia
[5] Daffodil Int Univ, Dept Comp Sci & Engn, Dhaka 1207, Bangladesh
[6] Istanbul Medipol Univ, Dept Comp Engn, TR-34810 Istanbul, Turkiye
[7] Univ Southern Denmark, Dept Heath Informat, DK-5230 Odense, Denmark
关键词
Big data; text classification; imbalanced data; machine learning; MCNN-LSTM; CLASSIFICATION; EMAILS;
D O I
10.1109/ACCESS.2023.3309697
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Searching, retrieving, and arranging text in ever-larger document collections necessitate more efficient information processing algorithms. Document categorization is a crucial component of various information processing systems for supervised learning. As the quantity of documents grows, the performance of classic supervised classifiers has deteriorated because of the number of document categories. Assigning documents to a predetermined set of classes is called text classification. It is utilized extensively in a wide range of data-intensive applications. However, the fact that real-world implementations of these models are plagued with shortcomings begs for more investigation. Imbalanced datasets hinder the most prevalent high-performance algorithms. In this paper, we propose an approach name multi-class Convolutional Neural Network (MCNN)-Long Short-Time Memory (LSTM), which combines two deep learning techniques, Convolutional Neural Network (CNN) and Long Short-Time Memory, for text classification in news data. CNN's are used as feature extractors for the LSTMs on text input data and have the spatial structure of words in a sentence, paragraph, or document. The dataset is also imbalanced, and we use the Tomek-Link algorithm to balance the dataset and then apply our model, which shows better performance in terms of F1-score (98%) and Accuracy (99.71%) than the existing works. The combination of deep learning techniques used in our approach is ideal for the classification of imbalanced datasets with underrepresented categories. Hence, our method outperformed other machine learning algorithms in text classification by a large margin. We also compare our results with traditional machine learning algorithms in terms of imbalanced and balanced datasets.
引用
收藏
页码:93048 / 93063
页数:16
相关论文
共 55 条
[1]   Predicting the Future Popularity of Academic Publications Using Deep Learning by Considering It as Temporal Citation Networks [J].
Abbas, Khushnood ;
Hasan, Mohammad Kamrul ;
Abbasi, Alireza ;
Mokhtar, Umi Asma ;
Khan, Asif ;
Abdullah, Siti Norul Huda Sheikh ;
Dong, Shi ;
Islam, Shayla ;
Alboaneen, Dabiah ;
Ahmed, Fatima Rayan Awad .
IEEE ACCESS, 2023, 11 :83052-83068
[2]   Hybrid CNN-SVM Classifier for Handwritten Digit Recognition [J].
Ahlawat, Savita ;
Choudhary, Amit .
INTERNATIONAL CONFERENCE ON COMPUTATIONAL INTELLIGENCE AND DATA SCIENCE, 2020, 167 :2554-2560
[3]   An Efficient Approach to Predict Eye Diseases from Symptoms Using Machine Learning and Ranker-Based Feature Selection Methods [J].
Al Marouf, Ahmed ;
Mottalib, Md Mozaharul ;
Alhajj, Reda ;
Rokne, Jon ;
Jafarullah, Omar .
BIOENGINEERING-BASEL, 2023, 10 (01)
[4]   Comparative Analysis of Feature Selection Algorithms for Computational Personality Prediction From Social Media [J].
Al Marouf, Ahmed ;
Hasan, Md. Kamrul ;
Mahmud, Hasan .
IEEE TRANSACTIONS ON COMPUTATIONAL SOCIAL SYSTEMS, 2020, 7 (03) :587-599
[5]  
[Anonymous], 2004, P 13 ACM INT C INF K
[6]   Towards precision medicine [J].
Ashley, Euan A. .
NATURE REVIEWS GENETICS, 2016, 17 (09) :507-522
[7]   Principles of Chemical Programming [J].
Banatre, Jean-Pierre ;
Fradet, Pascal ;
Radenac, Yann .
ELECTRONIC NOTES IN THEORETICAL COMPUTER SCIENCE, 2005, 124 (01) :133-147
[8]  
Beckmann M., 2015, J. Intell. Learn. Syst. Appl, V7, P104, DOI DOI 10.4236/JILSA.2015.74010
[9]  
Bird S, 2006, P COLING ACL 2006 IN, DOI [10.3115/1225403.1225421, DOI 10.3115/1225403.1225421]
[10]   Completing Scientific Facts in Knowledge Graphs of Research Concepts [J].
Borrego, Agustin ;
Dessi, Danilo ;
Hernandez, Inma ;
Osborne, Francesco ;
Recupero, Diego Reforgiato ;
Ruiz, David ;
Buscaldi, Davide ;
Motta, Enrico .
IEEE ACCESS, 2022, 10 :125867-125880