A Scalable Text Classification Using Naive Bayes with Hadoop Framework

被引:2
作者
Temesgen, Mulualem Mheretu [1 ]
Lemma, Dereje Teferi [2 ]
机构
[1] Assosa Univ, Collage Comp & Informat, Assosa, Ethiopia
[2] Addis Ababa Univ, Sch Informat Sci, Addis Ababa, Ethiopia
来源
INFORMATION AND COMMUNICATION TECHNOLOGY FOR DEVELOPMENT FOR AFRICA (ICT4DA 2019) | 2019年 / 1026卷
关键词
Machine learning; Text classification; Naive Bayes; Mapreduce; Hadoop;
D O I
10.1007/978-3-030-26630-1_25
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Automated text classification is the labeling of documents to the predefined class label or category using machine learning algorithms. It is one of the important domains in machine learning where the algorithm is applied to classify documents to the appropriate category or genre of the document. For example, the document might be news items and the class/category/genre might be business news, sport news, health news, financial news and social news. Due to the volume of this textual data and its presumed exponential growth, classical data mining techniques may not provide optimal performance in terms of efficiency. To this end, scalable machine learning library apache mahout with hadoop can be used to improve the performance of the algorithm and computation time. In this study Naive Bayes classification algorithm is implemented on top of hadoop to build automatic document categorizer using Mapreduce programing model. Addis Ababa university institutional repository/Electronic thesis and dissertations text document is used for training and evaluation dataset. The proposed model achieved an accuracy of 79.06%. The result shows that the system can categorize large thesis documents into its predefined class with promising accuracy.
引用
收藏
页码:291 / 300
页数:10
相关论文
共 24 条
[1]  
[Anonymous], INT C MACH MAT COMP
[2]  
[Anonymous], 2017, 2017 INT C COMPUTER
[3]  
[Anonymous], 2018 2 INT C ART INT
[4]  
[Anonymous], TRAINING NAIVE BAYES
[5]  
[Anonymous], APPL MACHINE LEARNIN
[6]  
[Anonymous], SVM INT J INF ENG
[7]  
[Anonymous], 2010, TEXT MINING APPL THE
[8]  
[Anonymous], INT ART INT DAT PROC
[9]  
[Anonymous], P ACL WORKSH COMP AP
[10]   On the optimality of the simple Bayesian classifier under zero-one loss [J].
Domingos, P ;
Pazzani, M .
MACHINE LEARNING, 1997, 29 (2-3) :103-130