Improving random forests by neighborhood projection for effective text classification

被引:36
作者
Salles, Thiago [1 ]
Goncalves, Marcos [1 ]
Rodrigues, Victor [1 ]
Rocha, Leonardo [2 ]
机构
[1] Univ Fed Minas Gerais, Comp Sci Dept, Belo Horizonte, MG, Brazil
[2] Univ Fed Sao Joao del Rei, Comp Sci Dept, Sao Joao Del Rei, Brazil
关键词
Classification; Random forests; Lazy learning; Nearest neighbors; NEAREST; SELECTION;
D O I
10.1016/j.is.2018.05.006
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
In this article, we propose a lazy version of the traditional random forest (RF) classifier (called LazyNN_RF), specially designed for highly dimensional noisy classification tasks. The LazyNN_RF "localized" training projection is composed by examples that better resemble the examples to be classified, obtained through nearest neighborhood training set projection. Such projection filters out irrelevant data, ultimately avoiding some of the drawbacks of traditional random forests, such as overfitting due to very complex trees, especially in high dimensional noisy datasets. In sum, our main contributions are: (i) the proposal and implementation of a novel lazy learner based on the random forest classifier and nearest neighborhood projection of the training set that excels in automatic text classification tasks, as well as (ii) a throughout and detailed experimental analysis that sheds light on the behavior, effectiveness and feasibility of our solution. By means of an extensive experimental evaluation, performed considering two text classification domains and a large set of baseline algorithms, we show that our approach is highly effective and feasible, being a strong candidate for consideration for solving automatic text classification tasks when compared to state-of-the-art classifiers. (C) 2018 Elsevier Ltd. All rights reserved.
引用
收藏
页码:1 / 21
页数:21
相关论文
共 49 条
[1]  
[Anonymous], 2006, P IEEE C COMPUTER VI, DOI DOI 10.1109/CVPR.2006.301
[2]  
[Anonymous], P INT C WEBL SOC MED
[3]  
[Anonymous], 2008, Introduction to information retrieval
[4]  
[Anonymous], 2005, THESIS
[5]  
[Anonymous], 2011, WWW
[6]  
[Anonymous], J AM STAT ASS
[7]  
[Anonymous], 2011, ACM T INTEL SYST TEC, DOI DOI 10.1145/1961189.1961199
[8]   A lazy approach to associative classification [J].
Baralis, Elena ;
Chiusano, Silvia ;
Garza, Paolo .
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 2008, 20 (02) :156-171
[9]   Random forests [J].
Breiman, L .
MACHINE LEARNING, 2001, 45 (01) :5-32
[10]   Efficient and Scalable MetaFeature-based Document Classification using Massively Parallel Computing [J].
Canuto, Sergio ;
Goncalves, Marcos Andre ;
Santos, Wisllay ;
Rosa, Thierson ;
Martins, Wellington .
SIGIR 2015: PROCEEDINGS OF THE 38TH INTERNATIONAL ACM SIGIR CONFERENCE ON RESEARCH AND DEVELOPMENT IN INFORMATION RETRIEVAL, 2015, :333-342