CatBoost for big data: an interdisciplinary review

被引:701
作者
Hancock, John T. [1 ]
Khoshgoftaar, Taghi M. [1 ]
机构
[1] Florida Atlantic Univ, 777 Glades Rd, Boca Raton, FL 33431 USA
关键词
CatBoost; Big data; Categorical variable encoding; Ensemble methods; Machine learning; Decision tree; GENE SELECTION;
D O I
10.1186/s40537-020-00369-8
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
Gradient Boosted Decision Trees (GBDT's) are a powerful tool for classification and regression tasks in Big Data. Researchers should be familiar with the strengths and weaknesses of current implementations of GBDT's in order to use them effectively and make successful contributions. CatBoost is a member of the family of GBDT machine learning ensemble techniques. Since its debut in late 2018, researchers have successfully used CatBoost for machine learning studies involving Big Data. We take this opportunity to review recent research on CatBoost as it relates to Big Data, and learn best practices from studies that cast CatBoost in a positive light, as well as studies where CatBoost does not outshine other techniques, since we can learn lessons from both types of scenarios. Furthermore, as a Decision Tree based algorithm, CatBoost is well-suited to machine learning tasks involving categorical, heterogeneous data. Recent work across multiple disciplines illustrates CatBoost's effectiveness and shortcomings in classification and regression tasks. Another important issue we expose in literature on CatBoost is its sensitivity to hyper-parameters and the importance of hyper-parameter tuning. One contribution we make is to take an interdisciplinary approach to cover studies related to CatBoost in a single work. This provides researchers an in-depth understanding to help clarify proper application of CatBoost in solving problems. To the best of our knowledge, this is the first survey that studies all works related to CatBoost in a single publication.
引用
收藏
页数:45
相关论文
共 102 条
[1]   The Fourteenth Data Release of the Sloan Digital Sky Survey: First Spectroscopic Data from the Extended Baryon Oscillation Spectroscopic Survey and from the Second Phase of the Apache Point Observatory Galactic Evolution Experiment [J].
Abolfathi, Bela ;
Aguado, D. S. ;
Aguilar, Gabriela ;
Prieto, Carlos Allende ;
Almeida, Andres ;
Ananna, Tonima Tasnim ;
Anders, Friedrich ;
Anderson, Scott F. ;
Andrews, Brett H. ;
Anguiano, Borja ;
Aragon-Salamanca, Alfonso ;
Argudo-Fernandez, Maria ;
Armengaud, Eric ;
Ata, Metin ;
Aubourg, Eric ;
Avila-Reese, Vladimir ;
Badenes, Carles ;
Bailey, Stephen ;
Balland, Christophe ;
Barger, Kathleen A. ;
Barrera-Ballesteros, Jorge ;
Bartosz, Curtis ;
Bastien, Fabienne ;
Bates, Dominic ;
Baumgarten, Falk ;
Bautista, Julian ;
Beaton, Rachael ;
Beers, Timothy C. ;
Belfiore, Francesco ;
Bender, Chad F. ;
Bernardi, Mariangela ;
Bershady, Matthew A. ;
Beutler, Florian ;
Bird, Jonathan C. ;
Bizyaev, Dmitry ;
Blanc, Guillermo A. ;
Blanton, Michael R. ;
Blomqvist, Michael ;
Bolton, Adam S. ;
Boquien, Mederic ;
Borissova, Jura ;
Bovy, Jo ;
Diaz, Christian Andres Bradna ;
Brandt, William Nielsen ;
Brinkmann, Jonathan ;
Brownstein, Joel R. ;
Bundy, Kevin ;
Burgasser, Adam J. ;
Burtin, Etienne ;
Busca, Nicolas G. .
ASTROPHYSICAL JOURNAL SUPPLEMENT SERIES, 2018, 235 (02)
[2]   Machine learning identifies the dynamics and influencing factors in an auditory category learning experiment [J].
Abolfazli, Amir ;
Brechmann, Andre ;
Wolff, Susann ;
Spiliopoulou, Myra .
SCIENTIFIC REPORTS, 2020, 10 (01)
[3]   An efficient novel approach for iris recognition based on stylometric features and machine learning techniques [J].
Adamovic, Sasa ;
Miskovic, Vladislav ;
Macek, Nemanja ;
Milosavljevic, Milan ;
Sarac, Marko ;
Saracevic, Muzafer ;
Gnjatovic, Milan .
FUTURE GENERATION COMPUTER SYSTEMS-THE INTERNATIONAL JOURNAL OF ESCIENCE, 2020, 107 :144-157
[4]  
Anghel A., 2018, STATISTICS-ABINGDON, V2, P1467
[5]  
[Anonymous], 2017, P ADV NEUR INF PROC
[6]  
Bakhareva N., 2019, 2019 INT RUSS AUT C
[7]   MWMOTE-Majority Weighted Minority Oversampling Technique for Imbalanced Data Set Learning [J].
Barua, Sukarna ;
Islam, Md. Monirul ;
Yao, Xin ;
Murase, Kazuyuki .
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 2014, 26 (02) :405-425
[8]  
Bauder R, 2018 IEEE INT C INF
[9]  
Bauder RA, 2017 16 IEEE INT C M
[10]   EUKARYOTIC MESSENGER-RNA [J].
BRAWERMAN, G .
ANNUAL REVIEW OF BIOCHEMISTRY, 1974, 43 :621-642