What is Machine Learning? A Primer for the Epidemiologist

被引:365
作者
Bi, Qifang [1 ]
Goodman, Katherine E. [1 ]
Kaminsky, Joshua [1 ]
Lessler, Justin [1 ]
机构
[1] Johns Hopkins Univ, Bloomberg Sch Publ Hlth, Dept Epidemiol, Baltimore, MD USA
关键词
Big Data; ensemble models; machine learning; SUPPORT VECTOR MACHINE; GENE-EXPRESSION; NEURAL-NETWORKS; LOGISTIC-REGRESSION; GLOBAL DISTRIBUTION; DECISION TREE; BRAIN-TUMORS; NAIVE BAYES; CLASSIFICATION; PREDICTION;
D O I
10.1093/aje/kwz189
中图分类号
R1 [预防医学、卫生学];
学科分类号
1004 ; 120402 ;
摘要
Machine learning is a branch of computer science that has the potential to transform epidemiologic sciences. Amid a growing focus on "Big Data," it offers epidemiologists new tools to tackle problems for which classical methods are not well-suited. In order to critically evaluate the value of integrating machine learning algorithms and existing methods, however, it is essential to address language and technical barriers between the two fields that can make it difficult for epidemiologists to read and assess machine learning studies. Here, we provide an overview of the concepts and terminology used in machine learning literature, which encompasses a diverse set of tools with goals ranging from prediction to classification to clustering. We provide a brief introduction to 5 common machine learning algorithms and 4 ensemble-based approaches. We then summarize epidemiologic applications of machine learning techniques in the published literature. We recommend approaches to incorporate machine learning in epidemiologic research and discuss opportunities and challenges for integrating machine learning and existing epidemiologic research methods.
引用
收藏
页码:2222 / 2239
页数:18
相关论文
共 163 条
[1]  
Abadi M., 2015, P 12 USENIX S OPERAT
[2]   Racial/Ethnic Differences in the Role of Childhood Adversities for Mental Disorders Among a Nationally Representative Sample of Adolescents [J].
Ahern, Jennifer ;
Karasek, Deborah ;
Luedtke, Alexander R. ;
Bruckner, Tim A. ;
van der Laan, Mark J. .
EPIDEMIOLOGY, 2016, 27 (05) :697-704
[3]   The roles of outlet density and norms in alcohol use disorder [J].
Ahern, Jennifer ;
Balzer, Laura ;
Galea, Sandro .
DRUG AND ALCOHOL DEPENDENCE, 2015, 151 :144-150
[4]   Support vector machines combined with feature selection for breast cancer diagnosis [J].
Akay, Mehmet Fatih .
EXPERT SYSTEMS WITH APPLICATIONS, 2009, 36 (02) :3240-3247
[5]   An efficient algorithm for optimal pruning of decision trees [J].
Almuallim, H .
ARTIFICIAL INTELLIGENCE, 1996, 83 (02) :347-362
[6]   Prediction of Dengue Incidence Using Search Query Surveillance [J].
Althouse, Benjamin M. ;
Ng, Yih Yng ;
Cummings, Derek A. T. .
PLOS NEGLECTED TROPICAL DISEASES, 2011, 5 (08)
[7]   AN INTRODUCTION TO KERNEL AND NEAREST-NEIGHBOR NONPARAMETRIC REGRESSION [J].
ALTMAN, NS .
AMERICAN STATISTICIAN, 1992, 46 (03) :175-185
[8]   Stability and scalability in decision trees [J].
Aluja-Banet, T ;
Nafria, E .
COMPUTATIONAL STATISTICS, 2003, 18 (04) :505-520
[9]  
[Anonymous], 2014, Behaviormetrika, DOI DOI 10.2333/BHMK.41.65
[10]  
[Anonymous], 2018, An introduction to recursive partitioning using the RPART routines