PREDICTIVE MODELING WITH BIG DATA: Is Bigger Really Better?

被引：108

作者：

de Fortuny, Enric Junque ^{[1
]}

Martens, David ^{[1
]}

Provost, Foster ^{[2
]}

机构：

[1] Univ Antwerp, Dept Engn Management, Appl Data Min Res Grp, B-2000 Antwerp, Belgium

[2] NYU, Leonard N Stern Sch Business, Dept Informat Operat & Management Sci, New York, NY USA

来源：

BIG DATA | 2013年 / 1卷 / 04期

关键词：

D O I：

10.1089/big.2013.0037

中图分类号：

TP39 [计算机的应用];

学科分类号：

081203 ; 0835 ;

摘要：

With the increasingly widespread collection and processing of "big data," there is natural interest in using these data assets to improve decision making. One of the best understood ways to use data to improve decision making is via predictive analytics. An important, open question is: to what extent do larger data actually lead to better predictive models? In this article we empirically demonstrate that when predictive models are built from sparse, fine-grained data-such as data on low-level human behavior-we continue to see marginal increases in predictive performance even to very large scale. The empirical results are based on data drawn from nine different predictive modeling applications, from book reviews to banking transactions. This study provides a clear illustration that larger data indeed can be more valuable assets for predictive analytics. This implies that institutions with larger data assets-plus the skill to take advantage of them-potentially can obtain substantial competitive advantage over institutions without such access or skill. Moreover, the results suggest that it is worthwhile for companies with access to such fine-grained data, in the context of a key predictive task, to gather both more data instances and more possible data features. As an additional contribution, we introduce an implementation of the multivariate Bernoulli Naive Bayes algorithm that can scale to massive, sparse data.

引用

页码：BD215 / +

页数：13

共 35 条

[21] Mccallum A., 2001, Work Learn Text Categ, V752
[22] Ng AY, 2002, ADV NEUR IN, V14, P841
[23] Distribution-based aggregation for relational learning with identifier attributes
Perlich, C
Provost, F
[J]. MACHINE LEARNING, 2006, 62 (1-2) : 65 - 105
[24] Machine learning for targeted display advertising: transfer learning in action
Perlich, C.
Dalessandro, B.
Raeder, T.
Stitelman, O.
Provost, F.
[J]. MACHINE LEARNING, 2014, 95 (01) : 103 - 127
[25] Tree induction vs. logistic regression: A learning-curve analysis
Perlich, C
Provost, F
Simonoff, JS
[J]. JOURNAL OF MACHINE LEARNING RESEARCH, 2004, 4 (02) : 211 - 255
[26] A survey of methods for scaling up inductive algorithms
Provost, F
Kolluri, V
[J]. DATA MINING AND KNOWLEDGE DISCOVERY, 1999, 3 (02) : 131 - 169
[27] Provost F., 2013, Data Science for Business: What You Need to Know about Data Mining and Data-Analytic Thinking, V1
[28] Provost F., 2011, 1106A CEDER NEW YORK
[29] Provost FJ, 1996, MACH LEARN, V23, P33
[30] ACM SIGKDD 2014 TO BE HELD AUGUST 24-27 IN MANHATTAN
Provost, Foster
[J]. BIG DATA, 2014, 2 (02) : 71 - 72

← 1 2 3 4 →