Identifying consistent statements about numerical data with dispersion-corrected subgroup discovery

被引:30
作者
Boley, Mario [1 ,2 ,3 ]
Goldsmith, Bryan R. [3 ]
Ghiringhelli, Luca M. [3 ]
Vreeken, Jilles [1 ,2 ]
机构
[1] Max Planck Inst Informat, Saarbrucken, Germany
[2] Saarland Univ, Saarbrucken, Germany
[3] Max Planck Gesell, Fritz Haber Inst, Berlin, Germany
基金
欧盟地平线“2020”;
关键词
Subgroup discovery; Local pattern discovery; Branch-and-bound search;
D O I
10.1007/s10618-017-0520-3
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Existing algorithms for subgroup discovery with numerical targets do not optimize the error or target variable dispersion of the groups they find. This often leads to unreliable or inconsistent statements about the data, rendering practical applications, especially in scientific domains, futile. Therefore, we here extend the optimistic estimator framework for optimal subgroup discovery to a new class of objective functions: we show how tight estimators can be computed efficiently for all functions that are determined by subgroup size (non-decreasing dependence), the subgroup median value, and a dispersion measure around the median (non-increasing dependence). In the important special case when dispersion is measured using the mean absolute deviation from the median, this novel approach yields a linear time algorithm. Empirical evaluation on a wide range of datasets shows that, when used within branch-and-bound search, this approach is highly efficient and indeed discovers subgroups with much smaller errors.
引用
收藏
页码:1391 / 1418
页数:28
相关论文
共 32 条
[1]  
Alcalá-Fdez J, 2011, J MULT-VALUED LOG S, V17, P255
[2]   Subgroup discovery [J].
Atzmueller, Martin .
WILEY INTERDISCIPLINARY REVIEWS-DATA MINING AND KNOWLEDGE DISCOVERY, 2015, 5 (01) :35-49
[3]   A statistical theory for quantitative association rules [J].
Aumann, Y ;
Lindell, Y .
JOURNAL OF INTELLIGENT INFORMATION SYSTEMS, 2003, 20 (03) :255-283
[4]   Detecting group differences: Mining contrast sets [J].
Bay, SD ;
Pazzani, MJ .
DATA MINING AND KNOWLEDGE DISCOVERY, 2001, 5 (03) :213-246
[5]  
Benavoli A, 2014, PR MACH LEARN RES, V32, P1026
[6]  
Benavoli A, 2016, ARXIV160604316
[7]  
Boley M, 2009, LECT NOTES ARTIF INT, V5781, P179, DOI 10.1007/978-3-642-04180-8_29
[8]  
Boley Mario, 2012, P 18 ACM SIGKDD INT, P69
[9]  
Demsar J., 2008, WORKSH EV METH MACH
[10]  
Duivesteijn W., 2011, Proceedings of the 2011 IEEE 11th International Conference on Data Mining (ICDM 2011), P151, DOI 10.1109/ICDM.2011.65