Summarizing categorical data by clustering attributes

被引:0
作者
Michael Mampaey
Jilles Vreeken
机构
[1] University of Antwerp,Advanced Database Research and Modelling, Department of Mathematics and Computer Science
来源
Data Mining and Knowledge Discovery | 2013年 / 26卷
关键词
Attribute clustering; MDL; Summarization; Categorical data;
D O I
暂无
中图分类号
学科分类号
摘要
For a book, its title and abstract provide a good first impression of what to expect from it. For a database, obtaining a good first impression is typically not so straightforward. While low-order statistics only provide very limited insight, downright mining the data rapidly provides too much detail for such a quick glance. In this paper we propose a middle ground, and introduce a parameter-free method for constructing high-quality descriptive summaries of binary and categorical data. Our approach builds a summary by clustering attributes that strongly correlate, and uses the Minimum Description Length principle to identify the best clustering—without requiring a distance measure between attributes. Besides providing a practical overview of which attributes interact most strongly, these summaries can also be used as surrogates for the data, and can easily be queried. Extensive experimentation shows that our method discovers high-quality results: correlated attributes are correctly grouped, which is verified both objectively and subjectively. Our models can also be employed as surrogates for the data; as an example of this we show that we can quickly and accurately query the estimated supports of frequent generalized itemsets.
引用
收藏
页码:130 / 173
页数:43
相关论文
共 48 条
  • [1] Au W(2005)Attribute clustering for grouping, selection, and classification of gene expression data IEEE/ACM Trans Comput Biol Bioinform 2 83-101
  • [2] Chan K(2005)Modelling of classification rules on metabolic patterns including machine learning and expert knowledge Biomed Inform 38 89-98
  • [3] Wong A(2007)Non-derivable itemset mining Data Min Knowl Discov 14 171-206
  • [4] Wang Y(2011)Maximum entropy models and subjective interestingness: an application to tiles in binary databases Data Min Knowl Discov 23 407-446
  • [5] Baumgartner C(2003)A divisive information theoretic feature clustering algorithm for text classification J Mach Learn Res 3 1265-1287
  • [6] Böhm C(2011)Banded structure in binary matrices Knowl Inf Syst (KAIS) 28 197-226
  • [7] Baumgartner D(2007)Assessing data mining results via swap randomization Trans Knowl Discov Data 1 1556-4681
  • [8] Calders T(2007)Frequent pattern mining: current status and future directions Data Min Knowl Discov 15 55-86
  • [9] Goethals B(1984)Optimization by simulated annealing: quantitative studies Stat Phys 34 975-986
  • [10] De Bie T(2006)DNA copy number amplification profiling of human neoplasms Oncogene 25 7324-7332