Non-redundant data clustering

被引:27
作者
Gondek, D [1 ]
Hofmann, T [1 ]
机构
[1] Brown Univ, Dept Comp Sci, Providence, RI 02912 USA
来源
FOURTH IEEE INTERNATIONAL CONFERENCE ON DATA MINING, PROCEEDINGS | 2004年
关键词
D O I
10.1109/ICDM.2004.10104
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Data clustering is a popular approach for automatically finding classes, concepts, or groups of patterns. In practice this discovery process should avoid redundancies with existing knowledge about class structures or groupings, and reveal novel, previously unknown aspects of the data. In order to deal with this problem, we present an extension of the information bottleneck framework, called coordinated conditional information bottleneck, which takes negative relevance information into account by maximizing a conditional mutual information score subject to constraints. Algorithmically, one can apply an alternating optimization scheme that can be used in conjunction with different types of numeric and non-numeric attributes. We present experimental results for applications in text mining and computer vision.
引用
收藏
页码:75 / 82
页数:8
相关论文
共 15 条
  • [1] [Anonymous], P ACM SIGMOD 98
  • [2] [Anonymous], P 8 INT C DAT THEOR
  • [3] BUCILA C, 2002, P 8 SIGKDD INT C KNO
  • [4] CHECHIK G, 2002, ADV NEURAL INFORMATI, V15
  • [5] Craven M, 1998, FIFTEENTH NATIONAL CONFERENCE ON ARTIFICIAL INTELLIGENCE (AAAI-98) AND TENTH CONFERENCE ON INNOVATIVE APPLICATIONS OF ARTIFICAL INTELLIGENCE (IAAI-98) - PROCEEDINGS, P509
  • [6] FRIEDMAN N, 2001, P 17 C UNC ART INT
  • [7] GONDEK D, 2003, 3 IEEE INT C DAT MIN
  • [8] KLEIN D, 2002, P 19 INT C MACH LEAR
  • [9] McCullagh P., 1989, GEN LINEAR MODELS, V2nd edn, DOI [DOI 10.1007/978-1-4899-3242-6, 10.1007/978-1-4899-3242-6, DOI 10.2307/2347392, 10.1201/9780203753736]
  • [10] Text classification from labeled and unlabeled documents using EM
    Nigam, K
    McCallum, AK
    Thrun, S
    Mitchell, T
    [J]. MACHINE LEARNING, 2000, 39 (2-3) : 103 - 134