Robust simultaneous positive data clustering and unsupervised feature selection using generalized inverted Dirichlet mixture models

被引:31
作者
Al Mashrgy, Mohamed [1 ]
Bdiri, Taoufik [1 ]
Bouguila, Nizar [2 ]
机构
[1] Concordia Univ, Dept Elect & Comp Engn, Montreal, PQ H3G 1T7, Canada
[2] Concordia Univ, CIISE, Montreal, PQ H3G 1T7, Canada
基金
加拿大自然科学与工程研究理事会;
关键词
Positive data; Generalized inverted Dirichlet; Finite mixture; Feature selection; Outliers; Model selection; Images clustering; VARIABLE SELECTION; REGRESSION;
D O I
10.1016/j.knosys.2014.01.007
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
The discovery, extraction and analysis of knowledge from data rely generally upon the use of unsupervised learning methods, in particular clustering approaches. Much recent research in clustering and data engineering has focused on the consideration of finite mixture models which allow to reason in the face of uncertainty and to learn by example. The adoption of these models becomes a challenging task in the presence of outliers and in the case of high-dimensional data which necessitates the deployment of feature selection techniques. In this paper we tackle simultaneously the problems of cluster validation (i.e. model selection), feature selection and outliers rejection when clustering positive data. The proposed statistical framework is based on the generalized inverted Dirichlet distribution that offers a more practical and flexible alternative to the inverted Dirichlet which has a very restrictive covariance structure. The learning of the parameters of the resulting model is based on the minimization of a message length objective incorporating prior knowledge. We use synthetic data and real data generated from challenging applications, namely visual scenes and objects clustering, to demonstrate the feasibility and advantages of the proposed method. (C) 2014 Elsevier B.V. All rights reserved.
引用
收藏
页码:182 / 195
页数:14
相关论文
共 72 条
[1]   Redefining clustering for high-dimensional applications [J].
Aggarwal, CC ;
Yu, PS .
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 2002, 14 (02) :210-225
[2]   Automatic subspace clustering of high dimensional data [J].
Agrawal, R ;
Gehrke, J ;
Gunopulos, D ;
Raghavan, P .
DATA MINING AND KNOWLEDGE DISCOVERY, 2005, 11 (01) :5-33
[3]   ROBUST METHOD FOR MULTIPLE LINEAR-REGRESSION [J].
ANDREWS, DF .
TECHNOMETRICS, 1974, 16 (04) :523-531
[4]  
Ankerst M, 1999, SIGMOD RECORD, VOL 28, NO 2 - JUNE 1999, P49
[5]  
[Anonymous], 2000, Sankhya Ser. A, DOI DOI 10.2307/25051289
[6]  
[Anonymous], 2005, Statistical and Inductive Inference by Minimum Message Length
[7]  
[Anonymous], 1976, DEMOSTRATIO MATH
[8]  
[Anonymous], 2001, The Bayesian choice
[9]  
[Anonymous], 2005, P 11 ACM SIGKDD INT
[10]  
Bayardo R. J. Jr., 1998, SIGMOD Record, V27, P85, DOI 10.1145/276305.276313