An Algorithm for Clustering Categorical Data With Set-Valued Features

被引:20
作者
Cao, Fuyuan [1 ]
Huang, Joshua Zhexue [2 ]
Liang, Jiye [1 ]
Zhao, Xingwang [1 ]
Meng, Yinfeng [1 ]
Feng, Kai [1 ]
Qian, Yuhua [1 ]
机构
[1] Shanxi Univ, Sch Comp & Informat Technol, Minist Educ, Key Lab Computat Intelligence & Chinese Informat, Taiyuan 030006, Shanxi, Peoples R China
[2] Shenzhen Univ, Coll Comp Sci & Software Engn, Shenzhen 518060, Peoples R China
基金
中国国家自然科学基金;
关键词
Categorical data set-valued feature; set-valued modes; SV-k-modes algorithm; IMPACT;
D O I
10.1109/TNNLS.2017.2770167
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
In data mining, objects are often represented by a set of features, where each feature of an object has only one value. However, in reality, some features can take on multiple values, for instance, a person with several job titles, hobbies, and email addresses. These features can be referred to as set-valued features and are often treated with dummy features when using existing data mining algorithms to analyze data with set-valued features. In this paper, we propose an SV-k-modes algorithm that clusters categorical data with set-valued features. In this algorithm, a distance function is defined between two objects with set-valued features, and a set-valued mode representation of cluster centers is proposed. We develop a heuristic method to update cluster centers in the iterative clustering process and an initialization algorithm to select the initial cluster centers. The convergence and complexity of the SV-k-modes algorithm are analyzed. Experiments are conducted on both synthetic data and real data from five different applications. The experimental results have shown that the SV-k-modes algorithm performs better when clustering real data than do three other categorical clustering algorithms and that the algorithm is scalable to large data.
引用
收藏
页码:4593 / 4606
页数:14
相关论文
共 28 条
[1]  
[Anonymous], 1901, Bull. Soc. Vaudoise Sci. Nat
[2]   The Impact of Cluster Representatives on the Convergence of the K-Modes Type Clustering [J].
Bai, Liang ;
Liang, Jiye ;
Dang, Chuangyin ;
Cao, Fuyuan .
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2013, 35 (06) :1509-1522
[3]   SET PARTITIONING VIA INCLUSION-EXCLUSION [J].
Bjorklund, Andreas ;
Husfeldt, Thore ;
Koivisto, Mikko .
SIAM JOURNAL ON COMPUTING, 2009, 39 (02) :546-563
[4]   Trend analysis of categorical data streams with a concept change method [J].
Cao, Fuyuan ;
Huang, Joshua Zhexue ;
Liang, Jiye .
INFORMATION SCIENCES, 2014, 276 :160-173
[5]   A new initialization method for categorical data clustering [J].
Cao, Fuyuan ;
Liang, Jiye ;
Bai, Liang .
EXPERT SYSTEMS WITH APPLICATIONS, 2009, 36 (07) :10223-10228
[6]  
Carroll J. D., 1994, P CLASS SOC N AM M P
[7]   K-modes clustering [J].
Chaturvedi, A ;
Green, PE ;
Carroll, JD .
JOURNAL OF CLASSIFICATION, 2001, 18 (01) :35-55
[8]   Solving the multiple instance problem with axis-parallel rectangles [J].
Dietterich, TG ;
Lathrop, RH ;
LozanoPerez, T .
ARTIFICIAL INTELLIGENCE, 1997, 89 (1-2) :31-71
[9]  
Giannotti F., 2002, Principles of Data Mining and Knowledge Discovery. 6th European Conference, PKDD 2002. Proceedings (Lecture Notes in Artificial Intelligence Vol.2431), P175
[10]   ROCK: A robust clustering algorithm for categorical attributes [J].
Guha, S ;
Rastogi, R ;
Shim, K .
15TH INTERNATIONAL CONFERENCE ON DATA ENGINEERING, PROCEEDINGS, 1999, :512-521