Categorical data clustering: 25 years beyond K-modes

被引:2
作者
Dinh, Tai [1 ,2 ]
Wong, Hauchi [2 ]
Fournier-Viger, Philippe [3 ]
Lisik, Daniil [4 ]
Ha, Minh-Quyet [5 ]
Dam, Hieu-Chi [5 ]
Huynh, Van-Nam [5 ]
机构
[1] CMC Univ, Dich Vong Hau Ward, 11 Duy Tan St, Hanoi, Vietnam
[2] Kyoto Coll Grad Studies Informat, Sakyo Ward, 7 Tanaka Monzencho, Kyoto, Kyoto, Japan
[3] Shenzhen Univ, Coll Comp Sci & Software Engn, Shenzhen, Guangdong, Peoples R China
[4] Univ Gothenburg, Dept Cell & Mol Biol, Medicinaregatan 1F, S-41390 Gothenburg, Sweden
[5] Japan Adv Inst Sci & Technol, 1-1 Asahidai, Nomi, Ishikawa, Japan
关键词
Data mining; Cluster analysis; Categorical data; Literature review; Artificial intelligence; Machine learning; LATENT CLASS; SET APPROACH; ALGORITHM; ATTRIBUTE; INFORMATION; FRAMEWORK; EFFICIENT; INITIALIZATION; DISTANCE; SEARCH;
D O I
10.1016/j.eswa.2025.126608
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
The clustering of categorical data is a common and important task in computer science, offering profound implications across a spectrum of applications. Unlike purely numerical data, categorical data often lack inherent ordering as in nominal data, or have varying levels of order as in ordinal data, thus requiring specialized methodologies for efficient organization and analysis. This review provides a comprehensive synthesis of categorical data clustering in the past twenty-five years, starting from the introduction of K-MODES. It elucidates the pivotal role of categorical data clustering in diverse fields such as health sciences, natural sciences, social sciences, education, engineering, and economics. Practical comparisons are conducted for algorithms having public implementations, highlighting distinguishing clustering methodologies and revealing the performance of recent algorithms on several benchmark categorical datasets. Finally, challenges and opportunities in the field are discussed.
引用
收藏
页数:49
相关论文
共 219 条
[1]   Clinical Characterization of Data-Driven Diabetes Clusters of Pediatric Type 2 Diabetes [J].
Abbasi, Mahsan ;
Tosur, Mustafa ;
Astudillo, Marcela ;
Refaey, Ahmad ;
Sabharwal, Ashutosh ;
Redondo, Maria J. .
PEDIATRIC DIABETES, 2023, 2023
[2]   Phenotypic Clusters Predict Outcomes in a Longitudinal Interstitial Lung Disease Cohort [J].
Adegunsoye, Ayodeji ;
Oldham, Justin M. ;
Chung, Jonathan H. ;
Montner, Steven M. ;
Lee, Cathryn ;
Witt, Leah J. ;
Stahlbaum, Danielle ;
Bermea, Rene S. ;
Chen, Lena W. ;
Hsu, Scully ;
Husain, Aliya N. ;
Noth, Imre ;
Vij, Rekha ;
Strek, Mary E. ;
Churpek, Matthew .
CHEST, 2018, 153 (02) :349-360
[3]   Malicious accounts: Dark of the social networks [J].
Adewole, Kayode Sakariyah ;
Anuar, Nor Badrul ;
Kamsin, Amirrudin ;
Varathan, Kasturi Dewi ;
Razak, Syed Abdul .
JOURNAL OF NETWORK AND COMPUTER APPLICATIONS, 2017, 79 :41-67
[4]  
Aggarwal C.C, 2013, Data Clustering: Algorithms and Applications, P1
[5]   On clustering massive text and categorical data streams [J].
Aggarwal, Charu C. ;
Yu, Philip S. .
KNOWLEDGE AND INFORMATION SYSTEMS, 2010, 24 (02) :171-196
[6]  
Agresti A., 2012, Categorical data analysis, V792
[7]   A method to compute distance between two categorical values of same attribute in unsupervised learning for categorical data set [J].
Ahmad, Amir ;
Dey, Lipika .
PATTERN RECOGNITION LETTERS, 2007, 28 (01) :110-118
[8]   Evaluation of Inherited Resistance Genes of Bacterial Leaf Blight, Blast and Drought Tolerance in Improved Rice Lines [J].
Akos, Ibrahim Silas ;
Rafii, Mohd Y. ;
Ismail, Mohd Razi ;
Ramlee, Shairul Izan ;
Shamsudin, Noraziyah Abd Aziz ;
Ramli, Asfaliza ;
Chukwu, Samuel Chibuike ;
Swaray, Senesie ;
Jalloh, Momodu .
RICE SCIENCE, 2021, 28 (03) :279-288
[9]   DISA tool: Discriminative and informative subspace assessment with categorical and numerical outcomes [J].
Alexandre, Leonardo ;
Costa, Rafael S. ;
Henriques, Rui .
PLOS ONE, 2022, 17 (10)
[10]   Clustering in graphs and hypergraphs with categorical edge labels [J].
Amburg, Ilya ;
Veldt, Nate ;
Benson, Austin .
WEB CONFERENCE 2020: PROCEEDINGS OF THE WORLD WIDE WEB CONFERENCE (WWW 2020), 2020, :706-717