Learning-Based Dissimilarity for Clustering Categorical Data

被引：4

作者：

Rivera Rios, Edgar Jacob ^{[1
]}

Angel Medina-Perez, Miguel ^{[1
]}

Lazo-Cortes, Manuel S. ^{[2
]}

Monroy, Raul ^{[1
]}

机构：

[1] Tecnol Monterrey, Sch Sci & Engn, Estado De Mexico 52926, Mexico

[2] TecNM Inst Tecnol Tlalnepantla, Tlalnepantla De Baz 54070, Mexico

来源：

APPLIED SCIENCES-BASEL | 2021年 / 11卷 / 08期

关键词：

dissimilarity; categorical data; clustering;

D O I：

10.3390/app11083509

中图分类号：

O6 [化学];

学科分类号：

0703 ;

摘要：

Comparing data objects is at the heart of machine learning. For continuous data, object dissimilarity is usually taken to be object distance; however, for categorical data, there is no universal agreement, for categories can be ordered in several different ways. Most existing category dissimilarity measures characterize the distance among the values an attribute may take using precisely the number of different values the attribute takes (the attribute space) and the frequency at which they occur. These kinds of measures overlook attribute interdependence, which may provide valuable information when capturing per-attribute object dissimilarity. In this paper, we introduce a novel object dissimilarity measure that we call Learning-Based Dissimilarity, for comparing categorical data. Our measure characterizes the distance between two categorical values of a given attribute in terms of how likely it is that such values are confused or not when all the dataset objects with the remaining attributes are used to predict them. To that end, we provide an algorithm that, given a target attribute, first learns a classification model in order to compute a confusion matrix for the attribute. Then, our method transforms the confusion matrix into a per-attribute dissimilarity measure. We have successfully tested our measure against 55 datasets gathered from the University of California, Irvine (UCI) Machine Learning Repository. Our results show that it surpasses, in terms of various performance indicators for data clustering, the most prominent distance relations put forward in the literature.

引用

页数：17

共 22 条

[1] A comparison of extrinsic clustering evaluation metrics based on formal constraints [J].

Amigo, Enrique ;

Gonzalo, Julio ;

Artiles, Javier ;

Verdejo, Felisa .

INFORMATION RETRIEVAL, 2009, 12 (04) :461-486

[2]

[Anonymous], 2002, APPL DATA MINING COM

[3]

Arthur D, 2007, PROCEEDINGS OF THE EIGHTEENTH ANNUAL ACM-SIAM SYMPOSIUM ON DISCRETE ALGORITHMS, P1027

[4]

Boriah Shyam, 2008, P 8 SIAM INT C DAT M, P243, DOI [DOI 10.1137/1.9781611972788.22, 10.1137/1.9781611972788.22]

[5]

Cheung Y., P AAAI C ART INT NEW, P6869

[6]

Church K.W, P 3 WORKSH VER LARG

[7]

Demsar J, 2006, J MACH LEARN RES, V7, P1

[8] A practical tutorial on the use of nonparametric statistical tests as a methodology for comparing evolutionary and swarm intelligence algorithms [J].

Derrac, Joaquin ;

Garcia, Salvador ;

Molina, Daniel ;

Herrera, Francisco .

SWARM AND EVOLUTIONARY COMPUTATION, 2011, 1 (01) :3-18

[9] Categorical data clustering: What similarity measure to recommend? [J].

dos Santos, Tiago R. L. ;

Zarate, Luis E. .

EXPERT SYSTEMS WITH APPLICATIONS, 2015, 42 (03) :1247-1260

[10]

Frank A., 2010, UCI MACHINE LEARNING

← 1 2 3 →