A Unified Entropy-Based Distance Metric for Ordinal-and-Nominal-Attribute Data Clustering

被引:33
作者
Zhang, Yiqun [1 ]
Cheung, Yiu-Ming [1 ]
Tan, Kay Chen [2 ]
机构
[1] Hong Kong Baptist Univ, Dept Comp Sci, Hong Kong, Peoples R China
[2] City Univ Hong Kong, Dept Comp Sci, Hong Kong, Peoples R China
基金
中国国家自然科学基金;
关键词
Distance measurement; Clustering algorithms; Task analysis; Data analysis; Benchmark testing; Entropy; Categorical data; clustering algorithms; data analysis; distance metric; entropy; order information; ordinal attribute; K-MEANS ALGORITHM; DISSIMILARITY MEASURE; ROUGH APPROXIMATION; FEATURE-SELECTION; SETS; ASSOCIATION; INFORMATION;
D O I
10.1109/TNNLS.2019.2899381
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Ordinal data are common in many data mining and machine learning tasks. Compared to nominal data, the possible values (also called categories interchangeably) of an ordinal attribute are naturally ordered. Nevertheless, since the data values are not quantitative, the distance between two categories of an ordinal attribute is generally not well defined, which surely has a serious impact on the result of the quantitative analysis if an inappropriate distance metric is utilized. From the practical perspective, ordinal-and-nominal-attribute categorical data, i.e., categorical data associated with a mixture of nominal and ordinal attributes, is common, but the distance metric for such data has yet to be well explored in the literature. In this paper, within the framework of clustering analysis, we therefore first propose an entropy-based distance metric for ordinal attributes, which exploits the underlying order information among categories of an ordinal attribute for the distance measurement. Then, we generalize this distance metric and propose a unified one accordingly, which is applicable to ordinal-and-nominal-attribute categorical data. Compared with the existing metrics proposed for categorical data, the proposed metric is simple to use and nonparametric. More importantly, it reasonably exploits the underlying order information of ordinal attributes and statistical information of nominal attributes for distance measurement. Extensive experiments show that the proposed metric outperforms the existing counterparts on both the real and benchmark data sets.
引用
收藏
页码:39 / 52
页数:14
相关论文
共 49 条