Clustering of imbalanced high-dimensional media data

被引:0
作者
Šárka Brodinová
Maia Zaharieva
Peter Filzmoser
Thomas Ortner
Christian Breiteneder
机构
[1] TU Wien,Institute of Software Technology and Interactive Systems
[2] University of Vienna,Multimedia Information Systems Group
[3] TU Wien,Institute of Statistics and Mathematical Methods in Economics
来源
Advances in Data Analysis and Classification | 2018年 / 12卷
关键词
Clustering; Imbalanced data; High-dimensional data; Media data; LOF; 62H30;
D O I
暂无
中图分类号
学科分类号
摘要
Media content in large repositories usually exhibits multiple groups of strongly varying sizes. Media of potential interest often form notably smaller groups. Such media groups differ so much from the remaining data that it may be worthy to look at them in more detail. In contrast, media with popular content appear in larger groups. Identifying groups of varying sizes is addressed by clustering of imbalanced data. Clustering highly imbalanced media groups is additionally challenged by the high dimensionality of the underlying features. In this paper, we present the imbalanced clustering (IClust) algorithm designed to reveal group structures in high-dimensional media data. IClust employs an existing clustering method in order to find an initial set of a large number of potentially highly pure clusters which are then successively merged. The main advantage of IClust is that the number of clusters does not have to be pre-specified and that no specific assumptions about the cluster or data characteristics need to be made. Experiments on real-world media data demonstrate that in comparison to existing methods, IClust is able to better identify media groups, especially groups of small sizes.
引用
收藏
页码:261 / 284
页数:23
相关论文
共 31 条
[1]  
Fraley C(2000)Model-based clustering, discriminant analysis, and density estimation J Am Stat Assoc 97 611-631
[2]  
Raftery AE(2007)Clustering by passing messages between data points Science 315 972-976
[3]  
Frey BJ(1979)A Appl Stat 28 100-108
[4]  
Dueck D(2009)-means clustering algorithm Pattern Recogn Lett 30 994-1002
[5]  
Hartigan JA(2016)Robust partitional clustering by outlier and density insensitive seeding Prog Artif Intell 21 1-12
[6]  
Wong MA(2009)Learning from imbalanced data: open challenges and future directions ACM Trans Knowl Discov Data 3 1-58
[7]  
Hasan MA(2011)Clustering high-dimensional data: a survey on subspace clustering, pattern-based clustering, and correlation clustering WIREs Data Min Knowl Discov 1 231-240
[8]  
Chaoji V(2009)Density-based clustering Proc VLDB Endow (PVLDB) 2 1270-1281
[9]  
Salem S(2014)Evaluating clustering in subspace projections of high dimensional data J Classif 31 274-295
[10]  
Zaki MJ(2004)Ward’s hierarchical agglomerative clustering method: which algorithms implement Ward’s criterion? SIGKDD Explor Newsl 6 90-105