Constrained Clustering with Seeds and Term Weighting Scheme

被引:0
作者
Buatoom, Uraiwan [1 ]
Kongprawechnon, Waree [1 ]
Theeramunkong, Thanaruk [1 ,2 ]
机构
[1] Thammasat Univ, Sirindhorn Int Inst Technol, Pathum Thani, Thailand
[2] Royal Soc Thailand, Bangkok, Thailand
来源
2018 THIRTEENTH INTERNATIONAL CONFERENCE ON KNOWLEDGE, INFORMATION AND CREATIVITY SUPPORT SYSTEMS (KICSS) | 2018年
关键词
Semi-supervised; Term weighting; Distribution class; Ambiguity class and Seeded k-means; SEMI-SUPERVISED CLASSIFICATION;
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
While traditional unsupervised learning is blind and the performance relies on the choice of initial seeds. The idea of constrained clustering can use a small number of labeled instances to partly guide a large number of unlabeled instances. It focuses on a set of predefined classes with an aim is to increase the performance of supervised and unsupervised learning using constraints. This paper proposes a new idea of semi-supervised learning based on particularly seeded constrained clustering, where the clustering guidance comes from the statistics of a small set of labeled data. In contrast with existing approaches in seeded K-Means where the labeled instances are specified. However, the proposed work investigates how weighting obtained from a training set affects the seeded-clustering results. Experimental results are demonstrated on three groups of term-weighting statistics; in-collection, intra-class, and inter-class based on frequencies/distributions and an ambiguity class pass entropy value. Text datasets is studied in our experiment. The result also depicts that the term weighting scheme is a potential mean to control/guide the initial and clustering process over a standard normal term weighting scheme.
引用
收藏
页码:99 / 104
页数:6
相关论文
共 14 条
[1]  
[Anonymous], 2001, ICML
[2]  
[Anonymous], 2002, ICML
[3]  
Davidson I, 2005, LECT NOTES ARTIF INT, V3721, P59
[4]  
Davidson I, 2006, LECT NOTES ARTIF INT, V4213, P115
[5]   Semi-supervised classification method through oversampling and common hidden space [J].
Dong, Aimei ;
Chung, Fu-lai ;
Wang, Shitong .
INFORMATION SCIENCES, 2016, 349 :216-228
[6]  
George A, 2013, INT ARAB J INF TECHN, V10, P467
[7]   An initial seed selection algorithm for k-means clustering of georeferenced data to improve replicability of cluster assignments for mapping application [J].
Khan, Fouad .
APPLIED SOFT COMPUTING, 2012, 12 (11) :3698-3700
[8]  
Klein D, 2002, TECH REP
[9]   Effect of term distributions on centroid-based text categorization [J].
Lertnattee, V ;
Theeramunkong, T .
INFORMATION SCIENCES, 2004, 158 :89-115
[10]  
Li X., 2015, J BIOINFORM INTELL C, V4, P111