Reproducible clustering with non-Euclidean distances: a simulation and case study

被引:1
|
作者
Staples, Lauren [1 ]
Ring, Janelle [2 ]
Fontana, Scott [2 ]
Stradwick, Christina [1 ]
DeMaio, Joe [1 ]
Ray, Herman [1 ]
Zhang, Yifan [1 ]
Zhang, Xinyan [1 ]
机构
[1] Kennesaw State Univ, Sch Data Sci & Analyt, 3391 Town Point Dr NW, Kennesaw, GA 30144 USA
[2] Provider Consulting & Analyt, BlueCross BlueShield Tennessee, 1 Cameron Cir, Chattanooga, TN 37402 USA
关键词
K-means; K-medoids; Jaccard; Edit distance; Reproducibility; Prediction strength; Clustering; Non-Euclidean; Initialization; EDIT DISTANCE; VALIDATION; ALGORITHM;
D O I
10.1007/s41060-023-00429-1
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Certain categorical sequence clustering applications require path connectivity, such as the clustering of DNA, click-paths through web-user sessions, or paths of care clustering with sequences of patient medical billing codes. K-means and k-medoids clustering with non-Euclidean distance metrics such as the Jaccard or edit distances maintains such path connectivity. Although k-means and k-medoids clustering with the Jaccard and edit distances have enjoyed success in these domains, the limits of accurate cluster recovery in these conditions have not yet been defined. As a first step in approaching this goal, we performed a simulated study using k-means and k-medoids clustering with non-Euclidean distances and show the performance deteriorates at a certain level of noise and when the number of clusters increases. However, we identify initialization strategies that improve upon cluster recovery in the presence of noise. We employ the use of the Tibshirani and Guenther (J Comput Graph Stat 14(3):511-528, 2005) Prediction Strength method, which creates a hypothesis testing scenario that determines if there is clustering structure to the data (if the clusters are reproducible), with the null hypothesis being there is none. We then applied the framework to perinatal episodes of care and the clusters reproducibly and organically split between Cesarean and vaginal deliveries, which itself is not a clinical finding but sensibly validates the approach. Further visualizations of the clusters did bring insights into subclusters that split along groups of physicians, cost and risk scores, warranting the outlined future work into ways of improving this framework for better resolution.
引用
收藏
页数:20
相关论文
共 50 条
  • [41] A case study on environmental sustainability: A study of the trophic changes in fish species as a result of the damming of rivers through clustering analysis
    de Almeida, Ricardo
    Arns Steiner, Maria Teresinha
    Coelho, Leandro dos Santos
    Cavalheiro Francisco, Claudia Aparecida
    Steiner Neto, Pedro Jose
    COMPUTERS & INDUSTRIAL ENGINEERING, 2019, 135 : 1239 - 1252
  • [42] Clustering of Heterogeneously Typed Data with Soft Computing - A Case Study
    Kuri-Morales, Angel
    Banos, Daniel-Trejo
    Enrique Cortes-Berrueco, Luis
    ADVANCES IN SOFT COMPUTING, PT II, 2011, 7095 : 235 - +
  • [43] Discovering Similar Workflows via Provenance Clustering: A Case Study
    Alawini, Abdussalam
    Chen, Leshang
    Davidson, Susan
    Fisher, Stephen
    Kim, Junhyong
    PROVENANCE AND ANNOTATION OF DATA AND PROCESSES, IPAW 2018, 2018, 11017 : 115 - 127
  • [44] Exploiting Spatial Information to Enhance DTI Segmentations via Spatial Fuzzy c-Means with Covariance Matrix Data and Non-Euclidean Metrics
    Elsheikh, Safa
    Fish, Andrew
    Zhou, Diwei
    APPLIED SCIENCES-BASEL, 2021, 11 (15):
  • [45] A Novel Autoencoder-Integrated Clustering Methodology for Inventory Classification: A Real Case Study for White Goods Industry
    Keskin, Sena
    Taskin, Alev
    SUSTAINABILITY, 2024, 16 (21)
  • [46] Clustering project management for drought regions determination: A case study in Serbia
    Shamshirband, Shahaboddin
    Gocic, Milan
    Petkovic, Dalibor
    Javidnia, Hossein
    Ab Hamid, Siti Hafizah
    Mansor, Zulkefli
    Qasem, Sultan Noman
    AGRICULTURAL AND FOREST METEOROLOGY, 2015, 200 : 57 - 65
  • [47] Benchmarking the Clustering Performances of Evolutionary Algorithms: A Case Study on Varying Data Size
    Kayaalp, F.
    Erdogmus, P.
    IRBM, 2020, 41 (05) : 267 - 275
  • [48] Anomaly Detection in Automotive Industry Using Clustering Methods-A Case Study
    Guerreiro, Marcio Trindade
    Guerreiro, Eliana Maria Andriani
    Barchi, Tathiana Mikamura
    Biluca, Juliana
    Alves, Thiago Antonini
    De Souza Tadano, Yara
    Trojan, Flavio
    Siqueira, Hugo Valadares
    APPLIED SCIENCES-BASEL, 2021, 11 (21):
  • [49] CLUSTERING BEFORE TRAINING LARGE DATASETS - CASE STUDY: K-SVD
    Rusu, Cristian
    2012 PROCEEDINGS OF THE 20TH EUROPEAN SIGNAL PROCESSING CONFERENCE (EUSIPCO), 2012, : 2188 - 2192
  • [50] Enhancing municipal solid waste management efficiency through clustering: A case study
    Cil, Sedat
    Karaer, Feza
    Salihoglu, N. Kamil
    Tabansiz-Goc, Gulveren
    Cavdur, Fatih
    ENERGY SOURCES PART A-RECOVERY UTILIZATION AND ENVIRONMENTAL EFFECTS, 2024, 46 (01) : 17304 - 17314