Reproducible clustering with non-Euclidean distances: a simulation and case study

被引:1
|
作者
Staples, Lauren [1 ]
Ring, Janelle [2 ]
Fontana, Scott [2 ]
Stradwick, Christina [1 ]
DeMaio, Joe [1 ]
Ray, Herman [1 ]
Zhang, Yifan [1 ]
Zhang, Xinyan [1 ]
机构
[1] Kennesaw State Univ, Sch Data Sci & Analyt, 3391 Town Point Dr NW, Kennesaw, GA 30144 USA
[2] Provider Consulting & Analyt, BlueCross BlueShield Tennessee, 1 Cameron Cir, Chattanooga, TN 37402 USA
关键词
K-means; K-medoids; Jaccard; Edit distance; Reproducibility; Prediction strength; Clustering; Non-Euclidean; Initialization; EDIT DISTANCE; VALIDATION; ALGORITHM;
D O I
10.1007/s41060-023-00429-1
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Certain categorical sequence clustering applications require path connectivity, such as the clustering of DNA, click-paths through web-user sessions, or paths of care clustering with sequences of patient medical billing codes. K-means and k-medoids clustering with non-Euclidean distance metrics such as the Jaccard or edit distances maintains such path connectivity. Although k-means and k-medoids clustering with the Jaccard and edit distances have enjoyed success in these domains, the limits of accurate cluster recovery in these conditions have not yet been defined. As a first step in approaching this goal, we performed a simulated study using k-means and k-medoids clustering with non-Euclidean distances and show the performance deteriorates at a certain level of noise and when the number of clusters increases. However, we identify initialization strategies that improve upon cluster recovery in the presence of noise. We employ the use of the Tibshirani and Guenther (J Comput Graph Stat 14(3):511-528, 2005) Prediction Strength method, which creates a hypothesis testing scenario that determines if there is clustering structure to the data (if the clusters are reproducible), with the null hypothesis being there is none. We then applied the framework to perinatal episodes of care and the clusters reproducibly and organically split between Cesarean and vaginal deliveries, which itself is not a clinical finding but sensibly validates the approach. Further visualizations of the clusters did bring insights into subclusters that split along groups of physicians, cost and risk scores, warranting the outlined future work into ways of improving this framework for better resolution.
引用
收藏
页数:20
相关论文
共 50 条
  • [31] On the performance of self-organizing maps for the non-Euclidean Traveling Salesman Problem in the polygonal domain
    Faigl, Jan
    INFORMATION SCIENCES, 2011, 181 (19) : 4214 - 4229
  • [32] Improving the Efficiency of Image Clustering using Modified Non Euclidean Distance Measures in Data Mining
    Santhi, P.
    Bhaskaran, V. Murali
    INTERNATIONAL JOURNAL OF COMPUTERS COMMUNICATIONS & CONTROL, 2014, 9 (01) : 56 - 61
  • [33] Reproducible research: A bioinformatics case study
    Gentleman, Robert
    STATISTICAL APPLICATIONS IN GENETICS AND MOLECULAR BIOLOGY, 2005, 4
  • [34] Bi-objective cyclic scheduling in a robotic cell with processing time windows and non-Euclidean travel times
    Feng, Jianguang
    Che, Ada
    Wang, Nengmin
    INTERNATIONAL JOURNAL OF PRODUCTION RESEARCH, 2014, 52 (09) : 2505 - 2518
  • [35] Explaining detection heterogeneity with finite mixture and non-Euclidean movement in spatially explicit capture- recapture models
    Marrotte, Robby R.
    Howe, Eric J.
    Beauclerc, Kaela B.
    Potter, Derek
    Northrup, Joseph M.
    PEERJ, 2022, 10
  • [36] Comparative Study of K-Means Clustering Using Iris Data Set for Various Distances
    Chakraborty, Adrija
    Punhani, Akash
    Faujdar, Neetu
    Saraswat, Shipra
    PROCEEDINGS OF THE CONFLUENCE 2020: 10TH INTERNATIONAL CONFERENCE ON CLOUD COMPUTING, DATA SCIENCE & ENGINEERING, 2020, : 332 - 335
  • [37] Pisum sativum has no competitive responses to neighbors: A case study in non reproducible plant biology
    Mobley, Mariah L.
    Kruse, Audrey S.
    McNickle, Gordon G.
    PLANT DIRECT, 2022, 6 (10)
  • [38] Simulation Study on the Electricity Data Streams Time Series Clustering
    Gajowniczek, Krzysztof
    Bator, Marcin
    Zabkowski, Tomasz
    Orlowski, Arkadiusz
    Loo, Chu Kiong
    ENERGIES, 2020, 13 (04)
  • [39] The Garbage Can Model: A Study in (Non)Reproducible Research
    Levin, Stewart A.
    NONLINEAR DYNAMICS PSYCHOLOGY AND LIFE SCIENCES, 2021, 25 (04) : 455 - 465
  • [40] Vector-valued Gaussian processes on non-Euclidean product spaces: constructive methods and fast simulations based on partial spectral inversion
    Emery, Xavier
    Mery, Nadia
    Porcu, Emilio
    STOCHASTIC ENVIRONMENTAL RESEARCH AND RISK ASSESSMENT, 2024, 38 (09) : 3411 - 3428