Reproducible clustering with non-Euclidean distances: a simulation and case study

被引:1
|
作者
Staples, Lauren [1 ]
Ring, Janelle [2 ]
Fontana, Scott [2 ]
Stradwick, Christina [1 ]
DeMaio, Joe [1 ]
Ray, Herman [1 ]
Zhang, Yifan [1 ]
Zhang, Xinyan [1 ]
机构
[1] Kennesaw State Univ, Sch Data Sci & Analyt, 3391 Town Point Dr NW, Kennesaw, GA 30144 USA
[2] Provider Consulting & Analyt, BlueCross BlueShield Tennessee, 1 Cameron Cir, Chattanooga, TN 37402 USA
关键词
K-means; K-medoids; Jaccard; Edit distance; Reproducibility; Prediction strength; Clustering; Non-Euclidean; Initialization; EDIT DISTANCE; VALIDATION; ALGORITHM;
D O I
10.1007/s41060-023-00429-1
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Certain categorical sequence clustering applications require path connectivity, such as the clustering of DNA, click-paths through web-user sessions, or paths of care clustering with sequences of patient medical billing codes. K-means and k-medoids clustering with non-Euclidean distance metrics such as the Jaccard or edit distances maintains such path connectivity. Although k-means and k-medoids clustering with the Jaccard and edit distances have enjoyed success in these domains, the limits of accurate cluster recovery in these conditions have not yet been defined. As a first step in approaching this goal, we performed a simulated study using k-means and k-medoids clustering with non-Euclidean distances and show the performance deteriorates at a certain level of noise and when the number of clusters increases. However, we identify initialization strategies that improve upon cluster recovery in the presence of noise. We employ the use of the Tibshirani and Guenther (J Comput Graph Stat 14(3):511-528, 2005) Prediction Strength method, which creates a hypothesis testing scenario that determines if there is clustering structure to the data (if the clusters are reproducible), with the null hypothesis being there is none. We then applied the framework to perinatal episodes of care and the clusters reproducibly and organically split between Cesarean and vaginal deliveries, which itself is not a clinical finding but sensibly validates the approach. Further visualizations of the clusters did bring insights into subclusters that split along groups of physicians, cost and risk scores, warranting the outlined future work into ways of improving this framework for better resolution.
引用
收藏
页数:20
相关论文
共 50 条
  • [21] Non-Euclidean Grid-Partitioning to Mitigate Cascading Risk in Multi-Infeed HVDC System
    Wang, Xiaohui
    Song, Kaige
    Hao, Quanrui
    Gao, Feng
    IEEE TRANSACTIONS ON POWER SYSTEMS, 2024, 39 (05) : 6426 - 6440
  • [22] Monte Carlo simulations of primitive model (PM) electrolytes in non-Euclidean geometries
    Hanassab, S
    VanderNoot, TJ
    MOLECULAR SIMULATION, 2004, 30 (05) : 301 - 311
  • [23] An application of the self-organizing map in the non-Euclidean Traveling Salesman Problem
    Faigl, Jan
    Kulich, Miroslav
    Vonasek, Vojtech
    Preucil, Libor
    NEUROCOMPUTING, 2011, 74 (05) : 671 - 679
  • [24] Modelling hyperbolic space: Designing a computational context for learning non-euclidean geometry
    Stevenson I.
    International Journal of Computers for Mathematical Learning, 2000, 5 (02): : 143 - 167
  • [25] Cognitive Maps for a Non-Euclidean Environment: Path Integration and Spatial Memory on a Sphere
    Kim, Misun
    Doeller, Christian F.
    PSYCHOLOGICAL SCIENCE, 2024, 35 (11) : 1217 - 1230
  • [26] Mass-spring-damper networks for distributed optimization in non-Euclidean spaces
    Yu, Yue
    Acikmese, Behcet
    Mesbahi, Mehran
    AUTOMATICA, 2020, 112
  • [27] Adjusted, non-Euclidean cluster detection of Vibrio parahaemolyticus in the Chesapeake Bay, USA
    Kvit, Anton
    Davis, Benjamin
    Jacobs, John
    Curriero, Frank C.
    GEOSPATIAL HEALTH, 2019, 14 (02) : 211 - 218
  • [28] On the iteration-complexity of a non-Euclidean hybrid proximal extragradient framework and of a proximal ADMM
    Goncalves, Max L. N.
    Melo, Jefferson G.
    Monteiro, Renato D. C.
    OPTIMIZATION, 2020, 69 (04) : 847 - 873
  • [29] If The Map Fits! Exploring Minimaps as Distractors from Non-Euclidean Spaces in Virtual Reality
    Auda, Jonas
    Gruenefeld, Uwe
    Schneegass, Stefan
    EXTENDED ABSTRACTS OF THE 2022 CHI CONFERENCE ON HUMAN FACTORS IN COMPUTING SYSTEMS, CHI 2022, 2022,
  • [30] IMPROVED POINTWISE ITERATION-COMPLEXITY OF A REGULARIZED ADMM AND OF A REGULARIZED NON-EUCLIDEAN HPE FRAMEWORK
    Goncalves, Max L. N.
    Melo, Jefferson G.
    Monteiro, Renato D. C.
    SIAM JOURNAL ON OPTIMIZATION, 2017, 27 (01) : 379 - 407