Reproducible clustering with non-Euclidean distances: a simulation and case study

被引:1
|
作者
Staples, Lauren [1 ]
Ring, Janelle [2 ]
Fontana, Scott [2 ]
Stradwick, Christina [1 ]
DeMaio, Joe [1 ]
Ray, Herman [1 ]
Zhang, Yifan [1 ]
Zhang, Xinyan [1 ]
机构
[1] Kennesaw State Univ, Sch Data Sci & Analyt, 3391 Town Point Dr NW, Kennesaw, GA 30144 USA
[2] Provider Consulting & Analyt, BlueCross BlueShield Tennessee, 1 Cameron Cir, Chattanooga, TN 37402 USA
关键词
K-means; K-medoids; Jaccard; Edit distance; Reproducibility; Prediction strength; Clustering; Non-Euclidean; Initialization; EDIT DISTANCE; VALIDATION; ALGORITHM;
D O I
10.1007/s41060-023-00429-1
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Certain categorical sequence clustering applications require path connectivity, such as the clustering of DNA, click-paths through web-user sessions, or paths of care clustering with sequences of patient medical billing codes. K-means and k-medoids clustering with non-Euclidean distance metrics such as the Jaccard or edit distances maintains such path connectivity. Although k-means and k-medoids clustering with the Jaccard and edit distances have enjoyed success in these domains, the limits of accurate cluster recovery in these conditions have not yet been defined. As a first step in approaching this goal, we performed a simulated study using k-means and k-medoids clustering with non-Euclidean distances and show the performance deteriorates at a certain level of noise and when the number of clusters increases. However, we identify initialization strategies that improve upon cluster recovery in the presence of noise. We employ the use of the Tibshirani and Guenther (J Comput Graph Stat 14(3):511-528, 2005) Prediction Strength method, which creates a hypothesis testing scenario that determines if there is clustering structure to the data (if the clusters are reproducible), with the null hypothesis being there is none. We then applied the framework to perinatal episodes of care and the clusters reproducibly and organically split between Cesarean and vaginal deliveries, which itself is not a clinical finding but sensibly validates the approach. Further visualizations of the clusters did bring insights into subclusters that split along groups of physicians, cost and risk scores, warranting the outlined future work into ways of improving this framework for better resolution.
引用
收藏
页数:20
相关论文
共 50 条
  • [1] A case example of insect gymnastics: how is non-Euclidean geometry learned?
    Junius, Premalatha
    INTERNATIONAL JOURNAL OF MATHEMATICAL EDUCATION IN SCIENCE AND TECHNOLOGY, 2008, 39 (08) : 987 - 1002
  • [2] Elastic theory of unconstrained non-Euclidean plates
    Efrati, E.
    Sharon, E.
    Kupferman, R.
    JOURNAL OF THE MECHANICS AND PHYSICS OF SOLIDS, 2009, 57 (04) : 762 - 775
  • [3] Soft learning vector quantization and clustering algorithms based on non-Euclidean norms: Multinorm algorithms
    Karayiannis, NB
    Randolph-Gips, MM
    IEEE TRANSACTIONS ON NEURAL NETWORKS, 2003, 14 (01): : 89 - 102
  • [4] Non-Euclidean Sliced Optimal Transport Sampling
    Genest, Baptiste
    Courty, Nicolas
    Coeurjolly, David
    COMPUTER GRAPHICS FORUM, 2024, 43 (02)
  • [5] Clustering incomplete relational data using the non-Euclidean relational fuzzy c-means algorithm
    Hathaway, RJ
    Bezdek, JC
    PATTERN RECOGNITION LETTERS, 2002, 23 (1-3) : 151 - 160
  • [6] A Fuzzy Neural Network Based on Non-Euclidean Distance Clustering for Quality Index Model in Slashing Process
    Zhang, Yuxian
    Li, Song
    Qian, Xiaoyi
    Wang, Jianhui
    MATHEMATICAL PROBLEMS IN ENGINEERING, 2015, 2015
  • [7] Structure Preserving Encoding of Non-euclidean Similarity Data
    Muench, Maximilian
    Raab, Christoph
    Biehl, Michael
    Schleif, Frank-Michael
    ICPRAM: PROCEEDINGS OF THE 9TH INTERNATIONAL CONFERENCE ON PATTERN RECOGNITION APPLICATIONS AND METHODS, 2020, : 43 - 51
  • [8] A Performance Comparison of Euclidean, Manhattan and Minkowski Distances in K-Means Clustering
    Haviluddin
    Iqbal, Muhammad
    Putra, Gubtha Mahendra
    Puspitasari, Novianti
    Setyadi, Hario Jati
    Dwiyanto, Felix Andika
    Wibawa, Aji Prasetya
    Alfred, Rayner
    2020 6TH INTERNATIONAL CONFERENCE ON SCIENCE IN INFORMATION TECHNOLOGY (ICSITECH): EMBRACING INDUSTRY 4.0: TOWARDS INNOVATION IN DISASTER MANAGEMENT, 2020, : 184 - 188
  • [9] Unsupervised learning with normalised data and non-Euclidean norms
    Doherty, K. A. J.
    Adams, R. G.
    Davey, N.
    APPLIED SOFT COMPUTING, 2007, 7 (01) : 203 - 210
  • [10] Hanson-Wright inequality in Hilbert spaces with application to K-means clustering for non-Euclidean data
    Chen, Xiaohui
    Yang, Yun
    BERNOULLI, 2021, 27 (01) : 586 - 614