Trimmed scores regression for k-means clustering data with high-missing ratio

被引:1
作者
Guo, Guangbao [1 ]
Niu, Ruiling [1 ]
Qian, Guoqi [2 ]
Lu, Tao [1 ]
机构
[1] Shandong Univ Technol, Sch Math & Stat, Zibo, Peoples R China
[2] Univ Melbourne, Sch Math & Stat, Melbourne, Vic, Australia
关键词
High-missing ratio; k-means clustering; Missing data; Trimmed scores regression; PRINCIPAL COMPONENT ANALYSIS; IMPUTATION; VALUES;
D O I
10.1080/03610918.2022.2091779
中图分类号
O21 [概率论与数理统计]; C8 [统计学];
学科分类号
020208 ; 070103 ; 0714 ;
摘要
Data sets with missing values bring great challenges to k-means clustering (KMC). At present, most studies focus on KMC data with low missing ratio while few studies on KMC data with high missing ratio. The current imputation methods have the following problems when dealing with the KMC data: (1) the error between imputation value and original true value is large, which leads to the poor imputation precision; (2) the imputation results have a great influence on the clustering results, which reduce the accuracies of the clustering results. We propose a novel imputation method, to deal with the problems, called as trimmed scores regression (TSR), which obtains an imputation estimator from a regression equation with a trimmed score matrix, and a novel cluster with k-means method. Compared with other imputation methods in numerical analysis, the TSR method exhibits better performance.
引用
收藏
页码:2805 / 2821
页数:17
相关论文
共 25 条
  • [1] Distinct types of diffuse large B-cell lymphoma identified by gene expression profiling
    Alizadeh, AA
    Eisen, MB
    Davis, RE
    Ma, C
    Lossos, IS
    Rosenwald, A
    Boldrick, JG
    Sabet, H
    Tran, T
    Yu, X
    Powell, JI
    Yang, LM
    Marti, GE
    Moore, T
    Hudson, J
    Lu, LS
    Lewis, DB
    Tibshirani, R
    Sherlock, G
    Chan, WC
    Greiner, TC
    Weisenburger, DD
    Armitage, JO
    Warnke, R
    Levy, R
    Wilson, W
    Grever, MR
    Byrd, JC
    Botstein, D
    Brown, PO
    Staudt, LM
    [J]. NATURE, 2000, 403 (6769) : 503 - 511
  • [2] Applications of maximum likelihood principal component analysis: incomplete data sets and calibration transfer
    Andrews, DT
    Wentzell, PD
    [J]. ANALYTICA CHIMICA ACTA, 1997, 350 (03) : 341 - 352
  • [3] Dealing with missing data in MSPC: several methods, different interpretations, some examples
    Arteaga, F
    Ferrer, A
    [J]. JOURNAL OF CHEMOMETRICS, 2002, 16 (8-10) : 408 - 418
  • [4] Bell RobertM., 2010, Chance, V23, P24, DOI [DOI 10.1080/09332480.2010.10739787, 10.1080/09332480.2010.10739787]
  • [5] LSimpute: accurate estimation of missing values in microarray data with least squares methods
    Bo, TH
    Dysvik, J
    Jonassen, I
    [J]. NUCLEIC ACIDS RESEARCH, 2004, 32 (03) : e34
  • [6] Optimal clustering with missing values
    Boluki, Shahin
    Dadaneh, Siamak Zamani
    Qian, Xiaoning
    Dougherty, Edward R.
    [J]. BMC BIOINFORMATICS, 2019, 20 (Suppl 12)
  • [7] Comparative analysis of missing value imputation methods to improve clustering and interpretation of microarray experiments
    Celton, Magalie
    Malpertuy, Alain
    Lelandais, Gaelle
    de Brevern, Alexandre G.
    [J]. BMC GENOMICS, 2010, 11
  • [8] k-POD: A Method for k-Means Clustering of Missing Data
    Chi, Jocelyn T.
    Chi, Eric C.
    Baraniuk, Richard G.
    [J]. AMERICAN STATISTICIAN, 2016, 70 (01) : 91 - 99
  • [9] Influence of microarrays experiments missing values on the stability of gene groups by hierarchical clustering
    de Brevern, AG
    Hazout, S
    Malpertuy, A
    [J]. BMC BIOINFORMATICS, 2004, 5 (1)
  • [10] Impact of missing data imputation methods on gene expression clustering and classification
    de Souto, Marcilio C. P.
    Jaskowiak, Pablo A.
    Costa, Ivan G.
    [J]. BMC BIOINFORMATICS, 2015, 16