Efficient technique of microarray missing data imputation using clustering and weighted nearest neighbour

被引:17
作者
Dubey, Aditya [1 ]
Rasool, Akhtar [1 ]
机构
[1] Maulana Azad Natl Inst Technol, Dept Comp Sci & Engn, Bhopal 462003, India
关键词
GENE-EXPRESSION; HYBRID APPROACH; PREDICTION; FRAMEWORK; CANCER;
D O I
10.1038/s41598-021-03438-x
中图分类号
O [数理科学和化学]; P [天文学、地球科学]; Q [生物科学]; N [自然科学总论];
学科分类号
07 ; 0710 ; 09 ;
摘要
For most bioinformatics statistical methods, particularly for gene expression data classification, prognosis, and prediction, a complete dataset is required. The gene sample value can be missing due to hardware failure, software failure, or manual mistakes. The missing data in gene expression research dramatically affects the analysis of the collected data. Consequently, this has become a critical problem that requires an efficient imputation algorithm to resolve the issue. This paper proposed a technique considering the local similarity structure that predicts the missing data using clustering and top K nearest neighbor approaches for imputing the missing value. A similarity-based spectral clustering approach is used that is combined with the K-means. The spectral clustering parameters, cluster size, and weighting factors are optimized, and after that, missing values are predicted. For imputing each cluster's missing value, the top K nearest neighbor approach utilizes the concept of weighted distance. The evaluation is carried out on numerous datasets from a variety of biological areas, with experimentally inserted missing values varying from 5 to 25%. Experimental results prove that the proposed imputation technique makes accurate predictions as compared to other imputation procedures. In this paper, for performing the imputation experiments, microarray gene expression datasets consisting of information of different cancers and tumors are considered. The main contribution of this research states that local similarity-based techniques can be used for imputation even when the dataset has varying dimensionality and characteristics.
引用
收藏
页数:12
相关论文
共 34 条
[1]   CLC and IFNAR1 are differentially expressed and a global immunity score is distinct between early- and late-onset colorectal cancer [J].
Agesen, T. H. ;
Berg, M. ;
Clancy, T. ;
Thiis-Evensen, E. ;
Cekaite, L. ;
Lind, G. E. ;
Nesland, J. M. ;
Bakka, A. ;
Mala, T. ;
Hauss, H. J. ;
Fetveit, T. ;
Vatn, M. H. ;
Hovig, E. ;
Nesbakken, A. ;
Lothe, R. A. ;
Skotheim, R. I. .
GENES AND IMMUNITY, 2011, 12 (08) :653-662
[2]  
Aydilek IB, 2012, INT J INNOV COMPUT I, V8, P4705
[3]   Improving cluster-based missing value estimation of DNA microarray data [J].
Bras, Ligia P. ;
Menezes, Jose C. .
BIOMOLECULAR ENGINEERING, 2007, 24 (02) :273-282
[4]   Down-regulation of the interferon signaling pathway in T lymphocytes from patients with metastatic melanoma [J].
Critchley-Thorne, Rebecca J. ;
Yan, Ning ;
Nacu, Serban ;
Weber, Jeffrey ;
Holmes, Susan P. ;
Lee, Peter P. .
PLOS MEDICINE, 2007, 4 (05) :897-911
[5]  
Dubey A, 2020, INT J ADV COMPUT SC, V11, P710
[6]   Microarray missing data imputation based on a set theoretic framework and biological knowledge [J].
Gan, XC ;
Liew, AWC ;
Yan, H .
NUCLEIC ACIDS RESEARCH, 2006, 34 (05) :1608-1619
[7]  
Hippo Y, 2002, CANCER RES, V62, P233
[8]   Missing value estimation for DNA microarray gene expression data: local least squares imputation [J].
Kim, H ;
Golub, GH ;
Park, H .
BIOINFORMATICS, 2005, 21 (02) :187-198
[9]  
Kurgan L., 2005, NEXT GENERATION DATA, P415
[10]   Data Mining on DNA Sequences of Hepatitis B Virus [J].
Leung, Kwong-Sak ;
Lee, Kin Hong ;
Wang, Jin-Feng ;
Ng, Eddie Y. T. ;
Chan, Henry L. Y. ;
Tsui, Stephen K. W. ;
Mok, Tony S. K. ;
Tse, Pete Chi-Hang ;
Sung, Joseph Jao-Yiu .
IEEE-ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, 2011, 8 (02) :428-440