Robust subspace methods for outlier detection in genomic data circumvents the curse of dimensionality

被引:14
作者
Shetta, Omar [1 ]
Niranjan, Mahesan [1 ]
机构
[1] Univ Southampton, Elect & Comp Sci, Southampton SO17 1BJ, Hants, England
基金
英国工程与自然科学研究理事会;
关键词
dimensionality reduction; outlier detection; high-dimensional data; genomics; TRANSCRIPTOME; CANCER;
D O I
10.1098/rsos.190714
中图分类号
O [数理科学和化学]; P [天文学、地球科学]; Q [生物科学]; N [自然科学总论];
学科分类号
07 ; 0710 ; 09 ;
摘要
The application of machine learning to inference problems in biology is dominated by supervised learning problems of regression and classification, and unsupervised learning problems of clustering and variants of low-dimensional projections for visualization. A class of problems that have not gained much attention is detecting outliers in datasets, arising from reasons such as gross experimental, reporting or labelling errors. These could also be small parts of a dataset that are functionally distinct from the majority of a population. Outlier data are often identified by considering the probability density of normal data and comparing data likelihoods against some threshold. This classical approach suffers from the curse of dimensionality, which is a serious problem with omics data which are often found in very high dimensions. We develop an outlier detection method based on structured low-rank approximation methods. The objective function includes a regularizer based on neighbourhood information captured in the graph Laplacian. Results on publicly available genomic data show that our method robustly detects outliers whereas a density-based method fails even at moderate dimensions. Moreover, we show that our method has better clustering and visualization performance on the recovered low-dimensional projection when compared with popular dimensionality reduction techniques.
引用
收藏
页数:14
相关论文
共 31 条
[1]  
Aggarwal C. C., 2015, Data Mining, P237
[2]   Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays [J].
Alon, U ;
Barkai, N ;
Notterman, DA ;
Gish, K ;
Ybarra, S ;
Mack, D ;
Levine, AJ .
PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA, 1999, 96 (12) :6745-6750
[3]  
[Anonymous], SIAM J OPTIM UNPUB
[4]  
[Anonymous], 2011, Found. Trends Mach. Learn., DOI DOI 10.1561/2200000016
[5]  
Barghash A., 2016, Journal of Proteomics and Bioinformatics, V9, P38, DOI DOI 10.4172/JPB.1000387
[6]   Gene expression profiling of colon cancer by DNA microarrays and correlation with histoclinical parameters [J].
Bertucci, F ;
Salas, S ;
Eysteries, S ;
Nasser, V ;
Finetti, P ;
Ginestier, C ;
Charafe-Jauffret, E ;
Loriod, B ;
Bachelart, L ;
Montfort, J ;
Victorero, G ;
Viret, F ;
Ollendorff, V ;
Fert, V ;
Giovaninni, M ;
Delpero, JR ;
Nguyen, C ;
Viens, P ;
Monges, G ;
Birnbaum, D ;
Houlgatte, R .
ONCOGENE, 2004, 23 (07) :1377-1391
[7]  
Birkenkamp-Demtroder K, 2002, CANCER RES, V62, P4352
[8]   A comparison of normalization methods for high density oligonucleotide array data based on variance and bias [J].
Bolstad, BM ;
Irizarry, RA ;
Åstrand, M ;
Speed, TP .
BIOINFORMATICS, 2003, 19 (02) :185-193
[9]   Computational analysis of cell-to-cell heterogeneity in single-cell RNA-sequencing data reveals hidden subpopulations of cells [J].
Buettner, Florian ;
Natarajan, Kedar N. ;
Casale, F. Paolo ;
Proserpio, Valentina ;
Scialdone, Antonio ;
Theis, Fabian J. ;
Teichmann, Sarah A. ;
Marioni, John C. ;
Stegie, Oliver .
NATURE BIOTECHNOLOGY, 2015, 33 (02) :155-160
[10]   A SINGULAR VALUE THRESHOLDING ALGORITHM FOR MATRIX COMPLETION [J].
Cai, Jian-Feng ;
Candes, Emmanuel J. ;
Shen, Zuowei .
SIAM JOURNAL ON OPTIMIZATION, 2010, 20 (04) :1956-1982