Bayesian clustering with uncertain data

被引:0
|
作者
Nicholls, Kath [1 ,2 ]
Kirk, Paul D. W. [1 ,2 ,3 ]
Wallace, Chris [1 ,2 ]
机构
[1] Univ Cambridge, Cambridge Inst Therapeut Immunol & Infect Dis, Cambridge, England
[2] Univ Cambridge, MRC Biostat Unit, Cambridge, England
[3] Univ Cambridge, Canc Res UK Cambridge Ctr, Ovarian Canc Programme, Cambridge, England
基金
英国惠康基金; 英国科学技术设施理事会; 英国工程与自然科学研究理事会;
关键词
T-CELL EXHAUSTION; DENSITY-ESTIMATION; CLASSIFICATION; AUTOIMMUNITY; SIGNATURE;
D O I
10.1371/journal.pcbi.1012301
中图分类号
Q5 [生物化学];
学科分类号
071010 ; 081704 ;
摘要
Clustering is widely used in bioinformatics and many other fields, with applications from exploratory analysis to prediction. Many types of data have associated uncertainty or measurement error, but this is rarely used to inform the clustering. We present Dirichlet Process Mixtures with Uncertainty (DPMUnc), an extension of a Bayesian nonparametric clustering algorithm which makes use of the uncertainty associated with data points. We show that DPMUnc out-performs existing methods on simulated data. We cluster immune-mediated diseases (IMD) using GWAS summary statistics, which have uncertainty linked with the sample size of the study. DPMUnc separates autoimmune from autoinflammatory diseases and isolates other subgroups such as adult-onset arthritis. We additionally consider how DPMUnc can be used to cluster gene expression datasets that have been summarised using gene signatures. We first introduce a novel procedure for generating a summary of a gene signature on a dataset different to the one where it was discovered, which incorporates a measure of the variability in expression across signature genes within each individual. We summarise three public gene expression datasets containing patients with a range of IMD, using three relevant gene signatures. We find association between disease and the clusters returned by DPMUnc, with clustering structure replicated across the datasets. The significance of this work is two-fold. Firstly, we demonstrate that when data has associated uncertainty, this uncertainty should be used to inform clustering and we present a method which does this, DPMUnc. Secondly, we present a procedure for using gene signatures in datasets other than where they were originally defined. We show the value of this procedure by summarising gene expression data from patients with immune-mediated diseases using relevant gene signatures, and clustering these patients using DPMUnc. Identifying groups of items that are similar to each other, a process called clustering, has a range of applications. For example, if patients split into two distinct groups this suggests that a disease may have subtypes which should be treated differently. Real data often has measurement error associated with it, but this error is frequently discarded by clustering methods. We propose a clustering method which makes use of the measurement error and use it to cluster diseases linked to the immune system. Gene expression datasets measure the activity level of all similar to 20,000 genes in the human genome. We propose a procedure for summarising gene expression data using gene signatures, lists of genes produced by highly focused studies. For example, a study might list the genes which increase activity after exposure to a particular virus. The genes in the gene signature may not be as tightly correlated in a new dataset, and so our procedure measures the strength of the gene signature in the new dataset, effectively defining measurement error for the summary. We summarise gene expression datasets related to the immune system using relevant gene signatures and find that our method groups patients with the same disease.
引用
收藏
页数:17
相关论文
共 50 条
  • [31] Deformable Bayesian Networks for Data Clustering and Fusion
    Kampa, Kittipat
    Principe, Jose C.
    Cobb, J. Tory
    Rangarajan, Anand
    DETECTION AND SENSING OF MINES, EXPLOSIVE OBJECTS, AND OBSCURED TARGETS XVI, 2011, 8017
  • [32] UNCERTAIN BAYESIAN NETWORKS: LEARNING FROM INCOMPLETE DATA
    Hougen, Conrad D.
    Kaplan, Lance M.
    Cerutti, Federico
    Hero, Alfred O.
    2021 IEEE 31ST INTERNATIONAL WORKSHOP ON MACHINE LEARNING FOR SIGNAL PROCESSING (MLSP), 2021,
  • [33] Learning the Parameters of Bayesian Networks from Uncertain Data
    Wasserkrug, Segev
    Marinescu, Radu
    Zeltyn, Sergey
    Shindin, Evgeny
    Feldman, Yishai A.
    THIRTY-FIFTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, THIRTY-THIRD CONFERENCE ON INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE AND THE ELEVENTH SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2021, 35 : 12190 - 12197
  • [34] Modifications of uncertain data: A Bayesian framework for belief revision
    Dey, D
    Sarkar, S
    INFORMATION SYSTEMS RESEARCH, 2000, 11 (01) : 1 - 16
  • [35] BAYESIAN SOUND FIELD ESTIMATION USING UNCERTAIN DATA
    Brunnstrom, Jesper
    Moller, Martin Bo
    Ostergaard, Jan
    Moonen, Marc
    2024 18TH INTERNATIONAL WORKSHOP ON ACOUSTIC SIGNAL ENHANCEMENT, IWAENC 2024, 2024, : 329 - 333
  • [36] A Naive Bayesian Classifier in Categorical Uncertain Data Streams
    Ge, Jiaqi
    Xia, Yuni
    Wang, Jian
    2014 INTERNATIONAL CONFERENCE ON DATA SCIENCE AND ADVANCED ANALYTICS (DSAA), 2014, : 392 - 398
  • [37] Bayesian kriging with lognormal data and uncertain variogram parameters
    Pilz, J
    Pluch, P
    Spöck, G
    GEOSTATISTICS FOR ENVIRONMENTAL APPLICATIONS, PROCEEDINGS, 2005, : 51 - 62
  • [38] A Bayesian framework for modification of uncertain data in probabilistic multidimensional data model
    Moole, BR
    Korrapati, RB
    PROCEEDINGS OF THE IEEE SOUTHEASTCON 2004: ENGINEERING CONNECTS, 2004, : 3 - 11
  • [39] Clustering Uncertain Data Based on Probability Distribution Similarity
    Jiang, Bin
    Pei, Jian
    Tao, Yufei
    Lin, Xuemin
    IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 2013, 25 (04) : 751 - 763
  • [40] Clustering on Uncertain Data Stream over Sliding Windows
    Tu, Li
    2015 THIRD INTERNATIONAL CONFERENCE ON ADVANCED CLOUD AND BIG DATA, 2015, : 148 - 152