Effective Clustering for Single Cell Sequencing Cancer Data

被引:2
作者
Ciccolella, Simone [1 ]
Patterson, Murray D. [1 ,2 ]
Bonizzoni, Paola [1 ]
Della Vedova, Gianluca [1 ]
机构
[1] Univ Milano Bicocca, Dipartimento Informat Sistemist & Comunicaz, Milan, Italy
[2] Fairfield Univ, Dept Comp Sci & Engn, Sch Engn, Fairfield, CT 06430 USA
来源
ACM-BCB'19: PROCEEDINGS OF THE 10TH ACM INTERNATIONAL CONFERENCE ON BIOINFORMATICS, COMPUTATIONAL BIOLOGY AND HEALTH INFORMATICS | 2019年
关键词
CLONAL EVOLUTION; INFERENCE;
D O I
10.1145/3307339.3342149
中图分类号
TP39 [计算机的应用];
学科分类号
081203 ; 0835 ;
摘要
Background. Single cell sequencing (SCS) technologies provide a level of resolution that makes it indispensable for inferring from a sequenced tumor, evolutionary trees or phylogenies representing an accumulation of cancerous mutations. A drawback of SCS is elevated false negative and missing value rates, resulting in a large space of possible solutions, which in turn makes infeasible using some approaches and tools. While this has not inhibited the development of methods for inferring phylogenies from SCS data, the continuing increase in size and resolution of these data begin to put a strain on such methods. One possible solution is to reduce the size of an SCS instance - usually represented as a matrix of presence, absence and missing values of the mutations found in the different sequenced cells - and to infer the tree from this reduced-size instance. Previous approaches have used k-means to this end, clustering groups of mutations and/or cells, and using these means as the reduced instance. Such an approach typically uses the Euclidean distance for computing means. However, since the values in these matrices are of a categorical nature (having the three categories: present, absent and missing), we explore techniques for clustering categorical data - commonly used in data mining and machine learning - to SCS data, with this goal in mind. Results. In this work, we present a new clustering procedure aimed at clustering categorical vector, or matrix data - here representing SCS instances, called celluloid. We demonstrate that celluloid clusters mutations with high precision: never pairing too many mutations that are unrelated in the ground truth, but also obtains accurate results in terms of the phylogeny inferred downstream from the reduced instance produced by this method. Finally, we demonstrate the usefulness of a clustering step by applying the entire pipeline (clustering + inference method) to a real dataset, showing a significant reduction in the runtime, raising considerably the upper bound on the size of SCS instances which can be solved in practice.
引用
收藏
页码:437 / 446
页数:10
相关论文
共 35 条
[1]  
Anderberg M.R., 1973, Probability and Mathematical Statistics
[2]  
[Anonymous], 1997, C BOARD MATH SCI
[3]  
[Anonymous], 2000, TECHNICAL REPORT
[4]   Phylogenetic analysis of metastatic progression in breast cancer using somatic mutations and copy number aberrations (vol 8, 14944, 2017) [J].
Brown, David ;
Smeets, Dominiek ;
Szekely, Borbala ;
Larsimont, Denis ;
Szasz, A. Marcell ;
Adnet, Pierre-Yves ;
Rothe, Francoise ;
Rouas, Ghizlane ;
Nagy, Zsofia I. ;
Farago, Zsofia ;
Tokes, Anna-Maria ;
Dank, Magdolna ;
Szentmartoni, Gyongyver ;
Udvarhelyi, Nora ;
Zoppoli, Gabriele ;
Pusztai, Lajos ;
Piccart, Martine ;
Kulka, Janina ;
Lambrechts, Diether ;
Sotiriou, Christos ;
Desmedt, Christine .
NATURE COMMUNICATIONS, 2017, 8
[5]  
Ciccolella S, 2018, INT CONF COMPUT ADV
[6]  
Ciccolella Simone, 2018, BIORXIV268243, DOI [10.1101/268243, DOI 10.1101/268243]
[7]   SPhyR: tumor phylogeny estimation from single-cell sequencing data under loss and error [J].
El-Kebir, Mohammed .
BIOINFORMATICS, 2018, 34 (17) :671-679
[8]   Reconstruction of clonal trees and tumor composition from multi-sample sequencing data [J].
El-Kebir, Mohammed ;
Oesper, Layla ;
Acheson-Field, Hannah ;
Raphael, Benjamin J. .
BIOINFORMATICS, 2015, 31 (12) :62-70
[9]   A METHOD FOR COMPARING 2 HIERARCHICAL CLUSTERINGS [J].
FOWLKES, EB ;
MALLOWS, CL .
JOURNAL OF THE AMERICAN STATISTICAL ASSOCIATION, 1983, 78 (383) :553-569
[10]   Clustering by passing messages between data points [J].
Frey, Brendan J. ;
Dueck, Delbert .
SCIENCE, 2007, 315 (5814) :972-976