Clustering transformed compositional data using K-means, with applications in gene expression and bicycle sharing system data

被引:35
作者
Godichon-Baggioni, Antoine [1 ]
Maugis-Rabusseau, Cathy [2 ]
Rau, Andrea [3 ]
机构
[1] Univ Toulouse, CNRS, UMR 5219, Inst Math Toulouse,UPS, F-31062 Toulouse 9, France
[2] Univ Toulouse, CNRS, UMR 5219, Inst Math Toulouse,INSA, F-31077 Toulouse, France
[3] Univ Paris Saclay, AgroParisTech, INRA, GABI, Paris, France
关键词
Clustering; compositional data; data transformations; K-means; STATISTICAL-ANALYSIS; MIXTURE MODEL; DATA SET; NUMBER; CRITERION;
D O I
10.1080/02664763.2018.1454894
中图分类号
O21 [概率论与数理统计]; C8 [统计学];
学科分类号
020208 ; 070103 ; 0714 ;
摘要
Although there is no shortage of clustering algorithms proposed in the literature, the question of the most relevant strategy for clustering compositional data (i.e. data whose rows belong to the simplex) remains largely unexplored in cases where the observed value is equal or close to zero for one or more samples. This work is motivated by the analysis of two applications, both focused on the categorization of compositional profiles: (1) identifying groups of co-expressed genes from high-throughput RNA sequencing data, in which a given gene may be completely silent in one or more experimental conditions; and (2) finding patterns in the usage of stations over the course of one week in the Velib' bicycle sharing system in Paris, France. For both of these applications, we make use of appropriately chosen data transformations, including the Centered Log Ratio and a novel extension called the Log Centered Log Ratio, in conjunction with the K-means algorithm. We use a non-asymptotic penalized criterion, whose penalty is calibrated with the slope heuristics, to select the number of clusters. Finally, we illustrate the performance of this clustering strategy, which is implemented in the Bioconductor package coseq, on both the gene expression and bicycle sharing system data.
引用
收藏
页码:47 / 65
页数:19
相关论文
共 42 条
  • [11] Caliski T., 1974, Commun Stat Simul Comput, V3, P1, DOI [10.1080/03610927408827101, DOI 10.1080/03610927408827101]
  • [12] A fast and recursive algorithm for clustering large datasets with k-medians
    Cardot, Herve
    Cenac, Peggy
    Monnez, Jean-Marie
    [J]. COMPUTATIONAL STATISTICS & DATA ANALYSIS, 2012, 56 (06) : 1434 - 1449
  • [13] ON CORRELATION BETWEEN VARIABLES OF CONSTANT SUM
    CHAYES, F
    [J]. JOURNAL OF GEOPHYSICAL RESEARCH, 1960, 65 (12): : 4185 - 4193
  • [14] MAXIMUM LIKELIHOOD FROM INCOMPLETE DATA VIA EM ALGORITHM
    DEMPSTER, AP
    LAIRD, NM
    RUBIN, DB
    [J]. JOURNAL OF THE ROYAL STATISTICAL SOCIETY SERIES B-METHODOLOGICAL, 1977, 39 (01): : 1 - 38
  • [15] ANALYZING CONSTANT-SUM MULTIPLE CRITERION DATA - A SEGMENT-LEVEL APPROACH
    DESARBO, WS
    RAMASWAMY, V
    CHATTERJEE, R
    [J]. JOURNAL OF MARKETING RESEARCH, 1995, 32 (02) : 222 - 232
  • [16] Isometric logratio transformations for compositional data analysis
    Egozcue, JJ
    Pawlowsky-Glahn, V
    Mateu-Figueras, G
    Barceló-Vidal, C
    [J]. MATHEMATICAL GEOLOGY, 2003, 35 (03): : 279 - 300
  • [17] Transcriptomes of germinal zones of human and mouse fetal neocortex suggest a role of extracellular matrix in progenitor self-renewal
    Fietz, Simone A.
    Lachmann, Robert
    Brandl, Holger
    Kircher, Martin
    Samusik, Nikolay
    Schroeder, Roland
    Lakshmanaperumal, Naharajan
    Henry, Ian
    Vogt, Johannes
    Riehn, Axel
    Distler, Wolfgang
    Nitsch, Robert
    Enard, Wolfgang
    Paeaeboc, Svante
    Huttner, Wieland B.
    [J]. PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA, 2012, 109 (29) : 11836 - 11841
  • [18] On the number of groups in clustering
    Fischer, Aurelie
    [J]. STATISTICS & PROBABILITY LETTERS, 2011, 81 (12) : 1771 - 1781
  • [19] ReCount: A multi-experiment resource of analysis-ready RNA-seq gene count datasets
    Frazee, Alyssa C.
    Langmead, Ben
    Leek, Jeffrey T.
    [J]. BMC BIOINFORMATICS, 2011, 12
  • [20] Inferring Correlation Networks from Genomic Survey Data
    Friedman, Jonathan
    Alm, Eric J.
    [J]. PLOS COMPUTATIONAL BIOLOGY, 2012, 8 (09)