Selective Inference for Hierarchical Clustering

被引:49
作者
Gao, Lucy L. [1 ]
Bien, Jacob [2 ]
Witten, Daniela [3 ,4 ]
机构
[1] Univ British Columbia, Dept Stat, Vancouver, BC, Canada
[2] Univ Southern Calif, Dept Data Sci & Operat, Los Angeles, CA 90007 USA
[3] Univ Washington, Dept Stat, Seattle, WA 98195 USA
[4] Univ Washington, Dept Biostat, Seattle, WA 98195 USA
基金
加拿大自然科学与工程研究理事会;
关键词
Difference in means; Hypothesis testing; Post-selection inference; Type I error; STATISTICAL SIGNIFICANCE; SINGLE;
D O I
10.1080/01621459.2022.2116331
中图分类号
O21 [概率论与数理统计]; C8 [统计学];
学科分类号
020208 ; 070103 ; 0714 ;
摘要
Classical tests for a difference in means control the Type I error rate when the groups are defined a priori. However, when the groups are instead defined via clustering, then applying a classical test yields an extremely inflated Type I error rate. Notably, this problem persists even if two separate and independent datasets are used to define the groups and to test for a difference in their means. To address this problem, in this article, we propose a selective inference approach to test for a difference in means between two clusters. Our procedure controls the selective Type I error rate by accounting for the fact that the choice of null hypothesis was made based on the data. We describe how to efficiently compute exact p-values for clusters obtained using agglomerative hierarchical clustering with many commonly used linkages. We apply our method to simulated data and to single-cell RNA-sequencing data. for this article are available online.
引用
收藏
页码:332 / 342
页数:11
相关论文
共 35 条
[1]   Integrating single-cell transcriptomic data across different conditions, technologies, and species [J].
Butler, Andrew ;
Hoffman, Paul ;
Smibert, Peter ;
Papalexi, Efthymia ;
Satija, Rahul .
NATURE BIOTECHNOLOGY, 2018, 36 (05) :411-+
[2]  
Campbell F., 2018, THESIS RICE U
[3]   A modified likelihood ratio test for homogeneity in finite mixture models [J].
Chen, HF ;
Chen, JH ;
Kalbfleisch, JD .
JOURNAL OF THE ROYAL STATISTICAL SOCIETY SERIES B-STATISTICAL METHODOLOGY, 2001, 63 :19-29
[4]   Inference on the Order of a Normal Mixture [J].
Chen, Jiahua ;
Li, Pengfei ;
Fu, Yuejiao .
JOURNAL OF THE AMERICAN STATISTICAL ASSOCIATION, 2012, 107 (499) :1096-1105
[5]   Recent advances in trajectory inference from single-cell omics data [J].
Deconinck, Louise ;
Cannoodt, Robrecht ;
Saelens, Wouter ;
Deplancke, Bart ;
Saeys, Yvan .
CURRENT OPINION IN SYSTEMS BIOLOGY, 2021, 27
[6]  
Duo Angelo, 2018, F1000Res, V7, P1141, DOI 10.12688/f1000research.15666.3
[7]   Large covariance estimation by thresholding principal orthogonal complements [J].
Fan, Jianqing ;
Liao, Yuan ;
Mincheva, Martina .
JOURNAL OF THE ROYAL STATISTICAL SOCIETY SERIES B-STATISTICAL METHODOLOGY, 2013, 75 (04) :603-680
[8]  
Fithian W, 2017, Arxiv, DOI [arXiv:1410.2597, DOI 10.48550/ARXIV.1410.2597]
[9]   Integrated analysis of multimodal single-cell data [J].
Hao, Yuhan ;
Hao, Stephanie ;
Andersen-Nissen, Erica ;
Mauck, William M. I. I. I. I. I. I. ;
Zheng, Shiwei ;
Butler, Andrew ;
Lee, Maddie J. ;
Wilk, Aaron J. ;
Darby, Charlotte ;
Zager, Michael ;
Hoffman, Paul ;
Stoeckius, Marlon ;
Papalexi, Efthymia ;
Mimitou, Eleni P. ;
Jain, Jaison ;
Srivastava, Avi ;
Stuart, Tim ;
Fleming, Lamar M. ;
Yeung, Bertrand ;
Rogers, Angela J. ;
McElrath, Juliana M. ;
Blish, Catherine A. ;
Gottardo, Raphael ;
Smibert, Peter ;
Satija, Rahul .
CELL, 2021, 184 (13) :3573-+
[10]  
Hocking T.D., 2011, P 28 INT C MACHINE L, P1