Intrinsic entropy model for feature selection of scRNA-seq data

被引:7
作者
Li, Lin [1 ,2 ]
Tang, Hui [3 ]
Xia, Rui [1 ,2 ]
Dai, Hao [1 ]
Liu, Rui [3 ]
Chen, Luonan [1 ,4 ,5 ,6 ]
机构
[1] Chinese Acad Sci, CAS Ctr Excellence Mol Cell Sci, Shanghai Inst Biochem & Ceti Biol, State Key Lab Cell Biol, Shanghai 200031, Peoples R China
[2] Univ Chinese Acad Sci, Beijing 100049, Peoples R China
[3] South China Univ Technol, Sch Math, Guangzhou 510640, Peoples R China
[4] Chinese Acad Sci, Ctr Excellence Anim Evolut & Genet, Kunming 650223, Yunnan, Peoples R China
[5] Chinese Acad Sci, Univ Chinese Acad Sci, Hangzhou Inst Adv Study, Key Lab Syst Hlth Sci Zhejiang Prov, Hangzhou 310024, Peoples R China
[6] Guangdong Inst Intelligence Sci & Technol, Zhuhai 519031, Peoples R China
基金
国家重点研发计划; 中国国家自然科学基金;
关键词
scRNA-seq; feature selection; intrinsic entropy; extrinsic entropy; entropy decomposition; informative genes;
D O I
10.1093/jmcb/mjac008
中图分类号
Q2 [细胞生物学];
学科分类号
071009 ; 090102 ;
摘要
Recent advances of single-cell RNA sequencing (scRNA-seq) technologies have led to extensive study of cellular heterogeneity and cell-to-cell variation. However, the high frequency of dropout events and noise in scRNA-seq data confounds the accuracy of the downstream analysis, i.e. clustering analysis, whose accuracy depends heavily on the selected feature genes. Here, by deriving an entropy decomposition formula, we propose a feature selection method, i.e. an intrinsic entropy (IE) model, to identify the informative genes for accurately clustering analysis. Specifically, by eliminating the 'noisy' fluctuation or extrinsic entropy (EE), we extract the IE of each gene from the total entropy (TE), i.e. TE = IE + EE. We show that the IE of each gene actually reflects the regulatory fluctuation of this gene in a cellular process, and thus high-IE genes provide rich information on cell type or state analysis. To validate the performance of the high-IE genes, we conduct computational analysis on both simulated datasets and real single-cell datasets by comparing with other representative methods. The results show that our IE model is not only broadly applicable and robust for different clustering and classification methods, but also sensitive for novel cell types. Our results also demonstrate that the intrinsic entropy/fluctuation of a gene serves as information rather than noise in contrast to its total entropy/fluctuation.
引用
收藏
页数:11
相关论文
共 40 条
[1]  
[Anonymous], 2021, IEEE Trans. Broadcast.
[2]  
[Anonymous], 2016, PROC 22 ACM SIGKDD I, DOI DOI 10.1145/2939672.2939785
[3]  
Brennecke P, 2013, NAT METHODS, V10, P1093, DOI [10.1038/NMETH.2645, 10.1038/nmeth.2645]
[4]   Detecting early-warning signals for sudden deterioration of complex diseases by dynamical network biomarkers [J].
Chen, Luonan ;
Liu, Rui ;
Liu, Zhi-Ping ;
Li, Meiyi ;
Aihara, Kazuyuki .
SCIENTIFIC REPORTS, 2012, 2
[5]   CCL20 Signaling in the Tumor Microenvironment [J].
Chen, Weilong ;
Qin, Yuanyuan ;
Liu, Suling .
TUMOR MICROENVIRONMENT: THE ROLE OF CHEMOKINES, PT A, 2020, 1231 :53-65
[6]   Cell-specific network constructed by single-cell RNA sequencing data [J].
Dai, Hao ;
Li, Lin ;
Zeng, Tao ;
Chen, Luonan .
NUCLEIC ACIDS RESEARCH, 2019, 47 (11)
[7]  
Fan RE, 2005, J MACH LEARN RES, V6, P1889
[8]   A decision-theoretic generalization of on-line learning and an application to boosting [J].
Freund, Y ;
Schapire, RE .
JOURNAL OF COMPUTER AND SYSTEM SCIENCES, 1997, 55 (01) :119-139
[9]   Separating intrinsic from extrinsic fluctuations in dynamic biological systems [J].
Hilfinger, Andreas ;
Paulsson, Johan .
PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA, 2011, 108 (29) :12167-12172
[10]   SMAD7 and SERPINE1 as novel dynamic network biomarkers detect and regulate the tipping point of TGF-beta induced EMT [J].
Jiang, Zhonglin ;
Lu, Lina ;
Liu, Yuwei ;
Zhang, Si ;
Li, Shuxian ;
Wang, Guanyu ;
Wang, Peng ;
Chen, Luonan .
SCIENCE BULLETIN, 2020, 65 (10) :842-853