Category encoding method to select feature genes for the classification of bulk and single-cell RNA-seq data

被引:2
作者
Zhou, Yan [1 ]
Zhang, Li [1 ]
Xu, Jinfeng [2 ]
Zhang, Jun [1 ]
Yan, Xiaodong [3 ]
机构
[1] Shenzhen Univ, Coll Math & Stat, Inst Stat Sci, Shenzhen Key Lab Adv Machine Learning & Applicat, Shenzhen, Peoples R China
[2] Univ Hong Kong, Dept Math, Pokfulam, Hong Kong, Peoples R China
[3] Shandong Univ, Zhongtai Secur Inst Financial Studies, Jinan, Peoples R China
基金
中国国家自然科学基金;
关键词
CAEN; classification; feature selection; single‐ cell RNA‐ seq; DISCRIMINANT-ANALYSIS; NORMALIZATION; FILTER; MODEL;
D O I
10.1002/sim.9015
中图分类号
Q [生物科学];
学科分类号
07 ; 0710 ; 09 ;
摘要
Bulk and single-cell RNA-seq (scRNA-seq) data are being used as alternatives to traditional technology in biology and medicine research. These data are used, for example, for the detection of differentially expressed (DE) genes. Several statistical methods have been developed for the classification of bulk and single-cell RNA-seq data. These feature genes are vitally important for the classification of bulk and single-cell RNA-seq data. The majority of genes are not DE and they are thus irrelevant for class distinction. To improve the classification performance and save the computation time, removal of irrelevant genes is necessary. Removal will aid the detection of the important feature genes. Widely used schemes in the literature, such as the BSS/WSS (BW) method, assume that data are normally distributed and may not be suitable for bulk and single-cell RNA-seq data. In this article, a category encoding (CAEN) method is proposed to select feature genes for bulk and single-cell RNA-seq data classification. This novel method encodes categories by employing the rank of sequence samples for each gene in each class. Correlation coefficients are considered for gene and class with the rank of sample and a new rank of category. The highest gene correlation coefficients are considered feature genes, which are the most effective for classifying bulk and single-cell RNA-seq dataset. The sure screening method was also established for rank consistency properties of the proposed CAEN method. Simulation studies show that the classifier using the proposed CAEN method performs better than, or at least as well as, the existing methods in most settings. Existing real datasets were analyzed, with the results demonstrating superior performance of the proposed method over current competitors. The application has been coded into an R package named "CAEN" to facilitate wide use.
引用
收藏
页码:4077 / 4089
页数:13
相关论文
共 29 条
  • [1] MicroRNA expression data analysis to identify key miRNAs associated with Alzheimer's disease
    Chen, Jing
    Qi, Yan
    Liu, Cui-Fang
    Lu, Jing-Min
    Shi, Jing
    Shi, Yan
    [J]. JOURNAL OF GENE MEDICINE, 2018, 20 (06)
  • [2] A neural network model for cell classification based on single-cell biomechanical properties
    Darling, Eric M.
    Guilak, Farshid
    [J]. TISSUE ENGINEERING PART A, 2008, 14 (09) : 1507 - 1515
  • [3] SCell: integrated analysis of single-cell RNA-seq data
    Diaz, Aaron
    Liu, Siyuan J.
    Sandoval, Carmen
    Pollen, Alex
    Nowakowski, Tom J.
    Lim, Daniel A.
    Kriegstein, Arnold
    [J]. BIOINFORMATICS, 2016, 32 (14) : 2219 - 2220
  • [4] Normalization and noise reduction for single cell RNA-seq experiments
    Ding, Bo
    Zheng, Lina
    Zhu, Yun
    Li, Nan
    Jia, Haiyang
    Ai, Rizi
    Wildberg, Andre
    Wang, Wei
    [J]. BIOINFORMATICS, 2015, 31 (13) : 2225 - 2227
  • [5] NBLDA: negative binomial linear discriminant analysis for RNA-Seq data
    Dong, Kai
    Zhao, Hongyu
    Tong, Tiejun
    Wan, Xiang
    [J]. BMC BIOINFORMATICS, 2016, 17
  • [6] Comparison of discrimination methods for the classification of tumors using gene expression data
    Dudoit, S
    Fridlyand, J
    Speed, TP
    [J]. JOURNAL OF THE AMERICAN STATISTICAL ASSOCIATION, 2002, 97 (457) : 77 - 87
  • [7] Sure independence screening for ultrahigh dimensional feature space
    Fan, Jianqing
    Lv, Jinchi
    [J]. JOURNAL OF THE ROYAL STATISTICAL SOCIETY SERIES B-STATISTICAL METHODOLOGY, 2008, 70 : 849 - 883
  • [8] Droplet Barcoding for Single-Cell Transcriptomics Applied to Embryonic Stem Cells
    Klein, Allon M.
    Mazutis, Linas
    Akartuna, Ilke
    Tallapragada, Naren
    Veres, Adrian
    Li, Victor
    Peshkin, Leonid
    Weitz, David A.
    Kirschner, Marc W.
    [J]. CELL, 2015, 161 (05) : 1187 - 1201
  • [9] Feature Screening via Distance Correlation Learning
    Li, Runze
    Zhong, Wei
    Zhu, Liping
    [J]. JOURNAL OF THE AMERICAN STATISTICAL ASSOCIATION, 2012, 107 (499) : 1129 - 1139
  • [10] A nonparametric feature screening method for ultrahigh-dimensional missing response
    Li, Xiaoxia
    Tang, Niansheng
    Xie, Jinhan
    Yan, Xiaodong
    [J]. COMPUTATIONAL STATISTICS & DATA ANALYSIS, 2020, 142