ForestSubtype: a cancer subtype identifying approach based on high-dimensional genomic data and a parallel random forest

被引:1
|
作者
Luo, Junwei [1 ]
Feng, Yading [1 ]
Wu, Xuyang [1 ]
Li, Ruimin [1 ]
Shi, Jiawei [1 ]
Chang, Wenjing [1 ]
Wang, Junfeng [1 ]
机构
[1] Henan Polytech Univ, Sch Software, Jiaozuo, Peoples R China
基金
中国国家自然科学基金;
关键词
Cancer subtyping; Random forest; Gene expression data; Machine learning; Auto Encoder; BREAST-CANCER; PROGNOSTIC BIOMARKER; PREDICTOR; SELECTION; PACKAGE; GROWTH; FOXC1;
D O I
10.1186/s12859-023-05412-y
中图分类号
Q5 [生物化学];
学科分类号
071010 ; 081704 ;
摘要
Background: Cancer subtype classification is helpful for personalized cancer treat-ment. Although, some approaches have been developed to classifying caner subtype based on high dimensional gene expression data, it is difficult to obtain satisfactory classification results. Meanwhile, some cancers have been well studied and classified to some subtypes, which are adopt by most researchers. Hence, this priori knowledge is significant for further identifying new meaningful subtypes.Results: In this paper, we present a combined parallel random forest and autoencoder approach for cancer subtype identification based on high dimensional gene expression data, ForestSubtype. ForestSubtype first adopts the parallel RF and the priori knowledge of cancer subtype to train a module and extract significant candidate features. Second, ForestSubtype uses a random forest as the base module and ten parallel random forests to compute each feature weight and rank them separately. Then, the intersection of the features with the larger weights output by the ten parallel random forests is taken as our subsequent candidate features. Third, ForestSubtype uses an autoencoder to condenses the selected features into a two-dimensional data. Fourth, ForestSubtype utilizes k-means++ to obtain new cancer subtype identifica-tion results. In this paper, the breast cancer gene expression data obtained from The Cancer Genome Atlas are used for training and validation, and an independent breast cancer dataset from the Molecular Taxonomy of Breast Cancer International Consor-tium is used for testing. Additionally, we use two other cancer datasets for validat-ing the generalizability of ForestSubtype. ForestSubtype outperforms the other two methods in terms of the distribution of clusters, internal and external metric results. The open-source code is available at https://github.com/lffyd/ForestSubtype.Conclusions: Our work shows that the combination of high-dimensional gene expression data and parallel random forests and autoencoder, guided by a priori knowledge, can identify new subtypes more effectively than existing methods of can-cer subtype classification.
引用
收藏
页数:19
相关论文
共 50 条
  • [21] Multistage feature selection approach for high-dimensional cancer data
    Alhasan Alkuhlani
    Mohammad Nassef
    Ibrahim Farag
    Soft Computing, 2017, 21 : 6895 - 6906
  • [22] An efficient approach for feature construction of high-dimensional microarray data by random projections
    Tariq, Hassan
    Eldridge, Elf
    Welch, Ian
    PLOS ONE, 2018, 13 (04):
  • [23] Robust network-based regularization and variable selection for high-dimensional genomic data in cancer prognosis
    Ren, Jie
    Du, Yinhao
    Li, Shaoyu
    Ma, Shuangge
    Jiang, Yu
    Wu, Cen
    GENETIC EPIDEMIOLOGY, 2019, 43 (03) : 276 - 291
  • [24] Identifying a Minimal Class of Models for High-dimensional Data
    Nevo, Daniel
    Ritov, Ya'acov
    JOURNAL OF MACHINE LEARNING RESEARCH, 2017, 18
  • [25] Random forest Granger causality for detection of effective brain connectivity using high-dimensional data
    Furqan, Mohammad Shaheryar
    Siyal, Mohammad Yakoob
    JOURNAL OF INTEGRATIVE NEUROSCIENCE, 2016, 15 (01) : 55 - 66
  • [26] Interaction Detection with Random Forests in High-Dimensional Data
    Winham, Stacey
    Wang, Xin
    de Andrade, Mariza
    Freimuth, Robert
    Colby, Colin
    Huebner, Marianne
    Biernacka, Joanna
    GENETIC EPIDEMIOLOGY, 2012, 36 (02) : 142 - 142
  • [27] Iterative random projections for high-dimensional data clustering
    Cardoso, Angelo
    Wichert, Andreas
    PATTERN RECOGNITION LETTERS, 2012, 33 (13) : 1749 - 1755
  • [28] A Novel Cox Proportional Hazards Model for High-Dimensional Genomic Data in Cancer Prognosis
    Huang, Hai-Hui
    Liang, Yong
    IEEE-ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, 2021, 18 (05) : 1821 - 1830
  • [29] Supervised Classification of High-Dimensional Correlated Data: Application to Genomic Data
    Gaye, Aboubacry
    Diongue, Abdou Ka
    Sylla, Seydou Nourou
    Diarra, Maryam
    Diallo, Amadou
    Talla, Cheikh
    Loucoubar, Cheikh
    JOURNAL OF CLASSIFICATION, 2024, 41 (01) : 158 - 169
  • [30] Overlapping group screening for binary cancer classification with TCGA high-dimensional genomic data
    Wang, Jie-Huei
    Chen, Yi-Hau
    JOURNAL OF BIOINFORMATICS AND COMPUTATIONAL BIOLOGY, 2023,