ForestSubtype: a cancer subtype identifying approach based on high-dimensional genomic data and a parallel random forest

被引:1
|
作者
Luo, Junwei [1 ]
Feng, Yading [1 ]
Wu, Xuyang [1 ]
Li, Ruimin [1 ]
Shi, Jiawei [1 ]
Chang, Wenjing [1 ]
Wang, Junfeng [1 ]
机构
[1] Henan Polytech Univ, Sch Software, Jiaozuo, Peoples R China
基金
中国国家自然科学基金;
关键词
Cancer subtyping; Random forest; Gene expression data; Machine learning; Auto Encoder; BREAST-CANCER; PROGNOSTIC BIOMARKER; PREDICTOR; SELECTION; PACKAGE; GROWTH; FOXC1;
D O I
10.1186/s12859-023-05412-y
中图分类号
Q5 [生物化学];
学科分类号
071010 ; 081704 ;
摘要
Background: Cancer subtype classification is helpful for personalized cancer treat-ment. Although, some approaches have been developed to classifying caner subtype based on high dimensional gene expression data, it is difficult to obtain satisfactory classification results. Meanwhile, some cancers have been well studied and classified to some subtypes, which are adopt by most researchers. Hence, this priori knowledge is significant for further identifying new meaningful subtypes.Results: In this paper, we present a combined parallel random forest and autoencoder approach for cancer subtype identification based on high dimensional gene expression data, ForestSubtype. ForestSubtype first adopts the parallel RF and the priori knowledge of cancer subtype to train a module and extract significant candidate features. Second, ForestSubtype uses a random forest as the base module and ten parallel random forests to compute each feature weight and rank them separately. Then, the intersection of the features with the larger weights output by the ten parallel random forests is taken as our subsequent candidate features. Third, ForestSubtype uses an autoencoder to condenses the selected features into a two-dimensional data. Fourth, ForestSubtype utilizes k-means++ to obtain new cancer subtype identifica-tion results. In this paper, the breast cancer gene expression data obtained from The Cancer Genome Atlas are used for training and validation, and an independent breast cancer dataset from the Molecular Taxonomy of Breast Cancer International Consor-tium is used for testing. Additionally, we use two other cancer datasets for validat-ing the generalizability of ForestSubtype. ForestSubtype outperforms the other two methods in terms of the distribution of clusters, internal and external metric results. The open-source code is available at https://github.com/lffyd/ForestSubtype.Conclusions: Our work shows that the combination of high-dimensional gene expression data and parallel random forests and autoencoder, guided by a priori knowledge, can identify new subtypes more effectively than existing methods of can-cer subtype classification.
引用
收藏
页数:19
相关论文
共 50 条
  • [1] ForestSubtype: a cancer subtype identifying approach based on high-dimensional genomic data and a parallel random forest
    Junwei Luo
    Yading Feng
    Xuyang Wu
    Ruimin Li
    Jiawei Shi
    Wenjing Chang
    Junfeng Wang
    BMC Bioinformatics, 24
  • [2] Deep-learning approach to identifying cancer subtypes using high-dimensional genomic data
    Chen, Runpu
    Yang, Le
    Goodison, Steve
    Sun, Yijun
    BIOINFORMATICS, 2020, 36 (05) : 1476 - 1483
  • [3] Enriched Random Forest for High Dimensional Genomic Data
    Ghosh, Debopriya
    Cabrera, Javier
    IEEE-ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, 2022, 19 (05) : 2817 - 2828
  • [4] The Visualization of E-commerce High-dimensional Data Based on Random Forest
    Zhu Xianwen
    Yin Hongtan
    AGRO FOOD INDUSTRY HI-TECH, 2017, 28 (01): : 987 - 991
  • [5] The visualization of e-commerce high-dimensional data based on random forest
    Xianwen, Zhu, 1600, TeknoScienze, Viale Brianza,22, Milano, 20127, Italy (28):
  • [6] Deep learning approach for cancer subtype classification using high-dimensional gene expression data
    Jiquan Shen
    Jiawei Shi
    Junwei Luo
    Haixia Zhai
    Xiaoyan Liu
    Zhengjiang Wu
    Chaokun Yan
    Huimin Luo
    BMC Bioinformatics, 23
  • [7] Deep learning approach for cancer subtype classification using high-dimensional gene expression data
    Shen, Jiquan
    Shi, Jiawei
    Luo, Junwei
    Zhai, Haixia
    Liu, Xiaoyan
    Wu, Zhengjiang
    Yan, Chaokun
    Luo, Huimin
    BMC BIOINFORMATICS, 2022, 23 (01)
  • [8] Random Forest Modelling of High-Dimensional Mixed-Type Data for Breast Cancer Classification
    Quist, Jelmar
    Taylor, Lawson
    Staaf, Johan
    Grigoriadis, Anita
    CANCERS, 2021, 13 (05) : 1 - 15
  • [9] Laplacian-Weighted Random Forest for High-Dimensional Data Classification
    Liang, Jianheng
    Huang, Dong
    2019 IEEE SYMPOSIUM SERIES ON COMPUTATIONAL INTELLIGENCE (IEEE SSCI 2019), 2019, : 748 - 753
  • [10] Bayesian weighted random forest for classification of high-dimensional genomics data
    Olaniran, Oyebayo Ridwan
    Abdullah, Mohd Asrul A.
    KUWAIT JOURNAL OF SCIENCE, 2023, 50 (04) : 477 - 484