ForestSubtype: a cancer subtype identifying approach based on high-dimensional genomic data and a parallel random forest

被引:1
|
作者
Luo, Junwei [1 ]
Feng, Yading [1 ]
Wu, Xuyang [1 ]
Li, Ruimin [1 ]
Shi, Jiawei [1 ]
Chang, Wenjing [1 ]
Wang, Junfeng [1 ]
机构
[1] Henan Polytech Univ, Sch Software, Jiaozuo, Peoples R China
基金
中国国家自然科学基金;
关键词
Cancer subtyping; Random forest; Gene expression data; Machine learning; Auto Encoder; BREAST-CANCER; PROGNOSTIC BIOMARKER; PREDICTOR; SELECTION; PACKAGE; GROWTH; FOXC1;
D O I
10.1186/s12859-023-05412-y
中图分类号
Q5 [生物化学];
学科分类号
071010 ; 081704 ;
摘要
Background: Cancer subtype classification is helpful for personalized cancer treat-ment. Although, some approaches have been developed to classifying caner subtype based on high dimensional gene expression data, it is difficult to obtain satisfactory classification results. Meanwhile, some cancers have been well studied and classified to some subtypes, which are adopt by most researchers. Hence, this priori knowledge is significant for further identifying new meaningful subtypes.Results: In this paper, we present a combined parallel random forest and autoencoder approach for cancer subtype identification based on high dimensional gene expression data, ForestSubtype. ForestSubtype first adopts the parallel RF and the priori knowledge of cancer subtype to train a module and extract significant candidate features. Second, ForestSubtype uses a random forest as the base module and ten parallel random forests to compute each feature weight and rank them separately. Then, the intersection of the features with the larger weights output by the ten parallel random forests is taken as our subsequent candidate features. Third, ForestSubtype uses an autoencoder to condenses the selected features into a two-dimensional data. Fourth, ForestSubtype utilizes k-means++ to obtain new cancer subtype identifica-tion results. In this paper, the breast cancer gene expression data obtained from The Cancer Genome Atlas are used for training and validation, and an independent breast cancer dataset from the Molecular Taxonomy of Breast Cancer International Consor-tium is used for testing. Additionally, we use two other cancer datasets for validat-ing the generalizability of ForestSubtype. ForestSubtype outperforms the other two methods in terms of the distribution of clusters, internal and external metric results. The open-source code is available at https://github.com/lffyd/ForestSubtype.Conclusions: Our work shows that the combination of high-dimensional gene expression data and parallel random forests and autoencoder, guided by a priori knowledge, can identify new subtypes more effectively than existing methods of can-cer subtype classification.
引用
收藏
页数:19
相关论文
共 50 条
  • [31] Supervised Classification of High-Dimensional Correlated Data: Application to Genomic Data
    Aboubacry Gaye
    Abdou Ka Diongue
    Seydou Nourou Sylla
    Maryam Diarra
    Amadou Diallo
    Cheikh Talla
    Cheikh Loucoubar
    Journal of Classification, 2024, 41 : 158 - 169
  • [32] CLASSIFICATION OF HIGH-DIMENSIONAL DATA: A RANDOM-MATRIX REGULARIZED DISCRIMINANT ANALYSIS APPROACH
    Ye, Bin
    Liu, Peng
    INTERNATIONAL JOURNAL OF INNOVATIVE COMPUTING INFORMATION AND CONTROL, 2019, 15 (03): : 955 - 967
  • [33] Identifying commuters based on random forest of smartcard data
    Mei, Zhenyu
    Ding, Wenchao
    Feng, Chi
    Shen, Liting
    IET INTELLIGENT TRANSPORT SYSTEMS, 2020, 14 (04) : 207 - 212
  • [34] Oropharyngeal cancer patient stratification using random forest based-learning over high-dimensional radiomic features
    Patel, Harsh
    Vock, David M.
    Marai, G. Elisabeta
    Fuller, Clifton D.
    Mohamed, Abdallah S. R.
    Canahuate, Guadalupe
    SCIENTIFIC REPORTS, 2021, 11 (01)
  • [35] Oropharyngeal cancer patient stratification using random forest based-learning over high-dimensional radiomic features
    Harsh Patel
    David M. Vock
    G. Elisabeta Marai
    Clifton D. Fuller
    Abdallah S. R. Mohamed
    Guadalupe Canahuate
    Scientific Reports, 11
  • [36] A novel LDA approach for high-dimensional data
    Feng, GY
    Hu, DW
    Li, M
    Zhou, ZT
    ADVANCES IN NATURAL COMPUTATION, PT 1, PROCEEDINGS, 2005, 3610 : 209 - 212
  • [37] Ensemble of Trees for Classifying High-Dimensional Imbalanced Genomic Data
    Farid, Dewan Md.
    Nowe, Ann
    Manderick, Bernard
    PROCEEDINGS OF SAI INTELLIGENT SYSTEMS CONFERENCE (INTELLISYS) 2016, VOL 1, 2018, 15 : 172 - 187
  • [38] Comparison of biomarker selection methods in high-dimensional genomic data
    Wang, Y.
    Guo, S.
    EUROPEAN JOURNAL OF CANCER, 2022, 174 : S98 - S98
  • [39] Sparse redundancy analysis of high-dimensional genetic and genomic data
    Csala, Attila
    Voorbraak, Frans P. J. M.
    Zwinderman, Aeilko H.
    Hof, Michel H.
    BIOINFORMATICS, 2017, 33 (20) : 3228 - 3234
  • [40] Generalized Linear Discriminant Analysis for High-Dimensional Genomic Data
    Li, Sisi
    Lewinger, Juan Pablo
    GENETIC EPIDEMIOLOGY, 2017, 41 (07) : 704 - 704