ForestSubtype: a cancer subtype identifying approach based on high-dimensional genomic data and a parallel random forest

被引:1
|
作者
Luo, Junwei [1 ]
Feng, Yading [1 ]
Wu, Xuyang [1 ]
Li, Ruimin [1 ]
Shi, Jiawei [1 ]
Chang, Wenjing [1 ]
Wang, Junfeng [1 ]
机构
[1] Henan Polytech Univ, Sch Software, Jiaozuo, Peoples R China
基金
中国国家自然科学基金;
关键词
Cancer subtyping; Random forest; Gene expression data; Machine learning; Auto Encoder; BREAST-CANCER; PROGNOSTIC BIOMARKER; PREDICTOR; SELECTION; PACKAGE; GROWTH; FOXC1;
D O I
10.1186/s12859-023-05412-y
中图分类号
Q5 [生物化学];
学科分类号
071010 ; 081704 ;
摘要
Background: Cancer subtype classification is helpful for personalized cancer treat-ment. Although, some approaches have been developed to classifying caner subtype based on high dimensional gene expression data, it is difficult to obtain satisfactory classification results. Meanwhile, some cancers have been well studied and classified to some subtypes, which are adopt by most researchers. Hence, this priori knowledge is significant for further identifying new meaningful subtypes.Results: In this paper, we present a combined parallel random forest and autoencoder approach for cancer subtype identification based on high dimensional gene expression data, ForestSubtype. ForestSubtype first adopts the parallel RF and the priori knowledge of cancer subtype to train a module and extract significant candidate features. Second, ForestSubtype uses a random forest as the base module and ten parallel random forests to compute each feature weight and rank them separately. Then, the intersection of the features with the larger weights output by the ten parallel random forests is taken as our subsequent candidate features. Third, ForestSubtype uses an autoencoder to condenses the selected features into a two-dimensional data. Fourth, ForestSubtype utilizes k-means++ to obtain new cancer subtype identifica-tion results. In this paper, the breast cancer gene expression data obtained from The Cancer Genome Atlas are used for training and validation, and an independent breast cancer dataset from the Molecular Taxonomy of Breast Cancer International Consor-tium is used for testing. Additionally, we use two other cancer datasets for validat-ing the generalizability of ForestSubtype. ForestSubtype outperforms the other two methods in terms of the distribution of clusters, internal and external metric results. The open-source code is available at https://github.com/lffyd/ForestSubtype.Conclusions: Our work shows that the combination of high-dimensional gene expression data and parallel random forests and autoencoder, guided by a priori knowledge, can identify new subtypes more effectively than existing methods of can-cer subtype classification.
引用
收藏
页数:19
相关论文
共 50 条
  • [41] A novel ensemble method for high-dimensional genomic data classification
    Espichan, Alexandra
    Villanueva, Edwin
    PROCEEDINGS 2018 IEEE INTERNATIONAL CONFERENCE ON BIOINFORMATICS AND BIOMEDICINE (BIBM), 2018, : 2229 - 2236
  • [42] A Normality Test for High-dimensional Data Based on the Nearest Neighbor Approach
    Chen, Hao
    Xia, Yin
    JOURNAL OF THE AMERICAN STATISTICAL ASSOCIATION, 2023, 118 (541) : 719 - 731
  • [43] Generalized linear discriminant analysis for high-dimensional genomic data
    Li, Sisi
    Lewinger, Juan Pablo
    GENETIC EPIDEMIOLOGY, 2018, 42 (07) : 713 - 713
  • [44] Fault diagnosis of rotating machinery with high-dimensional imbalance samples based on wavelet random forest
    Guo, Zhen
    Du, Wenliao
    Li, Chuan
    Guo, Xibin
    Liu, Zhiping
    MEASUREMENT, 2025, 248
  • [45] Network-based regularization for analysis of high-dimensional genomic data with group structure
    Kim, Kipoong
    Choi, Jiyun
    Sun, Hokeun
    KOREAN JOURNAL OF APPLIED STATISTICS, 2016, 29 (06) : 1117 - 1128
  • [46] Bayesian Random Forest with Multiple Imputation by Chain Equations for High-Dimensional Missing Data: A Simulation Study
    Olaniran, Oyebayo Ridwan
    Alzahrani, Ali Rashash R.
    MATHEMATICS, 2025, 13 (06)
  • [47] Analysis of high-dimensional genomic data using MapReduce based probabilistic neural network
    Baliarsingh, Santos Kumar
    Vipsita, Swati
    Gandomi, Amir H.
    Panda, Abhijeet
    Bakshi, Sambit
    Ramasubbareddy, Somula
    COMPUTER METHODS AND PROGRAMS IN BIOMEDICINE, 2020, 195
  • [48] AN APPROACH TO THE PARALLEL SOLUTION OF A HIGH-DIMENSIONAL BASIC FLOW PROBLEM
    Pogorilyy, S. D.
    Boyko, Yu. V.
    Gusarov, A. D.
    Lozytski, S. I.
    CYBERNETICS AND SYSTEMS ANALYSIS, 2009, 45 (02) : 291 - 296
  • [49] Parallel Clustering of High-Dimensional Social Media Data Streams
    Gao, Xiaoming
    Ferrara, Emilio
    Qiu, Judy
    2015 15TH IEEE/ACM INTERNATIONAL SYMPOSIUM ON CLUSTER, CLOUD AND GRID COMPUTING, 2015, : 323 - 332
  • [50] Efficient Parallel Skyline Query Processing for High-Dimensional Data
    Tang, Mingjie
    Yu, Yongyang
    Aref, Walid G.
    Malluhi, Qutaibah M.
    Ouzzani, Mourad
    2019 IEEE 35TH INTERNATIONAL CONFERENCE ON DATA ENGINEERING (ICDE 2019), 2019, : 2113 - 2114