Graph Random Forest: A Graph Embedded Algorithm for Identifying Highly Connected Important Features

被引:11
|
作者
Tian, Leqi [1 ,2 ]
Wu, Wenbin [1 ]
Yu, Tianwei [1 ,2 ,3 ]
机构
[1] Chinese Univ Hong Kong, Sch Data Sci, Shenzhen 518172, Peoples R China
[2] Shenzhen Res Inst Big Data, Shenzhen 518172, Peoples R China
[3] Guangdong Prov Key Lab Big Data Comp, Shenzhen 518172, Peoples R China
关键词
feature selection; random forest; gene network; CANCER;
D O I
10.3390/biom13071153
中图分类号
Q5 [生物化学]; Q7 [分子生物学];
学科分类号
071010 ; 081704 ;
摘要
Random Forest (RF) is a widely used machine learning method with good performance on classification and regression tasks. It works well under low sample size situations, which benefits applications in the field of biology. For example, gene expression data often involve much larger numbers of features (p) compared to the size of samples (n). Though the predictive accuracy using RF is often high, there are some problems when selecting important genes using RF. The important genes selected by RF are usually scattered on the gene network, which conflicts with the biological assumption of functional consistency between effective features. To improve feature selection by incorporating external topological information between genes, we propose the Graph Random Forest (GRF) for identifying highly connected important features by involving the known biological network when constructing the forest. The algorithm can identify effective features that form highly connected sub-graphs and achieve equivalent classification accuracy to RF. To evaluate the capability of our proposed method, we conducted simulation experiments and applied the method to two real datasets-non-small cell lung cancer RNA-seq data from The Cancer Genome Atlas, and human embryonic stem cell RNA-seq dataset (GSE93593). The resulting high classification accuracy, connectivity of selected sub-graphs, and interpretable feature selection results suggest the method is a helpful addition to graph-based classification models and feature selection procedures.
引用
收藏
页数:14
相关论文
共 45 条
  • [41] Urine output as one of the most important features in differentiating in-hospital death among patients receiving extracorporeal membrane oxygenation: a random forest approach
    Chang, Sheng-Nan
    Hu, Nian-Ze
    Wu, Jo-Hsuan
    Cheng, Hsun-Mao
    Caffrey, James L.
    Yu, Hsi-Yu
    Chen, Yih-Sharng
    Hsu, Jiun
    Lin, Jou-Wei
    EUROPEAN JOURNAL OF MEDICAL RESEARCH, 2023, 28 (01)
  • [42] Urine output as one of the most important features in differentiating in-hospital death among patients receiving extracorporeal membrane oxygenation: a random forest approach
    Sheng-Nan Chang
    Nian-Ze Hu
    Jo-Hsuan Wu
    Hsun-Mao Cheng
    James L. Caffrey
    Hsi-Yu Yu
    Yih-Sharng Chen
    Jiun Hsu
    Jou-Wei Lin
    European Journal of Medical Research, 28
  • [43] A New Vector for Mapping Gold Mineralization Potential and Proposed Pathways in Highly Weathered Basement Rocks using Multispectral, Radar, and Magnetic Data in Random Forest Algorithm
    Ahmed M. Eldosouky
    Abdullah Othman
    Saada A. Saada
    Sara Zamzam
    Natural Resources Research, 2024, 33 : 23 - 50
  • [44] A New Vector for Mapping Gold Mineralization Potential and Proposed Pathways in Highly Weathered Basement Rocks using Multispectral, Radar, and Magnetic Data in Random Forest Algorithm
    Eldosouky, Ahmed M.
    Othman, Abdullah
    Saada, Saada A.
    Zamzam, Sara
    NATURAL RESOURCES RESEARCH, 2024, 33 (01) : 23 - 50
  • [45] Identifying low-PM2.5 exposure commuting routes for cyclists through modeling with the random forest algorithm based on low-cost sensor measurements in three Asian cities
    Wu, Tzong-Gang
    Chen, Yan-Da
    Chen, Bang-Hua
    Harada, Kouji H.
    Lee, Kiyoung
    Deng, Furong
    Rood, Mark J.
    Chen, Chu-Chih
    Tran, Cong-Thanh
    Chien, Kuo-Liong
    Wen, Tzai-Hung
    Wu, Chang-Fu
    ENVIRONMENTAL POLLUTION, 2022, 294