How data heterogeneity affects innovating knowledge and information in gene identification: A statistical learning perspective

被引:0
作者
Zhao, Jun [1 ]
Lao, Fangyi [1 ]
Yan, Guan'ao [2 ]
Zhang, Yi [3 ]
机构
[1] Hangzhou City Univ, Dept Stat & Data Sci, Hangzhou, Peoples R China
[2] Univ Calif Los Angeles, Dept Stat, Los Angeles, CA USA
[3] Zhejiang Univ, Sch Math Sci, Hangzhou, Peoples R China
来源
JOURNAL OF INNOVATION & KNOWLEDGE | 2024年 / 9卷 / 03期
关键词
Data heterogeneity; Gene identification; Statistical learning; Semiparametric modelling; NONCONCAVE PENALIZED LIKELIHOOD; QUANTILE REGRESSION; SELECTION; MODEL;
D O I
10.1016/j.jik.2024.100514
中图分类号
F [经济];
学科分类号
02 ;
摘要
Data heterogeneity, particularly noted in fields such as genetics, has been identified as a key feature of big data, posing significant challenges to innovation in knowledge and information. This paper focuses on characterizing and understanding the so-called "curse of heterogeneity" in gene identification for low infant birth weight from a statistical learning perspective. Owing to the computational and analytical advantages of expectile regression in handling heterogeneity, this paper proposes a flexible, regularized, partially linear additive expectile regression model for high-dimensional heterogeneous data. Unlike most existing works that assume Gaussian or sub-Gaussian error distributions, we adopt a more realistic, less stringent assumption that the errors have only finite moments. Additionally, we derive a two-step algorithm to address the reduced optimization problem and demonstrate that our method, with a probability approaching one, achieves optimal estimation accuracy. Furthermore, we demonstrate that the proposed algorithm converges at least linearly, ensuring the practical applicability of our method. Monte Carlo simulations reveal that our method's resulting estimator performs well in terms of estimation accuracy, model selection, and heterogeneity identification. Empirical analysis in gene trait expression further underscores the potential for guiding public health interventions. (c) 2024 The Authors. Published by Elsevier Espa & ntilde;a, S.L.U. on behalf of Journal of Innovation & Knowledge. This (http://creativecommons.org/licenses/by-nc-nd/4.0/)
引用
收藏
页数:10
相关论文
共 27 条
  • [21] GOexpress: an R/Bioconductor package for the identification and visualisation of robust gene ontology signatures through supervised learning of gene expression data
    Rue-Albrecht, Kevin
    McGettigan, Paul A.
    Hernandez, Belinda
    Nalpas, Nicolas C.
    Magee, David A.
    Parnell, Andrew C.
    Gordon, Stephen V.
    MacHugh, David E.
    BMC BIOINFORMATICS, 2016, 17
  • [22] Visual statistical learning produces implicit and explicit knowledge about temporal order information and scene chunks: Evidence from direct and indirect measures
    Otsuka, Sachio
    Koch, Christof
    Saiki, Jun
    VISUAL COGNITION, 2016, 24 (02) : 155 - 172
  • [23] Statistical learning of small data with domain knowledge--- sample size- and pre-notch length- dependent strength of concrete
    Wang, Jia-Hao
    Jia, Jun-Nan
    Sun, Sheng
    Zhang, Tong-Yi
    ENGINEERING FRACTURE MECHANICS, 2022, 259
  • [24] Identification of Alzheimer associated differentially expressed gene through microarray data and transfer learning-based image analysis
    George, Benu
    Gokhale, Sheetal D.
    Yaswanth, P. M.
    Vijayan, Ajay
    Devika, S.
    Suchithra, T., V
    NEUROSCIENCE LETTERS, 2022, 766
  • [25] Mechanistic machine learning: how data assimilation leverages physiologic knowledge using Bayesian inference to forecast the future, infer the present, and phenotype
    Albers, David J.
    Levine, Matthew E.
    Stuart, Andrew
    Mamykina, Lena
    Gluckman, Bruce
    Hripcsak, George
    JOURNAL OF THE AMERICAN MEDICAL INFORMATICS ASSOCIATION, 2018, 25 (10) : 1392 - 1401
  • [26] Using statistical analysis to explore the influencing factors of data imbalance for machine learning identification methods of human transcriptome m6A modification sites
    Li, Mingxin
    Li, Rujun
    Zhang, Yichi
    Peng, Shiyu
    Lv, Zhibin
    COMPUTATIONAL BIOLOGY AND CHEMISTRY, 2025, 115
  • [27] Single_cell_GRN: gene regulatory network identification based on supervised learning method and Single-cell RNA-seq data
    Bin Yang
    Bao, Wenzheng
    Chen, Baitong
    Song, Dan
    BIODATA MINING, 2022, 15 (01)