scGPT: toward building a foundation model for single-cell multi-omics using generative AI

被引:155
作者
Cui, Haotian [1 ,2 ,3 ]
Wang, Chloe [1 ,2 ,3 ]
Maan, Hassaan [1 ,3 ,4 ]
Pang, Kuan [2 ,3 ]
Luo, Fengning [2 ,3 ]
Duan, Nan [5 ]
Wang, Bo [1 ,2 ,3 ,4 ,6 ,7 ]
机构
[1] Univ Hlth Network, Peter Munk Cardiac Ctr, ,Ontartio, Toronto, ON, Canada
[2] Univ Toronto, Dept Comp Sci, Toronto, ON, Canada
[3] Vector Inst, Toronto, ON, Canada
[4] Univ Toronto, Dept Med Biophys, Toronto, ON, Canada
[5] Microsoft Res, Redmond, WA USA
[6] Univ Toronto, Dept Lab Med & Pathobiol, Toronto, ON, Canada
[7] Univ Hlth Network, AI Hub, Toronto, ON, Canada
基金
加拿大自然科学与工程研究理事会;
关键词
ATLAS;
D O I
10.1038/s41592-024-02201-0
中图分类号
Q5 [生物化学];
学科分类号
071010 ; 081704 ;
摘要
Generative pretrained models have achieved remarkable success in various domains such as language and computer vision. Specifically, the combination of large-scale diverse datasets and pretrained transformers has emerged as a promising approach for developing foundation models. Drawing parallels between language and cellular biology (in which texts comprise words; similarly, cells are defined by genes), our study probes the applicability of foundation models to advance cellular biology and genetic research. Using burgeoning single-cell sequencing data, we have constructed a foundation model for single-cell biology, scGPT, based on a generative pretrained transformer across a repository of over 33 million cells. Our findings illustrate that scGPT effectively distills critical biological insights concerning genes and cells. Through further adaptation of transfer learning, scGPT can be optimized to achieve superior performance across diverse downstream applications. This includes tasks such as cell type annotation, multi-batch integration, multi-omic integration, perturbation response prediction and gene network inference. Pretrained using over 33 million single-cell RNA-sequencing profiles, scGPT is a foundation model facilitating a broad spectrum of downstream single-cell analysis tasks by transfer learning.
引用
收藏
页码:1470 / 1480
页数:23
相关论文
共 73 条
  • [1] Achiam J., 2023, ARXIV, DOI [10.48550/arXiv.2303.08774, DOI 10.48550/ARXIV.2303.08774]
  • [2] A Multiplexed Single-Cell CRISPR Screening Platform Enables Systematic Dissection of the Unfolded Protein Response
    Adamson, Britt
    Norman, Thomas M.
    Jost, Marco
    Cho, Min Y.
    Nunez, James K.
    Chen, Yuwen
    Villalta, Jacqueline E.
    Gilbert, Luke A.
    Horlbeck, Max A.
    Hein, Marco Y.
    Pak, Ryan A.
    Gray, Andrew N.
    Gross, Carol A.
    Dixit, Atray
    Parnas, Oren
    Regev, Aviv
    Weissman, Jonathan S.
    [J]. CELL, 2016, 167 (07) : 1867 - +
  • [3] Angerer Philipp, 2017, Current Opinion in Systems Biology, V4, P85, DOI 10.1016/j.coisb.2017.07.004
  • [4] [Anonymous], 2020, GENOMICS
  • [5] Effective gene expression prediction from sequence by integrating long-range interactions
    Avsec, Ziga
    Agarwal, Vikram
    Visentin, Daniel
    Ledsam, Joseph R.
    Grabska-Barwinska, Agnieszka
    Taylor, Kyle R.
    Assael, Yannis
    Jumper, John
    Kohli, Pushmeet
    Kelley, David R.
    [J]. NATURE METHODS, 2021, 18 (10) : 1196 - +
  • [6] Bommasani Rishi, 2022, ARXIV, DOI DOI 10.48550/ARXIV.2108.07258
  • [7] Brown T. B, 2020, NEURIPS
  • [8] Bubeck S, 2023, ARXIV, DOI DOI 10.48550/ARXIV.2303.12712
  • [9] Multi-omics single-cell data integration and regulatory inference with graph-linked embedding
    Cao, Zhi-Jie
    Gao, Ge
    [J]. NATURE BIOTECHNOLOGY, 2022, 40 (10) : 1458 - +
  • [10] Identification of transcriptional programs using dense vector representations defined by mutual information with GeneVector
    Ceglia, Nicholas
    Sethna, Zachary
    Freeman, Samuel S.
    Uhlitz, Florian
    Bojilova, Viktoria
    Rusk, Nicole
    Burman, Bharat
    Chow, Andrew
    Salehi, Sohrab
    Kabeer, Farhia
    Aparicio, Samuel
    Greenbaum, Benjamin D.
    Shah, Sohrab P.
    McPherson, Andrew
    [J]. NATURE COMMUNICATIONS, 2023, 14 (01)