scGPT: toward building a foundation model for single-cell multi-omics using generative AI

被引:260
作者
Cui, Haotian [1 ,2 ,3 ]
Wang, Chloe [1 ,2 ,3 ]
Maan, Hassaan [1 ,3 ,4 ]
Pang, Kuan [2 ,3 ]
Luo, Fengning [2 ,3 ]
Duan, Nan [5 ]
Wang, Bo [1 ,2 ,3 ,4 ,6 ,7 ]
机构
[1] Univ Hlth Network, Peter Munk Cardiac Ctr, ,Ontartio, Toronto, ON, Canada
[2] Univ Toronto, Dept Comp Sci, Toronto, ON, Canada
[3] Vector Inst, Toronto, ON, Canada
[4] Univ Toronto, Dept Med Biophys, Toronto, ON, Canada
[5] Microsoft Res, Redmond, WA USA
[6] Univ Toronto, Dept Lab Med & Pathobiol, Toronto, ON, Canada
[7] Univ Hlth Network, AI Hub, Toronto, ON, Canada
基金
加拿大自然科学与工程研究理事会;
关键词
ATLAS;
D O I
10.1038/s41592-024-02201-0
中图分类号
Q5 [生物化学];
学科分类号
071010 ; 081704 ;
摘要
Generative pretrained models have achieved remarkable success in various domains such as language and computer vision. Specifically, the combination of large-scale diverse datasets and pretrained transformers has emerged as a promising approach for developing foundation models. Drawing parallels between language and cellular biology (in which texts comprise words; similarly, cells are defined by genes), our study probes the applicability of foundation models to advance cellular biology and genetic research. Using burgeoning single-cell sequencing data, we have constructed a foundation model for single-cell biology, scGPT, based on a generative pretrained transformer across a repository of over 33 million cells. Our findings illustrate that scGPT effectively distills critical biological insights concerning genes and cells. Through further adaptation of transfer learning, scGPT can be optimized to achieve superior performance across diverse downstream applications. This includes tasks such as cell type annotation, multi-batch integration, multi-omic integration, perturbation response prediction and gene network inference. Pretrained using over 33 million single-cell RNA-sequencing profiles, scGPT is a foundation model facilitating a broad spectrum of downstream single-cell analysis tasks by transfer learning.
引用
收藏
页码:1470 / 1480
页数:23
相关论文
共 73 条
[1]  
Achiam J., 2024, Gpt-4 technical report, DOI 10.48550/arXiv.2303.08774
[2]   A Multiplexed Single-Cell CRISPR Screening Platform Enables Systematic Dissection of the Unfolded Protein Response [J].
Adamson, Britt ;
Norman, Thomas M. ;
Jost, Marco ;
Cho, Min Y. ;
Nunez, James K. ;
Chen, Yuwen ;
Villalta, Jacqueline E. ;
Gilbert, Luke A. ;
Horlbeck, Max A. ;
Hein, Marco Y. ;
Pak, Ryan A. ;
Gray, Andrew N. ;
Gross, Carol A. ;
Dixit, Atray ;
Parnas, Oren ;
Regev, Aviv ;
Weissman, Jonathan S. .
CELL, 2016, 167 (07) :1867-+
[3]  
Angerer Philipp, 2017, Current Opinion in Systems Biology, V4, P85, DOI 10.1016/j.coisb.2017.07.004
[4]  
[Anonymous], 2020, GENOMICS
[5]   Effective gene expression prediction from sequence by integrating long-range interactions [J].
Avsec, Ziga ;
Agarwal, Vikram ;
Visentin, Daniel ;
Ledsam, Joseph R. ;
Grabska-Barwinska, Agnieszka ;
Taylor, Kyle R. ;
Assael, Yannis ;
Jumper, John ;
Kohli, Pushmeet ;
Kelley, David R. .
NATURE METHODS, 2021, 18 (10) :1196-+
[6]  
Bommasani R., 2022, OPPORTUNITIES RISKS, DOI DOI 10.48550/ARXIV.2108.07258
[7]  
Brown T.B., 2020, Advances in neural information processing systems
[8]  
Bubeck S., 2023, ARXIV, DOI DOI 10.48550/ARXIV.2303.12712
[9]   Multi-omics single-cell data integration and regulatory inference with graph-linked embedding [J].
Cao, Zhi-Jie ;
Gao, Ge .
NATURE BIOTECHNOLOGY, 2022, 40 (10) :1458-+
[10]   Identification of transcriptional programs using dense vector representations defined by mutual information with GeneVector [J].
Ceglia, Nicholas ;
Sethna, Zachary ;
Freeman, Samuel S. ;
Uhlitz, Florian ;
Bojilova, Viktoria ;
Rusk, Nicole ;
Burman, Bharat ;
Chow, Andrew ;
Salehi, Sohrab ;
Kabeer, Farhia ;
Aparicio, Samuel ;
Greenbaum, Benjamin D. ;
Shah, Sohrab P. ;
McPherson, Andrew .
NATURE COMMUNICATIONS, 2023, 14 (01)