Embeddings of genomic region sets capture rich biological associations in lower dimensions

被引:13
作者
Gharavi, Erfaneh [1 ,2 ]
Gu, Aaron [1 ,3 ]
Zheng, Guangtao [3 ]
Smith, Jason P. [1 ,4 ]
Cho, Hyun Jae [1 ,3 ]
Zhang, Aidong [3 ]
Brown, Donald E. [2 ]
Sheffield, Nathan C. [1 ,2 ,4 ,5 ,6 ]
机构
[1] Univ Virginia, Ctr Publ Hlth Genom, Charlottesville, VA 22903 USA
[2] Univ Virginia, Sch Data Sci, Charlottesville, VA 22903 USA
[3] Univ Virginia, Dept Comp Sci, Charlottesville, VA 22903 USA
[4] Univ Virginia, Dept Biochem & Mol Genet, Charlottesville, VA 22903 USA
[5] Univ Virginia, Dept Publ Hlth Sci, Charlottesville, VA 22903 USA
[6] Univ Virginia, Dept Biomed Engn, Charlottesville, VA 22903 USA
基金
美国国家科学基金会; 美国国家卫生研究院;
关键词
DNA; CHROMATIN; ELEMENTS;
D O I
10.1093/bioinformatics/btab439
中图分类号
Q5 [生物化学];
学科分类号
071010 ; 081704 ;
摘要
Motivation: Genomic region sets summarize functional genomics data and define locations of interest in the genome such as regulatory regions or transcription factor binding sites. The number of publicly available region sets has increased dramatically, leading to challenges in data analysis. Results: We propose a new method to represent genomic region sets as vectors, or embeddings, using an adapted word2vec approach. We compared our approach to two simpler methods based on interval unions or term frequency-inverse document frequency and evaluated the methods in three ways: First, by classifying the cell line, antibody or tissue type of the region set; second, by assessing whether similarity among embeddings can reflect simulated random perturbations of genomic regions; and third, by testing robustness of the proposed representations to different signal thresholds for calling peaks. Our word2vec-based region set embeddings reduce dimensionality from more than a hundred thousand to 100 without significant loss in classification performance. The vector representation could identify cell line, antibody and tissue type with over 90% accuracy. We also found that the vectors could quantitatively summarize simulated random perturbations to region sets and are more robust to subsampling the data derived from different peak calling thresholds. Our evaluations demonstrate that the vectors retain useful biological information in relatively lower-dimensional spaces. We propose that vector representation of region sets is a promising approach for efficient analysis of genomic region data.
引用
收藏
页码:4299 / 4306
页数:8
相关论文
共 36 条
[1]   Principal component analysis [J].
Abdi, Herve ;
Williams, Lynne J. .
WILEY INTERDISCIPLINARY REVIEWS-COMPUTATIONAL STATISTICS, 2010, 2 (04) :433-459
[2]   Dimensionality reduction for visualizing single-cell data using UMAP [J].
Becht, Etienne ;
McInnes, Leland ;
Healy, John ;
Dutertre, Charles-Antoine ;
Kwok, Immanuel W. H. ;
Ng, Lai Guan ;
Ginhoux, Florent ;
Newell, Evan W. .
NATURE BIOTECHNOLOGY, 2019, 37 (01) :38-+
[3]   Representation Learning: A Review and New Perspectives [J].
Bengio, Yoshua ;
Courville, Aaron ;
Vincent, Pascal .
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2013, 35 (08) :1798-1828
[4]   Sequence embedding for fast construction of guide trees for multiple sequence alignment [J].
Blackshields, Gordon ;
Sievers, Fabian ;
Shi, Weifeng ;
Wilm, Andreas ;
Higgins, Desmond G. .
ALGORITHMS FOR MOLECULAR BIOLOGY, 2010, 5
[5]  
Buenrostro JD, 2013, NAT METHODS, V10, P1213, DOI [10.1038/NMETH.2688, 10.1038/nmeth.2688]
[6]   Assessment of computational methods for the analysis of single-cell ATAC-seq data [J].
Chen, Huidong ;
Lareau, Caleb A. ;
Andreani, Tommaso ;
Vinyard, Michael E. ;
Garcia, Sara P. ;
Clement, Kendell ;
Andrade-Navarro, Miguel ;
Buenrostro, Jason D. ;
Pinello, Luca .
GENOME BIOLOGY, 2019, 20 (01)
[7]   Lineage-specific and single-cell chromatin accessibility charts human hematopoiesis and leukemia evolution [J].
Corces, M. Ryan ;
Buenrostro, Jason D. ;
Wu, Beijing ;
Greenside, Peyton G. ;
Chan, Steven M. ;
Koenig, Julie L. ;
Snyder, Michael P. ;
Pritchard, Jonathan K. ;
Kundaje, Anshul ;
Gkeenleaf, William J. ;
Majeti, Ravindra ;
Chang, Howard Y. .
NATURE GENETICS, 2016, 48 (10) :1193-1203
[8]   Sequence2Vec: a novel embedding approach for modeling transcription factor binding affinity landscape [J].
Dai, Hanjun ;
Umarov, Ramzan ;
Kuwahara, Hiroyuki ;
Li, Yu ;
Song, Le ;
Gao, Xin .
BIOINFORMATICS, 2017, 33 (22) :3575-3583
[9]   Epigenomic annotation-based interpretation of genomic data: from enrichment analysis to machine learning [J].
Dozmorov, Mikhail G. .
BIOINFORMATICS, 2017, 33 (20) :3323-3330
[10]   Gene2vec: distributed representation of genes based on co-expression [J].
Du, Jingcheng ;
Jia, Peilin ;
Dai, Yulin ;
Tao, Cui ;
Zhao, Zhongming ;
Zhi, Degui .
BMC GENOMICS, 2019, 20 (Suppl 1)