InvBFM: finding genomic inversions from high-throughput sequence data based on feature mining

被引:4
作者
Wu, Zhongjia [1 ]
Wu, Yufeng [2 ]
Gao, Jingyang [1 ]
机构
[1] Beijing Univ Chem Technol, Coll Informat Sci & Technol, Beijing, Peoples R China
[2] Univ Connecticut, Dept Comp Sci & Engn, Storrs, CT USA
基金
北京市自然科学基金; 美国国家科学基金会;
关键词
Genomics; High-throughput sequencing; Structural variation; Inversion; Support vector machine; PAIRED-END; DISCOVERY; GENE;
D O I
10.1186/s12864-020-6585-1
中图分类号
Q81 [生物工程学(生物技术)]; Q93 [微生物学];
学科分类号
071005 ; 0836 ; 090102 ; 100705 ;
摘要
Background Genomic inversion is one type of structural variations (SVs) and is known to play an important biological role. An established problem in sequence data analysis is calling inversions from high-throughput sequence data. It is more difficult to detect inversions because they are surrounded by duplication or other types of SVs in the inversion areas. Existing inversion detection tools are mainly based on three approaches: paired-end reads, split-mapped reads, and assembly. However, existing tools suffer from unsatisfying precision or sensitivity (eg: only 50 similar to 60% sensitivity) and it needs to be improved. Result In this paper, we present a new inversion calling method called InvBFM. InvBFM calls inversions based on feature mining. InvBFM first gathers the results of existing inversion detection tools as candidates for inversions. It then extracts features from the inversions. Finally, it calls the true inversions by a trained support vector machine (SVM) classifier. Conclusions Our results on real sequence data from the 1000 Genomes Project show that by combining feature mining and a machine learning model, InvBFM outperforms existing tools. InvBFM is written in Python and Shell and is available for download at https://github.com/wzj1234/InvBFM.
引用
收藏
页数:10
相关论文
共 17 条
[1]   A global reference for human genetic variation [J].
Altshuler, David M. ;
Durbin, Richard M. ;
Abecasis, Goncalo R. ;
Bentley, David R. ;
Chakravarti, Aravinda ;
Clark, Andrew G. ;
Donnelly, Peter ;
Eichler, Evan E. ;
Flicek, Paul ;
Gabriel, Stacey B. ;
Gibbs, Richard A. ;
Green, Eric D. ;
Hurles, Matthew E. ;
Knoppers, Bartha M. ;
Korbel, Jan O. ;
Lander, Eric S. ;
Lee, Charles ;
Lehrach, Hans ;
Mardis, Elaine R. ;
Marth, Gabor T. ;
McVean, Gil A. ;
Nickerson, Deborah A. ;
Wang, Jun ;
Wilson, Richard K. ;
Boerwinkle, Eric ;
Doddapaneni, Harsha ;
Han, Yi ;
Korchina, Viktoriya ;
Kovar, Christie ;
Lee, Sandra ;
Muzny, Donna ;
Reid, Jeffrey G. ;
Zhu, Yiming ;
Chang, Yuqi ;
Feng, Qiang ;
Fang, Xiaodong ;
Guo, Xiaosen ;
Jian, Min ;
Jiang, Hui ;
Jin, Xin ;
Lan, Tianming ;
Li, Guoqing ;
Li, Jingxiang ;
Li, Yingrui ;
Liu, Shengmao ;
Liu, Xiao ;
Lu, Yao ;
Ma, Xuedi ;
Tang, Meifang ;
Wang, Bo .
NATURE, 2015, 526 (7571) :68-+
[2]   Recurrent inversion breaking intron 1 of the factor VIII gene is a frequent cause of severe hemophilia A [J].
Bagnall, RD ;
Waseem, N ;
Green, PM ;
Giannelli, F .
BLOOD, 2002, 99 (01) :168-174
[3]   INVERSION OF THE IDS GENE RESULTING FROM RECOMBINATION WITH IDS-RELATED SEQUENCES IS A COMMON-CAUSE OF THE HUNTER SYNDROME [J].
BONDESON, ML ;
DAHL, N ;
MALMGREN, H ;
KLEIJER, WJ ;
TONNESEN, T ;
CARLBERG, BM ;
PETTERSSON, U .
HUMAN MOLECULAR GENETICS, 1995, 4 (04) :615-621
[4]   Concod: an effective integration framework of consensus-based calling deletions from next-generation sequencing data [J].
Cai, Lei ;
Chu, Chong ;
Zhang, Xiaodong ;
Wu, Yufeng ;
Gao, Jingyang .
INTERNATIONAL JOURNAL OF DATA MINING AND BIOINFORMATICS, 2017, 17 (02) :153-172
[5]   SpliceJumper: a classification-based approach for calling splicing junctions from RNA-seq data [J].
Chu, Chong ;
Li, Xin ;
Wu, Yufeng .
BMC BIOINFORMATICS, 2015, 16
[6]   GINDEL: Accurate Genotype Calling of Insertions and Deletions from Low Coverage Population Sequence Reads [J].
Chu, Chong ;
Zhang, Jin ;
Wu, Yufeng .
PLOS ONE, 2014, 9 (11)
[7]   An improved burden-test pipeline for identifying associations from rare germline and somatic variants [J].
Geng, Yu ;
Zhao, Zhongmeng ;
Zhang, Xuanping ;
Wang, Wenke ;
Cui, Xingjian ;
Ye, Kai ;
Xiao, Xiao ;
Wang, Jiayin .
BMC GENOMICS, 2017, 18
[8]   LUMPY: a probabilistic framework for structural variant discovery [J].
Layer, Ryan M. ;
Chiang, Colby ;
Quinlan, Aaron R. ;
Hall, Ira M. .
GENOME BIOLOGY, 2014, 15 (06)
[9]   A statistical framework for SNP calling, mutation discovery, association mapping and population genetical parameter estimation from sequencing data [J].
Li, Heng .
BIOINFORMATICS, 2011, 27 (21) :2987-2993
[10]   Prioritizing Disease Genes by Using Search Engine Algorithm [J].
Li, Min ;
Zheng, Ruiqing ;
Li, Qi ;
Wang, Jianxin ;
Wu, Fang-Xiang ;
Zhang, Zhuohua .
CURRENT BIOINFORMATICS, 2016, 11 (02) :195-202