Interpreting alignment-free sequence comparison: what makes a score a good score?

被引:2
|
作者
Swain, Martin T. [1 ]
Vickers, Martin [2 ]
机构
[1] Aberystwyth Univ, Dept Life Sci, Aberystwyth SY23 3DA, Ceredigion, Wales
[2] Norwich Res Pk, John Innes Ctr, Norwich NR4 7UH, Norfolk, England
基金
英国生物技术与生命科学研究理事会;
关键词
CHAOS GAME REPRESENTATION; K-WORD MATCHES; GENOME SEQUENCE; SIMILARITY; GENERATION; SEARCH; METRICS; TOOLS; BLAST;
D O I
10.1093/nargab/lqac062
中图分类号
Q3 [遗传学];
学科分类号
071007 ; 090102 ;
摘要
Alignment-free methods are alternatives to alignment-based methods when searching sequence data sets. The output from an alignment-free sequence comparison is a similarity score, the interpretation of which is not straightforward. We propose objective functions to interpret and calibrate outputs from alignment-free searches, noting that different objective functions are necessary for different biological contexts. This leads to advantages: visualising and comparing score distributions, including those from true positives, may be a relatively simple method to gain insight into the performance of different metrics. Using an empirical approach with both DNA and protein sequences, we characterise different similarity score distributions generated under different parameters. In particular, we demonstrate how sequence length can affect the scores. We show that scores of true positive sequence pairs may correlate significantly with their mean length; and even if the correlation is weak, the relative difference in length of the sequence pair may significantly reduce the effectiveness of alignment-free metrics. Importantly, we show how objective functions can be used with test data to accurately estimate the probability of true positives. This can significantly increase the utility of alignment-free approaches. Finally, we have developed a general-purpose software tool called KAST for use in high-throughput workflows on Linux clusters.
引用
收藏
页数:22
相关论文
共 50 条
  • [1] Alignment-free sequence comparison - a review
    Vinga, S
    Almeida, J
    BIOINFORMATICS, 2003, 19 (04) : 513 - 523
  • [2] Multiple alignment-free sequence comparison
    Ren, Jie
    Song, Kai
    Sun, Fengzhu
    Deng, Minghua
    Reinert, Gesine
    BIOINFORMATICS, 2013, 29 (21) : 2690 - 2698
  • [3] Benchmarking of alignment-free sequence comparison methods
    Zielezinski, Andrzej
    Girgis, Hani Z.
    Bernard, Guillaume
    Leimeister, Chris-Andre
    Tang, Kujin
    Dencker, Thomas
    Lau, Anna Katharina
    Roehling, Sophie
    Choi, Jae Jin
    Waterman, Michael S.
    Comin, Matteo
    Kim, Sung-Hou
    Vinga, Susana
    Almeida, Jonas S.
    Chan, Cheong Xin
    James, Benjamin T.
    Sun, Fengzhu
    Morgenstern, Burkhard
    Karlowski, Wojciech M.
    GENOME BIOLOGY, 2019, 20 (1)
  • [4] A probabilistic measure for alignment-free sequence comparison
    Pham, TD
    Zuegg, J
    BIOINFORMATICS, 2004, 20 (18) : 3455 - 3461
  • [5] Benchmarking of alignment-free sequence comparison methods
    Andrzej Zielezinski
    Hani Z. Girgis
    Guillaume Bernard
    Chris-Andre Leimeister
    Kujin Tang
    Thomas Dencker
    Anna Katharina Lau
    Sophie Röhling
    Jae Jin Choi
    Michael S. Waterman
    Matteo Comin
    Sung-Hou Kim
    Susana Vinga
    Jonas S. Almeida
    Cheong Xin Chan
    Benjamin T. James
    Fengzhu Sun
    Burkhard Morgenstern
    Wojciech M. Karlowski
    Genome Biology, 20
  • [6] A Geometric Interpretation for Local Alignment-Free Sequence Comparison
    Behnam, Ehsan
    Waterman, Michael S.
    Smith, Andrew D.
    JOURNAL OF COMPUTATIONAL BIOLOGY, 2013, 20 (07) : 471 - 485
  • [7] Alignment-Free Sequence Comparison With Multiple k Values
    Qian, Ying
    Zhang, Yu
    Zhang, Jiongmin
    IEEE-ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, 2021, 18 (05) : 1841 - 1849
  • [8] Alignment-free sequence comparison: benefits, applications, and tools
    Andrzej Zielezinski
    Susana Vinga
    Jonas Almeida
    Wojciech M. Karlowski
    Genome Biology, 18
  • [9] Alignment-Free Sequence Comparison (I): Statistics and Power
    Reinert, Gesine
    Chew, David
    Sun, Fengzhu
    Waterman, Michael S.
    JOURNAL OF COMPUTATIONAL BIOLOGY, 2009, 16 (12) : 1615 - 1634
  • [10] Alignment-free sequence comparison using absent words
    Charalampopoulos, Panagiotis
    Crochemore, Maxime
    Fici, Gabriele
    Mercas, Robert
    Pissis, Solon R.
    INFORMATION AND COMPUTATION, 2018, 262 : 57 - 68