Jaccard/Tanimoto similarity test and estimation methods for biological presence-absence data

被引:146
作者
Chung, Neo Christopher [1 ]
Miasojedow, Blazej [2 ]
Startek, Michal [1 ]
Gambin, Anna [1 ]
机构
[1] Univ Warsaw, Fac Math Informat & Mech, Inst Informat, Stefana Banacha 2, PL-02097 Warsaw, Poland
[2] Polish Acad Sci, Inst Math, Jana & Jedrzeja Sniadeckich 8, PL-00656 Warsaw, Poland
关键词
Jaccard; Tanimoto; Binary similarity; Presence-absence; Co-occurrences; P-value; SPECIES COOCCURRENCES; BETA-DIVERSITY; NONRANDOMNESS; COMMUNITIES; MODEL;
D O I
10.1186/s12859-019-3118-5
中图分类号
Q5 [生物化学];
学科分类号
071010 ; 081704 ;
摘要
Background: A survey of presences and absences of specific species across multiple biogeographic units (or bioregions) are used in a broad area of biological studies from ecology to microbiology. Using binary presence-absence data, we evaluate species co-occurrences that help elucidate relationships among organisms and environments. To summarize similarity between occurrences of species, we routinely use the Jaccard/Tanimoto coefficient, which is the ratio of their intersection to their union. It is natural, then, to identify statistically significant Jaccard/Tanimoto coefficients, which suggest non-random co-occurrences of species. However, statistical hypothesis testing using this similarity coefficient has been seldom used or studied. Results: We introduce a hypothesis test for similarity for biological presence-absence data, using the Jaccard/Tanimoto coefficient. Several key improvements are presented including unbiased estimation of expectation and centered Jaccard/Tanimoto coefficients, that account for occurrence probabilities. The exact and asymptotic solutions are derived. To overcome a computational burden due to high-dimensionality, we propose the bootstrap and measurement concentration algorithms to efficiently estimate statistical significance of binary similarity. Comprehensive simulation studies demonstrate that our proposed methods produce accurate p-values and false discovery rates. The proposed estimation methods are orders of magnitude faster than the exact solution, particularly with an increasing dimensionality. We showcase their applications in evaluating co-occurrences of bird species in 28 islands of Vanuatu and fish species in 3347 freshwater habitats in France. The proposed methods are implemented in an open source R package called jaccard (https://cran.r-project.org/package=jaccard). Conclusion: We introduce a suite of statistical methods for the Jaccard/Tanimoto similarity coefficient for binary data, that enable straightforward incorporation of probabilistic measures in analysis for species co-occurrences. Due to their generality, the proposed methods and implementations are applicable to a wide range of binary data arising from genomics, biochemistry, and other areas of science.
引用
收藏
页数:11
相关论文
共 37 条
  • [1] [Anonymous], 2015, Cooccurrence analysis
  • [2] [Anonymous], 2017, R LANG ENV STAT COMP
  • [3] [Anonymous], 1994, INTRO BOOTSTRAP
  • [4] The geographic scaling of biotic interactions
    Araujo, Miguel B.
    Rozenfeld, Alejandro
    [J]. ECOGRAPHY, 2014, 37 (05) : 406 - 415
  • [5] Why is Tanimoto index an appropriate choice for fingerprint-based similarity calculations?
    Bajusz, David
    Racz, Anita
    Heberger, Kroly
    [J]. JOURNAL OF CHEMINFORMATICS, 2015, 7
  • [6] SIMILARITY OF BINARY DATA
    BARONIURBANI, C
    BUSER, MW
    [J]. SYSTEMATIC ZOOLOGY, 1976, 25 (03): : 251 - 259
  • [7] BARONIURBANI C, 1980, OECOLOGIA, V44, P287, DOI 10.1007/BF00545229
  • [8] BIRKS HJB, 1987, ANN ZOOL FENN, V24, P165
  • [9] Disentangling the importance of ecological niches from stochastic processes across scales
    Chase, Jonathan M.
    Myers, Jonathan A.
    [J]. PHILOSOPHICAL TRANSACTIONS OF THE ROYAL SOCIETY B-BIOLOGICAL SCIENCES, 2011, 366 (1576) : 2351 - 2363
  • [10] Climate interacts with anthropogenic drivers to determine extirpation dynamics
    Comte, Lise
    Hugueny, Bernard
    Grenouillet, Gael
    [J]. ECOGRAPHY, 2016, 39 (10) : 1008 - 1016