Discovery of power-laws in chemical space

被引:51
作者
Benz, Ryan W. [1 ]
Swamidass, S. Joshua [1 ]
Baldi, Pierre [1 ,2 ]
机构
[1] Univ Calif Irvine, Sch Informat & Comp Sci, Inst Genom & Bioinformat, Irvine, CA 92697 USA
[2] Univ Calif Irvine, Sch Informat & Comp Sci, Dept Biol Chem, Irvine, CA 92697 USA
关键词
D O I
10.1021/ci700353m
中图分类号
R914 [药物化学];
学科分类号
100701 ;
摘要
Power-law distributions have been observed in a wide variety of areas. To our knowledge however, there has been no systematic observation of power-law distributions in chemoinformatics. Here, we present several examples of power-law distributions arising from the features of small, organic molecules. The distributions of rigid segments and ring systems, the distributions of molecular paths and circular substructures, and the sizes of molecular similarity clusters all show linear trends on log-log rank/frequency plots, suggesting underlying power-law distributions. The number of unique features also follow Heaps'-like laws. The characteristic exponents of the power-laws lie in the 1.5-3 range, consistently with the exponents observed in other power-law phenomena. The power-law nature of these distributions leads to several applications including the prediction of the growth of available data through Heaps' law and the optimal allocation of experimental or computational resources via the 80/20 rule. More importantly, we also show how the power-laws can be leveraged to efficiently compress chemical fingerprints in a lossless manner, useful for the improved storage and retrieval of molecules in large chemical databases.
引用
收藏
页码:1138 / 1151
页数:14
相关论文
共 37 条
  • [1] The Cambridge Structural Database: a quarter of a million crystal structures and rising
    Allen, FH
    [J]. ACTA CRYSTALLOGRAPHICA SECTION B-STRUCTURAL SCIENCE, 2002, 58 (3 PART 1): : 380 - 388
  • [2] [Anonymous], 1983, New York
  • [3] [Anonymous], 1935, PSYCHO BIOL LANGUAGE
  • [4] [Anonymous], 1949, Human behaviour and the principle of least-effort
  • [5] ARAUJO M, 1997, P 4 S AM WORKSH STRI
  • [6] Baldi P., 2003, MODELING INTERNET WE
  • [7] Lossless compression of chemical fingerprints using integer entropy codes improves storage and retrieval
    Baldi, Pierre
    Benz, Ryan W.
    Hirschberg, Daniel S.
    Swamidass, S. Joshua
    [J]. JOURNAL OF CHEMICAL INFORMATION AND MODELING, 2007, 47 (06) : 2098 - 2109
  • [8] Emergence of scaling in random networks
    Barabási, AL
    Albert, R
    [J]. SCIENCE, 1999, 286 (5439) : 509 - 512
  • [9] Graph structure in the Web
    Broder, A
    Kumar, R
    Maghoul, F
    Raghavan, P
    Rajagopalan, S
    Stata, R
    Tomkins, A
    Wiener, J
    [J]. COMPUTER NETWORKS-THE INTERNATIONAL JOURNAL OF COMPUTER AND TELECOMMUNICATIONS NETWORKING, 2000, 33 (1-6): : 309 - 320
  • [10] ChemDB: a public database of small molecules and related chemoinformatics resources
    Chen, J
    Swamidass, SJ
    Bruand, J
    Baldi, P
    [J]. BIOINFORMATICS, 2005, 21 (22) : 4133 - 4139