Revealing and avoiding bias in semantic similarity scores for protein pairs

被引:34
作者
Wang, Jing [1 ]
Zhou, Xianxiao [1 ]
Zhu, Jing [1 ]
Zhou, Chenggui [1 ]
Guo, Zheng [1 ,2 ]
机构
[1] Univ Elect Sci & Technol China, Bioinformat Ctr, Sch Life Sci & Technol, Chengdu 610054, Peoples R China
[2] Harbin Med Univ, Coll Bioinformat Sci & Technol, Harbin 150086, Peoples R China
基金
中国国家自然科学基金;
关键词
GENE ONTOLOGY; INTERACTION NETWORK; FUNCTIONAL SIMILARITY; MICROARRAY DATA; DISEASE; EXPRESSION; PREDICTION; SEQUENCE; MODULES; CANCER;
D O I
10.1186/1471-2105-11-290
中图分类号
Q5 [生物化学];
学科分类号
071010 ; 081704 ;
摘要
Background: Semantic similarity scores for protein pairs are widely applied in functional genomic researches for finding functional clusters of proteins, predicting protein functions and protein-protein interactions, and for identifying putative disease genes. However, because some proteins, such as those related to diseases, tend to be studied more intensively, annotations are likely to be biased, which may affect applications based on semantic similarity measures. Thus, it is necessary to evaluate the effects of the bias on semantic similarity scores between proteins and then find a method to avoid them. Results: First, we evaluated 14 commonly used semantic similarity scores for protein pairs and demonstrated that they significantly correlated with the numbers of annotation terms for the proteins (also known as the protein annotation length). These results suggested that current applications of the semantic similarity scores between proteins might be unreliable. Then, to reduce this annotation bias effect, we proposed normalizing the semantic similarity scores between proteins using the power transformation of the scores. We provide evidence that this improves performance in some applications. Conclusions: Current semantic similarity measures for protein pairs are highly dependent on protein annotation lengths, which are subject to biological research bias. This affects applications that are based on these semantic similarity scores, especially in clustering studies that rely on score magnitudes. The normalized scores proposed in this paper can reduce the effects of this bias to some extent.
引用
收藏
页数:11
相关论文
共 85 条
[1]   Speeding disease gene discovery by sequence based candidate prioritization [J].
Adie, EA ;
Adams, RR ;
Evans, KL ;
Porteous, DJ ;
Pickard, BS .
BMC BIOINFORMATICS, 2005, 6 (1)
[2]   SUSPECTS: enabling fast and effective prioritization of positional candidates [J].
Adie, EA ;
Adams, RR ;
Evans, KL ;
Porteous, DJ ;
Pickard, BS .
BIOINFORMATICS, 2006, 22 (06) :773-774
[3]   Phylogenetic and Functional Assessment of Orthologs Inference Projects and Methods [J].
Altenhoff, Adrian M. ;
Dessimoz, Christophe .
PLOS COMPUTATIONAL BIOLOGY, 2009, 5 (01)
[4]  
[Anonymous], 2000, ADV DATA, V314, P1
[5]   Gene Ontology: tool for the unification of biology [J].
Ashburner, M ;
Ball, CA ;
Blake, JA ;
Botstein, D ;
Butler, H ;
Cherry, JM ;
Davis, AP ;
Dolinski, K ;
Dwight, SS ;
Eppig, JT ;
Harris, MA ;
Hill, DP ;
Issel-Tarver, L ;
Kasarskis, A ;
Lewis, S ;
Matese, JC ;
Richardson, JE ;
Ringwald, M ;
Rubin, GM ;
Sherlock, G .
NATURE GENETICS, 2000, 25 (01) :25-29
[6]   Functional maps of protein complexes from quantitative genetic interaction data [J].
Bandyopadhyay, Sourav ;
Kelley, Ryan ;
Krogan, Nevan J. ;
Ideker, Trey .
PLOS COMPUTATIONAL BIOLOGY, 2008, 4 (04)
[7]   CONTROLLING THE FALSE DISCOVERY RATE - A PRACTICAL AND POWERFUL APPROACH TO MULTIPLE TESTING [J].
BENJAMINI, Y ;
HOCHBERG, Y .
JOURNAL OF THE ROYAL STATISTICAL SOCIETY SERIES B-STATISTICAL METHODOLOGY, 1995, 57 (01) :289-300
[8]   Integrative analysis for finding genes and networks involved in diabetes and other complex diseases [J].
Bergholdt, Regine ;
Storling, Zenia M. ;
Lage, Kasper ;
Karlberg, E. Olof ;
Olason, Pall I. ;
Aalund, Mogens ;
Nerup, Jorn ;
Brunak, Soren ;
Workman, Christopher T. ;
Pociot, Flemming .
GENOME BIOLOGY, 2007, 8 (11)
[9]   From syndrome families to functional genomics [J].
Brunner, HG ;
van Driel, MA .
NATURE REVIEWS GENETICS, 2004, 5 (07) :545-551
[10]   Discovering gene annotations in biomedical text databases [J].
Cakmak, Ali ;
Ozsoyoglu, Gultekin .
BMC BIOINFORMATICS, 2008, 9 (1)