Integrating large and distributed life sciences resources for systems biology research: Progress and new challenges

被引:1
作者
Jamil H. [1 ]
机构
[1] Department of Computer Science, Wayne State University
来源
Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) | 2011年 / 6790 LNCS卷
基金
美国国家科学基金会; 美国国家卫生研究院;
关键词
Compendex;
D O I
10.1007/978-3-642-23074-5_9
中图分类号
学科分类号
摘要
Researchers in Systems Biology routinely access vast collection of hidden web research resources freely available on the internet. These collections include online data repositories, online and downloadable data analysis tools, publications, text mining systems, visualization artifacts, etc. Almost always, these resources have complex data formats that are heterogeneous in representation, data type, interpretation and even identity. They are often forced to develop analysis pipelines and data management applications that involve extensive and prohibitive manual interactions. Such approaches act as a barrier for optimal use of these resources and thus impede the progress of research. In this paper, we discuss our experience of building a new middleware approach to data and application integration for Systems Biology that leverages recent developments in schema matching, wrapper generation, workflow management, and query language design. In this approach, ad hoc integration of arbitrary resources and computational pipeline construction using a declarative language is advocated. We highlight the features and advantages of this new data management system, called LifeDB, and its query language BioFlow. Based on our experience, we highlight the new challenges it raises, and potential solutions to meet these new research issues toward a viable platform for large scale autonomous data integration. We believe the research issues we raise have general interest in the autonomous data integration community and will be applicable equally to research unrelated to LifeDB. © 2011 Springer-Verlag Berlin Heidelberg.
引用
收藏
页码:208 / 237
页数:29
相关论文
共 76 条
[1]  
The Open Protein Structure Annotation Network
[2]  
Ahmed E., Jamil H., Post processing wrapper generated tables for labeling anonymous datasets, ACM International Workshop on Web Information and Data Management, (2009)
[3]  
Ala U., Piro R.M., Grassi E., Damasco C., Silengo L., Oti M., Provero P., Cunto F.D., Prediction of human disease genes by human-mouse conserved coexpression analysis, PLoS Comput Biology, 4, 3, pp. 1-17, (2008)
[4]  
Altintas I., Berkley C., Jaeger E., Jones M., Ludascher B., Mock S., Kepler: An extensible system for design and execution of scientific workflows, SSDBM, (2004)
[5]  
Amin M.S., Jamil H., Top-k similar graph enumeration using neighborhood biased, β-signatures in Biological Networks, (2010)
[6]  
Amin M.S., Bhattacharjee A., Jamil H., Wikipedia driven autonomous label assignment in wrapper induced tables with missing column names, ACM International Symposium on Applied Computing, pp. 1656-1660, (2010)
[7]  
Amin M.S., Bhattacharjee A., Russell J., Finley L., Jamil H., A stochastic approach to candidate disease gene subnetwork extraction, ACM International Symposium on Applied Computing, pp. 1534-1538, (2010)
[8]  
Amin M.S., Jamil H., Ontology guided autonomous label assignment for wrapper induced tables with missing column names, IEEE International Conference on Information Reuse and Integration, (2009)
[9]  
Amin M.S., Jamil H., An efficient web-based wrapper and annotator for tabular data, International Journal of Software Engineering and Knowledge Engineering, 20, 2, pp. 215-231, (2010)
[10]  
Aumueller D., Do H.-H., Massmann S., Rahm E., Schema and ontology matching with COMA++, Proceedings of the ACM SIGMOD International Conference on Management of Data, pp. 906-908, (2005)