1
|
Cheng T, Chin PJ, Cha K, Petrick N, Mikailov M. Profiling the BLAST bioinformatics application for load balancing on high-performance computing clusters. BMC Bioinformatics 2022; 23:544. [PMID: 36526957 PMCID: PMC9758941 DOI: 10.1186/s12859-022-05029-7] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/19/2022] [Accepted: 10/31/2022] [Indexed: 12/23/2022] Open
Abstract
BACKGROUND The Basic Local Alignment Search Tool (BLAST) is a suite of commonly used algorithms for identifying matches between biological sequences. The user supplies a database file and query file of sequences for BLAST to find identical sequences between the two. The typical millions of database and query sequences make BLAST computationally challenging but also well suited for parallelization on high-performance computing clusters. The efficacy of parallelization depends on the data partitioning, where the optimal data partitioning relies on an accurate performance model. In previous studies, a BLAST job was sped up by 27 times by partitioning the database and query among thousands of processor nodes. However, the optimality of the partitioning method was not studied. Unlike BLAST performance models proposed in the literature that usually have problem size and hardware configuration as the only variables, the execution time of a BLAST job is a function of database size, query size, and hardware capability. In this work, the nucleotide BLAST application BLASTN was profiled using three methods: shell-level profiling with the Unix "time" command, code-level profiling with the built-in "profiler" module, and system-level profiling with the Unix "gprof" program. The runtimes were measured for six node types, using six different database files and 15 query files, on a heterogeneous HPC cluster with 500+ nodes. The empirical measurement data were fitted with quadratic functions to develop performance models that were used to guide the data parallelization for BLASTN jobs. RESULTS Profiling results showed that BLASTN contains more than 34,500 different functions, but a single function, RunMTBySplitDB, takes 99.12% of the total runtime. Among its 53 child functions, five core functions were identified to make up 92.12% of the overall BLASTN runtime. Based on the performance models, static load balancing algorithms can be applied to the BLASTN input data to minimize the runtime of the longest job on an HPC cluster. Four test cases being run on homogeneous and heterogeneous clusters were tested. Experiment results showed that the runtime can be reduced by 81% on a homogeneous cluster and by 20% on a heterogeneous cluster by re-distributing the workload. DISCUSSION Optimal data partitioning can improve BLASTN's overall runtime 5.4-fold in comparison with dividing the database and query into the same number of fragments. The proposed methodology can be used in the other applications in the BLAST+ suite or any other application as long as source code is available.
Collapse
Affiliation(s)
- Trinity Cheng
- grid.417587.80000 0001 2243 3366Center for Devices and Radiological Health, U.S. Food and Drug Administration, Silver Spring, MD 20993 USA ,grid.21107.350000 0001 2171 9311Department of Biomedical Engineering, Whiting School of Engineering, Johns Hopkins University, Baltimore, MD 21218 USA
| | - Pei-Ju Chin
- grid.290496.00000 0001 1945 2072Center for Biologics Evaluation and Research, U.S. Food and Drug Administration, Silver Spring, MD 20993 USA
| | - Kenny Cha
- grid.417587.80000 0001 2243 3366Center for Devices and Radiological Health, U.S. Food and Drug Administration, Silver Spring, MD 20993 USA
| | - Nicholas Petrick
- grid.417587.80000 0001 2243 3366Center for Devices and Radiological Health, U.S. Food and Drug Administration, Silver Spring, MD 20993 USA
| | - Mike Mikailov
- grid.417587.80000 0001 2243 3366Center for Devices and Radiological Health, U.S. Food and Drug Administration, Silver Spring, MD 20993 USA
| |
Collapse
|
2
|
Eldred LE, Thorn RG, Smith DR. Simple Matching Using QIIME 2 and RDP Reveals Misidentified Sequences and an Underrepresentation of Fungi in Reference Datasets. Front Genet 2021; 12:768473. [PMID: 34899856 PMCID: PMC8662557 DOI: 10.3389/fgene.2021.768473] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/31/2021] [Accepted: 11/08/2021] [Indexed: 11/21/2022] Open
Abstract
Simple nucleotide matching identification methods are not as accurate as once thought at identifying environmental fungal sequences. This is largely because of incorrect naming and the underrepresentation of various fungal groups in reference datasets. Here, we explore these issues by examining an environmental metabarcoding dataset of partial large subunit rRNA sequences of Basidiomycota and basal fungi. We employed the simple matching method using the QIIME 2 classifier and the RDP Classifier in conjunction with the latest releases of the SILVA (138.1, 2020) and RDP (11, 2014) reference datasets and then compared the results with a manual phylogenetic binning approach. Of the 71 query sequences tested, 21 and 42% were misidentified using QIIME 2 and the RDP Classifier, respectively. Of these simple matching misidentifications, more than half resulted from the underrepresentation of various groups of fungi in the SILVA and RDP reference datasets. More comprehensive reference datasets with fewer misidentified sequences will increase the accuracy of simple matching identifications. However, we argue that the phylogenetic binning approach is a better alternative to simple matching since, in addition to better accuracy, it provides evolutionary information about query sequences.
Collapse
Affiliation(s)
- Lauren E Eldred
- Department of Biology, University of Western Ontario, London, ON, Canada
| | - R Greg Thorn
- Department of Biology, University of Western Ontario, London, ON, Canada
| | - David Roy Smith
- Department of Biology, University of Western Ontario, London, ON, Canada
| |
Collapse
|
3
|
Kumar PS, Dabdoub SM, Ganesan SM. Probing periodontal microbial dark matter using metataxonomics and metagenomics. Periodontol 2000 2020; 85:12-27. [PMID: 33226714 DOI: 10.1111/prd.12349] [Citation(s) in RCA: 9] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/12/2022]
Abstract
Our view of the periodontal microbial community has been shaped by a century or more of cultivation-based and microscopic investigations. While these studies firmly established the infection-mediated etiology of periodontal diseases, it was apparent from the very early days that periodontal microbiology suffered from what Staley and Konopka described as the "great plate count anomaly", in that these culturable bacteria were only a minor part of what was visible under the microscope. For nearly a century, much effort has been devoted to finding the right tools to investigate this uncultivated majority, also known as "microbial dark matter". The discovery that DNA was an effective tool to "see" microbial dark matter was a significant breakthrough in environmental microbiology, and oral microbiologists were among the earliest to capitalize on these advances. By identifying the order in which nucleotides are arranged in a stretch of DNA (DNA sequencing) and creating a repository of these sequences, sequence databases were created. Computational tools that used probability-driven analysis of these sequences enabled the discovery of new and unsuspected species and ascribed novel functions to these species. This review will trace the development of DNA sequencing as a quantitative, open-ended, comprehensive approach to characterize microbial communities in their native environments, and explore how this technology has shifted traditional dogmas on how the oral microbiome promotes health and its role in disease causation and perpetuation.
Collapse
Affiliation(s)
- Purnima S Kumar
- Department of Periodontology, College of Dentistry, The Ohio State University, Columbus, Ohio, USA
| | - Shareef M Dabdoub
- Department of Periodontology, College of Dentistry, The Ohio State University, Columbus, Ohio, USA
| | - Sukirth M Ganesan
- Department of Periodontics, College of Dentistry and Dental Clinics, The University of Iowa, Iowa City, Iowa, USA
| |
Collapse
|
4
|
Ge H, Sun L, Yu J. Fast batch searching for protein homology based on compression and clustering. BMC Bioinformatics 2017; 18:508. [PMID: 29162030 PMCID: PMC5697088 DOI: 10.1186/s12859-017-1938-8] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/18/2017] [Accepted: 11/14/2017] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND In bioinformatics community, many tasks associate with matching a set of protein query sequences in large sequence datasets. To conduct multiple queries in the database, a common used method is to run BLAST on each original querey or on the concatenated queries. It is inefficient since it doesn't exploit the common subsequences shared by queries. RESULTS We propose a compression and cluster based BLASTP (C2-BLASTP) algorithm to further exploit the joint information among the query sequences and the database. Firstly, the queries and database are compressed in turn by procedures of redundancy analysis, redundancy removal and distinction record. Secondly, the database is clustered according to Hamming distance among the subsequences. To improve the sensitivity and selectivity of sequence alignments, ten groups of reduced amino acid alphabets are used. Following this, the hits finding operator is implemented on the clustered database. Furthermore, an execution database is constructed based on the found potential hits, with the objective of mitigating the effect of increasing scale of the sequence database. Finally, the homology search is performed in the execution database. Experiments on NCBI NR database demonstrate the effectiveness of the proposed C2-BLASTP for batch searching of homology in sequence database. The results are evaluated in terms of homology accuracy, search speed and memory usage. CONCLUSIONS It can be seen that the C2-BLASTP achieves competitive results as compared with some state-of-the-art methods.
Collapse
Affiliation(s)
- Hongwei Ge
- College of Computer Science and Technology, Dalian University of Technology, No.2, Linggong Road, Dalian, China
| | - Liang Sun
- College of Computer Science and Technology, Dalian University of Technology, No.2, Linggong Road, Dalian, China
| | - Jinghong Yu
- College of Computer Science and Technology, Dalian University of Technology, No.2, Linggong Road, Dalian, China
| |
Collapse
|
5
|
Chen Y, Ye W, Zhang Y, Xu Y. High speed BLASTN: an accelerated MegaBLAST search tool. Nucleic Acids Res 2015; 43:7762-8. [PMID: 26250111 PMCID: PMC4652774 DOI: 10.1093/nar/gkv784] [Citation(s) in RCA: 292] [Impact Index Per Article: 32.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/27/2015] [Accepted: 07/22/2015] [Indexed: 11/14/2022] Open
Abstract
Sequence alignment is a long standing problem in bioinformatics. The Basic Local Alignment Search Tool (BLAST) is one of the most popular and fundamental alignment tools. The explosive growth of biological sequences calls for speedup of sequence alignment tools such as BLAST. To this end, we develop high speed BLASTN (HS-BLASTN), a parallel and fast nucleotide database search tool that accelerates MegaBLAST—the default module of NCBI-BLASTN. HS-BLASTN builds a new lookup table using the FMD-index of the database and employs an accurate and effective seeding method to find short stretches of identities (called seeds) between the query and the database. HS-BLASTN produces the same alignment results as MegaBLAST and its computational speed is much faster than MegaBLAST. Specifically, our experiments conducted on a 12-core server show that HS-BLASTN can be 22 times faster than MegaBLAST and exhibits better parallel performance than MegaBLAST. HS-BLASTN is written in C++ and the related source code is available at https://github.com/chenying2016/queries under the GPLv3 license.
Collapse
Affiliation(s)
- Ying Chen
- Guangdong Province Key Laboratory of Computational Science, School of Mathematics and Computational Science, Sun Yat-sen University, Guangzhou 510275, P. R. China
| | - Weicai Ye
- Guangdong Province Key Laboratory of Computational Science, School of Mathematics and Computational Science, Sun Yat-sen University, Guangzhou 510275, P. R. China
| | - Yongdong Zhang
- Guangdong Province Key Laboratory of Computational Science, School of Mathematics and Computational Science, Sun Yat-sen University, Guangzhou 510275, P. R. China
| | - Yuesheng Xu
- Guangdong Province Key Laboratory of Computational Science, School of Mathematics and Computational Science, Sun Yat-sen University, Guangzhou 510275, P. R. China Department of Mathematics, Syracuse University, Syracuse, NY 13244, USA
| |
Collapse
|
6
|
Kotsifakos A, Stefan A, Athitsos V, Das G, Papapetrou P. DRESS: dimensionality reduction for efficient sequence search. Data Min Knowl Discov 2015. [DOI: 10.1007/s10618-015-0413-2] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/23/2022]
|
7
|
Liu H, Beck TN, Golemis EA, Serebriiskii IG. Integrating in silico resources to map a signaling network. Methods Mol Biol 2014; 1101:197-245. [PMID: 24233784 DOI: 10.1007/978-1-62703-721-1_11] [Citation(s) in RCA: 15] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/23/2022]
Abstract
The abundance of publicly available life science databases offers a wealth of information that can support interpretation of experimentally derived data and greatly enhance hypothesis generation. Protein interaction and functional networks are not simply new renditions of existing data: they provide the opportunity to gain insights into the specific physical and functional role a protein plays as part of the biological system. In this chapter, we describe different in silico tools that can quickly and conveniently retrieve data from existing data repositories and we discuss how the available tools are best utilized for different purposes. While emphasizing protein-protein interaction databases (e.g., BioGrid and IntAct), we also introduce metasearch platforms such as STRING and GeneMANIA, pathway databases (e.g., BioCarta and Pathway Commons), text mining approaches (e.g., PubMed and Chilibot), and resources for drug-protein interactions, genetic information for model organisms and gene expression information based on microarray data mining. Furthermore, we provide a simple step-by-step protocol for building customized protein-protein interaction networks in Cytoscape, a powerful network assembly and visualization program, integrating data retrieved from these various databases. As we illustrate, generation of composite interaction networks enables investigators to extract significantly more information about a given biological system than utilization of a single database or sole reliance on primary literature.
Collapse
Affiliation(s)
- Hanqing Liu
- Fox Chase Cancer Center, Philadelphia, PA, USA
| | | | | | | |
Collapse
|
8
|
Dinov ID, Torri F, Macciardi F, Petrosyan P, Liu Z, Zamanyan A, Eggert P, Pierce J, Genco A, Knowles JA, Clark AP, Van Horn JD, Ames J, Kesselman C, Toga AW. Applications of the pipeline environment for visual informatics and genomics computations. BMC Bioinformatics 2011; 12:304. [PMID: 21791102 PMCID: PMC3199760 DOI: 10.1186/1471-2105-12-304] [Citation(s) in RCA: 27] [Impact Index Per Article: 2.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/24/2011] [Accepted: 07/26/2011] [Indexed: 01/19/2023] Open
Abstract
BACKGROUND Contemporary informatics and genomics research require efficient, flexible and robust management of large heterogeneous data, advanced computational tools, powerful visualization, reliable hardware infrastructure, interoperability of computational resources, and detailed data and analysis-protocol provenance. The Pipeline is a client-server distributed computational environment that facilitates the visual graphical construction, execution, monitoring, validation and dissemination of advanced data analysis protocols. RESULTS This paper reports on the applications of the LONI Pipeline environment to address two informatics challenges - graphical management of diverse genomics tools, and the interoperability of informatics software. Specifically, this manuscript presents the concrete details of deploying general informatics suites and individual software tools to new hardware infrastructures, the design, validation and execution of new visual analysis protocols via the Pipeline graphical interface, and integration of diverse informatics tools via the Pipeline eXtensible Markup Language syntax. We demonstrate each of these processes using several established informatics packages (e.g., miBLAST, EMBOSS, mrFAST, GWASS, MAQ, SAMtools, Bowtie) for basic local sequence alignment and search, molecular biology data analysis, and genome-wide association studies. These examples demonstrate the power of the Pipeline graphical workflow environment to enable integration of bioinformatics resources which provide a well-defined syntax for dynamic specification of the input/output parameters and the run-time execution controls. CONCLUSIONS The LONI Pipeline environment http://pipeline.loni.ucla.edu provides a flexible graphical infrastructure for efficient biomedical computing and distributed informatics research. The interactive Pipeline resource manager enables the utilization and interoperability of diverse types of informatics resources. The Pipeline client-server model provides computational power to a broad spectrum of informatics investigators--experienced developers and novice users, user with or without access to advanced computational-resources (e.g., Grid, data), as well as basic and translational scientists. The open development, validation and dissemination of computational networks (pipeline workflows) facilitates the sharing of knowledge, tools, protocols and best practices, and enables the unbiased validation and replication of scientific findings by the entire community.
Collapse
Affiliation(s)
- Ivo D Dinov
- Laboratory of Neuro Imaging (LONI), University of California, Los Angeles, Los Angeles, CA 90095, USA
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | |
Collapse
|
9
|
Klingstrom T, Plewczynski D. Protein-protein interaction and pathway databases, a graphical review. Brief Bioinform 2010; 12:702-13. [DOI: 10.1093/bib/bbq064] [Citation(s) in RCA: 57] [Impact Index Per Article: 4.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022] Open
|
10
|
Qu W, Shen Z, Zhao D, Yang Y, Zhang C. MFEprimer: multiple factor evaluation of the specificity of PCR primers. ACTA ACUST UNITED AC 2008; 25:276-8. [PMID: 19038987 DOI: 10.1093/bioinformatics/btn614] [Citation(s) in RCA: 41] [Impact Index Per Article: 2.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022]
Abstract
SUMMARY We developed a program named MFEprimer for evaluating the specificity of PCR primers based on multiple factors, including sequence similarity, stability at the 3'-end of the primer, melting temperature, GC content and number of binding sites between the primer and DNA templates. MFEprimer can help the user to select more suitable primers before running either standard or multiplex PCR reactions. The cDNA and genomic DNA databases of 10 widely used species, as well as user custom databases, were used as DNA templates for analyzing primers specificity. Furthermore, we maintained a Primer3Plus server with a modified Primer3Manager for one-stop primer design and specificity checking.
Collapse
Affiliation(s)
- Wubin Qu
- Beijing Institute of Radiation Medicine, State Key Laboratory of Proteomics, Beijing, China
| | | | | | | | | |
Collapse
|
11
|
Tarcea VG, Weymouth T, Ade A, Bookvich A, Gao J, Mahavisno V, Wright Z, Chapman A, Jayapandian M, Ozgür A, Tian Y, Cavalcoli J, Mirel B, Patel J, Radev D, Athey B, States D, Jagadish HV. Michigan molecular interactions r2: from interacting proteins to pathways. Nucleic Acids Res 2008; 37:D642-6. [PMID: 18978014 PMCID: PMC2686565 DOI: 10.1093/nar/gkn722] [Citation(s) in RCA: 76] [Impact Index Per Article: 4.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
Abstract
Molecular interaction data exists in a number of repositories, each with its own data format, molecule identifier and information coverage. Michigan molecular interactions (MiMI) assists scientists searching through this profusion of molecular interaction data. The original release of MiMI gathered data from well-known protein interaction databases, and deep merged this information while keeping track of provenance. Based on the feedback received from users, MiMI has been completely redesigned. This article describes the resulting MiMI Release 2 (MiMIr2). New functionality includes extension from proteins to genes and to pathways; identification of highlighted sentences in source publications; seamless two-way linkage with Cytoscape; query facilities based on MeSH/GO terms and other concepts; approximate graph matching to find relevant pathways; support for querying in bulk; and a user focus-group driven interface design. MiMI is part of the NIH's; National Center for Integrative Biomedical Informatics (NCIBI) and is publicly available at: http://mimi.ncibi.org.
Collapse
Affiliation(s)
- V Glenn Tarcea
- Center for Computational Medicine and Biology, University of Michigan, Ann Arbor, MI 48109, USA
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | |
Collapse
|
12
|
Morgulis A, Coulouris G, Raytselis Y, Madden TL, Agarwala R, Schäffer AA. Database indexing for production MegaBLAST searches. ACTA ACUST UNITED AC 2008; 24:1757-64. [PMID: 18567917 PMCID: PMC2696921 DOI: 10.1093/bioinformatics/btn322] [Citation(s) in RCA: 726] [Impact Index Per Article: 45.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022]
Abstract
Motivation: The BLAST software package for sequence comparison speeds up homology search by preprocessing a query sequence into a lookup table. Numerous research studies have suggested that preprocessing the database instead would give better performance. However, production usage of sequence comparison methods that preprocess the database has been limited to programs such as BLAT and SSAHA that are designed to find matches when query and database subsequences are highly similar. Results: We developed a new version of the MegaBLAST module of BLAST that does the initial phase of finding short seeds for matches by searching a database index. We also developed a program makembindex that preprocesses the database into a data structure for rapid seed searching. We show that the new ‘indexed MegaBLAST’ is faster than the ‘non-indexed’ version for most practical uses. We show that indexed MegaBLAST is faster than miBLAST, another implementation of BLAST nucleotide searching with a preprocessed database, for most of the 200 queries we tested. To deploy indexed MegaBLAST as part of NCBI'sWeb BLAST service, the storage of databases and the queueing mechanism were modified, so that some machines are now dedicated to serving queries for a specific database. The response time for such Web queries is now faster than it was when each computer handled queries for multiple databases. Availability: The code for indexed MegaBLAST is part of the blastn program in the NCBI C++ toolkit. The preprocessor program makembindex is also in the toolkit. Indexed MegaBLAST has been used in production on NCBI's Web BLAST service to search one version of the human and mouse genomes since October 2007. The Linux command-line executables for blastn and makembindex, documentation, and some query sets used to carry out the tests described below are available in the directory: ftp://ftp.ncbi.nlm.nih.gov/pub/agarwala/indexed_megablast Contact:schaffer@helix.nih.gov Supplementary information:Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Aleksandr Morgulis
- Department of Health and Human Services, National Center for Biotechnology Information, National Institutes of Health, Bethesda, MD 20894, USA
| | | | | | | | | | | |
Collapse
|
13
|
Dinov ID, Rubin D, Lorensen W, Dugan J, Ma J, Murphy S, Kirschner B, Bug W, Sherman M, Floratos A, Kennedy D, Jagadish HV, Schmidt J, Athey B, Califano A, Musen M, Altman R, Kikinis R, Kohane I, Delp S, Parker DS, Toga AW. iTools: a framework for classification, categorization and integration of computational biology resources. PLoS One 2008; 3:e2265. [PMID: 18509477 PMCID: PMC2386255 DOI: 10.1371/journal.pone.0002265] [Citation(s) in RCA: 19] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/20/2008] [Accepted: 03/27/2008] [Indexed: 11/22/2022] Open
Abstract
The advancement of the computational biology field hinges on progress in three fundamental directions – the development of new computational algorithms, the availability of informatics resource management infrastructures and the capability of tools to interoperate and synergize. There is an explosion in algorithms and tools for computational biology, which makes it difficult for biologists to find, compare and integrate such resources. We describe a new infrastructure, iTools, for managing the query, traversal and comparison of diverse computational biology resources. Specifically, iTools stores information about three types of resources–data, software tools and web-services. The iTools design, implementation and resource meta - data content reflect the broad research, computational, applied and scientific expertise available at the seven National Centers for Biomedical Computing. iTools provides a system for classification, categorization and integration of different computational biology resources across space-and-time scales, biomedical problems, computational infrastructures and mathematical foundations. A large number of resources are already iTools-accessible to the community and this infrastructure is rapidly growing. iTools includes human and machine interfaces to its resource meta-data repository. Investigators or computer programs may utilize these interfaces to search, compare, expand, revise and mine meta-data descriptions of existent computational biology resources. We propose two ways to browse and display the iTools dynamic collection of resources. The first one is based on an ontology of computational biology resources, and the second one is derived from hyperbolic projections of manifolds or complex structures onto planar discs. iTools is an open source project both in terms of the source code development as well as its meta-data content. iTools employs a decentralized, portable, scalable and lightweight framework for long-term resource management. We demonstrate several applications of iTools as a framework for integrated bioinformatics. iTools and the complete details about its specifications, usage and interfaces are available at the iTools web page http://iTools.ccb.ucla.edu.
Collapse
Affiliation(s)
- Ivo D. Dinov
- Center for Computational Biology, University of California Los Angeles, Los Angeles, California, United States of America
| | - Daniel Rubin
- National Center for Biomedical Ontology, Stanford University, Stanford, California, United States of America
| | - William Lorensen
- National Alliance for Medical Imaging Computing, Harvard University, Cambridge, Massachusetts, United States of America
| | - Jonathan Dugan
- Center for Physics-based Simulation of Biological Structures, Stanford University, Stanford, California, United States of America
| | - Jeff Ma
- Center for Computational Biology, University of California Los Angeles, Los Angeles, California, United States of America
| | - Shawn Murphy
- Informatics for Integrating Biology and the Bedside, Harvard University, Cambridge, Massachusetts, United States of America
| | - Beth Kirschner
- National Center for Integrative Biomedical Informatics, University of Michigan, Ann Arbor, Michigan, United States of America
| | - William Bug
- National Center for Microscopy Imaging Research, University of California San Diego, San Diego, California, United States of America
| | - Michael Sherman
- Center for Physics-based Simulation of Biological Structures, Stanford University, Stanford, California, United States of America
| | - Aris Floratos
- National Center for Multi-Scale Study of Cellular Networks, Columbia University, New York, New York, United States of America
| | - David Kennedy
- Neuroscience Center, Massachusetts General Hospital, Boston, Massachusetts, United States of America
| | - H. V. Jagadish
- National Center for Integrative Biomedical Informatics, University of Michigan, Ann Arbor, Michigan, United States of America
| | - Jeanette Schmidt
- Center for Physics-based Simulation of Biological Structures, Stanford University, Stanford, California, United States of America
| | - Brian Athey
- National Center for Integrative Biomedical Informatics, University of Michigan, Ann Arbor, Michigan, United States of America
| | - Andrea Califano
- National Center for Multi-Scale Study of Cellular Networks, Columbia University, New York, New York, United States of America
| | - Mark Musen
- National Center for Biomedical Ontology, Stanford University, Stanford, California, United States of America
| | - Russ Altman
- Center for Physics-based Simulation of Biological Structures, Stanford University, Stanford, California, United States of America
| | - Ron Kikinis
- National Alliance for Medical Imaging Computing, Harvard University, Cambridge, Massachusetts, United States of America
| | - Isaac Kohane
- Informatics for Integrating Biology and the Bedside, Harvard University, Cambridge, Massachusetts, United States of America
| | - Scott Delp
- Center for Physics-based Simulation of Biological Structures, Stanford University, Stanford, California, United States of America
| | - D. Stott Parker
- Center for Computational Biology, University of California Los Angeles, Los Angeles, California, United States of America
| | - Arthur W. Toga
- Center for Computational Biology, University of California Los Angeles, Los Angeles, California, United States of America
- * E-mail:
| |
Collapse
|
14
|
Jayapandian M, Chapman A, Tarcea VG, Yu C, Elkiss A, Ianni A, Liu B, Nandi A, Santos C, Andrews P, Athey B, States D, Jagadish HV. Michigan Molecular Interactions (MiMI): putting the jigsaw puzzle together. Nucleic Acids Res 2006; 35:D566-71. [PMID: 17130145 PMCID: PMC1716720 DOI: 10.1093/nar/gkl859] [Citation(s) in RCA: 88] [Impact Index Per Article: 4.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/15/2022] Open
Abstract
Protein interaction data exists in a number of repositories. Each repository has its own data format, molecule identifier and supplementary information. Michigan Molecular Interactions (MiMI) assists scientists searching through this overwhelming amount of protein interaction data. MiMI gathers data from well-known protein interaction databases and deep-merges the information. Utilizing an identity function, molecules that may have different identifiers but represent the same real-world object are merged. Thus, MiMI allows the users to retrieve information from many different databases at once, highlighting complementary and contradictory information. To help scientists judge the usefulness of a piece of data, MiMI tracks the provenance of all data. Finally, a simple yet powerful user interface aids users in their queries, and frees them from the onerous task of knowing the data format or learning a query language. MiMI allows scientists to query all data, whether corroborative or contradictory, and specify which sources to utilize. MiMI is part of the National Center for Integrative Biomedical Informatics (NCIBI) and is publicly available at: http://mimi.ncibi.org.
Collapse
Affiliation(s)
| | - Adriane Chapman
- To whom correspondence should be addressed at Department of Electrical Engineering and Computer Science, University of Michigan, 2260 Hayward Avenue, Ann Arbor, MI 48109, USA. Tel: +1 734 763 4433; Fax: +1 734 763 8094;
| | | | | | | | | | | | | | | | | | | | | | | |
Collapse
|