Reference Citation Analysis: Find an Article, Find a Category, Find a Journal, Find a Scholar

For: Kim YJ, Boyd A, Athey BD, Patel JM. miBLAST: scalable evaluation of a batch of nucleotide sequence queries with BLAST. Nucleic Acids Res 2005;33:4335-44. [PMID: 16061938 PMCID: PMC1182166 DOI: 10.1093/nar/gki739] [Citation(s) in RCA: 23] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [What about the content of this article? (0)] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/23/2022] Open

For:	Kim YJ, Boyd A, Athey BD, Patel JM. miBLAST: scalable evaluation of a batch of nucleotide sequence queries with BLAST. Nucleic Acids Res 2005;33:4335-44. [PMID: 16061938 PMCID: PMC1182166 DOI: 10.1093/nar/gki739] [Citation(s) in RCA: 23] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [What about the content of this article? (0)] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/23/2022] Open

Number

Cited by Other Article(s)

Cheng T, Chin PJ, Cha K, Petrick N, Mikailov M. Profiling the BLAST bioinformatics application for load balancing on high-performance computing clusters. BMC Bioinformatics 2022;23:544. [PMID: 36526957 PMCID: PMC9758941 DOI: 10.1186/s12859-022-05029-7] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/19/2022] [Accepted: 10/31/2022] [Indexed: 12/23/2022] Open

Abstract

BACKGROUND

The Basic Local Alignment Search Tool (BLAST) is a suite of commonly used algorithms for identifying matches between biological sequences. The user supplies a database file and query file of sequences for BLAST to find identical sequences between the two. The typical millions of database and query sequences make BLAST computationally challenging but also well suited for parallelization on high-performance computing clusters. The efficacy of parallelization depends on the data partitioning, where the optimal data partitioning relies on an accurate performance model. In previous studies, a BLAST job was sped up by 27 times by partitioning the database and query among thousands of processor nodes. However, the optimality of the partitioning method was not studied. Unlike BLAST performance models proposed in the literature that usually have problem size and hardware configuration as the only variables, the execution time of a BLAST job is a function of database size, query size, and hardware capability. In this work, the nucleotide BLAST application BLASTN was profiled using three methods: shell-level profiling with the Unix "time" command, code-level profiling with the built-in "profiler" module, and system-level profiling with the Unix "gprof" program. The runtimes were measured for six node types, using six different database files and 15 query files, on a heterogeneous HPC cluster with 500+ nodes. The empirical measurement data were fitted with quadratic functions to develop performance models that were used to guide the data parallelization for BLASTN jobs.

RESULTS

Profiling results showed that BLASTN contains more than 34,500 different functions, but a single function, RunMTBySplitDB, takes 99.12% of the total runtime. Among its 53 child functions, five core functions were identified to make up 92.12% of the overall BLASTN runtime. Based on the performance models, static load balancing algorithms can be applied to the BLASTN input data to minimize the runtime of the longest job on an HPC cluster. Four test cases being run on homogeneous and heterogeneous clusters were tested. Experiment results showed that the runtime can be reduced by 81% on a homogeneous cluster and by 20% on a heterogeneous cluster by re-distributing the workload.

DISCUSSION

Optimal data partitioning can improve BLASTN's overall runtime 5.4-fold in comparison with dividing the database and query into the same number of fragments. The proposed methodology can be used in the other applications in the BLAST+ suite or any other application as long as source code is available.

Collapse

Eldred LE, Thorn RG, Smith DR. Simple Matching Using QIIME 2 and RDP Reveals Misidentified Sequences and an Underrepresentation of Fungi in Reference Datasets. Front Genet 2021;12:768473. [PMID: 34899856 PMCID: PMC8662557 DOI: 10.3389/fgene.2021.768473] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/31/2021] [Accepted: 11/08/2021] [Indexed: 11/21/2022] Open

Kumar PS, Dabdoub SM, Ganesan SM. Probing periodontal microbial dark matter using metataxonomics and metagenomics. Periodontol 2000 2020;85:12-27. [PMID: 33226714 DOI: 10.1111/prd.12349] [Citation(s) in RCA: 9] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/12/2022]

Ge H, Sun L, Yu J. Fast batch searching for protein homology based on compression and clustering. BMC Bioinformatics 2017;18:508. [PMID: 29162030 PMCID: PMC5697088 DOI: 10.1186/s12859-017-1938-8] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/18/2017] [Accepted: 11/14/2017] [Indexed: 11/10/2022] Open

Chen Y, Ye W, Zhang Y, Xu Y. High speed BLASTN: an accelerated MegaBLAST search tool. Nucleic Acids Res 2015;43:7762-8. [PMID: 26250111 PMCID: PMC4652774 DOI: 10.1093/nar/gkv784] [Citation(s) in RCA: 292] [Impact Index Per Article: 32.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/27/2015] [Accepted: 07/22/2015] [Indexed: 11/14/2022] Open

Kotsifakos A, Stefan A, Athitsos V, Das G, Papapetrou P. DRESS: dimensionality reduction for efficient sequence search. Data Min Knowl Discov 2015. [DOI: 10.1007/s10618-015-0413-2] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/23/2022]

Liu H, Beck TN, Golemis EA, Serebriiskii IG. Integrating in silico resources to map a signaling network. Methods Mol Biol 2014;1101:197-245. [PMID: 24233784 DOI: 10.1007/978-1-62703-721-1_11] [Citation(s) in RCA: 15] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/23/2022]

Dinov ID, Torri F, Macciardi F, Petrosyan P, Liu Z, Zamanyan A, Eggert P, Pierce J, Genco A, Knowles JA, Clark AP, Van Horn JD, Ames J, Kesselman C, Toga AW. Applications of the pipeline environment for visual informatics and genomics computations. BMC Bioinformatics 2011;12:304. [PMID: 21791102 PMCID: PMC3199760 DOI: 10.1186/1471-2105-12-304] [Citation(s) in RCA: 27] [Impact Index Per Article: 2.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/24/2011] [Accepted: 07/26/2011] [Indexed: 01/19/2023] Open

Abstract

BACKGROUND

Contemporary informatics and genomics research require efficient, flexible and robust management of large heterogeneous data, advanced computational tools, powerful visualization, reliable hardware infrastructure, interoperability of computational resources, and detailed data and analysis-protocol provenance. The Pipeline is a client-server distributed computational environment that facilitates the visual graphical construction, execution, monitoring, validation and dissemination of advanced data analysis protocols.

RESULTS

This paper reports on the applications of the LONI Pipeline environment to address two informatics challenges - graphical management of diverse genomics tools, and the interoperability of informatics software. Specifically, this manuscript presents the concrete details of deploying general informatics suites and individual software tools to new hardware infrastructures, the design, validation and execution of new visual analysis protocols via the Pipeline graphical interface, and integration of diverse informatics tools via the Pipeline eXtensible Markup Language syntax. We demonstrate each of these processes using several established informatics packages (e.g., miBLAST, EMBOSS, mrFAST, GWASS, MAQ, SAMtools, Bowtie) for basic local sequence alignment and search, molecular biology data analysis, and genome-wide association studies. These examples demonstrate the power of the Pipeline graphical workflow environment to enable integration of bioinformatics resources which provide a well-defined syntax for dynamic specification of the input/output parameters and the run-time execution controls.

CONCLUSIONS

The LONI Pipeline environment http://pipeline.loni.ucla.edu provides a flexible graphical infrastructure for efficient biomedical computing and distributed informatics research. The interactive Pipeline resource manager enables the utilization and interoperability of diverse types of informatics resources. The Pipeline client-server model provides computational power to a broad spectrum of informatics investigators--experienced developers and novice users, user with or without access to advanced computational-resources (e.g., Grid, data), as well as basic and translational scientists. The open development, validation and dissemination of computational networks (pipeline workflows) facilitates the sharing of knowledge, tools, protocols and best practices, and enables the unbiased validation and replication of scientific findings by the entire community.

Collapse

Klingstrom T, Plewczynski D. Protein-protein interaction and pathway databases, a graphical review. Brief Bioinform 2010;12:702-13. [DOI: 10.1093/bib/bbq064] [Citation(s) in RCA: 57] [Impact Index Per Article: 4.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022] Open

Qu W, Shen Z, Zhao D, Yang Y, Zhang C. MFEprimer: multiple factor evaluation of the specificity of PCR primers. ACTA ACUST UNITED AC 2008;25:276-8. [PMID: 19038987 DOI: 10.1093/bioinformatics/btn614] [Citation(s) in RCA: 41] [Impact Index Per Article: 2.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022]

Tarcea VG, Weymouth T, Ade A, Bookvich A, Gao J, Mahavisno V, Wright Z, Chapman A, Jayapandian M, Ozgür A, Tian Y, Cavalcoli J, Mirel B, Patel J, Radev D, Athey B, States D, Jagadish HV. Michigan molecular interactions r2: from interacting proteins to pathways. Nucleic Acids Res 2008;37:D642-6. [PMID: 18978014 PMCID: PMC2686565 DOI: 10.1093/nar/gkn722] [Citation(s) in RCA: 76] [Impact Index Per Article: 4.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open

Morgulis A, Coulouris G, Raytselis Y, Madden TL, Agarwala R, Schäffer AA. Database indexing for production MegaBLAST searches. ACTA ACUST UNITED AC 2008;24:1757-64. [PMID: 18567917 PMCID: PMC2696921 DOI: 10.1093/bioinformatics/btn322] [Citation(s) in RCA: 726] [Impact Index Per Article: 45.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022]

Abstract

Motivation: The BLAST software package for sequence comparison speeds up homology search by preprocessing a query sequence into a lookup table. Numerous research studies have suggested that preprocessing the database instead would give better performance. However, production usage of sequence comparison methods that preprocess the database has been limited to programs such as BLAT and SSAHA that are designed to find matches when query and database subsequences are highly similar.

Results: We developed a new version of the MegaBLAST module of BLAST that does the initial phase of finding short seeds for matches by searching a database index. We also developed a program makembindex that preprocesses the database into a data structure for rapid seed searching. We show that the new ‘indexed MegaBLAST’ is faster than the ‘non-indexed’ version for most practical uses. We show that indexed MegaBLAST is faster than miBLAST, another implementation of BLAST nucleotide searching with a preprocessed database, for most of the 200 queries we tested. To deploy indexed MegaBLAST as part of NCBI'sWeb BLAST service, the storage of databases and the queueing mechanism were modified, so that some machines are now dedicated to serving queries for a specific database. The response time for such Web queries is now faster than it was when each computer handled queries for multiple databases.

Availability: The code for indexed MegaBLAST is part of the blastn program in the NCBI C++ toolkit. The preprocessor program makembindex is also in the toolkit. Indexed MegaBLAST has been used in production on NCBI's Web BLAST service to search one version of the human and mouse genomes since October 2007. The Linux command-line executables for blastn and makembindex, documentation, and some query sets used to carry out the tests described below are available in the directory: ftp://ftp.ncbi.nlm.nih.gov/pub/agarwala/indexed_megablast

Contact:schaffer@helix.nih.gov

Supplementary information:Supplementary data are available at Bioinformatics online.

Collapse

Dinov ID, Rubin D, Lorensen W, Dugan J, Ma J, Murphy S, Kirschner B, Bug W, Sherman M, Floratos A, Kennedy D, Jagadish HV, Schmidt J, Athey B, Califano A, Musen M, Altman R, Kikinis R, Kohane I, Delp S, Parker DS, Toga AW. iTools: a framework for classification, categorization and integration of computational biology resources. PLoS One 2008;3:e2265. [PMID: 18509477 PMCID: PMC2386255 DOI: 10.1371/journal.pone.0002265] [Citation(s) in RCA: 19] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/20/2008] [Accepted: 03/27/2008] [Indexed: 11/22/2022] Open

Abstract

The advancement of the computational biology field hinges on progress in three fundamental directions – the development of new computational algorithms, the availability of informatics resource management infrastructures and the capability of tools to interoperate and synergize. There is an explosion in algorithms and tools for computational biology, which makes it difficult for biologists to find, compare and integrate such resources. We describe a new infrastructure, iTools, for managing the query, traversal and comparison of diverse computational biology resources. Specifically, iTools stores information about three types of resources–data, software tools and web-services. The iTools design, implementation and resource meta - data content reflect the broad research, computational, applied and scientific expertise available at the seven National Centers for Biomedical Computing. iTools provides a system for classification, categorization and integration of different computational biology resources across space-and-time scales, biomedical problems, computational infrastructures and mathematical foundations. A large number of resources are already iTools-accessible to the community and this infrastructure is rapidly growing. iTools includes human and machine interfaces to its resource meta-data repository. Investigators or computer programs may utilize these interfaces to search, compare, expand, revise and mine meta-data descriptions of existent computational biology resources. We propose two ways to browse and display the iTools dynamic collection of resources. The first one is based on an ontology of computational biology resources, and the second one is derived from hyperbolic projections of manifolds or complex structures onto planar discs. iTools is an open source project both in terms of the source code development as well as its meta-data content. iTools employs a decentralized, portable, scalable and lightweight framework for long-term resource management. We demonstrate several applications of iTools as a framework for integrated bioinformatics. iTools and the complete details about its specifications, usage and interfaces are available at the iTools web page http://iTools.ccb.ucla.edu.

Collapse

Affiliation(s)

Ivo D. Dinov Center for Computational Biology, University of California Los Angeles, Los Angeles, California, United States of America
Daniel Rubin National Center for Biomedical Ontology, Stanford University, Stanford, California, United States of America
William Lorensen National Alliance for Medical Imaging Computing, Harvard University, Cambridge, Massachusetts, United States of America
Jonathan Dugan Center for Physics-based Simulation of Biological Structures, Stanford University, Stanford, California, United States of America
Jeff Ma Center for Computational Biology, University of California Los Angeles, Los Angeles, California, United States of America
Shawn Murphy Informatics for Integrating Biology and the Bedside, Harvard University, Cambridge, Massachusetts, United States of America
Beth Kirschner National Center for Integrative Biomedical Informatics, University of Michigan, Ann Arbor, Michigan, United States of America
William Bug National Center for Microscopy Imaging Research, University of California San Diego, San Diego, California, United States of America
Michael Sherman Center for Physics-based Simulation of Biological Structures, Stanford University, Stanford, California, United States of America
Aris Floratos National Center for Multi-Scale Study of Cellular Networks, Columbia University, New York, New York, United States of America
David Kennedy Neuroscience Center, Massachusetts General Hospital, Boston, Massachusetts, United States of America
H. V. Jagadish National Center for Integrative Biomedical Informatics, University of Michigan, Ann Arbor, Michigan, United States of America
Jeanette Schmidt Center for Physics-based Simulation of Biological Structures, Stanford University, Stanford, California, United States of America
Brian Athey National Center for Integrative Biomedical Informatics, University of Michigan, Ann Arbor, Michigan, United States of America
Andrea Califano National Center for Multi-Scale Study of Cellular Networks, Columbia University, New York, New York, United States of America
Mark Musen National Center for Biomedical Ontology, Stanford University, Stanford, California, United States of America
Russ Altman Center for Physics-based Simulation of Biological Structures, Stanford University, Stanford, California, United States of America
Ron Kikinis National Alliance for Medical Imaging Computing, Harvard University, Cambridge, Massachusetts, United States of America
Isaac Kohane Informatics for Integrating Biology and the Bedside, Harvard University, Cambridge, Massachusetts, United States of America
Scott Delp Center for Physics-based Simulation of Biological Structures, Stanford University, Stanford, California, United States of America
D. Stott Parker Center for Computational Biology, University of California Los Angeles, Los Angeles, California, United States of America
Arthur W. Toga Center for Computational Biology, University of California Los Angeles, Los Angeles, California, United States of America * E-mail:

Collapse

Jayapandian M, Chapman A, Tarcea VG, Yu C, Elkiss A, Ianni A, Liu B, Nandi A, Santos C, Andrews P, Athey B, States D, Jagadish HV. Michigan Molecular Interactions (MiMI): putting the jigsaw puzzle together. Nucleic Acids Res 2006;35:D566-71. [PMID: 17130145 PMCID: PMC1716720 DOI: 10.1093/nar/gkl859] [Citation(s) in RCA: 88] [Impact Index Per Article: 4.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/15/2022] Open