1
|
Mustguseal and Sister Web-Methods: A Practical Guide to Bioinformatic Analysis of Protein Superfamilies. Methods Mol Biol 2021; 2231:179-200. [PMID: 33289894 DOI: 10.1007/978-1-0716-1036-7_12] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/18/2023]
Abstract
Bioinformatic analysis of functionally diverse superfamilies can help to study the structure-function relationship in proteins, but represents a methodological challenge. The Mustguseal web-server can build large structure-guided sequence alignments of thousands of homologs that cover all currently available sequence variants within a common structural fold. The input to the method is a PDB code of the query protein, which represents the protein superfamily of interest. The collection and subsequent alignment of protein sequences and structures is fully automated and driven by the particular choice of parameters. Four integrated sister web-methods-the Zebra, pocketZebra, visualCMAT, and Yosshi-are available to further analyze the resulting superimposition and identify conserved, subfamily-specific, and co-evolving residues, as well as to classify and study disulfide bonds in protein superfamilies. The integration of these web-based bioinformatic tools provides an out-of-the-box easy-to-use solution, first of its kind, to study protein function and regulation and design improved enzyme variants for practical applications and selective ligands to modulate their functional properties. In this chapter, we provide a step-by-step protocol for a comprehensive bioinformatic analysis of a protein superfamily using a web-browser as the main tool and notes on selecting the appropriate values for the key algorithm parameters depending on your research objective. The web-servers are freely available to all users at https://biokinet.belozersky.msu.ru/m-platform with no login requirement.
Collapse
|
2
|
Abstract
Many fields of biology rely on the inference of accurate multiple sequence alignments (MSA) of biological sequences. Unfortunately, the problem of assembling an MSA is NP-complete thus limiting computation to approximate solutions using heuristics solutions. The progressive algorithm is one of the most popular frameworks for the computation of MSAs. It involves pre-clustering the sequences and aligning them starting with the most similar ones. The scalability of this framework is limited, especially with respect to accuracy. We present here an alternative approach named regressive algorithm. In this framework, sequences are first clustered and then aligned starting with the most distantly related ones. This approach has been shown to greatly improve accuracy during scale-up, especially on datasets featuring 10,000 sequences or more. Another benefit is the possibility to integrate third-party clustering methods and third-party MSA aligners. The regressive algorithm has been tested on up to 1.5 million sequences, its implementation is available in the T-Coffee package.
Collapse
|
3
|
Abstract
Multiple sequence alignment (MSA) is a central step in many bioinformatics and computational biology analyses. Although there exist many methods to perform MSA, most of them fail when dealing with large datasets due to their high computational cost. MSAProbs-MPI is a publicly available tool ( http://msaprobs.sourceforge.net ) that provides highly accurate results in relatively short runtime thanks to exploiting the hardware resources of multicore clusters. In this chapter, I explain the statistical and biological concepts employed in MSAProbs-MPI to complete the alignments, as well as the high-performance computing techniques used to accelerate it. Moreover, I provide some hints about the configuration parameters that should be used to guarantee high-performance executions.
Collapse
|
4
|
Accelerating Sequence Alignments Based on FM-Index Using the Intel KNL Processor. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2020; 17:1093-1104. [PMID: 30530369 DOI: 10.1109/tcbb.2018.2884701] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/09/2023]
Abstract
FM-index is a compact data structure suitable for fast matches of short reads to large reference genomes. The matching algorithm using this index exhibits irregular memory access patterns that cause frequent cache misses, resulting in a memory bound problem. This paper analyzes different FM-index versions presented in the literature, focusing on those computing aspects related to the data access. As a result of the analysis, we propose a new organization of FM-index that minimizes the demand for memory bandwidth, allowing a great improvement of performance on processors with high-bandwidth memory, such as the second-generation Intel Xeon Phi (Knights Landing, or KNL), integrating ultra high-bandwidth stacked memory technology. As the roofline model shows, our implementation reaches 95 percent of the peak random access bandwidth limit when executed on the KNL and almost all of the available bandwidth when executed on other Intel Xeon architectures with conventional DDR memory. In addition, the obtained throughput in KNL is much higher than the results reported for GPUs in the literature.
Collapse
|
5
|
BLASTP-ACC: Parallel Architecture and Hardware Accelerator Design for BLAST-Based Protein Sequence Alignment. IEEE TRANSACTIONS ON BIOMEDICAL CIRCUITS AND SYSTEMS 2019; 13:1771-1782. [PMID: 31581096 DOI: 10.1109/tbcas.2019.2943539] [Citation(s) in RCA: 13] [Impact Index Per Article: 2.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/10/2023]
Abstract
In this study, we design a hardware accelerator for a widely used sequence alignment algorithm, the basic local alignment search tool for proteins (BLASTP). The architecture of the proposed accelerator consists of five stages: a new systolic-array-based one-hit finding stage, a novel RAM-REG-based two-hit finding stage, a refined ungapped extension stage, a faster gapped extension stage, and a highly efficient parallel sorter. The system is implemented on an Altera Stratix V FPGA with a processing speed of more than 500 giga cell updates per second (GCUPS). It can receive a query sequence, compare it with the sequences in the database, and generate a list sorted in descending order of the similarity scores between the query sequence and the subject sequences. Moreover, it is capable of processing both query and subject protein sequences comprising as many as 8192 amino acid residues in a single pass. Using data from the National Center for Biotechnology Information (NCBI) database, we show that a speed-up of more than 3X can be achieved with our hardware compared to the runtime required by BLASTP software on an 8-thread Intel Xeon CPU with 144 GB DRAM.
Collapse
|
6
|
Freiburg RNA tools: a central online resource for RNA-focused research and teaching. Nucleic Acids Res 2018; 46:W25-W29. [PMID: 29788132 PMCID: PMC6030932 DOI: 10.1093/nar/gky329] [Citation(s) in RCA: 72] [Impact Index Per Article: 12.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/30/2018] [Revised: 04/03/2018] [Accepted: 05/18/2018] [Indexed: 12/20/2022] Open
Abstract
The Freiburg RNA tools webserver is a well established online resource for RNA-focused research. It provides a unified user interface and comprehensive result visualization for efficient command line tools. The webserver includes RNA-RNA interaction prediction (IntaRNA, CopraRNA, metaMIR), sRNA homology search (GLASSgo), sequence-structure alignments (LocARNA, MARNA, CARNA, ExpaRNA), CRISPR repeat classification (CRISPRmap), sequence design (antaRNA, INFO-RNA, SECISDesign), structure aberration evaluation of point mutations (RaSE), and RNA/protein-family models visualization (CMV), and other methods. Open education resources offer interactive visualizations of RNA structure and RNA-RNA interaction prediction as well as basic and advanced sequence alignment algorithms. The services are freely available at http://rna.informatik.uni-freiburg.de.
Collapse
|
7
|
NGS-QCbox and Raspberry for Parallel, Automated and Rapid Quality Control Analysis of Large-Scale Next Generation Sequencing (Illumina) Data. PLoS One 2015; 10:e0139868. [PMID: 26460497 PMCID: PMC4604202 DOI: 10.1371/journal.pone.0139868] [Citation(s) in RCA: 26] [Impact Index Per Article: 2.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/01/2015] [Accepted: 09/16/2015] [Indexed: 12/04/2022] Open
Abstract
Rapid popularity and adaptation of next generation sequencing (NGS) approaches have generated huge volumes of data. High throughput platforms like Illumina HiSeq produce terabytes of raw data that requires quick processing. Quality control of the data is an important component prior to the downstream analyses. To address these issues, we have developed a quality control pipeline, NGS-QCbox that scales up to process hundreds or thousands of samples. Raspberry is an in-house tool, developed in C language utilizing HTSlib (v1.2.1) (http://htslib.org), for computing read/base level statistics. It can be used as stand-alone application and can process both compressed and uncompressed FASTQ format files. NGS-QCbox integrates Raspberry with other open-source tools for alignment (Bowtie2), SNP calling (SAMtools) and other utilities (bedtools) towards analyzing raw NGS data at higher efficiency and in high-throughput manner. The pipeline implements batch processing of jobs using Bpipe (https://github.com/ssadedin/bpipe) in parallel and internally, a fine grained task parallelization utilizing OpenMP. It reports read and base statistics along with genome coverage and variants in a user friendly format. The pipeline developed presents a simple menu driven interface and can be used in either quick or complete mode. In addition, the pipeline in quick mode outperforms in speed against other similar existing QC pipeline/tools. The NGS-QCbox pipeline, Raspberry tool and associated scripts are made available at the URL https://github.com/CEG-ICRISAT/NGS-QCbox and https://github.com/CEG-ICRISAT/Raspberry for rapid quality control analysis of large-scale next generation sequencing (Illumina) data.
Collapse
|
8
|
B-MIC: An Ultrafast Three-Level Parallel Sequence Aligner Using MIC. Interdiscip Sci 2015; 8:28-34. [PMID: 26358141 DOI: 10.1007/s12539-015-0278-5] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/04/2014] [Revised: 05/19/2014] [Accepted: 05/19/2014] [Indexed: 12/29/2022]
Abstract
Sequence alignment is the central process for sequence analysis, where mapping raw sequencing data to reference genome. The large amount of data generated by NGS is far beyond the process capabilities of existing alignment tools. Consequently, sequence alignment becomes the bottleneck of sequence analysis. Intensive computing power is required to address this challenge. Intel recently announced the MIC coprocessor, which can provide massive computing power. The Tianhe-2 is the world's fastest supercomputer now equipped with three MIC coprocessors each compute node. A key feature of sequence alignment is that different reads are independent. Considering this property, we proposed a MIC-oriented three-level parallelization strategy to speed up BWA, a widely used sequence alignment tool, and developed our ultrafast parallel sequence aligner: B-MIC. B-MIC contains three levels of parallelization: firstly, parallelization of data IO and reads alignment by a three-stage parallel pipeline; secondly, parallelization enabled by MIC coprocessor technology; thirdly, inter-node parallelization implemented by MPI. In this paper, we demonstrate that B-MIC outperforms BWA by a combination of those techniques using Inspur NF5280M server and the Tianhe-2 supercomputer. To the best of our knowledge, B-MIC is the first sequence alignment tool to run on Intel MIC and it can achieve more than fivefold speedup over the original BWA while maintaining the alignment precision.
Collapse
|
9
|
Efficient and Accurate OTU Clustering with GPU-Based Sequence Alignment and Dynamic Dendrogram Cutting. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2015; 12:1060-1073. [PMID: 26451819 DOI: 10.1109/tcbb.2015.2407574] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/05/2023]
Abstract
De novo clustering is a popular technique to perform taxonomic profiling of a microbial community by grouping 16S rRNA amplicon reads into operational taxonomic units (OTUs). In this work, we introduce a new dendrogram-based OTU clustering pipeline called CRiSPy. The key idea used in CRiSPy to improve clustering accuracy is the application of an anomaly detection technique to obtain a dynamic distance cutoff instead of using the de facto value of 97 percent sequence similarity as in most existing OTU clustering pipelines. This technique works by detecting an abrupt change in the merging heights of a dendrogram. To produce the output dendrograms, CRiSPy employs the OTU hierarchical clustering approach that is computed on a genetic distance matrix derived from an all-against-all read comparison by pairwise sequence alignment. However, most existing dendrogram-based tools have difficulty processing datasets larger than 10,000 unique reads due to high computational complexity. We address this difficulty by developing two efficient algorithms for CRiSPy: a compute-efficient GPU-accelerated parallel algorithm for pairwise distance matrix computation and a memory-efficient hierarchical clustering algorithm. Our experiments on various datasets with distinct attributes show that CRiSPy is able to produce more accurate OTU groupings than most OTU clustering applications.
Collapse
|
10
|
Concurrent and Accurate Short Read Mapping on Multicore Processors. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2015; 12:995-1007. [PMID: 26451814 DOI: 10.1109/tcbb.2015.2392077] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/05/2023]
Abstract
We introduce a parallel aligner with a work-flow organization for fast and accurate mapping of RNA sequences on servers equipped with multicore processors. Our software, HPG Aligner SA (HPG Aligner SA is an open-source application. The software is available at http://www.opencb.org, exploits a suffix array to rapidly map a large fraction of the RNA fragments (reads), as well as leverages the accuracy of the Smith-Waterman algorithm to deal with conflictive reads. The aligner is enhanced with a careful strategy to detect splice junctions based on an adaptive division of RNA reads into small segments (or seeds), which are then mapped onto a number of candidate alignment locations, providing crucial information for the successful alignment of the complete reads. The experimental results on a platform with Intel multicore technology report the parallel performance of HPG Aligner SA, on RNA reads of 100-400 nucleotides, which excels in execution time/sensitivity to state-of-the-art aligners such as TopHat 2+Bowtie 2, MapSplice, and STAR.
Collapse
|
11
|
Accelerating the Next Generation Long Read Mapping with the FPGA-Based System. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2014; 11:840-852. [PMID: 26356857 DOI: 10.1109/tcbb.2014.2326876] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/05/2023]
Abstract
To compare the newly determined sequences against the subject sequences stored in the databases is a critical job in the bioinformatics. Fortunately, recent survey reports that the state-of-the-art aligners are already fast enough to handle the ultra amount of short sequence reads in the reasonable time. However, for aligning the long sequence reads (>400 bp) generated by the next generation sequencing (NGS) technology, it is still quite inefficient with present aligners. Furthermore, the challenge becomes more and more serious as the lengths and the amounts of the sequence reads are both keeping increasing with the improvement of the sequencing technology. Thus, it is extremely urgent for the researchers to enhance the performance of the long read alignment. In this paper, we propose a novel FPGA-based system to improve the efficiency of the long read mapping. Compared to the state-of-the-art long read aligner BWA-SW, our accelerating platform could achieve a high performance with almost the same sensitivity. Experiments demonstrate that, for reads with lengths ranging from 512 up to 4,096 base pairs, the described system obtains a 10x -48x speedup for the bottleneck of the software. As to the whole mapping procedure, the FPGA-based platform could achieve a 1.8x -3:3x speedup versus the BWA-SW aligner, reducing the alignment cycles from weeks to days.
Collapse
|
12
|
Fragment merger: an online tool to merge overlapping long sequence fragments. Viruses 2013; 5:824-33. [PMID: 23482300 PMCID: PMC3705298 DOI: 10.3390/v5030824] [Citation(s) in RCA: 26] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/16/2013] [Revised: 02/26/2013] [Accepted: 02/26/2013] [Indexed: 11/16/2022] Open
Abstract
While PCR amplicons extend to a few thousand bases, the length of sequences from direct Sanger sequencing is limited to 500–800 nucleotides. Therefore, several fragments may be required to cover an amplicon, a gene or an entire genome. These fragments are typically sequenced in an overlapping fashion and assembled by manually sliding and aligning the sequences visually. This is time-consuming, repetitive and error-prone, and further complicated by circular genomes. An online tool merging two to twelve long overlapping sequence fragments was developed. Either chromatograms or FASTA files are submitted to the tool, which trims poor quality ends of chromatograms according to user-specified parameters. Fragments are assembled into a single sequence by repeatedly calling the EMBOSS merger tool in a consecutive manner. Output includes the number of trimmed nucleotides, details of each merge, and an optional alignment to a reference sequence. The final merge sequence is displayed and can be downloaded in FASTA format. All output files can be downloaded as a ZIP archive. This tool allows for easy and automated assembly of overlapping sequences and is aimed at researchers without specialist computer skills. The tool is genome- and organism-agnostic and has been developed using hepatitis B virus sequence data.
Collapse
|
13
|
QuShape: rapid, accurate, and best-practices quantification of nucleic acid probing information, resolved by capillary electrophoresis. RNA (NEW YORK, N.Y.) 2013; 19. [PMID: 23188808 PMCID: PMC3527727 DOI: 10.1261/rna.036327.112] [Citation(s) in RCA: 158] [Impact Index Per Article: 14.4] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/03/2023]
Abstract
Chemical probing of RNA and DNA structure is a widely used and highly informative approach for examining nucleic acid structure and for evaluating interactions with protein and small-molecule ligands. Use of capillary electrophoresis to analyze chemical probing experiments yields hundreds of nucleotides of information per experiment and can be performed on automated instruments. Extraction of the information from capillary electrophoresis electropherograms is a computationally intensive multistep analytical process, and no current software provides rapid, automated, and accurate data analysis. To overcome this bottleneck, we developed a platform-independent, user-friendly software package, QuShape, that yields quantitatively accurate nucleotide reactivity information with minimal user supervision. QuShape incorporates newly developed algorithms for signal decay correction, alignment of time-varying signals within and across capillaries and relative to the RNA nucleotide sequence, and signal scaling across channels or experiments. An analysis-by-reference option enables multiple, related experiments to be fully analyzed in minutes. We illustrate the usefulness and robustness of QuShape by analysis of RNA SHAPE (selective 2'-hydroxyl acylation analyzed by primer extension) experiments.
Collapse
|
14
|
PAirwise Sequence Comparison (PASC) and its application in the classification of filoviruses. Viruses 2012; 4:1318-27. [PMID: 23012628 PMCID: PMC3446765 DOI: 10.3390/v4081318] [Citation(s) in RCA: 30] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/03/2012] [Revised: 08/16/2012] [Accepted: 08/16/2012] [Indexed: 11/26/2022] Open
Abstract
PAirwise Sequence Comparison (PASC) is a tool that uses genome sequence similarity to help with virus classification. The PASC tool at NCBI uses two methods: local alignment based on BLAST and global alignment based on Needleman-Wunsch algorithm. It works for complete genomes of viruses of several families/groups, and for the family of Filoviridae, it currently includes 52 complete genomes available in GenBank. It has been shown that BLAST-based alignment approach works better for filoviruses, and therefore is recommended for establishing taxon demarcation criteria. When more genome sequences with high divergence become available, these demarcations will most likely become more precise. The tool can compare new genome sequences of filoviruses with the ones already in the database, and propose their taxonomic classification.
Collapse
|
15
|
SIMPLEX: cloud-enabled pipeline for the comprehensive analysis of exome sequencing data. PLoS One 2012; 7:e41948. [PMID: 22870267 PMCID: PMC3411592 DOI: 10.1371/journal.pone.0041948] [Citation(s) in RCA: 36] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/23/2012] [Accepted: 06/28/2012] [Indexed: 01/24/2023] Open
Abstract
In recent studies, exome sequencing has proven to be a successful screening tool for the identification of candidate genes causing rare genetic diseases. Although underlying targeted sequencing methods are well established, necessary data handling and focused, structured analysis still remain demanding tasks. Here, we present a cloud-enabled autonomous analysis pipeline, which comprises the complete exome analysis workflow. The pipeline combines several in-house developed and published applications to perform the following steps: (a) initial quality control, (b) intelligent data filtering and pre-processing, (c) sequence alignment to a reference genome, (d) SNP and DIP detection, (e) functional annotation of variants using different approaches, and (f) detailed report generation during various stages of the workflow. The pipeline connects the selected analysis steps, exposes all available parameters for customized usage, performs required data handling, and distributes computationally expensive tasks either on a dedicated high-performance computing infrastructure or on the Amazon cloud environment (EC2). The presented application has already been used in several research projects including studies to elucidate the role of rare genetic diseases. The pipeline is continuously tested and is publicly available under the GPL as a VirtualBox or Cloud image at http://simplex.i-med.ac.at; additional supplementary data is provided at http://www.icbi.at/exome.
Collapse
|
16
|
Protein secondary structure prediction using modular reciprocal bidirectional recurrent neural networks. COMPUTER METHODS AND PROGRAMS IN BIOMEDICINE 2010; 100:237-247. [PMID: 20472322 DOI: 10.1016/j.cmpb.2010.04.005] [Citation(s) in RCA: 12] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 11/11/2009] [Revised: 04/04/2010] [Accepted: 04/13/2010] [Indexed: 05/29/2023]
Abstract
The supervised learning of recurrent neural networks well-suited for prediction of protein secondary structures from the underlying amino acids sequence is studied. Modular reciprocal recurrent neural networks (MRR-NN) are proposed to model the strong correlations between adjacent secondary structure elements. Besides, a multilayer bidirectional recurrent neural network (MBR-NN) is introduced to capture the long-range intramolecular interactions between amino acids in formation of the secondary structure. The final modular prediction system is devised based on the interactive integration of the MRR-NN and the MBR-NN structures to arbitrarily engage the neighboring effects of the secondary structure types concurrent with memorizing the sequential dependencies of amino acids along the protein chain. The advanced combined network augments the percentage accuracy (Q₃) to 79.36% and boosts the segment overlap (SOV) up to 70.09% when tested on the PSIPRED dataset in three-fold cross-validation.
Collapse
|
17
|
A score of the ability of a three-dimensional protein model to retrieve its own sequence as a quantitative measure of its quality and appropriateness. PLoS One 2010; 5:e12483. [PMID: 20830209 PMCID: PMC2935356 DOI: 10.1371/journal.pone.0012483] [Citation(s) in RCA: 9] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/26/2010] [Accepted: 08/03/2010] [Indexed: 11/18/2022] Open
Abstract
BACKGROUND Despite the remarkable progress of bioinformatics, how the primary structure of a protein leads to a three-dimensional fold, and in turn determines its function remains an elusive question. Alignments of sequences with known function can be used to identify proteins with the same or similar function with high success. However, identification of function-related and structure-related amino acid positions is only possible after a detailed study of every protein. Folding pattern diversity seems to be much narrower than sequence diversity, and the amino acid sequences of natural proteins have evolved under a selective pressure comprising structural and functional requirements acting in parallel. PRINCIPAL FINDINGS The approach described in this work begins by generating a large number of amino acid sequences using ROSETTA [Dantas G et al. (2003) J Mol Biol 332:449-460], a program with notable robustness in the assignment of amino acids to a known three-dimensional structure. The resulting sequence-sets showed no conservation of amino acids at active sites, or protein-protein interfaces. Hidden Markov models built from the resulting sequence sets were used to search sequence databases. Surprisingly, the models retrieved from the database sequences belonged to proteins with the same or a very similar function. Given an appropriate cutoff, the rate of false positives was zero. According to our results, this protocol, here referred to as Rd.HMM, detects fine structural details on the folding patterns, that seem to be tightly linked to the fitness of a structural framework for a specific biological function. CONCLUSION Because the sequence of the native protein used to create the Rd.HMM model was always amongst the top hits, the procedure is a reliable tool to score, very accurately, the quality and appropriateness of computer-modeled 3D-structures, without the need for spectroscopy data. However, Rd.HMM is very sensitive to the conformational features of the models' backbone.
Collapse
|
18
|
MetaGenomeThreader: a software tool for predicting genes in DNA-sequences of metagenome projects. Methods Mol Biol 2010; 668:325-338. [PMID: 20830575 DOI: 10.1007/978-1-60761-823-2_23] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 05/29/2023]
Abstract
We consider a gene finding method that is specifically designed to work on metagenome sequences. The method can handle short metagenome sequences with in-frame stop codons as well as frame shifts. It delivers gene predictions for a set of metagenome sequences, which may be individual reads or a collection of assembled reads sequenced from an environmental sample. The method searches for stretches of DNA that are conserved within the environmental sample. Conserved coding sequences are discriminated from conserved non-coding regions based on their synonymous substitution rate. We describe the program MetaGenomeThreader which implements the method and show its application on a synthetic metagenome.
Collapse
|
19
|
Highly sensitive detection of individual HEAT and ARM repeats with HHpred and COACH. PLoS One 2009; 4:e7148. [PMID: 19777061 PMCID: PMC2744927 DOI: 10.1371/journal.pone.0007148] [Citation(s) in RCA: 26] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/18/2009] [Accepted: 08/21/2009] [Indexed: 11/19/2022] Open
Abstract
Background HEAT and ARM repeats occur in a large number of eukaryotic proteins. As these repeats are often highly diverged, the prediction of HEAT or ARM domains can be challenging. Except for the most clear-cut cases, identification at the individual repeat level is indispensable, in particular for determining domain boundaries. However, methods using single sequence queries do not have the sensitivity required to deal with more divergent repeats and, when applied to proteins with known structures, in some cases failed to detect a single repeat. Methodology and Principal Findings Testing algorithms which use multiple sequence alignments as queries, we found two of them, HHpred and COACH, to detect HEAT and ARM repeats with greatly enhanced sensitivity. Calibration against experimentally determined structures suggests the use of three score classes with increasing confidence in the prediction, and prediction thresholds for each method. When we applied a new protocol using both HHpred and COACH to these structures, it detected 82% of HEAT repeats and 90% of ARM repeats, with the minimum for a given protein of 57% for HEAT repeats and 60% for ARM repeats. Application to bona fide HEAT and ARM proteins or domains indicated that similar numbers can be expected for the full complement of HEAT/ARM proteins. A systematic screen of the Protein Data Bank for false positive hits revealed their number to be low, in particular for ARM repeats. Double false positive hits for a given protein were rare for HEAT and not at all observed for ARM repeats. In combination with fold prediction and consistency checking (multiple sequence alignments, secondary structure prediction, and position analysis), repeat prediction with the new HHpred/COACH protocol dramatically improves prediction in the twilight zone of fold prediction methods, as well as the delineation of HEAT/ARM domain boundaries. Significance A protocol is presented for the identification of individual HEAT or ARM repeats which is straightforward to implement. It provides high sensitivity at a low false positive rate and will therefore greatly enhance the accuracy of predictions of HEAT and ARM domains.
Collapse
|
20
|
Prodepth: predict residue depth by support vector regression approach from protein sequences only. PLoS One 2009; 4:e7072. [PMID: 19759917 PMCID: PMC2742725 DOI: 10.1371/journal.pone.0007072] [Citation(s) in RCA: 35] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/14/2009] [Accepted: 08/20/2009] [Indexed: 11/24/2022] Open
Abstract
Residue depth (RD) is a solvent exposure measure that complements the information provided by conventional accessible surface area (ASA) and describes to what extent a residue is buried in the protein structure space. Previous studies have established that RD is correlated with several protein properties, such as protein stability, residue conservation and amino acid types. Accurate prediction of RD has many potentially important applications in the field of structural bioinformatics, for example, facilitating the identification of functionally important residues, or residues in the folding nucleus, or enzyme active sites from sequence information. In this work, we introduce an efficient approach that uses support vector regression to quantify the relationship between RD and protein sequence. We systematically investigated eight different sequence encoding schemes including both local and global sequence characteristics and examined their respective prediction performances. For the objective evaluation of our approach, we used 5-fold cross-validation to assess the prediction accuracies and showed that the overall best performance could be achieved with a correlation coefficient (CC) of 0.71 between the observed and predicted RD values and a root mean square error (RMSE) of 1.74, after incorporating the relevant multiple sequence features. The results suggest that residue depth could be reliably predicted solely from protein primary sequences: local sequence environments are the major determinants, while global sequence features could influence the prediction performance marginally. We highlight two examples as a comparison in order to illustrate the applicability of this approach. We also discuss the potential implications of this new structural parameter in the field of protein structure prediction and homology modeling. This method might prove to be a powerful tool for sequence analysis.
Collapse
|
21
|
GRAT--genome-scale rapid alignment tool. COMPUTER METHODS AND PROGRAMS IN BIOMEDICINE 2007; 86:87-92. [PMID: 17292508 DOI: 10.1016/j.cmpb.2007.01.002] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 03/14/2006] [Revised: 01/04/2007] [Accepted: 01/04/2007] [Indexed: 05/13/2023]
Abstract
Modern alignment methods designed to work rapidly and efficiently with large datasets often do so at the cost of method sensitivity. To overcome this, we have developed a novel alignment program, GRAT, built to accurately align short, highly similar DNA sequences. The program runs rapidly and requires no more memory and CPU power than a desktop computer. In addition, specificity is ensured by statistically separating the true alignments from spurious matches using phred quality values. An efficient separation is especially important when searching large datasets and whenever there are repeats present in the dataset. Results are superior in comparison to widely used existing software, and analysis of two large genomic datasets show the usefulness and scalability of the algorithm.
Collapse
|
22
|
Abstract
MOTIVATION High-resolution mass spectrometers generate large data files that are complex, noisy and require extensive processing to extract the optimal data from raw spectra. This processing is readily achieved in software and is often embedded in manufacturers' instrument control and data processing environments. However, the speed of this data processing is such that it is usually performed off-line, post data acquisition. We have been exploring strategies that would allow real-time advanced processing of mass spectrometric data, making use of the reconfigurable computing paradigm, which exploits the flexibility and versatility of Field Programmable Gate Arrays (FPGAs). This approach has emerged as a powerful solution for speeding up time-critical algorithms. We describe here a reconfigurable computing solution for processing raw mass spectrometric data generated by MALDI-ToF instruments. The hardware-implemented algorithms for de-noising, baseline correction, peak identification and deisotoping, running on a Xilinx Virtex 2 FPGA at 180 MHz, generate a mass fingerprint over 100 times faster than an equivalent algorithm written in C, running on a Dual 3 GHz Xeon workstation.
Collapse
|
23
|
Abstract
MOTIVATION Pattern identification in biological sequence data is one of the main objectives of bioinformatics research. However, few methods are available for detecting patterns (substructures) in unordered datasets. Data mining algorithms mainly developed outside the realm of bioinformatics have been adapted for that purpose, but typically do not determine the statistical significance of the identified patterns. Moreover, these algorithms do not exploit the often modular structure of biological data. RESULTS We present the algorithm DASS (Discovery of All Significant Substructures) that first identifies all substructures in unordered data (DASS(Sub)) in a manner that is especially efficient for modular data. In addition, DASS calculates the statistical significance of the identified substructures, for sets with at most one element of each type (DASS(P(set))), or for sets with multiple occurrence of elements (DASS(P(mset))). The power and versatility of DASS is demonstrated by four examples: combinations of protein domains in multi-domain proteins, combinations of proteins in protein complexes (protein subcomplexes), combinations of transcription factor target sites in promoter regions and evolutionarily conserved protein interaction subnetworks. AVAILABILITY The program code and additional data are available at http://www.fli-leibniz.de/tsb/DASS
Collapse
|
24
|
A unique and universal molecular barcode array. Nat Methods 2006; 3:601-3. [PMID: 16862133 DOI: 10.1038/nmeth905] [Citation(s) in RCA: 98] [Impact Index Per Article: 5.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/23/2006] [Accepted: 06/19/2006] [Indexed: 11/08/2022]
Abstract
Molecular barcode arrays allow the analysis of thousands of biological samples in parallel through the use of unique 20-base-pair (bp) DNA tags. Here we present a new barcode array, which is unique among microarrays in that it includes at least five replicates of every tag feature. The use of smaller dispersed replicate features dramatically improves performance versus a single larger feature and allows the correction of previously undetectable hybridization defects.
Collapse
|
25
|
ProfNet, a method to derive profile-profile alignment scoring functions that improves the alignments of distantly related proteins. BMC Bioinformatics 2005; 6:253. [PMID: 16225676 PMCID: PMC1274300 DOI: 10.1186/1471-2105-6-253] [Citation(s) in RCA: 15] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/09/2005] [Accepted: 10/14/2005] [Indexed: 11/10/2022] Open
Abstract
Background Profile-profile methods have been used for some years now to detect and align homologous proteins. The best such methods use information from the background distribution of amino acids and substitution tables either when constructing the profiles or in the scoring. This makes the methods dependent on the quality and choice of substitution table as well as the construction of the profiles. Here, we introduce a novel method called ProfNet that is used to derive a profile-profile scoring function. The method optimizes the discrimination between scores of related and unrelated residues and it is fast and straightforward to use. This new method derives a scoring function that is mainly dependent on the actual alignment of residues from a training set, and it does not use any additional information about the background distribution. Results It is shown that ProfNet improves the discrimination of related and unrelated residues. Further it can be used to improve the alignment of distantly related proteins. Conclusion The best performance is obtained using superfamily related proteins in the training of ProfNet, and a classifier that is related to the distance between the structurally aligned residues. The main difference between the new scoring function and a traditional profile-profile scoring function is that conserved residues on average score higher with the new function.
Collapse
|
26
|
Abstract
Aligning hundreds of sequences using progressive alignment tools such as ClustalW requires several hours on state-of-the-art workstations. We present a new approach to compute multiple sequence alignments in far shorter time using reconfigurable hardware. This results in an implementation of ClustalW with significant runtime savings on a standard off-the-shelf FPGA.
Collapse
|
27
|
Abstract
SUMMARY To increase compatibility between different generations of Affymetrix GeneChip arrays, we propose a method of filtering probes based on their sequences. Our method is implemented as a web-based service for downloading necessary materials for converting the raw data files (*.CEL) for comparative analysis. The user can specify the appropriate level of filtering by setting the criteria for the minimum overlap length between probe sequences and the minimum number of usable probe pairs per probe set. Our website supports a within-species comparison for human and mouse GeneChip arrays. AVAILABILITY http://www.crosschip.org
Collapse
|
28
|
Hardware-accelerated protein identification for mass spectrometry. RAPID COMMUNICATIONS IN MASS SPECTROMETRY : RCM 2005; 19:833-837. [PMID: 15723443 DOI: 10.1002/rcm.1853] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/24/2023]
Abstract
An ongoing issue in mass spectrometry is the time it takes to search DNA sequences with MS/MS peptide fragments (see, e.g., Choudary et al., Proteomics 2001; 1: 651-667.) Search times are far longer than spectra acquisition time, and parallelization of search software on clusters requires doubling the size of a conventional computing cluster to cut the search time in half. Field programmable gate arrays (FPGAs) are used to create hardware-accelerated algorithms that reduce operating costs and improve search speed compared to large clusters. We present a novel hardware design that takes full spectra and computes 6-frame translation word searches on DNA databases at a rate of approximately 3 billion base pairs per second, with queries of up to 10 amino acids in length and arbitrary wildcard positions. Hardware post-processing identifies in silico tryptic peptides and scores them using a variety of techniques including mass frequency expected values. With faster FPGAs protein identifications from the human genome can be achieved in less than a second, and this makes it an ideal solution for a number of proteome-scale applications.
Collapse
|
29
|
Abstract
MOTIVATION Multiple transcription factors coordinately control transcriptional regulation of genes in eukaryotes. Although many computational methods consider the identification of individual transcription factor binding sites (TFBSs), very few focus on the interactions between these sites. We consider finding TFBSs and their context specific interactions using microarray gene expression data. We devise a hybrid approach called LogicMotif composed of a TFBS identification method combined with the new regression methodology logic regression. LogicMotif has two steps: First, potential binding sites are identified from transcription control regions of genes of interest. Various available methods can be used in this step when the genes of interest can be divided into groups such as up-and downregulated. For this step, we also develop a simple univariate regression and extension method MFURE to extract candidate TFBSs from a large number of genes in the availability of microarray gene expression data. MFURE provides an alternative method for this step when partitioning of the genes into disjoint groups is not preferred. This first step aims to identify individual sites within gene groups of interest or sites that are correlated with the gene expression outcome. In the second step, logic regression is used to build a predictive model of outcome of interest (either gene expression or up- and down-regulation) using these potential sites. This 2-fold approach creates a rich diverse set of potential binding sites in the first step and builds regression or classification models in the second step using logic regression that is particularly good at identifying complex interactions. RESULTS LogicMotif is applied to two publicly available datasets. A genome-wide gene expression data set of Saccharomyces cerevisiae is used for validation. The regression models obtained are interpretable and the biological implications are in agreement with the known resuts. This analysis suggests that LogicMotif provides biologically more reasonable regression models than previous analysis of this dataset with standard linear regression methods. Another dataset of S.cerevisiae illustrates the use of LogicMotif in classification questions by building a model that discriminates between up- and down-regulated genes in iron copper deficiency. LogicMotif identifies an inductive and two repressor motifs in this dataset. The inductive motif matches the binding site of the transcription factor Aft1p that has a key role in regulation of the uptake process. One of the novel repressor sites is highly present in transcription control regions of FeS genes. This site could represent a TFBS for an unknown transcription factor involved in repression of genes encoding FeS proteins in iron deficiency. We establish the robustness of the method to the type of outcome variable used by considering both continuous and binary outcome variables for this dataset. Our results indicate that logic regression used in combination with cluster/group operating binding site identification methods or with our proposed method MFURE is a powerful and flexible alternative to linear regression based motif finding methods. AVAILABILITY Source code for logic regression is freely available as a package of the R programming language by Ruczinski et al. (2003) and can be downloaded at http://bear.fhcrc.org/~ingor/logic/download/download.html an R package for MFURE is available at http://www.stat.berkeley.edu/~sunduz/software.html
Collapse
|
30
|
Abstract
UNLABELLED We present the architecture of PROSIDIS, a special purpose co-processor designed to search for the occurrence of substrings similar to a given 'template string' within a proteome. Actual tests show speed up figures ranging from 5 to 50 with respect to conventional general-purpose processors. AVAILABILITY the PROSIDIS configuration file and the c code are available at http://www.enea.it/hpcn/php/rosato/
Collapse
|
31
|
|
32
|
Monitoring DNA hybridization on alkyl modified silicon surface through capacitance measurement. Biosens Bioelectron 2003; 18:1157-63. [PMID: 12788558 DOI: 10.1016/s0956-5663(03)00002-2] [Citation(s) in RCA: 88] [Impact Index Per Article: 4.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/16/2022]
Abstract
Single strand oligodeoxynucleotide is attached to the alkyl modified silicon surface through a peptide bond. The oligodeoxynucleotide-modified silicon substrate is used as a working electrode in an electrochemical cell system. After the electrode is treated by a solution containing strands of complementary oligodeoxynucleotide the Mott-Schottky measurements exhibit obvious negative shift in the flat band potential of the electrode, while in a control experiment treated with a solution of non-complementary oligodeoxynucleotide such a shift does not occur. The DNA hybridization is also manifested in a real time capacitance measurement. A DNA sensor based on the capacitance measurement could be more convenient than that based on a fluorescence detection.
Collapse
|
33
|
Achieving differentiation of single-base mutations through hairpin oligonucleotide and electric potential control. Biosens Bioelectron 2003; 18:1149-55. [PMID: 12788557 DOI: 10.1016/s0956-5663(02)00249-x] [Citation(s) in RCA: 31] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/27/2022]
Abstract
A novel assay for surface DNA hybridization, which is free of sample and probe labeling, convenient and of low cost, sensitive and capable of differentiation of single-base mutations, is reported. Hairpin oligonucleotides are carefully designed as probes and are covalently attached to Si chips. Segments of the human p53 gene are chosen to demonstrate the major features of the novel technique. Impedance measurement is used to detect the hybridization. To further optimize the performance, electric potential is applied on the chip. The apparently different responses of the chip to the complementary strand and the single-base mutant are shown under electric potential control. The criteria on the design of the hairpin oligonucleotides are discussed.
Collapse
|
34
|
|
35
|
Strategies and tools for whole-genome alignments. Genome Res 2003; 13:73-80. [PMID: 12529308 PMCID: PMC430965 DOI: 10.1101/gr.762503] [Citation(s) in RCA: 159] [Impact Index Per Article: 7.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/04/2002] [Accepted: 11/06/2002] [Indexed: 11/25/2022]
Abstract
The availability of the assembled mouse genome makes possible, for the first time, an alignment and comparison of two large vertebrate genomes. We investigated different strategies of alignment for the subsequent analysis of conservation of genomes that are effective for assemblies of different quality. These strategies were applied to the comparison of the working draft of the human genome with the Mouse Genome Sequencing Consortium assembly, as well as other intermediate mouse assemblies. Our methods are fast and the resulting alignments exhibit a high degree of sensitivity, covering more than 90% of known coding exons in the human genome. We obtained such coverage while preserving specificity. With a view towards the end user, we developed a suite of tools and Web sites for automatically aligning and subsequently browsing and working with whole-genome comparisons. We describe the use of these tools to identify conserved non-coding regions between the human and mouse genomes, some of which have not been identified by other methods.
Collapse
|
36
|
Abstract
The Mouse Genome Analysis Consortium aligned the human and mouse genome sequences for a variety of purposes, using alignment programs that suited the various needs. For investigating issues regarding genome evolution, a particularly sensitive method was needed to permit alignment of a large proportion of the neutrally evolving regions. We selected a program called BLASTZ, an independent implementation of the Gapped BLAST algorithm specifically designed for aligning two long genomic sequences. BLASTZ was subsequently modified, both to attain efficiency adequate for aligning entire mammalian genomes and to increase its sensitivity. This work describes BLASTZ, its modifications, the hardware environment on which we run it, and several empirical studies to validate its results.
Collapse
|
37
|
Abstract
Similarity searches are a powerful method for solving important biological problems such as database scanning, evolutionary studies, gene prediction, and protein structure prediction. FASTA is a widely used sequence comparison tool for rapid database scanning. Here we describe the GWFASTA server that was developed to assist the FASTA user in similarity searches against partially and/or completely sequenced genomes. GWFASTA consists of more than 60 microbial genomes, eight eukaryote genomes, and proteomes of annotatedgenomes. Infact, it provides the maximum number of databases for similarity searching from a single platform. GWFASTA allows the submission of more than one sequence as a single query for a FASTA search. It also provides integrated post-processing of FASTA output, including compositional analysis of proteins, multiple sequences alignment, and phylogenetic analysis. Furthermore, it summarizes the search results organism-wise for prokaryotes and chromosome-wise for eukaryotes. Thus, the integration of different tools for sequence analyses makes GWFASTA a powerful toolfor biologists.
Collapse
|
38
|
|
39
|
Abstract
SUMMARY SOAP is a stand-alone, multi-platform program to test the stability of a multiple alignment of molecular sequences.
Collapse
|
40
|
Abstract
With the availability of a nearly complete sequence of the human genome, aligning expressed sequence tags (EST) to the genomic sequence has become a practical and powerful strategy for gene prediction. Elucidating gene structure is a complex problem requiring the identification of splice junctions, gene boundaries, and alternative splicing variants. We have developed a software tool, Transcript Assembly Program (TAP), to delineate gene structures using genomically aligned EST sequences. TAP assembles the joint gene structure of the entire genomic region from individual splice junction pairs, using a novel algorithm that uses the EST-encoded connectivity and redundancy information to sort out the complex alternative splicing patterns. A method called polyadenylation site scan (PASS) has been developed to detect poly-A sites in the genome. TAP uses these predictions to identify gene boundaries by segmenting the joint gene structure at polyadenylated terminal exons. Reconstructing 1007 known transcripts, TAP scored a sensitivity (Sn) of 60% and a specificity (Sp) of 92% at the exon level. The gene boundary identification process was found to be accurate 78% of the time. also reports alternative splicing patterns in EST alignments. An analysis of alternative splicing in 1124 genic regions suggested that more than half of human genes undergo alternative splicing. Surprisingly, we saw an absolute majority of the detected alternative splicing events affect the coding region. Furthermore, the evolutionary conservation of alternative splicing between human and mouse was analyzed using an EST-based approach. (See http://stl.wustl.edu/~zkan/TAP/)
Collapse
|
41
|
Six-fold speed-up of Smith-Waterman sequence database searches using parallel processing on common microprocessors. Bioinformatics 2000; 16:699-706. [PMID: 11099256 DOI: 10.1093/bioinformatics/16.8.699] [Citation(s) in RCA: 67] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open
Abstract
MOTIVATION Sequence database searching is among the most important and challenging tasks in bioinformatics. The ultimate choice of sequence-search algorithm is that of Smith-Waterman. However, because of the computationally demanding nature of this method, heuristic programs or special-purpose hardware alternatives have been developed. Increased speed has been obtained at the cost of reduced sensitivity or very expensive hardware. RESULTS A fast implementation of the Smith-Waterman sequence-alignment algorithm using Single-Instruction, Multiple-Data (SIMD) technology is presented. This implementation is based on the MultiMedia eXtensions (MMX) and Streaming SIMD Extensions (SSE) technology that is embedded in Intel's latest microprocessors. Similar technology exists also in other modern microprocessors. Six-fold speed-up relative to the fastest previously known Smith-Waterman implementation on the same hardware was achieved by an optimized 8-way parallel processing approach. A speed of more than 150 million cell updates per second was obtained on a single Intel Pentium III 500 MHz microprocessor. This is probably the fastest implementation of this algorithm on a single general-purpose microprocessor described to date.
Collapse
|
42
|
Abstract
RESULTS JavaShade is a multiple sequence alignment box-and-shade tool for generating high quality printed output that uses a variety of methods for boxing and shading, allowing the most appropriate functions to be chosen for displaying the most meaningful positions in an alignment. AVAILABILITY JavaShade is available from the WWW at http://industry.ebi.ac.uk/JavaShade
Collapse
|
43
|
SAMBA: hardware accelerator for biological sequence comparison. COMPUTER APPLICATIONS IN THE BIOSCIENCES : CABIOS 1997; 13:609-15. [PMID: 9475989 DOI: 10.1093/bioinformatics/13.6.609] [Citation(s) in RCA: 23] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 02/06/2023]
Abstract
MOTIVATION SAMBA (Systolic Accelerator for Molecular Biological Applications) is a 128 processor hardware accelerator for speeding up the sequence comparison process. The short-term objective is to provide a low-cost board to boost PC or workstation performance on this class of applications. This paper places SAMBA amongst other existing systems and highlights the original features. RESULTS Real performance obtained from the prototype is demonstrated. For example, a sequence of 300 amino acids is scanned against SWISS-PROT-34 (21 210 389 residues) in 30 s using the Smith and Waterman algorithm. More time-consuming applications, like the bank-to-bank comparison, are computed in a few hours instead of days on standard workstations. Technology allows the prototype to fit onto a single PCI board for plugging into any PC or workstation. AVAILABILITY SAMBA can be tested on the WEB server at URL http://www.irisa.fr/SAMBA/.
Collapse
|
44
|
|
45
|
|
46
|
[DNA sequencing with a DNA sequencer using fluorescence labelling]. TANPAKUSHITSU KAKUSAN KOSO. PROTEIN, NUCLEIC ACID, ENZYME 1990; 35:2295-301. [PMID: 2267322] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [MESH Headings] [Subscribe] [Scholar Register] [Indexed: 12/31/2022]
|