1
|
Duality Between the Local Score of One Sequence and Constrained Hidden Markov Model. Methodol Comput Appl Probab 2022. [DOI: 10.1007/s11009-021-09856-8] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/21/2022]
|
2
|
Muthuvel Prasath K, Ganesan K, Parthasarathy S. PredictSuperFam-PSS-3D1D: A server for predicting superfamily for the annotation of twilight zone protein sequences. J Struct Biol 2020; 210:107479. [PMID: 32081792 DOI: 10.1016/j.jsb.2020.107479] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/14/2019] [Revised: 12/25/2019] [Accepted: 02/14/2020] [Indexed: 10/25/2022]
Abstract
Annotation of twilight zone protein sequences has been hitherto attempted by predicting the fold of the given sequence. We report here the PredictSuperFam-PSS-3D1D method, which predicts the superfamily for a given twilight zone (TZ) protein sequence. Earlier, we have reported that adding predicted secondary structure information into the threading methods could improve fold prediction especially for the TZ protein sequences. In this study, we have analysed the application of the same method to predict superfamilies. Here, in this method, the twilight zone protein sequence is threaded with the 3D1D profiles of the known protein superfamilies library. In addition, weightage for the predicted secondary structure (PSS) is also employed. The performance of the method is benchmarked with twilight zone sequences. In the benchmarks, 62 and 65 percentages of superfamily predictions are obtained with GOR IV and NPS@ predicted secondary structures, respectively. Receiver Operating Characteristic (ROC) curves indicate that the method is sensitive in predicting the superfamilies. A case study has been conducted with the hypothetical protein sequences of Schistosoma haematobium (Blood Fluke) using this method and the results are analyzed. Our method predicts the superfamily for TZ sequences for which, methods based on sequence similarity alone are inadequate. A web server has been developed for our method and it is available online at http://bioinfo.bdu.ac.in/psfpss.
Collapse
Affiliation(s)
- K Muthuvel Prasath
- Department of Bioinformatics, School of Life Sciences, Bharathidasan University, Tiruchirappalli 620 024, Tamil Nadu, India
| | - K Ganesan
- Department of Biotechnology, Bhupat and Jyoti Mehta School of Biosciences, Indian Institute of Technology Madras, Chennai 600 036, Tamil Nadu, India
| | - S Parthasarathy
- Department of Bioinformatics, School of Life Sciences, Bharathidasan University, Tiruchirappalli 620 024, Tamil Nadu, India.
| |
Collapse
|
3
|
Barlowe S, Coan HB, Youker RT. SubVis: an interactive R package for exploring the effects of multiple substitution matrices on pairwise sequence alignment. PeerJ 2017; 5:e3492. [PMID: 28674656 PMCID: PMC5490468 DOI: 10.7717/peerj.3492] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/16/2017] [Accepted: 05/27/2017] [Indexed: 01/13/2023] Open
Abstract
Understanding how proteins mutate is critical to solving a host of biological problems. Mutations occur when an amino acid is substituted for another in a protein sequence. The set of likelihoods for amino acid substitutions is stored in a matrix and input to alignment algorithms. The quality of the resulting alignment is used to assess the similarity of two or more sequences and can vary according to assumptions modeled by the substitution matrix. Substitution strategies with minor parameter variations are often grouped together in families. For example, the BLOSUM and PAM matrix families are commonly used because they provide a standard, predefined way of modeling substitutions. However, researchers often do not know if a given matrix family or any individual matrix within a family is the most suitable. Furthermore, predefined matrix families may inaccurately reflect a particular hypothesis that a researcher wishes to model or otherwise result in unsatisfactory alignments. In these cases, the ability to compare the effects of one or more custom matrices may be needed. This laborious process is often performed manually because the ability to simultaneously load multiple matrices and then compare their effects on alignments is not readily available in current software tools. This paper presents SubVis, an interactive R package for loading and applying multiple substitution matrices to pairwise alignments. Users can simultaneously explore alignments resulting from multiple predefined and custom substitution matrices. SubVis utilizes several of the alignment functions found in R, a common language among protein scientists. Functions are tied together with the Shiny platform which allows the modification of input parameters. Information regarding alignment quality and individual amino acid substitutions is displayed with the JavaScript language which provides interactive visualizations for revealing both high-level and low-level alignment information.
Collapse
Affiliation(s)
- Scott Barlowe
- Department of Mathematics and Computer Science, Western Carolina University, Cullowhee, NC, United States of America
| | - Heather B Coan
- Department of Biology, Western Carolina University, Cullowhee, NC, United States of America
| | - Robert T Youker
- Department of Biology, Western Carolina University, Cullowhee, NC, United States of America
| |
Collapse
|
4
|
Lagnoux A, Mercier S, Vallois P. Statistical significance based on length and position of the local score in a model of i.i.d. sequences. Bioinformatics 2017; 33:654-660. [PMID: 28035025 DOI: 10.1093/bioinformatics/btw699] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/01/2016] [Accepted: 11/08/2016] [Indexed: 11/14/2022] Open
Abstract
Motivation The local score of a biological sequence analysis is a mathematical tool largely used to analyse biological sequences. Consequently, determining an accurate estimation of its distribution is crucial. Results First, we study the accuracy of classical results on the local score distribution in independent and identically distributed model using a Kolmogorov-Smirnov goodness of fit test. Second, we highlight how the length of the segment that realizes the local score improves the classical setting based on local score only. Finally, we study which part of the sequence contributes to the local score. Contact mercier@univ-tlse2.fr.
Collapse
Affiliation(s)
- Agnès Lagnoux
- Institut de Mathématiques de Toulouse, UMR5219, Université de Toulouse 2 Jean Jaurès, 5 allées Antonio Machado, Toulouse, Cedex 09 31058, France
| | - Sabine Mercier
- Institut de Mathématiques de Toulouse, UMR5219, Université de Toulouse 2 Jean Jaurès, 5 allées Antonio Machado, Toulouse, Cedex 09 31058, France
| | - Pierre Vallois
- Institut Elie Cartan, UMR7502 CNRS, INRIA-BIGS, Université de Lorraine, Vandoeuvre-lès-Nancy Cedex 54506, France
| |
Collapse
|
5
|
Jäntschi L, Bolboacă SD. Distribution on contingency of alignment of two literal sequences under constrains. Acta Biotheor 2015; 63:55-69. [PMID: 25524134 DOI: 10.1007/s10441-014-9243-7] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/01/2014] [Accepted: 12/05/2014] [Indexed: 10/24/2022]
Abstract
The case of ungapped alignment of two literal sequences under constrains is considered. The analysis lead to general formulas for probability mass function and cumulative distribution function for the general case of using an alphabet with a chosen number of letters (e.g. 4 for deoxyribonucleic acid sequences) in the expression of the literal sequences. Formulas for three statistics including mean, mode, and standard deviation were obtained. Distributions are depicted for three important particular cases: alignment on binary sequences, alignment of trinomial series (such as coming from generalized Kronecker delta), and alignment of genetic sequences (with four literals in the alphabet). A particular case when sequences contain each letter of the alphabet at least once in both sequences has also been analyzed and some statistics for this restricted case are given.
Collapse
|
6
|
Abstract
Multiple sequence alignment involves identifying related subsequences among biological sequences. When matches are found, the associated pieces are shifted so that when sequences are presented as successive rows-one sequence per row-homologous residues line-up in columns. Exact alignment of more than a few sequences is known to be computationally prohibitive. Thus many heuristic algorithms have been developed to produce good alignments in an efficient amount of time by determining an order by which pairs of sequences are progressively aligned and merged. GRAMALIGN is such a progressive alignment algorithm that uses a grammar-based relative complexity distance metric to determine the alignment order. This technique allows for a computationally efficient and scalable program useful for aligning both large numbers of sequences and sets of long sequences quickly. The GRAMALIGN software is available at http://bioinfo.unl.edu/gramalign.php for both source code download and a web-based alignment server.
Collapse
Affiliation(s)
- David J Russell
- Department of Electrical Engineering, University of Nebraska-Lincoln, Lincoln, NE, USA
| |
Collapse
|
7
|
Alignment free comparison: k word voting model and its applications. J Theor Biol 2013; 335:276-82. [DOI: 10.1016/j.jtbi.2013.06.037] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/22/2012] [Revised: 04/25/2013] [Accepted: 06/26/2013] [Indexed: 02/06/2023]
|
8
|
Abstract
Thanks to advances in next-generation technologies, genome sequences are now being generated at breadth (e.g. across environments) and depth (thousands of closely related strains, individuals or samples) unimaginable only a few years ago. Phylogenomics--the study of evolutionary relationships based on comparative analysis of genome-scale data--has so far been developed as industrial-scale molecular phylogenetics, proceeding in the two classical steps: multiple alignment of homologous sequences, followed by inference of a tree (or multiple trees). However, the algorithms typically employed for these steps scale poorly with number of sequences, such that for an increasing number of problems, high-quality phylogenomic analysis is (or soon will be) computationally infeasible. Moreover, next-generation data are often incomplete and error-prone, and analysis may be further complicated by genome rearrangement, gene fusion and deletion, lateral genetic transfer, and transcript variation. Here we argue that next-generation data require next-generation phylogenomics, including so-called alignment-free approaches.
Collapse
Affiliation(s)
- Cheong Xin Chan
- Institute for Molecular Bioscience, and ARC Centre of Excellence in Bioinformatics, The University of Queensland, Brisbane, QLD, 4072, Australia
| | | |
Collapse
|
9
|
Wang D, Tapan S. MISCORE: a new scoring function for characterizing DNA regulatory motifs in promoter sequences. BMC SYSTEMS BIOLOGY 2012; 6 Suppl 2:S4. [PMID: 23282090 PMCID: PMC3521183 DOI: 10.1186/1752-0509-6-s2-s4] [Citation(s) in RCA: 9] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 12/31/2022]
Abstract
Background Computational approaches for finding DNA regulatory motifs in promoter sequences are useful to biologists in terms of reducing the experimental costs and speeding up the discovery process of de novo binding sites. It is important for rule-based or clustering-based motif searching schemes to effectively and efficiently evaluate the similarity between a k-mer (a k-length subsequence) and a motif model, without assuming the independence of nucleotides in motif models or without employing computationally expensive Markov chain models to estimate the background probabilities of k-mers. Also, it is interesting and beneficial to use a priori knowledge in developing advanced searching tools. Results This paper presents a new scoring function, termed as MISCORE, for functional motif characterization and evaluation. Our MISCORE is free from: (i) any assumption on model dependency; and (ii) the use of Markov chain model for background modeling. It integrates the compositional complexity of motif instances into the function. Performance evaluations with comparison to the well-known Maximum a Posteriori (MAP) score and Information Content (IC) have shown that MISCORE has promising capabilities to separate and recognize functional DNA motifs and its instances from non-functional ones. Conclusions MISCORE is a fast computational tool for candidate motif characterization, evaluation and selection. It enables to embed priori known motif models for computing motif-to-motif similarity, which is more advantageous than IC and MAP score. In addition to these merits mentioned above, MISCORE can automatically filter out some repetitive k-mers from a motif model due to the introduction of the compositional complexity in the function. Consequently, the merits of our proposed MISCORE in terms of both motif signal modeling power and computational efficiency will make it more applicable in the development of computational motif discovery tools.
Collapse
Affiliation(s)
- Dianhui Wang
- Department of Computer Science and Computer Engineering, La Trobe University, Melbourne, Victoria 3086, Australia.
| | | |
Collapse
|
10
|
Sarkar D, Goldstein S, Schwartz DC, Newton MA. Statistical significance of optical map alignments. J Comput Biol 2012; 19:478-92. [PMID: 22506568 DOI: 10.1089/cmb.2011.0221] [Citation(s) in RCA: 16] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open
Abstract
The Optical Mapping System constructs ordered restriction maps spanning entire genomes through the assembly and analysis of large datasets comprising individually analyzed genomic DNA molecules. Such restriction maps uniquely reveal mammalian genome structure and variation, but also raise computational and statistical questions beyond those that have been solved in the analysis of smaller, microbial genomes. We address the problem of how to filter maps that align poorly to a reference genome. We obtain map-specific thresholds that control errors and improve iterative assembly. We also show how an optimal self-alignment score provides an accurate approximation to the probability of alignment, which is useful in applications seeking to identify structural genomic abnormalities.
Collapse
Affiliation(s)
- Deepayan Sarkar
- Theoretical Statistics and Mathematics Unit, Indian Statistical Institute, New Delhi, India
| | | | | | | |
Collapse
|
11
|
Zhang Y, Misra S, Agrawal A, Patwary MMA, Liao WK, Qin Z, Choudhary A. Accelerating pairwise statistical significance estimation for local alignment by harvesting GPU's power. BMC Bioinformatics 2012; 13 Suppl 5:S3. [PMID: 22537007 PMCID: PMC3318904 DOI: 10.1186/1471-2105-13-s5-s3] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/27/2022] Open
Abstract
Background Pairwise statistical significance has been recognized to be able to accurately identify related sequences, which is a very important cornerstone procedure in numerous bioinformatics applications. However, it is both computationally and data intensive, which poses a big challenge in terms of performance and scalability. Results We present a GPU implementation to accelerate pairwise statistical significance estimation of local sequence alignment using standard substitution matrices. By carefully studying the algorithm's data access characteristics, we developed a tile-based scheme that can produce a contiguous data access in the GPU global memory and sustain a large number of threads to achieve a high GPU occupancy. We further extend the parallelization technique to estimate pairwise statistical significance using position-specific substitution matrices, which has earlier demonstrated significantly better sequence comparison accuracy than using standard substitution matrices. The implementation is also extended to take advantage of dual-GPUs. We observe end-to-end speedups of nearly 250 (370) × using single-GPU Tesla C2050 GPU (dual-Tesla C2050) over the CPU implementation using Intel© Core™i7 CPU 920 processor. Conclusions Harvesting the high performance of modern GPUs is a promising approach to accelerate pairwise statistical significance estimation for local sequence alignment.
Collapse
Affiliation(s)
- Yuhong Zhang
- School of Computer Science and Engineering, University of Electronic Science and Technology of China, Chengdu, China.
| | | | | | | | | | | | | |
Collapse
|
12
|
PSS-3D1D: an improved 3D1D profile method of protein fold recognition for the annotation of twilight zone sequences. ACTA ACUST UNITED AC 2011; 12:181-9. [DOI: 10.1007/s10969-011-9119-x] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/15/2011] [Accepted: 11/24/2011] [Indexed: 10/14/2022]
|
13
|
Freschi V, Bogliolo A. A monte carlo method for assessing the quality of duplication-aware alignment algorithms. Evol Bioinform Online 2011; 7:31-40. [PMID: 21698090 PMCID: PMC3118696 DOI: 10.4137/ebo.s6662] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/06/2022] Open
Abstract
The increasing availability of high throughput sequencing technologies poses several challenges concerning the analysis of genomic data. Within this context, duplication-aware sequence alignment taking into account complex mutation events is regarded as an important problem, particularly in light of recent evolutionary bioinformatics researches that highlighted the role of tandem duplications as one of the most important mutation events. Traditional sequence comparison algorithms do not take into account these events, resulting in poor alignments in terms of biological significance, mainly because of their assumption of statistical independence among contiguous residues. Several duplication-aware algorithms have been proposed in the last years which differ either for the type of duplications they consider or for the methods adopted to identify and compare them. However, there is no solution which clearly outperforms the others and no methods exist for assessing the reliability of the resulting alignments. This paper proposes a Monte Carlo method for assessing the quality of duplication-aware alignment algorithms and for driving the choice of the most appropriate alignment technique to be used in a specific context. The applicability and usefulness of the proposed approach are demonstrated on a case study, namely, the comparison of alignments based on edit distance with or without repeat masking.
Collapse
Affiliation(s)
- Valerio Freschi
- DiSBeF-Department of Base Sciences and Fundamentals, University of Urbino, Italy
| | | |
Collapse
|
14
|
Chan HP, Zhang NR, Chen LHY. Importance sampling of word patterns in DNA and protein sequences. J Comput Biol 2011; 17:1697-709. [PMID: 21128856 DOI: 10.1089/cmb.2008.0233] [Citation(s) in RCA: 9] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
Abstract
Monte Carlo methods can provide accurate p-value estimates of word counting test statistics and are easy to implement. They are especially attractive when an asymptotic theory is absent or when either the search sequence or the word pattern is too short for the application of asymptotic formulae. Naive direct Monte Carlo is undesirable for the estimation of small probabilities because the associated rare events of interest are seldom generated. We propose instead efficient importance sampling algorithms that use controlled insertion of the desired word patterns on randomly generated sequences. The implementation is illustrated on word patterns of biological interest: palindromes and inverted repeats, patterns arising from position-specific weight matrices (PSWMs), and co-occurrences of pairs of motifs.
Collapse
Affiliation(s)
- Hock Peng Chan
- Department of Statistics and Applied Probability, National University of Singapore, Singapore, Republic of Singapore
| | | | | |
Collapse
|
15
|
Using Markov model to improve word normalization algorithm for biological sequence comparison. Amino Acids 2011; 42:1867-77. [DOI: 10.1007/s00726-011-0906-2] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/13/2010] [Accepted: 03/29/2011] [Indexed: 10/18/2022]
|
16
|
Miller CA, Settle SH, Sulman EP, Aldape KD, Milosavljevic A. Discovering functional modules by identifying recurrent and mutually exclusive mutational patterns in tumors. BMC Med Genomics 2011; 4:34. [PMID: 21489305 PMCID: PMC3102606 DOI: 10.1186/1755-8794-4-34] [Citation(s) in RCA: 87] [Impact Index Per Article: 6.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/25/2010] [Accepted: 04/14/2011] [Indexed: 11/10/2022] Open
Abstract
Background Assays of multiple tumor samples frequently reveal recurrent genomic aberrations, including point mutations and copy-number alterations, that affect individual genes. Analyses that extend beyond single genes are often restricted to examining pathways, interactions and functional modules that are already known. Methods We present a method that identifies functional modules without any information other than patterns of recurrent and mutually exclusive aberrations (RME patterns) that arise due to positive selection for key cancer phenotypes. Our algorithm efficiently constructs and searches networks of potential interactions and identifies significant modules (RME modules) by using the algorithmic significance test. Results We apply the method to the TCGA collection of 145 glioblastoma samples, resulting in extension of known pathways and discovery of new functional modules. The method predicts a role for EP300 that was previously unknown in glioblastoma. We demonstrate the clinical relevance of these results by validating that expression of EP300 is prognostic, predicting survival independent of age at diagnosis and tumor grade. Conclusions We have developed a sensitive, simple, and fast method for automatically detecting functional modules in tumors based solely on patterns of recurrent genomic aberration. Due to its ability to analyze very large amounts of diverse data, we expect it to be increasingly useful when applied to the many tumor panels scheduled to be assayed in the near future.
Collapse
Affiliation(s)
- Christopher A Miller
- Graduate Program in Structural and Computational Biology and Molecular Biophysics, Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, Texas, USA
| | | | | | | | | |
Collapse
|
17
|
Agrawal A, Huang X. Pairwise statistical significance of local sequence alignment using sequence-specific and position-specific substitution matrices. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2011; 8:194-205. [PMID: 21071807 DOI: 10.1109/tcbb.2009.69] [Citation(s) in RCA: 9] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/30/2023]
Abstract
Pairwise sequence alignment is a central problem in bioinformatics, which forms the basis of various other applications. Two related sequences are expected to have a high alignment score, but relatedness is usually judged by statistical significance rather than by alignment score. Recently, it was shown that pairwise statistical significance gives promising results as an alternative to database statistical significance for getting individual significance estimates of pairwise alignment scores. The improvement was mainly attributed to making the statistical significance estimation process more sequence-specific and database-independent. In this paper, we use sequence-specific and position-specific substitution matrices to derive the estimates of pairwise statistical significance, which is expected to use more sequence-specific information in estimating pairwise statistical significance. Experiments on a benchmark database with sequence-specific substitution matrices at different levels of sequence-specific contribution were conducted, and results confirm that using sequence-specific substitution matrices for estimating pairwise statistical significance is significantly better than using a standard matrix like BLOSUM62, and than database statistical significance estimates reported by popular database search programs like BLAST, PSI-BLAST (without pretrained PSSMs), and SSEARCH on a benchmark database, but with pretrained PSSMs, PSI-BLAST results are significantly better. Further, using position-specific substitution matrices for estimating pairwise statistical significance gives significantly better results even than PSI-BLAST using pretrained PSSMs.
Collapse
Affiliation(s)
- Ankit Agrawal
- Department of Computer Science, Iowa State University, 226 Atanasoff Hall, Ames, IA 50011-1041, USA.
| | | |
Collapse
|
18
|
Abstract
Alignment algorithms are powerful tools for searching for homologous proteins in databases, providing a score for each sequence present in the database. It has been well known for 20 years that the shape of the score distribution looks like an extreme value distribution. The extremely large number of times biologists face this class of distributions raises the question of the evolutionary origin of this probability law. We investigated the possibility of deriving the main properties of sequence alignment score distributions from a basic evolutionary process: a duplication-divergence protein evolution process in a sequence space. Firstly, the distribution of sequences in this space was defined with respect to the genetic distance between sequences. Secondly, we derived a basic relation between the genetic distance and the alignment score. We obtained a novel score probability distribution which is qualitatively very similar to that of Karlin-Altschul but performing better than all other previous model.
Collapse
Affiliation(s)
- Philippe Ortet
- CNRS (UMR 6191)-CEA Cadarache-Aix-Marseille Université, Laboratoire d'Ecologie Microbienne de la Rhizosphere, Institut de Biologie Environementale et Biotechnologie, CEA Cadarache, F-13108 Saint Paul-lez-Durance, France
| | | |
Collapse
|
19
|
Oberto J. FITBAR: a web tool for the robust prediction of prokaryotic regulons. BMC Bioinformatics 2010; 11:554. [PMID: 21070640 PMCID: PMC3098098 DOI: 10.1186/1471-2105-11-554] [Citation(s) in RCA: 24] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/29/2010] [Accepted: 11/11/2010] [Indexed: 11/24/2022] Open
Abstract
Background The binding of regulatory proteins to their specific DNA targets determines the accurate expression of the neighboring genes. The in silico prediction of new binding sites in completely sequenced genomes is a key aspect in the deeper understanding of gene regulatory networks. Several algorithms have been described to discriminate against false-positives in the prediction of new binding targets; however none of them has been implemented so far to assist the detection of binding sites at the genomic scale. Results FITBAR (Fast Investigation Tool for Bacterial and Archaeal Regulons) is a web service designed to identify new protein binding sites on fully sequenced prokaryotic genomes. This tool consists in a workbench where the significance of the predictions can be compared using different statistical methods, a feature not found in existing resources. The Local Markov Model and the Compound Importance Sampling algorithms have been implemented to compute the P-value of newly discovered binding sites. In addition, FITBAR provides two optimized genomic scanning algorithms using either log-odds or entropy-weighted position-specific scoring matrices. Other significant features include the production of a detailed genomic context map for each detected binding site and the export of the search results in spreadsheet and portable document formats. FITBAR discovery of a high affinity Escherichia coli NagC binding site was validated experimentally in vitro as well as in vivo and published. Conclusions FITBAR was developed in order to allow fast, accurate and statistically robust predictions of prokaryotic regulons. This feature constitutes the main advantage of this web tool over other matrix search programs and does not impair its performance. The web service is available at http://archaea.u-psud.fr/fitbar.
Collapse
Affiliation(s)
- Jacques Oberto
- Université Paris-Sud 11, Centre National de la Recherche Scientifique, UMR 8621, Institut de Génétique et Microbiologie, Orsay, France.
| |
Collapse
|
20
|
Lane J, Duroux P, Lefranc MP. From IMGT-ONTOLOGY to IMGT/LIGMotif: the IMGT standardized approach for immunoglobulin and T cell receptor gene identification and description in large genomic sequences. BMC Bioinformatics 2010; 11:223. [PMID: 20433708 PMCID: PMC2880031 DOI: 10.1186/1471-2105-11-223] [Citation(s) in RCA: 27] [Impact Index Per Article: 1.9] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/07/2009] [Accepted: 04/30/2010] [Indexed: 01/25/2023] Open
Abstract
Background The antigen receptors, immunoglobulins (IG) and T cell receptors (TR), are specific molecular components of the adaptive immune response of vertebrates. Their genes are organized in the genome in several loci (7 in humans) that comprise different gene types: variable (V), diversity (D), joining (J) and constant (C) genes. Synthesis of the IG and TR proteins requires rearrangements of V and J, or V, D and J genes at the DNA level, followed by the splicing at the RNA level of the rearranged V-J and V-D-J genes to C genes. Owing to the particularities of IG and TR gene structures related to these molecular mechanisms, conventional bioinformatic software and tools are not adapted to the identification and description of IG and TR genes in large genomic sequences. In order to answer that need, IMGT®, the international ImMunoGeneTics information system®, has developed IMGT/LIGMotif, a tool for IG and TR gene annotation. This tool is based on standardized rules defined in IMGT-ONTOLOGY, the first ontology in immunogenetics and immunoinformatics. Results IMGT/LIGMotif currently annotates human and mouse IG and TR loci in large genomic sequences. The annotation includes gene identification and orientation on DNA strand, description of the V, D and J genes by assigning IMGT® labels, gene functionality, and finally, gene delimitation and cluster assembly. IMGT/LIGMotif analyses sequences up to 2.5 megabase pairs and can analyse them in batch files. Conclusions IMGT/LIGMotif is currently used by the IMGT® biocurators to annotate, in a first step, IG and TR genomic sequences of human and mouse in new haplotypes and those of closely related species, nonhuman primates and rat, respectively. In a next step, and following enrichment of its reference databases, IMGT/LIGMotif will be used to annotate IG and TR of more distantly related vertebrate species. IMGT/LIGMotif is available at http://www.imgt.org/ligmotif/.
Collapse
Affiliation(s)
- Jérôme Lane
- Université Montpellier 2, Laboratoire d'ImmunoGénétique Moléculaire LIGM, UPR CNRS 1142, Institut de Génétique Humaine IGH, 141 rue de la Cardonille, 34396 Montpellier cedex 5, France
| | | | | |
Collapse
|
21
|
Margelevicius M, Venclovas C. Detection of distant evolutionary relationships between protein families using theory of sequence profile-profile comparison. BMC Bioinformatics 2010; 11:89. [PMID: 20158924 PMCID: PMC2837030 DOI: 10.1186/1471-2105-11-89] [Citation(s) in RCA: 38] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/28/2009] [Accepted: 02/17/2010] [Indexed: 01/31/2023] Open
Abstract
Background Detection of common evolutionary origin (homology) is a primary means of inferring protein structure and function. At present, comparison of protein families represented as sequence profiles is arguably the most effective homology detection strategy. However, finding the best way to represent evolutionary information of a protein sequence family in the profile, to compare profiles and to estimate the biological significance of such comparisons, remains an active area of research. Results Here, we present a new homology detection method based on sequence profile-profile comparison. The method has a number of new features including position-dependent gap penalties and a global score system. Position-dependent gap penalties provide a more biologically relevant way to represent and align protein families as sequence profiles. The global score system enables an analytical solution of the statistical parameters needed to estimate the statistical significance of profile-profile similarities. The new method, together with other state-of-the-art profile-based methods (HHsearch, COMPASS and PSI-BLAST), is benchmarked in all-against-all comparison of a challenging set of SCOP domains that share at most 20% sequence identity. For benchmarking, we use a reference ("gold standard") free model-based evaluation framework. Evaluation results show that at the level of protein domains our method compares favorably to all other tested methods. We also provide examples of the new method outperforming structure-based similarity detection and alignment. The implementation of the new method both as a standalone software package and as a web server is available at http://www.ibt.lt/bioinformatics/coma. Conclusion Due to a number of developments, the new profile-profile comparison method shows an improved ability to match distantly related protein domains. Therefore, the method should be useful for annotation and homology modeling of uncharacterized proteins.
Collapse
|
22
|
Dai Q, Liu X, Li L, Yao Y, Han B, Zhu L. Using Gaussian model to improve biological sequence comparison. J Comput Chem 2010; 31:351-61. [PMID: 19479732 PMCID: PMC7166749 DOI: 10.1002/jcc.21322] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/11/2009] [Accepted: 04/14/2009] [Indexed: 11/08/2022]
Abstract
One of the major tasks in biological sequence analysis is to compare biological sequences, which could serve as evidence of structural and functional conservation, as well as of evolutionary relations among the sequences. Numerous efficient methods have been developed for sequence comparison, but challenges remain. In this article, we proposed a novel method to compare biological sequences based on Gaussian model. Instead of comparing the frequencies of k-words in biological sequences directly, we considered the k-word frequency distribution under Gaussian model which gives the different expression levels of k-words. The proposed method was tested by similarity search, evaluation on functionally related genes, and phylogenetic analysis. The performance of our method was further compared with alignment-based and alignment-free methods. The results demonstrate that Gaussian model provides more information about k-word frequencies and improves the efficiency of sequence comparison.
Collapse
Affiliation(s)
- Qi Dai
- Institute for Biomedical Engineering and Instrumentation, Hangzhou Dianzi University, Hangzhou 310018, People's Republic of China
| | - Xiaoqing Liu
- School of Science, Hangzhou Dianzi University; Hangzhou 310018, People's Republic of China
| | - Lihua Li
- Institute for Biomedical Engineering and Instrumentation, Hangzhou Dianzi University, Hangzhou 310018, People's Republic of China
| | - Yuhua Yao
- College of Life Sciences, Zhejiang Sci‐Tech University, Hangzhou 310018, People's Republic of China
| | - Bin Han
- Institute for Biomedical Engineering and Instrumentation, Hangzhou Dianzi University, Hangzhou 310018, People's Republic of China
| | - Lei Zhu
- Institute for Biomedical Engineering and Instrumentation, Hangzhou Dianzi University, Hangzhou 310018, People's Republic of China
| |
Collapse
|
23
|
Park HC, Eo HS, Kim W. A computational approach for the classification of protein tyrosine kinases. Mol Cells 2009; 28:195-200. [PMID: 19756393 DOI: 10.1007/s10059-009-0122-z] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/08/2009] [Accepted: 07/20/2009] [Indexed: 10/20/2022] Open
Abstract
Protein tyrosine kinases (PTKs) play a central role in the modulation of a wide variety of cellular events such as differentiation, proliferation and metabolism, and their unregulated activation can lead to various diseases including cancer and diabetes. PTKs represent a diverse family of proteins including both receptor tyrosine kinases (RTKs) and non-receptor tyrosine kinases (NRTKs). Due to the diversity and important cellular roles of PTKs, accurate classification methods are required to better understand and differentiate different PTKs. In addition, PTKs have become important targets for drugs, providing a further need to develop novel methods to accurately classify this set of important biological molecules. Here, we introduce a novel statistical model for the classification of PTKs that is based on their structural features. The approach allows for both the recognition of PTKs and the classification of RTKs into their subfamilies. This novel approach had an overall accuracy of 98.5% for the identification of PTKs, and 99.3% for the classification of RTKs.
Collapse
Affiliation(s)
- Hyun-Chul Park
- Program in Bioinformatics, Seoul National University, Korea
| | | | | |
Collapse
|
24
|
Newberg LA. Error statistics of hidden Markov model and hidden Boltzmann model results. BMC Bioinformatics 2009; 10:212. [PMID: 19589158 PMCID: PMC2722652 DOI: 10.1186/1471-2105-10-212] [Citation(s) in RCA: 9] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/09/2009] [Accepted: 07/09/2009] [Indexed: 11/29/2022] Open
Abstract
Background Hidden Markov models and hidden Boltzmann models are employed in computational biology and a variety of other scientific fields for a variety of analyses of sequential data. Whether the associated algorithms are used to compute an actual probability or, more generally, an odds ratio or some other score, a frequent requirement is that the error statistics of a given score be known. What is the chance that random data would achieve that score or better? What is the chance that a real signal would achieve a given score threshold? Results Here we present a novel general approach to estimating these false positive and true positive rates that is significantly more efficient than are existing general approaches. We validate the technique via an implementation within the HMMER 3.0 package, which scans DNA or protein sequence databases for patterns of interest, using a profile-HMM. Conclusion The new approach is faster than general naïve sampling approaches, and more general than other current approaches. It provides an efficient mechanism by which to estimate error statistics for hidden Markov model and hidden Boltzmann model results.
Collapse
Affiliation(s)
- Lee A Newberg
- The Wadsworth Center, New York State Department of Health, Albany, NY 12201, USA.
| |
Collapse
|
25
|
Agrawal A, Huang X. Pairwise statistical significance of local sequence alignment using multiple parameter sets and empirical justification of parameter set change penalty. BMC Bioinformatics 2009; 10 Suppl 3:S1. [PMID: 19344477 PMCID: PMC2665049 DOI: 10.1186/1471-2105-10-s3-s1] [Citation(s) in RCA: 12] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Accurate estimation of statistical significance of a pairwise alignment is an important problem in sequence comparison. Recently, a comparative study of pairwise statistical significance with database statistical significance was conducted. In this paper, we extend the earlier work on pairwise statistical significance by incorporating with it the use of multiple parameter sets. RESULTS Results for a knowledge discovery application of homology detection reveal that using multiple parameter sets for pairwise statistical significance estimates gives better coverage than using a single parameter set, at least at some error levels. Further, the results of pairwise statistical significance using multiple parameter sets are shown to be significantly better than database statistical significance estimates reported by BLAST and PSI-BLAST, and comparable and at times significantly better than SSEARCH. Using non-zero parameter set change penalty values give better performance than zero penalty. CONCLUSION The fact that the homology detection performance does not degrade when using multiple parameter sets is a strong evidence for the validity of the assumption that the alignment score distribution follows an extreme value distribution even when using multiple parameter sets. Parameter set change penalty is a useful parameter for alignment using multiple parameter sets. Pairwise statistical significance using multiple parameter sets can be effectively used to determine the relatedness of a (or a few) pair(s) of sequences without performing a time-consuming database search.
Collapse
Affiliation(s)
- Ankit Agrawal
- Department of Computer Science, Iowa State University, 226 Atanasoff Hall, Ames, IA 50011-1041, USA.
| | | |
Collapse
|
26
|
Abstract
Measurement of the the statistical significance of extreme sequence alignment scores is key to many important applications, but it is difficult. To precisely approximate alignment score significance, we draw random samples directly from a well chosen, importance-sampling probability distribution. We apply our technique to pairwise local sequence alignment of nucleic acid and amino acid sequences of length up to 1000. For instance, using a BLOSUM62 scoring system for local sequence alignment, we compute that the p-value of a score of 6000 for the alignment of two sequences of length 1000 is (3.4 +/- 0.3) x 10(-1314). Further, we show that the extreme value significance statistic for the local alignment model that we examine does not follow a Gumbel distribution. A web server for this application is available at http://bayesweb.wadsworth.org/alignmentSignificanceV1/.
Collapse
Affiliation(s)
- Lee A Newberg
- Center for Bioinformatics, Wadsworth Center, New York State Department of Health, Albany, New York 12201-0509, USA
| |
Collapse
|
27
|
Russell DJ, Otu HH, Sayood K. Grammar-based distance in progressive multiple sequence alignment. BMC Bioinformatics 2008; 9:306. [PMID: 18616828 PMCID: PMC2478692 DOI: 10.1186/1471-2105-9-306] [Citation(s) in RCA: 34] [Impact Index Per Article: 2.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/21/2008] [Accepted: 07/10/2008] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND We propose a multiple sequence alignment (MSA) algorithm and compare the alignment-quality and execution-time of the proposed algorithm with that of existing algorithms. The proposed progressive alignment algorithm uses a grammar-based distance metric to determine the order in which biological sequences are to be pairwise aligned. The progressive alignment occurs via pairwise aligning new sequences with an ensemble of the sequences previously aligned. RESULTS The performance of the proposed algorithm is validated via comparison to popular progressive multiple alignment approaches, ClustalW and T-Coffee, and to the more recently developed algorithms MAFFT, MUSCLE, Kalign, and PSAlign using the BAliBASE 3.0 database of amino acid alignment files and a set of longer sequences generated by Rose software. The proposed algorithm has successfully built multiple alignments comparable to other programs with significant improvements in running time. The results are especially striking for large datasets. CONCLUSION We introduce a computationally efficient progressive alignment algorithm using a grammar based sequence distance particularly useful in aligning large datasets.
Collapse
Affiliation(s)
- David J Russell
- Department of Electrical Engineering, University of Nebraska-Lincoln, 209N WSEC, Lincoln, NE 68588-0511, USA.
| | | | | |
Collapse
|
28
|
Eddy SR. A probabilistic model of local sequence alignment that simplifies statistical significance estimation. PLoS Comput Biol 2008; 4:e1000069. [PMID: 18516236 PMCID: PMC2396288 DOI: 10.1371/journal.pcbi.1000069] [Citation(s) in RCA: 221] [Impact Index Per Article: 13.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/05/2007] [Accepted: 03/26/2008] [Indexed: 11/19/2022] Open
Abstract
Sequence database searches require accurate estimation of the statistical significance of scores. Optimal local sequence alignment scores follow Gumbel distributions, but determining an important parameter of the distribution (λ) requires time-consuming computational simulation. Moreover, optimal alignment scores are less powerful than probabilistic scores that integrate over alignment uncertainty (“Forward” scores), but the expected distribution of Forward scores remains unknown. Here, I conjecture that both expected score distributions have simple, predictable forms when full probabilistic modeling methods are used. For a probabilistic model of local sequence alignment, optimal alignment bit scores (“Viterbi” scores) are Gumbel-distributed with constant λ = log 2, and the high scoring tail of Forward scores is exponential with the same constant λ. Simulation studies support these conjectures over a wide range of profile/sequence comparisons, using 9,318 profile-hidden Markov models from the Pfam database. This enables efficient and accurate determination of expectation values (E-values) for both Viterbi and Forward scores for probabilistic local alignments. Sequence database searches are a fundamental tool of molecular biology, enabling researchers to identify related sequences in other organisms, which often provides invaluable clues to the function and evolutionary history of genes. The power of database searches to detect more and more remote evolutionary relationships – essentially, to look back deeper in time – has improved steadily, with the adoption of more complex and realistic models. However, database searches require not just a realistic scoring model, but also the ability to distinguish good scores from bad ones – the ability to calculate the statistical significance of scores. For many models and scoring schemes, accurate statistical significance calculations have either involved expensive computational simulations, or not been feasible at all. Here, I introduce a probabilistic model of local sequence alignment that has readily predictable score statistics for position-specific profile scoring systems, and not just for traditional optimal alignment scores, but also for more powerful log-likelihood ratio scores derived in a full probabilistic inference framework. These results remove one of the main obstacles that have impeded the use of more powerful and biologically realistic statistical inference methods in sequence homology searches.
Collapse
Affiliation(s)
- Sean R Eddy
- Howard Hughes Medical Institute, Janelia Farm Research Campus, Ashburn, Virginia, United States of America.
| |
Collapse
|
29
|
|