1
|
Karami A, Fayyaz Movaghar A, Mercier S, Ferre L. New Approximate Statistical Significance of Gapped Alignments Based on the Greedy Extension Model. J Comput Biol 2020; 27:1361-1372. [PMID: 31913652 DOI: 10.1089/cmb.2018.0203] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open
Abstract
Sequence alignment is a fundamental concept in bioinformatics to distinguish regions of similarity among various sequences. The degree of similarity has been considered as a score. There are a number of various methods to find the statistical significance of similarity in the gapped and ungapped cases. In this article, we improve the statistical significance accuracy of the local score by introducing a new approximate p-value. This is developed according to Poisson clumping and the exact distribution of a partial sum of random variables. The efficiency of the proposed method is compared with that of previous methods on real and simulated data. The results yield a remarkable improvement in accuracy of the p-value in the gapped case. This is an evidence for the method to be considered as a prospective candidate for sequences comparison.
Collapse
Affiliation(s)
- Amirhossein Karami
- Department of Statistics, Faculty of Mathematical Sciences, University of Mazandaran, Babolsar, Iran
| | - Afshin Fayyaz Movaghar
- Department of Statistics, Faculty of Mathematical Sciences, University of Mazandaran, Babolsar, Iran
| | - Sabine Mercier
- Institut de Mathematiques de Toulouse, Department of Mathematics and Computer Science, Universite Toulouse Jean Jaures, Toulouse, France
| | - Louis Ferre
- Institut de Mathematiques de Toulouse, Toulouse, France
| |
Collapse
|
2
|
Abstract
In bioinformatics, the notion of an ‘island’ enhances the efficient simulation of gapped local alignment statistics. This paper generalizes several results relevant to gapless local alignment statistics from one to higher dimensions, with a particular eye to applications in gapped alignment statistics. For example, reversal of paths (rather than of discrete time) generalizes a distributional equality, from queueing theory, between the Lindley (local sum) and maximum processes. Systematic investigation of an ‘ownership’ relationship among vertices in ℤ2 formalizes the notion of an island as a set of vertices having a common owner. Predictably, islands possess some stochastic ordering and spatial averaging properties. Moreover, however, the average number of vertices in a subcritical stationary island is 1, generalizing a theorem of Kac about stationary point processes. The generalization leads to alternative ways of simulating some island statistics.
Collapse
|
3
|
Abstract
In bioinformatics, the notion of an ‘island’ enhances the efficient simulation of gapped local alignment statistics. This paper generalizes several results relevant to gapless local alignment statistics from one to higher dimensions, with a particular eye to applications in gapped alignment statistics. For example, reversal of paths (rather than of discrete time) generalizes a distributional equality, from queueing theory, between the Lindley (local sum) and maximum processes. Systematic investigation of an ‘ownership’ relationship among vertices in ℤ2 formalizes the notion of an island as a set of vertices having a common owner. Predictably, islands possess some stochastic ordering and spatial averaging properties. Moreover, however, the average number of vertices in a subcritical stationary island is 1, generalizing a theorem of Kac about stationary point processes. The generalization leads to alternative ways of simulating some island statistics.
Collapse
|
4
|
Sheetlin SL, Park Y, Frith MC, Spouge JL. Frameshift alignment: statistics and post-genomic applications. ACTA ACUST UNITED AC 2014; 30:3575-82. [PMID: 25172925 DOI: 10.1093/bioinformatics/btu576] [Citation(s) in RCA: 24] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/23/2023]
Abstract
MOTIVATION The alignment of DNA sequences to proteins, allowing for frameshifts, is a classic method in sequence analysis. It can help identify pseudogenes (which accumulate mutations), analyze raw DNA and RNA sequence data (which may have frameshift sequencing errors), investigate ribosomal frameshifts, etc. Often, however, only ad hoc approximations or simulations are available to provide the statistical significance of a frameshift alignment score. RESULTS We describe a method to estimate statistical significance of frameshift alignments, similar to classic BLAST statistics. (BLAST presently does not permit its alignments to include frameshifts.) We also illustrate the continuing usefulness of frameshift alignment with two 'post-genomic' applications: (i) when finding pseudogenes within the human genome, frameshift alignments show that most anciently conserved non-coding human elements are recent pseudogenes with conserved ancestral genes; and (ii) when analyzing metagenomic DNA reads from polluted soil, frameshift alignments show that most alignable metagenomic reads contain frameshifts, suggesting that metagenomic analysis needs to use frameshift alignment to derive accurate results.
Collapse
Affiliation(s)
- Sergey L Sheetlin
- National Center for Biotechnology Information, National Library of Medicine, Bethesda, MD 20894, USA and Computational Biology Research Center, National Institute of Advanced Industrial Science and Technology, Tokyo 135-0064, Japan
| | - Yonil Park
- National Center for Biotechnology Information, National Library of Medicine, Bethesda, MD 20894, USA and Computational Biology Research Center, National Institute of Advanced Industrial Science and Technology, Tokyo 135-0064, Japan
| | - Martin C Frith
- National Center for Biotechnology Information, National Library of Medicine, Bethesda, MD 20894, USA and Computational Biology Research Center, National Institute of Advanced Industrial Science and Technology, Tokyo 135-0064, Japan
| | - John L Spouge
- National Center for Biotechnology Information, National Library of Medicine, Bethesda, MD 20894, USA and Computational Biology Research Center, National Institute of Advanced Industrial Science and Technology, Tokyo 135-0064, Japan
| |
Collapse
|
5
|
Abstract
The comparison of homologous proteins from different species is a first step toward a function assignment and a reconstruction of the species evolution. Though local alignment is mostly used for this purpose, global alignment is important for constructing multiple alignments or phylogenetic trees. However, statistical significance of global alignments is not completely clear, lacking a specific statistical model to describe alignments or depending on computationally expensive methods like Z-score. Recently we presented a normalized global alignment, defined as the best compromise between global alignment cost and length, and showed that this new technique led to better classification results than Z-score at a much lower computational cost. However, it is necessary to analyze the statistical significance of the normalized global alignment in order to be considered a completely functional algorithm for protein alignment. Experiments with unrelated proteins extracted from the SCOP ASTRAL database showed that normalized global alignment scores can be fitted to a log-normal distribution. This fact, obtained without any theoretical support, can be used to derive statistical significance of normalized global alignments. Results are summarized in a table with fitted parameters for different scoring schemes.
Collapse
Affiliation(s)
- Guillermo Peris
- Department de Llenguatges i Sistemes Informátics, Universitat Jaume I , Castelló, Spain
| | | |
Collapse
|
6
|
Hähnke V, Rupp M, Hartmann AK, Schneider G. Pharmacophore Alignment Search Tool (PhAST): Significance Assessment of Chemical Similarity. Mol Inform 2013; 32:625-46. [PMID: 27481770 DOI: 10.1002/minf.201300021] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/12/2013] [Accepted: 04/19/2013] [Indexed: 11/06/2022]
Abstract
Previously, we proposed a ligand-based virtual screening technique (PhAST) based on global alignment of linearized interaction patterns. Here, we applied techniques developed for similarity assessment in local sequence alignments to our method resulting in p-values for chemical similarity. We compared two sampling strategies, a simple sampling strategy and a Markov Chain Monte Carlo (MCMC) method, and investigated the similarity of sampled distributions to Gaussian, Gumbel, modified Gumbel, and Gamma distributions. The Gumbel distribution with a Gaussian correction term was identified as the most similar to the observed empirical distributions. These techniques were applied in retrospective screenings on a drug-like dataset. Obtained p-values were adjusted to the size of the screening library with four different methods. Evaluation of E-value thresholds corroborated the Bonferroni correction as a preferred means to identify significant chemical similarity with PhAST. An online version of PhAST with significance estimation is available at http://modlab-cadd.ethz.ch/.
Collapse
Affiliation(s)
- Volker Hähnke
- Eidgenössische Technische Hochschule (ETH), Department of Chemistry and Applied Biosciences, Institute of Pharmaceutical Sciences, Wolfgang-Pauli-Str. 10, 8093 Zürich, Switzerland phone: +1 (202)436-5989.
| | - Matthias Rupp
- Eidgenössische Technische Hochschule (ETH), Department of Chemistry and Applied Biosciences, Institute of Pharmaceutical Sciences, Wolfgang-Pauli-Str. 10, 8093 Zürich, Switzerland phone: +1 (202)436-5989
| | - Alexander K Hartmann
- Universität Oldenburg, Computational Theoretical Physics, Institut für Physik, Carl-von-Ossietzky Strasse 9-11, 26111 Oldenburg, Germany
| | - Gisbert Schneider
- Eidgenössische Technische Hochschule (ETH), Department of Chemistry and Applied Biosciences, Institute of Pharmaceutical Sciences, Wolfgang-Pauli-Str. 10, 8093 Zürich, Switzerland phone: +1 (202)436-5989
| |
Collapse
|
7
|
Park Y, Sheetlin S, Ma N, Madden TL, Spouge JL. New finite-size correction for local alignment score distributions. BMC Res Notes 2012; 5:286. [PMID: 22691307 PMCID: PMC3483159 DOI: 10.1186/1756-0500-5-286] [Citation(s) in RCA: 10] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/30/2012] [Accepted: 05/16/2012] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Local alignment programs often calculate the probability that a match occurred by chance. The calculation of this probability may require a "finite-size" correction to the lengths of the sequences, as an alignment that starts near the end of either sequence may run out of sequence before achieving a significant score. FINDINGS We present an improved finite-size correction that considers the distribution of sequence lengths rather than simply the corresponding means. This approach improves sensitivity and avoids substituting an ad hoc length for short sequences that can underestimate the significance of a match. We use a test set derived from ASTRAL to show improved ROC scores, especially for shorter sequences. CONCLUSIONS The new finite-size correction improves the calculation of probabilities for a local alignment. It is now used in the BLAST+ package and at the NCBI BLAST web site ( http://blast.ncbi.nlm.nih.gov).
Collapse
Affiliation(s)
- Yonil Park
- National Center for Biotechnology Information, National Library of Medicine, Bethesda, MD 20894, USA
| | | | | | | | | |
Collapse
|
8
|
Sun H, Buhler JD. PhyLAT: a phylogenetic local alignment tool. Bioinformatics 2012; 28:1336-44. [PMID: 22492645 PMCID: PMC3465089 DOI: 10.1093/bioinformatics/bts158] [Citation(s) in RCA: 52] [Impact Index Per Article: 4.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/09/2011] [Revised: 03/29/2012] [Accepted: 03/30/2012] [Indexed: 11/14/2022] Open
Abstract
MOTIVATION The expansion of DNA sequencing capacity has enabled the sequencing of whole genomes from a number of related species. These genomes can be combined in a multiple alignment that provides useful information about the evolutionary history at each genomic locus. One area in which evolutionary information can productively be exploited is in aligning a new sequence to a database of existing, aligned genomes. However, existing high-throughput alignment tools are not designed to work effectively with multiple genome alignments. RESULTS We introduce PhyLAT, the phylogenetic local alignment tool, to compute local alignments of a query sequence against a fixed multiple-genome alignment of closely related species. PhyLAT uses a known phylogenetic tree on the species in the multiple alignment to improve the quality of its computed alignments while also estimating the placement of the query on this tree. It combines a probabilistic approach to alignment with seeding and expansion heuristics to accelerate discovery of significant alignments. We provide evidence, using alignments of human chromosome 22 against a five-species alignment from the UCSC Genome Browser database, that PhyLAT's alignments are more accurate than those of other commonly used programs, including BLAST, POY, MAFFT, MUSCLE and CLUSTAL. PhyLAT also identifies more alignments in coding DNA than does pairwise alignment alone. Finally, our tool determines the evolutionary relationship of query sequences to the database more accurately than do POY, RAxML, EPA or pplacer.
Collapse
Affiliation(s)
- Hongtao Sun
- Department of Computer Science and Engineering, Washington University, Saint Louis, MO 63130, USA.
| | | |
Collapse
|
9
|
WONG WINGCHEONG, MAURER-STROH SEBASTIAN, EISENHABER FRANK. THE JANUS-FACED E-VALUES OF HMMER2: EXTREME VALUE DISTRIBUTION OR LOGISTIC FUNCTION? J Bioinform Comput Biol 2011. [DOI: 10.1142/s0219720011005264] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022]
Abstract
E-value guided extrapolation of protein domain annotation from libraries such as Pfam with the HMMER suite is indispensable for hypothesizing about the function of experimentally uncharacterized protein sequences. Since the recent release of HMMER3 does not supersede all functions of HMMER2, the latter will remain relevant for ongoing research as well as for the evaluation of annotations that reside in databases and in the literature. In HMMER2, the E-value is computed from the score via a logistic function or via a domain model-specific extreme value distribution (EVD); the lower of the two is returned as E-value for the domain hit in the query sequence. We find that, for thousands of domain models, this treatment results in switching from the EVD to the statistical model with the logistic function when scores grow (for Pfam release 23, 99% in the global mode and 75% in the fragment mode). If the score corresponding to the breakpoint results in an E-value above a user-defined threshold (e.g. 0.1), a critical score region with conflicting E-values from the logistic function (below the threshold) and from EVD (above the threshold) does exist. Thus, this switch will affect E-value guided annotation decisions in an automated mode. To emphasize, switching in the fragment mode is of no practical relevance since it occurs only at E-values far below 0.1. Unfortunately, a critical score region does exist for 185 domain models in the hmmpfam and 1,748 domain models in the hmmsearch global-search mode. For 145 out the respective 185 models, the critical score region is indeed populated by actual sequences. In total, 24.4% of their hits have a logistic function-derived E-value < 0.1 when the EVD provides an E-value > 0.1. We provide examples of false annotations and critically discuss the appropriateness of a logistic function as alternative to the EVD.
Collapse
Affiliation(s)
- WING-CHEONG WONG
- Bioinformatics Institute (BII), Agency for Science, Technology and Research (A *STAR), 30 Biopolis Street, #07-01, Matrix, Singapore 138671, Singapore
| | - SEBASTIAN MAURER-STROH
- Bioinformatics Institute (BII), Agency for Science, Technology and Research (A *STAR), 30 Biopolis Street, #07-01, Matrix, Singapore 138671, Singapore
- School of Biological Sciences (SBS), Nanyang Technological University (NTU), 60 Nanyang Drive, Singapore 63755, Singapore
| | - FRANK EISENHABER
- Bioinformatics Institute (BII), Agency for Science, Technology and Research (A *STAR), 30 Biopolis Street, #07-01, Matrix, Singapore 138671, Singapore
- Department of Biological Sciences (DBS), National University of Singapore (NUS), 8 Medical Drive, Singapore 117597, Singapore
- School of Computer Engineering (SCE), Nanyang Technological University (NTU), 50 Nanyang Drive, Singapore 637553, Singapore
| |
Collapse
|
10
|
Sheetlin S, Park Y, Spouge JL. Objective method for estimating asymptotic parameters, with an application to sequence alignment. PHYSICAL REVIEW. E, STATISTICAL, NONLINEAR, AND SOFT MATTER PHYSICS 2011; 84:031914. [PMID: 22060410 PMCID: PMC3233989 DOI: 10.1103/physreve.84.031914] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 03/02/2011] [Revised: 06/14/2011] [Indexed: 05/31/2023]
Abstract
Sequence alignment is an indispensable computational tool in modern molecular biology. The model underlying biological sequence alignment is of interest to physicists because it approximates the statistical mechanics of DNA and protein annealing, while bearing an intimate relationship to models of directed polymers in random media. Recent methods for determining the statistics of random sequence alignments have reduced the computation time to less than 1 s, opening up some interesting possibilities for online computation with biological search engines. Before implementation, however, the methods required an objective technique for computing regression coefficients pertinent to an asymptotic regime. Typically, physicists estimate parameters pertinent to an asymptotic regime subjectively: They eyeball their data; estimate the asymptotic regime where the regression model holds with reasonable accuracy; and then regress data only within the estimated asymptotic regime. Our publicly available computer program ARRP replaces the subjective assessment of the asymptotic regime with an objective change-point detection method, increasing confidence in the scientific objectivity of the parameter estimates. Asymptotic regression has potential applications across most of physics.
Collapse
Affiliation(s)
- Sergey Sheetlin
- National Center for Biotechnology Information, National Library of Medicine, Bethesda, Maryland 20894, USA
| | - Yonil Park
- National Center for Biotechnology Information, National Library of Medicine, Bethesda, Maryland 20894, USA
| | - John L. Spouge
- National Center for Biotechnology Information, National Library of Medicine, Bethesda, Maryland 20894, USA
| |
Collapse
|
11
|
Abstract
Alignment algorithms are powerful tools for searching for homologous proteins in databases, providing a score for each sequence present in the database. It has been well known for 20 years that the shape of the score distribution looks like an extreme value distribution. The extremely large number of times biologists face this class of distributions raises the question of the evolutionary origin of this probability law. We investigated the possibility of deriving the main properties of sequence alignment score distributions from a basic evolutionary process: a duplication-divergence protein evolution process in a sequence space. Firstly, the distribution of sequences in this space was defined with respect to the genetic distance between sequences. Secondly, we derived a basic relation between the genetic distance and the alignment score. We obtained a novel score probability distribution which is qualitatively very similar to that of Karlin-Altschul but performing better than all other previous model.
Collapse
Affiliation(s)
- Philippe Ortet
- CNRS (UMR 6191)-CEA Cadarache-Aix-Marseille Université, Laboratoire d'Ecologie Microbienne de la Rhizosphere, Institut de Biologie Environementale et Biotechnologie, CEA Cadarache, F-13108 Saint Paul-lez-Durance, France
| | | |
Collapse
|
12
|
Park Y, Sheetlin S, Spouge JL. ESTIMATING THE GUMBEL SCALE PARAMETER FOR LOCAL ALIGNMENT OF RANDOM SEQUENCES BY IMPORTANCE SAMPLING WITH STOPPING TIMES. Ann Stat 2009; 37:3697. [PMID: 20148197 DOI: 10.1214/08-aos663] [Citation(s) in RCA: 11] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/19/2022]
Abstract
The gapped local alignment score of two random sequences follows a Gumbel distribution. If computers could estimate the parameters of the Gumbel distribution within one second, the use of arbitrary alignment scoring schemes could increase the sensitivity of searching biological sequence databases over the web. Accordingly, this article gives a novel equation for the scale parameter of the relevant Gumbel distribution. We speculate that the equation is exact, although present numerical evidence is limited. The equation involves ascending ladder variates in the global alignment of random sequences. In global alignment simulations, the ladder variates yield stopping times specifying random sequence lengths. Because of the random lengths, and because our trial distribution for importance sampling occurs on a different sample space from our target distribution, our study led to a mapping theorem, which led naturally in turn to an efficient dynamic programming algorithm for the importance sampling weights. Numerical studies using several popular alignment scoring schemes then examined the efficiency and accuracy of the resulting simulations.
Collapse
Affiliation(s)
- Yonil Park
- National Center for Biotechnology Information National Library of Medicine National Institutes of Health 8600 Rockville Pike Bethesda, Maryland 20894 USA
| | | | | |
Collapse
|
13
|
Mooney C, Pollastri G. Beyond the Twilight Zone: Automated prediction of structural properties of proteins by recursive neural networks and remote homology information. Proteins 2009; 77:181-90. [DOI: 10.1002/prot.22429] [Citation(s) in RCA: 43] [Impact Index Per Article: 2.9] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/06/2022]
|
14
|
Frith MC, Park Y, Sheetlin SL, Spouge JL. The whole alignment and nothing but the alignment: the problem of spurious alignment flanks. Nucleic Acids Res 2008; 36:5863-71. [PMID: 18796526 PMCID: PMC2566872 DOI: 10.1093/nar/gkn579] [Citation(s) in RCA: 19] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
Abstract
Pairwise sequence alignment is a ubiquitous tool for inferring the evolution and function of DNA, RNA and protein sequences. It is therefore essential to identify alignments arising by chance alone, i.e. spurious alignments. On one hand, if an entire alignment is spurious, statistical techniques for identifying and eliminating it are well known. On the other hand, if only a part of the alignment is spurious, elimination is much more problematic. In practice, even the sizes and frequencies of spurious subalignments remain unknown. This article shows that some common scoring schemes tend to overextend alignments and generate spurious alignment flanks up to hundreds of base pairs/amino acids in length. In the UCSC genome database, e.g. spurious flanks probably comprise >18% of the human-fugu genome alignment. To evaluate the possibility that chance alone generated a particular flank on a particular pairwise alignment, we provide a simple 'overalignment' P-value. The overalignment P-value can identify spurious alignment flanks, thereby eliminating potentially misleading inferences about evolution and function. Moreover, by explicitly demonstrating the tradeoff between over- and under-alignment, our methods guide the rational choice of scoring schemes for various alignment tasks.
Collapse
Affiliation(s)
- Martin C Frith
- Computational Biology Research Center, Institute for Advanced Industrial Science and Technology, Tokyo 135-0064, Japan
| | | | | | | |
Collapse
|
15
|
Eddy SR. A probabilistic model of local sequence alignment that simplifies statistical significance estimation. PLoS Comput Biol 2008; 4:e1000069. [PMID: 18516236 PMCID: PMC2396288 DOI: 10.1371/journal.pcbi.1000069] [Citation(s) in RCA: 221] [Impact Index Per Article: 13.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/05/2007] [Accepted: 03/26/2008] [Indexed: 11/19/2022] Open
Abstract
Sequence database searches require accurate estimation of the statistical significance of scores. Optimal local sequence alignment scores follow Gumbel distributions, but determining an important parameter of the distribution (λ) requires time-consuming computational simulation. Moreover, optimal alignment scores are less powerful than probabilistic scores that integrate over alignment uncertainty (“Forward” scores), but the expected distribution of Forward scores remains unknown. Here, I conjecture that both expected score distributions have simple, predictable forms when full probabilistic modeling methods are used. For a probabilistic model of local sequence alignment, optimal alignment bit scores (“Viterbi” scores) are Gumbel-distributed with constant λ = log 2, and the high scoring tail of Forward scores is exponential with the same constant λ. Simulation studies support these conjectures over a wide range of profile/sequence comparisons, using 9,318 profile-hidden Markov models from the Pfam database. This enables efficient and accurate determination of expectation values (E-values) for both Viterbi and Forward scores for probabilistic local alignments. Sequence database searches are a fundamental tool of molecular biology, enabling researchers to identify related sequences in other organisms, which often provides invaluable clues to the function and evolutionary history of genes. The power of database searches to detect more and more remote evolutionary relationships – essentially, to look back deeper in time – has improved steadily, with the adoption of more complex and realistic models. However, database searches require not just a realistic scoring model, but also the ability to distinguish good scores from bad ones – the ability to calculate the statistical significance of scores. For many models and scoring schemes, accurate statistical significance calculations have either involved expensive computational simulations, or not been feasible at all. Here, I introduce a probabilistic model of local sequence alignment that has readily predictable score statistics for position-specific profile scoring systems, and not just for traditional optimal alignment scores, but also for more powerful log-likelihood ratio scores derived in a full probabilistic inference framework. These results remove one of the main obstacles that have impeded the use of more powerful and biologically realistic statistical inference methods in sequence homology searches.
Collapse
Affiliation(s)
- Sean R Eddy
- Howard Hughes Medical Institute, Janelia Farm Research Campus, Ashburn, Virginia, United States of America.
| |
Collapse
|
16
|
Mitrophanov AY, Borodovsky M. Statistical significance in biological sequence analysis. Brief Bioinform 2008; 7:2-24. [PMID: 16761361 DOI: 10.1093/bib/bbk001] [Citation(s) in RCA: 44] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open
Abstract
One of the major goals of computational sequence analysis is to find sequence similarities, which could serve as evidence of structural and functional conservation, as well as of evolutionary relations among the sequences. Since the degree of similarity is usually assessed by the sequence alignment score, it is necessary to know if a score is high enough to indicate a biologically interesting alignment. A powerful approach to defining score cutoffs is based on the evaluation of the statistical significance of alignments. The statistical significance of an alignment score is frequently assessed by its P-value, which is the probability that this score or a higher one can occur simply by chance, given the probabilistic models for the sequences. In this review we discuss the general role of P-value estimation in sequence analysis, and give a description of theoretical methods and computational approaches to the estimation of statistical signifiance for important classes of sequence analysis problems. In particular, we concentrate on the P-value estimation techniques for single sequence studies (both score-based and score-free), global and local pairwise sequence alignments, multiple alignments, sequence-to-profile alignments and alignments built with hidden Markov models. We anticipate that the review will be useful both to researchers professionally working in bioinformatics as well as to biomedical scientists interested in using contemporary methods of DNA and protein sequence analysis.
Collapse
|
17
|
Abstract
We examine a Poisson heuristic for judging the significance of local sequence alignments with gaps. Model parameters are estimated directly from the sequences to be aligned, so that laborious prior simulation studies or database comparisons for the estimation of parameters describing the connection between score and E-value are unnecessary. Simulation studies give evidence that this method gives reasonable results even when the usual assumptions like the independence of sequence positions are violated.
Collapse
Affiliation(s)
- Dirk Metzler
- Institut für Informatik, Johann Wolfgang Goethe-Universität, Frankfurt am Main, Germany.
| |
Collapse
|
18
|
Sheetlin S, Park Y, Spouge JL. The Gumbel pre-factor k for gapped local alignment can be estimated from simulations of global alignment. Nucleic Acids Res 2005; 33:4987-94. [PMID: 16147981 PMCID: PMC1199557 DOI: 10.1093/nar/gki800] [Citation(s) in RCA: 16] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/03/2022] Open
Abstract
The optimal gapped local alignment score of two random sequences follows a Gumbel distribution. The Gumbel distribution has two parameters, the scale parameter λ and the pre-factor k. Presently, the basic local alignment search tool (BLAST) programs (BLASTP (BLAST for proteins), PSI-BLAST, etc.) use all time-consuming computer simulations to determine the Gumbel parameters. Because the simulations must be done offline, BLAST users are restricted in their choice of alignment scoring schemes. The ultimate aim of this paper is to speed the simulations, to determine the Gumbel parameters online, and to remove the corresponding restrictions on BLAST users. Simulations for the scale parameter λ can be as much as five times faster, if they use global instead of local alignment [R. Bundschuh (2002) J. Comput. Biol., 9, 243–260]. Unfortunately, the acceleration does not extend in determining the Gumbel pre-factor k, because k has no known mathematical relationship to global alignment. This paper relates k to global alignment and exploits the relationship to show that for the BLASTP defaults, 10 000 realizations with sequences of average length 140 suffice to estimate both Gumbel parameters λ and k within the errors required (λ, 0.8%; k, 10%). For the BLASTP defaults, simulations for both Gumbel parameters now take less than 30 s on a 2.8 GHz Pentium 4 processor.
Collapse
Affiliation(s)
| | | | - John L. Spouge
- To whom correspondence should be addressed. Tel: +301 402 9310; Fax: +301 480 2288;
| |
Collapse
|
19
|
Park Y, Sheetlin S, Spouge JL. Accelerated convergence and robust asymptotic regression of the Gumbel scale parameter for gapped sequence alignment. ACTA ACUST UNITED AC 2004. [DOI: 10.1088/0305-4470/38/1/006] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022]
|
20
|
Grossmann S, Yakir B. Large deviations for global maxima of independent superadditive processes with negative drift and an application to optimal sequence alignments. BERNOULLI 2004. [DOI: 10.3150/bj/1099579157] [Citation(s) in RCA: 14] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/19/2022]
Affiliation(s)
- Steffen Grossmann
- Department of Computational Molecular Biology, Max-Planck-Institute for MolecularGenetics
| | | |
Collapse
|