1
|
Rios S, Fernandez MF, Caltabiano G, Campillo M, Pardo L, Gonzalez A. GPCRtm: An amino acid substitution matrix for the transmembrane region of class A G Protein-Coupled Receptors. BMC Bioinformatics 2015; 16:206. [PMID: 26134144 PMCID: PMC4489126 DOI: 10.1186/s12859-015-0639-4] [Citation(s) in RCA: 18] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/20/2015] [Accepted: 06/06/2015] [Indexed: 01/08/2023] Open
Abstract
Background Protein sequence alignments and database search methods use standard scoring matrices calculated from amino acid substitution frequencies in general sets of proteins. These general-purpose matrices are not optimal to align accurately sequences with marked compositional biases, such as hydrophobic transmembrane regions found in membrane proteins. In this work, an amino acid substitution matrix (GPCRtm) is calculated for the membrane spanning segments of the G protein-coupled receptor (GPCR) rhodopsin family; one of the largest transmembrane protein family in humans with great importance in health and disease. Results The GPCRtm matrix reveals the amino acid compositional bias distinctive of the GPCR rhodopsin family and differs from other standard substitution matrices. These membrane receptors, as expected, are characterized by a high content of hydrophobic residues with regard to globular proteins. On the other hand, the presence of polar and charged residues is higher than in average membrane proteins, displaying high frequencies of replacement within themselves. Conclusions Analysis of amino acid frequencies and values obtained from the GPCRtm matrix reveals patterns of residue replacements different from other standard substitution matrices. GPCRs prioritize the reactivity properties of the amino acids over their bulkiness in the transmembrane regions. A distinctive role is that charged and polar residues seem to evolve at different rates than other amino acids. This observation is related to the role of the transmembrane bundle in the binding of ligands, that in many cases involve electrostatic and hydrogen bond interactions. This new matrix can be useful in database search and for the construction of more accurate sequence alignments of GPCRs. Electronic supplementary material The online version of this article (doi:10.1186/s12859-015-0639-4) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Santiago Rios
- Laboratori de Medicina Computacional, Unitat de Bioestadística, Facultat de Medicina, Universitat Autònoma de Barcelona, 08193, Bellaterra, Barcelona, Spain
| | - Marta F Fernandez
- Laboratori de Medicina Computacional, Unitat de Bioestadística, Facultat de Medicina, Universitat Autònoma de Barcelona, 08193, Bellaterra, Barcelona, Spain
| | - Gianluigi Caltabiano
- Laboratori de Medicina Computacional, Unitat de Bioestadística, Facultat de Medicina, Universitat Autònoma de Barcelona, 08193, Bellaterra, Barcelona, Spain
| | - Mercedes Campillo
- Laboratori de Medicina Computacional, Unitat de Bioestadística, Facultat de Medicina, Universitat Autònoma de Barcelona, 08193, Bellaterra, Barcelona, Spain
| | - Leonardo Pardo
- Laboratori de Medicina Computacional, Unitat de Bioestadística, Facultat de Medicina, Universitat Autònoma de Barcelona, 08193, Bellaterra, Barcelona, Spain
| | - Angel Gonzalez
- Laboratori de Medicina Computacional, Unitat de Bioestadística, Facultat de Medicina, Universitat Autònoma de Barcelona, 08193, Bellaterra, Barcelona, Spain.
| |
Collapse
|
2
|
Frith MC, Kawaguchi R. Split-alignment of genomes finds orthologies more accurately. Genome Biol 2015; 16:106. [PMID: 25994148 PMCID: PMC4464727 DOI: 10.1186/s13059-015-0670-9] [Citation(s) in RCA: 65] [Impact Index Per Article: 7.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/06/2015] [Accepted: 05/08/2015] [Indexed: 04/29/2023] Open
Abstract
We present a new pair-wise genome alignment method, based on a simple concept of finding an optimal set of local alignments. It gains accuracy by not masking repeats, and by using a statistical model to quantify the (un)ambiguity of each alignment part. Compared to previous animal genome alignments, it aligns thousands of locations differently and with much higher similarity, strongly suggesting that the previous alignments are non-orthologous. The previous methods suffer from an overly-strong assumption of long un-rearranged blocks. The new alignments should help find interesting and unusual features, such as fast-evolving elements and micro-rearrangements, which are confounded by alignment errors.
Collapse
Affiliation(s)
- Martin C Frith
- Computational Biology Research Center (CBRC), National Institute of Advanced Industrial Science and Technology (AIST), 2-4-7 Aomi, Koto-ku, Tokyo, 135-0064, Japan.
| | - Risa Kawaguchi
- Computational Biology Research Center (CBRC), National Institute of Advanced Industrial Science and Technology (AIST), 2-4-7 Aomi, Koto-ku, Tokyo, 135-0064, Japan. .,Department of Computational Biology, Faculty of Frontier Sciences, The University of Tokyo, 5-1-5 Kashiwanoha, Kashiwa, Chiba, 277-8561, Japan.
| |
Collapse
|
3
|
Sheetlin SL, Park Y, Frith MC, Spouge JL. Frameshift alignment: statistics and post-genomic applications. ACTA ACUST UNITED AC 2014; 30:3575-82. [PMID: 25172925 DOI: 10.1093/bioinformatics/btu576] [Citation(s) in RCA: 24] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/23/2023]
Abstract
MOTIVATION The alignment of DNA sequences to proteins, allowing for frameshifts, is a classic method in sequence analysis. It can help identify pseudogenes (which accumulate mutations), analyze raw DNA and RNA sequence data (which may have frameshift sequencing errors), investigate ribosomal frameshifts, etc. Often, however, only ad hoc approximations or simulations are available to provide the statistical significance of a frameshift alignment score. RESULTS We describe a method to estimate statistical significance of frameshift alignments, similar to classic BLAST statistics. (BLAST presently does not permit its alignments to include frameshifts.) We also illustrate the continuing usefulness of frameshift alignment with two 'post-genomic' applications: (i) when finding pseudogenes within the human genome, frameshift alignments show that most anciently conserved non-coding human elements are recent pseudogenes with conserved ancestral genes; and (ii) when analyzing metagenomic DNA reads from polluted soil, frameshift alignments show that most alignable metagenomic reads contain frameshifts, suggesting that metagenomic analysis needs to use frameshift alignment to derive accurate results.
Collapse
Affiliation(s)
- Sergey L Sheetlin
- National Center for Biotechnology Information, National Library of Medicine, Bethesda, MD 20894, USA and Computational Biology Research Center, National Institute of Advanced Industrial Science and Technology, Tokyo 135-0064, Japan
| | - Yonil Park
- National Center for Biotechnology Information, National Library of Medicine, Bethesda, MD 20894, USA and Computational Biology Research Center, National Institute of Advanced Industrial Science and Technology, Tokyo 135-0064, Japan
| | - Martin C Frith
- National Center for Biotechnology Information, National Library of Medicine, Bethesda, MD 20894, USA and Computational Biology Research Center, National Institute of Advanced Industrial Science and Technology, Tokyo 135-0064, Japan
| | - John L Spouge
- National Center for Biotechnology Information, National Library of Medicine, Bethesda, MD 20894, USA and Computational Biology Research Center, National Institute of Advanced Industrial Science and Technology, Tokyo 135-0064, Japan
| |
Collapse
|
4
|
Frith MC, Noé L. Improved search heuristics find 20,000 new alignments between human and mouse genomes. Nucleic Acids Res 2014; 42:e59. [PMID: 24493737 PMCID: PMC3985675 DOI: 10.1093/nar/gku104] [Citation(s) in RCA: 23] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/17/2022] Open
Abstract
Sequence similarity search is a fundamental way of analyzing nucleotide sequences. Despite decades of research, this is not a solved problem because there exist many similarities that are not found by current methods. Search methods are typically based on a seed-and-extend approach, which has many variants (e.g. spaced seeds, transition seeds), and it remains unclear how to optimize this approach. This study designs and tests seeding methods for inter-mammal and inter-insect genome comparison. By considering substitution patterns of real genomes, we design sets of multiple complementary transition seeds, which have better performance (sensitivity per run time) than previous seeding strategies. Often the best seed patterns have more transition positions than those used previously. We also point out that recent computer memory sizes (e.g. 60 GB) make it feasible to use multiple (e.g. eight) seeds for whole mammal genomes. Interestingly, the most sensitive settings achieve diminishing returns for human–dog and melanogaster–pseudoobscura comparisons, but not for human–mouse, which suggests that we still miss many human–mouse alignments. Our optimized heuristics find ∼20 000 new human–mouse alignments that are missing from the standard UCSC alignments. We tabulate seed patterns and parameters that work well so they can be used in future research.
Collapse
Affiliation(s)
- Martin C Frith
- Computational Biology Research Center, AIST, 2-4-7 Aomi, Koto-ku, Tokyo 135-0064, Japan and LIFL (UMR 8022)/INRIA Lille Nord-Europe, bat M3 Ext., Université Lille1 59655 Villeneuve d'Ascq Cedex, France
| | | |
Collapse
|
5
|
Abstract
The comparison of homologous proteins from different species is a first step toward a function assignment and a reconstruction of the species evolution. Though local alignment is mostly used for this purpose, global alignment is important for constructing multiple alignments or phylogenetic trees. However, statistical significance of global alignments is not completely clear, lacking a specific statistical model to describe alignments or depending on computationally expensive methods like Z-score. Recently we presented a normalized global alignment, defined as the best compromise between global alignment cost and length, and showed that this new technique led to better classification results than Z-score at a much lower computational cost. However, it is necessary to analyze the statistical significance of the normalized global alignment in order to be considered a completely functional algorithm for protein alignment. Experiments with unrelated proteins extracted from the SCOP ASTRAL database showed that normalized global alignment scores can be fitted to a log-normal distribution. This fact, obtained without any theoretical support, can be used to derive statistical significance of normalized global alignments. Results are summarized in a table with fitted parameters for different scoring schemes.
Collapse
Affiliation(s)
- Guillermo Peris
- Department de Llenguatges i Sistemes Informátics, Universitat Jaume I , Castelló, Spain
| | | |
Collapse
|
6
|
Chen TW, Gan RCR, Wu TH, Huang PJ, Lee CY, Chen YYM, Chen CC, Tang P. FastAnnotator--an efficient transcript annotation web tool. BMC Genomics 2012; 13 Suppl 7:S9. [PMID: 23281853 PMCID: PMC3521244 DOI: 10.1186/1471-2164-13-s7-s9] [Citation(s) in RCA: 49] [Impact Index Per Article: 4.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Recent developments in high-throughput sequencing (HTS) technologies have made it feasible to sequence the complete transcriptomes of non-model organisms or metatranscriptomes from environmental samples. The challenge after generating hundreds of millions of sequences is to annotate these transcripts and classify the transcripts based on their putative functions. Because many biological scientists lack the knowledge to install Linux-based software packages or maintain databases used for transcript annotation, we developed an automatic annotation tool with an easy-to-use interface. METHODS To elucidate the potential functions of gene transcripts, we integrated well-established annotation tools: Blast2GO, PRIAM and RPS BLAST in a web-based service, FastAnnotator, which can assign Gene Ontology (GO) terms, Enzyme Commission numbers (EC numbers) and functional domains to query sequences. RESULTS Using six transcriptome sequence datasets as examples, we demonstrated the ability of FastAnnotator to assign functional annotations. FastAnnotator annotated 88.1% and 81.3% of the transcripts from the well-studied organisms Caenorhabditis elegans and Streptococcus parasanguinis, respectively. Furthermore, FastAnnotator annotated 62.9%, 20.4%, 53.1% and 42.0% of the sequences from the transcriptomes of sweet potato, clam, amoeba, and Trichomonas vaginalis, respectively, which lack reference genomes. We demonstrated that FastAnnotator can complete the annotation process in a reasonable amount of time and is suitable for the annotation of transcriptomes from model organisms or organisms for which annotated reference genomes are not avaiable. CONCLUSIONS The sequencing process no longer represents the bottleneck in the study of genomics, and automatic annotation tools have become invaluable as the annotation procedure has become the limiting step. We present FastAnnotator, which was an automated annotation web tool designed to efficiently annotate sequences with their gene functions, enzyme functions or domains. FastAnnotator is useful in transcriptome studies and especially for those focusing on non-model organisms or metatranscriptomes. FastAnnotator does not require local installation and is freely available at http://fastannotator.cgu.edu.tw.
Collapse
Affiliation(s)
- Ting-Wen Chen
- Molecular Medicine Research Center, Chang Gung University, Taoyuan, Taiwan
| | | | | | | | | | | | | | | |
Collapse
|
7
|
Ashida H, Asai K, Hamada M. Shape-based alignment of genomic landscapes in multi-scale resolution. Nucleic Acids Res 2012; 40:6435-48. [PMID: 22561376 PMCID: PMC3413149 DOI: 10.1093/nar/gks354] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/03/2023] Open
Abstract
Due to dramatic advances in DNA technology, quantitative measures of annotation data can now be obtained in continuous coordinates across the entire genome, allowing various heterogeneous ‘genomic landscapes’ to emerge. Although much effort has been devoted to comparing DNA sequences, not much attention has been given to comparing these large quantities of data comprehensively. In this article, we introduce a method for rapidly detecting local regions that show high correlations between genomic landscapes. We overcame the size problem for genome-wide data by converting the data into series of symbols and then carrying out sequence alignment. We also decomposed the oscillation of the landscape data into different frequency bands before analysis, since the real genomic landscape is a mixture of embedded and confounded biological processes working at different scales in the cell nucleus. To verify the usefulness and generality of our method, we applied our approach to well investigated landscapes from the human genome, including several histone modifications. Furthermore, by applying our method to over 20 genomic landscapes in human and 12 in mouse, we found that DNA replication timing and the density of Alu insertions are highly correlated genome-wide in both species, even though the Alu elements have amplified independently in the two genomes. To our knowledge, this is the first method to align genomic landscapes at multiple scales according to their shape.
Collapse
Affiliation(s)
- Hiroki Ashida
- Department of Computational Biology, Graduate School of Frontier Sciences, The University of Tokyo, 5-1-5 Kashiwanoha, Kashiwa-shi, Chiba 277-8561, Japan.
| | | | | |
Collapse
|
8
|
Frith MC. Gentle masking of low-complexity sequences improves homology search. PLoS One 2011; 6:e28819. [PMID: 22205972 PMCID: PMC3242753 DOI: 10.1371/journal.pone.0028819] [Citation(s) in RCA: 20] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/17/2011] [Accepted: 11/15/2011] [Indexed: 11/19/2022] Open
Abstract
Detection of sequences that are homologous, i.e. descended from a common ancestor, is a fundamental task in computational biology. This task is confounded by low-complexity tracts (such as atatatatatat), which arise frequently and independently, causing strong similarities that are not homologies. There has been much research on identifying low-complexity tracts, but little research on how to treat them during homology search. We propose to find homologies by aligning sequences with “gentle” masking of low-complexity tracts. Gentle masking means that the match score involving a masked letter is , where is the unmasked score. Gentle masking slightly but noticeably improves the sensitivity of homology search (compared to “harsh” masking), without harming specificity. We show examples in three useful homology search problems: detection of NUMTs (nuclear copies of mitochondrial DNA), recruitment of metagenomic DNA reads to reference genomes, and pseudogene detection. Gentle masking is currently the best way to treat low-complexity tracts during homology search.
Collapse
Affiliation(s)
- Martin C Frith
- Computational Biology Research Center, Institute for Advanced Industrial Science and Technology, Koto-ku, Tokyo, Japan.
| |
Collapse
|
9
|
Sheetlin S, Park Y, Spouge JL. Objective method for estimating asymptotic parameters, with an application to sequence alignment. PHYSICAL REVIEW. E, STATISTICAL, NONLINEAR, AND SOFT MATTER PHYSICS 2011; 84:031914. [PMID: 22060410 PMCID: PMC3233989 DOI: 10.1103/physreve.84.031914] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 03/02/2011] [Revised: 06/14/2011] [Indexed: 05/31/2023]
Abstract
Sequence alignment is an indispensable computational tool in modern molecular biology. The model underlying biological sequence alignment is of interest to physicists because it approximates the statistical mechanics of DNA and protein annealing, while bearing an intimate relationship to models of directed polymers in random media. Recent methods for determining the statistics of random sequence alignments have reduced the computation time to less than 1 s, opening up some interesting possibilities for online computation with biological search engines. Before implementation, however, the methods required an objective technique for computing regression coefficients pertinent to an asymptotic regime. Typically, physicists estimate parameters pertinent to an asymptotic regime subjectively: They eyeball their data; estimate the asymptotic regime where the regression model holds with reasonable accuracy; and then regress data only within the estimated asymptotic regime. Our publicly available computer program ARRP replaces the subjective assessment of the asymptotic regime with an objective change-point detection method, increasing confidence in the scientific objectivity of the parameter estimates. Asymptotic regression has potential applications across most of physics.
Collapse
Affiliation(s)
- Sergey Sheetlin
- National Center for Biotechnology Information, National Library of Medicine, Bethesda, Maryland 20894, USA
| | - Yonil Park
- National Center for Biotechnology Information, National Library of Medicine, Bethesda, Maryland 20894, USA
| | - John L. Spouge
- National Center for Biotechnology Information, National Library of Medicine, Bethesda, Maryland 20894, USA
| |
Collapse
|
10
|
Hamada M, Yamada K, Sato K, Frith MC, Asai K. CentroidHomfold-LAST: accurate prediction of RNA secondary structure using automatically collected homologous sequences. Nucleic Acids Res 2011; 39:W100-6. [PMID: 21565800 PMCID: PMC3125741 DOI: 10.1093/nar/gkr290] [Citation(s) in RCA: 18] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open
Abstract
Although secondary structure predictions of an individual RNA sequence have been widely used in a number of sequence analyses of RNAs, accuracy is still limited. Recently, we proposed a method (called 'CentroidHomfold'), which includes information about homologous sequences into the prediction of the secondary structure of the target sequence, and showed that it substantially improved the performance of secondary structure predictions. CentroidHomfold, however, forces users to prepare homologous sequences of the target sequence. We have developed a Web application (CentroidHomfold-LAST) that predicts the secondary structure of the target sequence using automatically collected homologous sequences. LAST, which is a fast and sensitive local aligner, and CentroidHomfold are employed in the Web application. Computational experiments with a commonly-used data set indicated that CentroidHomfold-LAST substantially outperformed conventional secondary structure predictions including CentroidFold and RNAfold.
Collapse
Affiliation(s)
- Michiaki Hamada
- Graduate School of Frontier Sciences, The University of Tokyo, 5-1-5 Kashiwanoha, Kashiwa, Chiba 277-8562, Japan.
| | | | | | | | | |
Collapse
|
11
|
Agrawal A, Huang X. Pairwise statistical significance of local sequence alignment using sequence-specific and position-specific substitution matrices. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2011; 8:194-205. [PMID: 21071807 DOI: 10.1109/tcbb.2009.69] [Citation(s) in RCA: 9] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/30/2023]
Abstract
Pairwise sequence alignment is a central problem in bioinformatics, which forms the basis of various other applications. Two related sequences are expected to have a high alignment score, but relatedness is usually judged by statistical significance rather than by alignment score. Recently, it was shown that pairwise statistical significance gives promising results as an alternative to database statistical significance for getting individual significance estimates of pairwise alignment scores. The improvement was mainly attributed to making the statistical significance estimation process more sequence-specific and database-independent. In this paper, we use sequence-specific and position-specific substitution matrices to derive the estimates of pairwise statistical significance, which is expected to use more sequence-specific information in estimating pairwise statistical significance. Experiments on a benchmark database with sequence-specific substitution matrices at different levels of sequence-specific contribution were conducted, and results confirm that using sequence-specific substitution matrices for estimating pairwise statistical significance is significantly better than using a standard matrix like BLOSUM62, and than database statistical significance estimates reported by popular database search programs like BLAST, PSI-BLAST (without pretrained PSSMs), and SSEARCH on a benchmark database, but with pretrained PSSMs, PSI-BLAST results are significantly better. Further, using position-specific substitution matrices for estimating pairwise statistical significance gives significantly better results even than PSI-BLAST using pretrained PSSMs.
Collapse
Affiliation(s)
- Ankit Agrawal
- Department of Computer Science, Iowa State University, 226 Atanasoff Hall, Ames, IA 50011-1041, USA.
| | | |
Collapse
|
12
|
Frith MC, Hamada M, Horton P. Parameters for accurate genome alignment. BMC Bioinformatics 2010; 11:80. [PMID: 20144198 PMCID: PMC2829014 DOI: 10.1186/1471-2105-11-80] [Citation(s) in RCA: 145] [Impact Index Per Article: 10.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/16/2009] [Accepted: 02/09/2010] [Indexed: 11/25/2022] Open
Abstract
Background Genome sequence alignments form the basis of much research. Genome alignment depends on various mundane but critical choices, such as how to mask repeats and which score parameters to use. Surprisingly, there has been no large-scale assessment of these choices using real genomic data. Moreover, rigorous procedures to control the rate of spurious alignment have not been employed. Results We have assessed 495 combinations of score parameters for alignment of animal, plant, and fungal genomes. As our gold-standard of accuracy, we used genome alignments implied by multiple alignments of proteins and of structural RNAs. We found the HOXD scoring schemes underlying alignments in the UCSC genome database to be far from optimal, and suggest better parameters. Higher values of the X-drop parameter are not always better. E-values accurately indicate the rate of spurious alignment, but only if tandem repeats are masked in a non-standard way. Finally, we show that γ-centroid (probabilistic) alignment can find highly reliable subsets of aligned bases. Conclusions These results enable more accurate genome alignment, with reliability measures for local alignments and for individual aligned bases. This study was made possible by our new software, LAST, which can align vertebrate genomes in a few hours http://last.cbrc.jp/.
Collapse
Affiliation(s)
- Martin C Frith
- Computational Biology Research Center, Institute for Advanced Industrial Science and Technology, Tokyo 135-0064, Japan.
| | | | | |
Collapse
|
13
|
Park Y, Sheetlin S, Spouge JL. ESTIMATING THE GUMBEL SCALE PARAMETER FOR LOCAL ALIGNMENT OF RANDOM SEQUENCES BY IMPORTANCE SAMPLING WITH STOPPING TIMES. Ann Stat 2009; 37:3697. [PMID: 20148197 DOI: 10.1214/08-aos663] [Citation(s) in RCA: 11] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/19/2022]
Abstract
The gapped local alignment score of two random sequences follows a Gumbel distribution. If computers could estimate the parameters of the Gumbel distribution within one second, the use of arbitrary alignment scoring schemes could increase the sensitivity of searching biological sequence databases over the web. Accordingly, this article gives a novel equation for the scale parameter of the relevant Gumbel distribution. We speculate that the equation is exact, although present numerical evidence is limited. The equation involves ascending ladder variates in the global alignment of random sequences. In global alignment simulations, the ladder variates yield stopping times specifying random sequence lengths. Because of the random lengths, and because our trial distribution for importance sampling occurs on a different sample space from our target distribution, our study led to a mapping theorem, which led naturally in turn to an efficient dynamic programming algorithm for the importance sampling weights. Numerical studies using several popular alignment scoring schemes then examined the efficiency and accuracy of the resulting simulations.
Collapse
Affiliation(s)
- Yonil Park
- National Center for Biotechnology Information National Library of Medicine National Institutes of Health 8600 Rockville Pike Bethesda, Maryland 20894 USA
| | | | | |
Collapse
|
14
|
Agrawal A, Huang X. Pairwise statistical significance of local sequence alignment using multiple parameter sets and empirical justification of parameter set change penalty. BMC Bioinformatics 2009; 10 Suppl 3:S1. [PMID: 19344477 PMCID: PMC2665049 DOI: 10.1186/1471-2105-10-s3-s1] [Citation(s) in RCA: 12] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Accurate estimation of statistical significance of a pairwise alignment is an important problem in sequence comparison. Recently, a comparative study of pairwise statistical significance with database statistical significance was conducted. In this paper, we extend the earlier work on pairwise statistical significance by incorporating with it the use of multiple parameter sets. RESULTS Results for a knowledge discovery application of homology detection reveal that using multiple parameter sets for pairwise statistical significance estimates gives better coverage than using a single parameter set, at least at some error levels. Further, the results of pairwise statistical significance using multiple parameter sets are shown to be significantly better than database statistical significance estimates reported by BLAST and PSI-BLAST, and comparable and at times significantly better than SSEARCH. Using non-zero parameter set change penalty values give better performance than zero penalty. CONCLUSION The fact that the homology detection performance does not degrade when using multiple parameter sets is a strong evidence for the validity of the assumption that the alignment score distribution follows an extreme value distribution even when using multiple parameter sets. Parameter set change penalty is a useful parameter for alignment using multiple parameter sets. Pairwise statistical significance using multiple parameter sets can be effectively used to determine the relatedness of a (or a few) pair(s) of sequences without performing a time-consuming database search.
Collapse
Affiliation(s)
- Ankit Agrawal
- Department of Computer Science, Iowa State University, 226 Atanasoff Hall, Ames, IA 50011-1041, USA.
| | | |
Collapse
|