1
|
Prillo S, Deng Y, Boyeau P, Li X, Chen PY, Song YS. CherryML: scalable maximum likelihood estimation of phylogenetic models. Nat Methods 2023; 20:1232-1236. [PMID: 37386188 PMCID: PMC10644697 DOI: 10.1038/s41592-023-01917-9] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/21/2022] [Accepted: 05/18/2023] [Indexed: 07/01/2023]
Abstract
Phylogenetic models of molecular evolution are central to numerous biological applications spanning diverse timescales, from hundreds of millions of years involving orthologous proteins to just tens of days relating to single cells within an organism. A fundamental problem in these applications is estimating model parameters, for which maximum likelihood estimation is typically employed. Unfortunately, maximum likelihood estimation is a computationally expensive task, in some cases prohibitively so. To address this challenge, we here introduce CherryML, a broadly applicable method that achieves several orders of magnitude speedup by using a quantized composite likelihood over cherries in the trees. The massive speedup offered by our method should enable researchers to consider more complex and biologically realistic models than previously possible. Here we demonstrate CherryML's utility by applying it to estimate a general 400 × 400 rate matrix for residue-residue coevolution at contact sites in three-dimensional protein structures; we estimate that using current state-of-the-art methods such as the expectation-maximization algorithm for the same task would take >100,000 times longer.
Collapse
Affiliation(s)
- Sebastian Prillo
- Computer Science Division, University of California, Berkeley, CA, USA
| | - Yun Deng
- Graduate Group in Computational Biology, University of California, Berkeley, CA, USA
| | - Pierre Boyeau
- Computer Science Division, University of California, Berkeley, CA, USA
| | - Xingyu Li
- Computer Science Division, University of California, Berkeley, CA, USA
| | - Po-Yen Chen
- Computer Science Division, University of California, Berkeley, CA, USA
| | - Yun S Song
- Computer Science Division, University of California, Berkeley, CA, USA.
- Department of Statistics, University of California, Berkeley, CA, USA.
| |
Collapse
|
2
|
Andrikos C, Makris E, Kolaitis A, Rassias G, Pavlatos C, Tsanakas P. Knotify: An Efficient Parallel Platform for RNA Pseudoknot Prediction Using Syntactic Pattern Recognition. Methods Protoc 2022; 5:mps5010014. [PMID: 35200530 PMCID: PMC8876629 DOI: 10.3390/mps5010014] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/03/2022] [Revised: 01/27/2022] [Accepted: 01/30/2022] [Indexed: 11/16/2022] Open
Abstract
Obtaining valuable clues for noncoding RNA (ribonucleic acid) subsequences remains a significant challenge, acknowledging that most of the human genome transcribes into noncoding RNA parts related to unknown biological operations. Capturing these clues relies on accurate “base pairing” prediction, also known as “RNA secondary structure prediction”. As COVID-19 is considered a severe global threat, the single-stranded SARS-CoV-2 virus reveals the importance of establishing an efficient RNA analysis toolkit. This work aimed to contribute to that by introducing a novel system committed to predicting RNA secondary structure patterns (i.e., RNA’s pseudoknots) that leverage syntactic pattern-recognition strategies. Having focused on the pseudoknot predictions, we formalized the secondary structure prediction of the RNA to be primarily a parsing and, secondly, an optimization problem. The proposed methodology addresses the problem of predicting pseudoknots of the first order (H-type). We introduce a context-free grammar (CFG) that affords enough expression power to recognize potential pseudoknot pattern. In addition, an alternative methodology of detecting possible pseudoknots is also implemented as well, using a brute-force algorithm. Any input sequence may highlight multiple potential folding patterns requiring a strict methodology to determine the single biologically realistic one. We conscripted a novel heuristic over the widely accepted notion of free-energy minimization to tackle such ambiguity in a performant way by utilizing each pattern’s context to unveil the most prominent pseudoknot pattern. The overall process features polynomial-time complexity, while its parallel implementation enhances the end performance, as proportional to the deployed hardware. The proposed methodology does succeed in predicting the core stems of any RNA pseudoknot of the test dataset by performing a 76.4% recall ratio. The methodology achieved a F1-score equal to 0.774 and MCC equal 0.543 in discovering all the stems of an RNA sequence, outperforming the particular task. Measurements were taken using a dataset of 262 RNA sequences establishing a performance speed of 1.31, 3.45, and 7.76 compared to three well-known platforms. The implementation source code is publicly available under knotify github repo.
Collapse
Affiliation(s)
- Christos Andrikos
- School of Electrical and Computer Engineering, National Technical University of Athens, 9 Iroon Polytechniou St., 15780 Athens, Greece; (C.A.); (E.M.); (A.K.); (G.R.); (P.T.)
| | - Evangelos Makris
- School of Electrical and Computer Engineering, National Technical University of Athens, 9 Iroon Polytechniou St., 15780 Athens, Greece; (C.A.); (E.M.); (A.K.); (G.R.); (P.T.)
| | - Angelos Kolaitis
- School of Electrical and Computer Engineering, National Technical University of Athens, 9 Iroon Polytechniou St., 15780 Athens, Greece; (C.A.); (E.M.); (A.K.); (G.R.); (P.T.)
| | - Georgios Rassias
- School of Electrical and Computer Engineering, National Technical University of Athens, 9 Iroon Polytechniou St., 15780 Athens, Greece; (C.A.); (E.M.); (A.K.); (G.R.); (P.T.)
| | - Christos Pavlatos
- Hellenic Air Force Academy, Dekelia Air Base, Acharnes, 13671 Athens, Greece
- Correspondence: ; Tel.: +30-210-7722541
| | - Panayiotis Tsanakas
- School of Electrical and Computer Engineering, National Technical University of Athens, 9 Iroon Polytechniou St., 15780 Athens, Greece; (C.A.); (E.M.); (A.K.); (G.R.); (P.T.)
| |
Collapse
|
3
|
Chang H, Nie Y, Zhang N, Zhang X, Sun H, Mao Y, Qiu Z, Huang Y. MtOrt: an empirical mitochondrial amino acid substitution model for evolutionary studies of Orthoptera insects. BMC Evol Biol 2020; 20:57. [PMID: 32429841 PMCID: PMC7236349 DOI: 10.1186/s12862-020-01623-6] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/08/2020] [Accepted: 05/05/2020] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Amino acid substitution models play an important role in inferring phylogenies from proteins. Although different amino acid substitution models have been proposed, only a few were estimated from mitochondrial protein sequences for specific taxa such as the mtArt model for Arthropoda. The increasing of mitochondrial genome data from broad Orthoptera taxa provides an opportunity to estimate the Orthoptera-specific mitochondrial amino acid empirical model. RESULTS We sequenced complete mitochondrial genomes of 54 Orthoptera species, and estimated an amino acid substitution model (named mtOrt) by maximum likelihood method based on the 283 complete mitochondrial genomes available currently. The results indicated that there are obvious differences between mtOrt and the existing models, and the new model can better fit the Orthoptera mitochondrial protein datasets. Moreover, topologies of trees constructed using mtOrt and existing models are frequently different. MtOrt does indeed have an impact on likelihood improvement as well as tree topologies. The comparisons between the topologies of trees constructed using mtOrt and existing models show that the new model outperforms the existing models in inferring phylogenies from Orthoptera mitochondrial protein data. CONCLUSIONS The new mitochondrial amino acid substitution model of Orthoptera shows obvious differences from the existing models, and outperforms the existing models in inferring phylogenies from Orthoptera mitochondrial protein sequences.
Collapse
Affiliation(s)
- Huihui Chang
- College of Life Sciences, Shaanxi Normal University, No. 620, West Chang'an Avenue, Xi'an, 710119, Shaanxi, China
| | - Yimeng Nie
- College of Life Sciences, Shaanxi Normal University, No. 620, West Chang'an Avenue, Xi'an, 710119, Shaanxi, China
| | - Nan Zhang
- College of Life Sciences, Shaanxi Normal University, No. 620, West Chang'an Avenue, Xi'an, 710119, Shaanxi, China
| | - Xue Zhang
- College of Life Sciences, Shaanxi Normal University, No. 620, West Chang'an Avenue, Xi'an, 710119, Shaanxi, China
| | - Huimin Sun
- College of Life Sciences, Shaanxi Normal University, No. 620, West Chang'an Avenue, Xi'an, 710119, Shaanxi, China
| | - Ying Mao
- College of Life Sciences, Shaanxi Normal University, No. 620, West Chang'an Avenue, Xi'an, 710119, Shaanxi, China
| | - Zhongying Qiu
- School of Basic Medical Sciences & Shaanxi Key Laboratory of Brain Disorders, Xi'an Medical University, Xi'an, 710021, China
| | - Yuan Huang
- College of Life Sciences, Shaanxi Normal University, No. 620, West Chang'an Avenue, Xi'an, 710119, Shaanxi, China.
| |
Collapse
|
4
|
Boutte J, Fishbein M, Liston A, Straub SCK. NGS-Indel Coder: A pipeline to code indel characters in phylogenomic data with an example of its application in milkweeds (Asclepias). Mol Phylogenet Evol 2019; 139:106534. [PMID: 31212081 DOI: 10.1016/j.ympev.2019.106534] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/11/2019] [Revised: 05/12/2019] [Accepted: 06/13/2019] [Indexed: 12/30/2022]
Abstract
Targeted genome sequencing approaches allow characterization of evolutionary relationships using a considerable number of nuclear genes and informative characters. However, most phylogenomic analyses only utilize single nucleotide polymorphisms (SNPs). Studies at the species level, especially in groups that have recently radiated, often recover low amounts of phylogenetically informative variation in coding regions, and require non-coding sequences, which are richer in indels, to resolve gene trees. Here, NGS-Indel Coder, a pipeline to detect and omit false positive indels inferred from assemblies of short read sequence data, was developed to resolve the relationships among and within major clades of the American milkweeds (Asclepias), which are the result of a rapid and recent evolutionary radiation, and whose phylogeny has been difficult to resolve. This pipeline was applied to a Hyb-Seq data set of 768 loci including targeted exons and flanking intron regions from 33 milkweed species. Robust species tree inference was improved by excluding small alignment partitions (<100 bp) that increased gene tree ambiguity and incongruence. To further investigate the robustness of indel coding, data sets that included small and large indels were explored, and species trees derived from concatenated loci versus coalescent methods based on gene trees were compared. The phylogeny of Asclepias obtained using nuclear data was well resolved, and phylogenetic information from indels improved resolution of specific nodes. The Temperate North American, Mexican Highland, and Incarnatae clades were well supported as monophyletic. Asclepias coulteri, which has been considered part of the Sonoran Desert clade based on plastome analyses, was placed as sister to all the other milkweed species studied here, rather than as a member of that clade. Two groups within the Temperate North American and Mexican clades were not resolved, and the inferred relationships strongly conflicted when comparing results based on data sets that did or did not include indel characters. This new pipeline represents a step forward in making maximal use of the information content in phylogenomic data sets.
Collapse
Affiliation(s)
- Julien Boutte
- Department of Biology, Hobart and William Smith Colleges, Geneva, NY, USA
| | - Mark Fishbein
- Department of Plant Biology, Ecology and Evolution, Oklahoma State University, Stillwater, OK, USA
| | - Aaron Liston
- Department of Botany and Plant Pathology, Oregon State University, Corvallis, OR, USA
| | - Shannon C K Straub
- Department of Biology, Hobart and William Smith Colleges, Geneva, NY, USA.
| |
Collapse
|
5
|
Zhai Y, Alexandre BC. A Poissonian Model of Indel Rate Variation for Phylogenetic Tree Inference. Syst Biol 2018; 66:698-714. [PMID: 28204784 DOI: 10.1093/sysbio/syx033] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/04/2015] [Accepted: 01/27/2017] [Indexed: 01/22/2023] Open
Abstract
While indel rate variation has been observed and analyzed in detail, it is not taken into account by current indel-aware phylogenetic reconstruction methods. In this work, we introduce a continuous time stochastic process, the geometric Poisson indel process, that generalizes the Poisson indel process by allowing insertion and deletion rates to vary across sites. We design an efficient algorithm for computing the probability of a given multiple sequence alignment based on our new indel model. We describe a method to construct phylogeny estimates from a fixed alignment using neighbor joining. Using simulation studies, we show that ignoring indel rate variation may have a detrimental effect on the accuracy of the inferred phylogenies, and that our proposed method can sidestep this issue by inferring latent indel rate categories. We also show that our phylogenetic inference method may be more stable to taxa subsampling than methods that either ignore indels or indel rate variation. [evolutionary stochastic process; indel rate variation; Poisson indel process; TKF91.].
Collapse
Affiliation(s)
- Yongliang Zhai
- Department of Statistics, University of British Columbia, Vancouver, British Columbia, V6T 1Z4, Canada
| | - Bouchard-Côté Alexandre
- Department of Statistics, University of British Columbia, Vancouver, British Columbia, V6T 1Z4, Canada
| |
Collapse
|
6
|
Abstract
BACKGROUND Despite the long-anticipated possibility of putting sequence alignment on the same footing as statistical phylogenetics, theorists have struggled to develop time-dependent evolutionary models for indels that are as tractable as the analogous models for substitution events. MAIN TEXT This paper discusses progress in the area of insertion-deletion models, in view of recent work by Ezawa (BMC Bioinformatics 17:304, 2016); (BMC Bioinformatics 17:397, 2016); (BMC Bioinformatics 17:457, 2016) on the calculation of time-dependent gap length distributions in pairwise alignments, and current approaches for extending these approaches from ancestor-descendant pairs to phylogenetic trees. CONCLUSIONS While approximations that use finite-state machines (Pair HMMs and transducers) currently represent the most practical approach to problems such as sequence alignment and phylogeny, more rigorous approaches that work directly with the matrix exponential of the underlying continuous-time Markov chain also show promise, especially in view of recent advances.
Collapse
Affiliation(s)
- Ian H. Holmes
- 0000 0001 2181 7878grid.47840.3fDept of Bioengineering, University of California, Berkeley, 94720 USA
| |
Collapse
|
7
|
Tremblay-Savard O, Reinharz V, Waldispühl J. Reconstruction of ancestral RNA sequences under multiple structural constraints. BMC Genomics 2016; 17:862. [PMID: 28185557 PMCID: PMC5123390 DOI: 10.1186/s12864-016-3105-4] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Secondary structures form the scaffold of multiple sequence alignment of non-coding RNA (ncRNA) families. An accurate reconstruction of ancestral ncRNAs must use this structural signal. However, the inference of ancestors of a single ncRNA family with a single consensus structure may bias the results towards sequences with high affinity to this structure, which are far from the true ancestors. METHODS In this paper, we introduce achARNement, a maximum parsimony approach that, given two alignments of homologous ncRNA families with consensus secondary structures and a phylogenetic tree, simultaneously calculates ancestral RNA sequences for these two families. RESULTS We test our methodology on simulated data sets, and show that achARNement outperforms classical maximum parsimony approaches in terms of accuracy, but also reduces by several orders of magnitude the number of candidate sequences. To conclude this study, we apply our algorithms on the Glm clan and the FinP-traJ clan from the Rfam database. CONCLUSIONS Our results show that our methods reconstruct small sets of high-quality candidate ancestors with better agreement to the two target structures than with classical approaches. Our program is freely available at: http://csb.cs.mcgill.ca/acharnement .
Collapse
Affiliation(s)
- Olivier Tremblay-Savard
- School of Computer Science, McGill University, Montreal, H3A 0E9, Canada.,Department of Computer Science, University of Manitoba, Winnipeg, R3T 2N2, Canada
| | - Vladimir Reinharz
- School of Computer Science, McGill University, Montreal, H3A 0E9, Canada
| | - Jérôme Waldispühl
- School of Computer Science, McGill University, Montreal, H3A 0E9, Canada.
| |
Collapse
|
8
|
Hobolth A, Jensen JL. Summary Statistics for Endpoint-Conditioned Continuous-Time Markov Chains. J Appl Probab 2016. [DOI: 10.1239/jap/1324046009] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022]
Abstract
Continuous-time Markov chains are a widely used modelling tool. Applications include DNA sequence evolution, ion channel gating behaviour, and mathematical finance. We consider the problem of calculating properties of summary statistics (e.g. mean time spent in a state, mean number of jumps between two states, and the distribution of the total number of jumps) for discretely observed continuous-time Markov chains. Three alternative methods for calculating properties of summary statistics are described and the pros and cons of the methods are discussed. The methods are based on (i) an eigenvalue decomposition of the rate matrix, (ii) the uniformization method, and (iii) integrals of matrix exponentials. In particular, we develop a framework that allows for analyses of rather general summary statistics using the uniformization method.
Collapse
|
9
|
Abstract
Models of protein evolution are used to describe evolutionary processes, for phylogenetic analyses and homology detection. Widely used general models of protein evolution are biased toward globular domains and lack resolution to describe evolutionary processes for other protein types. As three-dimensional structure is a major constraint to protein evolution, specific models have been proposed for other types of proteins. Here, we consider evolutionary patterns in coiled-coil forming proteins. Coiled-coils are widespread structural domains, formed by a repeated motif of seven amino acids (heptad repeat). Coiled-coil forming proteins are frequently rods and spacers, structuring both the intracellular and the extracellular spaces that often form protein interaction interfaces. We tested the hypothesis that due to their specific structure the associated evolutionary constraints differ from those of globular proteins. We showed that substitution patterns in coiled-coil regions are different than those observed in globular regions, beyond the simple heptad repeat. Based on these substitution patterns we developed a coiled-coil specific (CC) model that in the context of phylogenetic reconstruction outperforms general models in tree likelihood, often leading to different topologies. For multidomain proteins containing both a coiled-coil region and a globular domain, we showed that a combination of the CC model and a general one gives higher likelihoods than a single model. Finally, we showed that the model can be used for homology detection to increase search sensitivity for coiled-coil proteins. The CC model, software, and other supplementary materials are available at http://www.evocell.org/cgl/resources (last accessed January 29, 2015).
Collapse
|
10
|
Mirsky A, Kazandjian L, Anisimova M. Antibody-specific model of amino acid substitution for immunological inferences from alignments of antibody sequences. Mol Biol Evol 2014; 32:806-19. [PMID: 25534034 PMCID: PMC4327158 DOI: 10.1093/molbev/msu340] [Citation(s) in RCA: 18] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Key Words] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/31/2022] Open
Abstract
Antibodies are glycoproteins produced by the immune system as a dynamically adaptive line of defense against invading pathogens. Very elegant and specific mutational mechanisms allow B lymphocytes to produce a large and diversified repertoire of antibodies, which is modified and enhanced throughout all adulthood. One of these mechanisms is somatic hypermutation, which stochastically mutates nucleotides in the antibody genes, forming new sequences with different properties and, eventually, higher affinity and selectivity to the pathogenic target. As somatic hypermutation involves fast mutation of antibody sequences, this process can be described using a Markov substitution model of molecular evolution. Here, using large sets of antibody sequences from mice and humans, we infer an empirical amino acid substitution model AB, which is specific to antibody sequences. Compared with existing general amino acid models, we show that the AB model provides significantly better description for the somatic evolution of mice and human antibody sequences, as demonstrated on large next generation sequencing (NGS) antibody data. General amino acid models are reflective of conservation at the protein level due to functional constraints, with most frequent amino acids exchanges taking place between residues with the same or similar physicochemical properties. In contrast, within the variable part of antibody sequences we observed an elevated frequency of exchanges between amino acids with distinct physicochemical properties. This is indicative of a sui generis mutational mechanism, specific to antibody somatic hypermutation. We illustrate this property of antibody sequences by a comparative analysis of the network modularity implied by the AB model and general amino acid substitution models. We recommend using the new model for computational studies of antibody sequence maturation, including inference of alignments and phylogenetic trees describing antibody somatic hypermutation in large NGS data sets. The AB model is implemented in the open-source software CodonPhyML (http://sourceforge.net/projects/codonphyml) and can be downloaded and supplied by the user to ProGraphMSA (http://sourceforge.net/projects/prographmsa) or other alignment and phylogeny reconstruction programs that allow for user-defined substitution models.
Collapse
Affiliation(s)
- Alexander Mirsky
- Systems Biophysics and Functional Nanosystems, Ludwig-Maximilians-Universität München, München, Germany Department of Computer Science, Swiss Federal Institute of Technology (ETH Zürich), Zürich, Switzerland
| | | | - Maria Anisimova
- Department of Computer Science, Swiss Federal Institute of Technology (ETH Zürich), Zürich, Switzerland Institute of Applied Simulation (IAS), School of Life Sciences and Facility Management, Zürich University of Applied Sciences (ZHAW), Wädenswil, Switzerland
| |
Collapse
|
11
|
Dang CC, Le VS, Gascuel O, Hazes B, Le QS. FastMG: a simple, fast, and accurate maximum likelihood procedure to estimate amino acid replacement rate matrices from large data sets. BMC Bioinformatics 2014; 15:341. [PMID: 25344302 PMCID: PMC4287512 DOI: 10.1186/1471-2105-15-341] [Citation(s) in RCA: 9] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/13/2014] [Accepted: 09/29/2014] [Indexed: 11/11/2022] Open
Abstract
Background Amino acid replacement rate matrices are a crucial component of many protein analysis systems such as sequence similarity search, sequence alignment, and phylogenetic inference. Ideally, the rate matrix reflects the mutational behavior of the actual data under study; however, estimating amino acid replacement rate matrices requires large protein alignments and is computationally expensive and complex. As a compromise, sub-optimal pre-calculated generic matrices are typically used for protein-based phylogeny. Sequence availability has now grown to a point where problem-specific rate matrices can often be calculated if the computational cost can be controlled. Results The most time consuming step in estimating rate matrices by maximum likelihood is building maximum likelihood phylogenetic trees from protein alignments. We propose a new procedure, called FastMG, to overcome this obstacle. The key innovation is the alignment-splitting algorithm that splits alignments with many sequences into non-overlapping sub-alignments prior to estimating amino acid replacement rates. Experiments with different large data sets showed that the FastMG procedure was an order of magnitude faster than without splitting. Importantly, there was no apparent loss in matrix quality if an appropriate splitting procedure is used. Conclusions FastMG is a simple, fast and accurate procedure to estimate amino acid replacement rate matrices from large data sets. It enables researchers to study the evolutionary relationships for specific groups of proteins or taxa with optimized, data-specific amino acid replacement rate matrices. The programs, data sets, and the new mammalian mitochondrial protein rate matrix are available at http://fastmg.codeplex.com.
Collapse
Affiliation(s)
| | | | | | | | - Quang Si Le
- The Wellcome Trust Center for Human Genetics, Oxford University, Oxford, UK.
| |
Collapse
|
12
|
Abstract
Models of codon evolution have attracted particular interest because of their unique capabilities to detect selection forces and their high fit when applied to sequence evolution. We described here a novel approach for modeling codon evolution, which is based on Kronecker product of matrices. The 61 × 61 codon substitution rate matrix is created using Kronecker product of three 4 × 4 nucleotide substitution matrices, the equilibrium frequency of codons, and the selection rate parameter. The entities of the nucleotide substitution matrices and selection rate are considered as parameters of the model, which are optimized by maximum likelihood. Our fully mechanistic model allows the instantaneous substitution matrix between codons to be fully estimated with only 19 parameters instead of 3,721, by using the biological interdependence existing between positions within codons. We illustrate the properties of our models using computer simulations and assessed its relevance by comparing the AICc measures of our model and other models of codon evolution on simulations and a large range of empirical data sets. We show that our model fits most biological data better compared with the current codon models. Furthermore, the parameters in our model can be interpreted in a similar way as the exchangeability rates found in empirical codon models.
Collapse
Affiliation(s)
- Maryam Zaheri
- Department of Ecology and Evolution, Biophore, University of Lausanne, 1015 Lausanne, SwitzerlandSwiss Institute of Bioinformatics, Genopode, Quartier Sorge, 1015 Lausanne, Switzerland
| | - Linda Dib
- Department of Ecology and Evolution, Biophore, University of Lausanne, 1015 Lausanne, SwitzerlandSwiss Institute of Bioinformatics, Genopode, Quartier Sorge, 1015 Lausanne, Switzerland
| | - Nicolas Salamin
- Department of Ecology and Evolution, Biophore, University of Lausanne, 1015 Lausanne, SwitzerlandSwiss Institute of Bioinformatics, Genopode, Quartier Sorge, 1015 Lausanne, Switzerland
| |
Collapse
|
13
|
Sükösd Z, Andersen ES, Lyngsø R. SCFGs in RNA secondary structure prediction RNA secondary structure prediction: a hands-on approach. Methods Mol Biol 2014; 1097:143-162. [PMID: 24639159 DOI: 10.1007/978-1-62703-709-9_8] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 06/03/2023]
Abstract
Stochastic context-free grammars (SCFGs) were first established in the context of natural language modelling, and only later found their applications in RNA secondary structure prediction. In this chapter, we discuss the basic SCFG algorithms (CYK and inside-outside algorithms) in an application-centered manner and use the pfold grammar as a case study to show how the algorithms can be adapted to a grammar in a nonstandard form. We extend our discussion to the use of grammars with additional information (such as evolutionary information) to improve the quality of predictions. Finally, we provide a brief survey of programs that use stochastic context-free grammars for RNA secondary structure prediction and modelling.
Collapse
Affiliation(s)
- Zsuzsanna Sükösd
- Bioinformatics Research Centre, Aarhus University, Aarhus, Denmark
| | | | | |
Collapse
|
14
|
Abstract
Empirical codon models (ECMs) estimated from a large number of globular protein families outperformed mechanistic codon models in their description of the general process of protein evolution. Among other factors, ECMs implicitly model the influence of amino acid properties and multiple nucleotide substitutions (MNS). However, the estimation of ECMs requires large quantities of data, and until recently, only few suitable data sets were available. Here, we take advantage of several new Drosophila species genomes to estimate codon models from genome-wide data. The availability of large numbers of genomes over varying phylogenetic depths in the Drosophila genus allows us to explore various divergence levels. In consequence, we can use these data to determine the appropriate level of divergence for the estimation of ECMs, avoiding overestimation of MNS rates caused by saturation. To account for variation in evolutionary rates along the genome, we develop new empirical codon hidden Markov models (ecHMMs). These models significantly outperform previous ones with respect to maximum likelihood values, suggesting that they provide a better fit to the evolutionary process. Using ECMs and ecHMMs derived from genome-wide data sets, we devise new likelihood ratio tests (LRTs) of positive selection. We found classical LRTs very sensitive to the presence of MNSs, showing high false-positive rates, especially with small phylogenies. The new LRTs are more conservative than the classical ones, having acceptable false-positive rates and reduced power.
Collapse
Affiliation(s)
- Nicola De Maio
- Institut für Populationsgenetik, Vetmeduni Vienna, Wien, Austria.
| | | | | | | |
Collapse
|
15
|
Zoller S, Schneider A. Improving phylogenetic inference with a semiempirical amino acid substitution model. Mol Biol Evol 2012; 30:469-79. [PMID: 23002090 DOI: 10.1093/molbev/mss229] [Citation(s) in RCA: 20] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/07/2023] Open
Abstract
Amino acid substitution matrices describe the rates by which amino acids are replaced during evolution. In contrast to nucleotide or codon models, amino acid substitution matrices are in general parameterless and empirically estimated, probably because there is no obvious parametrization for amino acid substitutions. Principal component analysis has previously been used to improve codon substitution models by empirically finding the most relevant parameters. Here, we apply the same method to amino acid substitution matrices, leading to a semiempirical substitution model that can adjust the transition rates to the protein sequences under investigation. Our new model almost invariably achieves the best likelihood values in large-scale comparisons with established amino acid substitution models (JTT, WAG, and LG). In particular for longer alignments, these likelihood gains are considerably larger than what could be expected from simply having more parameters. The application of our model differs from that of mixture models (such as UL2 or UL3), as we optimize one rate matrix per alignment, whereas mixture models apply the variation per alignments site. This makes our model computationally more efficient, while the performance is comparable to that of UL3. Applied to the phylogenetic problem of the origin of placental mammals, our new model and the UL3 mixed model are the only ones of the tested models that cluster Afrotheria and Xenarthra into a clade called Atlantogenata, which would be in correspondence with recent findings using more sophisticated phylogenetic methods.
Collapse
Affiliation(s)
- Stefan Zoller
- Computational Biochemistry Research Group, ETH Zürich, Zürich, Switzerland.
| | | |
Collapse
|
16
|
Westesson O, Holmes I. Developing and applying heterogeneous phylogenetic models with XRate. PLoS One 2012; 7:e36898. [PMID: 22693624 PMCID: PMC3367922 DOI: 10.1371/journal.pone.0036898] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/18/2011] [Accepted: 04/09/2012] [Indexed: 12/17/2022] Open
Abstract
Modeling sequence evolution on phylogenetic trees is a useful technique in computational biology. Especially powerful are models which take account of the heterogeneous nature of sequence evolution according to the "grammar" of the encoded gene features. However, beyond a modest level of model complexity, manual coding of models becomes prohibitively labor-intensive. We demonstrate, via a set of case studies, the new built-in model-prototyping capabilities of XRate (macros and Scheme extensions). These features allow rapid implementation of phylogenetic models which would have previously been far more labor-intensive. XRate 's new capabilities for lineage-specific models, ancestral sequence reconstruction, and improved annotation output are also discussed. XRate 's flexible model-specification capabilities and computational efficiency make it well-suited to developing and prototyping phylogenetic grammar models. XRate is available as part of the DART software package: http://biowiki.org/DART.
Collapse
Affiliation(s)
| | - Ian Holmes
- Department of Bioengineering, University of California, Berkeley, California, United States of America
| |
Collapse
|
17
|
Le SQ, Dang CC, Gascuel O. Modeling protein evolution with several amino acid replacement matrices depending on site rates. Mol Biol Evol 2012; 29:2921-36. [PMID: 22491036 DOI: 10.1093/molbev/mss112] [Citation(s) in RCA: 134] [Impact Index Per Article: 11.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
Abstract
Most protein substitution models use a single amino acid replacement matrix summarizing the biochemical properties of amino acids. However, site evolution is highly heterogeneous and depends on many factors that influence the substitution patterns. In this paper, we investigate the use of different substitution matrices for different site evolutionary rates. Indeed, the variability of evolutionary rates corresponds to one of the most apparent heterogeneity factors among sites, and there is no reason to assume that the substitution patterns remain identical regardless of the evolutionary rate. We first introduce LG4M, which is composed of four matrices, each corresponding to one discrete gamma rate category (of four). These matrices differ in their amino acid equilibrium distributions and in their exchangeabilities, contrary to the standard gamma model where only the global rate differs from one category to another. Next, we present LG4X, which also uses four different matrices, but leaves aside the gamma distribution and follows a distribution-free scheme for the site rates. All these matrices are estimated from a very large alignment database, and our two models are tested using a large sample of independent alignments. Detailed analysis of resulting matrices and models shows the complexity of amino acid substitutions and the advantage of flexible models such as LG4M and LG4X. Both significantly outperform single-matrix models, providing gains of dozens to hundreds of log-likelihood units for most data sets. LG4X obtains substantial gains compared with LG4M, thanks to its distribution-free scheme for site rates. Since LG4M and LG4X display such advantages but require the same memory space and have comparable running times to standard models, we believe that LG4M and LG4X are relevant alternatives to single replacement matrices. Our models, data, and software are available from http://www.atgc-montpellier.fr/models/lg4x.
Collapse
Affiliation(s)
- Si Quang Le
- Méthodes et Algorithmes pour la Bioinformatique (LIRMM & IBC), Centre National de la Recherche Scientifique (CNRS)-Université Montpellier II, Montpellier Cedex 5, France
| | | | | |
Collapse
|
18
|
Abstract
Populations evolve as mutations arise in individual organisms and, through hereditary transmission, may become "fixed" (shared by all individuals) in the population. Most mutations are lethal or have negative fitness consequences for the organism. Others have essentially no effect on organismal fitness and can become fixed through the neutral stochastic process known as random drift. However, mutations may also produce a selective advantage that boosts their chances of reaching fixation. Regions of genes where new mutations are beneficial, rather than neutral or deleterious, tend to evolve more rapidly due to positive selection. Genes involved in immunity and defense are a well-known example; rapid evolution in these genes presumably occurs because new mutations help organisms to prevail in evolutionary "arms races" with pathogens. In recent years, genome-wide scans for selection have enlarged our understanding of the evolution of the protein-coding regions of the various species. In this chapter, we focus on the methods to detect selection in protein-coding genes. In particular, we discuss probabilistic models and how they have changed with the advent of new genome-wide data now available.
Collapse
|
19
|
Tataru P, Hobolth A. Comparison of methods for calculating conditional expectations of sufficient statistics for continuous time Markov chains. BMC Bioinformatics 2011; 12:465. [PMID: 22142146 PMCID: PMC3329461 DOI: 10.1186/1471-2105-12-465] [Citation(s) in RCA: 14] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/01/2011] [Accepted: 12/05/2011] [Indexed: 11/10/2022] Open
Abstract
Background Continuous time Markov chains (CTMCs) is a widely used model for describing the evolution of DNA sequences on the nucleotide, amino acid or codon level. The sufficient statistics for CTMCs are the time spent in a state and the number of changes between any two states. In applications past evolutionary events (exact times and types of changes) are unaccessible and the past must be inferred from DNA sequence data observed in the present. Results We describe and implement three algorithms for computing linear combinations of expected values of the sufficient statistics, conditioned on the end-points of the chain, and compare their performance with respect to accuracy and running time. The first algorithm is based on an eigenvalue decomposition of the rate matrix (EVD), the second on uniformization (UNI), and the third on integrals of matrix exponentials (EXPM). The implementation in R of the algorithms is available at http://www.birc.au.dk/~paula/. Conclusions We use two different models to analyze the accuracy and eight experiments to investigate the speed of the three algorithms. We find that they have similar accuracy and that EXPM is the slowest method. Furthermore we find that UNI is usually faster than EVD.
Collapse
Affiliation(s)
- Paula Tataru
- Bioinformatics Research Center, Aarhus University, Aarhus, Denmark.
| | | |
Collapse
|
20
|
La D, Kihara D. A novel method for protein-protein interaction site prediction using phylogenetic substitution models. Proteins 2011; 80:126-41. [PMID: 21989996 DOI: 10.1002/prot.23169] [Citation(s) in RCA: 24] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/08/2011] [Revised: 07/07/2011] [Accepted: 08/17/2011] [Indexed: 11/10/2022]
Abstract
Protein-protein binding events mediate many critical biological functions in the cell. Typically, functionally important sites in proteins can be well identified by considering sequence conservation. However, protein-protein interaction sites exhibit higher sequence variation than other functional regions, such as catalytic sites of enzymes. Consequently, the mutational behavior leading to weak sequence conservation poses significant challenges to the protein-protein interaction site prediction. Here, we present a phylogenetic framework to capture critical sequence variations that favor the selection of residues essential for protein-protein binding. Through the comprehensive analysis of diverse protein families, we show that protein binding interfaces exhibit distinct amino acid substitution as compared with other surface residues. On the basis of this analysis, we have developed a novel method, BindML, which utilizes the substitution models to predict protein-protein binding sites of protein with unknown interacting partners. BindML estimates the likelihood that a phylogenetic tree of a local surface region in a query protein structure follows the substitution patterns of protein binding interface and nonbinding surfaces. BindML is shown to perform well compared to alternative methods for protein binding interface prediction. The methodology developed in this study is very versatile in the sense that it can be generally applied for predicting other types of functional sites, such as DNA, RNA, and membrane binding sites in proteins.
Collapse
Affiliation(s)
- David La
- Department of Biological Sciences, College of Science, Purdue University, West Lafayette, Indiana 47907, USA
| | | |
Collapse
|
21
|
Dang CC, Lefort V, Le VS, Le QS, Gascuel O. ReplacementMatrix: a web server for maximum-likelihood estimation of amino acid replacement rate matrices. Bioinformatics 2011; 27:2758-60. [PMID: 21791535 DOI: 10.1093/bioinformatics/btr435] [Citation(s) in RCA: 14] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/15/2023] Open
Abstract
SUMMARY Amino acid replacement rate matrices are an essential basis of protein studies (e.g. in phylogenetics and alignment). A number of general purpose matrices have been proposed (e.g. JTT, WAG, LG) since the seminal work of Margaret Dayhoff and co-workers. However, it has been shown that matrices specific to certain protein groups (e.g. mitochondrial) or life domains (e.g. viruses) differ significantly from general average matrices, and thus perform better when applied to the data to which they are dedicated. This Web server implements the maximum-likelihood estimation procedure that was used to estimate LG, and provides a number of tools and facilities. Users upload a set of multiple protein alignments from their domain of interest and receive the resulting matrix by email, along with statistics and comparisons with other matrices. A non-parametric bootstrap is performed optionally to assess the variability of replacement rate estimates. Maximum-likelihood trees, inferred using the estimated rate matrix, are also computed optionally for each input alignment. Finely tuned procedures and up-to-date ML software (PhyML 3.0, XRATE) are combined to perform all these heavy calculations on our clusters. AVAILABILITY http://www.atgc-montpellier.fr/ReplacementMatrix/ CONTACT olivier.gascuel@lirmm.fr SUPPLEMENTARY INFORMATION Supplementary data are available at http://www.atgc-montpellier.fr/ReplacementMatrix/
Collapse
Affiliation(s)
- Cuong Cao Dang
- College of Technology and Information Technology Institute, Vietnam National University, Hanoi, Vietnam
| | | | | | | | | |
Collapse
|
22
|
Kiryu H. Sufficient statistics and expectation maximization algorithms in phylogenetic tree models. ACTA ACUST UNITED AC 2011; 27:2346-53. [PMID: 21757463 DOI: 10.1093/bioinformatics/btr420] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022]
Abstract
MOTIVATION Measuring evolutionary conservation is a routine step in the identification of functional elements in genome sequences. Although a number of studies have proposed methods that use the continuous time Markov models (CTMMs) to find evolutionarily constrained elements, their probabilistic structures have been less frequently investigated. RESULTS In this article, we investigate a sufficient statistic for CTMMs. The statistic is composed of the fractional duration of nucleotide characters over evolutionary time, F(d), and the number of substitutions occurring in phylogenetic trees, N(s). We first derive basic properties of the sufficient statistic. Then, we derive an expectation maximization (EM) algorithm for estimating the parameters of a phylogenetic model, which iteratively computes the expectation values of the sufficient statistic. We show that the EM algorithm exhibits much faster convergence than other optimization methods that use numerical gradient descent algorithms. Finally, we investigate the genome-wide distribution of fractional duration time F(d) which, unlike the number of substitutions N(s), has rarely been investigated. We show that F(d) has evolutionary information that is distinct from that in N(s), which may be useful for detecting novel types of evolutionary constraints existing in the human genome. AVAILABILITY The C++ source code of the 'Fdur' software is available at http://www.ncrna.org/software/fdur/ CONTACT kiryu-h@k.u-tokyo.ac.jp SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Hisanori Kiryu
- Department of Computational Biology, Faculty of Frontier Science, The University of Tokyo, Kashiwa, Chiba 277-8561, Japan.
| |
Collapse
|
23
|
Kosiol C, Goldman N. Markovian and non-Markovian protein sequence evolution: aggregated Markov process models. J Mol Biol 2011; 411:910-23. [PMID: 21718704 PMCID: PMC3157587 DOI: 10.1016/j.jmb.2011.06.005] [Citation(s) in RCA: 18] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/14/2010] [Revised: 05/28/2011] [Accepted: 06/03/2011] [Indexed: 12/03/2022]
Abstract
Over the years, there have been claims that evolution proceeds according to systematically different processes over different timescales and that protein evolution behaves in a non-Markovian manner. On the other hand, Markov models are fundamental to many applications in evolutionary studies. Apparent non-Markovian or time-dependent behavior has been attributed to influence of the genetic code at short timescales and dominance of physicochemical properties of the amino acids at long timescales. However, any long time period is simply the accumulation of many short time periods, and it remains unclear why evolution should appear to act systematically differently across the range of timescales studied. We show that the observed time-dependent behavior can be explained qualitatively by modeling protein sequence evolution as an aggregated Markov process (AMP): a time-homogeneous Markovian substitution model observed only at the level of the amino acids encoded by the protein-coding DNA sequence. The study of AMPs sheds new light on the relationship between amino acid-level and codon-level models of sequence evolution, and our results suggest that protein evolution should be modeled at the codon level rather than using amino acid substitution models.
Collapse
Affiliation(s)
- Carolin Kosiol
- Institut für Populationsgenetik, Vetmeduni Vienna, Veterinärplatz 1, A-1210 Wien, Austria.
| | | |
Collapse
|
24
|
Szalkowski AM, Anisimova M. Markov models of amino acid substitution to study proteins with intrinsically disordered regions. PLoS One 2011; 6:e20488. [PMID: 21647374 PMCID: PMC3103576 DOI: 10.1371/journal.pone.0020488] [Citation(s) in RCA: 31] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/22/2011] [Accepted: 04/27/2011] [Indexed: 11/18/2022] Open
Abstract
BACKGROUND Intrinsically disordered proteins (IDPs) or proteins with disordered regions (IDRs) do not have a well-defined tertiary structure, but perform a multitude of functions, often relying on their native disorder to achieve the binding flexibility through changing to alternative conformations. Intrinsic disorder is frequently found in all three kingdoms of life, and may occur in short stretches or span whole proteins. To date most studies contrasting the differences between ordered and disordered proteins focused on simple summary statistics. Here, we propose an evolutionary approach to study IDPs, and contrast patterns specific to ordered protein regions and the corresponding IDRs. RESULTS Two empirical Markov models of amino acid substitutions were estimated, based on a large set of multiple sequence alignments with experimentally verified annotations of disordered regions from the DisProt database of IDPs. We applied new methods to detect differences in Markovian evolution and evolutionary rates between IDRs and the corresponding ordered protein regions. Further, we investigated the distribution of IDPs among functional categories, biochemical pathways and their preponderance to contain tandem repeats. CONCLUSIONS We find significant differences in the evolution between ordered and disordered regions of proteins. Most importantly we find that disorder promoting amino acids are more conserved in IDRs, indicating that in some cases not only amino acid composition but the specific sequence is important for function. This conjecture is also reinforced by the observation that for of our data set IDRs evolve more slowly than the ordered parts of the proteins, while we still support the common view that IDRs in general evolve more quickly. The improvement in model fit indicates a possible improvement for various types of analyses e.g. de novo disorder prediction using a phylogenetic Hidden Markov Model based on our matrices showed a performance similar to other disorder predictors.
Collapse
|
25
|
Zoller S, Schneider A. Empirical analysis of the most relevant parameters of codon substitution models. J Mol Evol 2010; 70:605-12. [PMID: 20526712 DOI: 10.1007/s00239-010-9356-9] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/27/2010] [Accepted: 05/17/2010] [Indexed: 02/04/2023]
Abstract
Traditionally, codon models of evolution have been parametric, meaning that the 61 x 61 substitution rate matrix was derived from only a handful of parameters, typically the equilibrium frequencies, the ratio of nonsynonymous to synonymous substitution rates and the ratio between transition and transversion rates. These parameters are reasonable choices and are based on observations of what aspects of evolution often vary in coding DNA. However, the choices are relatively arbitrary and no systematic empirical search has ever been performed to identify the best parameters for a codon model. Even for the empirical or semi-empirical models that have been presented recently, only the average substitution rates have been estimated from databases of real coding DNA, but the parameters used were essentially the same as before. In this study we attempted to investigate empirically what the most relevant parameters for a codon model are. By performing a principal component analysis (PCA) on 3666 substitution rate matrices estimated from single gene families, the sets of the most co-varying substitution rates were determined. Interestingly, the two most significant principal components (PCs) describe clearly identifiable parameters: the first PC separates synonymous and nonsynonymous substitutions while the second PC distinguishes between substitutions where only one nucleotide changes and substitutions with two or three nucleotide changes. For the third and subsequent PCs no simple descriptions could be found.
Collapse
Affiliation(s)
- Stefan Zoller
- Computer Science Department, ETH Zurich, Zürich, Switzerland.
| | | |
Collapse
|
26
|
Le SQ, Gascuel O. Accounting for Solvent Accessibility and Secondary Structure in Protein Phylogenetics Is Clearly Beneficial. Syst Biol 2010; 59:277-87. [DOI: 10.1093/sysbio/syq002] [Citation(s) in RCA: 72] [Impact Index Per Article: 5.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open
Affiliation(s)
- Si Quang Le
- Méthodes et Algorithmes pour la Bioinformatique, LIRMM, CNRS–Université Montpellier II, 161 rue Ada, 34392 Montpellier Cedex 5, France
- Wellcome Trust Sanger Institute, Hinxton, Cambridge CB10 1SA, UK
| | - Olivier Gascuel
- Méthodes et Algorithmes pour la Bioinformatique, LIRMM, CNRS–Université Montpellier II, 161 rue Ada, 34392 Montpellier Cedex 5, France
| |
Collapse
|
27
|
Dang CC, Le QS, Gascuel O, Le VS. FLU, an amino acid substitution model for influenza proteins. BMC Evol Biol 2010; 10:99. [PMID: 20384985 PMCID: PMC2873421 DOI: 10.1186/1471-2148-10-99] [Citation(s) in RCA: 44] [Impact Index Per Article: 3.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/21/2009] [Accepted: 04/12/2010] [Indexed: 01/28/2023] Open
Abstract
Background The amino acid substitution model is the core component of many protein analysis systems such as sequence similarity search, sequence alignment, and phylogenetic inference. Although several general amino acid substitution models have been estimated from large and diverse protein databases, they remain inappropriate for analyzing specific species, e.g., viruses. Emerging epidemics of influenza viruses raise the need for comprehensive studies of these dangerous viruses. We propose an influenza-specific amino acid substitution model to enhance the understanding of the evolution of influenza viruses. Results A maximum likelihood approach was applied to estimate an amino acid substitution model (FLU) from ~113, 000 influenza protein sequences, consisting of ~20 million residues. FLU outperforms 14 widely used models in constructing maximum likelihood phylogenetic trees for the majority of influenza protein alignments. On average, FLU gains ~42 log likelihood points with an alignment of 300 sites. Moreover, topologies of trees constructed using FLU and other models are frequently different. FLU does indeed have an impact on likelihood improvement as well as tree topologies. It was implemented in PhyML and can be downloaded from ftp://ftp.sanger.ac.uk/pub/1000genomes/lsq/FLU or included in PhyML 3.0 server at http://www.atgc-montpellier.fr/phyml/. Conclusions FLU should be useful for any influenza protein analysis system which requires an accurate description of amino acid substitutions.
Collapse
Affiliation(s)
- Cuong Cao Dang
- College of Technology, Vietnam National University Hanoi, Cau Giay, Hanoi, Vietnam
| | | | | | | |
Collapse
|
28
|
Bernhart SH, Hofacker IL. From consensus structure prediction to RNA gene finding. BRIEFINGS IN FUNCTIONAL GENOMICS AND PROTEOMICS 2009; 8:461-71. [PMID: 19833701 DOI: 10.1093/bfgp/elp043] [Citation(s) in RCA: 25] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/14/2022]
Abstract
Reliable structure prediction is a prerequisite for most types of bioinformatical analysis of RNA. Since the accuracy of structure prediction from single sequences is limited, one often resorts to computing the consensus structure for a set of related RNA sequences. Since functionally important RNA structures are expected to evolve much more slowly than the underlying sequences, the pattern of sequence (co-)variation can be exploited to dramatically improve structure prediction. Since a conserved common structure is only expected when the RNA structure is under selective pressure, consensus structure prediction also provides an ideal starting point for the de novo detection of structured non-coding RNAs. Here, we review different strategies for the prediction of consensus secondary structures, and show how these approaches can be used to predict non-coding RNA genes.
Collapse
Affiliation(s)
- Stephan H Bernhart
- Department of Theoretical Chemistry, University of Vienna, Währingerstrasse 17, A-1090 Wien, Austria.
| | | |
Collapse
|
29
|
Bradley RK, Uzilov AV, Skinner ME, Bendaña YR, Barquist L, Holmes I. Evolutionary modeling and prediction of non-coding RNAs in Drosophila. PLoS One 2009; 4:e6478. [PMID: 19668382 PMCID: PMC2721679 DOI: 10.1371/journal.pone.0006478] [Citation(s) in RCA: 10] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/17/2009] [Accepted: 06/30/2009] [Indexed: 12/19/2022] Open
Abstract
We performed benchmarks of phylogenetic grammar-based ncRNA gene prediction, experimenting with eight different models of structural evolution and two different programs for genome alignment. We evaluated our models using alignments of twelve Drosophila genomes. We find that ncRNA prediction performance can vary greatly between different gene predictors and subfamilies of ncRNA gene. Our estimates for false positive rates are based on simulations which preserve local islands of conservation; using these simulations, we predict a higher rate of false positives than previous computational ncRNA screens have reported. Using one of the tested prediction grammars, we provide an updated set of ncRNA predictions for D. melanogaster and compare them to previously-published predictions and experimental data. Many of our predictions show correlations with protein-coding genes. We found significant depletion of intergenic predictions near the 3' end of coding regions and furthermore depletion of predictions in the first intron of protein-coding genes. Some of our predictions are colocated with larger putative unannotated genes: for example, 17 of our predictions showing homology to the RFAM family snoR28 appear in a tandem array on the X chromosome; the 4.5 Kbp spanned by the predicted tandem array is contained within a FlyBase-annotated cDNA.
Collapse
Affiliation(s)
- Robert K. Bradley
- Biophysics Graduate Group, University of California, Berkeley, California, United States of America
| | - Andrew V. Uzilov
- Department of Bioengineering, University of California, Berkeley, California, United States of America
| | - Mitchell E. Skinner
- Department of Bioengineering, University of California, Berkeley, California, United States of America
| | - Yuri R. Bendaña
- Department of Bioengineering, University of California, Berkeley, California, United States of America
| | - Lars Barquist
- Department of Bioengineering, University of California, Berkeley, California, United States of America
| | - Ian Holmes
- Biophysics Graduate Group, University of California, Berkeley, California, United States of America
- Department of Bioengineering, University of California, Berkeley, California, United States of America
| |
Collapse
|
30
|
Le SQ, Lartillot N, Gascuel O. Phylogenetic mixture models for proteins. Philos Trans R Soc Lond B Biol Sci 2009; 363:3965-76. [PMID: 18852096 DOI: 10.1098/rstb.2008.0180] [Citation(s) in RCA: 141] [Impact Index Per Article: 9.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open
Abstract
Standard protein substitution models use a single amino acid replacement rate matrix that summarizes the biological, chemical and physical properties of amino acids. However, site evolution is highly heterogeneous and depends on many factors: genetic code; solvent exposure; secondary and tertiary structure; protein function; etc. These impact the substitution pattern and, in most cases, a single replacement matrix is not enough to represent all the complexity of the evolutionary processes. This paper explores in maximum-likelihood framework phylogenetic mixture models that combine several amino acid replacement matrices to better fit protein evolution.We learn these mixture models from a large alignment database extracted from HSSP, and test the performance using independent alignments from TREEBASE.We compare unsupervised learning approaches, where the site categories are unknown, to supervised ones, where in estimations we use the known category of each site, based on its exposure or its secondary structure. All our models are combined with gamma-distributed rates across sites. Results show that highly significant likelihood gains are obtained when using mixture models compared with the best available single replacement matrices. Mixtures of matrices also improve over mixtures of profiles in the manner of the CAT model. The unsupervised approach tends to be better than the supervised one, but it appears difficult to implement and highly sensitive to the starting values of the parameters, meaning that the supervised approach is still of interest for initialization and model comparison. Using an unsupervised model involving three matrices, the average AIC gain per site with TREEBASE test alignments is 0.31, 0.49 and 0.61 compared with LG (named after Le & Gascuel 2008 Mol. Biol. Evol. 25, 1307-1320), WAG and JTT, respectively. This three-matrix model is significantly better than LG for 34 alignments (among 57), and significantly worse for 1 alignment only. Moreover, tree topologies inferred with our mixture models frequently differ from those obtained with single matrices, indicating that using these mixtures impacts not only the likelihood value but also the output tree. All our models and a PhyML implementation are available from http://atgc.lirmm.fr/mixtures.
Collapse
Affiliation(s)
- Si Quang Le
- Méthodes et Algorithmes pour Bioinformatique, LIRMM, CNRS - Université Montpellier II, 161 rue Ada, 34392 Montpellier Cedex 5, France
| | | | | |
Collapse
|
31
|
Heger A, Ponting CP, Holmes I. Accurate estimation of gene evolutionary rates using XRATE, with an application to transmembrane proteins. Mol Biol Evol 2009; 26:1715-21. [PMID: 19380462 DOI: 10.1093/molbev/msp080] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/01/2023] Open
Abstract
XRATE implements algorithms for comparative annotation, ancestral reconstruction, evolutionary rate estimation, and simulation. Its modeling repertoire includes phylogenetic stochastic context-free grammars and phylo-hidden Markov models. Following earlier tests of XRATE as a machine-learning tool suitable for alignment annotation, we now report the first tests of XRATE as a precise quantitative instrument for estimating evolutionary rates. We implement a codon model similar to that of Goldman and Yang (1994) (A codon-based model of nucleotide substitution for protein-coding DNA sequences. Mol Biol Evol 11: 725-736) and show that XRATE's parameter estimates are consistent with those of PAML. To demonstrate its utility, we apply the model to measure the difference in selective strength (omega) between intracellular and secreted regions of type I transmembrane proteins. In 215 of 303 instances, a complex model with individual omega for each region provides a better fit to the data than the simpler single omega value model. Secreted portions of type I transmembrane proteins show an elevation in omega similar to that seen for secreted protein genes. Less stringent purifying selection is thus a general property of the extracellular milieu, rather than being specific to only soluble and secreted proteins.
Collapse
Affiliation(s)
- Andreas Heger
- Department of Physiology, Anatomy and Genetics, MRC Functional Genomics Unit, University of Oxford, Oxford, UK.
| | | | | |
Collapse
|
32
|
Paten B, Herrero J, Fitzgerald S, Beal K, Flicek P, Holmes I, Birney E. Genome-wide nucleotide-level mammalian ancestor reconstruction. Genes Dev 2008; 18:1829-43. [PMID: 18849525 PMCID: PMC2577868 DOI: 10.1101/gr.076521.108] [Citation(s) in RCA: 132] [Impact Index Per Article: 8.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/23/2008] [Accepted: 09/09/2008] [Indexed: 11/24/2022]
Abstract
Recently attention has been turned to the problem of reconstructing complete ancestral sequences from large multiple alignments. Successful generation of these genome-wide reconstructions will facilitate a greater knowledge of the events that have driven evolution. We present a new evolutionary alignment modeler, called "Ortheus," for inferring the evolutionary history of a multiple alignment, in terms of both substitutions and, importantly, insertions and deletions. Based on a multiple sequence probabilistic transducer model of the type proposed by Holmes, Ortheus uses efficient stochastic graph-based dynamic programming methods. Unlike other methods, Ortheus does not rely on a single fixed alignment from which to work. Ortheus is also more scaleable than previous methods while being fast, stable, and open source. Large-scale simulations show that Ortheus performs close to optimally on a deep mammalian phylogeny. Simulations also indicate that significant proportions of errors due to insertions and deletions can be avoided by not assuming a fixed alignment. We additionally use a challenging hold-out cross-validation procedure to test the method; using the reconstructions to predict extant sequence bases, we demonstrate significant improvements over using closest extant neighbor sequences. Accompanying this paper, a new, public, and genome-wide set of Ortheus ancestor alignments provide an intriguing new resource for evolutionary studies in mammals. As a first piece of analysis, we attempt to recover "fossilized" ancestral pseudogenes. We confidently find 31 cases in which the ancestral sequence had a more complete sequence than any of the extant sequences.
Collapse
Affiliation(s)
- Benedict Paten
- Center for Biomolecular Science and Engineering, University of California Santa Cruz, Santa Cruz, California 95064, USA
| | - Javier Herrero
- EMBL European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SD, United Kingdom
| | - Stephen Fitzgerald
- EMBL European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SD, United Kingdom
| | - Kathryn Beal
- EMBL European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SD, United Kingdom
| | - Paul Flicek
- EMBL European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SD, United Kingdom
| | - Ian Holmes
- Department of Bioengineering, University of California Berkeley, Berkeley, California 94720, USA
| | - Ewan Birney
- EMBL European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SD, United Kingdom
| |
Collapse
|
33
|
Anisimova M, Kosiol C. Investigating protein-coding sequence evolution with probabilistic codon substitution models. Mol Biol Evol 2008; 26:255-71. [PMID: 18922761 DOI: 10.1093/molbev/msn232] [Citation(s) in RCA: 127] [Impact Index Per Article: 7.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open
Abstract
This review is motivated by the true explosion in the number of recent studies both developing and ameliorating probabilistic models of codon evolution. Traditionally parametric, the first codon models focused on estimating the effects of selective pressure on the protein via an explicit parameter in the maximum likelihood framework. Likelihood ratio tests of nested codon models armed the biologists with powerful tools, which provided unambiguous evidence for positive selection in real data. This, in turn, triggered a new wave of methodological developments. The new generation of models views the codon evolution process in a more sophisticated way, relaxing several mathematical assumptions. These models make a greater use of physicochemical amino acid properties, genetic code machinery, and the large amounts of data from the public domain. The overview of the most recent advances on modeling codon evolution is presented here, and a wide range of their applications to real data is discussed. On the downside, availability of a large variety of models, each accounting for various biological factors, increases the margin for misinterpretation; the biological meaning of certain parameters may vary among models, and model selection procedures also deserve greater attention. Solid understanding of the modeling assumptions and their applicability is essential for successful statistical data analysis.
Collapse
Affiliation(s)
- Maria Anisimova
- Institute of Computational Science, Swiss Federal Institute of Technology, Zurich, Switzerland.
| | | |
Collapse
|
34
|
Bradley RK, Pachter L, Holmes I. Specific alignment of structured RNA: stochastic grammars and sequence annealing. ACTA ACUST UNITED AC 2008; 24:2677-83. [PMID: 18796475 DOI: 10.1093/bioinformatics/btn495] [Citation(s) in RCA: 27] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/22/2023]
Abstract
MOTIVATION Whole-genome screens suggest that eukaryotic genomes are dense with non-coding RNAs (ncRNAs). We introduce a novel approach to RNA multiple alignment which couples a generative probabilistic model of sequence and structure with an efficient sequence annealing approach for exploring the space of multiple alignments. This leads to a new software program, Stemloc-AMA, that is both accurate and specific in the alignment of multiple related RNA sequences. RESULTS When tested on the benchmark datasets BRalibase II and BRalibase 2.1, Stemloc-AMA has comparable sensitivity to and better specificity than the best competing methods. We use a large-scale random sequence experiment to show that while most alignment programs maximize sensitivity at the expense of specificity, even to the point of giving complete alignments of non-homologous sequences, Stemloc-AMA aligns only sequences with detectable homology and leaves unrelated sequences largely unaligned. Such accurate and specific alignments are crucial for comparative-genomics analysis, from inferring phylogeny to estimating substitution rates across different lineages. AVAILABILITY Stemloc-AMA is available from http://biowiki.org/StemLocAMA as part of the dart software package for sequence analysis.
Collapse
Affiliation(s)
- Robert K Bradley
- Biophysics Graduate Group, Department of Mathematics and Department of Bioengineering, University of California, Berkeley, CA 94720, USA
| | | | | |
Collapse
|
35
|
Abstract
Phylo-grammars, probabilistic models combining Markov chain substitution models with stochastic grammars, are powerful models for annotating structured features in multiple sequence alignments and analyzing the evolution of those features. In the past, these methods have been cumbersome to implement and modify. xrate provides means for the rapid development of phylo-grammars (using a simple file format) and automated parameterization of those grammars from training data (via the Expectation Maximization algorithm). xREI (pron. ‘X-ray’) is an intuitive, flexible AJAX (Asynchronous Javascript And XML) web interface to xrate providing grammar visualization tools as well as access to xrate's training and annotation functionality. It is hoped that this application will serve as a valuable tool to those developing phylo-grammars, and as a means for the exploration and dissemination of such models. xREI is available at http://harmony.biowiki.org/xrei/
Collapse
Affiliation(s)
- Lars Barquist
- Department of Bioengineering, University of California, Berkeley, USA.
| | | |
Collapse
|
36
|
Abstract
Amino acid replacement matrices are an essential basis of protein phylogenetics. They are used to compute substitution probabilities along phylogeny branches and thus the likelihood of the data. They are also essential in protein alignment. A number of replacement matrices and methods to estimate these matrices from protein alignments have been proposed since the seminal work of Dayhoff et al. (1972). An important advance was achieved by Whelan and Goldman (2001) and their WAG matrix, thanks to an efficient maximum likelihood estimation approach that accounts for the phylogenies of sequences within each training alignment. We further refine this method by incorporating the variability of evolutionary rates across sites in the matrix estimation and using a much larger and diverse database than BRKALN, which was used to estimate WAG. To estimate our new matrix (called LG after the authors), we use an adaptation of the XRATE software and 3,912 alignments from Pfam, comprising approximately 50,000 sequences and approximately 6.5 million residues overall. To evaluate the LG performance, we use an independent sample consisting of 59 alignments from TreeBase and randomly divide Pfam alignments into 3,412 training and 500 test alignments. The comparison with WAG and JTT shows a clear likelihood improvement. With TreeBase, we find that 1) the average Akaike information criterion gain per site is 0.25 and 0.42, when compared with WAG and JTT, respectively; 2) LG is significantly better than WAG for 38 alignments (among 59), and significantly worse with 2 alignments only; and 3) tree topologies inferred with LG, WAG, and JTT frequently differ, indicating that using LG impacts not only the likelihood value but also the output tree. Results with the test alignments from Pfam are analogous. LG and a PHYML implementation can be downloaded from http://atgc.lirmm.fr/LG.
Collapse
Affiliation(s)
- Si Quang Le
- Méthodes et Algorithmes pour la Bioinformatique, LIRMM, CNRS, Université Montpellier II, Montpellier, France
| | | |
Collapse
|
37
|
Abstract
Summary: Interactive examination of RNA multiple alignments for covariant mutations is a useful step in non-coding RNA sequence analysis. We present three parallel implementations of an RNA visualization metaphor: Colorstock, a command-line script using ANSI terminal color; SScolor, a Perl script that generates static HTML pages; and Ratón, an AJAX web application generating dynamic HTML. Each tool can be used to color RNA alignments by secondary structure and to visually highlight compensatory mutations in stems. Availability: All source code is freely available under the GPL. The source code can be downloaded and a prototype of Ratón can be accessed at http://biowiki.org/RnaAlignmentViewers Contact:ihh@berkeley.edu
Collapse
Affiliation(s)
- Yuri R Bendaña
- Department of Bioengineering, UC Berkeley, Berkeley, CA, USA
| | | |
Collapse
|
38
|
Margulies EH, Cooper GM, Asimenos G, Thomas DJ, Dewey CN, Siepel A, Birney E, Keefe D, Schwartz AS, Hou M, Taylor J, Nikolaev S, Montoya-Burgos JI, Löytynoja A, Whelan S, Pardi F, Massingham T, Brown JB, Bickel P, Holmes I, Mullikin JC, Ureta-Vidal A, Paten B, Stone EA, Rosenbloom KR, Kent WJ, Bouffard GG, Guan X, Hansen NF, Idol JR, Maduro VVB, Maskeri B, McDowell JC, Park M, Thomas PJ, Young AC, Blakesley RW, Muzny DM, Sodergren E, Wheeler DA, Worley KC, Jiang H, Weinstock GM, Gibbs RA, Graves T, Fulton R, Mardis ER, Wilson RK, Clamp M, Cuff J, Gnerre S, Jaffe DB, Chang JL, Lindblad-Toh K, Lander ES, Hinrichs A, Trumbower H, Clawson H, Zweig A, Kuhn RM, Barber G, Harte R, Karolchik D, Field MA, Moore RA, Matthewson CA, Schein JE, Marra MA, Antonarakis SE, Batzoglou S, Goldman N, Hardison R, Haussler D, Miller W, Pachter L, Green ED, Sidow A. Analyses of deep mammalian sequence alignments and constraint predictions for 1% of the human genome. Genome Res 2007; 17:760-74. [PMID: 17567995 PMCID: PMC1891336 DOI: 10.1101/gr.6034307] [Citation(s) in RCA: 149] [Impact Index Per Article: 8.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/10/2023]
Abstract
A key component of the ongoing ENCODE project involves rigorous comparative sequence analyses for the initially targeted 1% of the human genome. Here, we present orthologous sequence generation, alignment, and evolutionary constraint analyses of 23 mammalian species for all ENCODE targets. Alignments were generated using four different methods; comparisons of these methods reveal large-scale consistency but substantial differences in terms of small genomic rearrangements, sensitivity (sequence coverage), and specificity (alignment accuracy). We describe the quantitative and qualitative trade-offs concomitant with alignment method choice and the levels of technical error that need to be accounted for in applications that require multisequence alignments. Using the generated alignments, we identified constrained regions using three different methods. While the different constraint-detecting methods are in general agreement, there are important discrepancies relating to both the underlying alignments and the specific algorithms. However, by integrating the results across the alignments and constraint-detecting methods, we produced constraint annotations that were found to be robust based on multiple independent measures. Analyses of these annotations illustrate that most classes of experimentally annotated functional elements are enriched for constrained sequences; however, large portions of each class (with the exception of protein-coding sequences) do not overlap constrained regions. The latter elements might not be under primary sequence constraint, might not be constrained across all mammals, or might have expendable molecular functions. Conversely, 40% of the constrained sequences do not overlap any of the functional elements that have been experimentally identified. Together, these findings demonstrate and quantify how many genomic functional elements await basic molecular characterization.
Collapse
Affiliation(s)
- Elliott H Margulies
- Genome Technology Branch, National Human Genome Research Institute, National Institutes of Health, Bethesda, MD 20892, USA.
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | |
Collapse
|