1
|
Passow CN, Bronikowski AM, Blackmon H, Parsai S, Schwartz TS, McGaugh SE. Contrasting Patterns of Rapid Molecular Evolution within the p53 Network across Mammal and Sauropsid Lineages. Genome Biol Evol 2019; 11:629-643. [PMID: 30668691 PMCID: PMC6406535 DOI: 10.1093/gbe/evy273] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 01/04/2019] [Indexed: 12/13/2022] Open
Abstract
Cancer is a threat to multicellular organisms, yet the molecular evolution of pathways that prevent the accumulation of genetic damage has been largely unexplored. The p53 network regulates how cells respond to DNA-damaging stressors. We know little about p53 network molecular evolution as a whole. In this study, we performed comparative genetic analyses of the p53 network to quantify the number of genes within the network that are rapidly evolving and constrained, and the association between lifespan and the patterns of evolution. Based on our previous published data set, we used genomes and transcriptomes of 34 sauropsids and 32 mammals to analyze the molecular evolution of 45 genes within the p53 network. We found that genes in the network exhibited evidence of positive selection and divergent molecular evolution in mammals and sauropsids. Specifically, we found more evidence of positive selection in sauropsids than mammals, indicating that sauropsids have different targets of selection. In sauropsids, more genes upstream in the network exhibited positive selection, and this observation is driven by positive selection in squamates, which is consistent with previous work showing rapid divergence and adaptation of metabolic and stress pathways in this group. Finally, we identified a negative correlation between maximum lifespan and the number of genes with evidence of divergent molecular evolution, indicating that species with longer lifespans likely experienced less variation in selection across the network. In summary, our study offers evidence that comparative genomic approaches can provide insights into how molecular networks have evolved across diverse species.
Collapse
Affiliation(s)
- Courtney N Passow
- Department of Ecology, Evolution, and Behavior, University of Minnesota
| | - Anne M Bronikowski
- Department of Ecology, Evolution, and Organismal Biology, Iowa State University
| | - Heath Blackmon
- Department of Ecology, Evolution, and Behavior, University of Minnesota
- Department of Biology, Texas A&M University, College Station, TX
| | - Shikha Parsai
- Department of Ecology, Evolution, and Organismal Biology, Iowa State University
| | - Tonia S Schwartz
- Department of Ecology, Evolution, and Organismal Biology, Iowa State University
- Department of Biological Sciences, Auburn University, Auburn, AL
| | - Suzanne E McGaugh
- Department of Ecology, Evolution, and Behavior, University of Minnesota
| |
Collapse
|
2
|
Bogusz M, Whelan S. Phylogenetic Tree Estimation With and Without Alignment: New Distance Methods and Benchmarking. Syst Biol 2018; 66:218-231. [PMID: 27633353 DOI: 10.1093/sysbio/syw074] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/03/2016] [Accepted: 08/23/2016] [Indexed: 12/20/2022] Open
Abstract
Phylogenetic tree inference is a critical component of many systematic and evolutionary studies. The majority of these studies are based on the two-step process of multiple sequence alignment followed by tree inference, despite persistent evidence that the alignment step can lead to biased results. Here we present a two-part study that first presents PaHMM-Tree, a novel neighbor joining-based method that estimates pairwise distances without assuming a single alignment. We then use simulations to benchmark its performance against a wide-range of other phylogenetic tree inference methods, including the first comparison of alignment-free distance-based methods against more conventional tree estimation methods. Our new method for calculating pairwise distances based on statistical alignment provides distance estimates that are as accurate as those obtained using standard methods based on the true alignment. Pairwise distance estimates based on the two-step process tend to be substantially less accurate. This improved performance carries through to tree inference, where PaHMM-Tree provides more accurate tree estimates than all of the pairwise distance methods assessed. For close to moderately divergent sequence data we find that the two-step methods using statistical inference, where information from all sequences is included in the estimation procedure, tend to perform better than PaHMM-Tree, particularly full statistical alignment, which simultaneously estimates both the tree and the alignment. For deep divergences we find the alignment step becomes so prone to error that our distance-based PaHMM-Tree outperforms all other methods of tree inference. Finally, we find that the accuracy of alignment-free methods tends to decline faster than standard two-step methods in the presence of alignment uncertainty, and identify no conditions where alignment-free methods are equal to or more accurate than standard phylogenetic methods even in the presence of substantial alignment error. [Alignment-free; distance-based phylogenetics; pair Hidden Markov Models; phylogenetic inference; statistical alignment.].
Collapse
Affiliation(s)
- Marcin Bogusz
- Department of Evolutionary Biology, Evolutionary Biology Centre, Uppsala University, Norbyvägen 18D, 752 36 Uppsala, Sweden
| | - Simon Whelan
- Department of Evolutionary Biology, Evolutionary Biology Centre, Uppsala University, Norbyvägen 18D, 752 36 Uppsala, Sweden
| |
Collapse
|
3
|
Gaya E, Redelings BD, Navarro-Rosinés P, Llimona X, De Cáceres M, Lutzoni F. Align or not to align? Resolving species complexes within theCaloplaca saxicolagroup as a case study. Mycologia 2017; 103:361-78. [DOI: 10.3852/10-120] [Citation(s) in RCA: 27] [Impact Index Per Article: 3.9] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022]
Affiliation(s)
- Ester Gaya
- Department of Plant Biology (Botany Unit), Facultat de Biologia, Universitat de Barcelona, Av. Diagonal 645, 08028 Barcelona, Spain
| | | | | | - Xavier Llimona
- Department of Plant Biology (Botany Unit), Facultat de Biologia, Universitat de Barcelona, Av. Diagonal 645, 08028 Barcelona, Spain
| | - Miquel De Cáceres
- Biodiversity and Landscape Ecology Lab, Centre Tecnològic Forestal de Catalunya, Ctra. St. Llorenç de Morunys km 2, 25280 Solsona, Spain
| | - François Lutzoni
- Department of Biology, Duke University, Durham, North Carolina 27708-0338
| |
Collapse
|
4
|
Liu H, Xie Z, Tan S, Zhang X, Yang S. Relationship between amino acid usage and amino acid evolution in primates. Gene 2015; 557:182-7. [DOI: 10.1016/j.gene.2014.12.033] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/13/2013] [Revised: 12/04/2014] [Accepted: 12/14/2014] [Indexed: 11/29/2022]
|
5
|
McCandlish DM, Epstein CL, Plotkin JB. Formal properties of the probability of fixation: identities, inequalities and approximations. Theor Popul Biol 2014; 99:98-113. [PMID: 25450112 DOI: 10.1016/j.tpb.2014.11.004] [Citation(s) in RCA: 16] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/10/2014] [Revised: 11/03/2014] [Accepted: 11/11/2014] [Indexed: 12/22/2022]
Abstract
The formula for the probability of fixation of a new mutation is widely used in theoretical population genetics and molecular evolution. Here we derive a series of identities, inequalities and approximations for the exact probability of fixation of a new mutation under the Moran process (equivalent results hold for the approximate probability of fixation under the Wright-Fisher process, after an appropriate change of variables). We show that the logarithm of the fixation probability has particularly simple behavior when the selection coefficient is measured as a difference of Malthusian fitnesses, and we exploit this simplicity to derive inequalities and approximations. We also present a comprehensive comparison of both existing and new approximations for the fixation probability, highlighting those approximations that induce a reversible Markov chain when used to describe the dynamics of evolution under weak mutation. To demonstrate the power of these results, we consider the classical problem of determining the total substitution rate across an ensemble of biallelic loci and prove that, at equilibrium, a strict majority of substitutions are due to drift rather than selection.
Collapse
Affiliation(s)
- David M McCandlish
- Department of Biology, University of Pennsylvania, Philadelphia, PA, United States.
| | - Charles L Epstein
- Department of Mathematics, University of Pennsylvania, Philadelphia, PA, United States
| | - Joshua B Plotkin
- Department of Biology, University of Pennsylvania, Philadelphia, PA, United States
| |
Collapse
|
6
|
McCandlish DM, Stoltzfus A. Modeling evolution using the probability of fixation: history and implications. QUARTERLY REVIEW OF BIOLOGY 2014; 89:225-52. [PMID: 25195318 DOI: 10.1086/677571] [Citation(s) in RCA: 123] [Impact Index Per Article: 12.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/03/2022]
Abstract
Many models of evolution calculate the rate of evolution by multiplying the rate at which new mutations originate within a population by a probability of fixation. Here we review the historical origins, contemporary applications, and evolutionary implications of these "origin-fixation" models, which are widely used in evolutionary genetics, molecular evolution, and phylogenetics. Origin-fixation models were first introduced in 1969, in association with an emerging view of "molecular" evolution. Early origin-fixation models were used to calculate an instantaneous rate of evolution across a large number of independently evolving loci; in the 1980s and 1990s, a second wave of origin-fixation models emerged to address a sequence of fixation events at a single locus. Although origin fixation models have been applied to a broad array of problems in contemporary evolutionary research, their rise in popularity has not been accompanied by an increased appreciation of their restrictive assumptions or their distinctive implications. We argue that origin-fixation models constitute a coherent theory of mutation-limited evolution that contrasts sharply with theories of evolution that rely on the presence of standing genetic variation. A major unsolved question in evolutionary biology is the degree to which these models provide an accurate approximation of evolution in natural populations.
Collapse
|
7
|
Whelan S, Allen JE, Blackburne BP, Talavera D. ModelOMatic: fast and automated model selection between RY, nucleotide, amino acid, and codon substitution models. Syst Biol 2014; 64:42-55. [PMID: 25209223 DOI: 10.1093/sysbio/syu062] [Citation(s) in RCA: 38] [Impact Index Per Article: 3.8] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open
Abstract
Molecular phylogenetics is a powerful tool for inferring both the process and pattern of evolution from genomic sequence data. Statistical approaches, such as maximum likelihood and Bayesian inference, are now established as the preferred methods of inference. The choice of models that a researcher uses for inference is of critical importance, and there are established methods for model selection conditioned on a particular type of data, such as nucleotides, amino acids, or codons. A major limitation of existing model selection approaches is that they can only compare models acting upon a single type of data. Here, we extend model selection to allow comparisons between models describing different types of data by introducing the idea of adapter functions, which project aggregated models onto the originally observed sequence data. These projections are implemented in the program ModelOMatic and used to perform model selection on 3722 families from the PANDIT database, 68 genes from an arthropod phylogenomic data set, and 248 genes from a vertebrate phylogenomic data set. For the PANDIT and arthropod data, we find that amino acid models are selected for the overwhelming majority of alignments; with progressively smaller numbers of alignments selecting codon and nucleotide models, and no families selecting RY-based models. In contrast, nearly all alignments from the vertebrate data set select codon-based models. The sequence divergence, the number of sequences, and the degree of selection acting upon the protein sequences may contribute to explaining this variation in model selection. Our ModelOMatic program is fast, with most families from PANDIT taking fewer than 150 s to complete, and should therefore be easily incorporated into existing phylogenetic pipelines. ModelOMatic is available at https://code.google.com/p/modelomatic/.
Collapse
Affiliation(s)
- Simon Whelan
- Evolutionary Biology, Evolutionary Biology Centre, Uppsala University, Uppsala 75236, Sweden and Faculty of Life Sciences, University of Manchester, Manchester, UK Evolutionary Biology, Evolutionary Biology Centre, Uppsala University, Uppsala 75236, Sweden and Faculty of Life Sciences, University of Manchester, Manchester, UK
| | - James E Allen
- Evolutionary Biology, Evolutionary Biology Centre, Uppsala University, Uppsala 75236, Sweden and Faculty of Life Sciences, University of Manchester, Manchester, UK
| | - Benjamin P Blackburne
- Evolutionary Biology, Evolutionary Biology Centre, Uppsala University, Uppsala 75236, Sweden and Faculty of Life Sciences, University of Manchester, Manchester, UK
| | - David Talavera
- Evolutionary Biology, Evolutionary Biology Centre, Uppsala University, Uppsala 75236, Sweden and Faculty of Life Sciences, University of Manchester, Manchester, UK
| |
Collapse
|
8
|
Allen JE, Whelan S. Assessing the state of substitution models describing noncoding RNA evolution. Genome Biol Evol 2014; 6:65-75. [PMID: 24391153 PMCID: PMC3914692 DOI: 10.1093/gbe/evt206] [Citation(s) in RCA: 15] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
Abstract
Phylogenetic inference is widely used to investigate the relationships between homologous sequences. RNA molecules have played a key role in these studies because they are present throughout life and tend to evolve slowly. Phylogenetic inference has been shown to be dependent on the substitution model used. A wide range of models have been developed to describe RNA evolution, either with 16 states describing all possible canonical base pairs or with 7 states where the 10 mismatched nucleotides are reduced to a single state. Formal model selection has become a standard practice for choosing an inferential model and works well for comparing models of a specific type, such as comparisons within nucleotide models or within amino acid models. Model selection cannot function across different sized state spaces because the likelihoods are conditioned on different data. Here, we introduce statistical state-space projection methods that allow the direct comparison of likelihoods between nucleotide models and 7-state and 16-state RNA models. To demonstrate the general applicability of our new methods, we extract 287 RNA families from genomic alignments and perform model selection. We find that in 281/287 families, RNA models are selected in preference to nucleotide models, with simple 7-state RNA models selected for more conserved families with shorter stems and more complex 16-state RNA models selected for more divergent families with longer stems. Other factors, such as the function of the RNA molecule or the GC-content, have limited impact on model selection. Our models and model selection methods are freely available in the open-source PHASE 3.0 software.
Collapse
Affiliation(s)
- James E Allen
- Faculty of Life Sciences, University of Manchester, Manchester, United Kingdom
| | | |
Collapse
|
9
|
Wang HC, Susko E, Roger AJ. An amino acid substitution-selection model adjusts residue fitness to improve phylogenetic estimation. Mol Biol Evol 2014; 31:779-92. [PMID: 24441033 DOI: 10.1093/molbev/msu044] [Citation(s) in RCA: 21] [Impact Index Per Article: 2.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/14/2022] Open
Abstract
Standard protein phylogenetic models use fixed rate matrices of amino acid interchange derived from analyses of large databases. Differences between the stationary amino acid frequencies of these rate matrices from those of a data set of interest are typically adjusted for by matrix multiplication that converts the empirical rate matrix to an exchangeability matrix which is then postmultiplied by the amino acid frequencies in the alignment. The result is a time-reversible rate matrix with stationary amino acid frequencies equal to the data set frequencies. On the basis of population genetics principles, we develop an amino acid substitution-selection model that parameterizes the fitness of an amino acid as the logarithm of the ratio of the frequency of the amino acid to the frequency of the same amino acid under no selection. The model gives rise to a different sequence of matrix multiplications to convert an empirical rate matrix to one that has stationary amino acid frequencies equal to the data set frequencies. We incorporated the substitution-selection model with an improved amino acid class frequency mixture (cF) model to partially take into account site-specific amino acid frequencies in the phylogenetic models. We show that 1) the selection models fit data significantly better than corresponding models without selection for most of the 21 test data sets; 2) both cF and cF selection models favored the phylogenetic trees that were inferred under current sophisticated models and methods for three difficult phylogenetic problems (the positions of microsporidia and breviates in eukaryote phylogeny and the position of the root of the angiosperm tree); and 3) for data simulated under site-specific residue frequencies, the cF selection models estimated trees closer to the generating trees than a standard Г model or cF without selection. We also explored several ways of estimating amino acid frequencies under neutral evolution that are required for these selection models. By better modeling the amino acid substitution process, the cF selection models will be valuable for phylogenetic inference and evolutionary studies.
Collapse
Affiliation(s)
- Huai-Chun Wang
- Department of Mathematics and Statistics, Dalhousie University, Halifax, Nova Scotia, Canada
| | | | | |
Collapse
|
10
|
Rothfels CJ, Schuettpelz E. Accelerated Rate of Molecular Evolution for Vittarioid Ferns is Strong and Not Driven by Selection. Syst Biol 2013; 63:31-54. [DOI: 10.1093/sysbio/syt058] [Citation(s) in RCA: 45] [Impact Index Per Article: 4.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
Affiliation(s)
- Carl J. Rothfels
- Department of Biology, Duke University, Box 90338, Durham, NC 27708, USA; 2Department of Zoology, University of British Columbia, #4200-6270 University Blvd., Vancouver, BC V6T 1Z4, Canada; 3Department of Biology and Marine Biology, University of North Carolina Wilmington, 601 South College Road, Wilmington, NC 28403, USA; and 4Department of Botany (MRC 166), National Museum of Natural History, Smithsonian Institution, PO Box 37012, Washington DC 20013-7012, USA
- Department of Biology, Duke University, Box 90338, Durham, NC 27708, USA; 2Department of Zoology, University of British Columbia, #4200-6270 University Blvd., Vancouver, BC V6T 1Z4, Canada; 3Department of Biology and Marine Biology, University of North Carolina Wilmington, 601 South College Road, Wilmington, NC 28403, USA; and 4Department of Botany (MRC 166), National Museum of Natural History, Smithsonian Institution, PO Box 37012, Washington DC 20013-7012, USA
| | - Eric Schuettpelz
- Department of Biology, Duke University, Box 90338, Durham, NC 27708, USA; 2Department of Zoology, University of British Columbia, #4200-6270 University Blvd., Vancouver, BC V6T 1Z4, Canada; 3Department of Biology and Marine Biology, University of North Carolina Wilmington, 601 South College Road, Wilmington, NC 28403, USA; and 4Department of Botany (MRC 166), National Museum of Natural History, Smithsonian Institution, PO Box 37012, Washington DC 20013-7012, USA
- Department of Biology, Duke University, Box 90338, Durham, NC 27708, USA; 2Department of Zoology, University of British Columbia, #4200-6270 University Blvd., Vancouver, BC V6T 1Z4, Canada; 3Department of Biology and Marine Biology, University of North Carolina Wilmington, 601 South College Road, Wilmington, NC 28403, USA; and 4Department of Botany (MRC 166), National Museum of Natural History, Smithsonian Institution, PO Box 37012, Washington DC 20013-7012, USA
| |
Collapse
|
11
|
Gil M, Zanetti MS, Zoller S, Anisimova M. CodonPhyML: fast maximum likelihood phylogeny estimation under codon substitution models. Mol Biol Evol 2013; 30:1270-80. [PMID: 23436912 PMCID: PMC3649670 DOI: 10.1093/molbev/mst034] [Citation(s) in RCA: 85] [Impact Index Per Article: 7.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/15/2022] Open
Abstract
Markov models of codon substitution naturally incorporate the structure of the genetic code and the selection intensity at the protein level, providing a more realistic representation of protein-coding sequences compared with nucleotide or amino acid models. Thus, for protein-coding genes, phylogenetic inference is expected to be more accurate under codon models. So far, phylogeny reconstruction under codon models has been elusive due to computational difficulties of dealing with high dimension matrices. Here, we present a fast maximum likelihood (ML) package for phylogenetic inference, CodonPhyML offering hundreds of different codon models, the largest variety to date, for phylogeny inference by ML. CodonPhyML is tested on simulated and real data and is shown to offer excellent speed and convergence properties. In addition, CodonPhyML includes most recent fast methods for estimating phylogenetic branch supports and provides an integral framework for models selection, including amino acid and DNA models.
Collapse
Affiliation(s)
- Manuel Gil
- Department of Computer Science, Swiss Federal Institute of Technology, Zürich, Switzerland.
| | | | | | | |
Collapse
|
12
|
Abstract
Background The PRDM9 locus in mammals has increasingly attracted research attention due to its role in mediating chromosomal recombination and possible involvement in hybrid sterility and hence speciation processes. The aim of this study was to characterize sequence variation at the PRDM9 locus in a sample of our closest living relatives, the chimpanzees and bonobos. Methodology/Principal Findings PRDM9 contains a highly variable and repetitive zinc finger array. We amplified this domain using long-range PCR and determined the DNA sequences using conventional Sanger sequencing. From 17 chimpanzees representing three subspecies and five bonobos we obtained a total of 12 alleles differing at the nucleotide level. Based on a data set consisting of our data and recently published Pan PRDM9 sequences, we found that at the subspecies level, diversity levels did not differ among chimpanzee subspecies or between chimpanzee subspecies and bonobos. In contrast, the sample of chimpanzees harbors significantly more diversity at PRDM9 than samples of humans. Pan PRDM9 shows signs of rapid evolution including no alleles or ZnFs in common with humans as well as signals of positive selection in the residues responsible for DNA binding. Conclusions and Significance The high number of alleles specific to the genus Pan, signs of positive selection in the DNA binding residues, and reported lack of conservation of recombination hotspots between chimpanzees and humans suggest that PRDM9 could be active in hotspot recruitment in the genus Pan. Chimpanzees and bonobos are considered separate species and do not have overlapping ranges in the wild, making the presence of shared alleles at the amino acid level between the chimpanzee and bonobo species interesting in view of the hypothesis that PRDM9 plays a universal role in interspecific hybrid sterility.
Collapse
|
13
|
Context-Dependent Evolutionary Models for Non-Coding Sequences: An Overview of Several Decades of Research and an Analysis of Laurasiatheria and Primate Evolution. Evol Biol 2011. [DOI: 10.1007/s11692-011-9139-2] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/17/2022]
|
14
|
Addressing inter-gene heterogeneity in maximum likelihood phylogenomic analysis: yeasts revisited. PLoS One 2011; 6:e22783. [PMID: 21850235 PMCID: PMC3151265 DOI: 10.1371/journal.pone.0022783] [Citation(s) in RCA: 22] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/09/2011] [Accepted: 07/05/2011] [Indexed: 11/19/2022] Open
Abstract
Phylogenomic approaches to the resolution of inter-species relationships have become well established in recent years. Often these involve concatenation of many orthologous genes found in the respective genomes followed by analysis using standard phylogenetic models. Genome-scale data promise increased resolution by minimising sampling error, yet are associated with well-known but often inappropriately addressed caveats arising through data heterogeneity and model violation. These can lead to the reconstruction of highly-supported but incorrect topologies. With the aim of obtaining a species tree for 18 species within the ascomycetous yeasts, we have investigated the use of appropriate evolutionary models to address inter-gene heterogeneities and the scalability and validity of supermatrix analysis as the phylogenetic problem becomes more difficult and the number of genes analysed approaches truly phylogenomic dimensions. We have extended a widely-known early phylogenomic study of yeasts by adding additional species to increase diversity and augmenting the number of genes under analysis. We have investigated sophisticated maximum likelihood analyses, considering not only a concatenated version of the data but also partitioned models where each gene constitutes a partition and parameters are free to vary between the different partitions (thereby accounting for variation in the evolutionary processes at different loci). We find considerable increases in likelihood using these complex models, arguing for the need for appropriate models when analyzing phylogenomic data. Using these methods, we were able to reconstruct a well-supported tree for 18 ascomycetous yeasts spanning about 250 million years of evolution.
Collapse
|
15
|
Kosiol C, Goldman N. Markovian and non-Markovian protein sequence evolution: aggregated Markov process models. J Mol Biol 2011; 411:910-23. [PMID: 21718704 PMCID: PMC3157587 DOI: 10.1016/j.jmb.2011.06.005] [Citation(s) in RCA: 18] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/14/2010] [Revised: 05/28/2011] [Accepted: 06/03/2011] [Indexed: 12/03/2022]
Abstract
Over the years, there have been claims that evolution proceeds according to systematically different processes over different timescales and that protein evolution behaves in a non-Markovian manner. On the other hand, Markov models are fundamental to many applications in evolutionary studies. Apparent non-Markovian or time-dependent behavior has been attributed to influence of the genetic code at short timescales and dominance of physicochemical properties of the amino acids at long timescales. However, any long time period is simply the accumulation of many short time periods, and it remains unclear why evolution should appear to act systematically differently across the range of timescales studied. We show that the observed time-dependent behavior can be explained qualitatively by modeling protein sequence evolution as an aggregated Markov process (AMP): a time-homogeneous Markovian substitution model observed only at the level of the amino acids encoded by the protein-coding DNA sequence. The study of AMPs sheds new light on the relationship between amino acid-level and codon-level models of sequence evolution, and our results suggest that protein evolution should be modeled at the codon level rather than using amino acid substitution models.
Collapse
Affiliation(s)
- Carolin Kosiol
- Institut für Populationsgenetik, Vetmeduni Vienna, Veterinärplatz 1, A-1210 Wien, Austria.
| | | |
Collapse
|
16
|
Pan K, Long J, Sun H, Tobin GJ, Nara PL, Deem MW. Selective pressure to increase charge in immunodominant epitopes of the H3 hemagglutinin influenza protein. J Mol Evol 2010; 72:90-103. [PMID: 21086120 PMCID: PMC3033527 DOI: 10.1007/s00239-010-9405-4] [Citation(s) in RCA: 15] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/22/2010] [Accepted: 10/25/2010] [Indexed: 11/30/2022]
Abstract
The evolutionary speed and the consequent immune escape of H3N2 influenza A virus make it an interesting evolutionary system. Charged amino acid residues are often significant contributors to the free energy of binding for protein–protein interactions, including antibody–antigen binding and ligand–receptor binding. We used Markov chain theory and maximum likelihood estimation to model the evolution of the number of charged amino acids on the dominant epitope in the hemagglutinin protein of circulating H3N2 virus strains. The number of charged amino acids increased in the dominant epitope B of the H3N2 virus since introduction in humans in 1968. When epitope A became dominant in 1989, the number of charged amino acids increased in epitope A and decreased in epitope B. Interestingly, the number of charged residues in the dominant epitope of the dominant circulating strain is never fewer than that in the vaccine strain. We propose these results indicate selective pressure for charged amino acids that increase the affinity of the virus epitope for water and decrease the affinity for host antibodies. The standard PAM model of generic protein evolution is unable to capture these trends. The reduced alphabet Markov model (RAMM) model we introduce captures the increased selective pressure for charged amino acids in the dominant epitope of hemagglutinin of H3N2 influenza (R2 > 0.98 between 1968 and 1988). The RAMM model calibrated to historical H3N2 influenza virus evolution in humans fit well to the H3N2/Wyoming virus evolution data from Guinea pig animal model studies.
Collapse
Affiliation(s)
- Keyao Pan
- Department of Bioengineering, Rice University, 6100 Main Street, Houston, TX 77005, USA
| | | | | | | | | | | |
Collapse
|
17
|
Baele G, Van de Peer Y, Vansteelandt S. Efficient context-dependent model building based on clustering posterior distributions for non-coding sequences. BMC Evol Biol 2009; 9:87. [PMID: 19405957 PMCID: PMC2695821 DOI: 10.1186/1471-2148-9-87] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/02/2009] [Accepted: 04/30/2009] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Many recent studies that relax the assumption of independent evolution of sites have done so at the expense of a drastic increase in the number of substitution parameters. While additional parameters cannot be avoided to model context-dependent evolution, a large increase in model dimensionality is only justified when accompanied with careful model-building strategies that guard against overfitting. An increased dimensionality leads to increases in numerical computations of the models, increased convergence times in Bayesian Markov chain Monte Carlo algorithms and even more tedious Bayes Factor calculations. RESULTS We have developed two model-search algorithms which reduce the number of Bayes Factor calculations by clustering posterior densities to decide on the equality of substitution behavior in different contexts. The selected model's fit is evaluated using a Bayes Factor, which we calculate via model-switch thermodynamic integration. To reduce computation time and to increase the precision of this integration, we propose to split the calculations over different computers and to appropriately calibrate the individual runs. Using the proposed strategies, we find, in a dataset of primate Ancestral Repeats, that careful modeling of context-dependent evolution may increase model fit considerably and that the combination of a context-dependent model with the assumption of varying rates across sites offers even larger improvements in terms of model fit. Using a smaller nuclear SSU rRNA dataset, we show that context-dependence may only become detectable upon applying model-building strategies. CONCLUSION While context-dependent evolutionary models can increase the model fit over traditional independent evolutionary models, such complex models will often contain too many parameters. Justification for the added parameters is thus required so that only those parameters that model evolutionary processes previously unaccounted for are added to the evolutionary model. To obtain an optimal balance between the number of parameters in a context-dependent model and the performance in terms of model fit, we have designed two parameter-reduction strategies and we have shown that model fit can be greatly improved by reducing the number of parameters in a context-dependent evolutionary model.
Collapse
Affiliation(s)
- Guy Baele
- Department of Applied Mathematics and Computer Science, Ghent University, Ghent, Belgium.
| | | | | |
Collapse
|
18
|
Abstract
In 1994, Muse and Gaut (MG) and Goldman and Yang (GY) proposed evolutionary models that recognize the coding structure of the nucleotide sequences under study, by defining a Markovian substitution process with a state space consisting of the 61 sense codons (assuming the universal genetic code). Several variations and extensions to their models have since been proposed, but no general and flexible framework for contrasting the relative performance of alternative approaches has yet been applied. Here, we compute Bayes factors to evaluate the relative merit of several MG and GY styles of codon substitution models, including recent extensions acknowledging heterogeneous nonsynonymous rates across sites, as well as selective effects inducing uneven amino acid or codon preferences. Our results on three real data sets support a logical model construction following the MG formulation, allowing for a flexible account of global amino acid or codon preferences, while maintaining distinct parameters governing overall nucleotide propensities. Through posterior predictive checks, we highlight the importance of such a parameterization. Altogether, the framework presented here suggests a broad modeling project in the MG style, stressing the importance of combining and contrasting available model formulations and grounding developments in a sound probabilistic paradigm.
Collapse
|
19
|
Abstract
Molecular phylogenetics examines how biological sequences evolve and the historical relationships between them. An important aspect of many such studies is the estimation of a phylogenetic tree, which explicitly describes evolutionary relationships between the sequences. This chapter provides an introduction to evolutionary trees and some commonly used inferential methodology, focusing on the assumptions made and how they affect an analysis. Detailed discussion is also provided about some common algorithms used for phylogenetic tree estimation. Finally, there are a few practical guidelines, including how to combine multiple software packages to improve inference, and a comparison between Bayesian and maximum likelihood phylogenetics.
Collapse
|
20
|
Incorporating indel information into phylogeny estimation for rapidly emerging pathogens. BMC Evol Biol 2007; 7:40. [PMID: 17359539 PMCID: PMC1853084 DOI: 10.1186/1471-2148-7-40] [Citation(s) in RCA: 71] [Impact Index Per Article: 4.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/28/2006] [Accepted: 03/14/2007] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Phylogenies of rapidly evolving pathogens can be difficult to resolve because of the small number of substitutions that accumulate in the short times since divergence. To improve resolution of such phylogenies we propose using insertion and deletion (indel) information in addition to substitution information. We accomplish this through joint estimation of alignment and phylogeny in a Bayesian framework, drawing inference using Markov chain Monte Carlo. Joint estimation of alignment and phylogeny sidesteps biases that stem from conditioning on a single alignment by taking into account the ensemble of near-optimal alignments. RESULTS We introduce a novel Markov chain transition kernel that improves computational efficiency by proposing non-local topology rearrangements and by block sampling alignment and topology parameters. In addition, we extend our previous indel model to increase biological realism by placing indels preferentially on longer branches. We demonstrate the ability of indel information to increase phylogenetic resolution in examples drawn from within-host viral sequence samples. We also demonstrate the importance of taking alignment uncertainty into account when using such information. Finally, we show that codon-based substitution models can significantly affect alignment quality and phylogenetic inference by unrealistically forcing indels to begin and end between codons. CONCLUSION These results indicate that indel information can improve phylogenetic resolution of recently diverged pathogens and that alignment uncertainty should be considered in such analyses.
Collapse
|
21
|
Arvestad L. Efficient Methods for Estimating Amino Acid Replacement Rates. J Mol Evol 2006; 62:663-73. [PMID: 16752207 DOI: 10.1007/s00239-004-0113-9] [Citation(s) in RCA: 12] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/11/2004] [Accepted: 01/17/2006] [Indexed: 11/30/2022]
Abstract
Replacement rate matrices describe the process of evolution at one position in a protein and are used in many applications where proteins are studied with an evolutionary perspective. Several general matrices have been suggested and have proved to be good approximations of the real process. However, there are data for which general matrices are inappropriate, for example, special protein families, certain lineages in the tree of life, or particular parts of proteins. Analysis of such data could benefit from adaption of a data-specific rate matrix. This paper suggests two new methods for estimating replacement rate matrices from independent pairwise protein sequence alignments and also carefully studies Müller-Vingron's resolvent method. Comprehensive tests on synthetic datasets show that both new methods perform better than the resolvent method in a variety of settings. The best method is furthermore demonstrated to be robust on small datasets as well as practical on very large datasets of real data. Neither short nor divergent sequence pairs have to be discarded, making the method economical with data. A generalization to multialignment data is suggested and used in a test on protein-domain family phylogenies, where it is shown that the method offers family-specific rate matrices that often have a significantly better likelihood than a general matrix.
Collapse
Affiliation(s)
- Lars Arvestad
- Stockholm Bioinformatics Center, Albanova University Center, Royal Institute of Technology (KTH), SE-100 44, Stockholm, Sweden.
| |
Collapse
|
22
|
Whelan S, de Bakker PIW, Quevillon E, Rodriguez N, Goldman N. PANDIT: an evolution-centric database of protein and associated nucleotide domains with inferred trees. Nucleic Acids Res 2006; 34:D327-31. [PMID: 16381879 PMCID: PMC1347450 DOI: 10.1093/nar/gkj087] [Citation(s) in RCA: 53] [Impact Index Per Article: 2.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/22/2022] Open
Abstract
PANDIT is a database of homologous sequence alignments accompanied by estimates of their corresponding phylogenetic trees. It provides a valuable resource to those studying phylogenetic methodology and the evolution of coding-DNA and protein sequences. Currently in version 17.0, PANDIT comprises 7738 families of homologous protein domains; for each family, DNA and corresponding amino acid sequence multiple alignments are available together with high quality phylogenetic tree estimates. Recent improvements include expanded methods for phylogenetic tree inference, assessment of alignment quality and a redesigned web interface, available at the URL .
Collapse
Affiliation(s)
- Simon Whelan
- EMBL-European Bioinformatics Institute Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SD, UK.
| | | | | | | | | |
Collapse
|
23
|
Kosiol C, Bofkin L, Whelan S. Phylogenetics by likelihood: Evolutionary modeling as a tool for understanding the genome. J Biomed Inform 2006; 39:51-61. [PMID: 16226061 DOI: 10.1016/j.jbi.2005.08.003] [Citation(s) in RCA: 21] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/28/2005] [Revised: 07/27/2005] [Accepted: 08/08/2005] [Indexed: 11/18/2022]
Abstract
Molecular evolutionary studies provide a means of investigating how cells function and how organisms adapt to their environment. The products of evolutionary studies provide medically important insights to the source of major diseases, such as HIV, and hold the key to understand the developing immunity of pathogenic bacteria to antibiotics. They have also helped mankind understand its place in nature, casting light on the selective forces and environmental conditions that resulted in modern humans. The use of likelihood as a framework for statistical modeling in phylogenetics has played a fundamental role in studying molecular evolution, enabling rigorous and robust conclusions to be drawn from sequence data. The first half of this article is a general introduction to the likelihood method for inferring phylogenies, the properties of the models used, and how it can be used for statistical testing. The latter half of the article focuses on the emerging new generation of phylogenetic models that describe heterogeneity in the evolutionary process along sequences, including the recoding of protein coding sequence data to amino acids and codons, and various approaches for describing dependencies between sites in a sequence. We conclude with a detailed case study examining how modern modeling approaches have been successfully employed to identify adaptive evolution in proteins.
Collapse
Affiliation(s)
- Carolin Kosiol
- EMBL-European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SD, UK
| | | | | |
Collapse
|
24
|
Olsen R, Loomis WF. A collection of amino acid replacement matrices derived from clusters of orthologs. J Mol Evol 2005; 61:659-65. [PMID: 16245010 DOI: 10.1007/s00239-005-0060-0] [Citation(s) in RCA: 10] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/16/2005] [Accepted: 06/04/2005] [Indexed: 12/01/2022]
Abstract
Sequence divergence among orthologous proteins was characterized with 34 amino acid replacement matrices, sequence context analysis, and a phylogenetic tree. The model was trained on very large datasets of aligned protein sequences drawn from 15 organisms including protists, plants, Dictyostelium, fungi, and animals. Comparative tests with models currently used in phylogeny, i.e., with JTT+gamma+/-F and WAG+gamma+/-F, made on a test dataset of 380 multiple alignments containing protein sequences from all five of the major taxonomic groups mentioned, indicate that our model should be preferred over the JTT+gamma+/-F and WAG+gamma+/-F models on datasets similar to the test dataset. The strong performance of our model of orthologous protein sequence divergence can be attributed to its ability to better approximate amino acid equilibrium frequencies to compositions found in alignment columns.
Collapse
Affiliation(s)
- Rolf Olsen
- Department of Physics, University of California at San Diego, La Jolla, CA 92093, USA
| | | |
Collapse
|
25
|
McDonald JH. Apparent trends of amino Acid gain and loss in protein evolution due to nearly neutral variation. Mol Biol Evol 2005; 23:240-4. [PMID: 16195487 DOI: 10.1093/molbev/msj026] [Citation(s) in RCA: 24] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/15/2022] Open
Abstract
It has recently been claimed that certain amino acids have been increasing in frequency in all living organisms for most of the history of life on earth, while other amino acids have been decreasing in frequency. Three lines of evidence have been offered for this assertion, but each has a more plausible alternative interpretation. Here I show that unequal patterns of gains and losses for particular pairs of amino acids (such as more leucine --> phenylalanine than phenylalanine --> leucine substitutions in humans and chimpanzees since they split from a common ancestor) are consistent with a simple neutral model at equilibrium amino acid frequencies. Unequal numbers of gains and losses for particular amino acids (such as more gains than losses of cysteine) are shown by simulations to be consistent with a model of nearly neutral evolution. Unequal numbers of gains and losses for particular amino acids in human polymorphism data are shown by simulations to be explainable by the nearly neutral model as well. In a comparison of protein sequences from four strains of Escherichia coli, polarized by one outgroup strain of Salmonella, the disparity in number of gains and losses for particular amino acids is strong in terminal branches but weaker or nonexistent in internal branches, which is inconsistent with the universal trend model but as expected under the nearly neutral model.
Collapse
Affiliation(s)
- John H McDonald
- Department of Biological Sciences, University of Delaware, USA.
| |
Collapse
|
26
|
Atchley WR, Zhao J, Fernandes AD, Drüke T. Solving the protein sequence metric problem. Proc Natl Acad Sci U S A 2005; 102:6395-400. [PMID: 15851683 PMCID: PMC1088356 DOI: 10.1073/pnas.0408677102] [Citation(s) in RCA: 298] [Impact Index Per Article: 15.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/14/2004] [Indexed: 11/18/2022] Open
Abstract
Biological sequences are composed of long strings of alphabetic letters rather than arrays of numerical values. Lack of a natural underlying metric for comparing such alphabetic data significantly inhibits sophisticated statistical analyses of sequences, modeling structural and functional aspects of proteins, and related problems. Herein, we use multivariate statistical analyses on almost 500 amino acid attributes to produce a small set of highly interpretable numeric patterns of amino acid variability. These high-dimensional attribute data are summarized by five multidimensional patterns of attribute covariation that reflect polarity, secondary structure, molecular volume, codon diversity, and electrostatic charge. Numerical scores for each amino acid then transform amino acid sequences for statistical analyses. Relationships between transformed data and amino acid substitution matrices show significant associations for polarity and codon diversity scores. Transformed alphabetic data are used in analysis of variance and discriminant analysis to study DNA binding in the basic helix-loop-helix proteins. The transformed scores offer a general solution for analyzing a wide variety of sequence analysis problems.
Collapse
Affiliation(s)
- William R Atchley
- Department of Genetics, Graduate Program in Biomathematics, and Center for Computational Biology, North Carolina State University, Raleigh, NC 27695-7614, USA.
| | | | | | | |
Collapse
|
27
|
Rivas E. Evolutionary models for insertions and deletions in a probabilistic modeling framework. BMC Bioinformatics 2005; 6:63. [PMID: 15780137 PMCID: PMC1087829 DOI: 10.1186/1471-2105-6-63] [Citation(s) in RCA: 51] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/16/2004] [Accepted: 03/21/2005] [Indexed: 11/15/2022] Open
Abstract
BACKGROUND Probabilistic models for sequence comparison (such as hidden Markov models and pair hidden Markov models for proteins and mRNAs, or their context-free grammar counterparts for structural RNAs) often assume a fixed degree of divergence. Ideally we would like these models to be conditional on evolutionary divergence time. Probabilistic models of substitution events are well established, but there has not been a completely satisfactory theoretical framework for modeling insertion and deletion events. RESULTS I have developed a method for extending standard Markov substitution models to include gap characters, and another method for the evolution of state transition probabilities in a probabilistic model. These methods use instantaneous rate matrices in a way that is more general than those used for substitution processes, and are sufficient to provide time-dependent models for standard linear and affine gap penalties, respectively. Given a probabilistic model, we can make all of its emission probabilities (including gap characters) and all its transition probabilities conditional on a chosen divergence time. To do this, we only need to know the parameters of the model at one particular divergence time instance, as well as the parameters of the model at the two extremes of zero and infinite divergence. I have implemented these methods in a new generation of the RNA genefinder QRNA (eQRNA). CONCLUSION These methods can be applied to incorporate evolutionary models of insertions and deletions into any hidden Markov model or stochastic context-free grammar, in a pair or profile form, for sequence modeling.
Collapse
MESH Headings
- Algorithms
- Biological Evolution
- Computational Biology/methods
- Computer Simulation
- Evolution, Molecular
- Gene Deletion
- Likelihood Functions
- Markov Chains
- Models, Biological
- Models, Genetic
- Models, Statistical
- Models, Theoretical
- Molecular Sequence Data
- Mutation
- Probability
- Protein Conformation
- Protein Structure, Tertiary
- RNA/chemistry
- RNA, Messenger/metabolism
- Sequence Analysis, DNA
- Sequence Analysis, Protein
- Sequence Analysis, RNA
- Software
- Time Factors
Collapse
Affiliation(s)
- Elena Rivas
- Department of Genetics, Washington University School of Medicine, 4444 Forest Park Blvd., Saint Louis, Missouri 63108, USA.
| |
Collapse
|
28
|
Knudsen B, Miyamoto MM. Using equilibrium frequencies in models of sequence evolution. BMC Evol Biol 2005; 5:21. [PMID: 15743518 PMCID: PMC554786 DOI: 10.1186/1471-2148-5-21] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/19/2004] [Accepted: 03/02/2005] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND The f factor is a new parameter for accommodating the influence of both the starting and ending states in the rate matrices of "generalized weighted frequencies" (+gwF) models for sequence evolution. In this study, we derive an expected value for f, starting from a nearly neutral model of weak selection, and then assess the biological interpretation of this factor with evolutionary simulations. RESULTS An expected value of f = 0.5 (i.e., equal dependency on the starting and ending states) is derived for sequences that are evolving under the nearly neutral model of this study. However, this expectation is sensitive to violations of its underlying assumptions as illustrated with the evolutionary simulations. CONCLUSION This study illustrates how selection, drift, and mutation at the population level can be linked to the rate matrices of models for sequence evolution to derive an expected value of f. However, as f is affected by a number of factors that limit its biological interpretation, this factor should normally be estimated as a free parameter rather than fixed a priori in a +gwF analysis.
Collapse
Affiliation(s)
- Bjarne Knudsen
- Department of Zoology, Box 118525, University of Florida, Gainesville, FL 32611-8525, USA
| | - Michael M Miyamoto
- Department of Zoology, Box 118525, University of Florida, Gainesville, FL 32611-8525, USA
| |
Collapse
|
29
|
Jordan IK, Kondrashov FA, Adzhubei IA, Wolf YI, Koonin EV, Kondrashov AS, Sunyaev S. A universal trend of amino acid gain and loss in protein evolution. Nature 2005; 433:633-8. [PMID: 15660107 DOI: 10.1038/nature03306] [Citation(s) in RCA: 233] [Impact Index Per Article: 12.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/14/2004] [Accepted: 12/15/2004] [Indexed: 11/08/2022]
Abstract
Amino acid composition of proteins varies substantially between taxa and, thus, can evolve. For example, proteins from organisms with (G + C)-rich (or (A + T)-rich) genomes contain more (or fewer) amino acids encoded by (G + C)-rich codons. However, no universal trends in ongoing changes of amino acid frequencies have been reported. We compared sets of orthologous proteins encoded by triplets of closely related genomes from 15 taxa representing all three domains of life (Bacteria, Archaea and Eukaryota), and used phylogenies to polarize amino acid substitutions. Cys, Met, His, Ser and Phe accrue in at least 14 taxa, whereas Pro, Ala, Glu and Gly are consistently lost. The same nine amino acids are currently accrued or lost in human proteins, as shown by analysis of non-synonymous single-nucleotide polymorphisms. All amino acids with declining frequencies are thought to be among the first incorporated into the genetic code; conversely, all amino acids with increasing frequencies, except Ser, were probably recruited late. Thus, expansion of initially under-represented amino acids, which began over 3,400 million years ago, apparently continues to this day.
Collapse
Affiliation(s)
- I King Jordan
- National Center for Biotechnology Information, NIH, Bethesda, Maryland 20894, USA
| | | | | | | | | | | | | |
Collapse
|
30
|
Chang BSW, Ugalde JA, Matz MV. Applications of Ancestral Protein Reconstruction in Understanding Protein Function: GFP-Like Proteins. Methods Enzymol 2005; 395:652-70. [PMID: 15865989 DOI: 10.1016/s0076-6879(05)95034-9] [Citation(s) in RCA: 19] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/09/2023]
Abstract
Recreating ancestral proteins in the laboratory increasingly is being used to study the evolutionary history of protein function. More efficient gene synthesis techniques and the decreasing costs of commercial oligosynthesis are making this approach both simpler and less expensive to perform. Developments in ancestral reconstruction methods, particularly more realistic likelihood models of molecular evolution, allow for the accurate reconstruction of more ancient proteins than previously possible. This chapter reviews phylogenetic methods of ancestral inference, strategies for investigating alternative reconstructions, gene synthesis, and design, and an application of these methods to the reconstruction of an ancestor in the green fluorescent protein family.
Collapse
Affiliation(s)
- Belinda S W Chang
- Department of Zoology, University of Toronto, Toronto, Ontario M5S 3G5, Canada
| | | | | |
Collapse
|
31
|
Lartillot N, Philippe H. A Bayesian mixture model for across-site heterogeneities in the amino-acid replacement process. Mol Biol Evol 2004; 21:1095-109. [PMID: 15014145 DOI: 10.1093/molbev/msh112] [Citation(s) in RCA: 1020] [Impact Index Per Article: 51.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
Abstract
Most current models of sequence evolution assume that all sites of a protein evolve under the same substitution process, characterized by a 20 x 20 substitution matrix. Here, we propose to relax this assumption by developing a Bayesian mixture model that allows the amino-acid replacement pattern at different sites of a protein alignment to be described by distinct substitution processes. Our model, named CAT, assumes the existence of distinct processes (or classes) differing by their equilibrium frequencies over the 20 residues. Through the use of a Dirichlet process prior, the total number of classes and their respective amino-acid profiles, as well as the affiliations of each site to a given class, are all free variables of the model. In this way, the CAT model is able to adapt to the complexity actually present in the data, and it yields an estimate of the substitutional heterogeneity through the posterior mean number of classes. We show that a significant level of heterogeneity is present in the substitution patterns of proteins, and that the standard one-matrix model fails to account for this heterogeneity. By evaluating the Bayes factor, we demonstrate that the standard model is outperformed by CAT on all of the data sets which we analyzed. Altogether, these results suggest that the complexity of the pattern of substitution of real sequences is better captured by the CAT model, offering the possibility of studying its impact on phylogenetic reconstruction and its connections with structure-function determinants.
Collapse
Affiliation(s)
- Nicolas Lartillot
- Canadian Institute for Advanced Research, Département de Biochimie, Université de Montréal, Montréal, Québec Canada.
| | | |
Collapse
|
32
|
Katsu Y, Bermudez DS, Braun EL, Helbing C, Miyagawa S, Gunderson MP, Kohno S, Bryan TA, Guillette LJ, Iguchi T. Molecular cloning of the estrogen and progesterone receptors of the American alligator. Gen Comp Endocrinol 2004; 136:122-33. [PMID: 14980803 DOI: 10.1016/j.ygcen.2003.11.008] [Citation(s) in RCA: 66] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 08/11/2003] [Revised: 11/07/2003] [Accepted: 11/12/2003] [Indexed: 11/29/2022]
Abstract
Steroid hormones perform many essential roles in vertebrates during embryonic development, reproduction, growth, water balance, and responses to stress. The estrogens are essential for normal reproductive activity in female and male vertebrates and appear to have direct actions during sex determination in some vertebrates. To begin to understand the molecular mechanisms of estrogen action in alligators, we have isolated cDNAs encoding the estrogen receptors (ER) from the ovary. Degenerate PCR primers specific to ER were designed and used to amplify alligator ovary RNA. Two different DNA fragments (ERalpha and ERbeta) were obtained and the full-length alligator ERalpha cDNA was obtained using 5' and 3' RACE. The inferred amino acid sequence of alligator ERalpha (aERalpha) was very similar to the chicken ERalpha (91% identity), although phylogenetic analyses suggested profound differences in the rate of sequence evolution for vertebrate ER sequences. We also isolated partial DNA fragments encoding ERbeta and the progesterone receptor (PR) of the alligator, both of which show strong sequence similarities to avian ERbeta and PR. We examined the expression levels of these three steroid receptors (ERalpha, ERbeta, and PR) in the ovary of juvenile alligators and observed detectable levels of all three receptors. Quantitative RT-PCR showed that gonadal ERalpha transcript levels in juvenile alligators decreased after E2 treatment whereas ERbeta and PR transcripts were not changed. These results provide tools that will allow future studies examining the regulation and ontogenic expression of steroid receptors in alligators and expand our knowledge of vertebrate steroid receptor evolution.
Collapse
Affiliation(s)
- Yoshinao Katsu
- Center for Integrative Bioscience, National Institute for Basic Biology, Okazaki National Research Institutes, Higashiyama, Myodaiji, Okazaki, Japan
| | | | | | | | | | | | | | | | | | | |
Collapse
|