1
|
Treasurer's Report for Financial Year 2022. Mol Biol Evol 2024; 41:msad281. [PMID: 38174624 PMCID: PMC10786193 DOI: 10.1093/molbev/msad281] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/21/2023] [Accepted: 12/21/2023] [Indexed: 01/05/2024] Open
|
2
|
Scalable Bayesian Divergence Time Estimation With Ratio Transformations. Syst Biol 2023; 72:1136-1153. [PMID: 37458991 PMCID: PMC10636426 DOI: 10.1093/sysbio/syad039] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/02/2022] [Revised: 06/13/2023] [Accepted: 06/30/2023] [Indexed: 11/08/2023] Open
Abstract
Divergence time estimation is crucial to provide temporal signals for dating biologically important events from species divergence to viral transmissions in space and time. With the advent of high-throughput sequencing, recent Bayesian phylogenetic studies have analyzed hundreds to thousands of sequences. Such large-scale analyses challenge divergence time reconstruction by requiring inference on highly correlated internal node heights that often become computationally infeasible. To overcome this limitation, we explore a ratio transformation that maps the original $N-1$ internal node heights into a space of one height parameter and $N-2$ ratio parameters. To make the analyses scalable, we develop a collection of linear-time algorithms to compute the gradient and Jacobian-associated terms of the log-likelihood with respect to these ratios. We then apply Hamiltonian Monte Carlo sampling with the ratio transform in a Bayesian framework to learn the divergence times in 4 pathogenic viruses (West Nile virus, rabies virus, Lassa virus, and Ebola virus) and the coralline red algae. Our method both resolves a mixing issue in the West Nile virus example and improves inference efficiency by at least 5-fold for the Lassa and rabies virus examples as well as for the algae example. Our method now also makes it computationally feasible to incorporate mixed-effects molecular clock models for the Ebola virus example, confirms the findings from the original study, and reveals clearer multimodal distributions of the divergence times of some clades of interest.
Collapse
|
3
|
Interlocus Gene Conversion, Natural Selection, and Paralog Homogenization. Mol Biol Evol 2023; 40:msad198. [PMID: 37675606 PMCID: PMC10503786 DOI: 10.1093/molbev/msad198] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/09/2023] [Revised: 08/07/2023] [Accepted: 09/05/2023] [Indexed: 09/08/2023] Open
Abstract
Following a duplication, the resulting paralogs tend to diverge. While mutation and natural selection can accelerate this process, they can also slow it. Here, we quantify the paralog homogenization that is caused by point mutations and interlocus gene conversion (IGC). Among 164 duplicated teleost genes, the median percentage of postduplication codon substitutions that arise from IGC rather than point mutation is estimated to be between 7% and 8%. By differentiating between the nonsynonymous codon substitutions that homogenize the protein sequences of paralogs and the nonhomogenizing nonsynonymous substitutions, we estimate the homogenizing nonsynonymous rates to be higher for 163 of the 164 teleost data sets as well as for all 14 data sets of duplicated yeast ribosomal protein-coding genes that we consider. For all 14 yeast data sets, the estimated homogenizing nonsynonymous rates exceed the synonymous rates.
Collapse
|
4
|
Treasurer's Report for Financial Year (FY) 2021. Mol Biol Evol 2022; 39:6872745. [PMID: 36468441 PMCID: PMC9720543 DOI: 10.1093/molbev/msac244] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/12/2022] Open
|
5
|
Correlations between alignment gaps and nucleotide substitution or amino acid replacement. Proc Natl Acad Sci U S A 2022; 119:e2204435119. [PMID: 35972964 PMCID: PMC9407537 DOI: 10.1073/pnas.2204435119] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/12/2022] [Accepted: 07/11/2022] [Indexed: 11/18/2022] Open
Abstract
To assess the conventional treatment in evolutionary inference of alignment gaps as missing data, we propose a simple nonparametric test of the null hypothesis that the locations of alignment gaps are independent of the nucleotide substitution or amino acid replacement process. When we apply the test to 1,390 protein alignments that are informed by protein tertiary structure and use a 5% significance level, the null hypothesis of independence between amino acid replacement and gap location is rejected for ∼65% of datasets. Via simulations that include substitution and insertion-deletion, we show that the test performs well with true alignments. When we simulate according to the null hypothesis and then apply the test to optimal alignments that are inferred by each of four widely used software packages, the null hypothesis is rejected too frequently. Via further simulations and analyses, we show that the overly frequent rejections of the null hypothesis are not solely due to weaknesses of widely used software for finding optimal alignments. Instead, our evidence suggests that optimal alignments are unrepresentative of true alignments and that biased evolutionary inferences may result from relying upon individual optimal alignments.
Collapse
|
6
|
Convergent evolution of polyploid genomes from across the eukaryotic tree of life. G3 GENES|GENOMES|GENETICS 2022; 12:6572348. [PMID: 35451464 PMCID: PMC9157103 DOI: 10.1093/g3journal/jkac094] [Citation(s) in RCA: 9] [Impact Index Per Article: 4.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 01/05/2022] [Accepted: 04/15/2022] [Indexed: 11/14/2022]
Abstract
Abstract
By modeling the homoeologous gene losses that occurred in 50 genomes deriving from ten distinct polyploidy events, we show that the evolutionary forces acting on polyploids are remarkably similar, regardless of whether they occur in flowering plants, ciliates, fishes, or yeasts. We show that many of the events show a relative rate of duplicate gene loss before the first postpolyploidy speciation that is significantly higher than in later phases of their evolution. The relatively weak selective constraint experienced by the single-copy genes these losses produced leads us to suggest that most of the purely selectively neutral duplicate gene losses occur in the immediate postpolyploid period. Nearly all of the events show strong evidence of biases in the duplicate losses, consistent with them being allopolyploidies, with 2 distinct progenitors contributing to the modern species. We also find ongoing and extensive reciprocal gene losses (alternative losses of duplicated ancestral genes) between these genomes. With the exception of a handful of closely related taxa, all of these polyploid organisms are separated from each other by tens to thousands of reciprocal gene losses. As a result, it is very unlikely that viable diploid hybrid species could form between these taxa, since matings between such hybrids would tend to produce offspring lacking essential genes. It is, therefore, possible that the relatively high frequency of recurrent polyploidies in some lineages may be due to the ability of new polyploidies to bypass reciprocal gene loss barriers.
Collapse
|
7
|
Exome sequencing of hepatocellular carcinoma in lemurs identifies potential cancer drivers: A pilot study. Evol Med Public Health 2022; 10:221-230. [PMID: 35557512 PMCID: PMC9086584 DOI: 10.1093/emph/eoac016] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/03/2021] [Accepted: 04/17/2022] [Indexed: 11/24/2022] Open
Abstract
Background and objectives Hepatocellular carcinoma occurs frequently in prosimians, but the cause of these liver cancers in this group is unknown. Characterizing the genetic changes associated with hepatocellular carcinoma in prosimians may point to possible causes, treatments and methods of prevention, aiding conservation efforts that are particularly crucial to the survival of endangered lemurs. Although genomic studies of cancer in non-human primates have been hampered by a lack of tools, recent studies have demonstrated the efficacy of using human exome capture reagents across primates. Methodology In this proof-of-principle study, we applied human exome capture reagents to tumor-normal pairs from five lemurs with hepatocellular carcinoma to characterize the mutational landscape of this disease in lemurs. Results Several genes implicated in human hepatocellular carcinoma, including ARID1A, TP53 and CTNNB1, were mutated in multiple lemurs, and analysis of cancer driver genes mutated in these samples identified enrichment of genes involved with TP53 degradation and regulation. In addition to these similarities with human hepatocellular carcinoma, we also noted unique features, including six genes that contain mutations in all five lemurs. Interestingly, these genes are infrequently mutated in human hepatocellular carcinoma, suggesting potential differences in the etiology and/or progression of this cancer in lemurs and humans. Conclusions and implications Collectively, this pilot study suggests that human exome capture reagents are a promising tool for genomic studies of cancer in lemurs and other non-human primates. Lay Summary Hepatocellular carcinoma occurs frequently in prosimians, but the cause of these liver cancers is unknown. In this proof-of-principle study, we applied human DNA sequencing tools to tumor-normal pairs from five lemurs with hepatocellular carcinoma and compared the lemur mutation profiles to those of human hepatocellular carcinomas.
Collapse
|
8
|
Measuring Phylogenetic Information of Incomplete Sequence Data. Syst Biol 2021; 71:630-648. [PMID: 34469581 DOI: 10.1093/sysbio/syab073] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/02/2021] [Revised: 08/26/2021] [Accepted: 08/27/2021] [Indexed: 11/13/2022] Open
Abstract
Widely used approaches for extracting phylogenetic information from aligned sets of molecular sequences rely upon probabilistic models of nucleotide substitution or amino-acid replacement. The phylogenetic information that can be extracted depends on the number of columns in the sequence alignment and will be decreased when the alignment contains gaps due to insertion or deletion events. Motivated by the measurement of information loss, we suggest assessment of the Effective Sequence Length (ESL) of an aligned data set. The ESL can differ from the actual number of columns in a sequence alignment because of the presence of alignment gaps. Furthermore, the estimation of phylogenetic information is affected by model misspecification. Inevitably, the actual process of molecular evolution differs from the probabilistic models employed to describe this process. This disparity means the amount of phylogenetic information in an actual sequence alignment will differ from the amount in a simulated data set of equal size, which motivated us to develop a new test for model adequacy. Via theory and empirical data analysis, we show how to disentangle the effects of gaps and model misspecification. By comparing the Fisher information of actual and simulated sequences, we identify which alignment sites and tree branches are most affected by gaps and model misspecification.
Collapse
|
9
|
Pedigree-based and phylogenetic methods support surprising patterns of mutation rate and spectrum in the gray mouse lemur. Heredity (Edinb) 2021; 127:233-244. [PMID: 34272504 PMCID: PMC8322134 DOI: 10.1038/s41437-021-00446-5] [Citation(s) in RCA: 19] [Impact Index Per Article: 6.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/20/2020] [Revised: 05/25/2021] [Accepted: 05/26/2021] [Indexed: 02/06/2023] Open
Abstract
Mutations are the raw material on which evolution acts, and knowledge of their frequency and genomic distribution is crucial for understanding how evolution operates at both long and short timescales. At present, the rate and spectrum of de novo mutations have been directly characterized in relatively few lineages. Our study provides the first direct mutation-rate estimate for a strepsirrhine (i.e., the lemurs and lorises), which comprises nearly half of the primate clade. Using high-coverage linked-read sequencing for a focal quartet of gray mouse lemurs (Microcebus murinus), we estimated the mutation rate to be among the highest calculated for a mammal at 1.52 × 10-8 (95% credible interval: 1.28 × 10-8-1.78 × 10-8) mutations/site/generation. Further, we found an unexpectedly low count of paternal mutations, and only a modest overrepresentation of mutations at CpG sites. Despite the surprising nature of these results, we found both the rate and spectrum to be robust to the manipulation of a wide range of computational filtering criteria. We also sequenced a technical replicate to estimate a false-negative and false-positive rate for our data and show that any point estimate of a de novo mutation rate should be considered with a large degree of uncertainty. For validation, we conducted an independent analysis of context-dependent substitution types for gray mouse lemur and five additional primate species for which de novo mutation rates have also been estimated. These comparisons revealed general consistency of the mutation spectrum between the pedigree-based and the substitution-rate analyses for all species compared.
Collapse
|
10
|
Treasurer's Report for Financial Year (FY) 2019. Mol Biol Evol 2021; 38:3028. [PMID: 34009329 PMCID: PMC8233483 DOI: 10.1093/molbev/msab012] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022] Open
|
11
|
Advances in understanding and in multi-disciplinary methodology used to assess lipid regulation of signalling cascades from the cancer cell plasma membrane. Prog Lipid Res 2020; 81:101080. [PMID: 33359620 DOI: 10.1016/j.plipres.2020.101080] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/12/2020] [Revised: 12/18/2020] [Accepted: 12/18/2020] [Indexed: 12/31/2022]
Abstract
The lipid bilayer is a functional component of cells, forming a stable platform for the initiation of key biological processes, including cell signalling. There are distinct changes in the lipid composition of cell membranes during oncogenic transformation resulting in aberrant activation and inactivation of signalling transduction pathways. Studying the role of the cell membrane in cell signalling is challenging, since techniques are often limited to by timescale, resolution, sensitivity, and averaging. To overcome these limitations, combining 'computational', 'wet-lab' and 'semi-dry' approaches offers the best opportunity to resolving complex biological processes involved in membrane organisation. In this review, we highlight analytical tools that have been applied for the study of cell signalling initiation from the cancer cell membranes through computational microscopy, biological assays, and membrane biophysics. The cancer therapeutic potential of extracellular membrane-modulating agents, such as cholesterol-reducing agents is also discussed, as is the need for future collaborative inter-disciplinary research for studying the role of the cell membrane and its components in cancer therapy.
Collapse
|
12
|
Abstract
Evolutionary models of proteins are widely used for statistical sequence alignment and inference of homology and phylogeny. However, the vast majority of these models rely on an unrealistic assumption of independent evolution between sites. Here we focus on the related problem of protein structure alignment, a classic tool of computational biology that is widely used to identify structural and functional similarity and to infer homology among proteins. A site-independent statistical model for protein structural evolution has previously been introduced and shown to significantly improve alignments and phylogenetic inferences compared with approaches that utilize only amino acid sequence information. Here we extend this model to account for correlated evolutionary drift among neighboring amino acid positions. The result is a spatiotemporal model of protein structure evolution, described by a multivariate diffusion process convolved with a spatial birth-death process. This extended site-dependent model (SDM) comes with little additional computational cost or analytical complexity compared with the site-independent model (SIM). We demonstrate that this SDM yields a significant reduction of bias in estimated evolutionary distances and helps further improve phylogenetic tree reconstruction. We also develop a simple model of site-dependent sequence evolution, which we use to demonstrate the bias resulting from the application of standard site-independent sequence evolution models.
Collapse
|
13
|
Abstract
Despite a considerable expenditure of time and resources and significant advances in experimental models of disease, cancer research continues to suffer from extremely low success rates in translating preclinical discoveries into clinical practice. The continued failure of cancer drug development, particularly late in the course of human testing, not only impacts patient outcomes, but also drives up the cost for those therapies that do succeed. It is clear that a paradigm shift is necessary if improvements in this process are to occur. One promising direction for increasing translational success is comparative oncology-the study of cancer across species, often involving veterinary patients that develop naturally-occurring cancers. Comparative oncology leverages the power of cross-species analyses to understand the fundamental drivers of cancer protective mechanisms, as well as factors contributing to cancer initiation and progression. Clinical trials in veterinary patients with cancer provide an opportunity to evaluate novel therapeutics in a setting that recapitulates many of the key features of human cancers, including genomic aberrations that underly tumor development, response and resistance to treatment, and the presence of comorbidities that can affect outcomes. With a concerted effort from basic scientists, human physicians and veterinarians, comparative oncology has the potential to enhance the cost-effectiveness and efficiency of pipelines for cancer drug discovery and other cancer treatments.
Collapse
|
14
|
Information Criteria for Comparing Partition Schemes. Syst Biol 2018; 67:616-632. [PMID: 29309694 DOI: 10.1093/sysbio/syx097] [Citation(s) in RCA: 13] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/16/2017] [Accepted: 12/17/2017] [Indexed: 01/10/2023] Open
Abstract
When inferring phylogenies, one important decision is whether and how nucleotide substitution parameters should be shared across different subsets or partitions of the data. One sort of partitioning error occurs when heterogeneous subsets are mistakenly lumped together and treated as if they share parameter values. The opposite kind of error is mistakenly treating homogeneous subsets as if they result from distinct sets of parameters. Lumping and splitting errors are not equally bad. Lumping errors can yield parameter estimates that do not accurately reflect any of the subsets that were combined whereas splitting errors yield estimates that did not benefit from sharing information across partitions. Phylogenetic partitioning decisions are often made by applying information criteria such as the Akaike information criterion (AIC). As with other information criteria, the AIC evaluates a model or partition scheme by combining the maximum log-likelihood value with a penalty that depends on the number of parameters being estimated. For the purpose of selecting an optimal partitioning scheme, we derive an adjustment to the AIC that we refer to as the AIC$^{(p)}$ and that is motivated by the idea that splitting errors are less serious than lumping errors. We also introduce a similar adjustment to the Bayesian information criterion (BIC) that we refer to as the BIC$^{(p)}$. Via simulation and empirical data analysis, we contrast AIC and BIC behavior to our suggested adjustments. We discuss these results and also emphasize why we expect the probability of lumping errors with the AIC$^{(p)}$ and the BIC$^{(p)}$ to be relatively robust to model parameterization.
Collapse
|
15
|
Decoding the Feline Leukocyte Antigen MHC class I system via SMRT sequencing. THE JOURNAL OF IMMUNOLOGY 2018. [DOI: 10.4049/jimmunol.200.supp.59.26] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 01/03/2023]
Abstract
Abstract
CD8+ T-cell cytotoxicity is an important component of anti-viral immune responses and is restricted by MHC class I peptide presentation. Investigating virus-specific CTL in cats is impeded by the rudimentary understanding of the Feline Leukocyte Antigen class I (FLAI) system. As polygeny and polymorphisms are inherent MHC features, defining functional loci and cataloging alleles are the first steps in characterizing FLAI. Additionally, identification of prevalent alleles will permit discovery of immunodominant CTL responses. Previously, using conventional sequencing of class I hypervariable regions, we defined three class Ia loci, FLA-E, -H and -K, and found allele sharing, even in our small sample of cats. Because of high FLAI homology, partial-length sequences did not allow unambiguous assignment of all alleles to their originating locus. We hypothesized that the PacBio SMRT sequencing platform would allow us to efficiently obtain full-length genotyping of all class Ia loci in multi-cat cohorts. Consensus building of FLAI contigs provided complete MHC sequences with high accuracy, greater depth and sensitivity, and in a fraction of the time of traditional cloning and Sanger techniques. In sequencing 17 cats (including 2 reference individuals), we clarified locus assignments, identified >40 novel classical class I alleles and discovered several common allele supergroups (FLA-E*003, -E*009, -K*002, and -K*003). Also, for the first time, we confirmed the identify and expression of 2 class Ib loci (FLA-J, -L). This work brings the current total of known FLAI alleles to ~100 across 3 class Ia and ~35 across 2 class Ib loci. This new dataset should accelerate feline anti-viral research and facilitate allotransplantation in cats.
Collapse
|
16
|
Grouping substitution types into different relaxed molecular clocks. Philos Trans R Soc Lond B Biol Sci 2017; 371:rstb.2015.0141. [PMID: 27325837 DOI: 10.1098/rstb.2015.0141] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 01/07/2016] [Indexed: 11/12/2022] Open
Abstract
Different types of nucleotide substitutions experience different patterns of rate change over time. We propose clustering context-dependent (or context-independent) nucleotide substitution types according to how their rates change and then using the grouping for divergence time estimation. With our models, relative rates among types that are in the same group are fixed, whereas absolute rates of the types within a group change over time according to a shared relaxed molecular clock. We illustrate our procedure by analysing a 0.15 Mb intergenic region to infer divergence times relating eight primates. The different groupings of substitution types that we explore have little effect on the posterior means of divergence times, but the widths of the credibility intervals decrease as the number of groups increases.This article is part of the themed issue 'Dating species divergences using rocks and clocks'.
Collapse
|
17
|
A Phylogenetic Approach Finds Abundant Interlocus Gene Conversion in Yeast. Mol Biol Evol 2016; 33:2469-76. [PMID: 27297467 DOI: 10.1093/molbev/msw114] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open
Abstract
Interlocus gene conversion (IGC) homogenizes repeats. While genomes can be repeat-rich, the evolutionary importance of IGC is poorly understood. Additional statistical tools for characterizing it are needed. We propose a composite likelihood strategy for incorporating IGC into widely-used probabilistic models for sequence changes that originate with point mutation. We estimated the percentage of nucleotide substitutions that originate with an IGC event rather than a point mutation in 14 groups of yeast ribosomal protein-coding genes, and found values ranging from 20% to 38%. We designed and applied a procedure to determine whether these percentages are inflated due to artifacts arising from model misspecification. The results of this procedure are consistent with IGC having had an important role in the evolution of each of these 14 gene families. We further investigate the properties of our IGC approach via simulation. In contrast to usual practice, our findings suggest that the IGC should and can be considered when multigene family evolution is investigated.
Collapse
|
18
|
Mitochondrial genome sequences reveal evolutionary relationships of the Phytophthora 1c clade species. Curr Genet 2015; 61:567-77. [PMID: 25754775 PMCID: PMC4659649 DOI: 10.1007/s00294-015-0480-3] [Citation(s) in RCA: 20] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/20/2014] [Revised: 02/11/2015] [Accepted: 02/12/2015] [Indexed: 01/24/2023]
Abstract
Phytophthora infestans is one of the most destructive plant pathogens of potato and tomato globally. The pathogen is closely related to four other Phytophthora species in the 1c clade including P. phaseoli, P. ipomoeae, P. mirabilis and P. andina that are important pathogens of other wild and domesticated hosts. P. andina is an interspecific hybrid between P. infestans and an unknown Phytophthora species. We have sequenced mitochondrial genomes of the sister species of P. infestans and examined the evolutionary relationships within the clade. Phylogenetic analysis indicates that the P. phaseoli mitochondrial lineage is basal within the clade. P. mirabilis and P. ipomoeae are sister lineages and share a common ancestor with the Ic mitochondrial lineage of P. andina. These lineages in turn are sister to the P. infestans and P. andina Ia mitochondrial lineages. The P. andina Ic lineage diverged much earlier than the P. andina Ia mitochondrial lineage and P. infestans. The presence of two mitochondrial lineages in P. andina supports the hybrid nature of this species. The ancestral state of the P. andina Ic lineage in the tree and its occurrence only in the Andean regions of Ecuador, Colombia and Peru suggests that the origin of this species hybrid in nature may occur there.
Collapse
|
19
|
Abstract
Rates of molecular evolution can vary over time. Diverse statistical techniques for divergence time estimation have been developed to accommodate this variation. These typically require that all sequence (or codon) positions at a locus change independently of one another. They also generally assume that the rates of different types of nucleotide substitutions vary across a phylogeny in the same way. This permits divergence time estimation procedures to employ an instantaneous rate matrix with relative rates that do not differ among branches. However, previous studies have suggested that some substitution types (e.g., CpG to TpG changes in mammals) are more clock-like than others. As has been previously noted, this is biologically plausible given the mutational mechanism of CpG to TpG changes. Through stochastic mapping of sequence histories from context-independent substitution models, our approach allows for context-dependent nucleotide substitutions to change their relative rates over time. We apply our approach to the analysis of a 0.15 Mb intergenic region from eight primates. In accord with previous findings, we find comparatively little rate variation over time for CpG to TpG substitutions but we find more for other substitution types. We conclude by discussing the limitations and prospects of our approach.
Collapse
|
20
|
Roles of solvent accessibility and gene expression in modeling protein sequence evolution. Evol Bioinform Online 2015; 11:85-96. [PMID: 25987828 PMCID: PMC4415675 DOI: 10.4137/ebo.s22911] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/14/2014] [Revised: 02/04/2015] [Accepted: 02/09/2015] [Indexed: 11/05/2022] Open
Abstract
Models of protein evolution tend to ignore functional constraints, although structural constraints are sometimes incorporated. Here we propose a probabilistic framework for codon substitution that evaluates joint effects of relative solvent accessibility (RSA), a structural constraint; and gene expression, a functional constraint. First, we explore the relationship between RSA and codon usage at the genomic scale as well as at the individual gene scale. Motivated by these results, we construct our framework by determining how probable is an amino acid, given RSA and gene expression, and then evaluating the relative probability of observing a codon compared to other synonymous codons. We come to the biologically plausible conclusion that both RSA and gene expression are related to amino acid frequencies, but, among synonymous codons, the relative probability of a particular codon is more closely related to gene expression than RSA. To illustrate the potential applications of our framework, we propose a new codon substitution model. Using this model, we obtain estimates of 2N s, the product of effective population size N, and relative fitness difference of allele s. For a training data set consisting of human proteins with known structures and expression data, 2N s is estimated separately for synonymous and nonsynonymous substitutions in each protein. We then contrast the patterns of synonymous and nonsynonymous 2N s estimates across proteins while also taking gene expression levels of the proteins into account. We conclude that our 2N s estimates are too concentrated around 0, and we discuss potential explanations for this lack of variability.
Collapse
|
21
|
The interface of protein structure, protein biophysics, and molecular evolution. Protein Sci 2012; 21:769-85. [PMID: 22528593 PMCID: PMC3403413 DOI: 10.1002/pro.2071] [Citation(s) in RCA: 140] [Impact Index Per Article: 11.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/02/2012] [Revised: 03/22/2012] [Accepted: 03/23/2012] [Indexed: 12/20/2022]
Abstract
Abstract The interface of protein structural biology, protein biophysics, molecular evolution, and molecular population genetics forms the foundations for a mechanistic understanding of many aspects of protein biochemistry. Current efforts in interdisciplinary protein modeling are in their infancy and the state-of-the art of such models is described. Beyond the relationship between amino acid substitution and static protein structure, protein function, and corresponding organismal fitness, other considerations are also discussed. More complex mutational processes such as insertion and deletion and domain rearrangements and even circular permutations should be evaluated. The role of intrinsically disordered proteins is still controversial, but may be increasingly important to consider. Protein geometry and protein dynamics as a deviation from static considerations of protein structure are also important. Protein expression level is known to be a major determinant of evolutionary rate and several considerations including selection at the mRNA level and the role of interaction specificity are discussed. Lastly, the relationship between modeling and needed high-throughput experimental data as well as experimental examination of protein evolution using ancestral sequence resurrection and in vitro biochemistry are presented, towards an aim of ultimately generating better models for biological inference and prediction.
Collapse
|
22
|
Abstract
Although most of the important evolutionary events in the history of biology can only be studied via interspecific comparisons, it is challenging to apply the rich body of population genetic theory to the study of interspecific genetic variation. Probabilistic modeling of the substitution process would ideally be derived from first principles of population genetics, allowing a quantitative connection to be made between the parameters describing mutation, selection, drift, and the patterns of interspecific variation. There has been progress in reconciling population genetics and interspecific evolution for the case where mutation rates are sufficiently low, but when mutation rates are higher, reconciliation has been hampered due to complications from how the loss or fixation of new mutations can be influenced by linked nonneutral polymorphisms (i.e., the Hill-Robertson effect). To investigate the generation of interspecific genetic variation when concurrent fitness-affecting polymorphisms are common and the Hill-Robertson effect is thereby potentially strong, we used the Wright-Fisher model of population genetics to simulate very many generations of mutation, natural selection, and genetic drift. This was done so that the chronological history of advantageous, deleterious, and neutral substitutions could be traced over time along the ancestral lineage. Our simulations show that the process by which a nonrecombining sequence changes over time can markedly deviate from the Markov assumption that is ubiquitous in molecular phylogenetics. In particular, we find tendencies for advantageous substitutions to be followed by deleterious ones and for deleterious substitutions to be followed by advantageous ones. Such non-Markovian patterns reflect the fact that the fate of the ancestral lineage depends not only on its current allelic state but also on gene copies not belonging to the ancestral lineage. Although our simulations describe nonrecombining sequences, we conclude by discussing how non-Markovian behavior of the ancestral lineage is plausible even when recombination rates are not low. As a result, we believe that increased attention needs to be devoted to the robustness of evolutionary inference procedures that rely upon the Markov assumption.
Collapse
|
23
|
Coordinated genome-wide modifications within proximal promoter cis-regulatory elements during vertebrate evolution. Genome Biol Evol 2010; 3:66-74. [PMID: 21118975 PMCID: PMC3021792 DOI: 10.1093/gbe/evq078] [Citation(s) in RCA: 11] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/05/2023] Open
Abstract
There often exists a "one-to-many" relationship between a transcription factor and a multitude of binding sites throughout the genome. It is commonly assumed that transcription factor binding motifs remain largely static over the course of evolution because changes in binding specificity can alter the interactions with potentially hundreds of sites across the genome. Focusing on regulatory motifs overrepresented at specific locations within or near the promoter, we find that a surprisingly large number of cis-regulatory elements have been subject to coordinated genome-wide modifications during vertebrate evolution, such that the motif frequency changes on a single branch of vertebrate phylogeny. This was found to be the case even between closely related mammal species, with nearly a third of all location-specific consensus motifs exhibiting significant modifications within the human or mouse lineage since their divergence. Many of these modifications are likely to be compensatory changes throughout the genome following changes in protein factor binding affinities, whereas others may be due to changes in mutation rates or effective population size. The likelihood that this happened many times during vertebrate evolution highlights the need to examine additional taxa and to understand the evolutionary and molecular mechanisms underlying the evolution of protein-DNA interactions.
Collapse
|
24
|
Basing population genetic inferences and models of molecular evolution upon desired stationary distributions of DNA or protein sequences. Philos Trans R Soc Lond B Biol Sci 2009; 363:3931-9. [PMID: 18852105 DOI: 10.1098/rstb.2008.0167] [Citation(s) in RCA: 12] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open
Abstract
Models of molecular evolution tend to be overly simplistic caricatures of biology that are prone to assigning high probabilities to biologically implausible DNA or protein sequences. Here, we explore how to construct time-reversible evolutionary models that yield stationary distributions of sequences that match given target distributions. By adopting comparatively realistic target distributions,evolutionary models can be improved. Instead of focusing on estimating parameters, we concentrate on the population genetic implications of these models. Specifically, we obtain estimates of the product of effective population size and relative fitness difference of alleles. The approach is illustrated with two applications to protein-coding DNA. In the first, a codon-based evolutionary model yields a stationary distribution of sequences, which, when the sequences are translated,matches a variable-length Markov model trained on human proteins. In the second, we introduce an insertion-deletion model that describes selectively neutral evolutionary changes to DNA. We then show how to modify the neutral model so that its stationary distribution at the amino acid level can match a profile hidden Markov model, such as the one associated with the Pfam database.
Collapse
|
25
|
Estimates of natural selection due to protein tertiary structure inform the ancestry of biallelic loci. Gene 2008; 441:45-52. [PMID: 18725272 DOI: 10.1016/j.gene.2008.07.020] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/01/2008] [Accepted: 07/10/2008] [Indexed: 10/21/2022]
Abstract
We consider the inference of which of two alleles is ancestral when the alleles have a single nonsynonymous difference and when natural selection acts via protein tertiary structure. Whereas the probability that an allele is ancestral under neutrality is equal to its frequency, under selection this probability depends on allele frequency and on the magnitude and direction of selection pressure. Although allele frequencies can be well estimated from intraspecific data, small fitness differences have a large evolutionary impact but can be difficult to estimate with only intraspecific data. Methods for predicting aspects of phenotype from genotype can supplement intraspecific sequence data. Recently developed statistical techniques can assess effects of phenotypes, such as protein tertiary structure on molecular evolution. While these techniques were initially designed for comparing protein-coding genes from different species, the resulting interspecific inferences can be assigned population genetic interpretations to assess the effect of selection pressure, and we use them here along with intraspecific allele frequency data to estimate the probability that an allele is ancestral. We focus on 140 nonsynonymous single nucleotide polymorphisms of humans that are in proteins with known tertiary structures. We find that our technique for employing protein tertiary structure information yields some biologically plausible results but that it does not substantially improve the inference of ancestral human allele types.
Collapse
|
26
|
Rates of nucleotide substitution in Cornaceae (Cornales)-Pattern of variation and underlying causal factors. Mol Phylogenet Evol 2008; 49:327-42. [PMID: 18682295 DOI: 10.1016/j.ympev.2008.07.010] [Citation(s) in RCA: 30] [Impact Index Per Article: 1.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/18/2008] [Revised: 07/09/2008] [Accepted: 07/09/2008] [Indexed: 11/18/2022]
Abstract
Identifying causes of genetic divergence is a central goal in evolutionary biology. Although rates of nucleotide substitution vary among taxa and among genes, the causes of this variation tend to be poorly understood. In the present study, we examined the rate and pattern of molecular evolution for five DNA regions over a phylogeny of Cornus, the single genus of Cornaceae. To identify evolutionary mechanisms underlying the molecular variation, we employed Bayesian methods to estimate divergence times and to infer how absolute rates of synonymous and nonsynonymous substitutions and their ratios change over time. We found that the rates vary among genes, lineages, and through time, and differences in mutation rates, selection type and intensity, and possibly genetic drift all contributed to the variation of substitution rates observed among the major lineages of Cornus. We applied independent contrast analysis to explore whether speciation rates are linked to rates of molecular evolution. The results showed no relationships for individual genes, but suggested a possible localized link between species richness and rate of nonsynonymous nucleotide substitution for the combined cpDNA regions. Furthermore, we detected a positive correlation between rates of molecular evolution and morphological change in Cornus. This was particularly pronounced in the dwarf dogwood lineage, in which genome-wide acceleration in both molecular and morphological evolution has likely occurred.
Collapse
|
27
|
Protein evolution constraints and model-based techniques to study them. Curr Opin Struct Biol 2007; 17:337-41. [PMID: 17572082 DOI: 10.1016/j.sbi.2007.05.006] [Citation(s) in RCA: 22] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/16/2007] [Revised: 04/11/2007] [Accepted: 05/29/2007] [Indexed: 11/17/2022]
Abstract
There have been substantial improvements in statistical tools for assessing the evolutionary roles of mutation and natural selection from interspecific sequence data. The importance of having the rate at which a point mutation occurs depend on the DNA sequence at sites surrounding the mutation is now better appreciated and can be accommodated in probabilistic models of protein evolution. To quantify the evolutionary impact of some aspect of phenotype, one promising strategy is to develop a system for predicting phenotype from the DNA sequence and to then infer how the evolutionary rates of sequence change are affected by the predicted phenotypic consequences of the changes. Although statistical tools for characterizing protein evolution are improving, the list of candidate phenomena that can affect rates of protein evolution is long and the relative contributions of these phenomena are only beginning to be disentangled.
Collapse
|
28
|
Abstract
To investigate the evolutionary impact of protein structure, the experimentally determined tertiary structure and the protein-coding DNA sequence were collected for each of 1,195 genes. These genes were studied via a model of sequence change that explicitly incorporates effects on evolutionary rates due to protein tertiary structure. In the model, these effects act via the solvent accessibility environments and pairwise amino acid interactions that are induced by tertiary structure. To compare the hypotheses that structure does and does not have a strong influence on evolution, Bayes factors were estimated for each of the 1,195 sequences. Most of the Bayes factors strongly support the hypothesis that protein structure affects protein evolution. Furthermore, both solvent accessibility and pairwise interactions among amino acids are inferred to have important roles in protein evolution. Our results also indicate that the strength of the relationship between tertiary structure and evolution has a weak but real correlation to the annotation information in the Gene Ontology database. Although their influences on rates of evolution vary among protein families, we find that the mean impacts of solvent accessibility and pairwise interactions are about the same.
Collapse
|
29
|
Abstract
A central goal of computational biology is the prediction of phenotype from DNA and protein sequence data. Recent models of sequence change use in silico prediction systems to incorporate the effects of phenotype on evolutionary rates. These models have been designed for analyzing sequence data from different species and have been accompanied by statistical techniques for estimating model parameters when the incorporation of phenotype induces dependent change among sequence positions. A difficulty with these efforts to link phenotype and interspecific evolution is that evolution occurs within populations, and parameters of interspecific models should have population genetic interpretations. We show, with two examples, how population genetic interpretations can be assigned to evolutionary models. The first example considers the impact of RNA secondary structure on sequence change, and the second reflects the tendency for protein tertiary structure to influence nonsynonymous substitution rates. We argue that statistical fit to data should not be the sole criterion for assessing models of sequence change. A good interspecific model should also yield a clear and biologically plausible population genetic interpretation.
Collapse
|
30
|
BRISKET DISEASE. III. SPONTANEOUS REMISSION OF PULMONARY HYPERTENSION AND RECOVERY FROM HEART FAILURE. J Clin Invest 2006; 42:589-96. [PMID: 16695900 PMCID: PMC289323 DOI: 10.1172/jci104749] [Citation(s) in RCA: 20] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/17/2022] Open
|
31
|
Abstract
Although probabilistic models of genotype (e.g., DNA sequence) evolution have been greatly elaborated, less attention has been paid to the effect of phenotype on the evolution of the genotype. Here we propose an evolutionary model and a Bayesian inference procedure that are aimed at filling this gap. In the model, RNA secondary structure links genotype and phenotype by treating the approximate free energy of a sequence folded into a secondary structure as a surrogate for fitness. The underlying idea is that a nucleotide substitution resulting in a more stable secondary structure should have a higher rate than a substitution that yields a less stable secondary structure. This free energy approach incorporates evolutionary dependencies among sequence positions beyond those that are reflected simply by jointly modeling change at paired positions in an RNA helix. Although there is not a formal requirement with this approach that secondary structure be known and nearly invariant over evolutionary time, computational considerations make these assumptions attractive and they have been adopted in a software program that permits statistical analysis of multiple homologous sequences that are related via a known phylogenetic tree topology. Analyses of 5S ribosomal RNA sequences are presented to illustrate and quantify the strong impact that RNA secondary structure has on substitution rates. Analyses on simulated sequences show that the new inference procedure has reasonable statistical properties. Potential applications of this procedure, including improved ancestral sequence inference and location of functionally interesting sites, are discussed.
Collapse
|
32
|
Testing for spatial clustering of amino acid replacements within protein tertiary structure. J Mol Evol 2006; 62:682-92. [PMID: 16752209 DOI: 10.1007/s00239-005-0107-2] [Citation(s) in RCA: 10] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/02/2005] [Accepted: 11/30/2005] [Indexed: 11/25/2022]
Abstract
Widely used models of protein evolution ignore protein structure. Therefore, these models do not predict spatial clustering of amino acid replacements with respect to tertiary structure. One formal and biologically implausible possibility is that there is no tendency for amino acid replacements to be spatially clustered during evolution. An alternative to this is that amino acid replacements are spatially clustered and this spatial clustering can be fully explained by a tendency for similar rates of amino acid replacement at sites that are nearby in protein tertiary structure. A third possibility is that the amount of clustering exceeds that which can be explained solely on the basis of independently evolving protein sites with spatially clustered replacement rates. We introduce two simple and not very parametric hypothesis tests that help distinguish these three possibilities. We then apply these tests to 273 homologous protein families. The null hypothesis of no spatial clustering is rejected for 102 of 273 families. The explanation of spatially clustered rates but independent change among sites is rejected for 43 families. These findings need to be reconciled with the common practice of basing evolutionary inferences on models that assume independent change among sites.
Collapse
|
33
|
Incorporating gene-specific variation when inferring and evaluating optimal evolutionary tree topologies from multilocus sequence data. Proc Natl Acad Sci U S A 2005; 102:4436-41. [PMID: 15764703 PMCID: PMC555482 DOI: 10.1073/pnas.0408313102] [Citation(s) in RCA: 35] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/08/2004] [Indexed: 11/18/2022] Open
Abstract
Because of the increase of genomic data, multiple genes are often available for the inference of phylogenetic relationships. The simple approach for combining multiple genes from the same taxon is to concatenate the sequences and then ignore the fact that different positions in the concatenated sequence came from different genes. Here, we discuss two criteria for inferring the optimal tree topology from data sets with multiple genes. These criteria are designed for multigene data sets where gene-specific evolutionary features are too important to ignore. One criterion is conventional and is obtained by taking the sum of log-likelihoods over all genes. The other criterion is obtained by dividing the log-likelihood for a gene by its sequence length and then taking the arithmetic mean over genes of these ratios. A similar strategy could be adopted with parsimony scores. The optimal tree is then declared to be the one for which the sum or the arithmetic mean is maximized. These criteria are justified within a two-stage hierarchical framework. The first level of the hierarchy represents gene-specific evolutionary features, and the second represents site-specific features for given genes. For testing significance of the optimal topology, we suggest a two-stage bootstrap procedure that involves resampling genes and then resampling alignment columns within resampled genes. An advantage of this procedure over concatenation is that it can effectively account for gene-specific evolutionary features. We discuss the applicability of the two-stage bootstrap idea to the Kishino-Hasegawa test and the Shimodaira-Hasegawa test.
Collapse
|
34
|
Abstract
Estimation of divergence times from sequence data has become increasingly feasible in recent years. Conflicts between fossil evidence and molecular dates have sparked the development of new methods for inferring divergence times, further encouraging these efforts. In this paper, available methods for estimating divergence times are reviewed, especially those geared toward handling the widespread variation in rates of molecular evolution observed among lineages. The assumptions, strengths, and weaknesses of local clock, Bayesian, and rate smoothing methods are described. The rapidly growing literature applying these methods to key divergence times in plant evolutionary history is also reviewed. These include the crown group ages of green plants, land plants, seed plants, angiosperms, and major subclades of angiosperms. Finally, attempts to infer divergence times are described in the context of two very different temporal settings: recent adaptive radiations and much more ancient biogeographic patterns.
Collapse
|
35
|
Discussions on “A Bayesian Approach to DNA Sequence Segmentation”. Biometrics 2004. [DOI: 10.1111/j.0006-341x.2004.206_5.x] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/26/2022]
|
36
|
Estimating absolute rates of synonymous and nonsynonymous nucleotide substitution in order to characterize natural selection and date species divergences. Mol Biol Evol 2004; 21:1201-13. [PMID: 15014159 DOI: 10.1093/molbev/msh088] [Citation(s) in RCA: 74] [Impact Index Per Article: 3.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022] Open
Abstract
The rate of molecular evolution can vary among lineages. Sources of this variation have differential effects on synonymous and nonsynonymous substitution rates. Changes in effective population size or patterns of natural selection will mainly alter nonsynonymous substitution rates. Changes in generation length or mutation rates are likely to have an impact on both synonymous and nonsynonymous substitution rates. By comparing changes in synonymous and nonsynonymous rates, the relative contributions of the driving forces of evolution can be better characterized. Here, we introduce a procedure for estimating the chronological rates of synonymous and nonsynonymous substitutions on the branches of an evolutionary tree. Because the widely used ratio of nonsynonymous and synonymous rates is not designed to detect simultaneous increases or simultaneous decreases in synonymous and nonsynonymous rates, the estimation of these rates rather than their ratio can improve characterization of the evolutionary process. With our Bayesian approach, we analyze cytochrome oxidase subunit I evolution in primates and infer that nonsynonymous rates have a greater tendency to change over time than do synonymous rates. Our analysis of these data also suggests that rates have been positively correlated.
Collapse
|
37
|
|
38
|
Time flies, a new molecular time-scale for brachyceran fly evolution without a clock. Syst Biol 2003; 52:745-56. [PMID: 14668115] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 04/27/2023] Open
Abstract
The insect order Diptera, the true flies, contains one of the four largest Mesozoic insect radiations within its suborder Brachycera. Estimates of phylogenetic relationships and divergence dates among the major brachyceran lineages have been problematic or vague because of a lack of consistent evidence and the rarity of well-preserved fossils. Here, we combine new evidence from nucleotide sequence data, morphological reinterpretations, and fossils to improve estimates of brachyceran evolutionary relationships and ages. The 28S ribosomal DNA (rDNA) gene was sequenced for a broad diversity of taxa, and the data were combined with recently published morphological scorings for a parsimony-based phylogenetic analysis. The phylogenetic topology inferred from the combined 28S rDNA and morphology data set supports brachyceran monophyly and the monophyly of the four major brachyceran infraorders and suggests relationships largely consistent with previous classifications. Weak support was found for a basal brachyceran clade comprising the infraorders Stratiomyomorpha (soldier flies and relatives), Xylophagomorpha (xylophagid flies), and Tabanomorpha (horse flies, snipe flies, and relatives). This topology and similar alternative arrangements were used to obtain Bayesian estimates of divergence times, both with and without the assumption of a constant evolutionary rate. The estimated times were relatively robust to the choice of prior distributions. Divergence times based on the 28S rDNA and several fossil constraints indicate that the Brachycera originated in the late Triassic or earliest Mesozoic and that all major lower brachyceran fly lineages had near contemporaneous origins in the mid-Jurassic prior to the origin of flowering plants (angiosperms). This study provides increased resolution of brachyceran phylogeny, and our revised estimates of fly ages should improve the temporal context of evolutionary inferences and genomic comparisons between fly model organisms.
Collapse
|
39
|
Abstract
Markovian models of protein evolution that relax the assumption of independent change among codons are considered. With this comparatively realistic framework, an evolutionary rate at a site can depend both on the state of the site and on the states of surrounding sites. By allowing a relatively general dependence structure among sites, models of evolution can reflect attributes of tertiary structure. To quantify the impact of protein structure on protein evolution, we analyze protein-coding DNA sequence pairs with an evolutionary model that incorporates effects of solvent accessibility and pairwise interactions among amino acid residues. By explicitly considering the relationship between nonsynonymous substitution rates and protein structure, this approach can lead to refined detection and characterization of positive selection. Analyses of simulated sequence pairs indicate that parameters in this evolutionary model can be well estimated. Analyses of lysozyme c and annexin V sequence pairs yield the biologically reasonable result that amino acid replacement rates are higher when the replacements lead to energetically favorable proteins than when they destabilize the proteins. Although the focus here is evolutionary dependence among codons that is associated with protein structure, the statistical approach is quite general and could be applied to diverse cases of evolutionary dependence where surrogates for sequence fitness can be measured or modeled.
Collapse
|
40
|
Horizontally transferred genes in plant-parasitic nematodes: a high-throughput genomic approach. Genome Biol 2003; 4:R39. [PMID: 12801413 PMCID: PMC193618 DOI: 10.1186/gb-2003-4-6-r39] [Citation(s) in RCA: 116] [Impact Index Per Article: 5.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/31/2003] [Revised: 03/27/2003] [Accepted: 04/22/2003] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Published accounts of horizontally acquired genes in plant-parasitic nematodes have not been the result of a specific search for gene transfer per se, but rather have emerged from characterization of individual genes. We present a method for a high-throughput genome screen for horizontally acquired genes, illustrated using expressed sequence tag (EST) data from three species of root-knot nematode, Meloidogyne species. RESULTS Our approach identified the previously postulated horizontally transferred genes and revealed six new candidates. Screening was partially dependent on sequence quality, with more candidates identified from clustered sequences than from raw EST data. Computational and experimental methods verified the horizontal gene transfer candidates as bona fide nematode genes. Phylogenetic analysis implicated rhizobial ancestors as donors of horizontally acquired genes in Meloidogyne. CONCLUSIONS High-throughput genomic screening is an effective way to identify horizontal gene transfer candidates. Transferred genes that have undergone amelioration of nucleotide composition and codon bias have been identified using this approach. Analysis of these horizontally transferred gene candidates suggests a link between horizontally transferred genes in Meloidogyne and parasitism.
Collapse
|
41
|
Time scale of eutherian evolution estimated without assuming a constant rate of molecular evolution. Genes Genet Syst 2003; 78:267-83. [PMID: 14532706 DOI: 10.1266/ggs.78.267] [Citation(s) in RCA: 127] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/23/2022] Open
Abstract
Controversies over the molecular clock hypothesis were reviewed. Since it is evident that the molecular clock does not hold in an exact sense, accounting for evolution of the rate of molecular evolution is a prerequisite when estimating divergence times with molecular sequences. Recently proposed statistical methods that account for this rate variation are overviewed and one of these procedures is applied to the mitochondrial protein sequences and to the nuclear gene sequences from many mammalian species in order to estimate the time scale of eutherian evolution. This Bayesian method not only takes account of the variation of molecular evolutionary rate among lineages and among genes, but it also incorporates fossil evidence via constraints on node times. With denser taxonomic sampling and a more realistic model of molecular evolution, this Bayesian approach is expected to increase the accuracy of divergence time estimates.
Collapse
|
42
|
Abstract
Bayesian methods for estimating evolutionary divergence times are extended to multigene data sets, and a technique is described for detecting correlated changes in evolutionary rates among genes. Simulations are employed to explore the effect of multigene data on divergence time estimation, and the methodology is illustrated with a previously published data set representing diverse plant taxa. The fact that evolutionary rates and times are confounded when sequence data are compared is emphasized and the importance of fossil information for disentangling rates and times is stressed.
Collapse
|
43
|
Estimation of effective population size of HIV-1 within a host: a pseudomaximum-likelihood approach. Genetics 2002; 160:1283-93. [PMID: 11973287 PMCID: PMC1462041 DOI: 10.1093/genetics/160.4.1283] [Citation(s) in RCA: 54] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022] Open
Abstract
Using pseudomaximum-likelihood approaches to phylogenetic inference and coalescent theory, we develop a computationally tractable method of estimating effective population size from serially sampled viral data. We show that the variance of the maximum-likelihood estimator of effective population size depends on the serial sampling design only because internal node times on a coalescent genealogy can be better estimated with some designs than with others. Given the internal node times and the number of sequences sampled, the variance of the maximum-likelihood estimator is independent of the serial sampling design. We then estimate the effective size of the HIV-1 population within nine hosts. If we assume that the mutation rate is 2.5 x 10(-5) substitutions/generation and is the same in all patients, estimated generation lengths vary from 0.73 to 2.43 days/generation and the mean (1.47) is similar to the generation lengths estimated by other researchers. If we assume that generation length is 1.47 days and is the same in all patients, mutation rate estimates vary from 1.52 x 10(-5) to 5.02 x 10(-5). Our results indicate that effective viral population size and evolutionary rate per year are negatively correlated among HIV-1 patients.
Collapse
|
44
|
A viral sampling design for testing the molecular clock and for estimating evolutionary rates and divergence times. Bioinformatics 2002; 18:115-23. [PMID: 11836219 DOI: 10.1093/bioinformatics/18.1.115] [Citation(s) in RCA: 40] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022] Open
Abstract
MOTIVATION The high pace of viral sequence change means that variation in the times at which sequences are sampled can have a profound effect both on the ability to detect trends over time in evolutionary rates and on the power to reject the Molecular Clock Hypothesis (MCH). Trends in viral evolutionary rates are of particular interest because their detection may allow connections to be established between a patient's treatment or condition and the process of evolution. Variation in sequence isolation times also impacts the uncertainty associated with estimates of divergence times and evolutionary rates. Variation in isolation times can be intentionally adjusted to increase the power of hypothesis tests and to reduce the uncertainty of evolutionary parameter estimates, but this fact has received little previous attention. RESULTS We provide approximations for the power to reject the MCH when the alternative is that rates change in a linear fashion over time and when the alternative is that rates differ randomly among branches. In addition, we approximate the standard deviation of estimated evolutionary rates and divergence times. We illustrate how these approximations can be exploited to determine which viral sample to sequence when samples representing different dates are available.
Collapse
|
45
|
Abstract
Rates of molecular evolution vary over time and, hence, among lineages. In contrast, widely used methods for estimating divergence times from molecular sequence data assume constancy of rates. Therefore, methods for estimation of divergence times that incorporate rate variation are attractive. Improvements on a previously proposed Bayesian technique for divergence time estimation are described. New parameterization more effectively captures the phylogenetic structure of rate evolution on a tree. Fossil information and other evidence can now be included in Bayesian analyses in the form of constraints on divergence times. Simulation results demonstrate that the accuracy of divergence time estimation is substantially enhanced when constraints are included.
Collapse
|
46
|
Abstract
Homologous sequences are correlated due to their common ancestry. Probabilistic models of sequence evolution are employed routinely to properly account for these phylogenetic correlations. These increasingly realistic models provide a basis for studying evolution and for exploiting it to better understand protein structure and function. Notable recent advances have been made in the treatment of insertion and deletion events, the estimation of amino-acid replacement rates, and the detection of positive selection.
Collapse
|
47
|
Abstract
A simple model for the evolution of the rate of molecular evolution is presented. With a Bayesian approach, this model can serve as the basis for estimating dates of important evolutionary events even in the absence of the assumption of constant rates among evolutionary lineages. The method can be used in conjunction with any of the widely used models for nucleotide substitution or amino acid replacement. It is illustrated by analyzing a data set of rbcL protein sequences.
Collapse
|
48
|
Abstract
MOTIVATION Evolutionary models of amino acid sequences can be adapted to incorporate structure information; protein structure biologists can use phylogenetic relationships among species to improve prediction accuracy. Results : A computer program called PASSML ('Phylogeny and Secondary Structure using Maximum Likelihood') has been developed to implement an evolutionary model that combines protein secondary structure and amino acid replacement. The model is related to that of Dayhoff and co-workers, but we distinguish eight categories of structural environment: alpha helix, beta sheet, turn and coil, each further classified according to solvent accessibility, i.e. buried or exposed. The model of sequence evolution for each of the eight categories is a Markov process with discrete states in continuous time, and the organization of structure along protein sequences is described by a hidden Markov model. This paper describes the PASSML software and illustrates how it allows both the reconstruction of phylogenies and prediction of secondary structure from aligned amino acid sequences. AVAILABILITY PASSML 'ANSI C' source code and the example data sets described here are available at http://ng-dec1.gen.cam.ac.uk/hmm/Passml.html and 'downstream' Web pages. CONTACT P.Lio@gen.cam.ac.uk
Collapse
|
49
|
Abstract
Empirically derived models of amino acid replacement are employed to study the association between various physical features of proteins and evolution. The strengths of these associations are statistically evaluated by applying the models of protein evolution to 11 diverse sets of protein sequences. Parametric bootstrap tests indicate that the solvent accessibility status of a site has a particularly strong association with the process of amino acid replacement that it experiences. Significant association between secondary structure environment and the amino acid replacement process is also observed. Careful description of the length distribution of secondary structure elements and of the organization of secondary structure and solvent accessibility along a protein did not always significantly improve the fit of the evolutionary models to the data sets that were analyzed. As indicated by the strength of the association of both solvent accessibility and secondary structure with amino acid replacement, the process of protein evolution-both above and below the species level-will not be well understood until the physical constraints that affect protein evolution are identified and characterized.
Collapse
|
50
|
|