1
|
Ferreiro D, Branco C, Arenas M. Selection among site-dependent structurally constrained substitution models of protein evolution by approximate Bayesian computation. Bioinformatics 2024; 40:btae096. [PMID: 38374231 PMCID: PMC10914458 DOI: 10.1093/bioinformatics/btae096] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/22/2023] [Revised: 01/15/2024] [Accepted: 02/16/2024] [Indexed: 02/21/2024] Open
Abstract
MOTIVATION The selection among substitution models of molecular evolution is fundamental for obtaining accurate phylogenetic inferences. At the protein level, evolutionary analyses are traditionally based on empirical substitution models but these models make unrealistic assumptions and are being surpassed by structurally constrained substitution (SCS) models. The SCS models often consider site-dependent evolution, a process that provides realism but complicates their implementation into likelihood functions that are commonly used for substitution model selection. RESULTS We present a method to perform selection among site-dependent SCS models, also among empirical and site-dependent SCS models, based on the approximate Bayesian computation (ABC) approach and its implementation into the computational framework ProteinModelerABC. The framework implements ABC with and without regression adjustments and includes diverse empirical and site-dependent SCS models of protein evolution. Using extensive simulated data, we found that it provides selection among SCS and empirical models with acceptable accuracy. As illustrative examples, we applied the framework to analyze a variety of protein families observing that SCS models fit them better than the corresponding best-fitting empirical substitution models. AVAILABILITY AND IMPLEMENTATION ProteinModelerABC is freely available from https://github.com/DavidFerreiro/ProteinModelerABC, can run in parallel and includes a graphical user interface. The framework is distributed with detailed documentation and ready-to-use examples.
Collapse
Affiliation(s)
- David Ferreiro
- CINBIO, Universidade de Vigo, 36310 Vigo, Spain
- Department of Biochemistry, Genetics and Immunology, Universidade de Vigo, 36310 Vigo, Spain
| | - Catarina Branco
- CINBIO, Universidade de Vigo, 36310 Vigo, Spain
- Department of Biochemistry, Genetics and Immunology, Universidade de Vigo, 36310 Vigo, Spain
| | - Miguel Arenas
- CINBIO, Universidade de Vigo, 36310 Vigo, Spain
- Department of Biochemistry, Genetics and Immunology, Universidade de Vigo, 36310 Vigo, Spain
| |
Collapse
|
2
|
Teterina AA, Willis JH, Lukac M, Jovelin R, Cutter AD, Phillips PC. Genomic diversity landscapes in outcrossing and selfing Caenorhabditis nematodes. PLoS Genet 2023; 19:e1010879. [PMID: 37585484 PMCID: PMC10461856 DOI: 10.1371/journal.pgen.1010879] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/28/2022] [Revised: 08/28/2023] [Accepted: 07/21/2023] [Indexed: 08/18/2023] Open
Abstract
Caenorhabditis nematodes form an excellent model for studying how the mode of reproduction affects genetic diversity, as some species reproduce via outcrossing whereas others can self-fertilize. Currently, chromosome-level patterns of diversity and recombination are only available for self-reproducing Caenorhabditis, making the generality of genomic patterns across the genus unclear given the profound potential influence of reproductive mode. Here we present a whole-genome diversity landscape, coupled with a new genetic map, for the outcrossing nematode C. remanei. We demonstrate that the genomic distribution of recombination in C. remanei, like the model nematode C. elegans, shows high recombination rates on chromosome arms and low rates toward the central regions. Patterns of genetic variation across the genome are also similar between these species, but differ dramatically in scale, being tenfold greater for C. remanei. Historical reconstructions of variation in effective population size over the past million generations echo this difference in polymorphism. Evolutionary simulations demonstrate how selection, recombination, mutation, and selfing shape variation along the genome, and that multiple drivers can produce patterns similar to those observed in natural populations. The results illustrate how genome organization and selection play a crucial role in shaping the genomic pattern of diversity whereas demographic processes scale the level of diversity across the genome as a whole.
Collapse
Affiliation(s)
- Anastasia A. Teterina
- Institute of Ecology and Evolution, University of Oregon, Eugene, Oregon, United States of America
- Center of Parasitology, Severtsov Institute of Ecology and Evolution RAS, Moscow, Russia
| | - John H. Willis
- Institute of Ecology and Evolution, University of Oregon, Eugene, Oregon, United States of America
| | - Matt Lukac
- Institute of Ecology and Evolution, University of Oregon, Eugene, Oregon, United States of America
| | - Richard Jovelin
- Department of Ecology and Evolutionary Biology, University of Toronto, Toronto, Ontario, Canada
| | - Asher D. Cutter
- Department of Ecology and Evolutionary Biology, University of Toronto, Toronto, Ontario, Canada
| | - Patrick C. Phillips
- Institute of Ecology and Evolution, University of Oregon, Eugene, Oregon, United States of America
| |
Collapse
|
3
|
Del Amparo R, Arenas M. Influence of substitution model selection on protein phylogenetic tree reconstruction. Gene 2023; 865:147336. [PMID: 36871672 DOI: 10.1016/j.gene.2023.147336] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/04/2023] [Revised: 02/22/2023] [Accepted: 02/28/2023] [Indexed: 03/06/2023]
Abstract
Probabilistic phylogenetic tree reconstruction is traditionally performed under a best-fitting substitution model of molecular evolution previously selected according to diverse statistical criteria. Interestingly, some recent studies proposed that this procedure is unnecessary for phylogenetic tree reconstruction leading to a debate in the field. In contrast to DNA sequences, phylogenetic tree reconstruction from protein sequences is traditionally based on empirical exchangeability matrices that can differ among taxonomic groups and protein families. Considering this aspect, here we investigated the influence of selecting a substitution model of protein evolution on phylogenetic tree reconstruction by the analyses of real and simulated data. We found that phylogenetic tree reconstructions based on a selected best-fitting substitution model of protein evolution are the most accurate, in terms of topology and branch lengths, compared with those derived from substitution models with amino acid replacement matrices far from the selected best-fitting model, especially when the data has large genetic diversity. Indeed, we found that substitution models with similar amino acid replacement matrices produce similar reconstructed phylogenetic trees, suggesting the use of substitution models as similar as possible to a selected best-fitting model when the latter cannot be used. Therefore, we recommend the use of the traditional protocol of selection among substitution models of evolution for protein phylogenetic tree reconstruction.
Collapse
Affiliation(s)
- Roberto Del Amparo
- CINBIO, Universidade de Vigo, 36310 Vigo, Spain; Department of Biochemistry, Genetics and Immunology, Universidade de Vigo, 36310 Vigo, Spain.
| | - Miguel Arenas
- CINBIO, Universidade de Vigo, 36310 Vigo, Spain; Department of Biochemistry, Genetics and Immunology, Universidade de Vigo, 36310 Vigo, Spain; Galicia Sur Health Research Institute (IIS Galicia Sur), 36310 Vigo, Spain.
| |
Collapse
|
4
|
Muñoz-Baena L, Wade KE, Poon AFY. HexSE: Simulating evolution in overlapping reading frames. Virus Evol 2023; 9:vead009. [PMID: 36846827 PMCID: PMC9949996 DOI: 10.1093/ve/vead009] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/03/2022] [Revised: 01/11/2023] [Accepted: 01/27/2023] [Indexed: 02/04/2023] Open
Abstract
Gene overlap occurs when two or more genes are encoded by the same nucleotides. This phenomenon is found in all taxonomic domains, but is particularly common in viruses, where it may provide a mechanism to increase the information content of compact genomes. The presence of overlapping reading frames (OvRFs) can skew estimates of selection based on the rates of non-synonymous and synonymous substitutions, since a substitution that is synonymous in one reading frame may be non-synonymous in another and vice versa. To understand the impact of OvRFs on molecular evolution, we implemented a versatile simulation model of nucleotide sequence evolution along a phylogeny with any distribution of open reading frames in linear or circular genomes. We use a custom data structure to track the substitution rates at every nucleotide site, which is determined by the stationary nucleotide frequencies, transition bias and the distribution of selection biases (dN/dS) in the respective reading frames. Our simulation model is implemented in the Python scripting language. All source code is released under the GNU General Public License version 3 and are available at https://github.com/PoonLab/HexSE.
Collapse
Affiliation(s)
| | - Kaitlyn E Wade
- Department of Pathology and Laboratory Medicine, Western University, Dental Sciences Building 4044, London N6A 5C1, Canada
| | | |
Collapse
|
5
|
Gupta MK, Vadde R. Next-generation development and application of codon model in evolution. Front Genet 2023; 14:1091575. [PMID: 36777719 PMCID: PMC9911445 DOI: 10.3389/fgene.2023.1091575] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/07/2022] [Accepted: 01/17/2023] [Indexed: 01/28/2023] Open
Abstract
To date, numerous nucleotide, amino acid, and codon substitution models have been developed to estimate the evolutionary history of any sequence/organism in a more comprehensive way. Out of these three, the codon substitution model is the most powerful. These models have been utilized extensively to detect selective pressure on a protein, codon usage bias, ancestral reconstruction and phylogenetic reconstruction. However, due to more computational demanding, in comparison to nucleotide and amino acid substitution models, only a few studies have employed the codon substitution model to understand the heterogeneity of the evolutionary process in a genome-scale analysis. Hence, there is always a question of how to develop more robust but less computationally demanding codon substitution models to get more accurate results. In this review article, the authors attempted to understand the basis of the development of different types of codon-substitution models and how this information can be utilized to develop more robust but less computationally demanding codon substitution models. The codon substitution model enables to detect selection regime under which any gene or gene region is evolving, codon usage bias in any organism or tissue-specific region and phylogenetic relationship between different lineages more accurately than nucleotide and amino acid substitution models. Thus, in the near future, these codon models can be utilized in the field of conservation, breeding and medicine.
Collapse
|
6
|
Del Amparo R, González-Vázquez LD, Rodríguez-Moure L, Bastolla U, Arenas M. Consequences of Genetic Recombination on Protein Folding Stability. J Mol Evol 2023; 91:33-45. [PMID: 36463317 PMCID: PMC9849154 DOI: 10.1007/s00239-022-10080-2] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/21/2022] [Accepted: 11/25/2022] [Indexed: 12/05/2022]
Abstract
Genetic recombination is a common evolutionary mechanism that produces molecular diversity. However, its consequences on protein folding stability have not attracted the same attention as in the case of point mutations. Here, we studied the effects of homologous recombination on the computationally predicted protein folding stability for several protein families, finding less detrimental effects than we previously expected. Although recombination can affect multiple protein sites, we found that the fraction of recombined proteins that are eliminated by negative selection because of insufficient stability is not significantly larger than the corresponding fraction of proteins produced by mutation events. Indeed, although recombination disrupts epistatic interactions, the mean stability of recombinant proteins is not lower than that of their parents. On the other hand, the difference of stability between recombined proteins is amplified with respect to the parents, promoting phenotypic diversity. As a result, at least one third of recombined proteins present stability between those of their parents, and a substantial fraction have higher or lower stability than those of both parents. As expected, we found that parents with similar sequences tend to produce recombined proteins with stability close to that of the parents. Finally, the simulation of protein evolution along the ancestral recombination graph with empirical substitution models commonly used in phylogenetics, which ignore constraints on protein folding stability, showed that recombination favors the decrease of folding stability, supporting the convenience of adopting structurally constrained models when possible for inferences of protein evolutionary histories with recombination.
Collapse
Affiliation(s)
- Roberto Del Amparo
- CINBIO, Universidade de Vigo, 36310 Vigo, Spain ,Departamento de Bioquímica, Genética e Inmunología, Universidade de Vigo, 36310 Vigo, Spain
| | - Luis Daniel González-Vázquez
- CINBIO, Universidade de Vigo, 36310 Vigo, Spain ,Departamento de Bioquímica, Genética e Inmunología, Universidade de Vigo, 36310 Vigo, Spain
| | - Laura Rodríguez-Moure
- CINBIO, Universidade de Vigo, 36310 Vigo, Spain ,Departamento de Bioquímica, Genética e Inmunología, Universidade de Vigo, 36310 Vigo, Spain
| | - Ugo Bastolla
- Centre for Molecular Biology Severo Ochoa (CSIC-UAM), 28049 Madrid, Spain
| | - Miguel Arenas
- CINBIO, Universidade de Vigo, 36310 Vigo, Spain ,Departamento de Bioquímica, Genética e Inmunología, Universidade de Vigo, 36310 Vigo, Spain ,Galicia Sur Health Research Institute (IIS Galicia Sur), 36310 Vigo, Spain
| |
Collapse
|
7
|
De Maio N, Boulton W, Weilguny L, Walker CR, Turakhia Y, Corbett-Detig R, Goldman N. phastSim: Efficient simulation of sequence evolution for pandemic-scale datasets. PLoS Comput Biol 2022; 18:e1010056. [PMID: 35486906 PMCID: PMC9094560 DOI: 10.1371/journal.pcbi.1010056] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/24/2021] [Revised: 05/11/2022] [Accepted: 03/25/2022] [Indexed: 11/26/2022] Open
Abstract
Sequence simulators are fundamental tools in bioinformatics, as they allow us to test data processing and inference tools, and are an essential component of some inference methods. The ongoing surge in available sequence data is however testing the limits of our bioinformatics software. One example is the large number of SARS-CoV-2 genomes available, which are beyond the processing power of many methods, and simulating such large datasets is also proving difficult. Here, we present a new algorithm and software for efficiently simulating sequence evolution along extremely large trees (e.g. > 100, 000 tips) when the branches of the tree are short, as is typical in genomic epidemiology. Our algorithm is based on the Gillespie approach, and it implements an efficient multi-layered search tree structure that provides high computational efficiency by taking advantage of the fact that only a small proportion of the genome is likely to mutate at each branch of the considered phylogeny. Our open source software allows easy integration with other Python packages as well as a variety of evolutionary models, including indel models and new hypermutability models that we developed to more realistically represent SARS-CoV-2 genome evolution.
Collapse
Affiliation(s)
- Nicola De Maio
- European Molecular Biology Laboratory, European Bioinformatics Institute, Hinxton, United Kingdom
| | - William Boulton
- European Molecular Biology Laboratory, European Bioinformatics Institute, Hinxton, United Kingdom
| | - Lukas Weilguny
- European Molecular Biology Laboratory, European Bioinformatics Institute, Hinxton, United Kingdom
| | - Conor R. Walker
- European Molecular Biology Laboratory, European Bioinformatics Institute, Hinxton, United Kingdom
- Department of Genetics, University of Cambridge, Cambridge, United Kingdom
| | - Yatish Turakhia
- Department of Electrical and Computer Engineering, University of California San Diego, San Diego, California, United States of America
| | - Russell Corbett-Detig
- Department of Biomolecular Engineering, University of California Santa Cruz, Santa Cruz, California, United States of America
- Genomics Institute, University of California Santa Cruz, Santa Cruz, California, United States of America
| | - Nick Goldman
- European Molecular Biology Laboratory, European Bioinformatics Institute, Hinxton, United Kingdom
| |
Collapse
|
8
|
Baumdicker F, Bisschop G, Goldstein D, Gower G, Ragsdale AP, Tsambos G, Zhu S, Eldon B, Ellerman EC, Galloway JG, Gladstein AL, Gorjanc G, Guo B, Jeffery B, Kretzschmar WW, Lohse K, Matschiner M, Nelson D, Pope NS, Quinto-Cortés CD, Rodrigues MF, Saunack K, Sellinger T, Thornton K, van Kemenade H, Wohns AW, Wong Y, Gravel S, Kern AD, Koskela J, Ralph PL, Kelleher J. Efficient ancestry and mutation simulation with msprime 1.0. Genetics 2021; 220:6460344. [PMID: 34897427 PMCID: PMC9176297 DOI: 10.1093/genetics/iyab229] [Citation(s) in RCA: 91] [Impact Index Per Article: 30.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/22/2021] [Accepted: 12/03/2021] [Indexed: 11/13/2022] Open
Abstract
Stochastic simulation is a key tool in population genetics, since the models involved are often analytically intractable and simulation is usually the only way of obtaining ground-truth data to evaluate inferences. Because of this, a large number of specialized simulation programs have been developed, each filling a particular niche, but with largely overlapping functionality and a substantial duplication of effort. Here, we introduce msprime version 1.0, which efficiently implements ancestry and mutation simulations based on the succinct tree sequence data structure and the tskit library. We summarize msprime’s many features, and show that its performance is excellent, often many times faster and more memory efficient than specialized alternatives. These high-performance features have been thoroughly tested and validated, and built using a collaborative, open source development model, which reduces duplication of effort and promotes software quality via community engagement.
Collapse
Affiliation(s)
- Franz Baumdicker
- Cluster of Excellence "Controlling Microbes to Fight Infections", Mathematical and Computational Population Genetics, University of Tübingen, 72076 Tübingen, Germany
| | - Gertjan Bisschop
- Institute of Evolutionary Biology,The University of Edinburgh, EH9 3FL, UK
| | - Daniel Goldstein
- Khoury College of Computer Sciences, Northeastern University, MA 02115, USA.,No affiliation
| | - Graham Gower
- Lundbeck GeoGenetics Centre, Globe Institute, University of Copenhagen, 1350 Copenhagen K, Denmark
| | - Aaron P Ragsdale
- Department of Integrative Biology, University of Wisconsin-Madison, WI 53706, USA
| | - Georgia Tsambos
- Melbourne Integrative Genomics, School of Mathematics and Statistics, University of Melbourne, Victoria, 3010, Australia
| | - Sha Zhu
- Big Data Institute, Li Ka Shing Centre for Health Information and Discovery, University of Oxford, OX3 7LF, UK
| | - Bjarki Eldon
- Leibniz Institute for Evolution and Biodiversity Science,Museum für Naturkunde Berlin, 10115, Germany
| | | | - Jared G Galloway
- Institute of Ecology and Evolution, Department of Biology, University of Oregon, OR 97403-5289, USA.,Computational Biology Program, Fred Hutchinson Cancer Research Center, Seattle, WA 98102, USA
| | - Ariella L Gladstein
- Department of Genetics, University of North Carolina at Chapel Hill, NC 27599-7264, USA.,Embark Veterinary, Inc., Boston, MA 02111, USA
| | - Gregor Gorjanc
- The Roslin Institute and Royal (Dick) School of Veterinary Studies, University of Edinburgh, EH25 9RG, UK
| | - Bing Guo
- Institute for Genome Sciences,University of Maryland School of Medicine, Baltimore, MD, 21201, USA
| | - Ben Jeffery
- Big Data Institute, Li Ka Shing Centre for Health Information and Discovery, University of Oxford, OX3 7LF, UK
| | - Warren W Kretzschmar
- Center for Hematology and Regenerative Medicine, Karolinska Institute, 141 83 Huddinge, Sweden
| | - Konrad Lohse
- Institute of Evolutionary Biology,The University of Edinburgh, EH9 3FL, UK
| | | | - Dominic Nelson
- Department of Human Genetics, McGill University, Montréal, QC H3A 0C7, Canada
| | - Nathaniel S Pope
- Department of Entomology, Pennsylvania State University, PA 16802, USA
| | - Consuelo D Quinto-Cortés
- National Laboratory of Genomics for Biodiversity (LANGEBIO), Unit of Advanced Genomics, CINVESTAV, Irapuato, Mexico
| | - Murillo F Rodrigues
- Institute of Ecology and Evolution, Department of Biology, University of Oregon, OR 97403-5289, USA
| | - Kumar Saunack
- IIT Bombay, Powai, Mumbai 400 076, Maharashtra, India
| | - Thibaut Sellinger
- Professorship for Population Genetics, Department of Life Science Systems, Technical University of Munich, 85354 Freising, Germany
| | - Kevin Thornton
- Ecology and Evolutionary Biology, University of California, Irvine, CA 92697, USA
| | | | - Anthony W Wohns
- Big Data Institute, Li Ka Shing Centre for Health Information and Discovery, University of Oxford, OX3 7LF, UK.,Broad Institute of MIT and Harvard, Cambridge, MA 02142, USA
| | - Yan Wong
- Big Data Institute, Li Ka Shing Centre for Health Information and Discovery, University of Oxford, OX3 7LF, UK
| | - Simon Gravel
- Department of Human Genetics, McGill University, Montréal, QC H3A 0C7, Canada
| | - Andrew D Kern
- Institute of Ecology and Evolution, Department of Biology, University of Oregon, OR 97403-5289, USA
| | - Jere Koskela
- Department of Statistics, University of Warwick, CV4 7AL, UK
| | - Peter L Ralph
- Institute of Ecology and Evolution, Department of Biology, University of Oregon, OR 97403-5289, USA.,Department of Mathematics, University of Oregon, OR 97403-5289 USA
| | - Jerome Kelleher
- Big Data Institute, Li Ka Shing Centre for Health Information and Discovery, University of Oxford, OX3 7LF, UK
| |
Collapse
|
9
|
Ongaro L, Molinaro L, Flores R, Marnetto D, Capodiferro MR, Alarcón-Riquelme ME, Moreno-Estrada A, Mabunda N, Ventura M, Tambets K, Achilli A, Capelli C, Metspalu M, Pagani L, Montinaro F. Evaluating the Impact of Sex-Biased Genetic Admixture in the Americas through the Analysis of Haplotype Data. Genes (Basel) 2021; 12:genes12101580. [PMID: 34680976 PMCID: PMC8535939 DOI: 10.3390/genes12101580] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/04/2021] [Revised: 10/04/2021] [Accepted: 10/06/2021] [Indexed: 01/30/2023] Open
Abstract
A general imbalance in the proportion of disembarked males and females in the Americas has been documented during the Trans-Atlantic Slave Trade and the Colonial Era and, although less prominent, more recently. This imbalance may have left a signature on the genomes of modern-day populations characterised by high levels of admixture. The analysis of the uniparental systems and the evaluation of continental proportion ratio of autosomal and X chromosomes revealed a general sex imbalance towards males for European and females for African and Indigenous American ancestries. However, the consistency and degree of this imbalance are variable, suggesting that other factors, such as cultural and social practices, may have played a role in shaping it. Moreover, very few investigations have evaluated the sex imbalance using haplotype data, containing more critical information than genotypes. Here, we analysed genome-wide data for more than 5000 admixed American individuals to assess the presence, direction and magnitude of sex-biased admixture in the Americas. For this purpose, we applied two haplotype-based approaches, ELAI and NNLS, and we compared them with a genotype-based method, ADMIXTURE. In doing so, besides a general agreement between methods, we unravelled that the post-colonial admixture dynamics show higher complexity than previously described.
Collapse
Affiliation(s)
- Linda Ongaro
- Estonian Biocentre, Institute of Genomics, University of Tartu, Riia 23b, 51010 Tartu, Estonia; (L.M.); (R.F.); (D.M.); (K.T.); (M.M.); (L.P.); (F.M.)
- Correspondence:
| | - Ludovica Molinaro
- Estonian Biocentre, Institute of Genomics, University of Tartu, Riia 23b, 51010 Tartu, Estonia; (L.M.); (R.F.); (D.M.); (K.T.); (M.M.); (L.P.); (F.M.)
| | - Rodrigo Flores
- Estonian Biocentre, Institute of Genomics, University of Tartu, Riia 23b, 51010 Tartu, Estonia; (L.M.); (R.F.); (D.M.); (K.T.); (M.M.); (L.P.); (F.M.)
| | - Davide Marnetto
- Estonian Biocentre, Institute of Genomics, University of Tartu, Riia 23b, 51010 Tartu, Estonia; (L.M.); (R.F.); (D.M.); (K.T.); (M.M.); (L.P.); (F.M.)
| | - Marco R. Capodiferro
- Department of Biology and Biotechnology “L. Spallanzani”, University of Pavia, 27100 Pavia, Italy; (M.R.C.); (A.A.)
| | - Marta E. Alarcón-Riquelme
- Department of Medical Genomics, GENYO, Centro Pfizer—Universidad de Granada—Junta de Andalucía de Genómica e Investigación Oncológica, Av de la Ilustración 114, Parque Tecnológico de la Salud (PTS), 18016 Granada, Spain;
| | - Andrés Moreno-Estrada
- National Laboratory of Genomics for Biodiversity (LANGEBIO), CINVESTAV, Irapuato, Guanajuato 36821, Mexico;
| | - Nedio Mabunda
- Instituto Nacional de Saúde, Distrito de Marracuene, Estrada Nacional N°1, Província de Maputo, Maputo 1120, Mozambique;
| | - Mario Ventura
- Department of Biology-Genetics, University of Bari, 70126 Bari, Italy;
| | - Kristiina Tambets
- Estonian Biocentre, Institute of Genomics, University of Tartu, Riia 23b, 51010 Tartu, Estonia; (L.M.); (R.F.); (D.M.); (K.T.); (M.M.); (L.P.); (F.M.)
| | - Alessandro Achilli
- Department of Biology and Biotechnology “L. Spallanzani”, University of Pavia, 27100 Pavia, Italy; (M.R.C.); (A.A.)
| | - Cristian Capelli
- Department of Zoology, University of Oxford, Oxford OX1 3SZ, UK;
- Department of Chemistry, Life Sciences and Environmental Sustainability, University of Parma, 43124 Parma, Italy
| | - Mait Metspalu
- Estonian Biocentre, Institute of Genomics, University of Tartu, Riia 23b, 51010 Tartu, Estonia; (L.M.); (R.F.); (D.M.); (K.T.); (M.M.); (L.P.); (F.M.)
| | - Luca Pagani
- Estonian Biocentre, Institute of Genomics, University of Tartu, Riia 23b, 51010 Tartu, Estonia; (L.M.); (R.F.); (D.M.); (K.T.); (M.M.); (L.P.); (F.M.)
- Department of Biology, University of Padua, 35131 Padua, Italy
| | - Francesco Montinaro
- Estonian Biocentre, Institute of Genomics, University of Tartu, Riia 23b, 51010 Tartu, Estonia; (L.M.); (R.F.); (D.M.); (K.T.); (M.M.); (L.P.); (F.M.)
- Department of Biology-Genetics, University of Bari, 70126 Bari, Italy;
| |
Collapse
|
10
|
De Maio N, Boulton W, Weilguny L, Walker CR, Turakhia Y, Corbett-Detig R, Goldman N. phastSim: efficient simulation of sequence evolution for pandemic-scale datasets. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2021:2021.03.15.435416. [PMID: 33758852 PMCID: PMC7987011 DOI: 10.1101/2021.03.15.435416] [Citation(s) in RCA: 6] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 12/30/2022]
Abstract
Sequence simulators are fundamental tools in bioinformatics, as they allow us to test data processing and inference tools, as well as being part of some inference methods. The ongoing surge in available sequence data is however testing the limits of our bioinformatics software. One example is the large number of SARS-CoV-2 genomes available, which are beyond the processing power of many methods, and simulating such large datasets is also proving difficult. Here we present a new algorithm and software for efficiently simulating sequence evolution along extremely large trees (e.g. > 100,000 tips) when the branches of the tree are short, as is typical in genomic epidemiology. Our algorithm is based on the Gillespie approach, and implements an efficient multi-layered search tree structure that provides high computational efficiency by taking advantage of the fact that only a small proportion of the genome is likely to mutate at each branch of the considered phylogeny. Our open source software is available from https://github.com/NicolaDM/phastSim and allows easy integration with other Python packages as well as a variety of evolutionary models, including indel models and new hypermutatability models that we developed to more realistically represent SARS-CoV-2 genome evolution.
Collapse
Affiliation(s)
- Nicola De Maio
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridgeshire, CB10 1SD, UK
| | - William Boulton
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridgeshire, CB10 1SD, UK
| | - Lukas Weilguny
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridgeshire, CB10 1SD, UK
| | - Conor R. Walker
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridgeshire, CB10 1SD, UK
- Department of Genetics, University of Cambridge, Cambridge, CB2 3EH, UK
| | - Yatish Turakhia
- Department of Electrical and Computer Engineering, University of California San Diego, San Diego, CA 92093, USA
| | - Russell Corbett-Detig
- Department of Biomolecular Engineering, University of California Santa Cruz, Santa Cruz, CA 95064, USA
- Genomics Institute, University of California Santa Cruz, Santa Cruz, CA 95064, USA
| | - Nick Goldman
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridgeshire, CB10 1SD, UK
| |
Collapse
|
11
|
Arenas M. ProteinEvolverABC: coestimation of recombination and substitution rates in protein sequences by approximate Bayesian computation. Bioinformatics 2021; 38:58-64. [PMID: 34450622 PMCID: PMC8696103 DOI: 10.1093/bioinformatics/btab617] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/15/2021] [Revised: 07/24/2021] [Accepted: 08/24/2021] [Indexed: 02/03/2023] Open
Abstract
MOTIVATION The evolutionary processes of mutation and recombination, upon which selection operates, are fundamental to understand the observed molecular diversity. Unlike nucleotide sequences, the estimation of the recombination rate in protein sequences has been little explored, neither implemented in evolutionary frameworks, despite protein sequencing methods are largely used. RESULTS In order to accommodate this need, here I present a computational framework, called ProteinEvolverABC, to jointly estimate recombination and substitution rates from alignments of protein sequences. The framework implements the approximate Bayesian computation approach, with and without regression adjustments and includes a variety of substitution models of protein evolution, demographics and longitudinal sampling. It also implements several nuisance parameters such as heterogeneous amino acid frequencies and rate of change among sites and, proportion of invariable sites. The framework produces accurate coestimation of recombination and substitution rates under diverse evolutionary scenarios. As illustrative examples of usage, I applied it to several viral protein families, including coronaviruses, showing heterogeneous substitution and recombination rates. AVAILABILITY AND IMPLEMENTATION ProteinEvolverABC is freely available from https://github.com/miguelarenas/proteinevolverabc, includes a graphical user interface for helping the specification of the input settings, extensive documentation and ready-to-use examples. Conveniently, the simulations can run in parallel on multicore machines. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
|
12
|
Abstract
Organisms evolve to increase their fitness, a process that may be described as climbing the fitness landscape. However, the fitness landscape of an individual site, i.e., the vector of fitness values corresponding to different variants at this site, can itself change with time due to changes in the environment or substitutions at other epistatically interacting sites. While there exist a number of simulators for modeling different aspects of molecular evolution, very few can accommodate changing landscapes. We present SELVa, the Simulator of Evolution with Landscape Variation, aimed at modeling the substitution process under a changing single-position fitness landscape in a set of evolving lineages that form a phylogeny of arbitrary shape. Written in Java and distributed as an executable jar file, SELVa provides a flexible framework that allows the user to choose from a number of implemented rules governing landscape change.
Collapse
|
13
|
Currat M, Arenas M, Quilodràn CS, Excoffier L, Ray N. SPLATCHE3: simulation of serial genetic data under spatially explicit evolutionary scenarios including long-distance dispersal. Bioinformatics 2020; 35:4480-4483. [PMID: 31077292 PMCID: PMC6821363 DOI: 10.1093/bioinformatics/btz311] [Citation(s) in RCA: 32] [Impact Index Per Article: 8.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/22/2019] [Revised: 04/18/2019] [Accepted: 04/29/2019] [Indexed: 01/25/2023] Open
Abstract
SUMMARY SPLATCHE3 simulates genetic data under a variety of spatially explicit evolutionary scenarios, extending previous versions of the framework. The new capabilities include long-distance migration, spatially and temporally heterogeneous short-scale migrations, alternative hybridization models, simulation of serial samples of genetic data and a large variety of DNA mutation models. These implementations have been applied independently to various studies, but grouped together in the current version. AVAILABILITY AND IMPLEMENTATION SPLATCHE3 is written in C++ and is freely available for non-commercial use from the website http://www.splatche.com/splatche3. It includes console versions for Linux, MacOs and Windows and a user-friendly GUI for Windows, as well as detailed documentation and ready-to-use examples.
Collapse
Affiliation(s)
- Mathias Currat
- Laboratory of Anthropology, Genetics and Peopling History, Department of Genetics and Evolution - Anthropology Unit, University of Geneva, Geneva 1205, Switzerland.,Institute of Genetics and Genomics in Geneva (IGE3), University of Geneva, Geneva 1211, Switzerland
| | - Miguel Arenas
- Department of Biochemistry, Genetics and Immunology, Vigo 36310, Spain.,Biomedical Research Center (CINBIO), University of Vigo, Vigo 36310, Spain
| | - Claudio S Quilodràn
- Laboratory of Anthropology, Genetics and Peopling History, Department of Genetics and Evolution - Anthropology Unit, University of Geneva, Geneva 1205, Switzerland
| | - Laurent Excoffier
- Computational and Molecular Population Genetics Laboratory, Institute of Ecology and Evolution, University of Bern, Bern 3012, Switzerland.,Swiss Institute of Bioinformatics, Lausanne 1015, Switzerland
| | - Nicolas Ray
- Institute of Global Health, GeoHealth Group, University of Geneva, Geneva 1205, Switzerland.,Institute for Environmental Sciences, University of Geneva, Geneva 1205, Switzerland
| |
Collapse
|
14
|
Del Amparo R, Vicens A, Arenas M. The influence of heterogeneous codon frequencies along sequences on the estimation of molecular adaptation. Bioinformatics 2020; 36:430-436. [PMID: 31304972 DOI: 10.1093/bioinformatics/btz558] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/21/2019] [Revised: 07/08/2019] [Accepted: 07/11/2019] [Indexed: 11/12/2022] Open
Abstract
MOTIVATION The nonsynonymous/synonymous substitution rate ratio (dN/dS) is a commonly used parameter to quantify molecular adaptation in protein-coding data. It is known that the estimation of dN/dS can be biased if some evolutionary processes are ignored. In this concern, common ML methods to estimate dN/dS assume invariable codon frequencies among sites, despite this characteristic is rare in nature, and it could bias the estimation of this parameter. RESULTS Here we studied the influence of variable codon frequencies among genetic regions on the estimation of dN/dS. We explored scenarios varying the number of genetic regions that differ in codon frequencies, the amount of variability of codon frequencies among regions and the nucleotide frequencies at each codon position among regions. We found that ignoring heterogeneous codon frequencies among regions overall leads to underestimation of dN/dS and the bias increases with the level of heterogeneity of codon frequencies. Interestingly, we also found that varying nucleotide frequencies among regions at the first or second codon position leads to underestimation of dN/dS while variation at the third codon position leads to overestimation of dN/dS. Next, we present a methodology to reduce this bias based on the analysis of partitions presenting similar codon frequencies and we applied it to analyze four real datasets. We conclude that accounting for heterogeneous codon frequencies along sequences is required to obtain realistic estimates of molecular adaptation through this relevant evolutionary parameter. AVAILABILITY AND IMPLEMENTATION The applied frameworks for the computer simulations of protein-coding data and estimation of molecular adaptation are SGWE and PAML, respectively. Both are publicly available and referenced in the study. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Roberto Del Amparo
- Department of Biochemistry, Genetics and Immunology.,Biomedical Research Center (CINBIO), University of Vigo, 36310 Vigo, Spain
| | - Alberto Vicens
- Department of Biochemistry, Genetics and Immunology.,Biomedical Research Center (CINBIO), University of Vigo, 36310 Vigo, Spain
| | - Miguel Arenas
- Department of Biochemistry, Genetics and Immunology.,Biomedical Research Center (CINBIO), University of Vigo, 36310 Vigo, Spain
| |
Collapse
|
15
|
Abstract
Coalescent simulation is a fundamental tool in modern population genetics. The msprime library provides unprecedented scalability in terms of both the simulations that can be performed and the efficiency with which the results can be processed. We show how coalescent models for population structure and demography can be constructed using a simple Python API, as well as how we can process the results of such simulations to efficiently calculate statistics of interest. We illustrate msprime's flexibility by implementing a simple (but functional) approximate Bayesian computation inference method in just a few tens of lines of code.
Collapse
Affiliation(s)
- Jerome Kelleher
- Big Data Institute, Li Ka Shing Centre for Health Information and Discovery, University of Oxford, Oxford, UK.
| | - Konrad Lohse
- Institute of Evolutionary Biology, University of Edinburgh, Edinburgh, UK
| |
Collapse
|
16
|
Pascual-García A, Arenas M, Bastolla U. The Molecular Clock in the Evolution of Protein Structures. Syst Biol 2019; 68:987-1002. [DOI: 10.1093/sysbio/syz022] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/08/2018] [Revised: 03/20/2019] [Accepted: 04/09/2019] [Indexed: 12/11/2022] Open
Abstract
Abstract
The molecular clock hypothesis, which states that substitutions accumulate in protein sequences at a constant rate, plays a fundamental role in molecular evolution but it is violated when selective or mutational processes vary with time. Such violations of the molecular clock have been widely investigated for protein sequences, but not yet for protein structures. Here, we introduce a novel statistical test (Significant Clock Violations) and perform a large scale assessment of the molecular clock in the evolution of both protein sequences and structures in three large superfamilies. After validating our method with computer simulations, we find that clock violations are generally consistent in sequence and structure evolution, but they tend to be larger and more significant in structure evolution. Moreover, changes of function assessed through Gene Ontology and InterPro terms are associated with large and significant clock violations in structure evolution. We found that almost one third of significant clock violations are significant in structure evolution but not in sequence evolution, highlighting the advantage to use structure information for assessing accelerated evolution and gathering hints of positive selection. Clock violations between closely related pairs are frequently significant in sequence evolution, consistent with the observed time dependence of the substitution rate attributed to segregation of neutral and slightly deleterious polymorphisms, but not in structure evolution, suggesting that these substitutions do not affect protein structure although they may affect stability. These results are consistent with the view that natural selection, both negative and positive, constrains more strongly protein structures than protein sequences. Our code for computing clock violations is freely available at https://github.com/ugobas/Molecular_clock.
Collapse
Affiliation(s)
- Alberto Pascual-García
- Centro de Biologia Molecular “Severo Ochoa” CSIC-UAM Cantoblanco, 28049 Madrid, Spain
- Department of Life Sciences, Imperial College London, Silwood Park Campus, Ascot, UK
- Institute of Integrative Biology, ETH Zürich, Zürich, Switzerland
| | - Miguel Arenas
- Centro de Biologia Molecular “Severo Ochoa” CSIC-UAM Cantoblanco, 28049 Madrid, Spain
- Department of Biochemistry, Genetics and Immunology, University of Vigo, Spain
| | - Ugo Bastolla
- Centro de Biologia Molecular “Severo Ochoa” CSIC-UAM Cantoblanco, 28049 Madrid, Spain
| |
Collapse
|
17
|
The Influence of Protein Stability on Sequence Evolution: Applications to Phylogenetic Inference. Methods Mol Biol 2019; 1851:215-231. [PMID: 30298399 DOI: 10.1007/978-1-4939-8736-8_11] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/13/2023]
Abstract
Phylogenetic inference from protein data is traditionally based on empirical substitution models of evolution that assume that protein sites evolve independently of each other and under the same substitution process. However, it is well known that the structural properties of a protein site in the native state affect its evolution, in particular the sequence entropy and the substitution rate. Starting from the seminal proposal by Halpern and Bruno, where structural properties are incorporated in the evolutionary model through site-specific amino acid frequencies, several models have been developed to tackle the influence of protein structure on sequence evolution. Here we describe stability-constrained substitution (SCS) models that explicitly consider the stability of the native state against both unfolded and misfolded states. One of them, the mean-field model, provides an independent sites approximation that can be readily incorporated in maximum likelihood methods of phylogenetic inference, including ancestral sequence reconstruction. Next, we describe its validation with simulated and real proteins and its limitations and advantages with respect to empirical models that lack site specificity. We finally provide guidelines and recommendations to analyze protein data accounting for stability constraints, including computer simulations and inferences of protein evolution based on maximum likelihood. Some practical examples are included to illustrate these procedures.
Collapse
|
18
|
Selecting among Alternative Scenarios of Human Evolution by Simulated Genetic Gradients. Genes (Basel) 2018; 9:genes9100506. [PMID: 30340387 PMCID: PMC6210830 DOI: 10.3390/genes9100506] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/13/2018] [Revised: 10/11/2018] [Accepted: 10/16/2018] [Indexed: 11/16/2022] Open
Abstract
Selecting among alternative scenarios of human evolution is nowadays a common methodology to investigate the history of our species. This strategy is usually based on computer simulations of genetic data under different evolutionary scenarios, followed by a fitting of the simulated data with the real data. A recent trend in the investigation of ancestral evolutionary processes of modern humans is the application of genetic gradients as a measure of fitting, since evolutionary processes such as range expansions, range contractions, and population admixture (among others) can lead to different genetic gradients. In addition, this strategy allows the analysis of the genetic causes of the observed genetic gradients. Here, we review recent findings on the selection among alternative scenarios of human evolution based on simulated genetic gradients, including pros and cons. First, we describe common methodologies to simulate genetic gradients and apply them to select among alternative scenarios of human evolution. Next, we review previous studies on the influence of range expansions, population admixture, last glacial period, and migration with long-distance dispersal on genetic gradients for some regions of the world. Finally, we discuss this analytical approach, including technical limitations, required improvements, and advice. Although here we focus on human evolution, this approach could be extended to study other species.
Collapse
|
19
|
Branco C, Velasco M, Benguigui M, Currat M, Ray N, Arenas M. Consequences of diverse evolutionary processes on american genetic gradients of modern humans. Heredity (Edinb) 2018; 121:548-556. [PMID: 30022169 DOI: 10.1038/s41437-018-0122-x] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/26/2018] [Revised: 07/02/2018] [Accepted: 07/03/2018] [Indexed: 11/09/2022] Open
Abstract
European genetic gradients of modern humans were initially interpreted as a consequence of the demic diffusion of expanding Neolithic farmers. However, recent studies showed that these gradients may also be influenced by other evolutionary processes such as population admixture or range contractions. Genetic gradients were observed in the Americas, although their specific evolutionary causes were not investigated. Here we extended the approach used to study genetic gradients in Europe to analyze the influence of diverse evolutionary scenarios on American genetic gradients. Using extensive computer simulations, we evaluated the impact of (i) admixture between expansion waves of modern humans, (ii) the presence of ice-sheets during the last glacial maximum (LGM) and (iii) long-distance dispersal (LDD) events, on the genetic gradients (detected by principal component analysis) of the entire continent, North America and South America. The specific simulation of North and South America showed that genetic gradients are usually orthogonal to the direction of range expansions-either expansions from Bering or posterior re-expansions to recolonize northern regions after ice sheets melting-and we suggest that they result from allele surfing processes. Conversely, our results on the entire continent show a northwest-southeast gradient obtained with any scenario, which we interpreted as a consequence of isolation by distance along the long length of the continent. These findings suggest that distinct genetic gradients can be detected at different regions of the Americas and that subcontinent regions present gradients more sensible to evolutionary and environmental factors (such as LDD and the LGM) than the whole continent.
Collapse
Affiliation(s)
- Catarina Branco
- Department of Biochemistry, Genetics and Immunology, University of Vigo, Vigo, Spain.,Institute of Molecular Pathology and Immunology of the University of Porto (IPATIMUP), Porto, Portugal.,Instituto de Investigação e Inovação em Saúde (i3S), Universidade do Porto, Porto, Portugal.,Department of Genetics and Biotechnology, University of Trás-os-Montes and Alto Douro (UTAD), Vila Real, Portugal
| | - Miguel Velasco
- Centre for Molecular Biology "Severo Ochoa", Consejo Superior de Investigaciones Científicas (CSIC), Madrid, Spain
| | - Macarena Benguigui
- Centre for Molecular Biology "Severo Ochoa", Consejo Superior de Investigaciones Científicas (CSIC), Madrid, Spain
| | - Mathias Currat
- Anthropology, Genetics and Peopling History Lab, Department of Genetics & Evolution -Anthropology Unit, University of Geneva, Geneva, Switzerland.,Institute of Genetics and Genomics in Geneva (IGE3), University of Geneva, Geneva, Switzerland
| | - Nicolas Ray
- EnviroSPACE Lab, Institute for Environmental Sciences, University of Geneva, Geneva, Switzerland.,Institute of Global Health, University of Geneva, Geneva, Switzerland
| | - Miguel Arenas
- Department of Biochemistry, Genetics and Immunology, University of Vigo, Vigo, Spain. .,Institute of Molecular Pathology and Immunology of the University of Porto (IPATIMUP), Porto, Portugal. .,Instituto de Investigação e Inovação em Saúde (i3S), Universidade do Porto, Porto, Portugal. .,Centre for Molecular Biology "Severo Ochoa", Consejo Superior de Investigaciones Científicas (CSIC), Madrid, Spain.
| |
Collapse
|
20
|
Pimenta J, Lopes AM, Comas D, Amorim A, Arenas M. Evaluating the Neolithic Expansion at Both Shores of the Mediterranean Sea. Mol Biol Evol 2017; 34:3232-3242. [DOI: 10.1093/molbev/msx256] [Citation(s) in RCA: 12] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/08/2023] Open
|
21
|
Abstract
Molecular population genetics aims to explain genetic variation and molecular evolution from population genetics principles. The field was born 50 years ago with the first measures of genetic variation in allozyme loci, continued with the nucleotide sequencing era, and is currently in the era of population genomics. During this period, molecular population genetics has been revolutionized by progress in data acquisition and theoretical developments. The conceptual elegance of the neutral theory of molecular evolution or the footprint carved by natural selection on the patterns of genetic variation are two examples of the vast number of inspiring findings of population genetics research. Since the inception of the field, Drosophila has been the prominent model species: molecular variation in populations was first described in Drosophila and most of the population genetics hypotheses were tested in Drosophila species. In this review, we describe the main concepts, methods, and landmarks of molecular population genetics, using the Drosophila model as a reference. We describe the different genetic data sets made available by advances in molecular technologies, and the theoretical developments fostered by these data. Finally, we review the results and new insights provided by the population genomics approach, and conclude by enumerating challenges and new lines of inquiry posed by increasingly large population scale sequence data.
Collapse
|
22
|
Pelletier A, Obbard ME, Harnden M, McConnell S, Howe EJ, Burrows FG, White BN, Kyle CJ. Determining causes of genetic isolation in a large carnivore (Ursus americanus) population to direct contemporary conservation measures. PLoS One 2017; 12:e0172319. [PMID: 28235066 PMCID: PMC5325280 DOI: 10.1371/journal.pone.0172319] [Citation(s) in RCA: 48] [Impact Index Per Article: 6.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/02/2016] [Accepted: 02/02/2017] [Indexed: 11/30/2022] Open
Abstract
The processes leading to genetic isolation influence a population’s local extinction risk, and should thus be identified before conservation actions are implemented. Natural or human-induced circumstances can result in historical or contemporary barriers to gene flow and/or demographic bottlenecks. Distinguishing between these hypotheses can be achieved by comparing genetic diversity and differentiation in isolated vs. continuous neighboring populations. In Ontario, American black bears (Ursus americanus) are continuously distributed, genetically diverse, and exhibit an isolation-by-distance structuring pattern, except on the Bruce Peninsula (BP). To identify the processes that led to the genetic isolation of BP black bears, we modelled various levels of historical and contemporary migration and population size reductions using forward simulations. We compared simulation results with empirical genetic indices from Ontario black bear populations under different levels of geographic isolation, and conducted additional simulations to determine if translocations could help achieve genetic restoration. From a genetic standpoint, conservation concerns for BP black bears are warranted because our results show that: i) a recent demographic bottleneck associated with recently reduced migration best explains the low genetic diversity on the BP; and ii) under sustained isolation, BP black bears could lose between 70% and 80% of their rare alleles within 100 years. Although restoring migration corridors would be the most effective method to enhance long-term genetic diversity and prevent inbreeding, it is unrealistic to expect connectivity to be re-established. Current levels of genetic diversity could be maintained by successfully translocating 10 bears onto the peninsula every 5 years. Such regular translocations may be more practical than landscape restoration, because areas connecting the peninsula to nearby mainland black bear populations have been irreversibly modified by humans, and form strong barriers to movement.
Collapse
Affiliation(s)
- Agnès Pelletier
- Environmental and Life Sciences Program, Trent University, Peterborough, ON, Canada
- Department of Biology, Trent University, Peterborough, ON, Canada
- * E-mail:
| | - Martyn E. Obbard
- Wildlife Research and Monitoring Section, Ontario Ministry of Natural Resources and Forestry, Peterborough, ON, Canada
| | - Matthew Harnden
- Department of Biology, Trent University, Peterborough, ON, Canada
| | - Sabine McConnell
- Department of Computing and Information Systems, Trent University, Peterborough, ON, Canada
| | - Eric J. Howe
- Wildlife Research and Monitoring Section, Ontario Ministry of Natural Resources and Forestry, Peterborough, ON, Canada
| | - Frank G. Burrows
- Bruce Peninsula National Park and Fathom Five National Marine Park, Parks Canada, Tobermory, ON, Canada
| | - Bradley N. White
- Department of Biology, Trent University, Peterborough, ON, Canada
- Forensic Science Department, Trent University, Peterborough, ON, Canada
| | - Christopher J. Kyle
- Environmental and Life Sciences Program, Trent University, Peterborough, ON, Canada
- Department of Biology, Trent University, Peterborough, ON, Canada
- Forensic Science Department, Trent University, Peterborough, ON, Canada
| |
Collapse
|
23
|
Montemuiño C, Espinosa A, Moure JC, Vera G, Hernández P, Ramos-Onsins S. Approaching Long Genomic Regions and Large Recombination Rates with msParSm as an Alternative to MaCS. Evol Bioinform Online 2016; 12:223-228. [PMID: 27721650 PMCID: PMC5047705 DOI: 10.4137/ebo.s40268] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/31/2016] [Revised: 07/19/2016] [Accepted: 07/21/2016] [Indexed: 11/05/2022] Open
Abstract
The msParSm application is an evolution of msPar, the parallel version of the coalescent simulation program ms, which removes the limitation for simulating long stretches of DNA sequences with large recombination rates, without compromising the accuracy of the standard coalescence. This work introduces msParSm, describes its significant performance improvements over msPar and its shared memory parallelization details, and shows how it can get better, if not similar, execution times than MaCS. Two case studies with different mutation rates were analyzed, one approximating the human average and the other approximating the Drosophila melanogaster average. Source code is available at https://github.com/cmontemuino/msparsm.
Collapse
Affiliation(s)
- Carlos Montemuiño
- Computer Architecture and Operating Systems Department (CAOS), Universitat Autònoma de Barcelona, Bellaterra, Spain
| | - Antonio Espinosa
- Computer Architecture and Operating Systems Department (CAOS), Universitat Autònoma de Barcelona, Bellaterra, Spain
| | - Juan C Moure
- Computer Architecture and Operating Systems Department (CAOS), Universitat Autònoma de Barcelona, Bellaterra, Spain
| | - Gonzalo Vera
- Centre for Research in Agricultural Genomics (CRAG) Consortium CSIC-IRTA-UAB-UB Edifici CRAG, Campus UAB, Bellaterra, Spain
| | - Porfidio Hernández
- Computer Architecture and Operating Systems Department (CAOS), Universitat Autònoma de Barcelona, Bellaterra, Spain
| | - Sebastián Ramos-Onsins
- Centre for Research in Agricultural Genomics (CRAG) Consortium CSIC-IRTA-UAB-UB Edifici CRAG, Campus UAB, Bellaterra, Spain
| |
Collapse
|
24
|
Kelleher J, Etheridge AM, McVean G. Efficient Coalescent Simulation and Genealogical Analysis for Large Sample Sizes. PLoS Comput Biol 2016; 12:e1004842. [PMID: 27145223 PMCID: PMC4856371 DOI: 10.1371/journal.pcbi.1004842] [Citation(s) in RCA: 328] [Impact Index Per Article: 41.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/10/2015] [Accepted: 03/02/2016] [Indexed: 01/23/2023] Open
Abstract
A central challenge in the analysis of genetic variation is to provide realistic genome simulation across millions of samples. Present day coalescent simulations do not scale well, or use approximations that fail to capture important long-range linkage properties. Analysing the results of simulations also presents a substantial challenge, as current methods to store genealogies consume a great deal of space, are slow to parse and do not take advantage of shared structure in correlated trees. We solve these problems by introducing sparse trees and coalescence records as the key units of genealogical analysis. Using these tools, exact simulation of the coalescent with recombination for chromosome-sized regions over hundreds of thousands of samples is possible, and substantially faster than present-day approximate methods. We can also analyse the results orders of magnitude more quickly than with existing methods. Our understanding of the distribution of genetic variation in natural populations has been driven by mathematical models of the underlying biological and demographic processes. A key strength of such coalescent models is that they enable efficient simulation of data we might see under a variety of evolutionary scenarios. However, current methods are not well suited to simulating genome-scale data sets on hundreds of thousands of samples, which is essential if we are to understand the data generated by population-scale sequencing projects. Similarly, processing the results of large simulations also presents researchers with a major challenge, as it can take many days just to read the data files. In this paper we solve these problems by introducing a new way to represent information about the ancestral process. This new representation leads to huge gains in simulation speed and storage efficiency so that large simulations complete in minutes and the output files can be processed in seconds.
Collapse
Affiliation(s)
- Jerome Kelleher
- Wellcome Trust Centre for Human Genetics, University of Oxford, Oxford, United Kingdom
- * E-mail:
| | | | - Gilean McVean
- Wellcome Trust Centre for Human Genetics, University of Oxford, Oxford, United Kingdom
- Department of Statistics, University of Oxford, Oxford, United Kingdom
- Li Ka Shing Centre for Health Information and Discovery, University of Oxford, Oxford, United Kingdom
| |
Collapse
|
25
|
Currat M, Gerbault P, Di D, Nunes JM, Sanchez-Mazas A. Forward-in-Time, Spatially Explicit Modeling Software to Simulate Genetic Lineages Under Selection. Evol Bioinform Online 2016; 11:27-39. [PMID: 26949332 PMCID: PMC4768942 DOI: 10.4137/ebo.s33488] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/02/2015] [Revised: 12/10/2015] [Accepted: 12/13/2015] [Indexed: 12/20/2022] Open
Abstract
SELECTOR is a software package for studying the evolution of multiallelic genes under balancing or positive selection while simulating complex evolutionary scenarios that integrate demographic growth and migration in a spatially explicit population framework. Parameters can be varied both in space and time to account for geographical, environmental, and cultural heterogeneity. SELECTOR can be used within an approximate Bayesian computation estimation framework. We first describe the principles of SELECTOR and validate the algorithms by comparing its outputs for simple models with theoretical expectations. Then, we show how it can be used to investigate genetic differentiation of loci under balancing selection in interconnected demes with spatially heterogeneous gene flow. We identify situations in which balancing selection reduces genetic differentiation between population groups compared with neutrality and explain conflicting outcomes observed for human leukocyte antigen loci. These results and three previously published applications demonstrate that SELECTOR is efficient and robust for building insight into human settlement history and evolution.
Collapse
Affiliation(s)
- Mathias Currat
- Laboratory of Anthropology, Genetics and Peopling History, Department of Genetics and Evolution - Anthropology Unit, University of Geneva, Geneva, Switzerland
| | - Pascale Gerbault
- Research Department of Genetics, Evolution and Environment, University College London, London, UK.; Department of Anthropology, University College London, London, UK
| | - Da Di
- Laboratory of Anthropology, Genetics and Peopling History, Department of Genetics and Evolution - Anthropology Unit, University of Geneva, Geneva, Switzerland
| | - José M Nunes
- Laboratory of Anthropology, Genetics and Peopling History, Department of Genetics and Evolution - Anthropology Unit, University of Geneva, Geneva, Switzerland
| | - Alicia Sanchez-Mazas
- Laboratory of Anthropology, Genetics and Peopling History, Department of Genetics and Evolution - Anthropology Unit, University of Geneva, Geneva, Switzerland
| |
Collapse
|
26
|
Dib L, Meyer X, Artimo P, Ioannidis V, Stockinger H, Salamin N. Coev-web: a web platform designed to simulate and evaluate coevolving positions along a phylogenetic tree. BMC Bioinformatics 2015; 16:394. [PMID: 26597459 PMCID: PMC4657261 DOI: 10.1186/s12859-015-0785-8] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/04/2015] [Accepted: 10/20/2015] [Indexed: 01/18/2023] Open
Abstract
BACKGROUND Available methods to simulate nucleotide or amino acid data typically use Markov models to simulate each position independently. These approaches are not appropriate to assess the performance of combinatorial and probabilistic methods that look for coevolving positions in nucleotide or amino acid sequences. RESULTS We have developed a web-based platform that gives a user-friendly access to two phylogenetic-based methods implementing the Coev model: the evaluation of coevolving scores and the simulation of coevolving positions. We have also extended the capabilities of the Coev model to allow for the generalization of the alphabet used in the Markov model, which can now analyse both nucleotide and amino acid data sets. The simulation of coevolving positions is novel and builds upon the developments of the Coev model. It allows user to simulate pairs of dependent nucleotide or amino acid positions. CONCLUSIONS The main focus of our paper is the new simulation method we present for coevolving positions. The implementation of this method is embedded within the web platform Coev-web that is freely accessible at http://coev.vital-it.ch/, and was tested in most modern web browsers.
Collapse
Affiliation(s)
- Linda Dib
- Department of Ecology and Evolution, University of Lausanne, Lausanne, 1015, Switzerland. .,SIB Swiss Institute of Bioinformatics, Lausanne, 1015, Switzerland. .,Laboratoire de recherche en neuroimagerie, CHUV, Lausanne, 1011, Switzerland.
| | - Xavier Meyer
- Department of Ecology and Evolution, University of Lausanne, Lausanne, 1015, Switzerland. .,SIB Swiss Institute of Bioinformatics, Lausanne, 1015, Switzerland. .,Computer Science department, University of Geneva, Carouge, 1227, Switzerland.
| | - Panu Artimo
- SIB Swiss Institute of Bioinformatics, Lausanne, 1015, Switzerland.
| | | | - Heinz Stockinger
- SIB Swiss Institute of Bioinformatics, Lausanne, 1015, Switzerland.
| | - Nicolas Salamin
- Department of Ecology and Evolution, University of Lausanne, Lausanne, 1015, Switzerland. .,SIB Swiss Institute of Bioinformatics, Lausanne, 1015, Switzerland.
| |
Collapse
|
27
|
Arenas M. Trends in substitution models of molecular evolution. Front Genet 2015; 6:319. [PMID: 26579193 PMCID: PMC4620419 DOI: 10.3389/fgene.2015.00319] [Citation(s) in RCA: 78] [Impact Index Per Article: 8.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/27/2015] [Accepted: 10/09/2015] [Indexed: 11/13/2022] Open
Abstract
Substitution models of evolution describe the process of genetic variation through fixed mutations and constitute the basis of the evolutionary analysis at the molecular level. Almost 40 years after the development of first substitution models, highly sophisticated, and data-specific substitution models continue emerging with the aim of better mimicking real evolutionary processes. Here I describe current trends in substitution models of DNA, codon and amino acid sequence evolution, including advantages and pitfalls of the most popular models. The perspective concludes that despite the large number of currently available substitution models, further research is required for more realistic modeling, especially for DNA coding and amino acid data. Additionally, the development of more accurate complex models should be coupled with new implementations and improvements of methods and frameworks for substitution model selection and downstream evolutionary analysis.
Collapse
Affiliation(s)
- Miguel Arenas
- Institute of Molecular Pathology and Immunology of the University of Porto Porto, Portugal
| |
Collapse
|
28
|
Spielman SJ, Wilke CO. Pyvolve: A Flexible Python Module for Simulating Sequences along Phylogenies. PLoS One 2015; 10:e0139047. [PMID: 26397960 PMCID: PMC4580465 DOI: 10.1371/journal.pone.0139047] [Citation(s) in RCA: 70] [Impact Index Per Article: 7.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/15/2015] [Accepted: 09/07/2015] [Indexed: 11/19/2022] Open
Abstract
We introduce Pyvolve, a flexible Python module for simulating genetic data along a phylogeny using continuous-time Markov models of sequence evolution. Easily incorporated into Python bioinformatics pipelines, Pyvolve can simulate sequences according to most standard models of nucleotide, amino-acid, and codon sequence evolution. All model parameters are fully customizable. Users can additionally specify custom evolutionary models, with custom rate matrices and/or states to evolve. This flexibility makes Pyvolve a convenient framework not only for simulating sequences under a wide variety of conditions, but also for developing and testing new evolutionary models. Pyvolve is an open-source project under a FreeBSD license, and it is available for download, along with a detailed user-manual and example scripts, from http://github.com/sjspielman/pyvolve.
Collapse
Affiliation(s)
- Stephanie J. Spielman
- Department of Integrative Biology, Center for Computational Biology and Bioinformatics, and Institute of Cellular and Molecular Biology, The University of Texas at Austin, Austin, TX 78712, United States of America
| | - Claus O. Wilke
- Department of Integrative Biology, Center for Computational Biology and Bioinformatics, and Institute of Cellular and Molecular Biology, The University of Texas at Austin, Austin, TX 78712, United States of America
| |
Collapse
|
29
|
Ewing GB, Reiff PA, Jensen JD. PopPlanner: visually constructing demographic models for simulation. Front Genet 2015; 6:150. [PMID: 25954301 PMCID: PMC4407479 DOI: 10.3389/fgene.2015.00150] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/12/2015] [Accepted: 03/31/2015] [Indexed: 11/18/2022] Open
Abstract
Currently there are a number of coalescent simulation programs that support a wide range of features, such as arbitrary demographic models, migration, and sub structure. Defining the model is done typically with either text files or command line switches. Although this has proven to be a powerful method of defining models of high complexity, it is often error prone and difficult to read without familiarity both with command lines and the program in question. A intuitive GUI based population structure program that can both read and write applicable command lines would dramatically simplify the construction, modification, and error checking of such models by a wider user base. Results: PopPlanner is a tool to both construct and inspect complicated demographic models visually with a GUI where the user's primary interaction is through mouse gestures. Because of their popularity, we focus on ms and by extension msms, command line coalescent simulation programs. Our program can be used to find errors with existing command lines, or to build original command lines. Furthermore, the graphical output supports a number of editing and output features including export of publication quality figures.
Collapse
Affiliation(s)
- Gregory B Ewing
- School of Life Sciences, École Polytechnique Fédérale de Lausanne Lausanne, Switzerland
| | - Pauline A Reiff
- School of Life Sciences, École Polytechnique Fédérale de Lausanne Lausanne, Switzerland
| | - Jeffrey D Jensen
- School of Life Sciences, École Polytechnique Fédérale de Lausanne Lausanne, Switzerland
| |
Collapse
|
30
|
McManus KF. popRange: a highly flexible spatially and temporally explicit Wright-Fisher simulator. SOURCE CODE FOR BIOLOGY AND MEDICINE 2015; 10:6. [PMID: 25883677 PMCID: PMC4399400 DOI: 10.1186/s13029-015-0036-4] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 10/09/2014] [Accepted: 03/30/2015] [Indexed: 01/31/2023]
Abstract
Background Sequencing and genotyping technology advancements have led to massive, growing repositories of spatially explicit genetic data and increasing quantities of temporal data (i.e., ancient DNA). These data will allow more complex and fine-scale inferences about population history than ever before; however, new methods are needed to test complex hypotheses. Results This article presents popRange, a forward genetic simulator, which incorporates large-scale genetic data with stochastic spatially and temporally explicit demographic and selective models. Features such as spatially and temporally variable selection coefficients and demography are incorporated in a highly flexible manner. popRange is implemented as an R package and presented with an example simulation exploring a selected allele’s trajectory in multiple subpopulations. Conclusions popRange allows researchers to evaluate and test complex scenarios by simulating large-scale data with complicated demographic and selective features. popRange is available for download at http://cran.r-project.org/web/packages/popRange/index.html. Electronic supplementary material The online version of this article (doi:10.1186/s13029-015-0036-4) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Kimberly F McManus
- Departments of Biology and Biomedical Informatics, Stanford University, Stanford, CA 94305 USA ; Departments of Biomedical Informatics, Stanford University, Stanford, CA 94305 USA
| |
Collapse
|
31
|
Pérez-Losada M, Arenas M, Galán JC, Palero F, González-Candelas F. Recombination in viruses: mechanisms, methods of study, and evolutionary consequences. INFECTION, GENETICS AND EVOLUTION : JOURNAL OF MOLECULAR EPIDEMIOLOGY AND EVOLUTIONARY GENETICS IN INFECTIOUS DISEASES 2015; 30:296-307. [PMID: 25541518 PMCID: PMC7106159 DOI: 10.1016/j.meegid.2014.12.022] [Citation(s) in RCA: 198] [Impact Index Per Article: 22.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 11/23/2014] [Revised: 12/15/2014] [Accepted: 12/17/2014] [Indexed: 02/08/2023]
Abstract
Recombination is a pervasive process generating diversity in most viruses. It joins variants that arise independently within the same molecule, creating new opportunities for viruses to overcome selective pressures and to adapt to new environments and hosts. Consequently, the analysis of viral recombination attracts the interest of clinicians, epidemiologists, molecular biologists and evolutionary biologists. In this review we present an overview of three major areas related to viral recombination: (i) the molecular mechanisms that underlie recombination in model viruses, including DNA-viruses (Herpesvirus) and RNA-viruses (Human Influenza Virus and Human Immunodeficiency Virus), (ii) the analytical procedures to detect recombination in viral sequences and to determine the recombination breakpoints, along with the conceptual and methodological tools currently used and a brief overview of the impact of new sequencing technologies on the detection of recombination, and (iii) the major areas in the evolutionary analysis of viral populations on which recombination has an impact. These include the evaluation of selective pressures acting on viral populations, the application of evolutionary reconstructions in the characterization of centralized genes for vaccine design, and the evaluation of linkage disequilibrium and population structure.
Collapse
Affiliation(s)
- Marcos Pérez-Losada
- CIBIO, Centro de Investigação em Biodiversidade e Recursos Genéticos, Universidade do Porto, Campus Agrário de Vairão, Portugal; Computational Biology Institute, George Washington University, Ashburn, VA 20147, USA
| | - Miguel Arenas
- Centre for Molecular Biology "Severo Ochoa", Consejo Superior de Investigaciones Científicas (CSIC), Madrid, Spain
| | - Juan Carlos Galán
- Servicio de Microbiología, Hospital Ramón y Cajal and Instituto Ramón y Cajal de Investigación Sanitaria (IRYCIS), Madrid, Spain; CIBER en Epidemiología y Salud Pública, Spain
| | - Ferran Palero
- CIBER en Epidemiología y Salud Pública, Spain; Unidad Mixta Infección y Salud Pública, FISABIO-Universitat de València, Valencia, Spain
| | - Fernando González-Candelas
- CIBER en Epidemiología y Salud Pública, Spain; Unidad Mixta Infección y Salud Pública, FISABIO-Universitat de València, Valencia, Spain.
| |
Collapse
|
32
|
Peng B, Chen HS, Mechanic LE, Racine B, Clarke J, Gillanders E, Feuer EJ. Genetic data simulators and their applications: an overview. Genet Epidemiol 2014; 39:2-10. [PMID: 25504286 DOI: 10.1002/gepi.21876] [Citation(s) in RCA: 24] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/15/2014] [Revised: 09/14/2014] [Accepted: 10/31/2014] [Indexed: 11/10/2022]
Abstract
Computer simulations have played an indispensable role in the development and applications of statistical models and methods for genetic studies across multiple disciplines. The need to simulate complex evolutionary scenarios and pseudo-datasets for various studies has fueled the development of dozens of computer programs with varying reliability, performance, and application areas. To help researchers compare and choose the most appropriate simulators for their studies, we have created the genetic simulation resources (GSR) website, which allows authors of simulation software to register their applications and describe them with more than 160 defined attributes. This article summarizes the properties of 93 simulators currently registered at GSR and provides an overview of the development and applications of genetic simulators. Unlike other review articles that address technical issues or compare simulators for particular application areas, we focus on software development, maintenance, and features of simulators, often from a historical perspective. Publications that cite these simulators are used to summarize both the applications of genetic simulations and the utilization of simulators.
Collapse
Affiliation(s)
- Bo Peng
- Department of Bioinformatics and Computational Biology, The University of Texas, MD Anderson Cancer Center, Houston, Texas, United States of America
| | | | | | | | | | | | | |
Collapse
|
33
|
Groussin M, Hobbs JK, Szöllősi GJ, Gribaldo S, Arcus VL, Gouy M. Toward more accurate ancestral protein genotype-phenotype reconstructions with the use of species tree-aware gene trees. Mol Biol Evol 2014; 32:13-22. [PMID: 25371435 PMCID: PMC4271536 DOI: 10.1093/molbev/msu305] [Citation(s) in RCA: 33] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/17/2022] Open
Abstract
The resurrection of ancestral proteins provides direct insight into how natural selection has shaped proteins found in nature. By tracing substitutions along a gene phylogeny, ancestral proteins can be reconstructed in silico and subsequently synthesized in vitro. This elegant strategy reveals the complex mechanisms responsible for the evolution of protein functions and structures. However, to date, all protein resurrection studies have used simplistic approaches for ancestral sequence reconstruction (ASR), including the assumption that a single sequence alignment alone is sufficient to accurately reconstruct the history of the gene family. The impact of such shortcuts on conclusions about ancestral functions has not been investigated. Here, we show with simulations that utilizing information on species history using a model that accounts for the duplication, horizontal transfer, and loss (DTL) of genes statistically increases ASR accuracy. This underscores the importance of the tree topology in the inference of putative ancestors. We validate our in silico predictions using in vitro resurrection of the LeuB enzyme for the ancestor of the Firmicutes, a major and ancient bacterial phylum. With this particular protein, our experimental results demonstrate that information on the species phylogeny results in a biochemically more realistic and kinetically more stable ancestral protein. Additional resurrection experiments with different proteins are necessary to statistically quantify the impact of using species tree-aware gene trees on ancestral protein phenotypes. Nonetheless, our results suggest the need for incorporating both sequence and DTL information in future studies of protein resurrections to accurately define the genotype-phenotype space in which proteins diversify.
Collapse
Affiliation(s)
- Mathieu Groussin
- Laboratoire de Biométrie et Biologie Evolutive, Université de Lyon, Université Lyon 1, CNRS, UMR5558, Villeurbanne, France
| | - Joanne K Hobbs
- Department of Biological Sciences, University of Waikato, Hamilton, New Zealand
| | - Gergely J Szöllősi
- Laboratoire de Biométrie et Biologie Evolutive, Université de Lyon, Université Lyon 1, CNRS, UMR5558, Villeurbanne, France ELTE-MTA "Lendület" Biophysics Research Group, Pázmány, Budapest, Hungary
| | - Simonetta Gribaldo
- Unité de Biologie Moléculaire du Gène chez les Extrêmophiles, Département de Microbiologie, Institut Pasteur, Paris cedex, France
| | - Vickery L Arcus
- Department of Biological Sciences, University of Waikato, Hamilton, New Zealand
| | - Manolo Gouy
- Laboratoire de Biométrie et Biologie Evolutive, Université de Lyon, Université Lyon 1, CNRS, UMR5558, Villeurbanne, France
| |
Collapse
|
34
|
Dellicour S, Kastally C, Hardy OJ, Mardulyn P. Comparing phylogeographic hypotheses by simulating DNA sequences under a spatially explicit model of coalescence. Mol Biol Evol 2014; 31:3359-72. [PMID: 25261404 DOI: 10.1093/molbev/msu277] [Citation(s) in RCA: 13] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
Abstract
Computer simulations of genetic data are increasingly used to investigate the impact of complex historical scenarios on patterns of genetic variation. Yet, in most empirical studies, relatively large portions of species ranges are often treated as panmictic populations, ignoring the underlying spatial context. In some cases, however, a more accurate spatial model is required. We use a spatially explicit model of coalescence (easily constructed by overlaying a two-dimensional grid on maps displaying an estimate of past and current species ranges) to evaluate the potential of several summary statistics to differentiate three typical phylogeographic scenarios. We first explore the variation of each summary statistic within the boundaries of each phylogeographic scenario, and identify those that appear most promising for a comparison of historical scenarios and/or to infer historical parameters. We then combine a selected set of summary statistics in a single chi-square statistic and evaluate whether it can be used to differentiate past geographic fragmentation or range expansion from a simple scenario of isolation by distance. We also investigate the benefits of using a spatially explicit model by comparing its performance to alternative models that are less spatially explicit (lower geographic resolution). The results identify conditions in which each summary statistic is useful to infer the evolution of a species range, and allow us to validate our spatially explicit model of coalescence and our procedure to compare simulated and observed sequence data. We also provide a detailed description of the spatially explicit model of coalescence used, which is currently lacking.
Collapse
Affiliation(s)
- Simon Dellicour
- Evolutionary Biology and Ecology, Université Libre de Bruxelles, Brussels, Belgium
| | - Chedly Kastally
- Evolutionary Biology and Ecology, Université Libre de Bruxelles, Brussels, Belgium
| | - Olivier J Hardy
- Evolutionary Biology and Ecology, Université Libre de Bruxelles, Brussels, Belgium
| | - Patrick Mardulyn
- Evolutionary Biology and Ecology, Université Libre de Bruxelles, Brussels, Belgium
| |
Collapse
|
35
|
Benguigui M, Arenas M. Spatial and temporal simulation of human evolution. Methods, frameworks and applications. Curr Genomics 2014; 15:245-55. [PMID: 25132795 PMCID: PMC4133948 DOI: 10.2174/1389202915666140506223639] [Citation(s) in RCA: 14] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/14/2014] [Revised: 04/05/2014] [Accepted: 05/04/2014] [Indexed: 01/29/2023] Open
Abstract
Analyses of human evolution are fundamental to understand the current gradients of human diversity. In this concern, genetic samples collected from current populations together with archaeological data are the most important resources to study human evolution. However, they are often insufficient to properly evaluate a variety of evolutionary scenarios, leading to continuous debates and discussions. A commonly applied strategy consists of the use of computer simulations based on, as realistic as possible, evolutionary models, to evaluate alternative evolutionary scenarios through statistical correlations with the real data. Computer simulations can also be applied to estimate evolutionary parameters or to study the role of each parameter on the evolutionary process. Here we review the mainly used methods and evolutionary frameworks to perform realistic spatially explicit computer simulations of human evolution. Although we focus on human evolution, most of the methods and software we describe can also be used to study other species. We also describe the importance of considering spatially explicit models to better mimic human evolutionary scenarios based on a variety of phenomena such as range expansions, range shifts, range contractions, sex-biased dispersal, long-distance dispersal or admixtures of populations. We finally discuss future implementations to improve current spatially explicit simulations and their derived applications in human evolution.
Collapse
Affiliation(s)
- Macarena Benguigui
- Centre for Molecular Biology "Severo Ochoa", Consejo Superior de Investigaciones Científicas (CSIC), Madrid, Spain
| | - Miguel Arenas
- Centre for Molecular Biology "Severo Ochoa", Consejo Superior de Investigaciones Científicas (CSIC), Madrid, Spain
| |
Collapse
|
36
|
Bielejec F, Lemey P, Carvalho LM, Baele G, Rambaut A, Suchard MA. πBUSS: a parallel BEAST/BEAGLE utility for sequence simulation under complex evolutionary scenarios. BMC Bioinformatics 2014; 15:133. [PMID: 24885610 PMCID: PMC4020384 DOI: 10.1186/1471-2105-15-133] [Citation(s) in RCA: 19] [Impact Index Per Article: 1.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/10/2013] [Accepted: 04/24/2014] [Indexed: 01/12/2023] Open
Abstract
Background Simulated nucleotide or amino acid sequences are frequently used to assess the performance of phylogenetic reconstruction methods. BEAST, a Bayesian statistical framework that focuses on reconstructing time-calibrated molecular evolutionary processes, supports a wide array of evolutionary models, but lacked matching machinery for simulation of character evolution along phylogenies. Results We present a flexible Monte Carlo simulation tool, called πBUSS, that employs the BEAGLE high performance library for phylogenetic computations to rapidly generate large sequence alignments under complex evolutionary models. πBUSS sports a user-friendly graphical user interface (GUI) that allows combining a rich array of models across an arbitrary number of partitions. A command-line interface mirrors the options available through the GUI and facilitates scripting in large-scale simulation studies. πBUSS may serve as an easy-to-use, standard sequence simulation tool, but the available models and data types are particularly useful to assess the performance of complex BEAST inferences. The connection with BEAST is further strengthened through the use of a common extensible markup language (XML), allowing to specify also more advanced evolutionary models. To support simulation under the latter, as well as to support simulation and analysis in a single run, we also add the πBUSS core simulation routine to the list of BEAST XML parsers. Conclusions πBUSS offers a unique combination of flexibility and ease-of-use for sequence simulation under realistic evolutionary scenarios. Through different interfaces, πBUSS supports simulation studies ranging from modest endeavors for illustrative purposes to complex and large-scale assessments of evolutionary inference procedures. Applications are not restricted to the BEAST framework, or even time-measured evolutionary histories, and πBUSS can be connected to various other programs using standard input and output format.
Collapse
Affiliation(s)
- Filip Bielejec
- Department of Microbiology and Immunology, Rega Institute, KU Leuven, Leuven, Belgium.
| | | | | | | | | | | |
Collapse
|
37
|
Hoban S. An overview of the utility of population simulation software in molecular ecology. Mol Ecol 2014; 23:2383-401. [DOI: 10.1111/mec.12741] [Citation(s) in RCA: 64] [Impact Index Per Article: 6.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/14/2013] [Revised: 03/22/2014] [Accepted: 03/26/2014] [Indexed: 01/12/2023]
Affiliation(s)
- Sean Hoban
- National Institute for Mathematical and Biological Synthesis; University of Tennessee; 1122 Volunteer Blvd. Suite 110A Knoxville TN 37996-3410 USA
| |
Collapse
|
38
|
Arenas M, Posada D. Simulation of genome-wide evolution under heterogeneous substitution models and complex multispecies coalescent histories. Mol Biol Evol 2014; 31:1295-301. [PMID: 24557445 PMCID: PMC3995339 DOI: 10.1093/molbev/msu078] [Citation(s) in RCA: 29] [Impact Index Per Article: 2.9] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/28/2023] Open
Abstract
Genomic evolution can be highly heterogeneous. Here, we introduce a new framework to simulate genome-wide sequence evolution under a variety of substitution models that may change along the genome and the phylogeny, following complex multispecies coalescent histories that can include recombination, demographics, longitudinal sampling, population subdivision/species history, and migration. A key aspect of our simulation strategy is that the heterogeneity of the whole evolutionary process can be parameterized according to statistical prior distributions specified by the user. We used this framework to carry out a study of the impact of variable codon frequencies across genomic regions on the estimation of the genome-wide nonsynonymous/synonymous ratio. We found that both variable codon frequencies across genes and rate variation among sites and regions can lead to severe underestimation of the global dN/dS values. The program SGWE—Simulation of Genome-Wide Evolution—is freely available from http://code.google.com/p/sgwe-project/, including extensive documentation and detailed examples.
Collapse
Affiliation(s)
- Miguel Arenas
- Centre for Molecular Biology "Severo Ochoa," Consejo Superior de Investigaciones Científicas (CSIC), Madrid, Spain
| | | |
Collapse
|
39
|
Bay RA, Ramakrishnan U, Hadly EA. A call for tiger management using "reserves" of genetic diversity. J Hered 2013; 105:295-302. [PMID: 24336928 DOI: 10.1093/jhered/est086] [Citation(s) in RCA: 10] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
Abstract
Tigers (Panthera tigris), like many large carnivores, are threatened by anthropogenic impacts, primarily habitat loss and poaching. Current conservation plans for tigers focus on population expansion, with the goal of doubling census size in the next 10 years. Previous studies have shown that because the demographic decline was recent, tiger populations still retain a large amount of genetic diversity. Although maintaining this diversity is extremely important to avoid deleterious effects of inbreeding, management plans have yet to consider predictive genetic models. We used coalescent simulations based on previously sequenced mitochondrial fragments (n = 125) from 5 of 6 extant subspecies to predict the population growth needed to maintain current genetic diversity over the next 150 years. We found that the level of gene flow between populations has a large effect on the local population growth necessary to maintain genetic diversity, without which tigers may face decreases in fitness. In the absence of gene flow, we demonstrate that maintaining genetic diversity is impossible based on known demographic parameters for the species. Thus, managing for the genetic diversity of the species should be prioritized over the riskier preservation of distinct subspecies. These predictive simulations provide unique management insights, hitherto not possible using existing analytical methods.
Collapse
Affiliation(s)
- Rachael A Bay
- the Department of Biology, Stanford University, Stanford, CA 94305
| | | | | |
Collapse
|
40
|
Arenas M. The importance and application of the ancestral recombination graph. Front Genet 2013; 4:206. [PMID: 24133504 PMCID: PMC3796270 DOI: 10.3389/fgene.2013.00206] [Citation(s) in RCA: 23] [Impact Index Per Article: 2.1] [Reference Citation Analysis] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/28/2013] [Accepted: 09/24/2013] [Indexed: 11/13/2022] Open
Affiliation(s)
- Miguel Arenas
- Centre for Molecular Biology “Severo Ochoa,” Consejo Superior de Investigaciones Científicas, Universidad Autónoma de MadridMadrid, Spain
| |
Collapse
|
41
|
Abdalla S, Al-Hadeethi Y. Genes alternations with exposure time of environmental factors. Gene 2013; 528:256-60. [PMID: 23860326 DOI: 10.1016/j.gene.2013.06.065] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/13/2013] [Revised: 06/20/2013] [Accepted: 06/21/2013] [Indexed: 01/01/2023]
Abstract
A theoretical model discussing the environmental factors (EFs) effect of exposure time on genes, which leads to human diseases, is presented using multi-logistic model. The advantages and limitations of this model are discussed in terms of its usefulness for simulating genetic samples. It has been shown that EFs affect genes with the same degree both at high exposure level, low exposure time and at low exposure level, high exposure time.
Collapse
Affiliation(s)
- S Abdalla
- Department of Physics, Faculty of Science, King Abdulaziz University Jeddah, P.O. Box 80203, Jeddah 21589, Saudi Arabia.
| | | |
Collapse
|
42
|
Johansson ML, Raimondi PT, Reed DC, Coelho NC, Serrão EA, Alberto FA. Looking into the black box: simulating the role of self-fertilization and mortality in the genetic structure of Macrocystis pyrifera. Mol Ecol 2013; 22:4842-54. [PMID: 23962179 DOI: 10.1111/mec.12444] [Citation(s) in RCA: 9] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/03/2012] [Accepted: 07/03/2013] [Indexed: 01/10/2024]
Abstract
Patterns of spatial genetic structure (SGS), typically estimated by genotyping adults, integrate migration over multiple generations and measure the effective gene flow of populations. SGS results can be compared with direct ecological studies of dispersal or mating system to gain additional insights. When mismatches occur, simulations can be used to illuminate the causes of these mismatches. Here, we report a SGS and simulation-based study of self-fertilization in Macrocystis pyrifera, the giant kelp. We found that SGS is weaker than expected in M. pyrifera and used computer simulations to identify selfing and early mortality rates for which the individual heterozygosity distribution fits that of the observed data. Only one (of three) population showed both elevated kinship in the smallest distance class and a significant negative slope between kinship and geographical distance. All simulations had poor fit to the observed data unless mortality due to inbreeding depression was imposed. This mortality could only be imposed for selfing, as these were the only simulations to show an excess of homozygous individuals relative to the observed data. Thus, the expected data consistently achieved nonsignificant differences from the observed data only under models of selfing with mortality, with best fits between 32% and 42% selfing. Inbreeding depression ranged from 0.70 to 0.73. The results suggest that density-dependent mortality of early life stages is a significant force in structuring Macrocystis populations, with few highly homozygous individuals surviving. The success of these results should help to validate simulation approaches even in data-poor systems, as a means to estimate otherwise difficult-to-measure life cycle parameters.
Collapse
Affiliation(s)
- Mattias L Johansson
- Department of Biological Sciences, University of Wisconsin - Milwaukee, PO Box 413, Milwaukee, WI, 53201, USA
| | | | | | | | | | | |
Collapse
|
43
|
Arenas M, Dos Santos HG, Posada D, Bastolla U. Protein evolution along phylogenetic histories under structurally constrained substitution models. ACTA ACUST UNITED AC 2013; 29:3020-8. [PMID: 24037213 DOI: 10.1093/bioinformatics/btt530] [Citation(s) in RCA: 31] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022]
Abstract
MOTIVATION Models of molecular evolution aim at describing the evolutionary processes at the molecular level. However, current models rarely incorporate information from protein structure. Conversely, structure-based models of protein evolution have not been commonly applied to simulate sequence evolution in a phylogenetic framework, and they often ignore relevant evolutionary processes such as recombination. A simulation evolutionary framework that integrates substitution models that account for protein structure stability should be able to generate more realistic in silico evolved proteins for a variety of purposes. RESULTS We developed a method to simulate protein evolution that combines models of protein folding stability, such that the fitness depends on the stability of the native state both with respect to unfolding and misfolding, with phylogenetic histories that can be either specified by the user or simulated with the coalescent under complex evolutionary scenarios, including recombination, demographics and migration. We have implemented this framework in a computer program called ProteinEvolver. Remarkably, comparing these models with empirical amino acid replacement models, we found that the former produce amino acid distributions closer to distributions observed in real protein families, and proteins that are predicted to be more stable. Therefore, we conclude that evolutionary models that consider protein stability and realistic evolutionary histories constitute a better approximation of the real evolutionary process.
Collapse
Affiliation(s)
- Miguel Arenas
- Centre for Molecular Biology 'Severo Ochoa', Consejo Superior de Investigaciones Científicas (CSIC), Madrid, Spain and Department of Biochemistry, Genetics and Immunology, University of Vigo, Vigo, Spain
| | | | | | | |
Collapse
|
44
|
Arenas M. Computer programs and methodologies for the simulation of DNA sequence data with recombination. Front Genet 2013; 4:9. [PMID: 23378848 PMCID: PMC3561691 DOI: 10.3389/fgene.2013.00009] [Citation(s) in RCA: 12] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/20/2012] [Accepted: 01/17/2013] [Indexed: 11/13/2022] Open
Abstract
Computer simulations are useful in evolutionary biology for hypothesis testing, to verify analytical methods, to analyze interactions among evolutionary processes, and to estimate evolutionary parameters. In particular, the simulation of DNA sequences with recombination may help in understanding the role of recombination in diverse evolutionary questions, such as the genome structure. Consequently, plenty of computer simulators have been developed to simulate DNA sequence data with recombination. However, the choice of an appropriate tool, among all currently available simulators, is critical if recombination simulations are to be biologically meaningful. This review provides a practical survival guide to commonly used computer programs and methodologies for the simulation of coding and non-coding DNA sequences with recombination. It may help in the correct design of computer simulation experiments of recombination. In addition, the study includes a review of simulation studies investigating the impact of ignoring recombination when performing various evolutionary analyses, such as phylogenetic tree and ancestral sequence reconstructions. Alternative analytical methodologies accounting for recombination are also reviewed.
Collapse
Affiliation(s)
- Miguel Arenas
- Centre for Molecular Biology "Severo Ochoa," Consejo Superior de Investigaciones Científicas Madrid, Spain
| |
Collapse
|