1
|
Patané JSL, Martins J, Setubal JC. A Guide to Phylogenomic Inference. Methods Mol Biol 2024; 2802:267-345. [PMID: 38819564 DOI: 10.1007/978-1-0716-3838-5_11] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 06/01/2024]
Abstract
Phylogenomics aims at reconstructing the evolutionary histories of organisms taking into account whole genomes or large fractions of genomes. Phylogenomics has significant applications in fields such as evolutionary biology, systematics, comparative genomics, and conservation genetics, providing valuable insights into the origins and relationships of species and contributing to our understanding of biological diversity and evolution. This chapter surveys phylogenetic concepts and methods aimed at both gene tree and species tree reconstruction while also addressing common pitfalls, providing references to relevant computer programs. A practical phylogenomic analysis example including bacterial genomes is presented at the end of the chapter.
Collapse
Affiliation(s)
- José S L Patané
- Laboratório de Genética e Cardiologia Molecular, Instituto do Coração/Heart Institute Hospital das Clínicas - Faculdade de Medicina da Universidade de São Paulo São Paulo, São Paulo, SP, Brazil
| | - Joaquim Martins
- Integrative Omics group, Biorenewables National Laboratory, Brazilian Center for Research in Energy and Materials, Campinas, SP, Brazil
| | - João Carlos Setubal
- Departmento de Bioquímica, Instituto de Química, Universidade de São Paulo, São Paulo, SP, Brazil.
| |
Collapse
|
2
|
Lu B. Evolutionary Insights into the Relationship of Frogs, Salamanders, and Caecilians and Their Adaptive Traits, with an Emphasis on Salamander Regeneration and Longevity. Animals (Basel) 2023; 13:3449. [PMID: 38003067 PMCID: PMC10668855 DOI: 10.3390/ani13223449] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/25/2023] [Revised: 11/01/2023] [Accepted: 11/06/2023] [Indexed: 11/26/2023] Open
Abstract
The extant amphibians have developed uncanny abilities to adapt to their environment. I compared the genes of amphibians to those of other vertebrates to investigate the genetic changes underlying their unique traits, especially salamanders' regeneration and longevity. Using the well-supported Batrachia tree, I found that salamander genomes have undergone accelerated adaptive evolution, especially for development-related genes. The group-based comparison showed that several genes are under positive selection, rapid evolution, and unexpected parallel evolution with traits shared by distantly related species, such as the tail-regenerative lizard and the longer-lived naked mole rat. The genes, such as EEF1E1, PAFAH1B1, and OGFR, may be involved in salamander regeneration, as they are involved in the apoptotic process, blastema formation, and cell proliferation, respectively. The genes PCNA and SIRT1 may be involved in extending lifespan, as they are involved in DNA repair and histone modification, respectively. Some genes, such as PCNA and OGFR, have dual roles in regeneration and aging, which suggests that these two processes are interconnected. My experiment validated the time course differential expression pattern of SERPINI1 and OGFR, two genes that have evolved in parallel in salamanders and lizards during the regeneration process of salamander limbs. In addition, I found several candidate genes responsible for frogs' frequent vocalization and caecilians' degenerative vision. This study provides much-needed insights into the processes of regeneration and aging, and the discovery of the critical genes paves the way for further functional analysis, which could open up new avenues for exploiting the genetic potential of humans and improving human well-being.
Collapse
Affiliation(s)
- Bin Lu
- Chengdu Institute of Biology, Chinese Academy of Sciences, Chengdu 610041, China
| |
Collapse
|
3
|
Vera-Ruiz VA, Robinson J, Jermiin LS. A Likelihood-Ratio Test for Lumpability of Phylogenetic Data: Is the Markovian Property of an Evolutionary Process retained in Recoded DNA? Syst Biol 2021; 71:660-675. [PMID: 34498090 DOI: 10.1093/sysbio/syab074] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/05/2021] [Revised: 08/19/2021] [Accepted: 08/27/2021] [Indexed: 11/12/2022] Open
Abstract
In molecular phylogenetics, it is typically assumed that the evolutionary process for DNA can be approximated by independent and identically distributed Markovian processes at the variable sites and that these processes diverge over the edges of a rooted bifurcating tree. Sometimes the nucleotides are transformed from a 4-state alphabet to a 3- or 2-state alphabet by a procedure that is called recoding, lumping, or grouping of states. Here, we introduce a likelihood-ratio test for lumpability for DNA that has diverged under different Markovian conditions, which assesses the assumption that the Markovian property of the evolutionary process over each edge is retained after recoding of the nucleotides. The test is derived and validated numerically on simulated data. To demonstrate the insights that can be gained by using the test, we assessed two published data sets, one of mitochondrial DNA from a phylogenetic study of the ratites (Syst. Biol. 59:90-107 [2010]) and the other of nuclear DNA from a phylogenetic study of yeast (Mol. Biol. Evol. 21:1455-1458 [2004]). Our analysis of these data sets revealed that recoding of the DNA eliminated some of the compositional heterogeneity detected over the sequences. However, the Markovian property of the original evolutionary process was not retained by the recoding, leading to some significant distortions of edge lengths in reconstructed trees.
Collapse
Affiliation(s)
- Victor A Vera-Ruiz
- School of Mathematics and Statistics, University of Sydney, NSW 2006, Australia.,Department of Mathematics and Statistics, University of Nevada, Reno, NV 89557, USA
| | - John Robinson
- School of Mathematics and Statistics, University of Sydney, NSW 2006, Australia
| | - Lars S Jermiin
- Research School of Biology, Australian National University, Canberra, ACT 2601, Australia.,School of Biology and Environmental Science, University College Dublin, Belfield, Dublin 4, Ireland.,Earth Institute, University College Dublin, Belfield, Dublin 4, Ireland
| |
Collapse
|
4
|
Jermiin LS, Catullo RA, Holland BR. A new phylogenetic protocol: dealing with model misspecification and confirmation bias in molecular phylogenetics. NAR Genom Bioinform 2020; 2:lqaa041. [PMID: 33575594 PMCID: PMC7671319 DOI: 10.1093/nargab/lqaa041] [Citation(s) in RCA: 11] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/17/2020] [Revised: 05/18/2020] [Accepted: 06/04/2020] [Indexed: 12/15/2022] Open
Abstract
Molecular phylogenetics plays a key role in comparative genomics and has increasingly significant impacts on science, industry, government, public health and society. In this paper, we posit that the current phylogenetic protocol is missing two critical steps, and that their absence allows model misspecification and confirmation bias to unduly influence phylogenetic estimates. Based on the potential offered by well-established but under-used procedures, such as assessment of phylogenetic assumptions and tests of goodness of fit, we introduce a new phylogenetic protocol that will reduce confirmation bias and increase the accuracy of phylogenetic estimates.
Collapse
Affiliation(s)
- Lars S Jermiin
- CSIRO Land & Water, Canberra, ACT 2601, Australia
- Research School of Biology, Australian National University, Canberra, ACT 2601, Australia
- School of Biology & Environment Science, University College Dublin, Belfield, Dublin 4, Ireland
- Earth Institute, University College Dublin, Belfield, Dublin 4, Ireland
| | - Renee A Catullo
- CSIRO Land & Water, Canberra, ACT 2601, Australia
- Research School of Biology, Australian National University, Canberra, ACT 2601, Australia
- School of Science and Health & Hawkesbury Institute of the Environment, Western Sydney University, Penrith, NSW 2751, Australia
| | - Barbara R Holland
- School of Natural Sciences, University of Tasmania, Hobart, TAS 7001, Australia
| |
Collapse
|
5
|
Abstract
Phylogenomics aims at reconstructing the evolutionary histories of organisms taking into account whole genomes or large fractions of genomes. The abundance of genomic data for an enormous variety of organisms has enabled phylogenomic inference of many groups, and this has motivated the development of many computer programs implementing the associated methods. This chapter surveys phylogenetic concepts and methods aimed at both gene tree and species tree reconstruction while also addressing common pitfalls, providing references to relevant computer programs. A practical phylogenomic analysis example including bacterial genomes is presented at the end of the chapter.
Collapse
Affiliation(s)
- José S L Patané
- Department of Biochemistry, Institute of Chemistry, University of São Paulo, Av. Prof. Lineu Prestes 748, São Paulo, SP, 05508-000, Brazil
| | - Joaquim Martins
- Department of Biochemistry, Institute of Chemistry, University of São Paulo, Av. Prof. Lineu Prestes 748, São Paulo, SP, 05508-000, Brazil
| | - João C Setubal
- Department of Biochemistry, Institute of Chemistry, University of São Paulo, Av. Prof. Lineu Prestes 748, São Paulo, SP, 05508-000, Brazil.
| |
Collapse
|
6
|
Abstract
Most phylogenetic methods are model-based and depend on models of evolution designed to approximate the evolutionary processes. Several methods have been developed to identify suitable models of evolution for phylogenetic analysis of alignments of nucleotide or amino acid sequences and some of these methods are now firmly embedded in the phylogenetic protocol. However, in a disturbingly large number of cases, it appears that these models were used without acknowledgement of their inherent shortcomings. In this chapter, we discuss the problem of model selection and show how some of the inherent shortcomings may be identified and overcome.
Collapse
Affiliation(s)
| | - Vivek Jayaswal
- School of Biomedical Sciences, Queensland University of Technology, Brisbane, QLD, Australia
| | - Faisal M Ababneh
- Department of Mathematics & Statistics, Al-Hussein Bin Talal University, Ma'an, Jordan
| | - John Robinson
- School of Mathematics & Statistics, University of Sydney, Sydney, NSW, Australia
| |
Collapse
|
7
|
Jayaswal V, Wong TKF, Robinson J, Poladian L, Jermiin LS. Mixture models of nucleotide sequence evolution that account for heterogeneity in the substitution process across sites and across lineages. Syst Biol 2014; 63:726-42. [PMID: 24927722 DOI: 10.1093/sysbio/syu036] [Citation(s) in RCA: 51] [Impact Index Per Article: 5.1] [Reference Citation Analysis] [Abstract] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022] Open
Abstract
Molecular phylogenetic studies of homologous sequences of nucleotides often assume that the underlying evolutionary process was globally stationary, reversible, and homogeneous (SRH), and that a model of evolution with one or more site-specific and time-reversible rate matrices (e.g., the GTR rate matrix) is enough to accurately model the evolution of data over the whole tree. However, an increasing body of data suggests that evolution under these conditions is an exception, rather than the norm. To address this issue, several non-SRH models of molecular evolution have been proposed, but they either ignore heterogeneity in the substitution process across sites (HAS) or assume it can be modeled accurately using the distribution. As an alternative to these models of evolution, we introduce a family of mixture models that approximate HAS without the assumption of an underlying predefined statistical distribution. This family of mixture models is combined with non-SRH models of evolution that account for heterogeneity in the substitution process across lineages (HAL). We also present two algorithms for searching model space and identifying an optimal model of evolution that is less likely to over- or underparameterize the data. The performance of the two new algorithms was evaluated using alignments of nucleotides with 10 000 sites simulated under complex non-SRH conditions on a 25-tipped tree. The algorithms were found to be very successful, identifying the correct HAL model with a 75% success rate (the average success rate for assigning rate matrices to the tree's 48 edges was 99.25%) and, for the correct HAL model, identifying the correct HAS model with a 98% success rate. Finally, parameter estimates obtained under the correct HAL-HAS model were found to be accurate and precise. The merits of our new algorithms were illustrated with an analysis of 42 337 second codon sites extracted from a concatenation of 106 alignments of orthologous genes encoded by the nuclear genomes of Saccharomyces cerevisiae, S. paradoxus, S. mikatae, S. kudriavzevii, S. castellii, S. kluyveri, S. bayanus, and Candida albicans. Our results show that second codon sites in the ancestral genome of these species contained 49.1% invariable sites, 39.6% variable sites belonging to one rate category (V1), and 11.3% variable sites belonging to a second rate category (V2). The ancestral nucleotide content was found to differ markedly across these three sets of sites, and the evolutionary processes operating at the variable sites were found to be non-SRH and best modeled by a combination of eight edge-specific rate matrices (four for V1 and four for V2). The number of substitutions per site at the variable sites also differed markedly, with sites belonging to V1 evolving slower than those belonging to V2 along the lineages separating the seven species of Saccharomyces. Finally, sites belonging to V1 appeared to have ceased evolving along the lineages separating S. cerevisiae, S. paradoxus, S. mikatae, S. kudriavzevii, and S. bayanus, implying that they might have become so selectively constrained that they could be considered invariable sites in these species.
Collapse
Affiliation(s)
- Vivek Jayaswal
- School of Biomedical Sciences, Queensland University of Technology, Brisbane, QLD 4000, Australia; School of Mathematics and Statistics, University of Sydney, Sydney, NSW 2006, Australia; CSIRO Ecosystem Sciences, Canberra, ACT 2601, Australia; and Centre for Mathematical Biology, University of Sydney, Sydney, NSW 2006, AustraliaSchool of Biomedical Sciences, Queensland University of Technology, Brisbane, QLD 4000, Australia; School of Mathematics and Statistics, University of Sydney, Sydney, NSW 2006, Australia; CSIRO Ecosystem Sciences, Canberra, ACT 2601, Australia; and Centre for Mathematical Biology, University of Sydney, Sydney, NSW 2006, Australia
| | - Thomas K F Wong
- School of Biomedical Sciences, Queensland University of Technology, Brisbane, QLD 4000, Australia; School of Mathematics and Statistics, University of Sydney, Sydney, NSW 2006, Australia; CSIRO Ecosystem Sciences, Canberra, ACT 2601, Australia; and Centre for Mathematical Biology, University of Sydney, Sydney, NSW 2006, Australia
| | - John Robinson
- School of Biomedical Sciences, Queensland University of Technology, Brisbane, QLD 4000, Australia; School of Mathematics and Statistics, University of Sydney, Sydney, NSW 2006, Australia; CSIRO Ecosystem Sciences, Canberra, ACT 2601, Australia; and Centre for Mathematical Biology, University of Sydney, Sydney, NSW 2006, AustraliaSchool of Biomedical Sciences, Queensland University of Technology, Brisbane, QLD 4000, Australia; School of Mathematics and Statistics, University of Sydney, Sydney, NSW 2006, Australia; CSIRO Ecosystem Sciences, Canberra, ACT 2601, Australia; and Centre for Mathematical Biology, University of Sydney, Sydney, NSW 2006, Australia
| | - Leon Poladian
- School of Biomedical Sciences, Queensland University of Technology, Brisbane, QLD 4000, Australia; School of Mathematics and Statistics, University of Sydney, Sydney, NSW 2006, Australia; CSIRO Ecosystem Sciences, Canberra, ACT 2601, Australia; and Centre for Mathematical Biology, University of Sydney, Sydney, NSW 2006, AustraliaSchool of Biomedical Sciences, Queensland University of Technology, Brisbane, QLD 4000, Australia; School of Mathematics and Statistics, University of Sydney, Sydney, NSW 2006, Australia; CSIRO Ecosystem Sciences, Canberra, ACT 2601, Australia; and Centre for Mathematical Biology, University of Sydney, Sydney, NSW 2006, Australia
| | - Lars S Jermiin
- School of Biomedical Sciences, Queensland University of Technology, Brisbane, QLD 4000, Australia; School of Mathematics and Statistics, University of Sydney, Sydney, NSW 2006, Australia; CSIRO Ecosystem Sciences, Canberra, ACT 2601, Australia; and Centre for Mathematical Biology, University of Sydney, Sydney, NSW 2006, Australia
| |
Collapse
|
8
|
A Genomic Island in Salmonella enterica ssp. salamae provides new insights on the genealogy of the locus of enterocyte effacement. PLoS One 2012; 7:e41615. [PMID: 22860002 PMCID: PMC3408504 DOI: 10.1371/journal.pone.0041615] [Citation(s) in RCA: 14] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/26/2012] [Accepted: 06/22/2012] [Indexed: 12/19/2022] Open
Abstract
The genomic island encoding the locus of enterocyte effacement (LEE) is an important virulence factor of the human pathogenic Escherichia coli. LEE typically encodes a type III secretion system (T3SS) and secreted effectors capable of forming attaching and effacing lesions. Although prominent in the pathogenic E. coli such as serotype O157:H7, LEE has also been detected in Citrobacter rodentium, E. albertii, and although not confirmed, it is likely to also be in Shigella boydii. Previous phylogenetic analysis of LEE indicated the genomic island was evolving through stepwise acquisition of various components. This study describes a new LEE region from two strains of Salmonella enterica subspecies salamae serovar Sofia along with a phylogenetic analysis of LEE that provides new insights into the likely evolution of this genomic island. The Salmonella LEE contains 36 of the 41 genes typically observed in LEE within a genomic island of 49, 371 bp that encodes a total of 54 genes. A phylogenetic analysis was performed on the entire T3SS and four T3SS genes (escF, escJ, escN, and escV) to elucidate the genealogy of LEE. Phylogenetic analysis inferred that the previously known LEE islands are members of a single lineage distinct from the new Salmonella LEE lineage. The previously known lineage of LEE diverged between islands found in Citrobacter and those in Escherichia and Shigella. Although recombination and horizontal gene transfer are important factors in the genealogy of most genomic islands, the phylogeny of the T3SS of LEE can be interpreted with a bifurcating tree. It seems likely that the LEE island entered the Enterobacteriaceae through horizontal gene transfer as a single unit, rather than as separate subsections, which was then subjected to the forces of both mutational change and recombination.
Collapse
|