1
|
Janzen T, Etienne RS. Phylogenetic tree statistics: A systematic overview using the new R package 'treestats'. Mol Phylogenet Evol 2024; 200:108168. [PMID: 39117295 DOI: 10.1016/j.ympev.2024.108168] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/24/2024] [Revised: 07/19/2024] [Accepted: 08/04/2024] [Indexed: 08/10/2024]
Abstract
Phylogenetic trees are believed to contain a wealth of information on diversification processes. However, comparing phylogenetic trees is not straightforward due to their high dimensionality. Researchers have therefore defined a wide range of low-dimensional summary statistics. Currently, it remains unexplored to what extent these summary statistics cover the same underlying information and what summary statistics best explain observed variation across phylogenies. Furthermore, a large subset of available summary statistics focusses on measuring the topological features of a phylogenetic tree, but are often only explored at the extreme edge cases of the fully balanced or imbalanced tree and not for trees of intermediate balance. Here, we introduce a new R package called 'treestats', that provides speed optimized code to compute 70 summary statistics. We study correlations between summary statistics on empirical trees and on trees simulated using several diversification models. Furthermore, we introduce an algorithm to create intermediately balanced trees in a well-defined manner, in order to explore variation in summary statistics across a balance gradient. We find that almost all summary statistics are correlated with tree size, and find that it is difficult, if not impossible, to correct for tree size, unless the tree generating model is known. Furthermore, we find that across empirical and simulated trees, at least three large clusters of correlated summary statistics can be found, where statistics group together based on information used (topology or branching times). However, the finer grained correlation structure appears to depend strongly on either the taxonomic group studied (in empirical studies) or the tree generating model (in simulation studies). Amongst statistics describing the (im)balance of a tree, we find that almost all statistics vary non-linearly, and sometimes even non-monotonically, with our generated balance gradient. This indicates that balance is perhaps a more complex property of a tree than previously thought. Furthermore, using our new imbalancing algorithm, we devise a numerical test to identify balance statistics, and identify several statistics as balance statistics that were not previously considered as such. Lastly, our results lead to several recommendations on which statistics to select when analyzing and comparing phylogenetic trees.
Collapse
Affiliation(s)
- Thijs Janzen
- Groningen Institute for Evolutionary Life Sciences, University of Groningen, Groningen, the Netherlands.
| | - Rampal S Etienne
- Groningen Institute for Evolutionary Life Sciences, University of Groningen, Groningen, the Netherlands
| |
Collapse
|
2
|
Tenorio-Salgado S, Villalpando-Aguilar JL, Hernandez-Guerrero R, Poot-Hernández AC, Perez-Rueda E. Exploring the enzymatic repertoires of Bacteria and Archaea and their associations with metabolic maps. Braz J Microbiol 2024:10.1007/s42770-024-01462-3. [PMID: 39052173 DOI: 10.1007/s42770-024-01462-3] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/22/2024] [Accepted: 07/11/2024] [Indexed: 07/27/2024] Open
Abstract
The evolution, survival, and adaptation of microbes are consequences of gene duplication, acquisition, and divergence in response to environmental challenges. In this context, enzymes play a central role in the evolution of organisms, because they are fundamental in cell metabolism. Here, we analyzed the enzymatic repertoire in 6,467 microbial genomes, including their abundances, and their associations with metabolic maps. We found that the enzymes follow a power-law distribution, in relation to the genome sizes. Therefore, we evaluated the total proportion enzymatic classes in relation to the genomes, identifying a descending-order proportion: transferases (EC:2.-), hydrolases (EC:3.-), oxidoreductases (EC:1.-), ligases (EC:6.-), lyases (EC:4.-), isomerases (EC:5.-), and translocases (EC:7-.). In addition, we identified a preferential use of enzymatic classes in metabolism pathways for xenobiotics, cofactors and vitamins, carbohydrates, amino acids, glycans, and energy. Therefore, this analysis provides clues about the functional constraints associated with the enzymatic repertoire of functions in Bacteria and Archaea.
Collapse
Affiliation(s)
- Silvia Tenorio-Salgado
- Instituto de Investigaciones en Matemáticas Aplicadas y en Sistemas, Universidad Nacional Autónoma de México, Unidad Académica del Estado de Yucatán, Mérida, Yucatán, México
- Tecnológico Nacional de México, Instituto Tecnológico de Mérida, Av. Tecnológico km. 4.5, 97118, Merida, Yucatan, Mexico
| | - José Luis Villalpando-Aguilar
- Instituto de Investigaciones en Matemáticas Aplicadas y en Sistemas, Universidad Nacional Autónoma de México, Unidad Académica del Estado de Yucatán, Mérida, Yucatán, México
- Facultad Ciencias de la Salud, Universidad Vizcaya de las Américas, Prolongación Allende, Campeche, 24035, Campeche, Mexico
| | - Rafael Hernandez-Guerrero
- Instituto de Investigaciones en Matemáticas Aplicadas y en Sistemas, Universidad Nacional Autónoma de México, Unidad Académica del Estado de Yucatán, Mérida, Yucatán, México
| | - Augusto César Poot-Hernández
- Unidad de Bioinformática y Manejo de la Información. Instituto de Fisiología Celular, Universidad Nacional Autónoma de México, Coyoacán, Ciudad de México, México
| | - Ernesto Perez-Rueda
- Instituto de Investigaciones en Matemáticas Aplicadas y en Sistemas, Universidad Nacional Autónoma de México, Unidad Académica del Estado de Yucatán, Mérida, Yucatán, México.
| |
Collapse
|
3
|
Khurana MP, Scheidwasser-Clow N, Penn MJ, Bhatt S, Duchêne DA. The Limits of the Constant-rate Birth-Death Prior for Phylogenetic Tree Topology Inference. Syst Biol 2024; 73:235-246. [PMID: 38153910 PMCID: PMC11129600 DOI: 10.1093/sysbio/syad075] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/06/2023] [Revised: 12/20/2023] [Accepted: 12/27/2023] [Indexed: 12/30/2023] Open
Abstract
Birth-death models are stochastic processes describing speciation and extinction through time and across taxa and are widely used in biology for inference of evolutionary timescales. Previous research has highlighted how the expected trees under the constant-rate birth-death (crBD) model tend to differ from empirical trees, for example, with respect to the amount of phylogenetic imbalance. However, our understanding of how trees differ between the crBD model and the signal in empirical data remains incomplete. In this Point of View, we aim to expose the degree to which the crBD model differs from empirically inferred phylogenies and test the limits of the model in practice. Using a wide range of topology indices to compare crBD expectations against a comprehensive dataset of 1189 empirically estimated trees, we confirm that crBD model trees frequently differ topologically compared with empirical trees. To place this in the context of standard practice in the field, we conducted a meta-analysis for a subset of the empirical studies. When comparing studies that used Bayesian methods and crBD priors with those that used other non-crBD priors and non-Bayesian methods (i.e., maximum likelihood methods), we do not find any significant differences in tree topology inferences. To scrutinize this finding for the case of highly imbalanced trees, we selected the 100 trees with the greatest imbalance from our dataset, simulated sequence data for these tree topologies under various evolutionary rates, and re-inferred the trees under maximum likelihood and using the crBD model in a Bayesian setting. We find that when the substitution rate is low, the crBD prior results in overly balanced trees, but the tendency is negligible when substitution rates are sufficiently high. Overall, our findings demonstrate the general robustness of crBD priors across a broad range of phylogenetic inference scenarios but also highlight that empirically observed phylogenetic imbalance is highly improbable under the crBD model, leading to systematic bias in data sets with limited information content.
Collapse
Affiliation(s)
- Mark P Khurana
- Section of Epidemiology, Department of Public Health, University of Copenhagen, 1352 Copenhagen, Denmark
| | - Neil Scheidwasser-Clow
- Section of Epidemiology, Department of Public Health, University of Copenhagen, 1352 Copenhagen, Denmark
| | - Matthew J Penn
- Department of Statistics, University of Oxford, OX1 3LB, Oxford, UK
| | - Samir Bhatt
- Section of Epidemiology, Department of Public Health, University of Copenhagen, 1352 Copenhagen, Denmark
- MRC Centre for Global Infectious Disease Analysis, School of Public Health, Imperial College London, SW7 2AZ, London, UK
| | - David A Duchêne
- Centre for Evolutionary Hologenomics, University of Copenhagen, 1352 Copenhagen, Denmark
| |
Collapse
|
4
|
Duarte CM, Ketcheson DI, Eguíluz VM, Agustí S, Fernández-Gracia J, Jamil T, Laiolo E, Gojobori T, Alam I. Rapid evolution of SARS-CoV-2 challenges human defenses. Sci Rep 2022; 12:6457. [PMID: 35440671 PMCID: PMC9017738 DOI: 10.1038/s41598-022-10097-z] [Citation(s) in RCA: 18] [Impact Index Per Article: 9.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/14/2021] [Accepted: 03/23/2022] [Indexed: 12/25/2022] Open
Abstract
The race between pathogens and their hosts is a major evolutionary driver, where both reshuffle their genomes to overcome and reorganize the defenses for infection, respectively. Evolutionary theory helps formulate predictions on the future evolutionary dynamics of SARS-CoV-2, which can be monitored through unprecedented real-time tracking of SARS-CoV-2 population genomics at the global scale. Here we quantify the accelerating evolution of SARS-CoV-2 by tracking the SARS-CoV-2 mutation globally, with a focus on the Receptor Binding Domain (RBD) of the spike protein determining infection success. We estimate that the > 820 million people that had been infected by October 5, 2021, produced up to 1021 copies of the virus, with 12 new effective RBD variants appearing, on average, daily. Doubling of the number of RBD variants every 89 days, followed by selection of the most infective variants challenges our defenses and calls for a shift to anticipatory, rather than reactive tactics involving collaborative global sequencing and vaccination.
Collapse
Affiliation(s)
- Carlos M Duarte
- Red Sea Research Centre (RSRC), King Abdullah University of Science and Technology, Thuwal, 23955, Saudi Arabia. .,Computational Bioscience Research Centre (CBRC), King Abdullah University of Science and Technology, Thuwal, 23955, Saudi Arabia.
| | - David I Ketcheson
- Computer, Electrical, and Mathematical Sciences and Engineering (CEMSE) Division, King Abdullah University of Science and Technology, Thuwal, 23955, Saudi Arabia
| | - Víctor M Eguíluz
- Instituto de Física Interdisciplinar y Sistemas Complejos IFISC (UIB-CSIC), Palma de Mallorca, Spain
| | - Susana Agustí
- Red Sea Research Centre (RSRC), King Abdullah University of Science and Technology, Thuwal, 23955, Saudi Arabia
| | - Juan Fernández-Gracia
- Instituto de Física Interdisciplinar y Sistemas Complejos IFISC (UIB-CSIC), Palma de Mallorca, Spain
| | - Tahira Jamil
- Red Sea Research Centre (RSRC), King Abdullah University of Science and Technology, Thuwal, 23955, Saudi Arabia.,Computational Bioscience Research Centre (CBRC), King Abdullah University of Science and Technology, Thuwal, 23955, Saudi Arabia
| | - Elisa Laiolo
- Red Sea Research Centre (RSRC), King Abdullah University of Science and Technology, Thuwal, 23955, Saudi Arabia.,Computational Bioscience Research Centre (CBRC), King Abdullah University of Science and Technology, Thuwal, 23955, Saudi Arabia
| | - Takashi Gojobori
- Computational Bioscience Research Centre (CBRC), King Abdullah University of Science and Technology, Thuwal, 23955, Saudi Arabia
| | - Intikhab Alam
- Computational Bioscience Research Centre (CBRC), King Abdullah University of Science and Technology, Thuwal, 23955, Saudi Arabia
| |
Collapse
|
5
|
Scale-invariant topology and bursty branching of evolutionary trees emerge from niche construction. Proc Natl Acad Sci U S A 2020; 117:7879-7887. [PMID: 32209672 DOI: 10.1073/pnas.1915088117] [Citation(s) in RCA: 11] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/03/2023] Open
Abstract
Phylogenetic trees describe both the evolutionary process and community diversity. Recent work has established that they exhibit scale-invariant topology, which quantifies the fact that their branching lies in between the two extreme cases of balanced binary trees and maximally unbalanced ones. In addition, the backbones of phylogenetic trees exhibit bursts of diversification on all timescales. Here, we present a simple, coarse-grained statistical model of niche construction coupled to speciation. Finite-size scaling analysis of the dynamics shows that the resultant phylogenetic tree topology is scale-invariant due to a singularity arising from large niche construction fluctuations that follow extinction events. The same model recapitulates the bursty pattern of diversification in time. These results show how dynamical scaling laws of phylogenetic trees on long timescales can reflect the indelible imprint of the interplay between ecological and evolutionary processes.
Collapse
|
6
|
Chakraborty C, Sharma AR, Sharma G, Bhattacharya M, Lee SS. Insight into Evolution and Conservation Patterns of B1-Subfamily Members of GPCR. Int J Pept Res Ther 2020; 26:2505-2517. [PMID: 32421105 PMCID: PMC7223794 DOI: 10.1007/s10989-020-10043-5] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 01/30/2020] [Indexed: 11/25/2022]
Abstract
The diverse, evolutionary architectures of proteins can be regarded as molecular fossils, tracing a historical path that marks important milestones across life. The B1-subfamily of GPCRs (G-protein-coupled receptors) are medically significant proteins that comprise 15 transmembrane receptor proteins in Homo sapiens. These proteins control the intracellular concentration of cyclic AMP as well as various vital processes in the body. However, little is known about the evolutionary correlation and conservational blueprint of this GPCR subfamily. We performed a comprehensive analysis to understand the evolutionary architecture among 13 members of the B1-subfamily. Multiple sequence alignment analysis exhibited six multiple sequence aligned blocks and five highly aligned blocks. Molecular phylogenetics indicated that CRHR1 and CRHR2 share a typical ancestral relationship and are siblings in 100% bootstrap replications with a total of 24 nodes observed in the cladogram. CRHR2 has the maximum number of extremely conserved amino acids followed by ADCYAP1R1. The longest continuous number sequence logos (74) were found between sequence location 349 and 423, and consequently, the maximum and minimum logo height recorded was 3.6 bits and 0.18 bits, respectively. Finally, to understand the model and pattern of evolutionary relatedness, the conservation blueprint, and the diversification among the members of a protein family, GPCR distribution from several species throughout the animal kingdom was analysed. Together, the study provides an evolutionary insight and offers a rapid method to explore the potential of depicting the evolutionary relationship, conservation blueprint, and diversification among the B1-subfamily of GPCRs using bioinformatics, algorithm analysis, and mathematical models.
Collapse
Affiliation(s)
- Chiranjib Chakraborty
- Adamas University, North, 24 Parganas, Kolkata, 700126 West Bengal India
- Institute for Skeletal Aging & Orthopedic Surgery, Chuncheon Sacred Heart Hospital, Hallym University, Chuncheon, 24252 Republic of Korea
| | - Ashish Ranjan Sharma
- Institute for Skeletal Aging & Orthopedic Surgery, Chuncheon Sacred Heart Hospital, Hallym University, Chuncheon, 24252 Republic of Korea
| | - Garima Sharma
- Neuropsychopharmacology and Toxicology Program, College of Pharmacy, Kangwon National University, Chuncheon, 24341 Republic of Korea
| | - Manojit Bhattacharya
- Institute for Skeletal Aging & Orthopedic Surgery, Chuncheon Sacred Heart Hospital, Hallym University, Chuncheon, 24252 Republic of Korea
| | - Sang-Soo Lee
- Institute for Skeletal Aging & Orthopedic Surgery, Chuncheon Sacred Heart Hospital, Hallym University, Chuncheon, 24252 Republic of Korea
| |
Collapse
|
7
|
Keller-Schmidt S, Tuğrul M, Eguíluz VM, Hernández-García E, Klemm K. Anomalous scaling in an age-dependent branching model. PHYSICAL REVIEW. E, STATISTICAL, NONLINEAR, AND SOFT MATTER PHYSICS 2015; 91:022803. [PMID: 25768548 DOI: 10.1103/physreve.91.022803] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 10/04/2013] [Indexed: 06/04/2023]
Abstract
We introduce a one-parametric family of tree growth models, in which branching probabilities decrease with branch age τ as τ(-α). Depending on the exponent α, the scaling of tree depth with tree size n displays a transition between the logarithmic scaling of random trees and an algebraic growth. At the transition (α=1) tree depth grows as (logn)(2). This anomalous scaling is in good agreement with the trend observed in evolution of biological species, thus providing a theoretical support for age-dependent speciation and associating it to the occurrence of a critical point.
Collapse
Affiliation(s)
- Stephanie Keller-Schmidt
- Bioinformatics, Institute of Computer Science, University Leipzig, Härtelstr. 16-18, 04107 Leipzig, Germany
| | - Murat Tuğrul
- IST Austria, Am Campus 1, 3400 Klosterneuburg, Austria
| | - Víctor M Eguíluz
- IFISC (CSIC-UIB), Instituto de Física Interdisciplinar y Sistemas Complejos, E-07122 Palma de Mallorca, Spain
| | - Emilio Hernández-García
- IFISC (CSIC-UIB), Instituto de Física Interdisciplinar y Sistemas Complejos, E-07122 Palma de Mallorca, Spain
| | - Konstantin Klemm
- Bioinformatics, Institute of Computer Science, University Leipzig, Härtelstr. 16-18, 04107 Leipzig, Germany
- Bioinformatics and Computational Biology, University of Vienna, Währingerstraße 29, 1090 Vienna, Austria
- Theoretical Chemistry, University of Vienna, Währingerstraße 17, 1090 Vienna, Austria
- School of Science and Technology, Nazarbayev University, Kabanbay Batyr Ave. 53, 010000 Astana, Kazakhstan
| |
Collapse
|
8
|
Li W, Freudenberg J, Miramontes P. Diminishing return for increased Mappability with longer sequencing reads: implications of the k-mer distributions in the human genome. BMC Bioinformatics 2014; 15:2. [PMID: 24386976 PMCID: PMC3927684 DOI: 10.1186/1471-2105-15-2] [Citation(s) in RCA: 25] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/25/2013] [Accepted: 12/17/2013] [Indexed: 11/10/2022] Open
Abstract
Background The amount of non-unique sequence (non-singletons) in a genome directly affects the difficulty of read alignment to a reference assembly for high throughput-sequencing data. Although a longer read is more likely to be uniquely mapped to the reference genome, a quantitative analysis of the influence of read lengths on mappability has been lacking. To address this question, we evaluate the k-mer distribution of the human reference genome. The k-mer frequency is determined for k ranging from 20 bp to 1000 bp. Results We observe that the proportion of non-singletons k-mers decreases slowly with increasing k, and can be fitted by piecewise power-law functions with different exponents at different ranges of k. A slower decay at greater values for k indicates more limited gains in mappability for read lengths between 200 bp and 1000 bp. The frequency distributions of k-mers exhibit long tails with a power-law-like trend, and rank frequency plots exhibit a concave Zipf’s curve. The most frequent 1000-mers comprise 172 regions, which include four large stretches on chromosomes 1 and X, containing genes of biomedical relevance. Comparison with other databases indicates that the 172 regions can be broadly classified into two types: those containing LINE transposable elements and those containing segmental duplications. Conclusion Read mappability as measured by the proportion of singletons increases steadily up to the length scale around 200 bp. When read length increases above 200 bp, smaller gains in mappability are expected. Moreover, the proportion of non-singletons decreases with read lengths much slower than linear. Even a read length of 1000 bp would not allow the unique alignment of reads for many coding regions of human genes. A mix of techniques will be needed for efficiently producing high-quality data that cover the complete human genome.
Collapse
Affiliation(s)
- Wentian Li
- The Robert S, Boas Center for Genomics and Human Genetic, The Feinstein Institute for Medical Research, North Shore LIJ Health System, 350 Community Drive, Manhasset, USA.
| | | | | |
Collapse
|
9
|
Abstract
A new word, phylodynamics, was coined to emphasize the interconnection between phylogenetic properties, as observed for instance in a phylogenetic tree, and the epidemic dynamics of viruses, where selection, mediated by the host immune response, and transmission play a crucial role. The challenges faced when investigating the evolution of RNA viruses call for a virtuous loop of data collection, data analysis and modeling. This already resulted both in the collection of massive sequences databases and in the formulation of hypotheses on the main mechanisms driving qualitative differences observed in the (reconstructed) evolutionary patterns of different RNA viruses. Qualitatively, it has been observed that selection driven by the host immune response induces an uneven survival ability among co-existing strains. As a consequence, the imbalance level of the phylogenetic tree is manifestly more pronounced if compared to the case when the interaction with the host immune system does not play a central role in the evolutive dynamics. While many imbalance metrics have been introduced, reliable methods to discriminate in a quantitative way different level of imbalance are still lacking. In our work, we reconstruct and analyze the phylogenetic trees of six RNA viruses, with a special emphasis on the human Influenza A virus, due to its relevance for vaccine preparation as well as for the theoretical challenges it poses due to its peculiar evolutionary dynamics. We focus in particular on topological properties. We point out the limitation featured by standard imbalance metrics, and we introduce a new methodology with which we assign the correct imbalance level of the phylogenetic trees, in agreement with the phylodynamics of the viruses. Our thorough quantitative analysis allows for a deeper understanding of the evolutionary dynamics of the considered RNA viruses, which is crucial in order to provide a valuable framework for a quantitative assessment of theoretical predictions.
Collapse
Affiliation(s)
- Simone Pompei
- Complex Systems Lagrange Lab, Institute for Scientific Interchange-ISI, Torino, Italy.
| | | | | |
Collapse
|
10
|
Caetano-Anollés G, Nasir A. Benefits of using molecular structure and abundance in phylogenomic analysis. Front Genet 2012; 3:172. [PMID: 22973296 PMCID: PMC3434437 DOI: 10.3389/fgene.2012.00172] [Citation(s) in RCA: 24] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/25/2012] [Accepted: 08/18/2012] [Indexed: 12/25/2022] Open
Affiliation(s)
- Gustavo Caetano-Anollés
- Evolutionary Bioinformatics Laboratory, Department of Crop Sciences, University of Illinois Urbana-Champaign, IL, USA
| | | |
Collapse
|