1
|
Shah PT, Xing L. Reply to Abrantes et al. Recombination-Based Perspectives on Lagovirus Classification, Phylogenetic Patterns, and Evolutionary Dynamics. Comment on "Shah et al. Genetic Characteristics and Phylogeographic Dynamics of Lagoviruses, 1988-2021. Viruses 2023, 15, 815". Viruses 2024; 16:928. [PMID: 38932220 PMCID: PMC11209430 DOI: 10.3390/v16060928] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/29/2024] [Accepted: 05/31/2024] [Indexed: 06/28/2024] Open
Abstract
Recently, Abrantes et al [...].
Collapse
Affiliation(s)
- Pir Tariq Shah
- Faculty of Medicine, School of Biomedical Engineering, Dalian University of Technology, No. 2 Linggong Road, Dalian 116024, China
- Shandong Laboratory of Yantai Drug Discovery, Bohai Rim Advanced Research Institute for Drug Discovery, Yantai 264000, China
| | - Li Xing
- Institute of Biomedical Sciences, Shanxi University, 92 Wucheng Road, Taiyuan 030006, China
- Shanxi Provincial Key Laboratory of Medical Molecular Cell Biology, Shanxi University, 92 Wucheng Road, Taiyuan 030006, China
| |
Collapse
|
2
|
Wehbi S, Wheeler A, Morel B, Minh BQ, Lauretta DS, Masel J. Order of amino acid recruitment into the genetic code resolved by Last Universal Common Ancestor's protein domains. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2024.04.13.589375. [PMID: 38659899 PMCID: PMC11042313 DOI: 10.1101/2024.04.13.589375] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 04/26/2024]
Abstract
The current "consensus" order in which amino acids were added to the genetic code is based on potentially biased criteria such as absence of sulfur-containing amino acids from the Urey-Miller experiment which lacked sulfur. Even if inferred perfectly, abiotic abundance might not reflect abundance in the organisms in which the genetic code evolved. Here, we instead exploit the fact that proteins that emerged prior to the genetic code's completion are likely enriched in early amino acids and depleted in late amino acids. We identify the most ancient protein-coding sequences born prior to the archaeal-bacterial split. Amino acid usage in protein sequences whose ancestors date back to a single homolog in the Last Universal Common Ancestor (LUCA) largely matches the consensus order. However, our findings indicate that metal-binding (cysteine and histidine) and sulfur-containing (cysteine and methionine) amino acids were added to the genetic code much earlier than previously thought. Surprisingly, even more ancient protein sequences - those that had already diversified into multiple distinct copies in LUCA - show a different pattern to single copy LUCA sequences: significantly less depleted in the late amino acids tryptophan and tyrosine, and enriched rather than depleted in phenylalanine. This is compatible with at least some of these sequences predating the current genetic code. Their distinct enrichment patterns thus provide hints about earlier, alternative genetic codes.
Collapse
Affiliation(s)
- Sawsan Wehbi
- Genetics Graduate Interdisciplinary Program, University of Arizona, Tucson, Arizona, 85721, USA
| | - Andrew Wheeler
- Genetics Graduate Interdisciplinary Program, University of Arizona, Tucson, Arizona, 85721, USA
| | - Benoit Morel
- Computational Molecular Evolution Group, Heidelberg Institute for Theoretical Studies, Heidelberg, Germany
| | - Bui Quang Minh
- School of Computing, Australian National University, Canberra, ACT, Australia
| | - Dante S Lauretta
- Lunar and Planetary Laboratory, University of Arizona, Tucson, AZ 85721, USA
| | - Joanna Masel
- Department of Ecology and Evolutionary Biology, University of Arizona, Tucson, AZ, 85721, USA
| |
Collapse
|
3
|
Gupta A, Mirarab S, Turakhia Y. Accurate, scalable, and fully automated inference of species trees from raw genome assemblies using ROADIES. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2024.05.27.596098. [PMID: 38854139 PMCID: PMC11160643 DOI: 10.1101/2024.05.27.596098] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/11/2024]
Abstract
Inference of species trees plays a crucial role in advancing our understanding of evolutionary relationships and has immense significance for diverse biological and medical applications. Extensive genome sequencing efforts are currently in progress across a broad spectrum of life forms, holding the potential to unravel the intricate branching patterns within the tree of life. However, estimating species trees starting from raw genome sequences is quite challenging, and the current cutting-edge methodologies require a series of error-prone steps that are neither entirely automated nor standardized. In this paper, we present ROADIES, a novel pipeline for species tree inference from raw genome assemblies that is fully automated, easy to use, scalable, free from reference bias, and provides flexibility to adjust the tradeoff between accuracy and runtime. The ROADIES pipeline eliminates the need to align whole genomes, choose a single reference species, or pre-select loci such as functional genes found using cumbersome annotation steps. Moreover, it leverages recent advances in phylogenetic inference to allow multi-copy genes, eliminating the need to detect orthology. Using the genomic datasets released from large-scale sequencing consortia across three diverse life forms (placental mammals, pomace flies, and birds), we show that ROADIES infers species trees that are comparable in quality with the state-of-the-art approaches but in a fraction of the time. By incorporating optimal approaches and automating all steps from assembled genomes to species and gene trees, ROADIES is poised to improve the accuracy, scalability, and reproducibility of phylogenomic analyses.
Collapse
Affiliation(s)
- Anshu Gupta
- Department of Computer Science and Engineering, University of California, San Diego; San Diego, CA 92093, USA
| | - Siavash Mirarab
- Department of Electrical and Computer Engineering, University of California, San Diego; San Diego, CA 92093, USA
| | - Yatish Turakhia
- Department of Electrical and Computer Engineering, University of California, San Diego; San Diego, CA 92093, USA
| |
Collapse
|
4
|
Stiller J, Feng S, Chowdhury AA, Rivas-González I, Duchêne DA, Fang Q, Deng Y, Kozlov A, Stamatakis A, Claramunt S, Nguyen JMT, Ho SYW, Faircloth BC, Haag J, Houde P, Cracraft J, Balaban M, Mai U, Chen G, Gao R, Zhou C, Xie Y, Huang Z, Cao Z, Yan Z, Ogilvie HA, Nakhleh L, Lindow B, Morel B, Fjeldså J, Hosner PA, da Fonseca RR, Petersen B, Tobias JA, Székely T, Kennedy JD, Reeve AH, Liker A, Stervander M, Antunes A, Tietze DT, Bertelsen MF, Lei F, Rahbek C, Graves GR, Schierup MH, Warnow T, Braun EL, Gilbert MTP, Jarvis ED, Mirarab S, Zhang G. Complexity of avian evolution revealed by family-level genomes. Nature 2024; 629:851-860. [PMID: 38560995 PMCID: PMC11111414 DOI: 10.1038/s41586-024-07323-1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/25/2023] [Accepted: 03/15/2024] [Indexed: 04/04/2024]
Abstract
Despite tremendous efforts in the past decades, relationships among main avian lineages remain heavily debated without a clear resolution. Discrepancies have been attributed to diversity of species sampled, phylogenetic method and the choice of genomic regions1-3. Here we address these issues by analysing the genomes of 363 bird species4 (218 taxonomic families, 92% of total). Using intergenic regions and coalescent methods, we present a well-supported tree but also a marked degree of discordance. The tree confirms that Neoaves experienced rapid radiation at or near the Cretaceous-Palaeogene boundary. Sufficient loci rather than extensive taxon sampling were more effective in resolving difficult nodes. Remaining recalcitrant nodes involve species that are a challenge to model due to either extreme DNA composition, variable substitution rates, incomplete lineage sorting or complex evolutionary events such as ancient hybridization. Assessment of the effects of different genomic partitions showed high heterogeneity across the genome. We discovered sharp increases in effective population size, substitution rates and relative brain size following the Cretaceous-Palaeogene extinction event, supporting the hypothesis that emerging ecological opportunities catalysed the diversification of modern birds. The resulting phylogenetic estimate offers fresh insights into the rapid radiation of modern birds and provides a taxon-rich backbone tree for future comparative studies.
Collapse
Affiliation(s)
- Josefin Stiller
- Section for Ecology and Evolution, Department of Biology, University of Copenhagen, Copenhagen, Denmark.
| | - Shaohong Feng
- Center for Evolutionary & Organismal Biology, Liangzhu Laboratory & Women's Hospital, Zhejiang University School of Medicine, Hangzhou, China
- Department of General Surgery, Sir Run-Run Shaw Hospital, Zhejiang University School of Medicine, Hangzhou, China
- Innovation Center of Yangtze River Delta, Zhejiang University, Jiashan, China
| | - Al-Aabid Chowdhury
- School of Life and Environmental Sciences, University of Sydney, Sydney, New South Wales, Australia
| | | | - David A Duchêne
- Center for Evolutionary Hologenomics, The Globe Institute, University of Copenhagen, Copenhagen, Denmark
| | - Qi Fang
- BGI Research, Shenzhen, China
| | - Yuan Deng
- BGI Research, Shenzhen, China
- BGI Research, Wuhan, China
| | - Alexey Kozlov
- Computational Molecular Evolution Group, Heidelberg Institute for Theoretical Studies, Heidelberg, Germany
| | - Alexandros Stamatakis
- Computational Molecular Evolution Group, Heidelberg Institute for Theoretical Studies, Heidelberg, Germany
- Institute of Computer Science, Foundation for Research and Technology Hellas, Heraklion, Greece
- Institute for Theoretical Informatics, Karlsruhe Institute of Technology, Karlsruhe, Germany
| | - Santiago Claramunt
- Department of Ecology and Evolutionary Biology, University of Toronto, Toronto, Ontario, Canada
- Department of Natural History, Royal Ontario Museum, Toronto, Ontario, Canada
| | - Jacqueline M T Nguyen
- College of Science and Engineering, Flinders University, Adelaide, South Australia, Australia
- Australian Museum Research Institute, Sydney, New South Wales, Australia
| | - Simon Y W Ho
- School of Life and Environmental Sciences, University of Sydney, Sydney, New South Wales, Australia
| | - Brant C Faircloth
- Department of Biological Sciences and Museum of Natural Science, Louisiana State University, Baton Rouge, LA, USA
| | - Julia Haag
- Computational Molecular Evolution Group, Heidelberg Institute for Theoretical Studies, Heidelberg, Germany
| | - Peter Houde
- Department of Biology, New Mexico State University, Las Cruces, NM, USA
| | - Joel Cracraft
- Department of Ornithology, American Museum of Natural History, New York, NY, USA
| | - Metin Balaban
- Bioinformatics and Systems Biology Graduate Program, University of California San Diego, La Jolla, CA, USA
| | - Uyen Mai
- Computer Science and Engineering, University of California San Diego, La Jolla, CA, USA
| | - Guangji Chen
- BGI Research, Wuhan, China
- College of Life Sciences, University of Chinese Academy of Sciences, Beijing, China
| | - Rongsheng Gao
- BGI Research, Wuhan, China
- College of Life Sciences, University of Chinese Academy of Sciences, Beijing, China
| | | | - Yulong Xie
- Center for Evolutionary & Organismal Biology, Liangzhu Laboratory & Women's Hospital, Zhejiang University School of Medicine, Hangzhou, China
| | - Zijian Huang
- Center for Evolutionary & Organismal Biology, Liangzhu Laboratory & Women's Hospital, Zhejiang University School of Medicine, Hangzhou, China
| | - Zhen Cao
- Department of Computer Science, Rice University, Houston, TX, USA
| | - Zhi Yan
- Department of Computer Science, Rice University, Houston, TX, USA
| | - Huw A Ogilvie
- Department of Computer Science, Rice University, Houston, TX, USA
| | - Luay Nakhleh
- Department of Computer Science, Rice University, Houston, TX, USA
| | - Bent Lindow
- Natural History Museum Denmark, University of Copenhagen, Copenhagen, Denmark
| | - Benoit Morel
- Computational Molecular Evolution Group, Heidelberg Institute for Theoretical Studies, Heidelberg, Germany
- Institute of Computer Science, Foundation for Research and Technology Hellas, Heraklion, Greece
| | - Jon Fjeldså
- Natural History Museum Denmark, University of Copenhagen, Copenhagen, Denmark
| | - Peter A Hosner
- Natural History Museum Denmark, University of Copenhagen, Copenhagen, Denmark
- Center for Global Mountain Biodiversity, Globe Institute, University of Copenhagen, Copenhagen, Denmark
| | - Rute R da Fonseca
- Center for Global Mountain Biodiversity, Globe Institute, University of Copenhagen, Copenhagen, Denmark
| | - Bent Petersen
- Center for Evolutionary Hologenomics, The Globe Institute, University of Copenhagen, Copenhagen, Denmark
- Centre of Excellence for Omics-Driven Computational Biodiscovery, Faculty of Applied Sciences, AIMST University, Bedong, Malaysia
| | - Joseph A Tobias
- Department of Life Sciences, Imperial College London, Silwood Park, Ascot, UK
| | - Tamás Székely
- Milner Centre for Evolution, University of Bath, Bath, UK
- ELKH-DE Reproductive Strategies Research Group, University of Debrecen, Debrecen, Hungary
| | - Jonathan David Kennedy
- Center for Macroecology, Evolution, and Climate, The Globe Institute, University of Copenhagen, Copenhagen, Denmark
| | - Andrew Hart Reeve
- Natural History Museum Denmark, University of Copenhagen, Copenhagen, Denmark
| | - Andras Liker
- HUN-REN-PE Evolutionary Ecology Research Group, University of Pannonia, Veszprém, Hungary
- Behavioural Ecology Research Group, Center for Natural Sciences, University of Pannonia, Veszprém, Hungary
| | | | - Agostinho Antunes
- CIIMAR/CIMAR, Interdisciplinary Centre of Marine and Environmental Research, University of Porto, Porto, Portugal
- Department of Biology, Faculty of Sciences, University of Porto, Porto, Portugal
| | | | - Mads F Bertelsen
- Centre for Zoo and Wild Animal Health, Copenhagen Zoo, Frederiksberg, Denmark
| | - Fumin Lei
- Key Laboratory of Zoological Systematics and Evolution, Institute of Zoology, Chinese Academy of Sciences, Beijing, China
- College of Life Science, University of Chinese Academy of Sciences, Beijing, China
| | - Carsten Rahbek
- Center for Global Mountain Biodiversity, Globe Institute, University of Copenhagen, Copenhagen, Denmark
- Center for Macroecology, Evolution, and Climate, The Globe Institute, University of Copenhagen, Copenhagen, Denmark
- Institute of Ecology, Peking University, Beijing, China
- Danish Institute for Advanced Study, University of Southern Denmark, Odense, Denmark
| | - Gary R Graves
- Center for Macroecology, Evolution, and Climate, The Globe Institute, University of Copenhagen, Copenhagen, Denmark
- Department of Vertebrate Zoology, National Museum of Natural History, Smithsonian Institution, Washington, DC, USA
| | | | - Tandy Warnow
- University of Illinois Urbana-Champaign, Champaign, IL, USA
| | - Edward L Braun
- Department of Biology, University of Florida, Gainesville, FL, USA
| | - M Thomas P Gilbert
- Center for Evolutionary Hologenomics, The Globe Institute, University of Copenhagen, Copenhagen, Denmark
- University Museum, NTNU, Trondheim, Norway
| | - Erich D Jarvis
- Vertebrate Genome Lab, The Rockefeller University, New York, NY, USA
- Howard Hughes Medical Institute, Durham, NC, USA
| | | | - Guojie Zhang
- Center for Evolutionary & Organismal Biology, Liangzhu Laboratory & Women's Hospital, Zhejiang University School of Medicine, Hangzhou, China.
- Innovation Center of Yangtze River Delta, Zhejiang University, Jiashan, China.
- BGI Research, Wuhan, China.
- Villum Center for Biodiversity Genomics, Department of Biology, University of Copenhagen, Copenhagen, Denmark.
| |
Collapse
|
5
|
López-Fernández H, Pinto M, Vieira CP, Duque P, Reboiro-Jato M, Vieira J. Auto-phylo v2 and auto-phylo-pipeliner: building advanced, flexible, and reusable pipelines for phylogenetic inferences, estimation of variability levels and identification of positively selected amino acid sites. J Integr Bioinform 2024; 0:jib-2023-0046. [PMID: 38529929 DOI: 10.1515/jib-2023-0046] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/31/2023] [Accepted: 01/31/2024] [Indexed: 03/27/2024] Open
Abstract
The vast amount of genome sequence data that is available, and that is predicted to drastically increase in the near future, can only be efficiently dealt with by building automated pipelines. Indeed, the Earth Biogenome Project will produce high-quality reference genome sequences for all 1.8 million named living eukaryote species, providing unprecedented insight into the evolution of genes and gene families, and thus on biological issues. Here, new modules for gene annotation, further BLAST search algorithms, further multiple sequence alignment methods, the adding of reference sequences, further tree rooting methods, the estimation of rates of synonymous and nonsynonymous substitutions, and the identification of positively selected amino acid sites, have been added to auto-phylo (version 2), a recently developed software to address biological problems using phylogenetic inferences. Additionally, we present auto-phylo-pipeliner, a graphical user interface application that further facilitates the creation and running of auto-phylo pipelines. Inferences on S-RNase specificity, are critical for both cross-based breeding and for the establishment of pollination requirements. Therefore, as a test case, we develop an auto-phylo pipeline to identify amino acid sites under positive selection, that are, in principle, those determining S-RNase specificity, starting from both non-annotated Prunus genomes and sequences available in public databases.
Collapse
Affiliation(s)
- Hugo López-Fernández
- CINBIO, Department of Computer Science, ESEI-Escuela Superior de Ingeniería Informática, Universidade de Vigo, 32004 Ourense, Spain
- SING Research Group, Galicia Sur Health Research Institute (IIS Galicia Sur), SERGAS-UVIGO, 36213 Vigo, Spain
| | - Miguel Pinto
- 26706 Instituto de Investigação e Inovação em Saúde (I3S), Universidade do Porto , Rua Alfredo Allen, 208, 4200-135 Porto, Portugal
- 26706 Faculdade de Ciências da Universidade do Porto (FCUP) , Rua do Campo Alegre, s/n, 4169-007 Porto, Portugal
| | - Cristina P Vieira
- 26706 Instituto de Investigação e Inovação em Saúde (I3S), Universidade do Porto , Rua Alfredo Allen, 208, 4200-135 Porto, Portugal
- Instituto de Biologia Molecular e Celular (IBMC), Rua Alfredo Allen, 208, 4200-135 Porto, Portugal
| | - Pedro Duque
- 26706 Instituto de Investigação e Inovação em Saúde (I3S), Universidade do Porto , Rua Alfredo Allen, 208, 4200-135 Porto, Portugal
- 26706 Faculdade de Ciências da Universidade do Porto (FCUP) , Rua do Campo Alegre, s/n, 4169-007 Porto, Portugal
- Instituto de Biologia Molecular e Celular (IBMC), Rua Alfredo Allen, 208, 4200-135 Porto, Portugal
- School of Medicine and Biomedical Sciences (ICBAS), Porto University, Rua de Jorge Viterbo Ferreira, 228, 4050-313 Porto, Portugal
| | - Miguel Reboiro-Jato
- CINBIO, Department of Computer Science, ESEI-Escuela Superior de Ingeniería Informática, Universidade de Vigo, 32004 Ourense, Spain
- SING Research Group, Galicia Sur Health Research Institute (IIS Galicia Sur), SERGAS-UVIGO, 36213 Vigo, Spain
| | - Jorge Vieira
- 26706 Instituto de Investigação e Inovação em Saúde (I3S), Universidade do Porto , Rua Alfredo Allen, 208, 4200-135 Porto, Portugal
- Instituto de Biologia Molecular e Celular (IBMC), Rua Alfredo Allen, 208, 4200-135 Porto, Portugal
| |
Collapse
|
6
|
McKnight DJE, Wong-Bajracharya J, Okoh EB, Snijders F, Lidbetter F, Webster J, Haughton M, Darling AE, Djordjevic SP, Bogema DR, Chapman TA. Xanthomonas rydalmerensis sp. nov., a non-pathogenic member of Group 1 Xanthomonas. Int J Syst Evol Microbiol 2024; 74:006294. [PMID: 38536071 PMCID: PMC10995728 DOI: 10.1099/ijsem.0.006294] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/06/2023] [Accepted: 03/04/2024] [Indexed: 04/07/2024] Open
Abstract
Five bacterial isolates were isolated from Fragaria × ananassa in 1976 in Rydalmere, Australia, during routine biosecurity surveillance. Initially, the results of biochemical characterisation indicated that these isolates represented members of the genus Xanthomonas. To determine their species, further analysis was conducted using both phenotypic and genotypic approaches. Phenotypic analysis involved using MALDI-TOF MS and BIOLOG GEN III microplates, which confirmed that the isolates represented members of the genus Xanthomonas but did not allow them to be classified with respect to species. Genome relatedness indices and the results of extensive phylogenetic analysis confirmed that the isolates were members of the genus Xanthomonas and represented a novel species. On the basis the minimal presence of virulence-associated factors typically found in genomes of members of the genus Xanthomonas, we suggest that these isolates are non-pathogenic. This conclusion was supported by the results of a pathogenicity assay. On the basis of these findings, we propose the name Xanthomonas rydalmerensis, with DAR 34855T = ICMP 24941 as the type strain.
Collapse
Affiliation(s)
- Daniel J. E. McKnight
- NSW Department of Primary Industries, Elizabeth Macarthur Agricultural Institute, Woodbridge Rd, Menangle NSW 2568, Australia
- University of Technology Sydney, 15 Broadway, Ultimo NSW 2007, Australia
| | - Johanna Wong-Bajracharya
- NSW Department of Primary Industries, Elizabeth Macarthur Agricultural Institute, Woodbridge Rd, Menangle NSW 2568, Australia
| | - Efenaide B. Okoh
- NSW Department of Primary Industries, Elizabeth Macarthur Agricultural Institute, Woodbridge Rd, Menangle NSW 2568, Australia
- Western Sydney University, Penrith, NSW, Australia
| | - Fridtjof Snijders
- NSW Department of Primary Industries, Elizabeth Macarthur Agricultural Institute, Woodbridge Rd, Menangle NSW 2568, Australia
| | - Fiona Lidbetter
- NSW Department of Primary Industries, Elizabeth Macarthur Agricultural Institute, Woodbridge Rd, Menangle NSW 2568, Australia
| | - John Webster
- NSW Department of Primary Industries, Elizabeth Macarthur Agricultural Institute, Woodbridge Rd, Menangle NSW 2568, Australia
| | - Mathew Haughton
- NSW Department of Primary Industries, Elizabeth Macarthur Agricultural Institute, Woodbridge Rd, Menangle NSW 2568, Australia
| | - Aaron E. Darling
- University of Technology Sydney, 15 Broadway, Ultimo NSW 2007, Australia
| | | | - Daniel R. Bogema
- NSW Department of Primary Industries, Elizabeth Macarthur Agricultural Institute, Woodbridge Rd, Menangle NSW 2568, Australia
- University of Technology Sydney, 15 Broadway, Ultimo NSW 2007, Australia
| | - Toni A. Chapman
- NSW Department of Primary Industries, Elizabeth Macarthur Agricultural Institute, Woodbridge Rd, Menangle NSW 2568, Australia
- University of Technology Sydney, 15 Broadway, Ultimo NSW 2007, Australia
| |
Collapse
|
7
|
Sharon BM, Arute AP, Nguyen A, Tiwari S, Reddy Bonthu SS, Hulyalkar NV, Neugent ML, Palacios Araya D, Dillon NA, Zimmern PE, Palmer KL, De Nisco NJ. Genetic and functional enrichments associated with Enterococcus faecalis isolated from the urinary tract. mBio 2023; 14:e0251523. [PMID: 37962362 PMCID: PMC10746210 DOI: 10.1128/mbio.02515-23] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/19/2023] [Accepted: 10/05/2023] [Indexed: 11/15/2023] Open
Abstract
IMPORTANCE Urinary tract infection (UTI) is a global health issue that imposes a substantial burden on healthcare systems. Women are disproportionately affected by UTI, with >60% of women experiencing at least one UTI in their lifetime. UTIs can recur, particularly in postmenopausal women, leading to diminished quality of life and potentially life-threatening complications. Understanding how pathogens colonize and survive in the urinary tract is necessary to identify new therapeutic targets that are urgently needed due to rising rates of antimicrobial resistance. How Enterococcus faecalis, a bacterium commonly associated with UTI, adapts to the urinary tract remains understudied. Here, we generated a collection of high-quality closed genome assemblies of clinical urinary E. faecalis isolated from the urine of postmenopausal women that we used alongside detailed clinical metadata to perform a robust comparative genomic investigation of genetic factors that may be involved in E. faecalis survival in the urinary tract.
Collapse
Affiliation(s)
- Belle M. Sharon
- Department of Biological Sciences, University of Texas at Dallas, Richardson, Texas, USA
| | - Amanda P. Arute
- Department of Biological Sciences, University of Texas at Dallas, Richardson, Texas, USA
| | - Amber Nguyen
- Department of Biological Sciences, University of Texas at Dallas, Richardson, Texas, USA
| | - Suman Tiwari
- Department of Biological Sciences, University of Texas at Dallas, Richardson, Texas, USA
| | | | - Neha V. Hulyalkar
- Department of Biological Sciences, University of Texas at Dallas, Richardson, Texas, USA
| | - Michael L. Neugent
- Department of Biological Sciences, University of Texas at Dallas, Richardson, Texas, USA
| | - Dennise Palacios Araya
- Department of Biological Sciences, University of Texas at Dallas, Richardson, Texas, USA
| | - Nicholas A. Dillon
- Department of Biological Sciences, University of Texas at Dallas, Richardson, Texas, USA
| | - Philippe E. Zimmern
- Department of Urology, University of Texas Southwestern Medical Center, Dallas, Texas, USA
| | - Kelli L. Palmer
- Department of Biological Sciences, University of Texas at Dallas, Richardson, Texas, USA
| | - Nicole J. De Nisco
- Department of Biological Sciences, University of Texas at Dallas, Richardson, Texas, USA
- Department of Urology, University of Texas Southwestern Medical Center, Dallas, Texas, USA
| |
Collapse
|
8
|
Tabatabaee Y, Roch S, Warnow T. QR-STAR: A Polynomial-Time Statistically Consistent Method for Rooting Species Trees Under the Coalescent. J Comput Biol 2023; 30:1146-1181. [PMID: 37902986 DOI: 10.1089/cmb.2023.0185] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/01/2023] Open
Abstract
We address the problem of rooting an unrooted species tree given a set of unrooted gene trees, under the assumption that gene trees evolve within the model species tree under the multispecies coalescent (MSC) model. Quintet Rooting (QR) is a polynomial time algorithm that was recently proposed for this problem, which is based on the theory developed by Allman, Degnan, and Rhodes that proves the identifiability of rooted 5-taxon trees from unrooted gene trees under the MSC. However, although QR had good accuracy in simulations, its statistical consistency was left as an open problem. We present QR-STAR, a variant of QR with an additional step and a different cost function, and prove that it is statistically consistent under the MSC. Moreover, we derive sample complexity bounds for QR-STAR and show that a particular variant of it based on "short quintets" has polynomial sample complexity. Finally, our simulation study under a variety of model conditions shows that QR-STAR matches or improves on the accuracy of QR. QR-STAR is available in open-source form on github.
Collapse
Affiliation(s)
- Yasamin Tabatabaee
- Department of Computer Science, University of Illinois Urbana-Champaign, Urbana, Illinois, USA
| | - Sebastien Roch
- Department of Mathematics, University of Wisconsin-Madison, Madison, Wisconsin, USA
| | - Tandy Warnow
- Department of Computer Science, University of Illinois Urbana-Champaign, Urbana, Illinois, USA
| |
Collapse
|
9
|
Tabatabaee Y, Zhang C, Warnow T, Mirarab S. Phylogenomic branch length estimation using quartets. Bioinformatics 2023; 39:i185-i193. [PMID: 37387151 DOI: 10.1093/bioinformatics/btad221] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 07/01/2023] Open
Abstract
MOTIVATION Branch lengths and topology of a species tree are essential in most downstream analyses, including estimation of diversification dates, characterization of selection, understanding adaptation, and comparative genomics. Modern phylogenomic analyses often use methods that account for the heterogeneity of evolutionary histories across the genome due to processes such as incomplete lineage sorting. However, these methods typically do not generate branch lengths in units that are usable by downstream applications, forcing phylogenomic analyses to resort to alternative shortcuts such as estimating branch lengths by concatenating gene alignments into a supermatrix. Yet, concatenation and other available approaches for estimating branch lengths fail to address heterogeneity across the genome. RESULTS In this article, we derive expected values of gene tree branch lengths in substitution units under an extension of the multispecies coalescent (MSC) model that allows substitutions with varying rates across the species tree. We present CASTLES, a new technique for estimating branch lengths on the species tree from estimated gene trees that uses these expected values, and our study shows that CASTLES improves on the most accurate prior methods with respect to both speed and accuracy. AVAILABILITY AND IMPLEMENTATION CASTLES is available at https://github.com/ytabatabaee/CASTLES.
Collapse
Affiliation(s)
- Yasamin Tabatabaee
- Department of Computer Science, University of Illinois at Urbana-Champaign, Urbana, IL 61801, United States
| | - Chao Zhang
- Department of Integrative Biology, University of California at Berkeley, Berkeley, CA 94720, United States
| | - Tandy Warnow
- Department of Computer Science, University of Illinois at Urbana-Champaign, Urbana, IL 61801, United States
| | - Siavash Mirarab
- Department of Electrical and Computer Engineering, University of California, San Diego, La Jolla, CA 92093, United States
| |
Collapse
|
10
|
Sharon BM, Hulyalkar NV, Zimmern PE, Palmer KL, De Nisco NJ. Inter-species diversity and functional genomic analyses of closed genome assemblies of clinically isolated, megaplasmid-containing Enterococcus raffinosus Er676 and ATCC49464. Access Microbiol 2023; 5:acmi000508.v3. [PMID: 37424546 PMCID: PMC10323788 DOI: 10.1099/acmi.0.000508.v3] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/25/2022] [Accepted: 03/10/2023] [Indexed: 07/11/2023] Open
Abstract
Enterococcus raffinosus is an understudied member of its genus possessing a characteristic megaplasmid contributing to a large genome size. Although less commonly associated with human infection compared to other enterococci, this species can cause disease and persist in diverse niches such as the gut, urinary tract, blood and environment. Few complete genome assemblies have been published to date for E. raffinosus . In this study, we report the complete assembly of the first clinical urinary E. raffinosus strain, Er676, isolated from a postmenopausal woman with history of recurrent urinary tract infection. We additionally completed the assembly of clinical type strain ATCC49464. Comparative genomic analyses reveal inter-species diversity driven by large accessory genomes. The presence of a conserved megaplasmid indicates it is a ubiquitous and vital genetic feature of E. raffinosus . We find that the E. raffinosus chromosome is enriched for DNA replication and protein biosynthesis genes while the megaplasmid is enriched for transcription and carbohydrate metabolism genes. Prophage analysis suggests that diversity in the chromosome and megaplasmid sequences arises, in part, from horizontal gene transfer. Er676 demonstrated the largest genome size reported to date for E. raffinosus and the highest probability of human pathogenicity. Er676 also possesses multiple antimicrobial resistance genes, of which all but one are encoded on the chromosome, and has the most complete prophage sequences. Complete assembly and comparative analyses of the Er676 and ATCC49464 genomes provide important insight into the inter-species diversity of E. raffinosus that gives it its ability to colonize and persist in the human body. Investigating genetic factors that contribute to the pathogenicity of this species will provide valuable tools to combat diseases caused by this opportunistic pathogen.
Collapse
Affiliation(s)
- Belle M. Sharon
- Department of Biological Sciences, University of Texas at Dallas, Richardson, Texas, USA
| | - Neha V. Hulyalkar
- Department of Biological Sciences, University of Texas at Dallas, Richardson, Texas, USA
| | - Philippe E. Zimmern
- Department of Urology, University of Texas Southwestern Medical Center, Dallas, Texas, USA
| | - Kelli L. Palmer
- Department of Biological Sciences, University of Texas at Dallas, Richardson, Texas, USA
| | - Nicole J. De Nisco
- Department of Biological Sciences, University of Texas at Dallas, Richardson, Texas, USA
- Department of Urology, University of Texas Southwestern Medical Center, Dallas, Texas, USA
| |
Collapse
|
11
|
Sharon BM, Arute AP, Nguyen A, Tiwari S, Bonthu SSR, Hulyalkar NV, Neugent ML, Araya DP, Dillon NA, Zimmern PE, Palmer KL, De Nisco NJ. Functional and genetic adaptations contributing to Enterococcus faecalis persistence in the female urinary tract. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2023:2023.05.18.541374. [PMID: 37293065 PMCID: PMC10245761 DOI: 10.1101/2023.05.18.541374] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/10/2023]
Abstract
Enterococcus faecalis is the leading Gram-positive bacterial species implicated in urinary tract infection (UTI). An opportunistic pathogen, E. faecalis is a commensal of the human gastrointestinal tract (GIT) and its presence in the GIT is a predisposing factor for UTI. The mechanisms by which E. faecalis colonizes and survives in the urinary tract (UT) are poorly understood, especially in uncomplicated or recurrent UTI. The UT is distinct from the GIT and is characterized by a sparse nutrient landscape and unique environmental stressors. In this study, we isolated and sequenced a collection of 37 clinical E. faecalis strains from the urine of primarily postmenopausal women. We generated 33 closed genome assemblies and four highly contiguous draft assemblies and conducted a comparative genomics to identify genetic features enriched in urinary E. faecalis with respect to E. faecalis isolated from the human GIT and blood. Phylogenetic analysis revealed high diversity among urinary strains and a closer relatedness between urine and gut isolates than blood isolates. Plasmid replicon (rep) typing further underscored possible UT-GIT interconnection identifying nine shared rep types between urine and gut E. faecalis . Both genotypic and phenotypic analysis of antimicrobial resistance among urinary E. faecalis revealed infrequent resistance to front-line UTI antibiotics nitrofurantoin and fluoroquinolones and no vancomycin resistance. Finally, we identified 19 candidate genes enriched among urinary strains that may play a role in adaptation to the UT. These genes are involved in the core processes of sugar transport, cobalamin import, glucose metabolism, and post-transcriptional regulation of gene expression. IMPORTANCE Urinary tract infection (UTI) is a global health issue that imposes substantial burden on healthcare systems. Women are disproportionately affected by UTI with >60% of women experiencing at least one UTI in their lifetime. UTIs can recur, particularly in postmenopausal women, leading to diminished quality of life and potentially life-threatening complications. Understanding how pathogens colonize and survive in the urinary tract is necessary to identify new therapeutic targets that are urgently needed due to rising rates of antimicrobial resistance. How Enterococcus faecalis , a bacterium commonly associated with UTI, adapts to the urinary tract remains understudied. Here, we generated a collection of high-quality closed genome assemblies of clinical urinary E. faecalis isolated from the urine of postmenopausal women that we used alongside detailed clinical metadata to perform a robust comparative genomic investigation of genetic factors that may mediate urinary E. faecalis adaptation to the female urinary tract.
Collapse
|
12
|
Willson J, Tabatabaee Y, Liu B, Warnow T. DISCO+QR: rooting species trees in the presence of GDL and ILS. BIOINFORMATICS ADVANCES 2023; 3:vbad015. [PMID: 36789293 PMCID: PMC9923442 DOI: 10.1093/bioadv/vbad015] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 12/12/2022] [Revised: 01/21/2023] [Accepted: 02/06/2023] [Indexed: 02/10/2023]
Abstract
Motivation Genes evolve under processes such as gene duplication and loss (GDL), so that gene family trees are multi-copy, as well as incomplete lineage sorting (ILS); both processes produce gene trees that differ from the species tree. The estimation of species trees from sets of gene family trees is challenging, and the estimation of rooted species trees presents additional analytical challenges. Two of the methods developed for this problem are STRIDE, which roots species trees by considering GDL events, and Quintet Rooting (QR), which roots species trees by considering ILS. Results We present DISCO+QR, a new approach to rooting species trees that first uses DISCO to address GDL and then uses QR to perform rooting in the presence of ILS. DISCO+QR operates by taking the input gene family trees and decomposing them into single-copy trees using DISCO and then roots the given species tree using the information in the single-copy gene trees using QR. We show that the relative accuracy of STRIDE and DISCO+QR depend on the properties of the dataset (number of species, genes, rate of gene duplication, degree of ILS and gene tree estimation error), and that each provides advantages over the other under some conditions. Availability and implementation DISCO and QR are available in github. Supplementary information Supplementary data are available at Bioinformatics Advances online.
Collapse
Affiliation(s)
- James Willson
- Department of Computer Science, University of Illinois Urbana-Champaign, Urbana, IL 61801, USA
| | - Yasamin Tabatabaee
- Department of Computer Science, University of Illinois Urbana-Champaign, Urbana, IL 61801, USA
| | - Baqiao Liu
- Department of Computer Science, University of Illinois Urbana-Champaign, Urbana, IL 61801, USA
| | | |
Collapse
|
13
|
Population Structure and Genomic Characteristics of Australian Erysipelothrix rhusiopathiae Reveals Unobserved Diversity in the Australian Pig Industry. Microorganisms 2023; 11:microorganisms11020297. [PMID: 36838261 PMCID: PMC9964597 DOI: 10.3390/microorganisms11020297] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/07/2022] [Revised: 01/18/2023] [Accepted: 01/19/2023] [Indexed: 01/26/2023] Open
Abstract
Erysipelothrix rhusiopathiae is a bacterial pathogen that is the causative agent of erysipelas in a variety of animals, including swine, emus, turkeys, muskox, caribou, moose, and humans. This study aims to investigate the population structure and genomic features of Australian isolates of E. rhusiopathiae in the Australian pig industry and compare them to the broader scope of isolates worldwide. A total of 178 isolates (154 Australian, seven vaccine isolates, six international isolates, and 11 of unknown origin) in this study were screened against an MLST scheme and publicly available reference isolates, identifying 59 new alleles, with isolates separating into two main single locus variant groups. Investigation with BLASTn revealed the presence of the spaA gene in 171 (96%) of the isolates, with three main groups of SpaA protein sequences observed amongst the isolates. Novel SpaA protein sequences, categorised here as group 3 sequences, consisted of two sequence types forming separate clades to groups 1 and 2, with amino acid variants at positions 195 (D/A), 303 (G/E) and 323(P/L). In addition to the newly identified groups, five new variant positions were identified, 124 (S/N), 307 (Q/R), 323 (P/L), 379 (M/I), and 400 (V/I). Resistance screening identified genes related to lincomycin, streptomycin, erythromycin, and tetracycline resistance. Of the 29 isolates carrying these resistance genes, 82% belonged to SpaA group 2-N101S (n = 22) or 2-N101S-I257L (n = 2). In addition, 79% (n = 23) of these 29 isolates belonged to MLST group ST 5. Our results illustrate that Australia appears to have a unique diversity of E. rhusiopathiae isolates in pig production industries within the wider global context of isolates.
Collapse
|
14
|
Adam PS, Kolyfetis GE, Bornemann TLV, Vorgias CE, Probst AJ. Genomic remnants of ancestral methanogenesis and hydrogenotrophy in Archaea drive anaerobic carbon cycling. SCIENCE ADVANCES 2022; 8:eabm9651. [PMID: 36332026 PMCID: PMC9635834 DOI: 10.1126/sciadv.abm9651] [Citation(s) in RCA: 16] [Impact Index Per Article: 8.0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 10/25/2021] [Accepted: 09/19/2022] [Indexed: 05/19/2023]
Abstract
Anaerobic methane metabolism is among the hallmarks of Archaea, originating very early in their evolution. Here, we show that the ancestor of methane metabolizers was an autotrophic CO2-reducing hydrogenotrophic methanogen that possessed the two main complexes, methyl-CoM reductase (Mcr) and tetrahydromethanopterin-CoM methyltransferase (Mtr), the anaplerotic hydrogenases Eha and Ehb, and a set of other genes collectively called "methanogenesis markers" but could not oxidize alkanes. Overturning recent inferences, we demonstrate that methyl-dependent hydrogenotrophic methanogenesis has emerged multiple times independently, either due to a loss of Mtr while Mcr is inherited vertically or from an ancient lateral acquisition of Mcr. Even if Mcr is lost, Mtr, Eha, Ehb, and the markers can persist, resulting in mixotrophic metabolisms centered around the Wood-Ljungdahl pathway. Through their methanogenesis remnants, Thorarchaeia and two newly reconstructed order-level lineages in Archaeoglobi and Bathyarchaeia act as metabolically versatile players in carbon cycling of anoxic environments across the globe.
Collapse
Affiliation(s)
- Panagiotis S. Adam
- Environmental Microbiology and Biotechnology, Faculty of Chemistry, University of Duisburg-Essen, Universitätsstraße 5, 45141 Essen, Germany
- Corresponding author.
| | - George E. Kolyfetis
- Environmental Microbiology and Biotechnology, Faculty of Chemistry, University of Duisburg-Essen, Universitätsstraße 5, 45141 Essen, Germany
- Department of Biochemistry and Molecular Biology, Faculty of Biology, National and Kapodistrian University of Athens, Panepistimiopolis Zografou, 15784 Athens, Greece
| | - Till L. V. Bornemann
- Environmental Microbiology and Biotechnology, Faculty of Chemistry, University of Duisburg-Essen, Universitätsstraße 5, 45141 Essen, Germany
| | - Constantinos E. Vorgias
- Department of Biochemistry and Molecular Biology, Faculty of Biology, National and Kapodistrian University of Athens, Panepistimiopolis Zografou, 15784 Athens, Greece
| | - Alexander J. Probst
- Environmental Microbiology and Biotechnology, Faculty of Chemistry, University of Duisburg-Essen, Universitätsstraße 5, 45141 Essen, Germany
- Centre for Water and Environmental Research (ZWU), University of Duisburg-Essen, Universitätsstraße 5, 45141 Essen, Germany
- Research Center One Health Ruhr, Research Alliance Ruhr, Environmental Metagenomics, University of Duisburg-Essen, Universitätsstraße 5, 45141 Essen, Germany
| |
Collapse
|
15
|
Redmond AK, Pettinello R, Bakke FK, Dooley H. Sharks Provide Evidence for a Highly Complex TNFSF Repertoire in the Jawed Vertebrate Ancestor. THE JOURNAL OF IMMUNOLOGY 2022; 209:1713-1723. [DOI: 10.4049/jimmunol.2200300] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Received: 04/25/2022] [Accepted: 08/19/2022] [Indexed: 01/04/2023]
Abstract
Abstract
Cytokines of the TNF superfamily (TNFSF) control many immunological processes and are implicated in the etiology of many immune disorders and diseases. Despite their obvious biological importance, the TNFSF repertoires of many species remain poorly characterized. In this study, we perform detailed bioinformatic, phylogenetic, and syntenic analyses of five cartilaginous fish genomes to identify their TNFSF repertoires. Strikingly, we find that shark genomes harbor ∼30 TNFSF genes, more than any other vertebrate examined to date and substantially more than humans. This is due to better retention of the ancestral jawed vertebrate TNFSF repertoire than any other jawed vertebrate lineage, combined with lineage-specific gene family expansions. All human TNFSFs appear in shark genomes, except for lymphotoxin-α (LTA; TNFSF1) and TNF (TNFSF2), and CD70 (TNFSF7) and 4-1BBL (TNFSF9), which diverged by tandem duplications early in tetrapod and mammalian evolution, respectively. Although lacking one-to-one LTA and TNF orthologs, sharks have evolved lineage-specific clusters of LTA/TNF co-orthologs. Other key findings include the presence of two BAFF (TNFSF13B) genes along with orthologs of APRIL (TNFSF13) and BALM (TNFSF13C) in sharks, and that all cartilaginous fish genomes harbor an ∼400-million-year-old cluster of multiple FASLG (TNFSF6) orthologs. Finally, sharks have retained seven ancestral jawed vertebrate TNFSF genes lost in humans. Taken together, our data indicate that the jawed vertebrate ancestor possessed a much larger and diverse TNFSF repertoire than previously hypothesized and oppose the idea that the cartilaginous fish immune system is “primitive” compared with that of mammals.
Collapse
Affiliation(s)
- Anthony K. Redmond
- *Smurfit Institute of Genetics, Trinity College Dublin, Dublin, Ireland
- †Department of Science and Health, Institute of Technology Carlow, Carlow, Ireland
| | - Rita Pettinello
- ‡School of Biological Sciences, University of Aberdeen, Aberdeen, United Kingdom
| | - Fiona K. Bakke
- ‡School of Biological Sciences, University of Aberdeen, Aberdeen, United Kingdom
| | - Helen Dooley
- §Department of Microbiology and Immunology, University of Maryland School of Medicine, Baltimore, MD; and
- ¶Institute of Marine and Environmental Technology, University of Maryland School of Medicine, Baltimore, MD
| |
Collapse
|
16
|
Fer E, McGrath KM, Guy L, Hockenberry AJ, Kaçar B. Early divergence of translation initiation and elongation factors. Protein Sci 2022; 31:e4393. [PMID: 36250475 PMCID: PMC9601768 DOI: 10.1002/pro.4393] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/11/2022] [Revised: 07/05/2022] [Accepted: 07/11/2022] [Indexed: 11/18/2022]
Abstract
Protein translation is a foundational attribute of all living cells. The translation function carried out by the ribosome critically depends on an assortment of protein interaction partners, collectively referred to as the translation machinery. Various studies suggest that the diversification of the translation machinery occurred prior to the last universal common ancestor, yet it is unclear whether the predecessors of the extant translation machinery factors were functionally distinct from their modern counterparts. Here we reconstructed the shared ancestral trajectory and subsequent evolution of essential translation factor GTPases, elongation factor EF-Tu (aEF-1A/eEF-1A), and initiation factor IF2 (aIF5B/eIF5B). Based upon their similar functions and structural homologies, it has been proposed that EF-Tu and IF2 emerged from an ancient common ancestor. We generated the phylogenetic tree of IF2 and EF-Tu proteins and reconstructed ancestral sequences corresponding to the deepest nodes in their shared evolutionary history, including the last common IF2 and EF-Tu ancestor. By identifying the residue and domain substitutions, as well as structural changes along the phylogenetic history, we developed an evolutionary scenario for the origins, divergence and functional refinement of EF-Tu and IF2 proteins. Our analyses suggest that the common ancestor of IF2 and EF-Tu was an IF2-like GTPase protein. Given the central importance of the translation machinery to all cellular life, its earliest evolutionary constraints and trajectories are key to characterizing the universal constraints and capabilities of cellular evolution.
Collapse
Affiliation(s)
- Evrim Fer
- Department of BacteriologyUniversity of Wisconsin‐MadisonMadisonWisconsinUSA
- Microbiology Doctoral Training ProgramUniversity of Wisconsin‐MadisonMadisonWisconsinUSA
- NASA Center for Early Life and EvolutionUniversity of Wisconsin‐MadisonMadisonWisconsinUSA
| | - Kaitlyn M. McGrath
- Department of BacteriologyUniversity of Wisconsin‐MadisonMadisonWisconsinUSA
- NASA Center for Early Life and EvolutionUniversity of Wisconsin‐MadisonMadisonWisconsinUSA
- Department of Molecular and Cellular BiologyUniversity of ArizonaTucsonArizonaUSA
| | - Lionel Guy
- Department of Medical Biochemistry and Microbiology, Science for Life LaboratoryUppsala UniversityUppsalaSweden
| | - Adam J. Hockenberry
- Department of Integrative BiologyThe University of Texas at AustinAustinTexasUSA
| | - Betül Kaçar
- Department of BacteriologyUniversity of Wisconsin‐MadisonMadisonWisconsinUSA
- NASA Center for Early Life and EvolutionUniversity of Wisconsin‐MadisonMadisonWisconsinUSA
| |
Collapse
|
17
|
Tabatabaee Y, Sarker K, Warnow T. Quintet Rooting: rooting species trees under the multi-species coalescent model. Bioinformatics 2022; 38:i109-i117. [PMID: 35758805 PMCID: PMC9236578 DOI: 10.1093/bioinformatics/btac224] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/25/2022] Open
Abstract
Motivation Rooted species trees are a basic model with multiple applications throughout biology, including understanding adaptation, biodiversity, phylogeography and co-evolution. Because most species tree estimation methods produce unrooted trees, methods for rooting these trees have been developed. However, most rooting methods either rely on prior biological knowledge or assume that evolution is close to clock-like, which is not usually the case. Furthermore, most prior rooting methods do not account for biological processes that create discordance between gene trees and species trees. Results We present Quintet Rooting (QR), a method for rooting species trees based on a proof of identifiability of the rooted species tree under the multi-species coalescent model established by Allman, Degnan and Rhodes (J. Math. Biol., 2011). We show that QR is generally more accurate than other rooting methods, except under extreme levels of gene tree estimation error. Availability and implementation Quintet Rooting is available in open source form at https://github.com/ytabatabaee/Quintet-Rooting. The simulated datasets used in this study are from a prior study and are available at https://www.ideals.illinois.edu/handle/2142/55319. The biological dataset used in this study is also from a prior study and is available at http://gigadb.org/dataset/101041. Contact warnow@illinois.edu Supplementary information Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Yasamin Tabatabaee
- Department of Computer Science, University of Illinois at Urbana-Champaign, Urbana, IL 61801, USA
| | - Kowshika Sarker
- Department of Computer Science, University of Illinois at Urbana-Champaign, Urbana, IL 61801, USA
| | - Tandy Warnow
- Department of Computer Science, University of Illinois at Urbana-Champaign, Urbana, IL 61801, USA
| |
Collapse
|
18
|
Chen K, Moravec JÍC, Gavryushkin A, Welch D, Drummond AJ. Accounting for errors in data improves divergence time estimates in single-cell cancer evolution. Mol Biol Evol 2022; 39:6613463. [PMID: 35733333 PMCID: PMC9356729 DOI: 10.1093/molbev/msac143] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
Abstract
Single-cell sequencing provides a new way to explore the evolutionary history of cells. Compared to traditional bulk sequencing, where a population of heterogeneous cells is pooled to form a single observation, single-cell sequencing isolates and amplifies genetic material from individual cells, thereby preserving the information about the origin of the sequences. However, single-cell data is more error-prone than bulk sequencing data due to the limited genomic material available per cell. Here, we present error and mutation models for evolutionary inference of single-cell data within a mature and extensible Bayesian framework, BEAST2. Our framework enables integration with biologically informative models such as relaxed molecular clocks and population dynamic models. Our simulations show that modeling errors increase the accuracy of relative divergence times and substitution parameters. We reconstruct the phylogenetic history of a colorectal cancer patient and a healthy patient from single-cell DNA sequencing data. We find that the estimated times of terminal splitting events are shifted forward in time compared to models which ignore errors. We observed that not accounting for errors can overestimate the phylogenetic diversity in single-cell DNA sequencing data. We estimate that 30-50% of the apparent diversity can be attributed to error. Our work enables a full Bayesian approach capable of accounting for errors in the data within the integrative Bayesian software framework BEAST2.
Collapse
Affiliation(s)
- Kylie Chen
- School of Computer Science, University of Auckland, Auckland, New Zealand
| | - Jiř Í C Moravec
- Department of Computer Science, University of Otago, Dunedin, New Zealand.,School of Mathematics and Statistics, University of Canterbury, Christchurch, New Zealand
| | - Alex Gavryushkin
- School of Mathematics and Statistics, University of Canterbury, Christchurch, New Zealand
| | - David Welch
- School of Computer Science, University of Auckland, Auckland, New Zealand
| | - Alexei J Drummond
- School of Computer Science, University of Auckland, Auckland, New Zealand.,School of Biological Sciences, University of Auckland, Auckland, New Zealand
| |
Collapse
|
19
|
Young C, Meng S, Moshiri N. An Evaluation of Phylogenetic Workflows in Viral Molecular Epidemiology. Viruses 2022; 14:v14040774. [PMID: 35458504 PMCID: PMC9032411 DOI: 10.3390/v14040774] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/03/2022] [Revised: 04/04/2022] [Accepted: 04/06/2022] [Indexed: 01/25/2023] Open
Abstract
The use of viral sequence data to inform public health intervention has become increasingly common in the realm of epidemiology. Such methods typically utilize multiple sequence alignments and phylogenies estimated from the sequence data. Like all estimation techniques, they are error prone, yet the impacts of such imperfections on downstream epidemiological inferences are poorly understood. To address this, we executed multiple commonly used viral phylogenetic analysis workflows on simulated viral sequence data, modeling Human Immunodeficiency Virus (HIV), Hepatitis C Virus (HCV), and Ebolavirus, and we computed multiple methods of accuracy, motivated by transmission-clustering techniques. For multiple sequence alignment, MAFFT consistently outperformed MUSCLE and Clustal Omega, in both accuracy and runtime. For phylogenetic inference, FastTree 2, IQ-TREE, RAxML-NG, and PhyML had similar topological accuracies, but branch lengths and pairwise distances were consistently most accurate in phylogenies inferred by RAxML-NG. However, FastTree 2 was the fastest, by orders of magnitude, and when the other tools were used to optimize branch lengths along a fixed FastTree 2 topology, the resulting phylogenies had accuracies that were indistinguishable from their original counterparts, but with a fraction of the runtime.
Collapse
|
20
|
Mai U, Mirarab S. Completing gene trees without species trees in sub-quadratic time. Bioinformatics 2022; 38:1532-1541. [PMID: 34978565 DOI: 10.1093/bioinformatics/btab875] [Citation(s) in RCA: 6] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/23/2021] [Revised: 11/27/2021] [Accepted: 12/30/2021] [Indexed: 02/03/2023] Open
Abstract
MOTIVATION As genome-wide reconstruction of phylogenetic trees becomes more widespread, limitations of available data are being appreciated more than ever before. One issue is that phylogenomic datasets are riddled with missing data, and gene trees, in particular, almost always lack representatives from some species otherwise available in the dataset. Since many downstream applications of gene trees require or can benefit from access to complete gene trees, it will be beneficial to algorithmically complete gene trees. Also, gene trees are often unrooted, and rooting them is useful for downstream applications. While completing and rooting a gene tree with respect to a given species tree has been studied, those problems are not studied in depth when we lack such a reference species tree. RESULTS We study completion of gene trees without a need for a reference species tree. We formulate an optimization problem to complete the gene trees while minimizing their quartet distance to the given set of gene trees. We extend a seminal algorithm by Brodal et al. to solve this problem in quasi-linear time. In simulated studies and on a large empirical data, we show that completion of gene trees using other gene trees is relatively accurate and, unlike the case where a species tree is available, is unbiased. AVAILABILITY AND IMPLEMENTATION Our method, tripVote, is available at https://github.com/uym2/tripVote. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Uyen Mai
- Department of Computer Science and Engineering, University of California San Diego, San Diego, CA 92093, USA
| | - Siavash Mirarab
- Department of Electrical and Computer Engineering, University of California San Diego, San Diego, CA 92093, USA
| |
Collapse
|
21
|
Chlamydia pecorum Ovine Abortion: Associations between Maternal Infection and Perinatal Mortality. Pathogens 2021; 10:pathogens10111367. [PMID: 34832523 PMCID: PMC8618313 DOI: 10.3390/pathogens10111367] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/30/2021] [Revised: 10/18/2021] [Accepted: 10/20/2021] [Indexed: 12/29/2022] Open
Abstract
Chlamydia pecorum is a common gastrointestinal inhabitant of livestock but infections can manifest in a broad array of clinical presentations and in a range of host species. While C. pecorum is a known cause of ovine abortion, clinical cases have only recently been described in detail. Here, the prevalence and sequence types (STs) of C. pecorum in ewes from a property experiencing high levels of perinatal mortality (PNM) in New South Wales (NSW), Australia, were investigated using serological and molecular methods. Ewes that were PNM+ were statistically more likely to test seropositive compared to PNM− ewes and displayed higher antibody titres; however, an increase in chlamydial shedding from either the rectum, vagina or conjunctiva of PNM+ ewes was not observed. Multilocus sequence typing (MLST) indicated that C. pecorum ST23 was the major ST shed by ewes in the flock, was the only ST identified from the vaginal site, and was the same ST detected within aborted foetal tissues. Whole genome sequencing of C. pecorum isolated from one abortion case revealed that the C. pecorum plasmid (pCpec) contained a unique deletion in coding sequence 1 (CDS1) that was also present in C. pecorum ST23 shed from the ewes. A further unique deletion was noted in a polymorphic membrane protein gene (pmpG) of the C. pecorum chromosome, which warrants further investigation given the role of PmpG in host cell adherence and tissue tropism.This study describes novel infection parameters in a sheep flock experiencing C. pecorum-associated perinatal mortality, provides the first genomic data from an abortigenic C. pecorum strain, and raises questions about possible links between unique genetic features of this strain and C. pecorum abortion.
Collapse
|
22
|
Cruaud A, Delvare G, Nidelet S, Sauné L, Ratnasingham S, Chartois M, Blaimer BB, Gates M, Brady SG, Faure S, van Noort S, Rossi JP, Rasplus JY. Ultra-Conserved Elements and morphology reciprocally illuminate conflicting phylogenetic hypotheses in Chalcididae (Hymenoptera, Chalcidoidea). Cladistics 2021; 37:1-35. [PMID: 34478176 DOI: 10.1111/cla.12416] [Citation(s) in RCA: 10] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 02/15/2020] [Indexed: 11/30/2022] Open
Abstract
Recent technical advances combined with novel computational approaches have promised the acceleration of our understanding of the tree of life. However, when it comes to hyperdiverse and poorly known groups of invertebrates, studies are still scarce. As published phylogenies will be rarely challenged by future taxonomists, careful attention must be paid to potential analytical bias. We present the first molecular phylogenetic hypothesis for the family Chalcididae, a group of parasitoid wasps, with a representative sampling (144 ingroups and seven outgroups) that covers all described subfamilies and tribes, and 82% of the known genera. Analyses of 538 Ultra-Conserved Elements (UCEs) with supermatrix (RAxML and IQTREE) and gene tree reconciliation approaches (ASTRAL, ASTRID) resulted in highly supported topologies in overall agreement with morphology but reveal conflicting topologies for some of the deepest nodes. To resolve these conflicts, we explored the phylogenetic tree space with clustering and gene genealogy interrogation methods, analyzed marker and taxon properties that could bias inferences and performed a thorough morphological analysis (130 characters encoded for 40 taxa representative of the diversity). This joint analysis reveals that UCEs enable attainment of resolution between ancestry and convergent/divergent evolution when morphology is not informative enough, but also shows that a systematic exploration of bias with different analytical methods and a careful analysis of morphological features is required to prevent publication of artifactual results. We highlight a GC content bias for maximum-likelihood approaches, an artifactual mid-point rooting of the ASTRAL tree and a deleterious effect of high percentage of missing data (>85% missing UCEs) on gene tree reconciliation methods. Based on the results we propose a new classification of the family into eight subfamilies and ten tribes that lay the foundation for future studies on the evolutionary history of Chalcididae.
Collapse
Affiliation(s)
- Astrid Cruaud
- CBGP, CIRAD, INRAe, IRD, Montpellier SupAgro, Université de Montpellier, Montpellier, France
| | - Gérard Delvare
- CBGP, CIRAD, INRAe, IRD, Montpellier SupAgro, Université de Montpellier, Montpellier, France.,UMR CBGP, CIRAD, F-34398, Montpellier, France
| | - Sabine Nidelet
- CBGP, CIRAD, INRAe, IRD, Montpellier SupAgro, Université de Montpellier, Montpellier, France
| | - Laure Sauné
- CBGP, CIRAD, INRAe, IRD, Montpellier SupAgro, Université de Montpellier, Montpellier, France
| | | | - Marguerite Chartois
- CBGP, CIRAD, INRAe, IRD, Montpellier SupAgro, Université de Montpellier, Montpellier, France
| | | | - Michael Gates
- USDA, ARS, SEL, c/o Smithsonian Institution, National Museum of Natural History, Washington, DC, USA
| | - Seán G Brady
- Department of Entomology, Smithsonian Institution, National Museum of Natural History, Washington, DC, USA
| | - Sariana Faure
- Department of Zoology and Entomology, Rhodes University, Grahamstown, South Africa
| | - Simon van Noort
- Research and Exhibitions Department, South African Museum, Iziko Museums of South Africa, PO Box 61, Cape Town, 8000, South Africa.,Department of Biological Sciences, University of Cape Town, Private Bag, Rondebosch, 7701, Cape Town, South Africa
| | - Jean-Pierre Rossi
- CBGP, CIRAD, INRAe, IRD, Montpellier SupAgro, Université de Montpellier, Montpellier, France
| | - Jean-Yves Rasplus
- CBGP, CIRAD, INRAe, IRD, Montpellier SupAgro, Université de Montpellier, Montpellier, France
| |
Collapse
|
23
|
Rasplus JY, Rodriguez LJ, Sauné L, Peng YQ, Bain A, Kjellberg F, Harrison RD, Pereira RAS, Ubaidillah R, Tollon-Cordet C, Gautier M, Rossi JP, Cruaud A. Exploring systematic biases, rooting methods and morphological evidence to unravel the evolutionary history of the genus Ficus (Moraceae). Cladistics 2021; 37:402-422. [PMID: 34478193 DOI: 10.1111/cla.12443] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 10/11/2020] [Indexed: 11/28/2022] Open
Abstract
Despite many attempts in the Sanger sequencing era, the phylogeny of fig trees remains unresolved, which limits our ability to analyze the evolution of key traits that may have contributed to their evolutionary and ecological success. We used restriction-site-associated DNA sequencing (c. 420 kb) and 102 morphological characters to elucidate the relationships between 70 species of Ficus. To increase phylogenetic information for higher-level relationships, we targeted conserved regions and assembled paired reads into long loci to enable the retrieval of homologous loci in outgroup genomes. We compared morphological and molecular results to highlight discrepancies and reveal possible inference bias. For the first time, we recovered a monophyletic subgenus Urostigma (stranglers) and a clade with all gynodioecious Ficus. However, we show, with a new approach based on iterative principal component analysis, that it is not (and will probably never be) possible to homogenize evolutionary rates and GC content for all taxa before phylogenetic inference. Four competing positions for the root of the molecular tree are possible. The placement of section Pharmacosycea as sister to other fig trees is not supported by morphological data and considered a result of a long-branch attraction artefact to the outgroups. Regarding morphological features and indirect evidence from the pollinator tree of life, the topology that divides Ficus into monoecious versus gynodioecious species appears most plausible. It seems most likely that the ancestor of fig trees was a freestanding tree and active pollination is inferred as the ancestral state, contrary to previous hypotheses. However, ambiguity remains on the ancestral breeding system. Despite morphological plasticity, we advocate restoring a central role to morphology in our understanding of the evolution of Ficus, as it can help detect systematic errors that appear more pronounced with larger molecular datasets.
Collapse
Affiliation(s)
- Jean-Yves Rasplus
- CBGP, INRAE, CIRAD, IRD, Montpellier SupAgro, Université de Montpellier, Montpellier, 34988, France
| | - Lillian Jennifer Rodriguez
- Institute of Biology, University of the Philippines Diliman, Quezon City, 1101, Philippines.,Natural Sciences Research Institute, University of the Philippines Diliman, Quezon City, 1101, Philippines
| | - Laure Sauné
- CBGP, INRAE, CIRAD, IRD, Montpellier SupAgro, Université de Montpellier, Montpellier, 34988, France
| | - Yang-Qiong Peng
- CAS Key Laboratory of Tropical Forest Ecology, Xishuangbanna Tropical Botanical Garden, Chinese Academy of Sciences, Kunming, 650223, China
| | - Anthony Bain
- Department of Biological Sciences, National Sun Yat-sen University, Kaohsiung, 80424, Taiwan
| | - Finn Kjellberg
- CEFE, CNRS, Université Paul-Valéry Montpellier, EPHE, Université de Montpellier, Montpellier, 34090, France
| | - Rhett D Harrison
- World Agroforestry, Eastern and Southern Africa, Region, 13 Elm Road, Woodlands, Lusaka, 10101, Zambia
| | - Rodrigo A S Pereira
- Departamento de Biologia, FFCLRP, Universidade de São Paulo, Ribeirão Preto, SP, 14040-901, Brazil
| | - Rosichon Ubaidillah
- Museum Zoologicum Bogoriense, LIPI, Gedung Widyasatwaloka, Jln Raya km 46, Cibinong, Bogor, 16911, Indonesia
| | - Christine Tollon-Cordet
- AGAP, INRA, CIRAD, Montpellier SupAgro, Université de Montpellier, Montpellier, 34398, France
| | - Mathieu Gautier
- CBGP, INRAE, CIRAD, IRD, Montpellier SupAgro, Université de Montpellier, Montpellier, 34988, France
| | - Jean-Pierre Rossi
- CBGP, INRAE, CIRAD, IRD, Montpellier SupAgro, Université de Montpellier, Montpellier, 34988, France
| | - Astrid Cruaud
- CBGP, INRAE, CIRAD, IRD, Montpellier SupAgro, Université de Montpellier, Montpellier, 34988, France
| |
Collapse
|
24
|
Naser-Khdour S, Minh BQ, Lanfear R. Assessing Confidence in Root Placement on Phylogenies: An Empirical Study Using Non-Reversible Models for Mammals. Syst Biol 2021; 71:959-972. [PMID: 34387349 PMCID: PMC9260635 DOI: 10.1093/sysbio/syab067] [Citation(s) in RCA: 12] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/30/2020] [Revised: 08/03/2021] [Accepted: 08/11/2021] [Indexed: 11/14/2022] Open
Abstract
Using time-reversible Markov models is a very common practice in phylogenetic analysis,
because although we expect many of their assumptions to be violated by empirical data,
they provide high computational efficiency. However, these models lack the ability to
infer the root placement of the estimated phylogeny. In order to compensate for the
inability of these models to root the tree, many researchers use external information such
as using outgroup taxa or additional assumptions such as molecular clocks. In this study,
we investigate the utility of nonreversible models to root empirical phylogenies and
introduce a new bootstrap measure, the rootstrap, which provides
information on the statistical support for any given root position. [Bootstrap;
nonreversible models; phylogenetic inference; root estimation.]
Collapse
Affiliation(s)
- Suha Naser-Khdour
- Department of Ecology and Evolution, Research School of Biology, Australian National University, Canberra, Australian Capital Territory, Australia
| | - Bui Quang Minh
- Department of Ecology and Evolution, Research School of Biology, Australian National University, Canberra, Australian Capital Territory, Australia.,Research School of Computer Science, Australian National University, Canberra, Australian Capital Territory, Australia
| | - Robert Lanfear
- Department of Ecology and Evolution, Research School of Biology, Australian National University, Canberra, Australian Capital Territory, Australia
| |
Collapse
|
25
|
Bettisworth B, Stamatakis A. Root Digger: a root placement program for phylogenetic trees. BMC Bioinformatics 2021; 22:225. [PMID: 33932975 PMCID: PMC8088003 DOI: 10.1186/s12859-021-03956-5] [Citation(s) in RCA: 7] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/05/2020] [Accepted: 01/01/2021] [Indexed: 01/30/2023] Open
Abstract
BACKGROUND In phylogenetic analysis, it is common to infer unrooted trees. However, knowing the root location is desirable for downstream analyses and interpretation. There exist several methods to recover a root, such as molecular clock analysis (including midpoint rooting) or rooting the tree using an outgroup. Non-reversible Markov models can also be used to compute the likelihood of a potential root position. RESULTS We present a software called RootDigger which uses a non-reversible Markov model to compute the most likely root location on a given tree and to infer a confidence value for each possible root placement. We find that RootDigger is successful at finding roots when compared to similar tools such as IQ-TREE and MAD, and will occasionally outperform them. Additionally, we find that the exhaustive mode of RootDigger is useful in quantifying and explaining uncertainty in rooting positions. CONCLUSIONS RootDigger can be used on an existing phylogeny to find a root, or to asses the uncertainty of the root placement. RootDigger is available under the MIT licence at https://www.github.com/computations/root_digger .
Collapse
Affiliation(s)
- Ben Bettisworth
- Computational Molecular Evolution Group, Heidelberg Institute for Theoretical Studies, Heidelberg, Germany
| | - Alexandros Stamatakis
- Computational Molecular Evolution Group, Heidelberg Institute for Theoretical Studies, Heidelberg, Germany
- Institut für Theoretische Informatik, Karlsruhe Institute of Technology, Karslruhe, Germany
| |
Collapse
|
26
|
Moshiri N, Smith DM, Mirarab S. HIV Care Prioritization Using Phylogenetic Branch Length. J Acquir Immune Defic Syndr 2021; 86:626-637. [PMID: 33394616 PMCID: PMC7933099 DOI: 10.1097/qai.0000000000002612] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/14/2020] [Accepted: 12/14/2020] [Indexed: 12/22/2022]
Abstract
BACKGROUND The structure of the HIV transmission networks can be dictated by just a few individuals. Public health intervention, such as ensuring people living with HIV adhere to antiretroviral therapy and remain virally suppressed, can help control the spread of the virus. However, such intervention requires using limited public health resource allocations. Determining which individuals are most at risk of transmitting HIV could allow public health officials to focus their limited resources on these individuals. SETTING Molecular epidemiology can help prioritize people living with HIV by patterns of transmission inferred from their sampled viral sequences. Such prioritization has been previously suggested and performed by monitoring cluster growth. In this article, we introduce Prioritization using AnCesTral edge lengths (ProACT), a phylogenetic approach for prioritizing individuals living with HIV. METHODS ProACT starts from a phylogeny inferred from sequence data and orders individuals according to their terminal branch length, breaking ties using ancestral branch lengths. We evaluated ProACT on a real data set of 926 HIV-1 subtype B pol data obtained in San Diego between 2005 and 2014 and a simulation data set modeling the same epidemic. Prioritization methods are compared by their ability to predict individuals who transmit most after the prioritization. RESULTS Across all simulation conditions and most real data sampling conditions, ProACT outperformed monitoring cluster growth for multiple metrics of prioritization efficacy. CONCLUSION The simple strategy used by ProACT improves the effectiveness of prioritization compared with state-of-the-art methods that rely on monitoring the growth of transmission clusters defined based on genetic distance.
Collapse
Affiliation(s)
- Niema Moshiri
- Department of Computer Science and Engineering, University of California, San Diego, La Jolla, 92093, USA
| | - Davey M. Smith
- Department of Medicine, University of California, San Diego, La Jolla, 92093, USA
| | - Siavash Mirarab
- Department of Electrical and Computer Engineering, University of California, San Diego, La Jolla, 92093, USA
| |
Collapse
|
27
|
Abstract
The rooting of the SARS-CoV-2 phylogeny is important for understanding the origin and early spread of the virus. Previously published phylogenies have used different rootings that do not always provide consistent results. We investigate several different strategies for rooting the SARS-CoV-2 tree and provide measures of statistical uncertainty for all methods. We show that methods based on the molecular clock tend to place the root in the B clade, whereas methods based on outgroup rooting tend to place the root in the A clade. The results from the two approaches are statistically incompatible, possibly as a consequence of deviations from a molecular clock or excess back-mutations. We also show that none of the methods provide strong statistical support for the placement of the root in any particular edge of the tree. These results suggest that phylogenetic evidence alone is unlikely to identify the origin of the SARS-CoV-2 virus and we caution against strong inferences regarding the early spread of the virus based solely on such evidence.
Collapse
Affiliation(s)
- Lenore Pipes
- Department of Integrative Biology, University of California, Berkeley, Berkeley, CA, USA
| | - Hongru Wang
- Department of Integrative Biology, University of California, Berkeley, Berkeley, CA, USA
| | - John P Huelsenbeck
- Department of Integrative Biology, University of California, Berkeley, Berkeley, CA, USA
| | - Rasmus Nielsen
- Department of Integrative Biology, University of California, Berkeley, Berkeley, CA, USA
- Department of Statistics, University of California, Berkeley, Berkeley, CA, USA
- Globe Institute, University of Copenhagen, Copenhagen, Denmark
| |
Collapse
|
28
|
Abstract
Phylogenetic trees inferred from sequence data often have branch lengths measured in the expected number of substitutions and therefore, do not have divergence times estimated. These trees give an incomplete view of evolutionary histories since many applications of phylogenies require time trees. Many methods have been developed to convert the inferred branch lengths from substitution unit to time unit using calibration points, but none is universally accepted as they are challenged in both scalability and accuracy under complex models. Here, we introduce a new method that formulates dating as a nonconvex optimization problem where the variance of log-transformed rate multipliers is minimized across the tree. On simulated and real data, we show that our method, wLogDate, is often more accurate than alternatives and is more robust to various model assumptions.
Collapse
Affiliation(s)
- Uyen Mai
- Department of Computer Science and Engineering, UC, San Diego, CA
| | - Siavash Mirarab
- Department of Electrical and Computer Engineering, UC, San Diego, CA
| |
Collapse
|
29
|
Wade T, Rangel LT, Kundu S, Fournier GP, Bansal MS. Assessing the accuracy of phylogenetic rooting methods on prokaryotic gene families. PLoS One 2020; 15:e0232950. [PMID: 32413061 PMCID: PMC7228096 DOI: 10.1371/journal.pone.0232950] [Citation(s) in RCA: 12] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/01/2019] [Accepted: 04/24/2020] [Indexed: 12/18/2022] Open
Abstract
Almost all standard phylogenetic methods for reconstructing gene trees result in unrooted trees; yet, many of the most useful applications of gene trees require that the gene trees be correctly rooted. As a result, several computational methods have been developed for inferring the root of unrooted gene trees. However, the accuracy of such methods has never been systematically evaluated on prokaryotic gene families, where horizontal gene transfer is often one of the dominant evolutionary events driving gene family evolution. In this work, we address this gap by conducting a thorough comparative evaluation of five different rooting methods using large collections of both simulated and empirical prokaryotic gene trees. Our simulation study is based on 6000 true and reconstructed gene trees on 100 species and characterizes the rooting accuracy of the four methods under 36 different evolutionary conditions and 3 levels of gene tree reconstruction error. The empirical study is based on a large, carefully designed data set of 3098 gene trees from 504 bacterial species (406 Alphaproteobacteria and 98 Cyanobacteria) and reveals insights that supplement those gleaned from the simulation study. Overall, this work provides several valuable insights into the accuracy of the considered methods that will help inform the choice of rooting methods to use when studying microbial gene family evolution. Among other findings, this study identifies parsimonious Duplication-Transfer-Loss (DTL) rooting and Minimal Ancestor Deviation (MAD) rooting as two of the most accurate gene tree rooting methods for prokaryotes and specifies the evolutionary conditions under which these methods are most accurate, demonstrates that DTL rooting is highly sensitive to high evolutionary rates and gene tree error, and that rooting methods based on branch-lengths are generally robust to gene tree reconstruction error.
Collapse
Affiliation(s)
- Taylor Wade
- Department of Computer Science & Engineering, University of Connecticut, Storrs, CT, United States of America
| | - L. Thiberio Rangel
- Department of Earth, Atmospheric & Planetary Sciences, Massachusetts Institute of Technology, Cambridge, MA, United States of America
| | - Soumya Kundu
- Department of Computer Science & Engineering, University of Connecticut, Storrs, CT, United States of America
| | - Gregory P. Fournier
- Department of Earth, Atmospheric & Planetary Sciences, Massachusetts Institute of Technology, Cambridge, MA, United States of America
| | - Mukul S. Bansal
- Department of Computer Science & Engineering, University of Connecticut, Storrs, CT, United States of America
- Institute for Systems Genomics, University of Connecticut, Storrs, CT, United States of America
| |
Collapse
|
30
|
Stadler PF, Geiß M, Schaller D, López Sánchez A, González Laffitte M, Valdivia DI, Hellmuth M, Hernández Rosales M. From pairs of most similar sequences to phylogenetic best matches. Algorithms Mol Biol 2020; 15:5. [PMID: 32308731 PMCID: PMC7147060 DOI: 10.1186/s13015-020-00165-2] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/16/2019] [Accepted: 03/26/2020] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Many of the commonly used methods for orthology detection start from mutually most similar pairs of genes (reciprocal best hits) as an approximation for evolutionary most closely related pairs of genes (reciprocal best matches). This approximation of best matches by best hits becomes exact for ultrametric dissimilarities, i.e., under the Molecular Clock Hypothesis. It fails, however, whenever there are large lineage specific rate variations among paralogous genes. In practice, this introduces a high level of noise into the input data for best-hit-based orthology detection methods. RESULTS If additive distances between genes are known, then evolutionary most closely related pairs can be identified by considering certain quartets of genes provided that in each quartet the outgroup relative to the remaining three genes is known. A priori knowledge of underlying species phylogeny greatly facilitates the identification of the required outgroup. Although the workflow remains a heuristic since the correct outgroup cannot be determined reliably in all cases, simulations with lineage specific biases and rate asymmetries show that nearly perfect results can be achieved. In a realistic setting, where distances data have to be estimated from sequence data and hence are noisy, it is still possible to obtain highly accurate sets of best matches. CONCLUSION Improvements of tree-free orthology assessment methods can be expected from a combination of the accurate inference of best matches reported here and recent mathematical advances in the understanding of (reciprocal) best match graphs and orthology relations. AVAILABILITY Accompanying software is available at https://github.com/david-schaller/AsymmeTree.
Collapse
Affiliation(s)
- Peter F. Stadler
- Bioinformatics Group, Department of Computer Science, and Interdisciplinary Center for Bioinformatics, Universität Leipzig, Härtelstraße 16–18, 04107 Leipzig, Germany
- Competence Center for Scalable Data Services and Solutions Dresden/Leipzig, Interdisciplinary Center for Bioinformatics, German Centre for Integrative Biodiversity Research (iDiv), and Leipzig Research Center for Civilization Diseases, Universität Leipzig, Augustusplatz 12, 04107 Leipzig, Germany
- Max Planck Institute for Mathematics in the Sciences, Inselstraße 22, 04103 Leipzig, Germany
- Department of Theoretical Chemistry, University of Vienna, Währinger Straße 17, 1090 Vienna, Austria
- Facultad de Ciencias, Universidad National de Colombia, Sede Bogotá, Ciudad Universitaria, 111321 Bogotá, D.C. Colombia
- Santa Fe Institute, 1399 Hyde Park Rd., Santa Fe, NM87501 USA
| | - Manuela Geiß
- Bioinformatics Group, Department of Computer Science, and Interdisciplinary Center for Bioinformatics, Universität Leipzig, Härtelstraße 16–18, 04107 Leipzig, Germany
- Software Competence Center Hagenberg GmbH, Softwarepark 21, 4232 Hagenberg, Austria
| | - David Schaller
- Bioinformatics Group, Department of Computer Science, and Interdisciplinary Center for Bioinformatics, Universität Leipzig, Härtelstraße 16–18, 04107 Leipzig, Germany
| | - Alitzel López Sánchez
- CONACYT-Instituto de Matemáticas, UNAM Juriquilla, Blvd. Juriquilla 3001, 76230 Juriquilla, Querétaro, QRO México
| | - Marcos González Laffitte
- CONACYT-Instituto de Matemáticas, UNAM Juriquilla, Blvd. Juriquilla 3001, 76230 Juriquilla, Querétaro, QRO México
| | - Dulce I. Valdivia
- Departamento de Ingeniería Genética, Centro de Investigación y de Estudios Avanzados del IPN (CINVESTAV), Km. 9.6 Libramiento Norte Carretera Irapuato-León, 36821 Irapuato, GTO México
| | - Marc Hellmuth
- School of Computing, University of Leeds, E C Stoner Building, Leeds, LS2 9JT UK
| | - Maribel Hernández Rosales
- CONACYT-Instituto de Matemáticas, UNAM Juriquilla, Blvd. Juriquilla 3001, 76230 Juriquilla, Querétaro, QRO México
| |
Collapse
|
31
|
Abstract
Phylogenetic trees are essential to evolutionary biology, and numerous methods exist that attempt to extract phylogenetic information applicable to a wide range of disciplines, such as epidemiology and metagenomics. Currently, the three main Python packages for trees are Bio.Phylo, DendroPy, and the ETE Toolkit, but as dataset sizes grow, parsing and manipulating ultra-large trees becomes impractical for these tools. To address this issue, we present TreeSwift, a user-friendly and massively scalable Python package for traversing and manipulating trees that is ideal for algorithms performed on ultra-large trees.
Collapse
Affiliation(s)
- N Moshiri
- Department of Computer Science and Engineering, UC San Diego, 92093, USA
| |
Collapse
|
32
|
Lamarca AP, Schrago CG. Fast speciations and slow genes: uncovering the root of living canids. Biol J Linn Soc Lond 2019. [DOI: 10.1093/biolinnean/blz181] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022]
Abstract
Abstract
Despite ongoing efforts relying on computationally intensive tree-building methods and large datasets, the deeper phylogenetic relationships between living canid genera remain controversial. We demonstrate that this issue arises fundamentally from the uncertainty of root placement as a consequence of the short length of the branch connecting the major canid clades, which probably resulted from a fast radiation during the early diversification of extant Canidae. Using both nuclear and mitochondrial genes, we investigate the position of the canid root and its consistency by using three rooting methods. We find that mitochondrial genomes consistently retrieve a root node separating the tribe Canini from the remaining canids, whereas nuclear data mostly recover a root that places the Urocyon foxes as the sister lineage of living canids. We demonstrate that, to resolve the canid root, the nuclear segments sequenced so far are significantly less informative than mitochondrial genomes. We also propose that short intervals between speciations obscure the place of the true root, because methods are susceptible to stochastic error in the presence of short internal branches near the root.
Collapse
Affiliation(s)
- Alessandra P Lamarca
- Department of Genetics, Federal University of Rio de Janeiro, Rio de Janeiro, Brazil
| | - Carlos G Schrago
- Department of Genetics, Federal University of Rio de Janeiro, Rio de Janeiro, Brazil
| |
Collapse
|
33
|
Kim KT, Ko J, Song H, Choi G, Kim H, Jeon J, Cheong K, Kang S, Lee YH. Evolution of the Genes Encoding Effector Candidates Within Multiple Pathotypes of Magnaporthe oryzae. Front Microbiol 2019; 10:2575. [PMID: 31781071 PMCID: PMC6851232 DOI: 10.3389/fmicb.2019.02575] [Citation(s) in RCA: 18] [Impact Index Per Article: 3.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/16/2019] [Accepted: 10/24/2019] [Indexed: 01/08/2023] Open
Abstract
Magnaporthe oryzae infects rice, wheat, and many grass species in the Poaceae family by secreting protein effectors. Here, we analyzed the distribution, sequence variation, and genomic context of effector candidate (EFC) genes in 31 isolates that represent five pathotypes of M. oryzae, three isolates of M. grisea, a sister species of M. oryzae, and one strain each for eight species in the family Magnaporthaceae to investigate how the host range expansion of M. oryzae has likely affected the evolution of effectors. We used the EFC genes of M. oryzae strain 70-15, whose genome has served as a reference for many comparative genomics analyses, to identify their homologs in these strains. We also analyzed the previously characterized avirulence (AVR) genes and single-copy orthologous (SCO) genes in these strains, which showed that the EFC and AVR genes evolved faster than the SCO genes. The EFC and AVR repertoires among M. oryzae pathotypes varied widely probably because adaptation to individual hosts exerted different types of selection pressure. Repetitive DNA elements appeared to have caused the variation of some EFC genes. Lastly, we analyzed expression patterns of the AVR and EFC genes to test the hypothesis that such genes are preferentially expressed during host infection. This comprehensive dataset serves as a foundation for future studies on the genetic basis of the evolution and host specialization in M. oryzae.
Collapse
Affiliation(s)
- Ki-Tae Kim
- Department of Agricultural Biotechnology, Seoul National University, Seoul, South Korea
| | - Jaeho Ko
- Department of Agricultural Biotechnology, Seoul National University, Seoul, South Korea
| | - Hyeunjeong Song
- Interdisciplinary Program in Agricultural Genomics, Seoul National University, Seoul, South Korea
| | - Gobong Choi
- Interdisciplinary Program in Agricultural Genomics, Seoul National University, Seoul, South Korea
| | - Hyunbin Kim
- Interdisciplinary Program in Agricultural Genomics, Seoul National University, Seoul, South Korea
| | - Jongbum Jeon
- Interdisciplinary Program in Agricultural Genomics, Seoul National University, Seoul, South Korea
| | - Kyeongchae Cheong
- Interdisciplinary Program in Agricultural Genomics, Seoul National University, Seoul, South Korea
| | - Seogchan Kang
- Department of Plant Pathology and Environmental Microbiology, The Pennsylvania State University, State College, PA, United States
| | - Yong-Hwan Lee
- Department of Agricultural Biotechnology, Seoul National University, Seoul, South Korea.,Interdisciplinary Program in Agricultural Genomics, Seoul National University, Seoul, South Korea.,Center for Fungal Genetic Resources, Seoul National University, Seoul, South Korea.,Plant Immunity Research Center, Seoul National University, Seoul, South Korea.,Research Institute of Agriculture and Life Sciences, Seoul National University, Seoul, South Korea
| |
Collapse
|
34
|
Balaban M, Moshiri N, Mai U, Jia X, Mirarab S. TreeCluster: Clustering biological sequences using phylogenetic trees. PLoS One 2019; 14:e0221068. [PMID: 31437182 PMCID: PMC6705769 DOI: 10.1371/journal.pone.0221068] [Citation(s) in RCA: 68] [Impact Index Per Article: 13.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/26/2019] [Accepted: 07/26/2019] [Indexed: 02/01/2023] Open
Abstract
Clustering homologous sequences based on their similarity is a problem that appears in many bioinformatics applications. The fact that sequences cluster is ultimately the result of their phylogenetic relationships. Despite this observation and the natural ways in which a tree can define clusters, most applications of sequence clustering do not use a phylogenetic tree and instead operate on pairwise sequence distances. Due to advances in large-scale phylogenetic inference, we argue that tree-based clustering is under-utilized. We define a family of optimization problems that, given an arbitrary tree, return the minimum number of clusters such that all clusters adhere to constraints on their heterogeneity. We study three specific constraints, limiting (1) the diameter of each cluster, (2) the sum of its branch lengths, or (3) chains of pairwise distances. These three problems can be solved in time that increases linearly with the size of the tree, and for two of the three criteria, the algorithms have been known in the theoretical computer scientist literature. We implement these algorithms in a tool called TreeCluster, which we test on three applications: OTU clustering for microbiome data, HIV transmission clustering, and divide-and-conquer multiple sequence alignment. We show that, by using tree-based distances, TreeCluster generates more internally consistent clusters than alternatives and improves the effectiveness of downstream applications. TreeCluster is available at https://github.com/niemasd/TreeCluster.
Collapse
Affiliation(s)
- Metin Balaban
- Bioinformatics and Systems Biology Graduate Program, UC San Diego, La Jolla, CA 92093, United States of America
| | - Niema Moshiri
- Bioinformatics and Systems Biology Graduate Program, UC San Diego, La Jolla, CA 92093, United States of America
| | - Uyen Mai
- Computer Science and Engineering, UC San Diego, La Jolla, CA 92093, United States of America
| | - Xingfan Jia
- Department of Mathematics, UC San Diego, La Jolla, CA 92093, United States of America
| | - Siavash Mirarab
- Department of Electrical and Computer Engineering, UC San Diego, La Jolla, CA 92093, United States of America
| |
Collapse
|
35
|
Grant T. Outgroup sampling in phylogenetics: Severity of test and successive outgroup expansion. J ZOOL SYST EVOL RES 2019. [DOI: 10.1111/jzs.12317] [Citation(s) in RCA: 12] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/29/2022]
Affiliation(s)
- Taran Grant
- Department of Zoology, Institute of Biosciences University of São Paulo São Paulo Brazil
| |
Collapse
|
36
|
Moshiri N, Mirarab S. A Two-State Model of Tree Evolution and Its Applications to Alu Retrotransposition. Syst Biol 2018; 67:475-489. [PMID: 29165679 DOI: 10.1093/sysbio/syx088] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/27/2017] [Accepted: 11/15/2017] [Indexed: 11/14/2022] Open
Abstract
Models of tree evolution have mostly focused on capturing the cladogenesis processes behind speciation. Processes that derive the evolution of genomic elements, such as repeats, are not necessarily captured by these existing models. In this article, we design a model of tree evolution that we call the dual-birth model, and we show how it can be useful in studying the evolution of short Alu repeats found in the human genome in abundance. The dual-birth model extends the traditional birth-only model to have two rates of propagation, one for active nodes that propagate often, and another for inactive nodes, that with a lower rate, activate and start propagating. Adjusting the ratio of the rates controls the expected tree balance. We present several theoretical results under the dual-birth model, introduce parameter estimation techniques, and study the properties of the model in simulations. We then use the dual-birth model to estimate the number of active Alu elements and their rates of propagation and activation in the human genome based on a large phylogenetic tree that we build from close to one million Alu sequences.
Collapse
Affiliation(s)
- Niema Moshiri
- Bioinformatics and Systems Biology Graduate Program, UC San Diego, 9500 Gilman Dr., La Jolla, CA 92093, USA
| | - Siavash Mirarab
- Department of Electrical and Computer Engineering, UC San Diego, 9500 Gilman Dr., La Jolla, CA 92093, USA
| |
Collapse
|
37
|
Scott Chialvo CH, White BE, Reed LK, Dyer KA. A phylogenetic examination of host use evolution in the quinaria and testacea groups of Drosophila. Mol Phylogenet Evol 2018; 130:233-243. [PMID: 30366088 DOI: 10.1016/j.ympev.2018.10.027] [Citation(s) in RCA: 17] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/06/2018] [Revised: 10/05/2018] [Accepted: 10/20/2018] [Indexed: 12/26/2022]
Abstract
Adaptive radiations provide an opportunity to examine complex evolutionary processes such as ecological specialization and speciation. While a well-resolved phylogenetic hypothesis is critical to completing such studies, the rapid rates of evolution in these groups can impede phylogenetic studies. Here we study the quinaria and testacea species groups of the immigrans-tripunctata radiation of Drosophila, which represent a recent adaptive radiation and are a developing model system for ecological genetics. We were especially interested in understanding host use evolution in these species. In order to infer a phylogenetic hypothesis for this group we sampled loci from both the nuclear genome and the mitochondrial DNA to develop a dataset of 43 protein-coding loci for these two groups along with their close relatives in the immigrans-tripunctata radiation. We used this dataset to examine their evolutionary relationships along with the evolution of feeding behavior. Our analysis recovers strong support for the monophyly of the testacea but not the quinaria group. Results from our ancestral state reconstruction analysis suggests that the ancestor of the testacea and quinaria groups exhibited mushroom-feeding. Within the quinaria group, we infer that transition to vegetative feeding occurred twice, and that this transition did not coincide with a genome-wide change in the rate of protein evolution.
Collapse
Affiliation(s)
- Clare H Scott Chialvo
- Department of Biological Sciences, University of Alabama, Tuscaloosa, AL 35487, USA.
| | - Brooke E White
- Department of Genetics, University of Georgia, Athens, GA 30602, USA
| | - Laura K Reed
- Department of Biological Sciences, University of Alabama, Tuscaloosa, AL 35487, USA
| | - Kelly A Dyer
- Department of Genetics, University of Georgia, Athens, GA 30602, USA.
| |
Collapse
|
38
|
Abstract
BACKGROUND Sequence data used in reconstructing phylogenetic trees may include various sources of error. Typically errors are detected at the sequence level, but when missed, the erroneous sequences often appear as unexpectedly long branches in the inferred phylogeny. RESULTS We propose an automatic method to detect such errors. We build a phylogeny including all the data then detect sequences that artificially inflate the tree diameter. We formulate an optimization problem, called the k-shrink problem, that seeks to find k leaves that could be removed to maximally reduce the tree diameter. We present an algorithm to find the exact solution for this problem in polynomial time. We then use several statistical tests to find outlier species that have an unexpectedly high impact on the tree diameter. These tests can use a single tree or a set of related gene trees and can also adjust to species-specific patterns of branch length. The resulting method is called TreeShrink. We test our method on six phylogenomic biological datasets and an HIV dataset and show that the method successfully detects and removes long branches. TreeShrink removes sequences more conservatively than rogue taxon removal and often reduces gene tree discordance more than rogue taxon removal once the amount of filtering is controlled. CONCLUSIONS TreeShrink is an effective method for detecting sequences that lead to unrealistically long branch lengths in phylogenetic trees. The tool is publicly available at https://github.com/uym2/TreeShrink .
Collapse
Affiliation(s)
- Uyen Mai
- Computer Science and Engineering, University of California at San Diego, San Diego, 92093 CA USA
| | - Siavash Mirarab
- Electrical and Computer Engineering, University of California at San Diego, San Diego, 92093 CA USA
| |
Collapse
|