1
|
Criscuolo A. On the transformation of MinHash-based uncorrected distances into proper evolutionary distances for phylogenetic inference. F1000Res 2020; 9:1309. [PMID: 33335719 PMCID: PMC7713896 DOI: 10.12688/f1000research.26930.1] [Citation(s) in RCA: 13] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Accepted: 10/12/2020] [Indexed: 12/29/2022] Open
Abstract
Recently developed MinHash-based techniques were proven successful in quickly estimating the level of similarity between large nucleotide sequences. This article discusses their usage and limitations in practice to approximating uncorrected distances between genomes, and transforming these pairwise dissimilarities into proper evolutionary distances. It is notably shown that complex distance measures can be easily approximated using simple transformation formulae based on few parameters. MinHash-based techniques can therefore be very useful for implementing fast yet accurate alignment-free phylogenetic reconstruction procedures from large sets of genomes. This last point of view is assessed with a simulation study using a dedicated bioinformatics tool.
Collapse
Affiliation(s)
- Alexis Criscuolo
- Hub de Bioinformatique et Biostatistique - Département Biologie Computationnelle, Institut Pasteur, USR 3756, CNRS, 75015 Paris, France
| |
Collapse
|
2
|
Suvorov A, Hochuli J, Schrider DR. Accurate Inference of Tree Topologies from Multiple Sequence Alignments Using Deep Learning. Syst Biol 2020; 69:221-233. [PMID: 31504938 PMCID: PMC8204903 DOI: 10.1093/sysbio/syz060] [Citation(s) in RCA: 30] [Impact Index Per Article: 7.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/07/2019] [Accepted: 08/28/2019] [Indexed: 11/13/2022] Open
Abstract
Reconstructing the phylogenetic relationships between species is one of the most formidable tasks in evolutionary biology. Multiple methods exist to reconstruct phylogenetic trees, each with their own strengths and weaknesses. Both simulation and empirical studies have identified several "zones" of parameter space where accuracy of some methods can plummet, even for four-taxon trees. Further, some methods can have undesirable statistical properties such as statistical inconsistency and/or the tendency to be positively misleading (i.e. assert strong support for the incorrect tree topology). Recently, deep learning techniques have made inroads on a number of both new and longstanding problems in biological research. In this study, we designed a deep convolutional neural network (CNN) to infer quartet topologies from multiple sequence alignments. This CNN can readily be trained to make inferences using both gapped and ungapped data. We show that our approach is highly accurate on simulated data, often outperforming traditional methods, and is remarkably robust to bias-inducing regions of parameter space such as the Felsenstein zone and the Farris zone. We also demonstrate that the confidence scores produced by our CNN can more accurately assess support for the chosen topology than bootstrap and posterior probability scores from traditional methods. Although numerous practical challenges remain, these findings suggest that the deep learning approaches such as ours have the potential to produce more accurate phylogenetic inferences.
Collapse
Affiliation(s)
- Anton Suvorov
- Department of Genetics, University of North Carolina at Chapel Hill, 120 Mason Farm Road, UNC-Chapel Hill, Chapel Hill, NC 27599-7264, USA
| | - Joshua Hochuli
- Biological and Biomedical Sciences Program, University of North Carolina at Chapel Hill, 130 Mason Farm Road, UNC-Chapel Hill Chapel Hill, NC 27599-7264, USA
| | - Daniel R Schrider
- Biological and Biomedical Sciences Program, University of North Carolina at Chapel Hill, 130 Mason Farm Road, UNC-Chapel Hill Chapel Hill, NC 27599-7264, USA
| |
Collapse
|
3
|
Criscuolo A. A fast alignment-free bioinformatics procedure to infer accurate distance-based phylogenetic trees from genome assemblies. RESEARCH IDEAS AND OUTCOMES 2019. [DOI: 10.3897/rio.5.e36178] [Citation(s) in RCA: 36] [Impact Index Per Article: 7.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/26/2022] Open
Abstract
This paper describes a novel alignment-free distance-based procedure for inferring phylogenetic trees from genome contig sequences using publicly available bioinformatics tools. For each pair of genomes, a dissimilarity measure is first computed and next transformed to obtain an estimation of the number of substitution events that have occurred during their evolution. These pairwise evolutionary distances are then used to infer a phylogenetic tree and assess a confidence support for each internal branch. Analyses of both simulated and real genome datasets show that this bioinformatics procedure allows accurate phylogenetic trees to be reconstructed with fast running times, especially when launched on multiple threads. Implemented in a publicly available script, named JolyTree, this procedure is a useful approach for quickly inferring species trees without the burden and potential biases of multiple sequence alignments.
Collapse
|
4
|
Saarela JM, Burke SV, Wysocki WP, Barrett MD, Clark LG, Craine JM, Peterson PM, Soreng RJ, Vorontsova MS, Duvall MR. A 250 plastome phylogeny of the grass family (Poaceae): topological support under different data partitions. PeerJ 2018; 6:e4299. [PMID: 29416954 PMCID: PMC5798404 DOI: 10.7717/peerj.4299] [Citation(s) in RCA: 76] [Impact Index Per Article: 12.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/08/2017] [Accepted: 01/08/2018] [Indexed: 12/23/2022] Open
Abstract
The systematics of grasses has advanced through applications of plastome phylogenomics, although studies have been largely limited to subfamilies or other subgroups of Poaceae. Here we present a plastome phylogenomic analysis of 250 complete plastomes (179 genera) sampled from 44 of the 52 tribes of Poaceae. Plastome sequences were determined from high throughput sequencing libraries and the assemblies represent over 28.7 Mbases of sequence data. Phylogenetic signal was characterized in 14 partitions, including (1) complete plastomes; (2) protein coding regions; (3) noncoding regions; and (4) three loci commonly used in single and multi-gene studies of grasses. Each of the four main partitions was further refined, alternatively including or excluding positively selected codons and also the gaps introduced by the alignment. All 76 protein coding plastome loci were found to be predominantly under purifying selection, but specific codons were found to be under positive selection in 65 loci. The loci that have been widely used in multi-gene phylogenetic studies had among the highest proportions of positively selected codons, suggesting caution in the interpretation of these earlier results. Plastome phylogenomic analyses confirmed the backbone topology for Poaceae with maximum bootstrap support (BP). Among the 14 analyses, 82 clades out of 309 resolved were maximally supported in all trees. Analyses of newly sequenced plastomes were in agreement with current classifications. Five of seven partitions in which alignment gaps were removed retrieved Panicoideae as sister to the remaining PACMAD subfamilies. Alternative topologies were recovered in trees from partitions that included alignment gaps. This suggests that ambiguities in aligning these uncertain regions might introduce a false signal. Resolution of these and other critical branch points in the phylogeny of Poaceae will help to better understand the selective forces that drove the radiation of the BOP and PACMAD clades comprising more than 99.9% of grass diversity.
Collapse
Affiliation(s)
- Jeffery M. Saarela
- Beaty Centre for Species Discovery and Botany Section, Canadian Museum of Nature, Ottawa, ON, Canada
| | - Sean V. Burke
- Plant Molecular and Bioinformatics Center, Biological Sciences, Northern Illinois University, DeKalb, IL, USA
| | - William P. Wysocki
- Center for Data Intensive Sciences, University of Chicago, Chicago, IL, USA
| | - Matthew D. Barrett
- Botanic Gardens and Parks Authority, Kings Park and Botanic Garden, West Perth, WA, Australia
- School of Biological Sciences, The University of Western Australia, Crawley, WA, Australia
| | - Lynn G. Clark
- Department of Ecology, Evolution and Organismal Biology, Iowa State University, Ames, IA, USA
| | | | - Paul M. Peterson
- Department of Botany, National Museum of Natural History, Smithsonian Institution, Washington, DC, USA
| | - Robert J. Soreng
- Department of Botany, National Museum of Natural History, Smithsonian Institution, Washington, DC, USA
| | - Maria S. Vorontsova
- Comparative Plant & Fungal Biology, Royal Botanic Gardens, Kew, Richmond, Surrey, UK
| | - Melvin R. Duvall
- Plant Molecular and Bioinformatics Center, Biological Sciences, Northern Illinois University, DeKalb, IL, USA
| |
Collapse
|
5
|
Levy Karin E, Shkedy D, Ashkenazy H, Cartwright RA, Pupko T. Inferring Rates and Length-Distributions of Indels Using Approximate Bayesian Computation. Genome Biol Evol 2018; 9:1280-1294. [PMID: 28453624 PMCID: PMC5438127 DOI: 10.1093/gbe/evx084] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 04/25/2017] [Indexed: 02/07/2023] Open
Abstract
The most common evolutionary events at the molecular level are single-base substitutions, as well as insertions and deletions (indels) of short DNA segments. A large body of research has been devoted to develop probabilistic substitution models and to infer their parameters using likelihood and Bayesian approaches. In contrast, relatively little has been done to model indel dynamics, probably due to the difficulty in writing explicit likelihood functions. Here, we contribute to the effort of modeling indel dynamics by presenting SpartaABC, an approximate Bayesian computation (ABC) approach to infer indel parameters from sequence data (either aligned or unaligned). SpartaABC circumvents the need to use an explicit likelihood function by extracting summary statistics from simulated sequences. First, summary statistics are extracted from the input sequence data. Second, SpartaABC samples indel parameters from a prior distribution and uses them to simulate sequences. Third, it computes summary statistics from the simulated sets of sequences. By computing a distance between the summary statistics extracted from the input and each simulation, SpartaABC can provide an approximation to the posterior distribution of indel parameters as well as point estimates. We study the performance of our methodology and show that it provides accurate estimates of indel parameters in simulations. We next demonstrate the utility of SpartaABC by studying the impact of alignment errors on the inference of positive selection. A C ++ program implementing SpartaABC is freely available in http://spartaabc.tau.ac.il.
Collapse
Affiliation(s)
- Eli Levy Karin
- Department of Cell Research and Immunology, George S. Wise Faculty of Life Sciences, Tel Aviv University, Israel.,Department of Molecular Biology & Ecology of Plants, George S. Wise Faculty of Life Sciences, Tel Aviv University, Israel
| | - Dafna Shkedy
- Department of Cell Research and Immunology, George S. Wise Faculty of Life Sciences, Tel Aviv University, Israel
| | - Haim Ashkenazy
- Department of Cell Research and Immunology, George S. Wise Faculty of Life Sciences, Tel Aviv University, Israel
| | - Reed A Cartwright
- The Biodesign Institute, Arizona State University, Tempe, AZ.,School of Life Sciences, Arizona State University, Tempe, AZ
| | - Tal Pupko
- Department of Cell Research and Immunology, George S. Wise Faculty of Life Sciences, Tel Aviv University, Israel
| |
Collapse
|
6
|
Kaiser T, Finstermeier K, Häntzsch M, Faucheux S, Kaase M, Eckmanns T, Bercker S, Kaisers UX, Lippmann N, Rodloff AC, Thiery J, Lübbert C. Stalking a lethal superbug by whole-genome sequencing and phylogenetics: Influence on unraveling a major hospital outbreak of carbapenem-resistant Klebsiella pneumoniae. Am J Infect Control 2018; 46:54-59. [PMID: 28935481 DOI: 10.1016/j.ajic.2017.07.022] [Citation(s) in RCA: 18] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/21/2017] [Revised: 07/24/2017] [Accepted: 07/24/2017] [Indexed: 10/18/2022]
Abstract
BACKGROUND From July 2010-April 2013, Leipzig University Hospital experienced the largest outbreak of a Klebsiella pneumoniae carbapenemase 2 (KPC-2)-producing Klebsiella pneumoniae (KPC-2-Kp) strain observed in Germany to date. After termination of the outbreak, we aimed to reconstruct transmission pathways by phylogenetics based on whole-genome sequencing (WGS). METHODS One hundred seventeen KPC-2-Kp isolates from 89 outbreak patients, 5 environmental KPC-2-Kp isolates, and 24 K pneumoniae strains not linked to the outbreak underwent WGS. Phylogenetic analysis was performed blinded to clinical data and based on the genomic reads. RESULTS A patient from Greece was confirmed as the source of the outbreak. Transmission pathways for 11 out of 89 patients (12.4%) were plausibly explained by descriptive epidemiology, applying strict definitions. Five of these and an additional 15 (ie, 20 out of 89 patients [22.5%]) were confirmed by phylogenetics. The rate of phylogenetically confirmed transmissions increased significantly from 8 out of 66 (12.1% for the time period before) to 12 out of 23 patients (52.2% for the time period after; P <.001) after implementation of systematic screening for KPC-2-Kp (33,623 screening investigations within 11 months). Using descriptive epidemiology, systematic screening showed no significant effect (7 out of 66 [10.6%] vs 4 out of 23 [17.4%] patients; P = .465). The phylogenetic analysis supported the assumption that a contaminated positioning pillow served as a reservoir for the persistence of KPC-2-Kp. CONCLUSIONS Effective phylogenetic identification of transmissions requires systematic microbiologic screening. Extensive screening and phylogenetic analysis based on WGS should be started as soon as possible in a bacterial outbreak situation.
Collapse
|
7
|
Brower AVZ. Statistical consistency and phylogenetic inference: a brief review. Cladistics 2017; 34:562-567. [DOI: 10.1111/cla.12216] [Citation(s) in RCA: 35] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 07/27/2017] [Indexed: 11/29/2022] Open
Affiliation(s)
- Andrew V. Z. Brower
- Evolution and Ecology Group Department of Biology Middle Tennessee State University Murfreesboro TN 37132 USA
| |
Collapse
|
8
|
Truszkowski J, Goldman N. Maximum Likelihood Phylogenetic Inference is Consistent on Multiple Sequence Alignments, with or without Gaps. Syst Biol 2016; 65:328-33. [PMID: 26615177 PMCID: PMC4748752 DOI: 10.1093/sysbio/syv089] [Citation(s) in RCA: 20] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/05/2015] [Accepted: 11/19/2015] [Indexed: 11/14/2022] Open
Abstract
We prove that maximum likelihood phylogenetic inference is consistent on gapped multiple sequence alignments (MSAs) as long as substitution rates across each edge are greater than zero, under mild assumptions on the structure of the alignment. Under these assumptions, maximum likelihood will asymptotically recover the tree with edge lengths corresponding to the mean number of substitutions per site on each edge. This refutes Warnow's recent suggestion (Warnow 2012) that maximum likelihood phylogenetic inference might be statistically inconsistent when gaps are treated as missing data, even if the MSA is correct. We also derive a simple new proof of maximum likelihood consistency of ungapped alignments.
Collapse
Affiliation(s)
- Jakub Truszkowski
- European Molecular Biology Laboratory, European Bioinformatics Institute Wellcome Genome Campus, Hinxton, CB10 1SD, UK; Cancer Research UK Cambridge Institute, University of Cambridge Robinson Way, Cambridge CB2 0RE, UK
| | - Nick Goldman
- European Molecular Biology Laboratory, European Bioinformatics Institute Wellcome Genome Campus, Hinxton, CB10 1SD, UK
| |
Collapse
|