1
|
Middlebrook EA, Katani R, Fair JM. OrthoPhyl-streamlining large-scale, orthology-based phylogenomic studies of bacteria at broad evolutionary scales. G3 (BETHESDA, MD.) 2024; 14:jkae119. [PMID: 38839049 PMCID: PMC11304591 DOI: 10.1093/g3journal/jkae119] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 04/03/2024] [Revised: 05/15/2024] [Accepted: 05/29/2024] [Indexed: 06/07/2024]
Abstract
There are a staggering number of publicly available bacterial genome sequences (at writing, 2.0 million assemblies in NCBI's GenBank alone), and the deposition rate continues to increase. This wealth of data begs for phylogenetic analyses to place these sequences within an evolutionary context. A phylogenetic placement not only aids in taxonomic classification but informs the evolution of novel phenotypes, targets of selection, and horizontal gene transfer. Building trees from multi-gene codon alignments is a laborious task that requires bioinformatic expertise, rigorous curation of orthologs, and heavy computation. Compounding the problem is the lack of tools that can streamline these processes for building trees from large-scale genomic data. Here we present OrthoPhyl, which takes bacterial genome assemblies and reconstructs trees from whole genome codon alignments. The analysis pipeline can analyze an arbitrarily large number of input genomes (>1200 tested here) by identifying a diversity-spanning subset of assemblies and using these genomes to build gene models to infer orthologs in the full dataset. To illustrate the versatility of OrthoPhyl, we show three use cases: E. coli/Shigella, Brucella/Ochrobactrum and the order Rickettsiales. We compare trees generated with OrthoPhyl to trees generated with kSNP3 and GToTree along with published trees using alternative methods. We show that OrthoPhyl trees are consistent with other methods while incorporating more data, allowing for greater numbers of input genomes, and more flexibility of analysis.
Collapse
Affiliation(s)
- Earl A Middlebrook
- Genomics and Bioanalytics Group, Los Alamos National Laboratory, Mailstop M888, Los Alamos, NM 87545, USA
| | - Robab Katani
- 401 Huck Life Sciences Building, Huck Institutes of Life Sciences, Pennsylvania State University, University Park, PA 16802, USA
| | - Jeanne M Fair
- Genomics and Bioanalytics Group, Los Alamos National Laboratory, Mailstop M888, Los Alamos, NM 87545, USA
| |
Collapse
|
2
|
Rick JA, Brock CD, Lewanski AL, Golcher-Benavides J, Wagner CE. Reference Genome Choice and Filtering Thresholds Jointly Influence Phylogenomic Analyses. Syst Biol 2024; 73:76-101. [PMID: 37881861 DOI: 10.1093/sysbio/syad065] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/10/2022] [Revised: 09/20/2023] [Accepted: 10/20/2023] [Indexed: 10/27/2023] Open
Abstract
Molecular phylogenies are a cornerstone of modern comparative biology and are commonly employed to investigate a range of biological phenomena, such as diversification rates, patterns in trait evolution, biogeography, and community assembly. Recent work has demonstrated that significant biases may be introduced into downstream phylogenetic analyses from processing genomic data; however, it remains unclear whether there are interactions among bioinformatic parameters or biases introduced through the choice of reference genome for sequence alignment and variant calling. We address these knowledge gaps by employing a combination of simulated and empirical data sets to investigate the extent to which the choice of reference genome in upstream bioinformatic processing of genomic data influences phylogenetic inference, as well as the way that reference genome choice interacts with bioinformatic filtering choices and phylogenetic inference method. We demonstrate that more stringent minor allele filters bias inferred trees away from the true species tree topology, and that these biased trees tend to be more imbalanced and have a higher center of gravity than the true trees. We find the greatest topological accuracy when filtering sites for minor allele count (MAC) >3-4 in our 51-taxa data sets, while tree center of gravity was closest to the true value when filtering for sites with MAC >1-2. In contrast, filtering for missing data increased accuracy in the inferred topologies; however, this effect was small in comparison to the effect of minor allele filters and may be undesirable due to a subsequent mutation spectrum distortion. The bias introduced by these filters differs based on the reference genome used in short read alignment, providing further support that choosing a reference genome for alignment is an important bioinformatic decision with implications for downstream analyses. These results demonstrate that attributes of the study system and dataset (and their interaction) add important nuance for how best to assemble and filter short-read genomic data for phylogenetic inference.
Collapse
Affiliation(s)
- Jessica A Rick
- School of Natural Resources & the Environment, University of Arizona, Tucson, AZ 85719, USA
| | - Chad D Brock
- Department of Biological Sciences, Tarleton State University, Stephenville, TX 76401, USA
| | - Alexander L Lewanski
- Department of Integrative Biology and W.K. Kellogg Biological Station, Michigan State University, East Lansing, MI 48824, USA
| | - Jimena Golcher-Benavides
- Department of Natural Resource Ecology and Management, Iowa State University, Ames, IA 50011, USA
| | - Catherine E Wagner
- Program in Ecology and Evolution, University of Wyoming, Laramie, WY 82071, USA
- Department of Botany, University of Wyoming, Laramie, WY 82071, USA
| |
Collapse
|
3
|
Lozano-Fernandez J. A Practical Guide to Design and Assess a Phylogenomic Study. Genome Biol Evol 2022; 14:evac129. [PMID: 35946263 PMCID: PMC9452790 DOI: 10.1093/gbe/evac129] [Citation(s) in RCA: 8] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 08/03/2022] [Indexed: 11/13/2022] Open
Abstract
Over the last decade, molecular systematics has undergone a change of paradigm as high-throughput sequencing now makes it possible to reconstruct evolutionary relationships using genome-scale datasets. The advent of "big data" molecular phylogenetics provided a battery of new tools for biologists but simultaneously brought new methodological challenges. The increase in analytical complexity comes at the price of highly specific training in computational biology and molecular phylogenetics, resulting very often in a polarized accumulation of knowledge (technical on one side and biological on the other). Interpreting the robustness of genome-scale phylogenetic studies is not straightforward, particularly as new methodological developments have consistently shown that the general belief of "more genes, more robustness" often does not apply, and because there is a range of systematic errors that plague phylogenomic investigations. This is particularly problematic because phylogenomic studies are highly heterogeneous in their methodology, and best practices are often not clearly defined. The main aim of this article is to present what I consider as the ten most important points to take into consideration when planning a well-thought-out phylogenomic study and while evaluating the quality of published papers. The goal is to provide a practical step-by-step guide that can be easily followed by nonexperts and phylogenomic novices in order to assess the technical robustness of phylogenomic studies or improve the experimental design of a project.
Collapse
Affiliation(s)
- Jesus Lozano-Fernandez
- Department of Genetics, Microbiology and Statistics, Biodiversity Research Institute (IRBio), University of Barcelona, Avd. Diagonal 643, 08028 Barcelona, Spain
- Institute of Evolutionary Biology (CSIC – Universitat Pompeu Fabra), Passeig marítim de la Barcelona 37-49, 08003 Barcelona, Spain
| |
Collapse
|
4
|
Smith BT, Mauck WM, Benz BW, Andersen MJ. Uneven Missing Data Skew Phylogenomic Relationships within the Lories and Lorikeets. Genome Biol Evol 2021; 12:1131-1147. [PMID: 32470111 PMCID: PMC7486955 DOI: 10.1093/gbe/evaa113] [Citation(s) in RCA: 14] [Impact Index Per Article: 4.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 05/26/2020] [Indexed: 01/21/2023] Open
Abstract
The resolution of the Tree of Life has accelerated with advances in DNA sequencing technology. To achieve dense taxon sampling, it is often necessary to obtain DNA from historical museum specimens to supplement modern genetic samples. However, DNA from historical material is generally degraded, which presents various challenges. In this study, we evaluated how the coverage at variant sites and missing data among historical and modern samples impacts phylogenomic inference. We explored these patterns in the brush-tongued parrots (lories and lorikeets) of Australasia by sampling ultraconserved elements in 105 taxa. Trees estimated with low coverage characters had several clades where relationships appeared to be influenced by whether the sample came from historical or modern specimens, which were not observed when more stringent filtering was applied. To assess if the topologies were affected by missing data, we performed an outlier analysis of sites and loci, and a data reduction approach where we excluded sites based on data completeness. Depending on the outlier test, 0.15% of total sites or 38% of loci were driving the topological differences among trees, and at these sites, historical samples had 10.9× more missing data than modern ones. In contrast, 70% data completeness was necessary to avoid spurious relationships. Predictive modeling found that outlier analysis scores were correlated with parsimony informative sites in the clades whose topologies changed the most by filtering. After accounting for biased loci and understanding the stability of relationships, we inferred a more robust phylogenetic hypothesis for lories and lorikeets.
Collapse
Affiliation(s)
- Brian Tilston Smith
- Department of Ornithology, American Museum of Natural History, New York, New York
| | - William M Mauck
- Department of Ornithology, American Museum of Natural History, New York, New York.,New York Genome Center, New York, New York
| | - Brett W Benz
- Museum of Zoology and Department of Ecology and Evolutionary Biology, University of Michigan
| | - Michael J Andersen
- Department of Biology and Museum of Southwestern Biology, University of New Mexico
| |
Collapse
|
5
|
Talavera G, Lukhtanov V, Pierce NE, Vila R. DNA barcodes combined with multi-locus data of representative taxa can generate reliable higher-level phylogenies. Syst Biol 2021; 71:382-395. [PMID: 34022059 PMCID: PMC8830075 DOI: 10.1093/sysbio/syab038] [Citation(s) in RCA: 21] [Impact Index Per Article: 7.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/30/2019] [Revised: 05/13/2021] [Accepted: 05/25/2021] [Indexed: 12/04/2022] Open
Abstract
Taxa are frequently labeled incertae sedis when their placement is debated at ranks above the species level, such as their subgeneric, generic, or subtribal placement. This is a pervasive problem in groups with complex systematics due to difficulties in identifying suitable synapomorphies. In this study, we propose combining DNA barcodes with a multilocus backbone phylogeny in order to assign taxa to genus or other higher-level categories. This sampling strategy generates molecular matrices containing large amounts of missing data that are not distributed randomly: barcodes are sampled for all representatives, and additional markers are sampled only for a small percentage. We investigate the effects of the degree and randomness of missing data on phylogenetic accuracy using simulations for up to 100 markers in 1000-tips trees, as well as a real case: the subtribe Polyommatina (Lepidoptera: Lycaenidae), a large group including numerous species with unresolved taxonomy. Our simulation tests show that when a strategic and representative selection of species for higher-level categories has been made for multigene sequencing (approximately one per simulated genus), the addition of this multigene backbone DNA data for as few as 5–10% of the specimens in the total data set can produce high-quality phylogenies, comparable to those resulting from 100% multigene sampling. In contrast, trees based exclusively on barcodes performed poorly. This approach was applied to a 1365-specimen data set of Polyommatina (including ca. 80% of described species), with nearly 8% of representative species included in the multigene backbone and the remaining 92% included only by mitochondrial COI barcodes, a phylogeny was generated that highlighted potential misplacements, unrecognized major clades, and placement for incertae sedis taxa. We use this information to make systematic rearrangements within Polyommatina, and to describe two new genera. Finally, we propose a systematic workflow to assess higher-level taxonomy in hyperdiverse groups. This research identifies an additional, enhanced value of DNA barcodes for improvements in higher-level systematics using large data sets. [Birabiro; DNA barcoding; incertae sedis; Kipepeo; Lycaenidae; missing data; phylogenomic; phylogeny; Polyommatina; supermatrix; systematics; taxonomy]
Collapse
Affiliation(s)
- Gerard Talavera
- Institut Botànic de Barcelona (IBB, CSIC-Ajuntament de Barcelona), Passeig del Migdia s/n, 08038 Barcelona, Catalonia, Spain.,Department of Organismic and Evolutionary Biology and Museum of Comparative Zoology, Harvard University, 26 Oxford Street, Cambridge, MA 02138, United States
| | - Vladimir Lukhtanov
- Department of Karyosystematics, Zoological Institute of Russian Academy of Sciences, Universitetskaya nab. 1, 199034 St. Petersburg, Russia
| | - Naomi E Pierce
- Department of Organismic and Evolutionary Biology and Museum of Comparative Zoology, Harvard University, 26 Oxford Street, Cambridge, MA 02138, United States
| | - Roger Vila
- Institut de Biologia Evolutiva (CSIC-UPF), Passeig Marítim de la Barceloneta, 08003 Barcelona, Catalonia, Spain
| |
Collapse
|
6
|
Collins RA, Hrbek T. An In Silico Comparison of Protocols for Dated Phylogenomics. Syst Biol 2018; 67:633-650. [PMID: 29319797 DOI: 10.1093/sysbio/syx089] [Citation(s) in RCA: 18] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/23/2015] [Accepted: 10/24/2017] [Indexed: 01/02/2023] Open
Abstract
In the age of genome-scale DNA sequencing, choice of molecular marker arguably remains an important decision in planning a phylogenetic study. Using published genomes from 23 primate species, we make a standardized comparison of four of the most frequently used protocols in phylogenomics, viz., targeted sequence-enrichment using ultraconserved element and exon-capture probes, and restriction-site-associated DNA sequencing (RADseq and ddRADseq). Here, we present a procedure to perform in silico extractions from genomes and create directly comparable data sets for each class of marker. We then compare these data sets in terms of both phylogenetic resolution and ability to consistently and precisely estimate clade ages using fossil-calibrated molecular-clock models. Furthermore, we were also able to directly compare these results to previously published data sets from Sanger-sequenced nuclear exons and mitochondrial genomes under the same analytical conditions. Our results show-although with the exception of the mitochondrial genome data set and the smallest ddRADseq data set-that for uncontroversial nodes all data classes performed equally well, that is they recovered the same well supported topology. However, for one difficult-to-resolve node comprising a rapid diversification, we report well supported but conflicting topologies among the marker classes consistent with the mismodeling of gene tree heterogeneity as demonstrated by species tree analyses of single nucleotide polymorphisms. Likewise, clade age estimates showed consistent discrepancies between data sets under strict and relaxed clock models; for recent nodes, clade ages estimated by nuclear exon data sets were younger than those of the UCE, RADseq and mitochondrial data, but vice versa for the deepest nodes in the primate phylogeny. This observation is explained by temporal differences in phylogenetic informativeness (PI), with the data sets with strong PI peaks toward the present underestimating the deepest node ages. Finally, we conclude by emphasizing that while huge numbers of loci are probably not required for uncontroversial phylogenetic questions-for which practical considerations such as ease of data generation, sharing, and aggregating, therefore become increasingly important-accurately modeling heterogeneous data remains as relevant as ever for the more recalcitrant problems.
Collapse
Affiliation(s)
- Rupert A Collins
- Laboratório de Evolução e Genética Animal, Department of Genetics, Federal University of Amazonas, Av. Rodrigo Otavio Ramos, 3000, Manaus, AM, 69077-000, Brazil.,School of Biological Sciences, Life Sciences Building, University of Bristol, 24 Tyndall Ave, Bristol BS8 1TH, UK
| | - Tomas Hrbek
- Laboratório de Evolução e Genética Animal, Department of Genetics, Federal University of Amazonas, Av. Rodrigo Otavio Ramos, 3000, Manaus, AM, 69077-000, Brazil.,Department of Biology, 4102 LSB Brigham Young University, Provo, UT, 84602, USA
| |
Collapse
|
7
|
Saarela JM, Burke SV, Wysocki WP, Barrett MD, Clark LG, Craine JM, Peterson PM, Soreng RJ, Vorontsova MS, Duvall MR. A 250 plastome phylogeny of the grass family (Poaceae): topological support under different data partitions. PeerJ 2018; 6:e4299. [PMID: 29416954 PMCID: PMC5798404 DOI: 10.7717/peerj.4299] [Citation(s) in RCA: 76] [Impact Index Per Article: 12.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/08/2017] [Accepted: 01/08/2018] [Indexed: 12/23/2022] Open
Abstract
The systematics of grasses has advanced through applications of plastome phylogenomics, although studies have been largely limited to subfamilies or other subgroups of Poaceae. Here we present a plastome phylogenomic analysis of 250 complete plastomes (179 genera) sampled from 44 of the 52 tribes of Poaceae. Plastome sequences were determined from high throughput sequencing libraries and the assemblies represent over 28.7 Mbases of sequence data. Phylogenetic signal was characterized in 14 partitions, including (1) complete plastomes; (2) protein coding regions; (3) noncoding regions; and (4) three loci commonly used in single and multi-gene studies of grasses. Each of the four main partitions was further refined, alternatively including or excluding positively selected codons and also the gaps introduced by the alignment. All 76 protein coding plastome loci were found to be predominantly under purifying selection, but specific codons were found to be under positive selection in 65 loci. The loci that have been widely used in multi-gene phylogenetic studies had among the highest proportions of positively selected codons, suggesting caution in the interpretation of these earlier results. Plastome phylogenomic analyses confirmed the backbone topology for Poaceae with maximum bootstrap support (BP). Among the 14 analyses, 82 clades out of 309 resolved were maximally supported in all trees. Analyses of newly sequenced plastomes were in agreement with current classifications. Five of seven partitions in which alignment gaps were removed retrieved Panicoideae as sister to the remaining PACMAD subfamilies. Alternative topologies were recovered in trees from partitions that included alignment gaps. This suggests that ambiguities in aligning these uncertain regions might introduce a false signal. Resolution of these and other critical branch points in the phylogeny of Poaceae will help to better understand the selective forces that drove the radiation of the BOP and PACMAD clades comprising more than 99.9% of grass diversity.
Collapse
Affiliation(s)
- Jeffery M. Saarela
- Beaty Centre for Species Discovery and Botany Section, Canadian Museum of Nature, Ottawa, ON, Canada
| | - Sean V. Burke
- Plant Molecular and Bioinformatics Center, Biological Sciences, Northern Illinois University, DeKalb, IL, USA
| | - William P. Wysocki
- Center for Data Intensive Sciences, University of Chicago, Chicago, IL, USA
| | - Matthew D. Barrett
- Botanic Gardens and Parks Authority, Kings Park and Botanic Garden, West Perth, WA, Australia
- School of Biological Sciences, The University of Western Australia, Crawley, WA, Australia
| | - Lynn G. Clark
- Department of Ecology, Evolution and Organismal Biology, Iowa State University, Ames, IA, USA
| | | | - Paul M. Peterson
- Department of Botany, National Museum of Natural History, Smithsonian Institution, Washington, DC, USA
| | - Robert J. Soreng
- Department of Botany, National Museum of Natural History, Smithsonian Institution, Washington, DC, USA
| | - Maria S. Vorontsova
- Comparative Plant & Fungal Biology, Royal Botanic Gardens, Kew, Richmond, Surrey, UK
| | - Melvin R. Duvall
- Plant Molecular and Bioinformatics Center, Biological Sciences, Northern Illinois University, DeKalb, IL, USA
| |
Collapse
|
8
|
Tripp EA, Tsai YE, Zhuang Y, Dexter KG. RADseq dataset with 90% missing data fully resolves recent radiation of Petalidium (Acanthaceae) in the ultra-arid deserts of Namibia. Ecol Evol 2017; 7:7920-7936. [PMID: 29043045 PMCID: PMC5632676 DOI: 10.1002/ece3.3274] [Citation(s) in RCA: 48] [Impact Index Per Article: 6.9] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/25/2017] [Revised: 06/16/2017] [Accepted: 06/20/2017] [Indexed: 01/04/2023] Open
Abstract
Deserts, even those at tropical latitudes, often have strikingly low levels of plant diversity, particularly within genera. One remarkable exception to this pattern is the genus Petalidium (Acanthaceae), in which 37 of 40 named species occupy one of the driest environments on Earth, the Namib Desert of Namibia and neighboring Angola. To contribute to understanding this enigmatic diversity, we generated RADseq data for 47 accessions of Petalidium representing 22 species. We explored the impacts of 18 different combinations of assembly parameters in de novo assembly of the data across nine levels of missing data plus a best practice assembly using a reference Acanthaceae genome for a total of 171 sequence datasets assembled. RADseq data assembled at several thresholds of missing data, including 90% missing data, yielded phylogenetic hypotheses of Petalidium that were confidently and nearly fully resolved, which is notable given that divergence time analyses suggest a crown age for African species of 3.6-1.4 Ma. De novo assembly of our data yielded the most strongly supported and well-resolved topologies; in contrast, reference-based assembly performed poorly, perhaps due in part to moderate phylogenetic divergence between the reference genome, Ruellia speciosa, and the ingroup. Overall, we found that Petalidium, despite the harshness of the environment in which species occur, shows a net diversification rate (0.8-2.1 species per my) on par with those of diverse genera in tropical, Mediterranean, and alpine environments.
Collapse
Affiliation(s)
- Erin A. Tripp
- Department of Ecology & Evolutionary BiologyUCB 334University of ColoradoBoulderCOUSA
- Museum of Natural HistoryUCB 350University of ColoradoBoulderCOUSA
| | - Yi‐Hsin Erica Tsai
- Department of Ecology & Evolutionary BiologyUCB 334University of ColoradoBoulderCOUSA
- Museum of Natural HistoryUCB 350University of ColoradoBoulderCOUSA
| | - Yongbin Zhuang
- Department of Ecology & Evolutionary BiologyUCB 334University of ColoradoBoulderCOUSA
- Museum of Natural HistoryUCB 350University of ColoradoBoulderCOUSA
| | - Kyle G. Dexter
- School of GeoSciencesUniversity of EdinburghEdinburghUK
- Royal Botanic Garden EdinburghEdinburghUK
| |
Collapse
|
9
|
Bonato L, Orlando M, Zapparoli M, Fusco G, Bortolin F. New insights into Plutonium, one of the largest and least known European centipedes (Chilopoda): distribution, evolution and morphology. Zool J Linn Soc 2017. [DOI: 10.1093/zoolinnean/zlw026] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/05/2023]
|
10
|
Shen XX, Hittinger CT, Rokas A. Contentious relationships in phylogenomic studies can be driven by a handful of genes. Nat Ecol Evol 2017; 1:126. [PMID: 28812701 PMCID: PMC5560076 DOI: 10.1038/s41559-017-0126] [Citation(s) in RCA: 265] [Impact Index Per Article: 37.9] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/25/2016] [Accepted: 03/01/2017] [Indexed: 01/05/2023]
Abstract
Phylogenomic studies have resolved countless branches of the tree of life, but remain strongly contradictory on certain, contentious relationships. Here, we use a maximum likelihood framework to quantify the distribution of phylogenetic signal among genes and sites for 17 contentious branches and 6 well-established control branches in plant, animal and fungal phylogenomic data matrices. We find that resolution in some of these 17 branches rests on a single gene or a few sites, and that removal of a single gene in concatenation analyses or a single site from every gene in coalescence-based analyses diminishes support and can alter the inferred topology. These results suggest that tiny subsets of very large data matrices drive the resolution of specific internodes, providing a dissection of the distribution of support and observed incongruence in phylogenomic analyses. We submit that quantifying the distribution of phylogenetic signal in phylogenomic data is essential for evaluating whether branches, especially contentious ones, are truly resolved. Finally, we offer one detailed example of such an evaluation for the controversy regarding the earliest-branching metazoan phylum, for which examination of the distributions of gene-wise and site-wise phylogenetic signal across eight data matrices consistently supports ctenophores as the sister group to all other metazoans.
Collapse
Affiliation(s)
- Xing-Xing Shen
- Department of Biological Sciences, Vanderbilt University, Nashville, TN 37235, USA
| | - Chris Todd Hittinger
- Laboratory of Genetics, Genome Center of Wisconsin, DOE Great Lakes Bioenergy Research Center, Wisconsin Energy Institute, J. F. Crow Institute for the Study of Evolution, University of Wisconsin-Madison, Madison, WI 53706, USA
| | - Antonis Rokas
- Department of Biological Sciences, Vanderbilt University, Nashville, TN 37235, USA
| |
Collapse
|
11
|
Truszkowski J, Goldman N. Maximum Likelihood Phylogenetic Inference is Consistent on Multiple Sequence Alignments, with or without Gaps. Syst Biol 2016; 65:328-33. [PMID: 26615177 PMCID: PMC4748752 DOI: 10.1093/sysbio/syv089] [Citation(s) in RCA: 20] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/05/2015] [Accepted: 11/19/2015] [Indexed: 11/14/2022] Open
Abstract
We prove that maximum likelihood phylogenetic inference is consistent on gapped multiple sequence alignments (MSAs) as long as substitution rates across each edge are greater than zero, under mild assumptions on the structure of the alignment. Under these assumptions, maximum likelihood will asymptotically recover the tree with edge lengths corresponding to the mean number of substitutions per site on each edge. This refutes Warnow's recent suggestion (Warnow 2012) that maximum likelihood phylogenetic inference might be statistically inconsistent when gaps are treated as missing data, even if the MSA is correct. We also derive a simple new proof of maximum likelihood consistency of ungapped alignments.
Collapse
Affiliation(s)
- Jakub Truszkowski
- European Molecular Biology Laboratory, European Bioinformatics Institute Wellcome Genome Campus, Hinxton, CB10 1SD, UK; Cancer Research UK Cambridge Institute, University of Cambridge Robinson Way, Cambridge CB2 0RE, UK
| | - Nick Goldman
- European Molecular Biology Laboratory, European Bioinformatics Institute Wellcome Genome Campus, Hinxton, CB10 1SD, UK
| |
Collapse
|
12
|
Chen MY, Liang D, Zhang P. Selecting Question-Specific Genes to Reduce Incongruence in Phylogenomics: A Case Study of Jawed Vertebrate Backbone Phylogeny. Syst Biol 2015; 64:1104-20. [PMID: 26276158 DOI: 10.1093/sysbio/syv059] [Citation(s) in RCA: 78] [Impact Index Per Article: 8.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/12/2015] [Accepted: 08/10/2015] [Indexed: 11/13/2022] Open
Abstract
Incongruence between different phylogenomic analyses is the main challenge faced by phylogeneticists in the genomic era. To reduce incongruence, phylogenomic studies normally adopt some data filtering approaches, such as reducing missing data or using slowly evolving genes, to improve the signal quality of data. Here, we assembled a phylogenomic data set of 58 jawed vertebrate taxa and 4682 genes to investigate the backbone phylogeny of jawed vertebrates under both concatenation and coalescent-based frameworks. To evaluate the efficiency of extracting phylogenetic signals among different data filtering methods, we chose six highly intractable internodes within the backbone phylogeny of jawed vertebrates as our test questions. We found that our phylogenomic data set exhibits substantial conflicting signal among genes for these questions. Our analyses showed that non-specific data sets that are generated without bias toward specific questions are not sufficient to produce consistent results when there are several difficult nodes within a phylogeny. Moreover, phylogenetic accuracy based on non-specific data is considerably influenced by the size of data and the choice of tree inference methods. To address such incongruences, we selected genes that resolve a given internode but not the entire phylogeny. Notably, not only can this strategy yield correct relationships for the question, but it also reduces inconsistency associated with data sizes and inference methods. Our study highlights the importance of gene selection in phylogenomic analyses, suggesting that simply using a large amount of data cannot guarantee correct results. Constructing question-specific data sets may be more powerful for resolving problematic nodes.
Collapse
Affiliation(s)
- Meng-Yun Chen
- State Key Laboratory of Biocontrol, College of Ecology and Evolution, School of Life Sciences, Sun Yat-Sen University, Guangzhou 510006, China
| | - Dan Liang
- State Key Laboratory of Biocontrol, College of Ecology and Evolution, School of Life Sciences, Sun Yat-Sen University, Guangzhou 510006, China
| | - Peng Zhang
- State Key Laboratory of Biocontrol, College of Ecology and Evolution, School of Life Sciences, Sun Yat-Sen University, Guangzhou 510006, China
| |
Collapse
|
13
|
McTavish EJ, Steel M, Holder MT. Twisted trees and inconsistency of tree estimation when gaps are treated as missing data - The impact of model mis-specification in distance corrections. Mol Phylogenet Evol 2015; 93:289-95. [PMID: 26256643 DOI: 10.1016/j.ympev.2015.07.027] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/28/2015] [Revised: 07/09/2015] [Accepted: 07/21/2015] [Indexed: 10/23/2022]
Abstract
Statistically consistent estimation of phylogenetic trees or gene trees is possible if pairwise sequence dissimilarities can be converted to a set of distances that are proportional to the true evolutionary distances. Susko et al. (2004) reported some strikingly broad results about the forms of inconsistency in tree estimation that can arise if corrected distances are not proportional to the true distances. They showed that if the corrected distance is a concave function of the true distance, then inconsistency due to long branch attraction will occur. If these functions are convex, then two "long branch repulsion" trees will be preferred over the true tree - though these two incorrect trees are expected to be tied as the preferred true. Here we extend their results, and demonstrate the existence of a tree shape (which we refer to as a "twisted Farris-zone" tree) for which a single incorrect tree topology will be guaranteed to be preferred if the corrected distance function is convex. We also report that the standard practice of treating gaps in sequence alignments as missing data is sufficient to produce non-linear corrected distance functions if the substitution process is not independent of the insertion/deletion process. Taken together, these results imply inconsistent tree inference under mild conditions. For example, if some positions in a sequence are constrained to be free of substitutions and insertion/deletion events while the remaining sites evolve with independent substitutions and insertion/deletion events, then the distances obtained by treating gaps as missing data can support an incorrect tree topology even given an unlimited amount of data.
Collapse
Affiliation(s)
- Emily Jane McTavish
- Heidelberg Institute for Theoretical Studies, Heidelberg, Germany; Department of Ecology and Evolutionary Biology, University of Kansas, Lawrence, KS, USA.
| | - Mike Steel
- Biomathematics Research Centre, University of Canterbury, Christchurch, New Zealand
| | - Mark T Holder
- Heidelberg Institute for Theoretical Studies, Heidelberg, Germany; Department of Ecology and Evolutionary Biology, University of Kansas, Lawrence, KS, USA
| |
Collapse
|
14
|
Naim S, Brown JK, Nibert ML. Genetic diversification of penaeid shrimp infectious myonecrosis virus between Indonesia and Brazil. Virus Res 2014; 189:97-105. [PMID: 24874195 PMCID: PMC7114510 DOI: 10.1016/j.virusres.2014.05.013] [Citation(s) in RCA: 14] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/16/2014] [Revised: 05/15/2014] [Accepted: 05/16/2014] [Indexed: 11/26/2022]
Abstract
Infectious myonecrosis virus (IMNV) is a pathogen of penaeid shrimp, most notably the whiteleg shrimp Litopenaeus vannamei. First discovered in L. vannamei from Brazilian aquaculture farms in 2003, IMNV was additionally confirmed in L. vannamei from Indonesian farms in 2006 and has since been found in numerous provinces there. Only two complete sequences of IMNV strains have been reported to date, one strain from the Brazilian state of Piauí collected in 2003 and another from the Indonesian province of East Java collected in 2006. In this study, we determined the complete sequences of two additional Indonesian strains, one from Lampung province collected in 2011 and another from East Java province collected in 2012. We also determined partial sequences for six other strains to enhance phylogenetic comparisons, which have heretofore been limited by the small number of reported sequences, including only one for an Indonesian strain. The new results demonstrate clear genetic diversification of IMNV between Indonesia and Brazil, as well as within Indonesia. Analyses of conserved sequence motifs suggest a revised RNA pseudoknot prediction for ribosomal frameshifting.
Collapse
Affiliation(s)
- Sidrotun Naim
- Department of Microbiology & Immunobiology, Harvard Medical School, 77 Avenue Louis Pasteur, Boston, MA 02115, USA; Center for Sustainable Aquaculture & Pathology Studies, Surya University, Banten 15810, Indonesia.
| | - Judith K Brown
- School of Plant Sciences, University of Arizona, 1140 E. South Campus Drive, Tucson, AZ 85721, USA.
| | - Max L Nibert
- Department of Microbiology & Immunobiology, Harvard Medical School, 77 Avenue Louis Pasteur, Boston, MA 02115, USA.
| |
Collapse
|
15
|
Bertels F, Silander OK, Pachkov M, Rainey PB, van Nimwegen E. Automated reconstruction of whole-genome phylogenies from short-sequence reads. Mol Biol Evol 2014; 31:1077-88. [PMID: 24600054 PMCID: PMC3995342 DOI: 10.1093/molbev/msu088] [Citation(s) in RCA: 324] [Impact Index Per Article: 32.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/01/2023] Open
Abstract
Studies of microbial evolutionary dynamics are being transformed by the availability of affordable high-throughput sequencing technologies, which allow whole-genome sequencing of hundreds of related taxa in a single study. Reconstructing a phylogenetic tree of these taxa is generally a crucial step in any evolutionary analysis. Instead of constructing genome assemblies for all taxa, annotating these assemblies, and aligning orthologous genes, many recent studies 1) directly map raw sequencing reads to a single reference sequence, 2) extract single nucleotide polymorphisms (SNPs), and 3) infer the phylogenetic tree using maximum likelihood methods from the aligned SNP positions. However, here we show that, when using such methods to reconstruct phylogenies from sets of simulated sequences, both the exclusion of nonpolymorphic positions and the alignment to a single reference genome, introduce systematic biases and errors in phylogeny reconstruction. To address these problems, we developed a new method that combines alignments from mappings to multiple reference sequences and show that this successfully removes biases from the reconstructed phylogenies. We implemented this method as a web server named REALPHY (Reference sequence Alignment-based Phylogeny builder), which fully automates phylogenetic reconstruction from raw sequencing reads.
Collapse
Affiliation(s)
- Frederic Bertels
- Biozentrum, University of Basel and Swiss Institute of Bioinformatics, Basel, Switzerland
| | | | | | | | | |
Collapse
|