1
|
Khodaei M, Edwards SV, Beerli P. Estimating Genome-wide Phylogenies Using Probabilistic Topic Modeling. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2023.12.20.572577. [PMID: 39605625 PMCID: PMC11601389 DOI: 10.1101/2023.12.20.572577] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/29/2024]
Abstract
Methods for rapidly inferring the evolutionary history of species or populations with genome-wide data are progressing, but computational constraints still limit our abilities in this area. We developed an alignment-free method to infer genome-wide phylogenies and implemented it in the Python package T opic C ontml . The method uses probabilistic topic modeling (specifically, Latent Dirichlet Allocation or LDA) to extract 'topic' frequencies from k -mers, which are derived from multilocus DNA sequences. These extracted frequencies then serve as an input for the program C ontml in the PHYLIP package, which is used to generate a species tree. We evaluated the performance of T opic C ontml on simulated datasets with gaps and three biological datasets: (1) 14 DNA sequence loci from two Australian bird species distributed across nine populations, (2) 5162 loci from 80 mammal species, and (3) raw, unaligned, non-orthologous P ac B io sequences from 12 bird species. Our empirical results and simulated data suggest that our method is efficient and statistically robust. We also assessed the uncertainty of the estimated relationships among clades using a bootstrap procedure.
Collapse
|
2
|
Zhao J, Han J, Lin YW, Zhu Y, Aichem M, Garkov D, Bergen PJ, Nang SC, Ye JZ, Zhou T, Velkov T, Song J, Schreiber F, Li J. PhageGE: an interactive web platform for exploratory analysis and visualization of bacteriophage genomes. Gigascience 2024; 13:giae074. [PMID: 39320317 PMCID: PMC11423353 DOI: 10.1093/gigascience/giae074] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/04/2024] [Revised: 06/29/2024] [Accepted: 09/02/2024] [Indexed: 09/26/2024] Open
Abstract
BACKGROUND Antimicrobial resistance is a serious threat to global health. Due to the stagnant antibiotic discovery pipeline, bacteriophages (phages) have been proposed as an alternative therapy for the treatment of infections caused by multidrug-resistant pathogens. Genomic features play an important role in phage pharmacology. However, our knowledge of phage genomics is sparse, and the use of existing bioinformatic pipelines and tools requires considerable bioinformatic expertise. These challenges have substantially limited the clinical translation of phage therapy. FINDINGS We have developed PhageGE (Phage Genome Explorer), a user-friendly graphical interface application for the interactive analysis of phage genomes. PhageGE enables users to perform key analyses, including phylogenetic analysis, visualization of phylogenetic trees, prediction of phage life cycle, and comparative analysis of phage genome annotations. The new R Shiny web server, PhageGE, integrates existing R packages and combines them with several newly developed functions to facilitate these analyses. Additionally, the web server provides interactive visualization capabilities and allows users to directly export publication-quality images. CONCLUSIONS PhageGE is a valuable tool that simplifies the analysis of phage genome data and may expedite the development and clinical translation of phage therapy. PhageGE is publicly available at https://jason-zhao.shinyapps.io/PhageGE_Update/.
Collapse
Affiliation(s)
- Jinxin Zhao
- Infection Program and Department of Microbiology, Biomedicine Discovery Institute, Monash University, Clayton 3800, Australia
- Monash Biomedicine Discovery Institute-Wenzhou Medical University Alliance in Clinical and Experimental Biomedicine, Monash University, Clayton 3800, Australia
| | - Jiru Han
- Population Health and Immunity Division, The Walter and Eliza Hall Institute of Medical Research, Parkville 3052, Australia
| | - Yu-Wei Lin
- Infection Program and Department of Microbiology, Biomedicine Discovery Institute, Monash University, Clayton 3800, Australia
| | - Yan Zhu
- Infection Program and Department of Microbiology, Biomedicine Discovery Institute, Monash University, Clayton 3800, Australia
- Systems Biology Center, Tianjin Institute of Industrial Biotechnology, Chinese Academy of Sciences, Tianjin 300308, China
| | - Michael Aichem
- Department of Computer and Information Science, University of Konstanz, Konstanz 78457, Germany
| | - Dimitar Garkov
- Department of Computer and Information Science, University of Konstanz, Konstanz 78457, Germany
| | - Phillip J Bergen
- Infection Program and Department of Microbiology, Biomedicine Discovery Institute, Monash University, Clayton 3800, Australia
| | - Sue C Nang
- Infection Program and Department of Microbiology, Biomedicine Discovery Institute, Monash University, Clayton 3800, Australia
| | - Jian-Zhong Ye
- Key Laboratory of Clinical Laboratory Diagnosis and Translational Research of Zhejiang Province, Department of Clinical Laboratory, The First Affiliated Hospital of Wenzhou Medical University, Wenzhou 325015, China
- Wenzhou Medical University–Monash Biomedicine Discovery Institute Alliance in Clinical and Experimental Biomedicine, The First Affiliated Hospital of Wenzhou Medical University,Wenzhou 325015, China
| | - Tieli Zhou
- Key Laboratory of Clinical Laboratory Diagnosis and Translational Research of Zhejiang Province, Department of Clinical Laboratory, The First Affiliated Hospital of Wenzhou Medical University, Wenzhou 325015, China
- Wenzhou Medical University–Monash Biomedicine Discovery Institute Alliance in Clinical and Experimental Biomedicine, The First Affiliated Hospital of Wenzhou Medical University,Wenzhou 325015, China
| | - Tony Velkov
- Department of Pharmacology, Biomedicine Discovery Institute, Monash University, Clayton 3800, Australia
| | - Jiangning Song
- Monash Biomedicine Discovery Institute-Wenzhou Medical University Alliance in Clinical and Experimental Biomedicine, Monash University, Clayton 3800, Australia
- Department of Biochemistry and Molecular Biology, Biomedicine Discovery Institute, Monash University, Clayton 3800, Australia
| | - Falk Schreiber
- Department of Computer and Information Science, University of Konstanz, Konstanz 78457, Germany
- Faculty of Information Technology, Monash University, Clayton 3800, Australia
| | - Jian Li
- Infection Program and Department of Microbiology, Biomedicine Discovery Institute, Monash University, Clayton 3800, Australia
- Monash Biomedicine Discovery Institute-Wenzhou Medical University Alliance in Clinical and Experimental Biomedicine, Monash University, Clayton 3800, Australia
| |
Collapse
|
3
|
Bhandari M, Poelstra JW, Kauffman M, Varghese B, Helmy YA, Scaria J, Rajashekara G. Genomic Diversity, Antimicrobial Resistance, Plasmidome, and Virulence Profiles of Salmonella Isolated from Small Specialty Crop Farms Revealed by Whole-Genome Sequencing. Antibiotics (Basel) 2023; 12:1637. [PMID: 37998839 PMCID: PMC10668983 DOI: 10.3390/antibiotics12111637] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/14/2023] [Revised: 11/10/2023] [Accepted: 11/15/2023] [Indexed: 11/25/2023] Open
Abstract
Salmonella is the leading cause of death associated with foodborne illnesses in the USA. Difficulty in treating human salmonellosis is attributed to the development of antimicrobial resistance and the pathogenicity of Salmonella strains. Therefore, it is important to study the genetic landscape of Salmonella, such as the diversity, plasmids, and presence antimicrobial resistance genes (AMRs) and virulence genes. To this end, we isolated Salmonella from environmental samples from small specialty crop farms (SSCFs) in Northeast Ohio from 2016 to 2021; 80 Salmonella isolates from 29 Salmonella-positive samples were subjected to whole-genome sequencing (WGS). In silico serotyping revealed the presence of 15 serotypes. AMR genes were detected in 15% of the samples, with 75% exhibiting phenotypic and genotypic multidrug resistance (MDR). Plasmid analysis demonstrated the presence of nine different types of plasmids, and 75% of AMR genes were located on plasmids. Interestingly, five Salmonella Newport isolates and one Salmonella Dublin isolate carried the ACSSuT gene cassette on a plasmid, which confers resistance to ampicillin, chloramphenicol, streptomycin, sulfonamide, and tetracycline. Overall, our results show that SSCFs are a potential reservoir of Salmonella with MDR genes. Thus, regular monitoring is needed to prevent the transmission of MDR Salmonella from SSCFs to humans.
Collapse
Affiliation(s)
- Menuka Bhandari
- Center for Food Animal Health, Department of Animal Sciences, College of Food, Agricultural, and Environmental Sciences, The Ohio State University, Wooster, OH 44691, USA; (M.B.); (M.K.)
| | - Jelmer W. Poelstra
- Molecular and Cellular Imaging Center, College of Food, Agricultural, and Environmental Sciences, The Ohio State University, Wooster, OH 44691, USA;
| | - Michael Kauffman
- Center for Food Animal Health, Department of Animal Sciences, College of Food, Agricultural, and Environmental Sciences, The Ohio State University, Wooster, OH 44691, USA; (M.B.); (M.K.)
| | - Binta Varghese
- Department of Veterinary Pathobiology, Oklahoma State University, Stillwater, OK 74074, USA; (B.V.); (J.S.)
| | - Yosra A. Helmy
- Department of Veterinary Science, Martin-Gatton College of Agriculture, Food and Environment, University of Kentucky, Lexington, KY 40546, USA;
| | - Joy Scaria
- Department of Veterinary Pathobiology, Oklahoma State University, Stillwater, OK 74074, USA; (B.V.); (J.S.)
| | - Gireesh Rajashekara
- Center for Food Animal Health, Department of Animal Sciences, College of Food, Agricultural, and Environmental Sciences, The Ohio State University, Wooster, OH 44691, USA; (M.B.); (M.K.)
| |
Collapse
|
4
|
Van Etten J, Stephens TG, Bhattacharya D. A k-mer-Based Approach for Phylogenetic Classification of Taxa in Environmental Genomic Data. Syst Biol 2023; 72:1101-1118. [PMID: 37314057 DOI: 10.1093/sysbio/syad037] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/07/2022] [Revised: 03/20/2023] [Accepted: 06/12/2023] [Indexed: 06/15/2023] Open
Abstract
In the age of genome sequencing, whole-genome data is readily and frequently generated, leading to a wealth of new information that can be used to advance various fields of research. New approaches, such as alignment-free phylogenetic methods that utilize k-mer-based distance scoring, are becoming increasingly popular given their ability to rapidly generate phylogenetic information from whole-genome data. However, these methods have not yet been tested using environmental data, which often tends to be highly fragmented and incomplete. Here, we compare the results of one alignment-free approach (which utilizes the D2 statistic) to traditional multi-gene maximum likelihood trees in 3 algal groups that have high-quality genome data available. In addition, we simulate lower-quality, fragmented genome data using these algae to test method robustness to genome quality and completeness. Finally, we apply the alignment-free approach to environmental metagenome assembled genome data of unclassified Saccharibacteria and Trebouxiophyte algae, and single-cell amplified data from uncultured marine stramenopiles to demonstrate its utility with real datasets. We find that in all instances, the alignment-free method produces phylogenies that are comparable, and often more informative, than those created using the traditional multi-gene approach. The k-mer-based method performs well even when there are significant missing data that include marker genes traditionally used for tree reconstruction. Our results demonstrate the value of alignment-free approaches for classifying novel, often cryptic or rare, species, that may not be culturable or are difficult to access using single-cell methods, but fill important gaps in the tree of life.
Collapse
Affiliation(s)
- Julia Van Etten
- Graduate Program in Ecology and Evolution, Rutgers, The State University of New Jersey, 14 College Farm Road, New Brunswick, NJ 08901, USA
| | - Timothy G Stephens
- Department of Biochemistry and Microbiology, Rutgers, The State University of New Jersey, 59 Dudley Road, New Brunswick, NJ 08901, USA
| | - Debashish Bhattacharya
- Department of Biochemistry and Microbiology, Rutgers, The State University of New Jersey, 59 Dudley Road, New Brunswick, NJ 08901, USA
| |
Collapse
|
5
|
King KM, Rajadhyaksha EV, Tobey IG, Van Doorslaer K. Synonymous nucleotide changes drive papillomavirus evolution. Tumour Virus Res 2022; 14:200248. [PMID: 36265836 PMCID: PMC9589209 DOI: 10.1016/j.tvr.2022.200248] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/29/2022] [Revised: 10/11/2022] [Accepted: 10/12/2022] [Indexed: 11/06/2022] Open
Abstract
Papillomaviruses have been evolving alongside their hosts for at least 450 million years. This review will discuss some of the insights gained into the evolution of this diverse family of viruses. Papillomavirus evolution is constrained by pervasive purifying selection to maximize viral fitness. Yet these viruses need to adapt to changes in their environment, e.g., the host immune system. It has long been known that these viruses evolved a codon usage that doesn't match the infected host. Here we discuss how papillomavirus genomes evolve by acquiring synonymous changes that allow the virus to avoid detection by the host innate immune system without changing the encoded proteins and associated fitness loss. We discuss the implications of studying viral evolution, lifecycle, and cancer progression.
Collapse
Affiliation(s)
- Kelly M King
- School of Animal and Comparative Biomedical Sciences, University of Arizona, Tucson, AZ, USA
| | - Esha Vikram Rajadhyaksha
- School of Animal and Comparative Biomedical Sciences, University of Arizona, Tucson, AZ, USA; Department of Physiology and Department of Ecology and Evolutionary Biology, University of Arizona, Tucson, AZ, USA
| | - Isabelle G Tobey
- Cancer Biology Graduate Interdisciplinary Program, University of Arizona, Tucson, AZ, USA
| | - Koenraad Van Doorslaer
- School of Animal and Comparative Biomedical Sciences, University of Arizona, Tucson, AZ, USA; Cancer Biology Graduate Interdisciplinary Program, University of Arizona, Tucson, AZ, USA; The BIO5 Institute, The Department of Immunobiology, Genetics Graduate Interdisciplinary Program, UA Cancer Center, University of Arizona Tucson, Arizona, USA.
| |
Collapse
|
6
|
Lo R, Dougan KE, Chen Y, Shah S, Bhattacharya D, Chan CX. Alignment-Free Analysis of Whole-Genome Sequences From Symbiodiniaceae Reveals Different Phylogenetic Signals in Distinct Regions. FRONTIERS IN PLANT SCIENCE 2022; 13:815714. [PMID: 35557718 PMCID: PMC9087856 DOI: 10.3389/fpls.2022.815714] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 11/15/2021] [Accepted: 04/04/2022] [Indexed: 05/24/2023]
Abstract
Dinoflagellates of the family Symbiodiniaceae are predominantly essential symbionts of corals and other marine organisms. Recent research reveals extensive genome sequence divergence among Symbiodiniaceae taxa and high phylogenetic diversity hidden behind subtly different cell morphologies. Using an alignment-free phylogenetic approach based on sub-sequences of fixed length k (i.e. k-mers), we assessed the phylogenetic signal among whole-genome sequences from 16 Symbiodiniaceae taxa (including the genera of Symbiodinium, Breviolum, Cladocopium, Durusdinium and Fugacium) and two strains of Polarella glacialis as outgroup. Based on phylogenetic trees inferred from k-mers in distinct genomic regions (i.e. repeat-masked genome sequences, protein-coding sequences, introns and repeats) and in protein sequences, the phylogenetic signal associated with protein-coding DNA and the encoded amino acids is largely consistent with the Symbiodiniaceae phylogeny based on established markers, such as large subunit rRNA. The other genome sequences (introns and repeats) exhibit distinct phylogenetic signals, supporting the expected differential evolutionary pressure acting on these regions. Our analysis of conserved core k-mers revealed the prevalence of conserved k-mers (>95% core 23-mers among all 18 genomes) in annotated repeats and non-genic regions of the genomes. We observed 180 distinct repeat types that are significantly enriched in genomes of the symbiotic versus free-living Symbiodinium taxa, suggesting an enhanced activity of transposable elements linked to the symbiotic lifestyle. We provide evidence that representation of alignment-free phylogenies as dynamic networks enhances the ability to generate new hypotheses about genome evolution in Symbiodiniaceae. These results demonstrate the potential of alignment-free phylogenetic methods as a scalable approach for inferring comprehensive, unbiased whole-genome phylogenies of dinoflagellates and more broadly of microbial eukaryotes.
Collapse
Affiliation(s)
- Rosalyn Lo
- Australian Centre for Ecogenomics, School of Chemistry and Molecular Biosciences, University of Queensland, Brisbane, QLD, Australia
| | - Katherine E. Dougan
- Australian Centre for Ecogenomics, School of Chemistry and Molecular Biosciences, University of Queensland, Brisbane, QLD, Australia
| | - Yibi Chen
- Australian Centre for Ecogenomics, School of Chemistry and Molecular Biosciences, University of Queensland, Brisbane, QLD, Australia
| | - Sarah Shah
- Australian Centre for Ecogenomics, School of Chemistry and Molecular Biosciences, University of Queensland, Brisbane, QLD, Australia
| | - Debashish Bhattacharya
- Department of Biochemistry and Microbiology, Rutgers University, New Brunswick, NJ, United States
| | - Cheong Xin Chan
- Australian Centre for Ecogenomics, School of Chemistry and Molecular Biosciences, University of Queensland, Brisbane, QLD, Australia
| |
Collapse
|
7
|
Was the Last Bacterial Common Ancestor a Monoderm after All? Genes (Basel) 2022; 13:genes13020376. [PMID: 35205421 PMCID: PMC8871954 DOI: 10.3390/genes13020376] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/31/2021] [Revised: 02/09/2022] [Accepted: 02/15/2022] [Indexed: 12/20/2022] Open
Abstract
The very nature of the last bacterial common ancestor (LBCA), in particular the characteristics of its cell wall, is a critical issue to understand the evolution of life on earth. Although knowledge of the relationships between bacterial phyla has made progress with the advent of phylogenomics, many questions remain, including on the appearance or disappearance of the outer membrane of diderm bacteria (also called Gram-negative bacteria). The phylogenetic transition between monoderm (Gram-positive bacteria) and diderm bacteria, and the associated peptidoglycan expansion or reduction, requires clarification. Herein, using a phylogenomic tree of cultivated and characterized bacteria as an evolutionary framework and a literature review of their cell-wall characteristics, we used Bayesian ancestral state reconstruction to infer the cell-wall architecture of the LBCA. With the same phylogenomic tree, we further revisited the evolution of the division and cell-wall synthesis (dcw) gene cluster using homology- and model-based methods. Finally, extensive similarity searches were carried out to determine the phylogenetic distribution of the genes involved with the biosynthesis of the outer membrane in diderm bacteria. Quite unexpectedly, our analyses suggest that all cultivated and characterized bacteria might have evolved from a common ancestor with a monoderm cell-wall architecture. If true, this would indicate that the appearance of the outer membrane was not a unique event and that selective forces have led to the repeated adoption of such an architecture. Due to the lack of phenotypic information, our methodology cannot be applied to all extant bacteria. Consequently, our conclusion might change once enough information is made available to allow the use of an even more diverse organism selection.
Collapse
|
8
|
Dougan KE, González-Pech RA, Stephens TG, Shah S, Chen Y, Ragan MA, Bhattacharya D, Chan CX. Genome-powered classification of microbial eukaryotes: focus on coral algal symbionts. Trends Microbiol 2022; 30:831-840. [DOI: 10.1016/j.tim.2022.02.001] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/10/2021] [Revised: 01/20/2022] [Accepted: 02/01/2022] [Indexed: 12/20/2022]
|
9
|
Cattaneo G, Ferraro Petrillo U, Giancarlo R, Palini F, Romualdi C. The power of word-frequency-based alignment-free functions: a comprehensive large-scale experimental analysis. Bioinformatics 2022; 38:925-932. [PMID: 34718420 DOI: 10.1093/bioinformatics/btab747] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/02/2021] [Revised: 10/07/2021] [Accepted: 10/26/2021] [Indexed: 02/03/2023] Open
Abstract
MOTIVATION Alignment-free (AF) distance/similarity functions are a key tool for sequence analysis. Experimental studies on real datasets abound and, to some extent, there are also studies regarding their control of false positive rate (Type I error). However, assessment of their power, i.e. their ability to identify true similarity, has been limited to some members of the D2 family. The corresponding experimental studies have concentrated on short sequences, a scenario no longer adequate for current applications, where sequence lengths may vary considerably. Such a State of the Art is methodologically problematic, since information regarding a key feature such as power is either missing or limited. RESULTS By concentrating on a representative set of word-frequency-based AF functions, we perform the first coherent and uniform evaluation of the power, involving also Type I error for completeness. Two alternative models of important genomic features (CIS Regulatory Modules and Horizontal Gene Transfer), a wide range of sequence lengths from a few thousand to millions, and different values of k have been used. As a result, we provide a characterization of those AF functions that is novel and informative. Indeed, we identify weak and strong points of each function considered, which may be used as a guide to choose one for analysis tasks. Remarkably, of the 15 functions that we have considered, only four stand out, with small differences between small and short sequence length scenarios. Finally, to encourage the use of our methodology for validation of future AF functions, the Big Data platform supporting it is public. AVAILABILITY AND IMPLEMENTATION The software is available at: https://github.com/pipp8/power_statistics. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Giuseppe Cattaneo
- Dipartimento di Informatica, Università di Salerno, Fisciano, SA 84084, Italy
| | | | - Raffaele Giancarlo
- Dipartimento di Matematica ed Informatica, Università di Palermo, 90133 Palermo, Italy
| | - Francesco Palini
- Dipartimento di Scienze Statistiche, Università di Roma-La Sapienza, 00185 Rome, Italy
| | - Chiara Romualdi
- Dipartimento di Biologia, Università di Padova, 35131 Padova, Italy
| |
Collapse
|
10
|
Bansal MS. Deciphering Microbial Gene Family Evolution Using Duplication-Transfer-Loss Reconciliation and RANGER-DTL. Methods Mol Biol 2022; 2569:233-252. [PMID: 36083451 DOI: 10.1007/978-1-0716-2691-7_11] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 05/24/2023]
Abstract
Phylogenetic reconciliation has emerged as a principled, highly effective technique for investigating the origin, spread, and evolutionary history of microbial gene families. Proper application of phylogenetic reconciliation requires a clear understanding of potential pitfalls and sources of error, and knowledge of the most effective reconciliation-based tools and protocols to use to maximize accuracy. In this book chapter, we provide a brief overview of Duplication-Transfer-Loss (DTL) reconciliation, the standard reconciliation model used to study microbial gene families and provide a step-by-step computational protocol to maximize the accuracy of DTL reconciliation and minimize false-positive evolutionary inferences.
Collapse
Affiliation(s)
- Mukul S Bansal
- Department of Computer Science & Engineering, University of Connecticut, Storrs, CT, USA.
| |
Collapse
|
11
|
Lu YY, Bai J, Wang Y, Wang Y, Sun F. CRAFT: Compact genome Representation toward large-scale Alignment-Free daTabase. Bioinformatics 2021; 37:155-161. [PMID: 32766810 DOI: 10.1093/bioinformatics/btaa699] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/18/2019] [Revised: 03/11/2020] [Accepted: 07/28/2020] [Indexed: 01/02/2023] Open
Abstract
MOTIVATION Rapid developments in sequencing technologies have boosted generating high volumes of sequence data. To archive and analyze those data, one primary step is sequence comparison. Alignment-free sequence comparison based on k-mer frequencies offers a computationally efficient solution, yet in practice, the k-mer frequency vectors for large k of practical interest lead to excessive memory and storage consumption. RESULTS We report CRAFT, a general genomic/metagenomic search engine to learn compact representations of sequences and perform fast comparison between DNA sequences. Specifically, given genome or high throughput sequencing data as input, CRAFT maps the data into a much smaller embedding space and locates the best matching genome in the archived massive sequence repositories. With 102-104-fold reduction of storage space, CRAFT performs fast query for gigabytes of data within seconds or minutes, achieving comparable performance as six state-of-the-art alignment-free measures. AVAILABILITY AND IMPLEMENTATION CRAFT offers a user-friendly graphical user interface with one-click installation on Windows and Linux operating systems, freely available at https://github.com/jiaxingbai/CRAFT. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Yang Young Lu
- Quantitative and Computational Biology Program, Department of Biological Sciences, University of Southern California, Los Angeles, CA 90089, USA
| | - Jiaxing Bai
- Department of Automation, Xiamen University, Xiamen 361000, China
| | - Yiwen Wang
- Department of Automation, Xiamen University, Xiamen 361000, China
| | - Ying Wang
- Department of Automation, Xiamen University, Xiamen 361000, China.,Xiamen Key Lab. of Big Data Intelligent Analysis and Decision, Xiamen 361000, China
| | - Fengzhu Sun
- Quantitative and Computational Biology Program, Department of Biological Sciences, University of Southern California, Los Angeles, CA 90089, USA
| |
Collapse
|
12
|
Jacobus AP, Stephens TG, Youssef P, González-Pech R, Ciccotosto-Camp MM, Dougan KE, Chen Y, Basso LC, Frazzon J, Chan CX, Gross J. Comparative Genomics Supports That Brazilian Bioethanol Saccharomyces cerevisiae Comprise a Unified Group of Domesticated Strains Related to Cachaça Spirit Yeasts. Front Microbiol 2021; 12:644089. [PMID: 33936002 PMCID: PMC8082247 DOI: 10.3389/fmicb.2021.644089] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/19/2020] [Accepted: 03/08/2021] [Indexed: 01/05/2023] Open
Abstract
Ethanol production from sugarcane is a key renewable fuel industry in Brazil. Major drivers of this alcoholic fermentation are Saccharomyces cerevisiae strains that originally were contaminants to the system and yet prevail in the industrial process. Here we present newly sequenced genomes (using Illumina short-read and PacBio long-read data) of two monosporic isolates (H3 and H4) of the S. cerevisiae PE-2, a predominant bioethanol strain in Brazil. The assembled genomes of H3 and H4, together with 42 draft genomes of sugarcane-fermenting (fuel ethanol plus cachaça) strains, were compared against those of the reference S288C and diverse S. cerevisiae. All genomes of bioethanol yeasts have amplified SNO2(3)/SNZ2(3) gene clusters for vitamin B1/B6 biosynthesis, and display ubiquitous presence of a particular family of SAM-dependent methyl transferases, rare in S. cerevisiae. Widespread amplifications of quinone oxidoreductases YCR102C/YLR460C/YNL134C, and the structural or punctual variations among aquaporins and components of the iron homeostasis system, likely represent adaptations to industrial fermentation. Interesting is the pervasive presence among the bioethanol/cachaça strains of a five-gene cluster (Region B) that is a known phylogenetic signature of European wine yeasts. Combining genomes of H3, H4, and 195 yeast strains, we comprehensively assessed whole-genome phylogeny of these taxa using an alignment-free approach. The 197-genome phylogeny substantiates that bioethanol yeasts are monophyletic and closely related to the cachaça and wine strains. Our results support the hypothesis that biofuel-producing yeasts in Brazil may have been co-opted from a pool of yeasts that were pre-adapted to alcoholic fermentation of sugarcane for the distillation of cachaça spirit, which historically is a much older industry than the large-scale fuel ethanol production.
Collapse
Affiliation(s)
- Ana Paula Jacobus
- Laboratory for Genomics and Experimental Evolution of Yeasts, Institute for Bioenergy Research, São Paulo State University, Rio Claro, Brazil
| | - Timothy G Stephens
- Institute for Molecular Bioscience, The University of Queensland, Brisbane, QLD, Australia
| | - Pierre Youssef
- Institute for Molecular Bioscience, The University of Queensland, Brisbane, QLD, Australia
| | - Raul González-Pech
- Institute for Molecular Bioscience, The University of Queensland, Brisbane, QLD, Australia
| | - Michael M Ciccotosto-Camp
- Australian Centre for Ecogenomics, The University of Queensland, Brisbane, QLD, Australia.,School of Chemistry and Molecular Biosciences, The University of Queensland, Brisbane, QLD, Australia
| | - Katherine E Dougan
- Australian Centre for Ecogenomics, The University of Queensland, Brisbane, QLD, Australia.,School of Chemistry and Molecular Biosciences, The University of Queensland, Brisbane, QLD, Australia
| | - Yibi Chen
- Institute for Molecular Bioscience, The University of Queensland, Brisbane, QLD, Australia.,Australian Centre for Ecogenomics, The University of Queensland, Brisbane, QLD, Australia.,School of Chemistry and Molecular Biosciences, The University of Queensland, Brisbane, QLD, Australia
| | - Luiz Carlos Basso
- Biological Science Department, Escola Superior de Agricultura Luiz de Queiroz, University of São Paulo (USP), Piracicaba, Brazil
| | - Jeverson Frazzon
- Institute of Food Science and Technology, Federal University of Rio Grande do Sul (UFRGS), Porto Alegre, Brazil
| | - Cheong Xin Chan
- Institute for Molecular Bioscience, The University of Queensland, Brisbane, QLD, Australia.,Australian Centre for Ecogenomics, The University of Queensland, Brisbane, QLD, Australia.,School of Chemistry and Molecular Biosciences, The University of Queensland, Brisbane, QLD, Australia
| | - Jeferson Gross
- Laboratory for Genomics and Experimental Evolution of Yeasts, Institute for Bioenergy Research, São Paulo State University, Rio Claro, Brazil
| |
Collapse
|
13
|
Bize A, Midoux C, Mariadassou M, Schbath S, Forterre P, Da Cunha V. Exploring short k-mer profiles in cells and mobile elements from Archaea highlights the major influence of both the ecological niche and evolutionary history. BMC Genomics 2021; 22:186. [PMID: 33726663 PMCID: PMC7962313 DOI: 10.1186/s12864-021-07471-y] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/13/2020] [Accepted: 02/24/2021] [Indexed: 12/16/2022] Open
Abstract
BACKGROUND K-mer-based methods have greatly advanced in recent years, largely driven by the realization of their biological significance and by the advent of next-generation sequencing. Their speed and their independence from the annotation process are major advantages. Their utility in the study of the mobilome has recently emerged and they seem a priori adapted to the patchy gene distribution and the lack of universal marker genes of viruses and plasmids. To provide a framework for the interpretation of results from k-mer based methods applied to archaea or their mobilome, we analyzed the 5-mer DNA profiles of close to 600 archaeal cells, viruses and plasmids. Archaea is one of the three domains of life. Archaea seem enriched in extremophiles and are associated with a high diversity of viral and plasmid families, many of which are specific to this domain. We explored the dataset structure by multivariate and statistical analyses, seeking to identify the underlying factors. RESULTS For cells, the 5-mer profiles were inconsistent with the phylogeny of archaea. At a finer taxonomic level, the influence of the taxonomy and the environmental constraints on 5-mer profiles was very strong. These two factors were interdependent to a significant extent, and the respective weights of their contributions varied according to the clade. A convergent adaptation was observed for the class Halobacteria, for which a strong 5-mer signature was identified. For mobile elements, coevolution with the host had a clear influence on their 5-mer profile. This enabled us to identify one previously known and one new case of recent host transfer based on the atypical composition of the mobile elements involved. Beyond the effect of coevolution, extrachromosomal elements strikingly retain the specific imprint of their own viral or plasmid taxonomic family in their 5-mer profile. CONCLUSION This specific imprint confirms that the evolution of extrachromosomal elements is driven by multiple parameters and is not restricted to host adaptation. In addition, we detected only recent host transfer events, suggesting the fast evolution of short k-mer profiles. This calls for caution when using k-mers for host prediction, metagenomic binning or phylogenetic reconstruction.
Collapse
Affiliation(s)
- Ariane Bize
- Université Paris-Saclay, INRAE, PROSE, F-92761, Antony, France.
| | - Cédric Midoux
- Université Paris-Saclay, INRAE, PROSE, F-92761, Antony, France.,Université Paris-Saclay, INRAE, MaIAGE, F-78350, Jouy-en-Josas, France.,Université Paris-Saclay, INRAE, BioinfOmics, MIGALE bioinformatics facility, F-78350, Jouy-en-Josas, France
| | - Mahendra Mariadassou
- Université Paris-Saclay, INRAE, MaIAGE, F-78350, Jouy-en-Josas, France.,Université Paris-Saclay, INRAE, BioinfOmics, MIGALE bioinformatics facility, F-78350, Jouy-en-Josas, France
| | - Sophie Schbath
- Université Paris-Saclay, INRAE, MaIAGE, F-78350, Jouy-en-Josas, France.,Université Paris-Saclay, INRAE, BioinfOmics, MIGALE bioinformatics facility, F-78350, Jouy-en-Josas, France
| | - Patrick Forterre
- Institut Pasteur, Unité de Virologie des Archées, Département de Microbiologie, 25 Rue du Docteur Roux, 75015, Paris, France. .,Université Paris-Saclay, CEA, CNRS, Institute for Integrative Biology of the Cell (I2BC), 91198, Gif-sur-Yvette, France.
| | - Violette Da Cunha
- Université Paris-Saclay, CEA, CNRS, Institute for Integrative Biology of the Cell (I2BC), 91198, Gif-sur-Yvette, France
| |
Collapse
|
14
|
Abstract
Inferring phylogenetic relationships among hundreds or thousands of microbial genomes is an increasingly common task. The conventional phylogenetic approach adopts multiple sequence alignment to compare gene-by-gene, concatenated multigene or whole-genome sequences, from which a phylogenetic tree would be inferred. These alignments follow the implicit assumption of full-length contiguity among homologous sequences. However, common events in microbial genome evolution (e.g., structural rearrangements and genetic recombination) violate this assumption. Moreover, aligning hundreds or thousands of sequences is computationally intensive and not scalable to the rate at which genome data are generated. Therefore, alignment-free methods present an attractive alternative strategy. Here we describe a scalable alignment-free strategy to infer phylogenetic relationships using complete genome sequences of bacteria and archaea, based on short, subsequences of length k (k-mers). We describe how this strategy can be extended to infer evolutionary relationships beyond a tree-like structure, to better capture both vertical and lateral signals of microbial evolution.
Collapse
|
15
|
Pornputtapong N, Acheampong DA, Patumcharoenpol P, Jenjaroenpun P, Wongsurawat T, Jun SR, Yongkiettrakul S, Chokesajjawatee N, Nookaew I. KITSUNE: A Tool for Identifying Empirically Optimal K-mer Length for Alignment-Free Phylogenomic Analysis. Front Bioeng Biotechnol 2020; 8:556413. [PMID: 33072720 PMCID: PMC7538862 DOI: 10.3389/fbioe.2020.556413] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/28/2020] [Accepted: 08/24/2020] [Indexed: 12/22/2022] Open
Abstract
Genomic DNA is the best “unique identifier” for organisms. Alignment-free phylogenomic analysis, simple, fast, and efficient method to compare genome sequences, relies on looking at the distribution of small DNA sequence of a particular length, referred to as k-mer. The k-mer approach has been explored as a basis for sequence analysis applications, including assembly, phylogenetic tree inference, and classification. Although this approach is not novel, selecting the appropriate k-mer length to obtain the optimal resolution is rather arbitrary. However, it is a very important parameter for achieving the appropriate resolution for genome/sequence distances to infer biologically meaningful phylogenetic relationships. Thus, there is a need for a systematic approach to identify the appropriate k-mer from whole-genome sequences. We present K-mer–length Iterative Selection for UNbiased Ecophylogenomics (KITSUNE), a tool for assessing the empirically optimal k-mer length of any given set of genomes of interest for phylogenomic analysis via a three-step approach based on (1) cumulative relative entropy (CRE), (2) average number of common features (ACF), and (3) observed common features (OCF). Using KITSUNE, we demonstrated the feasibility and reliability of these measurements to obtain empirically optimal k-mer lengths of 11, 17, and ∼34 from large genome datasets of viruses, bacteria, and fungi, respectively. Moreover, we demonstrated a feature of KITSUNE for accurate species identification for the two de novo assembled bacterial genomes derived from error-prone long-reads sequences, and for a published yeast genome. In addition, KITSUNE was used to identify the shortest species-specific k-mer accurately identifying viruses. KITSUNE is freely available at https://github.com/natapol/kitsune.
Collapse
Affiliation(s)
- Natapol Pornputtapong
- Department of Biochemistry and Microbiology, Faculty of Pharmaceutical Sciences, and Research Unit of DNA Barcoding of Thai Medicinal Plants, Chulalongkorn University, Bangkok, Thailand
| | - Daniel A Acheampong
- Department of Biomedical Informatics, University of Arkansas for Medical Sciences, Little Rock, AR, United States.,Joint Graduate Program in Bioinformatics, University of Arkansas at Little Rock and University of Arkansas for Medical Sciences, Little Rock, AR, United States
| | - Preecha Patumcharoenpol
- Department of Biomedical Informatics, University of Arkansas for Medical Sciences, Little Rock, AR, United States
| | - Piroon Jenjaroenpun
- Department of Biomedical Informatics, University of Arkansas for Medical Sciences, Little Rock, AR, United States
| | - Thidathip Wongsurawat
- Department of Biomedical Informatics, University of Arkansas for Medical Sciences, Little Rock, AR, United States
| | - Se-Ran Jun
- Department of Biomedical Informatics, University of Arkansas for Medical Sciences, Little Rock, AR, United States
| | - Suganya Yongkiettrakul
- National Center for Genetic Engineering and Biotechnology, National Science and Technology Development Agency, Pathum Thani, Thailand
| | - Nipa Chokesajjawatee
- National Center for Genetic Engineering and Biotechnology, National Science and Technology Development Agency, Pathum Thani, Thailand
| | - Intawat Nookaew
- Department of Biomedical Informatics, University of Arkansas for Medical Sciences, Little Rock, AR, United States
| |
Collapse
|
16
|
Brock DA, Noh S, Hubert AN, Haselkorn TS, DiSalvo S, Suess MK, Bradley AS, Tavakoli-Nezhad M, Geist KS, Queller DC, Strassmann JE. Endosymbiotic adaptations in three new bacterial species associated with Dictyostelium discoideum: Paraburkholderia agricolaris sp. nov., Paraburkholderia hayleyella sp. nov., and Paraburkholderia bonniea sp. nov. PeerJ 2020; 8:e9151. [PMID: 32509456 PMCID: PMC7247526 DOI: 10.7717/peerj.9151] [Citation(s) in RCA: 34] [Impact Index Per Article: 6.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/20/2019] [Accepted: 04/17/2020] [Indexed: 12/24/2022] Open
Abstract
Here we give names to three new species of Paraburkholderia that can remain in symbiosis indefinitely in the spores of a soil dwelling eukaryote, Dictyostelium discoideum. The new species P. agricolaris sp. nov., P. hayleyella sp. nov., and P. bonniea sp. nov. are widespread across the eastern USA and were isolated as internal symbionts of wild-collected D. discoideum. We describe these sp. nov. using several approaches. Evidence that they are each a distinct new species comes from their phylogenetic position, average nucleotide identity, genome-genome distance, carbon usage, reduced length, cooler optimal growth temperature, metabolic tests, and their previously described ability to invade D. discoideum amoebae and form a symbiotic relationship. All three of these new species facilitate the prolonged carriage of food bacteria by D. discoideum, though they themselves are not food. Further studies of the interactions of these three new species with D. discoideum should be fruitful for understanding the ecology and evolution of symbioses.
Collapse
Affiliation(s)
- Debra A. Brock
- Department of Biology, Washington University in St. Louis, St Louis, MO, United States of America
| | - Suegene Noh
- Department of Biology, Colby College, Waterville, ME, United States of America
| | - Alicia N.M. Hubert
- Department of Biology, Washington University in St. Louis, St Louis, MO, United States of America
| | - Tamara S. Haselkorn
- Department of Biology, University of Central Arkansas, Conway, AR, United States of America
| | - Susanne DiSalvo
- Department of Biological Sciences, Southern Illinois University at Edwardsville, Edwardsville, IL, United States of America
| | - Melanie K. Suess
- Department of Earth and Planetary Sciences, Washington University in St. Louis, St Louis, MO, United States of America
| | - Alexander S. Bradley
- Department of Earth and Planetary Sciences, Division of Biology and Biomedical Sciences, Washington University in St. Louis, St Louis, MO, United States of America
| | | | - Katherine S. Geist
- Department of Biology, Washington University in St. Louis, St Louis, MO, United States of America
| | - David C. Queller
- Department of Biology, Washington University in St. Louis, St Louis, MO, United States of America
| | - Joan E. Strassmann
- Department of Biology, Washington University in St. Louis, St Louis, MO, United States of America
| |
Collapse
|
17
|
Kaufer A, Stark D, Ellis J. A review of the systematics, species identification and diagnostics of the Trypanosomatidae using the maxicircle kinetoplast DNA: from past to present. Int J Parasitol 2020; 50:449-460. [PMID: 32333942 DOI: 10.1016/j.ijpara.2020.03.003] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/19/2019] [Revised: 02/28/2020] [Accepted: 03/09/2020] [Indexed: 11/25/2022]
Abstract
The Trypanosomatid family are a diverse and widespread group of protozoan parasites that belong to the higher order class Kinetoplastida. Containing predominantly monoxenous species (i.e. those having only a single host) that are confined to invertebrate hosts, this class is primarily known for its pathogenic dixenous species (i.e. those that have two hosts), serving as the aetiological agents of the important neglected tropical diseases including leishmaniasis, American trypanosomiasis (Chagas disease) and human African trypanosomiasis. Over the past few decades, a multitude of studies have investigated the diversity, classification and evolutionary history of the trypanosomatid family using different approaches and molecular targets. The mitochondrial-like DNA of the trypanosomatid parasites, also known as the kinetoplast, has emerged as a unique taxonomic and diagnostic target for exploring the evolution of this diverse group of parasitic eukaryotes. This review discusses recent advancements and important developments that have made a significant impact in the field of trypanosomatid systematics and diagnostics in recent years.
Collapse
Affiliation(s)
- Alexa Kaufer
- School of Life Sciences, University of Technology Sydney, Ultimo, NSW 2007, Australia.
| | - Damien Stark
- Department of Microbiology, St Vincent's Hospital Sydney, Darlinghurst, NSW 2010, Australia
| | - John Ellis
- School of Life Sciences, University of Technology Sydney, Ultimo, NSW 2007, Australia
| |
Collapse
|
18
|
Kalendar R, Raskina O, Belyayev A, Schulman AH. Long Tandem Arrays of Cassandra Retroelements and Their Role in Genome Dynamics in Plants. Int J Mol Sci 2020; 21:ijms21082931. [PMID: 32331257 PMCID: PMC7215508 DOI: 10.3390/ijms21082931] [Citation(s) in RCA: 23] [Impact Index Per Article: 4.6] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/16/2020] [Revised: 04/15/2020] [Accepted: 04/17/2020] [Indexed: 02/07/2023] Open
Abstract
Retrotransposable elements are widely distributed and diverse in eukaryotes. Their copy number increases through reverse-transcription-mediated propagation, while they can be lost through recombinational processes, generating genomic rearrangements. We previously identified extensive structurally uniform retrotransposon groups in which no member contains the gag, pol, or env internal domains. Because of the lack of protein-coding capacity, these groups are non-autonomous in replication, even if transcriptionally active. The Cassandra element belongs to the non-autonomous group called terminal-repeat retrotransposons in miniature (TRIM). It carries 5S RNA sequences with conserved RNA polymerase (pol) III promoters and terminators in its long terminal repeats (LTRs). Here, we identified multiple extended tandem arrays of Cassandra retrotransposons within different plant species, including ferns. At least 12 copies of repeated LTRs (as the tandem unit) and internal domain (as a spacer), giving a pattern that resembles the cellular 5S rRNA genes, were identified. A cytogenetic analysis revealed the specific chromosomal pattern of the Cassandra retrotransposon with prominent clustering at and around 5S rDNA loci. The secondary structure of the Cassandra retroelement RNA is predicted to form super-loops, in which the two LTRs are complementary to each other and can initiate local recombination, leading to the tandem arrays of Cassandra elements. The array structures are conserved for Cassandra retroelements of different species. We speculate that recombination events similar to those of 5S rRNA genes may explain the wide variation in Cassandra copy number. Likewise, the organization of 5S rRNA gene sequences is very variable in flowering plants; part of what is taken for 5S gene copy variation may be variation in Cassandra number. The role of the Cassandra 5S sequences remains to be established.
Collapse
Affiliation(s)
- Ruslan Kalendar
- Department of Agricultural Sciences, University of Helsinki, P.O. Box 27 (Latokartanonkaari 5), FI-00014 Helsinki, Finland
- RSE “National Center for Biotechnology”, Korgalzhyn Highway 13/5, Nur-Sultan 010000, Kazakhstan
- Correspondence: (R.K.); (A.H.S.)
| | - Olga Raskina
- Institute of Evolution, University of Haifa, Mount Carmel, Haifa 31905, Israel;
| | - Alexander Belyayev
- Laboratory of Molecular Cytogenetics and Karyology, Institute of Botany of the ASCR, Zámek 1, CZ-252 43 Průhonice, Czech Republic;
| | - Alan H. Schulman
- Natural Resources Institute Finland (Luke), Latokartanonkaari 9, FI-00790 Helsinki, Finland
- Institute of Biotechnology and Viikki Plant Science Centre, University of Helsinki, P.O. Box 65, FI-00014 Helsinki, Finland
- Correspondence: (R.K.); (A.H.S.)
| |
Collapse
|
19
|
Dencker T, Leimeister CA, Gerth M, Bleidorn C, Snir S, Morgenstern B. 'Multi-SpaM': a maximum-likelihood approach to phylogeny reconstruction using multiple spaced-word matches and quartet trees. NAR Genom Bioinform 2020; 2:lqz013. [PMID: 33575565 PMCID: PMC7671388 DOI: 10.1093/nargab/lqz013] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/30/2019] [Revised: 07/31/2019] [Accepted: 10/13/2019] [Indexed: 02/03/2023] Open
Abstract
Word-based or 'alignment-free' methods for phylogeny inference have become popular in recent years. These methods are much faster than traditional, alignment-based approaches, but they are generally less accurate. Most alignment-free methods calculate 'pairwise' distances between nucleic-acid or protein sequences; these distance values can then be used as input for tree-reconstruction programs such as neighbor-joining. In this paper, we propose the first word-based phylogeny approach that is based on 'multiple' sequence comparison and 'maximum likelihood'. Our algorithm first samples small, gap-free alignments involving four taxa each. For each of these alignments, it then calculates a quartet tree and, finally, the program 'Quartet MaxCut' is used to infer a super tree for the full set of input taxa from the calculated quartet trees. Experimental results show that trees produced with our approach are of high quality.
Collapse
Affiliation(s)
- Thomas Dencker
- Department of Bioinformatics, Institute of Microbiology and Genetics, Universität Göttingen, Goldschmidtstr. 1, 37077 Göttingen, Germany
| | - Chris-André Leimeister
- Department of Bioinformatics, Institute of Microbiology and Genetics, Universität Göttingen, Goldschmidtstr. 1, 37077 Göttingen, Germany
| | - Michael Gerth
- Institute for Integrative Biology, University of Liverpool, Biosciences Building, Crown Street, L69 7ZB Liverpool, UK
| | - Christoph Bleidorn
- Department of Animal Evolution and Biodiversity, Universität Göttingen, Untere Karspüle 2, 37073 Göttingen, Germany
- Museo Nacional de Ciencias Naturales, Spanish National Research Council (CSIC), 28006 Madrid, Spain
| | - Sagi Snir
- Institute of Evolution, Department of Evolutionary and Environmental Biology, University of Haifa, 199 Aba Khoushy Ave. Mount Carmel, Haifa, Israel
| | - Burkhard Morgenstern
- Department of Bioinformatics, Institute of Microbiology and Genetics, Universität Göttingen, Goldschmidtstr. 1, 37077 Göttingen, Germany
- Göttingen Center of Molecular Biosciences (GZMB), Justus-von-Liebig-Weg 11, 37077 Göttingen, Germany
| |
Collapse
|
20
|
Panyukov VV, Kiselev SS, Ozoline ON. Unique k-mers as Strain-Specific Barcodes for Phylogenetic Analysis and Natural Microbiome Profiling. Int J Mol Sci 2020; 21:ijms21030944. [PMID: 32023871 PMCID: PMC7037511 DOI: 10.3390/ijms21030944] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/29/2019] [Revised: 01/21/2020] [Accepted: 01/28/2020] [Indexed: 02/07/2023] Open
Abstract
The need for a comparative analysis of natural metagenomes stimulated the development of new methods for their taxonomic profiling. Alignment-free approaches based on the search for marker k-mers turned out to be capable of identifying not only species, but also strains of microorganisms with known genomes. Here, we evaluated the ability of genus-specific k-mers to distinguish eight phylogroups of Escherichia coli (A, B1, C, E, D, F, G, B2) and assessed the presence of their unique 22-mers in clinical samples from microbiomes of four healthy people and four patients with Crohn's disease. We found that a phylogenetic tree inferred from the pairwise distance matrix for unique 18-mers and 22-mers of 124 genomes was fully consistent with the topology of the tree, obtained with concatenated aligned sequences of orthologous genes. Therefore, we propose strain-specific "barcodes" for rapid phylotyping. Using unique 22-mers for taxonomic analysis, we detected microbes of all groups in human microbiomes; however, their presence in the five samples was significantly different. Pointing to the intraspecies heterogeneity of E. coli in the natural microflora, this also indicates the feasibility of further studies of the role of this heterogeneity in maintaining population homeostasis.
Collapse
Affiliation(s)
- Valery V. Panyukov
- Institute of Mathematical Problems of Biology RAS—the Branch of Keldysh Institute of Applied Mathematics of Russian Academy of Sciences, 142290 Pushchino, Russia;
- Structural and Functional Genomics Group, Federal Research Center “Pushchino Scientific Center for Biological Research of the Russian Academy of Sciences”, 142290 Pushchino, Russia;
| | - Sergey S. Kiselev
- Structural and Functional Genomics Group, Federal Research Center “Pushchino Scientific Center for Biological Research of the Russian Academy of Sciences”, 142290 Pushchino, Russia;
- Institute of Cell Biophysics of the Russian Academy of Sciences, 142290 Pushchino, Russia
| | - Olga N. Ozoline
- Structural and Functional Genomics Group, Federal Research Center “Pushchino Scientific Center for Biological Research of the Russian Academy of Sciences”, 142290 Pushchino, Russia;
- Institute of Cell Biophysics of the Russian Academy of Sciences, 142290 Pushchino, Russia
- Correspondence:
| |
Collapse
|
21
|
Heinzinger M, Elnaggar A, Wang Y, Dallago C, Nechaev D, Matthes F, Rost B. Modeling aspects of the language of life through transfer-learning protein sequences. BMC Bioinformatics 2019; 20:723. [PMID: 31847804 PMCID: PMC6918593 DOI: 10.1186/s12859-019-3220-8] [Citation(s) in RCA: 260] [Impact Index Per Article: 43.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/03/2019] [Accepted: 11/13/2019] [Indexed: 12/15/2022] Open
Abstract
BACKGROUND Predicting protein function and structure from sequence is one important challenge for computational biology. For 26 years, most state-of-the-art approaches combined machine learning and evolutionary information. However, for some applications retrieving related proteins is becoming too time-consuming. Additionally, evolutionary information is less powerful for small families, e.g. for proteins from the Dark Proteome. Both these problems are addressed by the new methodology introduced here. RESULTS We introduced a novel way to represent protein sequences as continuous vectors (embeddings) by using the language model ELMo taken from natural language processing. By modeling protein sequences, ELMo effectively captured the biophysical properties of the language of life from unlabeled big data (UniRef50). We refer to these new embeddings as SeqVec (Sequence-to-Vector) and demonstrate their effectiveness by training simple neural networks for two different tasks. At the per-residue level, secondary structure (Q3 = 79% ± 1, Q8 = 68% ± 1) and regions with intrinsic disorder (MCC = 0.59 ± 0.03) were predicted significantly better than through one-hot encoding or through Word2vec-like approaches. At the per-protein level, subcellular localization was predicted in ten classes (Q10 = 68% ± 1) and membrane-bound were distinguished from water-soluble proteins (Q2 = 87% ± 1). Although SeqVec embeddings generated the best predictions from single sequences, no solution improved over the best existing method using evolutionary information. Nevertheless, our approach improved over some popular methods using evolutionary information and for some proteins even did beat the best. Thus, they prove to condense the underlying principles of protein sequences. Overall, the important novelty is speed: where the lightning-fast HHblits needed on average about two minutes to generate the evolutionary information for a target protein, SeqVec created embeddings on average in 0.03 s. As this speed-up is independent of the size of growing sequence databases, SeqVec provides a highly scalable approach for the analysis of big data in proteomics, i.e. microbiome or metaproteome analysis. CONCLUSION Transfer-learning succeeded to extract information from unlabeled sequence databases relevant for various protein prediction tasks. SeqVec modeled the language of life, namely the principles underlying protein sequences better than any features suggested by textbooks and prediction methods. The exception is evolutionary information, however, that information is not available on the level of a single sequence.
Collapse
Affiliation(s)
- Michael Heinzinger
- Department of Informatics, Bioinformatics & Computational Biology - i12, TUM (Technical University of Munich), Boltzmannstr. 3, 85748, Garching/Munich, Germany.
- TUM Graduate School, Center of Doctoral Studies in Informatics and its Applications (CeDoSIA), Boltzmannstr. 11, 85748, Garching, Germany.
| | - Ahmed Elnaggar
- Department of Informatics, Bioinformatics & Computational Biology - i12, TUM (Technical University of Munich), Boltzmannstr. 3, 85748, Garching/Munich, Germany
- TUM Graduate School, Center of Doctoral Studies in Informatics and its Applications (CeDoSIA), Boltzmannstr. 11, 85748, Garching, Germany
| | - Yu Wang
- Leibniz Supercomputing Centre, Boltzmannstr. 1, 85748, Garching/Munich, Germany
| | - Christian Dallago
- Department of Informatics, Bioinformatics & Computational Biology - i12, TUM (Technical University of Munich), Boltzmannstr. 3, 85748, Garching/Munich, Germany
- TUM Graduate School, Center of Doctoral Studies in Informatics and its Applications (CeDoSIA), Boltzmannstr. 11, 85748, Garching, Germany
| | - Dmitrii Nechaev
- Department of Informatics, Bioinformatics & Computational Biology - i12, TUM (Technical University of Munich), Boltzmannstr. 3, 85748, Garching/Munich, Germany
- TUM Graduate School, Center of Doctoral Studies in Informatics and its Applications (CeDoSIA), Boltzmannstr. 11, 85748, Garching, Germany
| | - Florian Matthes
- TUM Department of Informatics, Software Engineering and Business Information Systems, Boltzmannstr. 1, 85748, Garching/Munich, Germany
| | - Burkhard Rost
- Department of Informatics, Bioinformatics & Computational Biology - i12, TUM (Technical University of Munich), Boltzmannstr. 3, 85748, Garching/Munich, Germany
- Institute for Advanced Study (TUM-IAS), Lichtenbergstr. 2a, 85748, Garching/Munich, Germany
- TUM School of Life Sciences Weihenstephan (WZW), Alte Akademie 8, Freising, Germany
- Department of Biochemistry and Molecular Biophysics & New York Consortium on Membrane Protein Structure (NYCOMPS), Columbia University, 701 West, 168th Street, New York, NY, 10032, USA
| |
Collapse
|
22
|
Evolutionary Insight into the Trypanosomatidae Using Alignment-Free Phylogenomics of the Kinetoplast. Pathogens 2019; 8:pathogens8030157. [PMID: 31540520 PMCID: PMC6789588 DOI: 10.3390/pathogens8030157] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/23/2019] [Revised: 09/10/2019] [Accepted: 09/13/2019] [Indexed: 12/12/2022] Open
Abstract
Advancements in next-generation sequencing techniques have led to a substantial increase in the genomic information available for analyses in evolutionary biology. As such, this data requires the exponential growth in bioinformatic methods and expertise required to understand such vast quantities of genomic data. Alignment-free phylogenomics offer an alternative approach for large-scale analyses that may have the potential to address these challenges. The evolutionary relationships between various species within the trypanosomatid family, specifically members belonging to the genera Leishmania and Trypanosoma have been extensively studies over the last 30 years. However, there is a need for a more exhaustive analysis of the Trypanosomatidae, summarising the evolutionary patterns amongst the entire family of these important protists. The mitochondrial DNA of the trypanosomatids, better known as the kinetoplast, represents a valuable taxonomic marker given its unique presence across all kinetoplastid protozoans. The aim of this study was to validate the reliability and robustness of alignment-free approaches for phylogenomic analyses and its applicability to reconstruct the evolutionary relationships between the trypanosomatid family. In the present study, alignment-free analyses demonstrated the strength of these methods, particularly when dealing with large datasets compared to the traditional phylogenetic approaches. We present a maxicircle genome phylogeny of 46 species spanning the trypanosomatid family, demonstrating the superiority of the maxicircle for the analysis and taxonomic resolution of the Trypanosomatidae.
Collapse
|
23
|
Bernard G, Chan CX, Chan YB, Chua XY, Cong Y, Hogan JM, Maetschke SR, Ragan MA. Alignment-free inference of hierarchical and reticulate phylogenomic relationships. Brief Bioinform 2019; 20:426-435. [PMID: 28673025 PMCID: PMC6433738 DOI: 10.1093/bib/bbx067] [Citation(s) in RCA: 55] [Impact Index Per Article: 9.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/08/2017] [Revised: 05/04/2017] [Indexed: 11/22/2022] Open
Abstract
We are amidst an ongoing flood of sequence data arising from the application of high-throughput technologies, and a concomitant fundamental revision in our understanding of how genomes evolve individually and within the biosphere. Workflows for phylogenomic inference must accommodate data that are not only much larger than before, but often more error prone and perhaps misassembled, or not assembled in the first place. Moreover, genomes of microbes, viruses and plasmids evolve not only by tree-like descent with modification but also by incorporating stretches of exogenous DNA. Thus, next-generation phylogenomics must address computational scalability while rethinking the nature of orthogroups, the alignment of multiple sequences and the inference and comparison of trees. New phylogenomic workflows have begun to take shape based on so-called alignment-free (AF) approaches. Here, we review the conceptual foundations of AF phylogenetics for the hierarchical (vertical) and reticulate (lateral) components of genome evolution, focusing on methods based on k-mers. We reflect on what seems to be successful, and on where further development is needed.
Collapse
|
24
|
Zielezinski A, Girgis HZ, Bernard G, Leimeister CA, Tang K, Dencker T, Lau AK, Röhling S, Choi JJ, Waterman MS, Comin M, Kim SH, Vinga S, Almeida JS, Chan CX, James BT, Sun F, Morgenstern B, Karlowski WM. Benchmarking of alignment-free sequence comparison methods. Genome Biol 2019; 20:144. [PMID: 31345254 PMCID: PMC6659240 DOI: 10.1186/s13059-019-1755-7] [Citation(s) in RCA: 101] [Impact Index Per Article: 16.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/16/2019] [Accepted: 07/03/2019] [Indexed: 11/22/2022] Open
Abstract
BACKGROUND Alignment-free (AF) sequence comparison is attracting persistent interest driven by data-intensive applications. Hence, many AF procedures have been proposed in recent years, but a lack of a clearly defined benchmarking consensus hampers their performance assessment. RESULTS Here, we present a community resource (http://afproject.org) to establish standards for comparing alignment-free approaches across different areas of sequence-based research. We characterize 74 AF methods available in 24 software tools for five research applications, namely, protein sequence classification, gene tree inference, regulatory element detection, genome-based phylogenetic inference, and reconstruction of species trees under horizontal gene transfer and recombination events. CONCLUSION The interactive web service allows researchers to explore the performance of alignment-free tools relevant to their data types and analytical goals. It also allows method developers to assess their own algorithms and compare them with current state-of-the-art tools, accelerating the development of new, more accurate AF solutions.
Collapse
Affiliation(s)
- Andrzej Zielezinski
- Department of Computational Biology, Faculty of Biology, Adam Mickiewicz University Poznan, Uniwersytetu Poznańskiego 6, 61-614, Poznan, Poland
| | - Hani Z Girgis
- Tandy School of Computer Science, The University of Tulsa, 800 South Tucker Drive, Tulsa, OK, 74104, USA
| | | | - Chris-Andre Leimeister
- Department of Bioinformatics, Institute of Microbiology and Genetics, University of Göttingen, Goldschmidtstr. 1, 37077, Göttingen, Germany
| | - Kujin Tang
- Department of Biological Sciences, Quantitative and Computational Biology Program, University of Southern California, Los Angeles, CA, 90089, USA
| | - Thomas Dencker
- Department of Bioinformatics, Institute of Microbiology and Genetics, University of Göttingen, Goldschmidtstr. 1, 37077, Göttingen, Germany
| | - Anna Katharina Lau
- Department of Bioinformatics, Institute of Microbiology and Genetics, University of Göttingen, Goldschmidtstr. 1, 37077, Göttingen, Germany
| | - Sophie Röhling
- Department of Bioinformatics, Institute of Microbiology and Genetics, University of Göttingen, Goldschmidtstr. 1, 37077, Göttingen, Germany
| | - Jae Jin Choi
- Department of Chemistry, University of California, Berkeley, CA, 94720, USA
- Molecular Biophysics & Integrated Bioimaging Division, Lawrence Berkeley National Laboratory, Berkeley, CA, 94720, USA
| | - Michael S Waterman
- Department of Biological Sciences, Quantitative and Computational Biology Program, University of Southern California, Los Angeles, CA, 90089, USA
- Centre for Computational Systems Biology, School of Mathematical Sciences, Fudan University, Shanghai, 200433, China
| | - Matteo Comin
- Department of Information Engineering, University of Padova, Padova, Italy
| | - Sung-Hou Kim
- Department of Chemistry, University of California, Berkeley, CA, 94720, USA
- Molecular Biophysics & Integrated Bioimaging Division, Lawrence Berkeley National Laboratory, Berkeley, CA, 94720, USA
| | - Susana Vinga
- INESC-ID, Instituto Superior Técnico, Universidade de Lisboa, Av. Rovisco Pais 1, 1049-001, Lisbon, Portugal
- IDMEC, Instituto Superior Técnico, Universidade de Lisboa, Av. Rovisco Pais 1, 1049-001, Lisbon, Portugal
| | - Jonas S Almeida
- Division of Cancer Epidemiology and Genetics (DCEG), National Cancer Institute (NIH/NCI), Bethesda, USA
| | - Cheong Xin Chan
- Institute for Molecular Bioscience, and School of Chemistry and Molecular Biosciences, The University of Queensland, Brisbane, QLD, 4072, Australia
| | - Benjamin T James
- Tandy School of Computer Science, The University of Tulsa, 800 South Tucker Drive, Tulsa, OK, 74104, USA
| | - Fengzhu Sun
- Department of Biological Sciences, Quantitative and Computational Biology Program, University of Southern California, Los Angeles, CA, 90089, USA
- Centre for Computational Systems Biology, School of Mathematical Sciences, Fudan University, Shanghai, 200433, China
| | - Burkhard Morgenstern
- Department of Bioinformatics, Institute of Microbiology and Genetics, University of Göttingen, Goldschmidtstr. 1, 37077, Göttingen, Germany
| | - Wojciech M Karlowski
- Department of Computational Biology, Faculty of Biology, Adam Mickiewicz University Poznan, Uniwersytetu Poznańskiego 6, 61-614, Poznan, Poland.
| |
Collapse
|
25
|
Lu YY, Tang K, Ren J, Fuhrman JA, Waterman MS, Sun F. CAFE: aCcelerated Alignment-FrEe sequence analysis. Nucleic Acids Res 2019; 45:W554-W559. [PMID: 28472388 PMCID: PMC5793812 DOI: 10.1093/nar/gkx351] [Citation(s) in RCA: 40] [Impact Index Per Article: 6.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/27/2017] [Accepted: 04/20/2017] [Indexed: 12/13/2022] Open
Abstract
Alignment-free genome and metagenome comparisons are increasingly important with the development of next generation sequencing (NGS) technologies. Recently developed state-of-the-art k-mer based alignment-free dissimilarity measures including CVTree, \documentclass[12pt]{minimal}
\usepackage{amsmath}
\usepackage{wasysym}
\usepackage{amsfonts}
\usepackage{amssymb}
\usepackage{amsbsy}
\usepackage{upgreek}
\usepackage{mathrsfs}
\setlength{\oddsidemargin}{-69pt}
\begin{document}
}{}$d_2^*$\end{document} and \documentclass[12pt]{minimal}
\usepackage{amsmath}
\usepackage{wasysym}
\usepackage{amsfonts}
\usepackage{amssymb}
\usepackage{amsbsy}
\usepackage{upgreek}
\usepackage{mathrsfs}
\setlength{\oddsidemargin}{-69pt}
\begin{document}
}{}$d_2^S$\end{document} are more computationally expensive than measures based solely on the k-mer frequencies. Here, we report a standalone software, aCcelerated Alignment-FrEe sequence analysis (CAFE), for efficient calculation of 28 alignment-free dissimilarity measures. CAFE allows for both assembled genome sequences and unassembled NGS shotgun reads as input, and wraps the output in a standard PHYLIP format. In downstream analyses, CAFE can also be used to visualize the pairwise dissimilarity measures, including dendrograms, heatmap, principal coordinate analysis and network display. CAFE serves as a general k-mer based alignment-free analysis platform for studying the relationships among genomes and metagenomes, and is freely available at https://github.com/younglululu/CAFE.
Collapse
Affiliation(s)
- Yang Young Lu
- Molecular and Computational Biology Program, Department of Biological Sciences, University of Southern California, CA 90089, USA
| | - Kujin Tang
- Molecular and Computational Biology Program, Department of Biological Sciences, University of Southern California, CA 90089, USA
| | - Jie Ren
- Molecular and Computational Biology Program, Department of Biological Sciences, University of Southern California, CA 90089, USA
| | - Jed A Fuhrman
- Department of Biological Sciences and Wrigley Institute for Environmental Studies, University of Southern California, Los Angeles, CA 90089, USA
| | - Michael S Waterman
- Molecular and Computational Biology Program, Department of Biological Sciences, University of Southern California, CA 90089, USA.,Centre for Computational Systems Biology, School of Mathematical Sciences, Fudan University, 200433 Shanghai, China
| | - Fengzhu Sun
- Molecular and Computational Biology Program, Department of Biological Sciences, University of Southern California, CA 90089, USA.,Centre for Computational Systems Biology, School of Mathematical Sciences, Fudan University, 200433 Shanghai, China
| |
Collapse
|
26
|
Bernard G, Greenfield P, Ragan MA, Chan CX. k-mer Similarity, Networks of Microbial Genomes, and Taxonomic Rank. mSystems 2018; 3:e00257-18. [PMID: 30505941 PMCID: PMC6247013 DOI: 10.1128/msystems.00257-18] [Citation(s) in RCA: 26] [Impact Index Per Article: 3.7] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/12/2018] [Accepted: 11/02/2018] [Indexed: 01/27/2023] Open
Abstract
Microbial genomes have been shaped by parent-to-offspring (vertical) descent and lateral genetic transfer. These processes can be distinguished by alignment-based inference and comparison of phylogenetic trees for individual gene families, but this approach is not scalable to whole-genome sequences, and a tree-like structure does not adequately capture how these processes impact microbial physiology. Here we adopted alignment-free approaches based on k-mer statistics to infer phylogenomic networks involving 2,783 completely sequenced bacterial and archaeal genomes and compared the contributions of rRNA, protein-coding, and plasmid sequences to these networks. Our results show that the phylogenomic signal arising from ribosomal RNAs is strong and extends broadly across all taxa, whereas that from plasmids is strong but restricted to closely related groups, particularly Proteobacteria. However, the signal from the other chromosomal regions is restricted in breadth. We show that mean k-mer similarity can correlate with taxonomic rank. We also link the implicated k-mers to genome annotation (thus, functions) and define core k-mers (thus, core functions) in specific phyletic groups. Highly conserved functions in most phyla include amino acid metabolism and transport as well as energy production and conversion. Intracellular trafficking and secretion are the most prominent core functions among Spirochaetes, whereas energy production and conversion are not highly conserved among the largely parasitic or commensal Tenericutes. These observations suggest that differential conservation of functions relates to niche specialization and evolutionary diversification of microbes. Our results demonstrate that k-mer approaches can be used to efficiently identify phylogenomic signals and conserved core functions at the multigenome scale. IMPORTANCE Genome evolution of microbes involves parent-to-offspring descent, and lateral genetic transfer that convolutes the phylogenomic signal. This study investigated phylogenomic signals among thousands of microbial genomes based on short subsequences without using multiple-sequence alignment. The signal from ribosomal RNAs is strong across all taxa, and the signal of plasmids is strong only in closely related groups, particularly Proteobacteria. However, the signal from other chromosomal regions (∼99% of the genomes) is remarkably restricted in breadth. The similarity of subsequences is found to correlate with taxonomic rank and informs on conserved and differential core functions relative to niche specialization and evolutionary diversification of microbes. These results provide a comprehensive, alignment-free view of microbial genome evolution as a network, beyond a tree-like structure.
Collapse
Affiliation(s)
- Guillaume Bernard
- Institute for Molecular Bioscience, The University of Queensland, Brisbane, QLD, Australia
| | - Paul Greenfield
- Commonwealth Scientific and Industrial Research Organisation (CSIRO), North Ryde, NSW, Australia
| | - Mark A. Ragan
- Institute for Molecular Bioscience, The University of Queensland, Brisbane, QLD, Australia
| | - Cheong Xin Chan
- Institute for Molecular Bioscience, The University of Queensland, Brisbane, QLD, Australia
- School of Chemistry and Molecular Biosciences, The University of Queensland, Brisbane, QLD, Australia
| |
Collapse
|
27
|
Fan H, Ives AR, Surget-Groba Y. Reconstructing phylogeny from reduced-representation genome sequencing data without assembly or alignment. Mol Ecol Resour 2018; 18:1482-1491. [PMID: 29939475 DOI: 10.1111/1755-0998.12921] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/15/2016] [Revised: 05/18/2018] [Accepted: 05/29/2018] [Indexed: 12/21/2022]
Abstract
Reduced-representation genome sequencing such as RADseq aids the analysis of genomes by reducing the quantity of data, thereby lowering both sequencing costs and computational burdens. RADseq was initially designed for studying genetic variation across genomes at the population level, but has also proved to be suitable for interspecific phylogeny reconstruction. RADseq data pose challenges for standard phylogenomic methods, however, due to incomplete coverage of the genome and large amounts of missing data. Alignment-free methods are both efficient and accurate for phylogenetic reconstructions with whole genomes and are especially practical for nonmodel organisms; nonetheless, alignment-free methods have not been applied with reduced genome sequencing data. Here, we test a full-genome assembly- and alignment-free method, AAF, in application to RADseq data and propose two procedures for reads selection to remove reads from restriction sites that were not found in taxa being compared. We validate these methods using both simulations and real data sets. Reads selection improved the accuracy of phylogenetic construction in every simulated scenario and the two real data sets, making AAF as good or better than a comparable alignment-based method, even though AAF had much lower computational burdens. We also investigated the sources of missing data in RADseq and their effects on phylogeny reconstruction using AAF. The AAF pipeline modified for RADseq or other reduced-representation sequencing data, phyloRAD, is available on github (https://github.com/fanhuan/phyloRAD).
Collapse
Affiliation(s)
- Huan Fan
- Department of Integrative Biology, University of Wisconsin-Madison, Madison, Wisconsin
| | - Anthony R Ives
- Department of Integrative Biology, University of Wisconsin-Madison, Madison, Wisconsin
| | - Yann Surget-Groba
- Institut des Sciences de la Forêt Tempérée, Université du Québec en Outaouais, Ripon, Quebec, Canada
| |
Collapse
|
28
|
Ren J, Bai X, Lu YY, Tang K, Wang Y, Reinert G, Sun F. Alignment-Free Sequence Analysis and Applications. Annu Rev Biomed Data Sci 2018; 1:93-114. [PMID: 31828235 PMCID: PMC6905628 DOI: 10.1146/annurev-biodatasci-080917-013431] [Citation(s) in RCA: 58] [Impact Index Per Article: 8.3] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/06/2023]
Abstract
Genome and metagenome comparisons based on large amounts of next generation sequencing (NGS) data pose significant challenges for alignment-based approaches due to the huge data size and the relatively short length of the reads. Alignment-free approaches based on the counts of word patterns in NGS data do not depend on the complete genome and are generally computationally efficient. Thus, they contribute significantly to genome and metagenome comparison. Recently, novel statistical approaches have been developed for the comparison of both long and shotgun sequences. These approaches have been applied to many problems including the comparison of gene regulatory regions, genome sequences, metagenomes, binning contigs in metagenomic data, identification of virus-host interactions, and detection of horizontal gene transfers. We provide an updated review of these applications and other related developments of word-count based approaches for alignment-free sequence analysis.
Collapse
Affiliation(s)
- Jie Ren
- Molecular and Computational Biology Program, University of Southern California, Los Angeles, California, USA
| | - Xin Bai
- Molecular and Computational Biology Program, University of Southern California, Los Angeles, California, USA
- Centre for Computational Systems Biology, School of Mathematical Sciences, Fudan University, Shanghai, China
| | - Yang Young Lu
- Molecular and Computational Biology Program, University of Southern California, Los Angeles, California, USA
| | - Kujin Tang
- Molecular and Computational Biology Program, University of Southern California, Los Angeles, California, USA
| | - Ying Wang
- Department of Automation, Xiamen University, Xiamen, Fujian, China
| | - Gesine Reinert
- Department of Statistics, University of Oxford, Oxford, United Kingdom
| | - Fengzhu Sun
- Molecular and Computational Biology Program, University of Southern California, Los Angeles, California, USA
- Centre for Computational Systems Biology, School of Mathematical Sciences, Fudan University, Shanghai, China
| |
Collapse
|
29
|
Kiu R, Caim S, Alexander S, Pachori P, Hall LJ. Probing Genomic Aspects of the Multi-Host Pathogen Clostridium perfringens Reveals Significant Pangenome Diversity, and a Diverse Array of Virulence Factors. Front Microbiol 2017; 8:2485. [PMID: 29312194 PMCID: PMC5733095 DOI: 10.3389/fmicb.2017.02485] [Citation(s) in RCA: 56] [Impact Index Per Article: 7.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/24/2017] [Accepted: 11/29/2017] [Indexed: 01/08/2023] Open
Abstract
Clostridium perfringens is an important cause of animal and human infections, however information about the genetic makeup of this pathogenic bacterium is currently limited. In this study, we sought to understand and characterise the genomic variation, pangenomic diversity, and key virulence traits of 56 C. perfringens strains which included 51 public, and 5 newly sequenced and annotated genomes using Whole Genome Sequencing. Our investigation revealed that C. perfringens has an "open" pangenome comprising 11667 genes and 12.6% of core genes, identified as the most divergent single-species Gram-positive bacterial pangenome currently reported. Our computational analyses also defined C. perfringens phylogeny (16S rRNA gene) in relation to some 25 Clostridium species, with C. baratii and C. sardiniense determined to be the closest relatives. Profiling virulence-associated factors confirmed presence of well-characterised C. perfringens-associated exotoxins genes including α-toxin (plc), enterotoxin (cpe), and Perfringolysin O (pfo or pfoA), although interestingly there did not appear to be a close correlation with encoded toxin type and disease phenotype. Furthermore, genomic analysis indicated significant horizontal gene transfer events as defined by presence of prophage genomes, and notably absence of CRISPR defence systems in >70% (40/56) of the strains. In relation to antimicrobial resistance mechanisms, tetracycline resistance genes (tet) and anti-defensins genes (mprF) were consistently detected in silico (tet: 75%; mprF: 100%). However, pre-antibiotic era strain genomes did not encode for tet, thus implying antimicrobial selective pressures in C. perfringens evolutionary history over the past 80 years. This study provides new genomic understanding of this genetically divergent multi-host bacterium, and further expands our knowledge on this medically and veterinary important pathogen.
Collapse
Affiliation(s)
- Raymond Kiu
- Gut Health and Food Safety, Quadram Institute Bioscience, Norwich Research Park, Norwich, United Kingdom
- Norwich Medical School, University of East Anglia, Norwich Research Park, Norwich, United Kingdom
| | - Shabhonam Caim
- Gut Health and Food Safety, Quadram Institute Bioscience, Norwich Research Park, Norwich, United Kingdom
| | | | - Purnima Pachori
- Earlham Institute, Norwich Research Park, Norwich, United Kingdom
| | - Lindsay J. Hall
- Gut Health and Food Safety, Quadram Institute Bioscience, Norwich Research Park, Norwich, United Kingdom
| |
Collapse
|
30
|
Zielezinski A, Vinga S, Almeida J, Karlowski WM. Alignment-free sequence comparison: benefits, applications, and tools. Genome Biol 2017; 18:186. [PMID: 28974235 PMCID: PMC5627421 DOI: 10.1186/s13059-017-1319-7] [Citation(s) in RCA: 267] [Impact Index Per Article: 33.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/30/2023] Open
Abstract
Alignment-free sequence analyses have been applied to problems ranging from whole-genome phylogeny to the classification of protein families, identification of horizontally transferred genes, and detection of recombined sequences. The strength of these methods makes them particularly useful for next-generation sequencing data processing and analysis. However, many researchers are unclear about how these methods work, how they compare to alignment-based methods, and what their potential is for use for their research. We address these questions and provide a guide to the currently available alignment-free sequence analysis tools.
Collapse
Affiliation(s)
- Andrzej Zielezinski
- Department of Computational Biology, Faculty of Biology, Adam Mickiewicz University in Poznan, Umultowska 89, 61-614, Poznan, Poland
| | - Susana Vinga
- IDMEC, Instituto Superior Técnico, Universidade de Lisboa, Av. Rovisco Pais 1, 1049-001, Lisbon, Portugal
| | - Jonas Almeida
- Stony Brook University (SUNY), 101 Nicolls Road, Stony Brook, NY, 11794, USA
| | - Wojciech M Karlowski
- Department of Computational Biology, Faculty of Biology, Adam Mickiewicz University in Poznan, Umultowska 89, 61-614, Poznan, Poland.
| |
Collapse
|
31
|
Bhardwaj T, Somvanshi P. Pan-genome analysis of Clostridium botulinum reveals unique targets for drug development. Gene 2017; 623:48-62. [DOI: 10.1016/j.gene.2017.04.019] [Citation(s) in RCA: 24] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/02/2016] [Revised: 03/29/2017] [Accepted: 04/12/2017] [Indexed: 10/19/2022]
|
32
|
Seo H, Cho DH. A new alignment free genome comparison algorithm based on statistically estimated feature frequency profile. ANNUAL INTERNATIONAL CONFERENCE OF THE IEEE ENGINEERING IN MEDICINE AND BIOLOGY SOCIETY. IEEE ENGINEERING IN MEDICINE AND BIOLOGY SOCIETY. ANNUAL INTERNATIONAL CONFERENCE 2017; 2017:4265-4268. [PMID: 29060839 DOI: 10.1109/embc.2017.8037798] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/07/2023]
Abstract
The sequence comparison is an important part in bioinformatics to understand the biological property of genome. Although the alignment based sequence comparison is traditional and reliable algorithm, alignment free methods have been actively researched because of their advantage in terms of computational complexity. In this paper, we suggest a new alignment free genome comparison scheme based on statistical approach. From sequence components, word frequency information of the sequence is estimated. By investigating the relationship between estimated frequency information and actual word frequency, the characteristics of the sequence are numerically represented. The phylogenetic tree and the sequence classification of mammalian sequences are provided to reveal the remarkable performance of our statistical algorithm.
Collapse
|
33
|
Leimeister CA, Sohrabi-Jahromi S, Morgenstern B. Fast and accurate phylogeny reconstruction using filtered spaced-word matches. Bioinformatics 2017; 33:971-979. [PMID: 28073754 PMCID: PMC5409309 DOI: 10.1093/bioinformatics/btw776] [Citation(s) in RCA: 31] [Impact Index Per Article: 3.9] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/02/2016] [Accepted: 12/02/2016] [Indexed: 11/13/2022] Open
Abstract
Motivation Word-based or ‘alignment-free’ algorithms are increasingly used for phylogeny reconstruction and genome comparison, since they are much faster than traditional approaches that are based on full sequence alignments. Existing alignment-free programs, however, are less accurate than alignment-based methods. Results We propose Filtered Spaced Word Matches (FSWM), a fast alignment-free approach to estimate phylogenetic distances between large genomic sequences. For a pre-defined binary pattern of match and don’t-care positions, FSWM rapidly identifies spaced word-matches between input sequences, i.e. gap-free local alignments with matching nucleotides at the match positions and with mismatches allowed at the don’t-care positions. We then estimate the number of nucleotide substitutions per site by considering the nucleotides aligned at the don’t-care positions of the identified spaced-word matches. To reduce the noise from spurious random matches, we use a filtering procedure where we discard all spaced-word matches for which the overall similarity between the aligned segments is below a threshold. We show that our approach can accurately estimate substitution frequencies even for distantly related sequences that cannot be analyzed with existing alignment-free methods; phylogenetic trees constructed with FSWM distances are of high quality. A program run on a pair of eukaryotic genomes of a few hundred Mb each takes a few minutes. Availability and Implementation The program source code for FSWM including a documentation, as well as the software that we used to generate artificial genome sequences are freely available at http://fswm.gobics.de/ Supplementary information Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Chris-André Leimeister
- Department of Bioinformatics, University of Göttingen, Institute of Microbiology and Genetics, Goldschmidtstr. 1, 37077?Göttingen, Germany
| | - Salma Sohrabi-Jahromi
- Department of Bioinformatics, University of Göttingen, Institute of Microbiology and Genetics, Goldschmidtstr. 1, 37077?Göttingen, Germany
| | - Burkhard Morgenstern
- Department of Bioinformatics, University of Göttingen, Institute of Microbiology and Genetics, Goldschmidtstr. 1, 37077 Göttingen, Germany.,University of Göttingen, Center for Computational Sciences, Goldschmidtstr. 1, 37077 Göttingen, Germany
| |
Collapse
|
34
|
Cong Y, Chan YB, Phillips CA, Langston MA, Ragan MA. Robust Inference of Genetic Exchange Communities from Microbial Genomes Using TF-IDF. Front Microbiol 2017; 8:21. [PMID: 28154557 PMCID: PMC5243798 DOI: 10.3389/fmicb.2017.00021] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/25/2016] [Accepted: 01/04/2017] [Indexed: 11/13/2022] Open
Abstract
Bacteria and archaea can exchange genetic material across lineages through processes of lateral genetic transfer (LGT). Collectively, these exchange relationships can be modeled as a network and analyzed using concepts from graph theory. In particular, densely connected regions within an LGT network have been defined as genetic exchange communities (GECs). However, it has been problematic to construct networks in which edges solely represent LGT. Here we apply term frequency-inverse document frequency (TF-IDF), an alignment-free method originating from document analysis, to infer regions of lateral origin in bacterial genomes. We examine four empirical datasets of different size (number of genomes) and phyletic breadth, varying a key parameter (word length k) within bounds established in previous work. We map the inferred lateral regions to genes in recipient genomes, and construct networks in which the nodes are groups of genomes, and the edges natively represent LGT. We then extract maximum and maximal cliques (i.e., GECs) from these graphs, and identify nodes that belong to GECs across a wide range of k. Most surviving lateral transfer has happened within these GECs. Using Gene Ontology enrichment tests we demonstrate that biological processes associated with metabolism, regulation and transport are often over-represented among the genes affected by LGT within these communities. These enrichments are largely robust to change of k.
Collapse
Affiliation(s)
- Yingnan Cong
- Institute for Molecular Bioscience and ARC Centre of Excellence in Bioinformatics, University of Queensland, St Lucia QLD, Australia
| | - Yao-Ban Chan
- School of Mathematics and Statistics, University of Melbourne, Parkville VIC, Australia
| | - Charles A Phillips
- Department of Electrical Engineering and Computer Science, University of Tennessee, Knoxville TN, USA
| | - Michael A Langston
- Department of Electrical Engineering and Computer Science, University of Tennessee, Knoxville TN, USA
| | - Mark A Ragan
- Institute for Molecular Bioscience and ARC Centre of Excellence in Bioinformatics, University of Queensland, St Lucia QLD, Australia
| |
Collapse
|
35
|
Naushad S, Barkema HW, Luby C, Condas LAZ, Nobrega DB, Carson DA, De Buck J. Comprehensive Phylogenetic Analysis of Bovine Non- aureus Staphylococci Species Based on Whole-Genome Sequencing. Front Microbiol 2016; 7:1990. [PMID: 28066335 PMCID: PMC5168469 DOI: 10.3389/fmicb.2016.01990] [Citation(s) in RCA: 37] [Impact Index Per Article: 4.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/06/2016] [Accepted: 11/28/2016] [Indexed: 11/19/2022] Open
Abstract
Non-aureus staphylococci (NAS), a heterogeneous group of a large number of species and subspecies, are the most frequently isolated pathogens from intramammary infections in dairy cattle. Phylogenetic relationships among bovine NAS species are controversial and have mostly been determined based on single-gene trees. Herein, we analyzed phylogeny of bovine NAS species using whole-genome sequencing (WGS) of 441 distinct isolates. In addition, evolutionary relationships among bovine NAS were estimated from multilocus data of 16S rRNA, hsp60, rpoB, sodA, and tuf genes and sequences from these and numerous other single genes/proteins. All phylogenies were created with FastTree, Maximum-Likelihood, Maximum-Parsimony, and Neighbor-Joining methods. Regardless of methodology, WGS-trees clearly separated bovine NAS species into five monophyletic coherent clades. Furthermore, there were consistent interspecies relationships within clades in all WGS phylogenetic reconstructions. Except for the Maximum-Parsimony tree, multilocus data analysis similarly produced five clades. There were large variations in determining clades and interspecies relationships in single gene/protein trees, under different methods of tree constructions, highlighting limitations of using single genes for determining bovine NAS phylogeny. However, based on WGS data, we established a robust phylogeny of bovine NAS species, unaffected by method or model of evolutionary reconstructions. Therefore, it is now possible to determine associations between phylogeny and many biological traits, such as virulence, antimicrobial resistance, environmental niche, geographical distribution, and host specificity.
Collapse
Affiliation(s)
- Sohail Naushad
- Department of Production Animal Health, Faculty of Veterinary Medicine, University of CalgaryCalgary, AB, Canada; Canadian Bovine Mastitis and Milk Quality Research NetworkSt-Hyacinthe, QC, Canada
| | - Herman W Barkema
- Department of Production Animal Health, Faculty of Veterinary Medicine, University of CalgaryCalgary, AB, Canada; Canadian Bovine Mastitis and Milk Quality Research NetworkSt-Hyacinthe, QC, Canada
| | - Christopher Luby
- Canadian Bovine Mastitis and Milk Quality Research NetworkSt-Hyacinthe, QC, Canada; Department of Large Animal Clinical Sciences, Western College of Veterinary Medicine, University of SaskatchewanSaskatoon, SK, Canada
| | - Larissa A Z Condas
- Department of Production Animal Health, Faculty of Veterinary Medicine, University of CalgaryCalgary, AB, Canada; Canadian Bovine Mastitis and Milk Quality Research NetworkSt-Hyacinthe, QC, Canada
| | - Diego B Nobrega
- Department of Production Animal Health, Faculty of Veterinary Medicine, University of CalgaryCalgary, AB, Canada; Canadian Bovine Mastitis and Milk Quality Research NetworkSt-Hyacinthe, QC, Canada
| | - Domonique A Carson
- Department of Production Animal Health, Faculty of Veterinary Medicine, University of CalgaryCalgary, AB, Canada; Canadian Bovine Mastitis and Milk Quality Research NetworkSt-Hyacinthe, QC, Canada
| | - Jeroen De Buck
- Department of Production Animal Health, Faculty of Veterinary Medicine, University of CalgaryCalgary, AB, Canada; Canadian Bovine Mastitis and Milk Quality Research NetworkSt-Hyacinthe, QC, Canada
| |
Collapse
|
36
|
Abstract
Ernst Haeckel based his landmark Tree of Life on the supposed ontogenic recapitulation of phylogeny, i.e. that successive embryonic stages during the development of an organism re-trace the morphological forms of its ancestors over the course of evolution. Much of this idea has since been discredited. Today, phylogenies are often based on families of molecular sequences. The standard approach starts with a multiple sequence alignment, in which the sequences are arranged relative to each other in a way that maximises a measure of similarity position-by-position along their entire length. A tree (or sometimes a network) is then inferred. Rigorous multiple sequence alignment is computationally demanding, and evolutionary processes that shape the genomes of many microbes (bacteria, archaea and some morphologically simple eukaryotes) can add further complications. In particular, recombination, genome rearrangement and lateral genetic transfer undermine the assumptions that underlie multiple sequence alignment, and imply that a tree-like structure may be too simplistic. Here, using genome sequences of 143 bacterial and archaeal genomes, we construct a network of phylogenetic relatedness based on the number of shared
k-mers (subsequences at fixed length
k). Our findings suggest that the network captures not only key aspects of microbial genome evolution as inferred from a tree, but also features that are not treelike. The method is highly scalable, allowing for investigation of genome evolution across a large number of genomes. Instead of using specific regions or sequences from genome sequences, or indeed Haeckel’s idea of ontogeny, we argue that genome phylogenies can be inferred using
k-mers from whole-genome sequences. Representing these networks dynamically allows biological questions of interest to be formulated and addressed quickly and in a visually intuitive manner.
Collapse
Affiliation(s)
- Guillaume Bernard
- Institute for Molecular Bioscience, The University of Queensland, Brisbane, Australia
| | - Mark A Ragan
- Institute for Molecular Bioscience, The University of Queensland, Brisbane, Australia
| | - Cheong Xin Chan
- Institute for Molecular Bioscience, The University of Queensland, Brisbane, Australia
| |
Collapse
|