1
|
Boumajdi N, Bendani H, Belyamani L, Ibrahimi A. TreeWave: command line tool for alignment-free phylogeny reconstruction based on graphical representation of DNA sequences and genomic signal processing. BMC Bioinformatics 2024; 25:367. [PMID: 39604838 PMCID: PMC11600722 DOI: 10.1186/s12859-024-05992-3] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/10/2024] [Accepted: 11/18/2024] [Indexed: 11/29/2024] Open
Abstract
BACKGROUND Genomic sequence similarity comparison is a crucial research area in bioinformatics. Multiple Sequence Alignment (MSA) is the basic technique used to identify regions of similarity between sequences, although MSA tools are widely used and highly accurate, they are often limited by computational complexity, and inaccuracies when handling highly divergent sequences, which leads to the development of alignment-free (AF) algorithms. RESULTS This paper presents TreeWave, a novel AF approach based on frequency chaos game representation and discrete wavelet transform of sequences for phylogeny inference. We validate our method on various genomic datasets such as complete virus genome sequences, bacteria genome sequences, human mitochondrial genome sequences, and rRNA gene sequences. Compared to classical methods, our tool demonstrates a significant reduction in running time, especially when analyzing large datasets. The resulting phylogenetic trees show that TreeWave has similar classification accuracy to the classical MSA methods based on the normalized Robinson-Foulds distances and Baker's Gamma coefficients. CONCLUSIONS TreeWave is an open source and user-friendly command line tool for phylogeny reconstruction. It is a faster and more scalable tool that prioritizes computational efficiency while maintaining accuracy. TreeWave is freely available at https://github.com/nasmaB/TreeWave .
Collapse
Affiliation(s)
- Nasma Boumajdi
- Laboratory of Biotechnology (MedBiotech), Rabat Medical & Pharmacy School, Bioinova Research Center, Mohammed V University in Rabat, Rabat, Morocco
| | - Houda Bendani
- Laboratory of Biotechnology (MedBiotech), Rabat Medical & Pharmacy School, Bioinova Research Center, Mohammed V University in Rabat, Rabat, Morocco
| | - Lahcen Belyamani
- Mohammed VI Center for Research and Innovation (CM6), Rabat, Morocco
- Mohammed VI University of Sciences and Health (UM6SS), Casablanca, Morocco
- Emergency Department, Military Hospital Mohammed V, Rabat Medical and Pharmacy School, Mohammed V University, Rabat, Morocco
| | - Azeddine Ibrahimi
- Laboratory of Biotechnology (MedBiotech), Rabat Medical & Pharmacy School, Bioinova Research Center, Mohammed V University in Rabat, Rabat, Morocco.
| |
Collapse
|
2
|
Ghosh S, Pal J, Cattani C, Maji B, Bhattacharya DK. Protein sequence comparison based on representation on a finite dimensional unit hypercube. J Biomol Struct Dyn 2024; 42:6425-6439. [PMID: 37837426 DOI: 10.1080/07391102.2023.2268719] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/16/2023] [Accepted: 07/01/2023] [Indexed: 10/16/2023]
Abstract
Numerous techniques are used to compare protein sequences based on the values of the physiochemical properties of amino acids. In this work, a single physical/chemical property value based non-binary representation of protein sequences is obtained on a 20 × 20-dimensional unit hypercube. The represented vector expressed in the matrix form is taken as the descriptor. The generalized NTV metric, which is an extension of the NTV metric used for polynucleotide space is taken as a distance measure. Based on this distance measure, a distance matrix is obtained for protein sequence comparison. Using this distance matrix, phylogenetic trees are drawn by using Molecular Evolutionary Genetics Analysis 11 (MEGA11) software applying the neighbor-joining method. Data sets used in this current work are 9-ND4, 9-ND5, 9-ND6, 24 TF-LF proteins, 27 different viruses and 127 proteins from the protein kinase C (PKC) family. Two sets of phylogenetic trees are obtained - one based on property value of polarity and the other based on property value of molecular weight. They are found to be exactly the same. Similar results also hold for other single property value based representation. The present trees are individually tested for efficiency based on the criterion of rationalized perception and computational time. The results of the present method are compared with those obtained earlier by other methods on the same protein sequences using assessment criteria of Symmetric distance (SD), Correlation coefficient, and Rationalized perception. In all the cases, the present results are found to be better than the results of other methods under comparison.Communicated by Ramaswamy H. Sarma.
Collapse
Affiliation(s)
- Soumen Ghosh
- Electronics & Communication Engineering, National Institute of Technology, Durgapur, West Bengal, India
- Information Technology, Narula Institute of Technology, Kolkata, West Bengal, India
| | - Jayanta Pal
- Computer Science & Engineering, Narula Institute of Technology, Kolkata, West Bengal, India
| | - Carlo Cattani
- DEIM, University of Tuscia, Largo dell'Universita, Viterbo, Italy
| | - Bansibadan Maji
- Electronics & Communication Engineering, National Institute of Technology, Durgapur, West Bengal, India
| | | |
Collapse
|
3
|
Islam R, Rahman A. An alignment-free method for detection of missing regions for phylogenetic analysis. Heliyon 2024; 10:e32227. [PMID: 38933968 PMCID: PMC11200290 DOI: 10.1016/j.heliyon.2024.e32227] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/17/2024] [Revised: 05/17/2024] [Accepted: 05/29/2024] [Indexed: 06/28/2024] Open
Abstract
Phylogenetic tree estimation using conventional approaches usually requires pairwise or multiple sequence alignment. However, sequence alignment has difficulties related to scalability and accuracy in case of long sequences such as whole genomes, low sequence identity, and in presence of genomic rearrangements. To address these issues, alignment-free approaches have been proposed. While these methods have demonstrated promising results, many of these lead to errors when regions are missing from the sequences of one or more species that are trivially detected in alignment-based methods. Here, we present an alignment-free method for detecting missing regions in sequences of species for which phylogeny is to be estimated. It is based on counts of k-mers and can be used to filter out k-mers belonging to regions in one species that are missing in one or more of the other species. We perform experiments with real and simulated datasets containing missing regions and find that it can successfully detect a large fraction of such k-mers and can lead to improvements in the estimated phylogenies. Our method can be used in k-mer based alignment-free phylogeny estimation methods to filter out k-mers corresponding to missing regions.
Collapse
Affiliation(s)
- Rubyeat Islam
- Department of Computer Science and Engineering, Military Institute of Science and Technology, Dhaka, Bangladesh
| | - Atif Rahman
- Department of Computer Science and Engineering, Bangladesh University of Engineering and Technology, Dhaka, Bangladesh
| |
Collapse
|
4
|
Wang T, Yu ZG, Li J. CGRWDL: alignment-free phylogeny reconstruction method for viruses based on chaos game representation weighted by dynamical language model. Front Microbiol 2024; 15:1339156. [PMID: 38572227 PMCID: PMC10987876 DOI: 10.3389/fmicb.2024.1339156] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/15/2023] [Accepted: 02/23/2024] [Indexed: 04/05/2024] Open
Abstract
Traditional alignment-based methods meet serious challenges in genome sequence comparison and phylogeny reconstruction due to their high computational complexity. Here, we propose a new alignment-free method to analyze the phylogenetic relationships (classification) among species. In our method, the dynamical language (DL) model and the chaos game representation (CGR) method are used to characterize the frequency information and the context information of k-mers in a sequence, respectively. Then for each DNA sequence or protein sequence in a dataset, our method converts the sequence into a feature vector that represents the sequence information based on CGR weighted by the DL model to infer phylogenetic relationships. We name our method CGRWDL. Its performance was tested on both DNA and protein sequences of 8 datasets of viruses to construct the phylogenetic trees. We compared the Robinson-Foulds (RF) distance between the phylogenetic tree constructed by CGRWDL and the reference tree by other advanced methods for each dataset. The results show that the phylogenetic trees constructed by CGRWDL can accurately classify the viruses, and the RF scores between the trees and the reference trees are smaller than that with other methods.
Collapse
Affiliation(s)
- Ting Wang
- National Center for Applied Mathematics in Hunan, Xiangtan University, Xiangtan, Hunan, China
- Key Laboratory of Intelligent Computing and Information Processing of Ministry of Education, Xiangtan University, Xiangtan, Hunan, China
| | - Zu-Guo Yu
- National Center for Applied Mathematics in Hunan, Xiangtan University, Xiangtan, Hunan, China
- Key Laboratory of Intelligent Computing and Information Processing of Ministry of Education, Xiangtan University, Xiangtan, Hunan, China
| | - Jinyan Li
- School of Computer Science and Control Engineering, Shenzhen Institute of Advanced Technology, Shenzhen, Guangdong, China
| |
Collapse
|
5
|
Pal J, Ghosh S, Maji B, Bhattacharya DK. MMV method: a new approach to compare protein sequences under binary representation. J Biomol Struct Dyn 2024:1-7. [PMID: 38375605 DOI: 10.1080/07391102.2024.2317982] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/28/2023] [Accepted: 02/07/2024] [Indexed: 02/21/2024]
Abstract
In the present work, a new form of descriptor using minimal moment vector (MMV) is introduced to compare protein sequences in the frequency domain under their component wise binary representations. From every sequence, 20 different binary component sequences are formed, each corresponding to 20 amino acids. Each such vector is now shifted from the time domain to the frequency domain by applying the Fast Fourier Transform (FFT). Next, the power spectrum calculated from the FFT values for each component sequence is so normalized that the sum of the components equals 1. The descriptor is defined as a 20-component vector composed of the 20 second-order minimal moments calculated from the normalized spectrum of the 20 component sequences. Once the descriptor is known, the distance matrix is created by applying the Euclidean Distance measure. The phylogenetic tree is generated by applying the unweighted pair group method with the arithmetic mean (UPGMA) algorithm using Molecular Evolutionary Genetics Analysis11 (MEGA11) software. In this work, the datasets used for similarity studies are 9 NADH dehydrogenase 5 (ND5), 12 Baculoviruses, 24 Transferrins (TF) proteins, and 50 Spike Protein of coronavirus. A qualitative measure using rationalized perception is used to compare the effectiveness of the proposed method. Quantitative measure based on symmetric distance (SD) is used to compare the phylogenetic trees of the present method with those obtained by other methods. It is observed that the phylogenetic trees generated by the proposed technique are at par with their known biological references, and they produce results better than those of the earlier methods.Communicated by Ramaswamy H. Sarma.
Collapse
Affiliation(s)
- Jayanta Pal
- Department of ECE, National Institute of Technology, Durgapur, India
- Department of CSE, Narula Institute of Technology, Kolkata, India
| | - Soumen Ghosh
- Department of ECE, National Institute of Technology, Durgapur, India
- Department of IT, Narula Institute of Technology, Kolkata, India
| | - Bansibadan Maji
- Department of ECE, National Institute of Technology, Durgapur, India
| | | |
Collapse
|
6
|
Lebatteux D, Soudeyns H, Boucoiran I, Gantt S, Diallo AB. Machine learning-based approach KEVOLVE efficiently identifies SARS-CoV-2 variant-specific genomic signatures. PLoS One 2024; 19:e0296627. [PMID: 38241279 PMCID: PMC10798494 DOI: 10.1371/journal.pone.0296627] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/21/2022] [Accepted: 12/07/2023] [Indexed: 01/21/2024] Open
Abstract
Machine learning was shown to be effective at identifying distinctive genomic signatures among viral sequences. These signatures are defined as pervasive motifs in the viral genome that allow discrimination between species or variants. In the context of SARS-CoV-2, the identification of these signatures can assist in taxonomic and phylogenetic studies, improve in the recognition and definition of emerging variants, and aid in the characterization of functional properties of polymorphic gene products. In this paper, we assess KEVOLVE, an approach based on a genetic algorithm with a machine-learning kernel, to identify multiple genomic signatures based on minimal sets of k-mers. In a comparative study, in which we analyzed large SARS-CoV-2 genome dataset, KEVOLVE was more effective at identifying variant-discriminative signatures than several gold-standard statistical tools. Subsequently, these signatures were characterized using a new extension of KEVOLVE (KANALYZER) to highlight variations of the discriminative signatures among different classes of variants, their genomic location, and the mutations involved. The majority of identified signatures were associated with known mutations among the different variants, in terms of functional and pathological impact based on available literature. Here we showed that KEVOLVE is a robust machine learning approach to identify discriminative signatures among SARS-CoV-2 variants, which are frequently also biologically relevant, while bypassing multiple sequence alignments. The source code of the method and additional resources are available at: https://github.com/bioinfoUQAM/KEVOLVE.
Collapse
Affiliation(s)
- Dylan Lebatteux
- Department of Computer Science, Université du Québec à Montréal, Montréal, Québec, Canada
| | - Hugo Soudeyns
- CHU Sainte-Justine Research Centre, Montréal, Québec, Canada
- Department of Microbiology, Infectious Diseases and Immunology, Faculty of Medicine, Université de Montréal, Montréal, Québec, Canada
- Department of Pediatrics, Faculty of Medicine, Université du Québec à Montréal, Montréal, Québec, Canada
| | - Isabelle Boucoiran
- Department of Obstetrics and Gynecology, Faculty of Medicine, Université de Montréal, Montreal, Quebec, Canada
| | - Soren Gantt
- CHU Sainte-Justine Research Centre, Montréal, Québec, Canada
- Department of Microbiology, Infectious Diseases and Immunology, Faculty of Medicine, Université de Montréal, Montréal, Québec, Canada
| | | |
Collapse
|
7
|
Van Etten J, Stephens TG, Bhattacharya D. A k-mer-Based Approach for Phylogenetic Classification of Taxa in Environmental Genomic Data. Syst Biol 2023; 72:1101-1118. [PMID: 37314057 DOI: 10.1093/sysbio/syad037] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/07/2022] [Revised: 03/20/2023] [Accepted: 06/12/2023] [Indexed: 06/15/2023] Open
Abstract
In the age of genome sequencing, whole-genome data is readily and frequently generated, leading to a wealth of new information that can be used to advance various fields of research. New approaches, such as alignment-free phylogenetic methods that utilize k-mer-based distance scoring, are becoming increasingly popular given their ability to rapidly generate phylogenetic information from whole-genome data. However, these methods have not yet been tested using environmental data, which often tends to be highly fragmented and incomplete. Here, we compare the results of one alignment-free approach (which utilizes the D2 statistic) to traditional multi-gene maximum likelihood trees in 3 algal groups that have high-quality genome data available. In addition, we simulate lower-quality, fragmented genome data using these algae to test method robustness to genome quality and completeness. Finally, we apply the alignment-free approach to environmental metagenome assembled genome data of unclassified Saccharibacteria and Trebouxiophyte algae, and single-cell amplified data from uncultured marine stramenopiles to demonstrate its utility with real datasets. We find that in all instances, the alignment-free method produces phylogenies that are comparable, and often more informative, than those created using the traditional multi-gene approach. The k-mer-based method performs well even when there are significant missing data that include marker genes traditionally used for tree reconstruction. Our results demonstrate the value of alignment-free approaches for classifying novel, often cryptic or rare, species, that may not be culturable or are difficult to access using single-cell methods, but fill important gaps in the tree of life.
Collapse
Affiliation(s)
- Julia Van Etten
- Graduate Program in Ecology and Evolution, Rutgers, The State University of New Jersey, 14 College Farm Road, New Brunswick, NJ 08901, USA
| | - Timothy G Stephens
- Department of Biochemistry and Microbiology, Rutgers, The State University of New Jersey, 59 Dudley Road, New Brunswick, NJ 08901, USA
| | - Debashish Bhattacharya
- Department of Biochemistry and Microbiology, Rutgers, The State University of New Jersey, 59 Dudley Road, New Brunswick, NJ 08901, USA
| |
Collapse
|
8
|
Dey S, Ghosh P, Das S. Positional difference and Frequency (PdF) based alignment-free technique for genome sequence comparison. J Biomol Struct Dyn 2023; 42:12660-12688. [PMID: 37885236 DOI: 10.1080/07391102.2023.2272748] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/07/2023] [Accepted: 09/19/2023] [Indexed: 10/28/2023]
Abstract
In the field of computational biology, genome sequence comparison among different species is essential and has applications in both the research and scientific fields. Owing to the lengthy processing time and large number of data sets, the alignment-based approaches are unsuitable and ineffective. Therefore, alignment-free techniques have obtained popularity for acquiring proper sequence clustering and evolutionary relationship among species. In this paper, a complete bipartite graph based Positional difference and Frequency (PdF) vector descriptor is introduced. Positional difference and Frequency, two parameters, are applied to the genome sequence to create a 16- dimensional vector descriptor using the di-nucleotide representation of genome sequence. Subsequently, a distance matrix is calculated to construct the phylogenetic trees for different data sets of mammals and viruses. The achieved outcomes are compared with the phylogenetic trees of the earlier methods viz. the FFP method, the ClustalW method, the MEV method, the PCNV method and the FIS method. In most instances, the proposed method produces more precise outcomes than the preceding techniques and has potential for genome sequence comparison on both the equal and unequal length of data-sets.Communicated by Ramaswamy H. Sarma.
Collapse
Affiliation(s)
- Sudeshna Dey
- Computer Science and Engineering, Narula Institute of Technology, Kolkata, India
| | - Papri Ghosh
- Computer Science and Engineering, Narula Institute of Technology, Kolkata, India
| | - Subhram Das
- Computer Science and Engineering, Narula Institute of Technology, Kolkata, India
| |
Collapse
|
9
|
Liu Y, Shen X, Gong Y, Liu Y, Song B, Zeng X. Sequence Alignment/Map format: a comprehensive review of approaches and applications. Brief Bioinform 2023; 24:bbad320. [PMID: 37668049 DOI: 10.1093/bib/bbad320] [Citation(s) in RCA: 3] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/16/2023] [Revised: 08/16/2023] [Accepted: 08/18/2023] [Indexed: 09/06/2023] Open
Abstract
The Sequence Alignment/Map (SAM) format file is the text file used to record alignment information. Alignment is the core of sequencing analysis, and downstream tasks accept mapping results for further processing. Given the rapid development of the sequencing industry today, a comprehensive understanding of the SAM format and related tools is necessary to meet the challenges of data processing and analysis. This paper is devoted to retrieving knowledge in the broad field of SAM. First, the format of SAM is introduced to understand the overall process of the sequencing analysis. Then, existing work is systematically classified in accordance with generation, compression and application, and the involved SAM tools are specifically mined. Lastly, a summary and some thoughts on future directions are provided.
Collapse
Affiliation(s)
- Yuansheng Liu
- College of Computer Science and Electronic Engineering, Hunan University, 410086, Changsha, China
| | - Xiangzhen Shen
- College of Computer Science and Electronic Engineering, Hunan University, 410086, Changsha, China
| | - Yongshun Gong
- School of Software, Shandong University, 250100, Jinan, China
| | - Yiping Liu
- College of Computer Science and Electronic Engineering, Hunan University, 410086, Changsha, China
| | - Bosheng Song
- College of Computer Science and Electronic Engineering, Hunan University, 410086, Changsha, China
| | - Xiangxiang Zeng
- College of Computer Science and Electronic Engineering, Hunan University, 410086, Changsha, China
| |
Collapse
|
10
|
Frith MC, Shaw J, Spouge JL. How to optimally sample a sequence for rapid analysis. Bioinformatics 2023; 39:btad057. [PMID: 36702468 PMCID: PMC9907223 DOI: 10.1093/bioinformatics/btad057] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/10/2022] [Accepted: 01/24/2023] [Indexed: 01/28/2023] Open
Abstract
MOTIVATION We face an increasing flood of genetic sequence data, from diverse sources, requiring rapid computational analysis. Rapid analysis can be achieved by sampling a subset of positions in each sequence. Previous sequence-sampling methods, such as minimizers, syncmers and minimally overlapping words, were developed by heuristic intuition, and are not optimal. RESULTS We present a sequence-sampling approach that provably optimizes sensitivity for a whole class of sequence comparison methods, for randomly evolving sequences. It is likely near-optimal for a wide range of alignment-based and alignment-free analyses. For real biological DNA, it increases specificity by avoiding simple repeats. Our approach generalizes universal hitting sets (which guarantee to sample a sequence at least once) and polar sets (which guarantee to sample a sequence at most once). This helps us understand how to do rapid sequence analysis as accurately as possible. AVAILABILITY AND IMPLEMENTATION Source code is freely available at https://gitlab.com/mcfrith/noverlap. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Martin C Frith
- Artificial Intelligence Research Center, AIST, Tokyo 135-0064, Japan
- Department of Computational Biology and Medical Sciences, Graduate School of Frontier Sciences, University of Tokyo, Chiba 277-8568, Japan
- Computational Bio Big-Data Open Innovation Laboratory, AIST, Tokyo 169-8555, Japan
| | - Jim Shaw
- Department of Mathematics, University of Toronto, Toronto, ON M5S 2E4, Canada
| | - John L Spouge
- National Library of Medicine, National Institutes of Health, Bethesda, MD 20894, USA
| |
Collapse
|
11
|
Dey S, Das S, Bhattacharya DK. Biochemical Property Based Positional Matrix: A New Approach Towards Genome Sequence Comparison. J Mol Evol 2023; 91:93-131. [PMID: 36587178 PMCID: PMC9805373 DOI: 10.1007/s00239-022-10082-0] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/08/2022] [Accepted: 12/01/2022] [Indexed: 01/01/2023]
Abstract
The growth of the genome sequence has become one of the emerging areas in the study of bioinformatics. It has led to an excessive demand for researchers to develop advanced methodologies for evolutionary relationships among species. The alignment-free methods have been proved to be more efficient and appropriate related to time and space than existing alignment-based methods for sequence analysis. In this study, a new alignment-free genome sequence comparison technique is proposed based on the biochemical properties of nucleotides. Each genome sequence can be distributed in four parameters to represent a 21-dimensional numerical descriptor using the Positional Matrix. To substantiate the proposed method, phylogenetic trees are constructed on the viral and mammalian datasets by applying the UPGMA/NJ clustering method. Further, the results of this method are compared with the results of the Feature Frequency Profiles method, the Positional Correlation Natural Vector method, the Graph-theoretic method, the Multiple Encoding Vector method, and the Fuzzy Integral Similarity method. In most cases, it is found that the present method produces more accurate results than the prior methods. Also, in the present method, the execution time for computation is comparatively small.
Collapse
Affiliation(s)
- Sudeshna Dey
- grid.440742.10000 0004 1799 6713Computer Science and Engineering, Narula Institute of Technology, Kolkata, 700109 India
| | - Subhram Das
- grid.440742.10000 0004 1799 6713Computer Science and Engineering, Narula Institute of Technology, Kolkata, 700109 India
| | - D. K. Bhattacharya
- grid.59056.3f0000 0001 0664 9773Pure Mathematics, Calcutta University, Kolkata, 700019 India
| |
Collapse
|
12
|
Bohnsack KS, Kaden M, Abel J, Villmann T. Alignment-Free Sequence Comparison: A Systematic Survey From a Machine Learning Perspective. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2023; 20:119-135. [PMID: 34990369 DOI: 10.1109/tcbb.2022.3140873] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/04/2023]
Abstract
The encounter of large amounts of biological sequence data generated during the last decades and the algorithmic and hardware improvements have offered the possibility to apply machine learning techniques in bioinformatics. While the machine learning community is aware of the necessity to rigorously distinguish data transformation from data comparison and adopt reasonable combinations thereof, this awareness is often lacking in the field of comparative sequence analysis. With realization of the disadvantages of alignments for sequence comparison, some typical applications use more and more so-called alignment-free approaches. In light of this development, we present a conceptual framework for alignment-free sequence comparison, which highlights the delineation of: 1) the sequence data transformation comprising of adequate mathematical sequence coding and feature generation, from 2) the subsequent (dis-)similarity evaluation of the transformed data by means of problem-specific but mathematically consistent proximity measures. We consider coding to be an information-loss free data transformation in order to get an appropriate representation, whereas feature generation is inevitably information-lossy with the intention to extract just the task-relevant information. This distinction sheds light on the plethora of methods available and assists in identifying suitable methods in machine learning and data analysis to compare the sequences under these premises.
Collapse
|
13
|
King KM, Rajadhyaksha EV, Tobey IG, Van Doorslaer K. Synonymous nucleotide changes drive papillomavirus evolution. Tumour Virus Res 2022; 14:200248. [PMID: 36265836 PMCID: PMC9589209 DOI: 10.1016/j.tvr.2022.200248] [Citation(s) in RCA: 4] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/29/2022] [Revised: 10/11/2022] [Accepted: 10/12/2022] [Indexed: 11/06/2022] Open
Abstract
Papillomaviruses have been evolving alongside their hosts for at least 450 million years. This review will discuss some of the insights gained into the evolution of this diverse family of viruses. Papillomavirus evolution is constrained by pervasive purifying selection to maximize viral fitness. Yet these viruses need to adapt to changes in their environment, e.g., the host immune system. It has long been known that these viruses evolved a codon usage that doesn't match the infected host. Here we discuss how papillomavirus genomes evolve by acquiring synonymous changes that allow the virus to avoid detection by the host innate immune system without changing the encoded proteins and associated fitness loss. We discuss the implications of studying viral evolution, lifecycle, and cancer progression.
Collapse
Affiliation(s)
- Kelly M King
- School of Animal and Comparative Biomedical Sciences, University of Arizona, Tucson, AZ, USA
| | - Esha Vikram Rajadhyaksha
- School of Animal and Comparative Biomedical Sciences, University of Arizona, Tucson, AZ, USA; Department of Physiology and Department of Ecology and Evolutionary Biology, University of Arizona, Tucson, AZ, USA
| | - Isabelle G Tobey
- Cancer Biology Graduate Interdisciplinary Program, University of Arizona, Tucson, AZ, USA
| | - Koenraad Van Doorslaer
- School of Animal and Comparative Biomedical Sciences, University of Arizona, Tucson, AZ, USA; Cancer Biology Graduate Interdisciplinary Program, University of Arizona, Tucson, AZ, USA; The BIO5 Institute, The Department of Immunobiology, Genetics Graduate Interdisciplinary Program, UA Cancer Center, University of Arizona Tucson, Arizona, USA.
| |
Collapse
|
14
|
Kaur S, Payne M, Luo L, Octavia S, Tanaka MM, Sintchenko V, Lan R. MGTdb: a web service and database for studying the global and local genomic epidemiology of bacterial pathogens. DATABASE 2022; 2022:6823527. [PMID: 36367311 PMCID: PMC9650772 DOI: 10.1093/database/baac094] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 06/10/2022] [Revised: 09/30/2022] [Accepted: 10/17/2022] [Indexed: 11/13/2022]
Abstract
Multilevel genome typing (MGT) enables the genomic characterization of bacterial isolates and the relationships among them. The MGT system describes an isolate using multiple multilocus sequence typing (MLST) schemes, referred to as levels. Thus, for a new isolate, sequence types (STs) assigned at multiple precisely defined levels can be used to type isolates at multiple resolutions. The MGT designation for isolates is stable, and the assignment is faster than the existing approaches. MGT’s utility has been demonstrated in multiple species. This paper presents a publicly accessible web service called MGTdb, which enables the assignment of MGT STs to isolates, along with their storage, retrieval and analysis. The MGTdb web service enables upload of genome data as sequence reads or alleles, which are processed and assigned MGT identifiers. Additionally, any newly sequenced isolates deposited in the National Center for Biotechnology Information’s Sequence Read Archive are also regularly retrieved (currently daily), processed, assigned MGT identifiers and made publicly available in MGTdb. Interactive visualization tools are presented to assist analysis, along with capabilities to download publicly available isolates and assignments for use with external software. MGTdb is currently available for Salmonella enterica serovars Typhimurium and Enteritidis and Vibrio cholerae. We demonstrate the usability of MGTdb through three case studies — to study the long-term national surveillance of S. Typhimurium, the local epidemiology and outbreaks of S. Typhimurium, and the global epidemiology of V. cholerae. Thus, MGTdb enables epidemiological and microbiological investigations at multiple levels of resolution for all publicly available isolates of these pathogens. Database URL: https://mgtdb.unsw.edu.au
Collapse
Affiliation(s)
- Sandeep Kaur
- School of Biotechnology and Biomolecular Sciences, University of New South Wales , New South Wales 2052, Australia
- School of Computer Science and Engineering, University of New South Wales , New South Wales 2052, Australia
| | - Michael Payne
- School of Biotechnology and Biomolecular Sciences, University of New South Wales , New South Wales 2052, Australia
| | - Lijuan Luo
- School of Biotechnology and Biomolecular Sciences, University of New South Wales , New South Wales 2052, Australia
| | - Sophie Octavia
- School of Biotechnology and Biomolecular Sciences, University of New South Wales , New South Wales 2052, Australia
| | - Mark M Tanaka
- School of Biotechnology and Biomolecular Sciences, University of New South Wales , New South Wales 2052, Australia
| | - Vitali Sintchenko
- Centre for Infectious Diseases and Microbiology—Public Health, Institute of Clinical Pathology and Medical Research—NSW Health Pathology, Westmead Hospital , New South Wales 2145, Australia
- Marie Bashir Institute for Infectious Diseases and Biosecurity, Sydney Medical School, University of Sydney , New South Wales 2006, Australia
| | - Ruiting Lan
- School of Biotechnology and Biomolecular Sciences, University of New South Wales , New South Wales 2052, Australia
| |
Collapse
|
15
|
Pal J, Ghosh S, Maji B, Bhattacharya DK. Mathematical Approach to Protein Sequence Comparison Based on Physiochemical Properties. ACS OMEGA 2022; 7:39446-39455. [PMID: 36340165 PMCID: PMC9631895 DOI: 10.1021/acsomega.2c06103] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 09/21/2022] [Accepted: 09/27/2022] [Indexed: 06/16/2023]
Abstract
The difficult aspect of developing new protein sequence comparison techniques is coming up with a method that can quickly and effectively handle huge data sets of various lengths in a timely manner. In this work, we first obtain two numerical representations of protein sequences separately based on one physical property and one chemical property of amino acids. The lengths of all the sequences under comparison are made equal by appending the required number of zeroes. Then, fast Fourier transform is applied to this numerical time series to obtain the corresponding spectrum. Next, the spectrum values are reduced by the standard inter coefficient difference method. Finally, the corresponding normalized values of the reduced spectrum are selected as the descriptors for protein sequence comparison. Using these descriptors, the distance matrices are obtained using Euclidian distance. They are subsequently used to draw the phylogenetic trees using the UPGMA algorithm. Phylogenetic trees are first constructed for 9 ND4, 9 ND5, and 9 ND6 proteins using the polarity value as the chemical property and the molecular weight as the physical property. They are compared, and it is seen that polarity is a better choice than molecular weight in protein sequence comparison. Next, using the polarity property, phylogenetic trees are obtained for 12 baculovirus and 24 transferrin proteins. The results are compared with those obtained earlier on the identical sequences by other methods. Three assessment criteria are considered for comparison of the results-quality based on rationalized perception, quantitative measures based on symmetric distance, and computational speed. In all the cases, the results are found to be more satisfactory.
Collapse
Affiliation(s)
- Jayanta Pal
- Department
of ECE, National Institute of Technology, Durgapur 713209, India
- Department
of CSE, Narula Institute of Technology, Kolkata 700109, India
| | - Soumen Ghosh
- Department
of IT, Narula Institute of Technology, Kolkata 700109, India
| | - Bansibadan Maji
- Department
of ECE, National Institute of Technology, Durgapur 713209, India
| | | |
Collapse
|
16
|
Ma Z, Lu YY, Wang Y, Lin R, Yang Z, Zhang F, Wang Y. Metric learning for comparing genomic data with triplet network. Brief Bioinform 2022; 23:6679451. [DOI: 10.1093/bib/bbac345] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/22/2022] [Revised: 07/20/2022] [Accepted: 07/26/2022] [Indexed: 11/13/2022] Open
Abstract
Abstract
Many biological applications are essentially pairwise comparison problems, such as evolutionary relationships on genomic sequences, contigs binning on metagenomic data, cell type identification on gene expression profiles of single-cells, etc. To make pair-wise comparison, it is necessary to adopt suitable dissimilarity metric. However, not all the metrics can be fully adapted to all possible biological applications. It is necessary to employ metric learning based on data adaptive to the application of interest. Therefore, in this study, we proposed MEtric Learning with Triplet network (MELT), which learns a nonlinear mapping from original space to the embedding space in order to keep similar data closer and dissimilar data far apart. MELT is a weakly supervised and data-driven comparison framework that offers more adaptive and accurate dissimilarity learned in the absence of the label information when the supervised methods are not applicable. We applied MELT in three typical applications of genomic data comparison, including hierarchical genomic sequences, longitudinal microbiome samples and longitudinal single-cell gene expression profiles, which have no distinctive grouping information. In the experiments, MELT demonstrated its empirical utility in comparison to many widely used dissimilarity metrics. And MELT is expected to accommodate a more extensive set of applications in large-scale genomic comparisons. MELT is available at https://github.com/Ying-Lab/MELT.
Collapse
Affiliation(s)
- Zhi Ma
- Department of Automation, Xiamen University , China
- National Institute for Data Science in Health and Medicine, Xiamen University
| | - Yang Young Lu
- Cheriton School of Computer Science, University of Waterloo , Waterloo, Ontario , Canada
| | - Yiwen Wang
- Department of Automation, Xiamen University , China
| | - Renhao Lin
- Department of Automation, Xiamen University , China
| | - Zizi Yang
- Department of Automation, Xiamen University , China
| | - Fang Zhang
- Cheriton School of Computer Science, University of Waterloo , Waterloo, Ontario , Canada
| | - Ying Wang
- Department of Automation, Xiamen University , China
- National Institute for Data Science in Health and Medicine, Xiamen University
- Xiamen Key Laboratory of Big Data Intelligent Analysis and Decision , Xiamen, Fujian 361005 , China
- Fujian Key Laboratory of Genetics and Breeding of Marine Organisms , Xiamen, 361100 , China
| |
Collapse
|
17
|
Birth N, Dencker T, Morgenstern B. Insertions and deletions as phylogenetic signal in an alignment-free context. PLoS Comput Biol 2022; 18:e1010303. [PMID: 35939516 PMCID: PMC9387925 DOI: 10.1371/journal.pcbi.1010303] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/18/2021] [Revised: 08/18/2022] [Accepted: 06/14/2022] [Indexed: 11/18/2022] Open
Abstract
Most methods for phylogenetic tree reconstruction are based on sequence alignments; they infer phylogenies from substitutions that may have occurred at the aligned sequence positions. Gaps in alignments are usually not employed as phylogenetic signal. In this paper, we explore an alignment-free approach that uses insertions and deletions (indels) as an additional source of information for phylogeny inference. For a set of four or more input sequences, we generate so-called quartet blocks of four putative homologous segments each. For pairs of such quartet blocks involving the same four sequences, we compare the distances between the two blocks in these sequences, to obtain hints about indels that may have happened between the blocks since the respective four sequences have evolved from their last common ancestor. A prototype implementation that we call Gap-SpaM is presented to infer phylogenetic trees from these data, using a quartet-tree approach or, alternatively, under the maximum-parsimony paradigm. This approach should not be regarded as an alternative to established methods, but rather as a complementary source of phylogenetic information. Interestingly, however, our software is able to produce phylogenetic trees from putative indels alone that are comparable to trees obtained with existing alignment-free methods. Phylogenetic tree inference based on DNA or protein sequence comparison is a fundamental task in computational biology. Given a multiple alignment of a set of input sequences, most approaches compare aligned sequence positions to each other, to find a suitable tree, based on a model of molecular evolution. Insertions and deletions that may have happened since the input sequences evolved from their last common ancestor are ignored by most phylogeny methods. Herein, we show that insertions and deletions can provide an additional source of information for phylogeny inference, and that such information can be obtained with a simple alignment-free approach. We provide an implementation of this idea that we call Gap-SpaM. The proposed approach is complementary to existing phylogeny methods since it is based on a completely different source of information. It is, thus, not meant to be an alternative to those existing methods but rather as a possible additional source of information for tree inference.
Collapse
Affiliation(s)
- Niklas Birth
- Department of Bioinformatics, Institute of Microbiology and Genetics, Universisät Göttingen, Göttingen, Germany
| | - Thomas Dencker
- Department of Bioinformatics, Institute of Microbiology and Genetics, Universisät Göttingen, Göttingen, Germany
| | - Burkhard Morgenstern
- Department of Bioinformatics, Institute of Microbiology and Genetics, Universisät Göttingen, Göttingen, Germany
- Göttingen Center of Molecular Biosciences (GZMB), Göttingen, Germany
- Campus-Institute Data Science (CIDAS), Göttingen, Germany
- * E-mail:
| |
Collapse
|
18
|
Swain MT, Vickers M. Interpreting alignment-free sequence comparison: what makes a score a good score? NAR Genom Bioinform 2022; 4:lqac062. [PMID: 36071721 PMCID: PMC9442500 DOI: 10.1093/nargab/lqac062] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/01/2022] [Revised: 07/01/2022] [Accepted: 08/16/2022] [Indexed: 11/13/2022] Open
Abstract
Alignment-free methods are alternatives to alignment-based methods when searching sequence data sets. The output from an alignment-free sequence comparison is a similarity score, the interpretation of which is not straightforward. We propose objective functions to interpret and calibrate outputs from alignment-free searches, noting that different objective functions are necessary for different biological contexts. This leads to advantages: visualising and comparing score distributions, including those from true positives, may be a relatively simple method to gain insight into the performance of different metrics. Using an empirical approach with both DNA and protein sequences, we characterise different similarity score distributions generated under different parameters. In particular, we demonstrate how sequence length can affect the scores. We show that scores of true positive sequence pairs may correlate significantly with their mean length; and even if the correlation is weak, the relative difference in length of the sequence pair may significantly reduce the effectiveness of alignment-free metrics. Importantly, we show how objective functions can be used with test data to accurately estimate the probability of true positives. This can significantly increase the utility of alignment-free approaches. Finally, we have developed a general-purpose software tool called KAST for use in high-throughput workflows on Linux clusters.
Collapse
Affiliation(s)
- Martin T Swain
- Department of Life Sciences, Aberystwyth University , Penglais, Aberystwyth, Ceredigion, SY23 3DA, UK
| | - Martin Vickers
- The John Innes Centre, Norwich Research Park , Norwich NR4 7UH, UK
| |
Collapse
|
19
|
Lo R, Dougan KE, Chen Y, Shah S, Bhattacharya D, Chan CX. Alignment-Free Analysis of Whole-Genome Sequences From Symbiodiniaceae Reveals Different Phylogenetic Signals in Distinct Regions. FRONTIERS IN PLANT SCIENCE 2022; 13:815714. [PMID: 35557718 PMCID: PMC9087856 DOI: 10.3389/fpls.2022.815714] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 11/15/2021] [Accepted: 04/04/2022] [Indexed: 05/24/2023]
Abstract
Dinoflagellates of the family Symbiodiniaceae are predominantly essential symbionts of corals and other marine organisms. Recent research reveals extensive genome sequence divergence among Symbiodiniaceae taxa and high phylogenetic diversity hidden behind subtly different cell morphologies. Using an alignment-free phylogenetic approach based on sub-sequences of fixed length k (i.e. k-mers), we assessed the phylogenetic signal among whole-genome sequences from 16 Symbiodiniaceae taxa (including the genera of Symbiodinium, Breviolum, Cladocopium, Durusdinium and Fugacium) and two strains of Polarella glacialis as outgroup. Based on phylogenetic trees inferred from k-mers in distinct genomic regions (i.e. repeat-masked genome sequences, protein-coding sequences, introns and repeats) and in protein sequences, the phylogenetic signal associated with protein-coding DNA and the encoded amino acids is largely consistent with the Symbiodiniaceae phylogeny based on established markers, such as large subunit rRNA. The other genome sequences (introns and repeats) exhibit distinct phylogenetic signals, supporting the expected differential evolutionary pressure acting on these regions. Our analysis of conserved core k-mers revealed the prevalence of conserved k-mers (>95% core 23-mers among all 18 genomes) in annotated repeats and non-genic regions of the genomes. We observed 180 distinct repeat types that are significantly enriched in genomes of the symbiotic versus free-living Symbiodinium taxa, suggesting an enhanced activity of transposable elements linked to the symbiotic lifestyle. We provide evidence that representation of alignment-free phylogenies as dynamic networks enhances the ability to generate new hypotheses about genome evolution in Symbiodiniaceae. These results demonstrate the potential of alignment-free phylogenetic methods as a scalable approach for inferring comprehensive, unbiased whole-genome phylogenies of dinoflagellates and more broadly of microbial eukaryotes.
Collapse
Affiliation(s)
- Rosalyn Lo
- Australian Centre for Ecogenomics, School of Chemistry and Molecular Biosciences, University of Queensland, Brisbane, QLD, Australia
| | - Katherine E. Dougan
- Australian Centre for Ecogenomics, School of Chemistry and Molecular Biosciences, University of Queensland, Brisbane, QLD, Australia
| | - Yibi Chen
- Australian Centre for Ecogenomics, School of Chemistry and Molecular Biosciences, University of Queensland, Brisbane, QLD, Australia
| | - Sarah Shah
- Australian Centre for Ecogenomics, School of Chemistry and Molecular Biosciences, University of Queensland, Brisbane, QLD, Australia
| | - Debashish Bhattacharya
- Department of Biochemistry and Microbiology, Rutgers University, New Brunswick, NJ, United States
| | - Cheong Xin Chan
- Australian Centre for Ecogenomics, School of Chemistry and Molecular Biosciences, University of Queensland, Brisbane, QLD, Australia
| |
Collapse
|
20
|
Comparative analysis of alignment-free genome clustering and whole genome alignment-based phylogenomic relationship of coronaviruses. PLoS One 2022; 17:e0264640. [PMID: 35259178 PMCID: PMC8903263 DOI: 10.1371/journal.pone.0264640] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/23/2021] [Accepted: 02/14/2022] [Indexed: 12/12/2022] Open
Abstract
The SARS-CoV-2 is the third coronavirus in addition to SARS-CoV and MERS-CoV that causes severe respiratory syndrome in humans. All of them likely crossed the interspecific barrier between animals and humans and are of zoonotic origin, respectively. The origin and evolution of viruses and their phylogenetic relationships are of great importance for study of their pathogenicity and development of antiviral drugs and vaccines. The main objective of the presented study was to compare two methods for identifying relationships between coronavirus genomes: phylogenetic one based on the whole genome alignment followed by molecular phylogenetic tree inference and alignment-free clustering of triplet frequencies, respectively, using 69 coronavirus genomes selected from two public databases. Both approaches resulted in well-resolved robust classifications. In general, the clusters identified by the first approach were in good agreement with the classes identified by the second using K-means and the elastic map method, but not always, which still needs to be explained. Both approaches demonstrated also a significant divergence of genomes on a taxonomic level, but there was less correspondence between genomes regarding the types of diseases they caused, which may be due to the individual characteristics of the host. This research showed that alignment-free methods are efficient in combination with alignment-based methods. They have a significant advantage in computational complexity and provide valuable additional alternative information on the genomes relationships.
Collapse
|
21
|
Chappell T, Geva S, Hogan JM, Lovell D, Trotman A, Perrin D. Metagenomic Geolocation Using Read Signatures. Front Genet 2022; 13:643592. [PMID: 35295949 PMCID: PMC8918732 DOI: 10.3389/fgene.2022.643592] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/18/2020] [Accepted: 01/21/2022] [Indexed: 12/02/2022] Open
Abstract
We present a novel approach to the Metagenomic Geolocation Challenge based on random projection of the sample reads from each location. This approach explores the direct use of k-mer composition to characterise samples so that we can avoid the computationally demanding step of aligning reads to available microbial reference sequences. Each variable-length read is converted into a fixed-length, k-mer-based read signature. Read signatures are then clustered into location signatures which provide a more compact characterisation of the reads at each location. Classification is then treated as a problem in ranked retrieval of locations, where signature similarity is used as a measure of similarity in microbial composition. We evaluate our approach using the CAMDA 2020 Challenge dataset and obtain promising results based on nearest neighbour classification. The main findings of this study are that k-mer representations carry sufficient information to reveal the origin of many of the CAMDA 2020 Challenge metagenomic samples, and that this reference-free approach can be achieved with much less computation than methods that need reads to be assigned to operational taxonomic units—advantages which become clear through comparison to previously published work on the CAMDA 2019 Challenge data.
Collapse
Affiliation(s)
- Timothy Chappell
- School of Computer Science, Faculty of Science, Queensland University of Technology, Brisbane, QLD, Australia
- Centre for Data Science, Queensland University of Technology, Brisbane, QLD, Australia
| | - Shlomo Geva
- School of Computer Science, Faculty of Science, Queensland University of Technology, Brisbane, QLD, Australia
| | - James M. Hogan
- School of Computer Science, Faculty of Science, Queensland University of Technology, Brisbane, QLD, Australia
- Centre for Data Science, Queensland University of Technology, Brisbane, QLD, Australia
| | - David Lovell
- School of Computer Science, Faculty of Science, Queensland University of Technology, Brisbane, QLD, Australia
- Centre for Data Science, Queensland University of Technology, Brisbane, QLD, Australia
| | - Andrew Trotman
- Department of Computer Science, University of Otago, Dunedin, New Zealand
| | - Dimitri Perrin
- School of Computer Science, Faculty of Science, Queensland University of Technology, Brisbane, QLD, Australia
- Centre for Data Science, Queensland University of Technology, Brisbane, QLD, Australia
- *Correspondence: Dimitri Perrin ,
| |
Collapse
|
22
|
Dougan KE, González-Pech RA, Stephens TG, Shah S, Chen Y, Ragan MA, Bhattacharya D, Chan CX. Genome-powered classification of microbial eukaryotes: focus on coral algal symbionts. Trends Microbiol 2022; 30:831-840. [DOI: 10.1016/j.tim.2022.02.001] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/10/2021] [Revised: 01/20/2022] [Accepted: 02/01/2022] [Indexed: 12/20/2022]
|
23
|
He L, Sun S, Zhang Q, Bao X, Li PK. Alignment-free sequence comparison for virus genomes based on location correlation coefficient. INFECTION, GENETICS AND EVOLUTION : JOURNAL OF MOLECULAR EPIDEMIOLOGY AND EVOLUTIONARY GENETICS IN INFECTIOUS DISEASES 2021; 96:105106. [PMID: 34626822 PMCID: PMC8493760 DOI: 10.1016/j.meegid.2021.105106] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 06/20/2021] [Revised: 09/08/2021] [Accepted: 10/03/2021] [Indexed: 12/18/2022]
Abstract
Coronaviruses (especially SARS-CoV-2) are characterized by rapid mutation and wide spread. As these characteristics easily lead to global pandemics, studying the evolutionary relationship between viruses is essential for clinical diagnosis. DNA sequencing has played an important role in evolutionary analysis. Recent alignment-free methods can overcome the problems of traditional alignment-based methods, which consume both time and space. This paper proposes a novel alignment-free method called the correlation coefficient feature vector (CCFV), which defines a correlation measure of the L-step delay of a nucleotide location from its location in the original DNA sequence. The numerical feature is a 16×L-dimensional numerical vector describing the distribution characteristics of the nucleotide positions in a DNA sequence. The proposed L-step delay correlation measure is interestingly related to some types of L+1 spaced mers. Unlike traditional gene comparison, our method avoids the computational complexity of multiple sequence alignment, and hence improves the speed of sequence comparison. Our method is applied to evolutionary analysis of the common human viruses including SARS-CoV-2, Dengue virus, Hepatitis B virus, and human rhinovirus and achieves the same or even better results than alignment-based methods. Especially for SARS-CoV-2, our method also confirms that bats are potential intermediate hosts of SARS-CoV-2.
Collapse
Affiliation(s)
- Lily He
- School of Science, Beijing University of Civil Engineering and Architecture, Beijing 102616, PR China.
| | - Siyang Sun
- The High School Affiliated to Renmin University of China, Beijing 100080, PR China
| | - Qianyue Zhang
- The High School Affiliated to Renmin University of China, Beijing 100080, PR China
| | - Xiaona Bao
- School of Science, Beijing University of Civil Engineering and Architecture, Beijing 102616, PR China
| | - Peter K Li
- School of Life Sciences, Tsinghua University, Beijing 100084, PR China.
| |
Collapse
|
24
|
Bussi Y, Kapon R, Reich Z. Large-scale k-mer-based analysis of the informational properties of genomes, comparative genomics and taxonomy. PLoS One 2021; 16:e0258693. [PMID: 34648558 PMCID: PMC8516232 DOI: 10.1371/journal.pone.0258693] [Citation(s) in RCA: 16] [Impact Index Per Article: 5.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/30/2021] [Accepted: 10/02/2021] [Indexed: 12/24/2022] Open
Abstract
Information theoretic approaches are ubiquitous and effective in a wide variety of bioinformatics applications. In comparative genomics, alignment-free methods, based on short DNA words, or k-mers, are particularly powerful. We evaluated the utility of varying k-mer lengths for genome comparisons by analyzing their sequence space coverage of 5805 genomes in the KEGG GENOME database. In subsequent analyses on four k-mer lengths spanning the relevant range (11, 21, 31, 41), hierarchical clustering of 1634 genus-level representative genomes using pairwise 21- and 31-mer Jaccard similarities best recapitulated a phylogenetic/taxonomic tree of life with clear boundaries for superkingdom domains and high subtree similarity for named taxons at lower levels (family through phylum). By analyzing ~14.2M prokaryotic genome comparisons by their lowest-common-ancestor taxon levels, we detected many potential misclassification errors in a curated database, further demonstrating the need for wide-scale adoption of quantitative taxonomic classifications based on whole-genome similarity.
Collapse
Affiliation(s)
- Yuval Bussi
- Department of Biomolecular Sciences, Weizmann Institute of Science, Rehovot, Israel
- Department of Computer Science and Applied Mathematics, Weizmann Institute of Science, Rehovot, Israel
- Department of Molecular Cell Biology, Weizmann Institute of Science, Rehovot, Israel
| | - Ruti Kapon
- Department of Biomolecular Sciences, Weizmann Institute of Science, Rehovot, Israel
| | - Ziv Reich
- Department of Biomolecular Sciences, Weizmann Institute of Science, Rehovot, Israel
- * E-mail:
| |
Collapse
|
25
|
Cao K, Yang X, Li Y, Zhu G, Fang W, Chen C, Wang X, Wu J, Wang L. New high-quality peach (Prunus persica L. Batsch) genome assembly to analyze the molecular evolutionary mechanism of volatile compounds in peach fruits. THE PLANT JOURNAL : FOR CELL AND MOLECULAR BIOLOGY 2021; 108:281-295. [PMID: 34309935 DOI: 10.1111/tpj.15439] [Citation(s) in RCA: 22] [Impact Index Per Article: 7.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 12/17/2020] [Revised: 07/16/2021] [Accepted: 07/20/2021] [Indexed: 06/13/2023]
Abstract
Peach (Prunus persica L. Batsch) is an economically important fruit crop worldwide. Although a high-quality peach genome has previously been published, Sanger sequencing was used for its assembly, which generated short contigs. Here, we report a chromosome-level genome assembly and sequence analysis of Chinese Cling, an important founder cultivar for peach breeding programs worldwide. The assembled genome contained 247.33 Mb with a contig N50 of 4.13 Mb and a scaffold N50 of 29.68 Mb, representing 99.8% of the estimated genome. Comparisons between this genome and the recently published one (Lovell peach) uncovered 685 407 single nucleotide polymorphisms, 162 655 insertions and deletions, and 16 248 structural variants. Gene family analysis highlighted the contraction of the gene families involved in flavone, flavonol, flavonoid, and monoterpenoid biosynthesis. Subsequently, the volatile compounds of 256 peach varieties were quantitated in mature fruits in 2015 and 2016 to perform a genome-wide association analysis. A comparison with the identified domestication genomic regions allowed us to identify 25 quantitative trait loci, associated with seven volatile compounds, in the domestication region, which is consistent with the differences in volatile compounds between wild and cultivated peaches. Finally, a gene encoding terpene synthase, located within a previously reported quantitative trait loci region, was identified to be associated with linalool synthesis. Such findings highlight the importance of this new assembly for the analysis of evolutionary mechanisms and gene identification in peach species. Furthermore, this high-quality peach genome provides valuable information for future fruit improvement.
Collapse
Affiliation(s)
- Ke Cao
- The Key Laboratory of Biology and Genetic Improvement of Horticultural Crops (Fruit Tree Breeding Technology), Ministry of Agriculture, Zhengzhou Fruit Research Institute, Chinese Academy of Agricultural Sciences, Hanghaidong, Guancheng district, Zhengzhou, Henan, 450009, China
| | - Xuanwen Yang
- The Key Laboratory of Biology and Genetic Improvement of Horticultural Crops (Fruit Tree Breeding Technology), Ministry of Agriculture, Zhengzhou Fruit Research Institute, Chinese Academy of Agricultural Sciences, Hanghaidong, Guancheng district, Zhengzhou, Henan, 450009, China
| | - Yong Li
- The Key Laboratory of Biology and Genetic Improvement of Horticultural Crops (Fruit Tree Breeding Technology), Ministry of Agriculture, Zhengzhou Fruit Research Institute, Chinese Academy of Agricultural Sciences, Hanghaidong, Guancheng district, Zhengzhou, Henan, 450009, China
| | - Gengrui Zhu
- The Key Laboratory of Biology and Genetic Improvement of Horticultural Crops (Fruit Tree Breeding Technology), Ministry of Agriculture, Zhengzhou Fruit Research Institute, Chinese Academy of Agricultural Sciences, Hanghaidong, Guancheng district, Zhengzhou, Henan, 450009, China
| | - Weichao Fang
- The Key Laboratory of Biology and Genetic Improvement of Horticultural Crops (Fruit Tree Breeding Technology), Ministry of Agriculture, Zhengzhou Fruit Research Institute, Chinese Academy of Agricultural Sciences, Hanghaidong, Guancheng district, Zhengzhou, Henan, 450009, China
| | - Changwen Chen
- The Key Laboratory of Biology and Genetic Improvement of Horticultural Crops (Fruit Tree Breeding Technology), Ministry of Agriculture, Zhengzhou Fruit Research Institute, Chinese Academy of Agricultural Sciences, Hanghaidong, Guancheng district, Zhengzhou, Henan, 450009, China
| | - Xinwei Wang
- The Key Laboratory of Biology and Genetic Improvement of Horticultural Crops (Fruit Tree Breeding Technology), Ministry of Agriculture, Zhengzhou Fruit Research Institute, Chinese Academy of Agricultural Sciences, Hanghaidong, Guancheng district, Zhengzhou, Henan, 450009, China
| | - Jinlong Wu
- The Key Laboratory of Biology and Genetic Improvement of Horticultural Crops (Fruit Tree Breeding Technology), Ministry of Agriculture, Zhengzhou Fruit Research Institute, Chinese Academy of Agricultural Sciences, Hanghaidong, Guancheng district, Zhengzhou, Henan, 450009, China
| | - Lirong Wang
- The Key Laboratory of Biology and Genetic Improvement of Horticultural Crops (Fruit Tree Breeding Technology), Ministry of Agriculture, Zhengzhou Fruit Research Institute, Chinese Academy of Agricultural Sciences, Hanghaidong, Guancheng district, Zhengzhou, Henan, 450009, China
| |
Collapse
|
26
|
Pechlivanis N, Togkousidis A, Tsagiopoulou M, Sgardelis S, Kappas I, Psomopoulos F. A Computational Framework for Pattern Detection on Unaligned Sequences: An Application on SARS-CoV-2 Data. Front Genet 2021; 12:618170. [PMID: 34122498 PMCID: PMC8194296 DOI: 10.3389/fgene.2021.618170] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/16/2020] [Accepted: 05/04/2021] [Indexed: 11/13/2022] Open
Abstract
The exponential growth of genome sequences available has spurred research on pattern detection with the aim of extracting evolutionary signal. Traditional approaches, such as multiple sequence alignment, rely on positional homology in order to reconstruct the phylogenetic history of taxa. Yet, mining information from the plethora of biological data and delineating species on a genetic basis, still proves to be an extremely difficult problem to consider. Multiple algorithms and techniques have been developed in order to approach the problem multidimensionally. Here, we propose a computational framework for identifying potentially meaningful features based on k-mers retrieved from unaligned sequence data. Specifically, we have developed a process which makes use of unsupervised learning techniques in order to identify characteristic k-mers of the input dataset across a range of different k-values and within a reasonable time frame. We use these k-mers as features for clustering the input sequences and identifying differences between the distributions of k-mers across the dataset. The developed algorithm is part of an innovative and much promising approach both to the problem of grouping sequence data based on their inherent characteristic features, as well as for the study of changes in the distributions of k-mers, as the k-value is fluctuating within a range of values. Our framework is fully developed in Python language as an open source software licensed under the MIT License, and is freely available at https://github.com/BiodataAnalysisGroup/kmerAnalyzer.
Collapse
Affiliation(s)
- Nikolaos Pechlivanis
- Institute of Applied Biosciences, Centre for Research and Technology Hellas, Thessaloniki, Greece
- Department of Genetics, Development and Molecular Biology, School of Biology, Aristotle University of Thessaloniki, Thessaloniki, Greece
| | - Anastasios Togkousidis
- Institute of Applied Biosciences, Centre for Research and Technology Hellas, Thessaloniki, Greece
| | - Maria Tsagiopoulou
- Institute of Applied Biosciences, Centre for Research and Technology Hellas, Thessaloniki, Greece
| | - Stefanos Sgardelis
- Department of Ecology, School of Biology, Aristotle University of Thessaloniki, Thessaloniki, Greece
| | - Ilias Kappas
- Department of Genetics, Development and Molecular Biology, School of Biology, Aristotle University of Thessaloniki, Thessaloniki, Greece
| | - Fotis Psomopoulos
- Institute of Applied Biosciences, Centre for Research and Technology Hellas, Thessaloniki, Greece
| |
Collapse
|
27
|
Jacobus AP, Stephens TG, Youssef P, González-Pech R, Ciccotosto-Camp MM, Dougan KE, Chen Y, Basso LC, Frazzon J, Chan CX, Gross J. Comparative Genomics Supports That Brazilian Bioethanol Saccharomyces cerevisiae Comprise a Unified Group of Domesticated Strains Related to Cachaça Spirit Yeasts. Front Microbiol 2021; 12:644089. [PMID: 33936002 PMCID: PMC8082247 DOI: 10.3389/fmicb.2021.644089] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/19/2020] [Accepted: 03/08/2021] [Indexed: 01/05/2023] Open
Abstract
Ethanol production from sugarcane is a key renewable fuel industry in Brazil. Major drivers of this alcoholic fermentation are Saccharomyces cerevisiae strains that originally were contaminants to the system and yet prevail in the industrial process. Here we present newly sequenced genomes (using Illumina short-read and PacBio long-read data) of two monosporic isolates (H3 and H4) of the S. cerevisiae PE-2, a predominant bioethanol strain in Brazil. The assembled genomes of H3 and H4, together with 42 draft genomes of sugarcane-fermenting (fuel ethanol plus cachaça) strains, were compared against those of the reference S288C and diverse S. cerevisiae. All genomes of bioethanol yeasts have amplified SNO2(3)/SNZ2(3) gene clusters for vitamin B1/B6 biosynthesis, and display ubiquitous presence of a particular family of SAM-dependent methyl transferases, rare in S. cerevisiae. Widespread amplifications of quinone oxidoreductases YCR102C/YLR460C/YNL134C, and the structural or punctual variations among aquaporins and components of the iron homeostasis system, likely represent adaptations to industrial fermentation. Interesting is the pervasive presence among the bioethanol/cachaça strains of a five-gene cluster (Region B) that is a known phylogenetic signature of European wine yeasts. Combining genomes of H3, H4, and 195 yeast strains, we comprehensively assessed whole-genome phylogeny of these taxa using an alignment-free approach. The 197-genome phylogeny substantiates that bioethanol yeasts are monophyletic and closely related to the cachaça and wine strains. Our results support the hypothesis that biofuel-producing yeasts in Brazil may have been co-opted from a pool of yeasts that were pre-adapted to alcoholic fermentation of sugarcane for the distillation of cachaça spirit, which historically is a much older industry than the large-scale fuel ethanol production.
Collapse
Affiliation(s)
- Ana Paula Jacobus
- Laboratory for Genomics and Experimental Evolution of Yeasts, Institute for Bioenergy Research, São Paulo State University, Rio Claro, Brazil
| | - Timothy G Stephens
- Institute for Molecular Bioscience, The University of Queensland, Brisbane, QLD, Australia
| | - Pierre Youssef
- Institute for Molecular Bioscience, The University of Queensland, Brisbane, QLD, Australia
| | - Raul González-Pech
- Institute for Molecular Bioscience, The University of Queensland, Brisbane, QLD, Australia
| | - Michael M Ciccotosto-Camp
- Australian Centre for Ecogenomics, The University of Queensland, Brisbane, QLD, Australia.,School of Chemistry and Molecular Biosciences, The University of Queensland, Brisbane, QLD, Australia
| | - Katherine E Dougan
- Australian Centre for Ecogenomics, The University of Queensland, Brisbane, QLD, Australia.,School of Chemistry and Molecular Biosciences, The University of Queensland, Brisbane, QLD, Australia
| | - Yibi Chen
- Institute for Molecular Bioscience, The University of Queensland, Brisbane, QLD, Australia.,Australian Centre for Ecogenomics, The University of Queensland, Brisbane, QLD, Australia.,School of Chemistry and Molecular Biosciences, The University of Queensland, Brisbane, QLD, Australia
| | - Luiz Carlos Basso
- Biological Science Department, Escola Superior de Agricultura Luiz de Queiroz, University of São Paulo (USP), Piracicaba, Brazil
| | - Jeverson Frazzon
- Institute of Food Science and Technology, Federal University of Rio Grande do Sul (UFRGS), Porto Alegre, Brazil
| | - Cheong Xin Chan
- Institute for Molecular Bioscience, The University of Queensland, Brisbane, QLD, Australia.,Australian Centre for Ecogenomics, The University of Queensland, Brisbane, QLD, Australia.,School of Chemistry and Molecular Biosciences, The University of Queensland, Brisbane, QLD, Australia
| | - Jeferson Gross
- Laboratory for Genomics and Experimental Evolution of Yeasts, Institute for Bioenergy Research, São Paulo State University, Rio Claro, Brazil
| |
Collapse
|
28
|
Jonkheer EM, Brankovics B, Houwers IM, van der Wolf JM, Bonants PJM, Vreeburg RAM, Bollema R, de Haan JR, Berke L, Smit S, de Ridder D, van der Lee TAJ. The Pectobacterium pangenome, with a focus on Pectobacterium brasiliense, shows a robust core and extensive exchange of genes from a shared gene pool. BMC Genomics 2021; 22:265. [PMID: 33849459 PMCID: PMC8045196 DOI: 10.1186/s12864-021-07583-5] [Citation(s) in RCA: 19] [Impact Index Per Article: 6.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/21/2020] [Accepted: 03/26/2021] [Indexed: 01/12/2023] Open
Abstract
BACKGROUND Bacterial plant pathogens of the Pectobacterium genus are responsible for a wide spectrum of diseases in plants, including important crops such as potato, tomato, lettuce, and banana. Investigation of the genetic diversity underlying virulence and host specificity can be performed at genome level by using a comprehensive comparative approach called pangenomics. A pangenomic approach, using newly developed functionalities in PanTools, was applied to analyze the complex phylogeny of the Pectobacterium genus. We specifically used the pangenome to investigate genetic differences between virulent and avirulent strains of P. brasiliense, a potato blackleg causing species dominantly present in Western Europe. RESULTS Here we generated a multilevel pangenome for Pectobacterium, comprising 197 strains across 19 species, including type strains, with a focus on P. brasiliense. The extensive phylogenetic analysis of the Pectobacterium genus showed robust distinct clades, with most detail provided by 452,388 parsimony-informative single-nucleotide polymorphisms identified in single-copy orthologs. The average Pectobacterium genome consists of 47% core genes, 1% unique genes, and 52% accessory genes. Using the pangenome, we zoomed in on differences between virulent and avirulent P. brasiliense strains and identified 86 genes associated to virulent strains. We found that the organization of genes is highly structured and linked with gene conservation, function, and transcriptional orientation. CONCLUSION The pangenome analysis demonstrates that evolution in Pectobacteria is a highly dynamic process, including gene acquisitions partly in clusters, genome rearrangements, and loss of genes. Pectobacterium species are typically not characterized by a set of species-specific genes, but instead present themselves using new gene combinations from the shared gene pool. A multilevel pangenomic approach, fusing DNA, protein, biological function, taxonomic group, and phenotypes, facilitates studies in a flexible taxonomic context.
Collapse
Affiliation(s)
- Eef M Jonkheer
- Bioinformatics Group, Wageningen University, Droevendaalsesteeg 1, 6708 PB, Wageningen, The Netherlands.
- Biointeractions and Plant Health, Wageningen Plant Research, Droevendaalsesteeg 1, 6708 PB, Wageningen, The Netherlands.
| | - Balázs Brankovics
- Biointeractions and Plant Health, Wageningen Plant Research, Droevendaalsesteeg 1, 6708 PB, Wageningen, The Netherlands
| | - Ilse M Houwers
- Biointeractions and Plant Health, Wageningen Plant Research, Droevendaalsesteeg 1, 6708 PB, Wageningen, The Netherlands
| | - Jan M van der Wolf
- Biointeractions and Plant Health, Wageningen Plant Research, Droevendaalsesteeg 1, 6708 PB, Wageningen, The Netherlands
| | - Peter J M Bonants
- Biointeractions and Plant Health, Wageningen Plant Research, Droevendaalsesteeg 1, 6708 PB, Wageningen, The Netherlands
| | - Robert A M Vreeburg
- Nederlandse Algemene Keuringsdienst voor zaaizaad en pootgoed van landbouwgewassen, Randweg 14, 8304 AS, Emmeloord, The Netherlands
| | - Robert Bollema
- Nederlandse Algemene Keuringsdienst voor zaaizaad en pootgoed van landbouwgewassen, Randweg 14, 8304 AS, Emmeloord, The Netherlands
| | - Jorn R de Haan
- Genetwister Technologies B.V, Nieuwe Kanaal 7b, 6709 PA, Wageningen, The Netherlands
| | - Lidija Berke
- Genetwister Technologies B.V, Nieuwe Kanaal 7b, 6709 PA, Wageningen, The Netherlands
| | - Sandra Smit
- Bioinformatics Group, Wageningen University, Droevendaalsesteeg 1, 6708 PB, Wageningen, The Netherlands
| | - Dick de Ridder
- Bioinformatics Group, Wageningen University, Droevendaalsesteeg 1, 6708 PB, Wageningen, The Netherlands
| | - Theo A J van der Lee
- Biointeractions and Plant Health, Wageningen Plant Research, Droevendaalsesteeg 1, 6708 PB, Wageningen, The Netherlands
| |
Collapse
|
29
|
Sequence Comparison Without Alignment: The SpaM Approaches. METHODS IN MOLECULAR BIOLOGY (CLIFTON, N.J.) 2021; 2231:121-134. [PMID: 33289890 DOI: 10.1007/978-1-0716-1036-7_8] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Subscribe] [Scholar Register] [Indexed: 10/22/2022]
Abstract
Sequence alignment is at the heart of DNA and protein sequence analysis. For the data volumes that are nowadays produced by massively parallel sequencing technologies, however, pairwise and multiple alignment methods are often too slow. Therefore, fast alignment-free approaches to sequence comparison have become popular in recent years. Most of these approaches are based on word frequencies, for words of a fixed length, or on word-matching statistics. Other approaches are using the length of maximal word matches. While these methods are very fast, most of them rely on ad hoc measures of sequences similarity or dissimilarity that are hard to interpret. In this chapter, I describe a number of alignment-free methods that we developed in recent years. Our approaches are based on spaced-word matches ("SpaM"), i.e. on inexact word matches, that are allowed to contain mismatches at certain pre-defined positions. Unlike most previous alignment-free approaches, our approaches are able to accurately estimate phylogenetic distances between DNA or protein sequences using a stochastic model of molecular evolution.
Collapse
|
30
|
González-Pech RA, Stephens TG, Chen Y, Mohamed AR, Cheng Y, Shah S, Dougan KE, Fortuin MDA, Lagorce R, Burt DW, Bhattacharya D, Ragan MA, Chan CX. Comparison of 15 dinoflagellate genomes reveals extensive sequence and structural divergence in family Symbiodiniaceae and genus Symbiodinium. BMC Biol 2021; 19:73. [PMID: 33849527 PMCID: PMC8045281 DOI: 10.1186/s12915-021-00994-6] [Citation(s) in RCA: 54] [Impact Index Per Article: 18.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/24/2020] [Accepted: 02/25/2021] [Indexed: 02/07/2023] Open
Abstract
Background Dinoflagellates in the family Symbiodiniaceae are important photosynthetic symbionts in cnidarians (such as corals) and other coral reef organisms. Breakdown of the coral-dinoflagellate symbiosis due to environmental stress (i.e. coral bleaching) can lead to coral death and the potential collapse of reef ecosystems. However, evolution of Symbiodiniaceae genomes, and its implications for the coral, is little understood. Genome sequences of Symbiodiniaceae remain scarce due in part to their large genome sizes (1–5 Gbp) and idiosyncratic genome features. Results Here, we present de novo genome assemblies of seven members of the genus Symbiodinium, of which two are free-living, one is an opportunistic symbiont, and the remainder are mutualistic symbionts. Integrating other available data, we compare 15 dinoflagellate genomes revealing high sequence and structural divergence. Divergence among some Symbiodinium isolates is comparable to that among distinct genera of Symbiodiniaceae. We also recovered hundreds of gene families specific to each lineage, many of which encode unknown functions. An in-depth comparison between the genomes of the symbiotic Symbiodinium tridacnidorum (isolated from a coral) and the free-living Symbiodinium natans reveals a greater prevalence of transposable elements, genetic duplication, structural rearrangements, and pseudogenisation in the symbiotic species. Conclusions Our results underscore the potential impact of lifestyle on lineage-specific gene-function innovation, genome divergence, and the diversification of Symbiodinium and Symbiodiniaceae. The divergent features we report, and their putative causes, may also apply to other microbial eukaryotes that have undergone symbiotic phases in their evolutionary history. Supplementary Information The online version contains supplementary material available at 10.1186/s12915-021-00994-6.
Collapse
Affiliation(s)
- Raúl A González-Pech
- Institute for Molecular Bioscience, The University of Queensland, Brisbane, QLD, 4072, Australia. .,Present address: Department of Integrative Biology, University of South Florida, Tampa, FL, 33620, USA.
| | - Timothy G Stephens
- Institute for Molecular Bioscience, The University of Queensland, Brisbane, QLD, 4072, Australia.,Present address: Department of Biochemistry and Microbiology, Rutgers University, New Brunswick, NJ, 08901, USA
| | - Yibi Chen
- Institute for Molecular Bioscience, The University of Queensland, Brisbane, QLD, 4072, Australia.,Australian Centre for Ecogenomics, The University of Queensland, Brisbane, QLD, 4072, Australia.,School of Chemistry and Molecular Biosciences, The University of Queensland, Brisbane, QLD, 4072, Australia
| | - Amin R Mohamed
- Commonwealth Scientific and Industrial Research Organisation (CSIRO) Agriculture and Food, Queensland Bioscience Precinct, St Lucia, QLD, 4072, Australia.,Present address: Institute for Molecular Bioscience, The University of Queensland, Brisbane, QLD, 4072, Australia
| | - Yuanyuan Cheng
- UQ Genomics Initiative, The University of Queensland, Brisbane, QLD, 4072, Australia.,Present address: School of Life and Environmental Sciences, The University of Sydney, Sydney, NSW, 2006, Australia
| | - Sarah Shah
- Institute for Molecular Bioscience, The University of Queensland, Brisbane, QLD, 4072, Australia.,Australian Centre for Ecogenomics, The University of Queensland, Brisbane, QLD, 4072, Australia.,School of Chemistry and Molecular Biosciences, The University of Queensland, Brisbane, QLD, 4072, Australia
| | - Katherine E Dougan
- Australian Centre for Ecogenomics, The University of Queensland, Brisbane, QLD, 4072, Australia.,School of Chemistry and Molecular Biosciences, The University of Queensland, Brisbane, QLD, 4072, Australia
| | - Michael D A Fortuin
- Australian Centre for Ecogenomics, The University of Queensland, Brisbane, QLD, 4072, Australia.,School of Chemistry and Molecular Biosciences, The University of Queensland, Brisbane, QLD, 4072, Australia
| | - Rémi Lagorce
- Institute for Molecular Bioscience, The University of Queensland, Brisbane, QLD, 4072, Australia.,École Polytechnique Universitaire de l'Université de Nice, Université Nice-Sophia-Antipolis, 06410, Nice, Provence-Alpes-Côte d'Azur, France
| | - David W Burt
- UQ Genomics Initiative, The University of Queensland, Brisbane, QLD, 4072, Australia
| | - Debashish Bhattacharya
- Department of Biochemistry and Microbiology, Rutgers University, New Brunswick, NJ, 08901, USA
| | - Mark A Ragan
- Institute for Molecular Bioscience, The University of Queensland, Brisbane, QLD, 4072, Australia
| | - Cheong Xin Chan
- Institute for Molecular Bioscience, The University of Queensland, Brisbane, QLD, 4072, Australia. .,Australian Centre for Ecogenomics, The University of Queensland, Brisbane, QLD, 4072, Australia. .,School of Chemistry and Molecular Biosciences, The University of Queensland, Brisbane, QLD, 4072, Australia.
| |
Collapse
|
31
|
Peakall R, Wong DCJ, Phillips RD, Ruibal M, Eyles R, Rodriguez-Delgado C, Linde CC. A multitiered sequence capture strategy spanning broad evolutionary scales: Application for phylogenetic and phylogeographic studies of orchids. Mol Ecol Resour 2021; 21:1118-1140. [PMID: 33453072 DOI: 10.1111/1755-0998.13327] [Citation(s) in RCA: 7] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/26/2020] [Revised: 12/22/2020] [Accepted: 01/05/2021] [Indexed: 11/30/2022]
Abstract
With over 25,000 species, the drivers of diversity in the Orchidaceae remain to be fully understood. Here, we outline a multitiered sequence capture strategy aimed at capturing hundreds of loci to enable phylogenetic resolution from subtribe to subspecific levels in orchids of the tribe Diurideae. For the probe design, we mined subsets of 18 transcriptomes, to give five target sequence sets aimed at the tribe (Sets 1 & 2), subtribe (Set 3), and within subtribe levels (Sets 4 & 5). Analysis included alternative de novo and reference-guided assembly, before target sequence extraction, annotation and alignment, and application of a homology-aware k-mer block phylogenomic approach, prior to maximum likelihood and coalescence-based phylogenetic inference. Our evaluation considered 87 taxa in two test data sets: 67 samples spanning the tribe, and 72 samples involving 24 closely related Caladenia species. The tiered design achieved high target loci recovery (>89%), with the median number of recovered loci in Sets 1-5 as follows: 212, 219, 816, 1024, and 1009, respectively. Interestingly, as a first test of the homologous k-mer approach for targeted sequence capture data, our study revealed its potential for enabling robust phylogenetic species tree inferences. Specifically, we found matching, and in one case improved phylogenetic resolution within species complexes, compared to conventional phylogenetic analysis involving target gene extraction. Our findings indicate that a customized multitiered sequence capture strategy, in combination with promising yet underutilized phylogenomic approaches, will be effective for groups where interspecific divergence is recent, but information on deeper phylogenetic relationships is also required.
Collapse
Affiliation(s)
- Rod Peakall
- Ecology and Evolution, Research School of Biology, The Australian National University, Canberra, ACT, Australia
| | - Darren C J Wong
- Ecology and Evolution, Research School of Biology, The Australian National University, Canberra, ACT, Australia
| | - Ryan D Phillips
- Ecology and Evolution, Research School of Biology, The Australian National University, Canberra, ACT, Australia.,Department of Ecology, Environment and Evolution, La Trobe University, Melbourne, Vic., Australia
| | - Monica Ruibal
- Ecology and Evolution, Research School of Biology, The Australian National University, Canberra, ACT, Australia
| | - Rodney Eyles
- Ecology and Evolution, Research School of Biology, The Australian National University, Canberra, ACT, Australia
| | - Claudia Rodriguez-Delgado
- Ecology and Evolution, Research School of Biology, The Australian National University, Canberra, ACT, Australia
| | - Celeste C Linde
- Ecology and Evolution, Research School of Biology, The Australian National University, Canberra, ACT, Australia
| |
Collapse
|
32
|
Chakraborty A, Morgenstern B, Bandyopadhyay S. S-conLSH: alignment-free gapped mapping of noisy long reads. BMC Bioinformatics 2021; 22:64. [PMID: 33573603 PMCID: PMC7879691 DOI: 10.1186/s12859-020-03918-3] [Citation(s) in RCA: 7] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Key Words] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/22/2020] [Accepted: 12/02/2020] [Indexed: 11/16/2022] Open
Abstract
Background The advancement of SMRT technology has unfolded new opportunities of genome analysis with its longer read length and low GC bias. Alignment of the reads to their appropriate positions in the respective reference genome is the first but costliest step of any analysis pipeline based on SMRT sequencing. However, the state-of-the-art aligners often fail to identify distant homologies due to lack of conserved regions, caused by frequent genetic duplication and recombination. Therefore, we developed a novel alignment-free method of sequence mapping that is fast and accurate. Results We present a new mapper called S-conLSH that uses Spaced context based Locality Sensitive Hashing. With multiple spaced patterns, S-conLSH facilitates a gapped mapping of noisy long reads to the corresponding target locations of a reference genome. We have examined the performance of the proposed method on 5 different real and simulated datasets. S-conLSH is at least 2 times faster than the recently developed method lordFAST. It achieves a sensitivity of 99%, without using any traditional base-to-base alignment, on human simulated sequence data. By default, S-conLSH provides an alignment-free mapping in PAF format. However, it has an option of generating aligned output as SAM-file, if it is required for any downstream processing. Conclusions S-conLSH is one of the first alignment-free reference genome mapping tools achieving a high level of sensitivity. The spaced-context is especially suitable for extracting distant similarities. The variable-length spaced-seeds or patterns add flexibility to the proposed algorithm by introducing gapped mapping of the noisy long reads. Therefore, S-conLSH may be considered as a prominent direction towards alignment-free sequence analysis.
Collapse
Affiliation(s)
- Angana Chakraborty
- Department of Computer Science, West Bengal Education Service, Kolkata, India
| | - Burkhard Morgenstern
- Department of Bioinformatics (IMG), University of Göttingen, 37077, Göttingen, Germany.
| | | |
Collapse
|
33
|
Abstract
Inferring phylogenetic relationships among hundreds or thousands of microbial genomes is an increasingly common task. The conventional phylogenetic approach adopts multiple sequence alignment to compare gene-by-gene, concatenated multigene or whole-genome sequences, from which a phylogenetic tree would be inferred. These alignments follow the implicit assumption of full-length contiguity among homologous sequences. However, common events in microbial genome evolution (e.g., structural rearrangements and genetic recombination) violate this assumption. Moreover, aligning hundreds or thousands of sequences is computationally intensive and not scalable to the rate at which genome data are generated. Therefore, alignment-free methods present an attractive alternative strategy. Here we describe a scalable alignment-free strategy to infer phylogenetic relationships using complete genome sequences of bacteria and archaea, based on short, subsequences of length k (k-mers). We describe how this strategy can be extended to infer evolutionary relationships beyond a tree-like structure, to better capture both vertical and lateral signals of microbial evolution.
Collapse
|
34
|
Muggia L, Ametrano CG, Sterflinger K, Tesei D. An Overview of Genomics, Phylogenomics and Proteomics Approaches in Ascomycota. Life (Basel) 2020; 10:E356. [PMID: 33348904 PMCID: PMC7765829 DOI: 10.3390/life10120356] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/12/2020] [Revised: 12/10/2020] [Accepted: 12/12/2020] [Indexed: 12/26/2022] Open
Abstract
Fungi are among the most successful eukaryotes on Earth: they have evolved strategies to survive in the most diverse environments and stressful conditions and have been selected and exploited for multiple aims by humans. The characteristic features intrinsic of Fungi have required evolutionary changes and adaptations at deep molecular levels. Omics approaches, nowadays including genomics, metagenomics, phylogenomics, transcriptomics, metabolomics, and proteomics have enormously advanced the way to understand fungal diversity at diverse taxonomic levels, under changeable conditions and in still under-investigated environments. These approaches can be applied both on environmental communities and on individual organisms, either in nature or in axenic culture and have led the traditional morphology-based fungal systematic to increasingly implement molecular-based approaches. The advent of next-generation sequencing technologies was key to boost advances in fungal genomics and proteomics research. Much effort has also been directed towards the development of methodologies for optimal genomic DNA and protein extraction and separation. To date, the amount of proteomics investigations in Ascomycetes exceeds those carried out in any other fungal group. This is primarily due to the preponderance of their involvement in plant and animal diseases and multiple industrial applications, and therefore the need to understand the biological basis of the infectious process to develop mechanisms for biologic control, as well as to detect key proteins with roles in stress survival. Here we chose to present an overview as much comprehensive as possible of the major advances, mainly of the past decade, in the fields of genomics (including phylogenomics) and proteomics of Ascomycota, focusing particularly on those reporting on opportunistic pathogenic, extremophilic, polyextremotolerant and lichenized fungi. We also present a review of the mostly used genome sequencing technologies and methods for DNA sequence and protein analyses applied so far for fungi.
Collapse
Affiliation(s)
- Lucia Muggia
- Department of Life Sciences, University of Trieste, 34127 Trieste, Italy
| | - Claudio G. Ametrano
- Grainger Bioinformatics Center, Department of Science and Education, The Field Museum, Chicago, IL 60605, USA;
| | - Katja Sterflinger
- Academy of Fine Arts Vienna, Institute of Natual Sciences and Technology in the Arts, 1090 Vienna, Austria;
| | - Donatella Tesei
- Department of Biotechnology, University of Natural Resources and Life Sciences, 1190 Vienna, Austria;
| |
Collapse
|
35
|
Pornputtapong N, Acheampong DA, Patumcharoenpol P, Jenjaroenpun P, Wongsurawat T, Jun SR, Yongkiettrakul S, Chokesajjawatee N, Nookaew I. KITSUNE: A Tool for Identifying Empirically Optimal K-mer Length for Alignment-Free Phylogenomic Analysis. Front Bioeng Biotechnol 2020; 8:556413. [PMID: 33072720 PMCID: PMC7538862 DOI: 10.3389/fbioe.2020.556413] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/28/2020] [Accepted: 08/24/2020] [Indexed: 12/22/2022] Open
Abstract
Genomic DNA is the best “unique identifier” for organisms. Alignment-free phylogenomic analysis, simple, fast, and efficient method to compare genome sequences, relies on looking at the distribution of small DNA sequence of a particular length, referred to as k-mer. The k-mer approach has been explored as a basis for sequence analysis applications, including assembly, phylogenetic tree inference, and classification. Although this approach is not novel, selecting the appropriate k-mer length to obtain the optimal resolution is rather arbitrary. However, it is a very important parameter for achieving the appropriate resolution for genome/sequence distances to infer biologically meaningful phylogenetic relationships. Thus, there is a need for a systematic approach to identify the appropriate k-mer from whole-genome sequences. We present K-mer–length Iterative Selection for UNbiased Ecophylogenomics (KITSUNE), a tool for assessing the empirically optimal k-mer length of any given set of genomes of interest for phylogenomic analysis via a three-step approach based on (1) cumulative relative entropy (CRE), (2) average number of common features (ACF), and (3) observed common features (OCF). Using KITSUNE, we demonstrated the feasibility and reliability of these measurements to obtain empirically optimal k-mer lengths of 11, 17, and ∼34 from large genome datasets of viruses, bacteria, and fungi, respectively. Moreover, we demonstrated a feature of KITSUNE for accurate species identification for the two de novo assembled bacterial genomes derived from error-prone long-reads sequences, and for a published yeast genome. In addition, KITSUNE was used to identify the shortest species-specific k-mer accurately identifying viruses. KITSUNE is freely available at https://github.com/natapol/kitsune.
Collapse
Affiliation(s)
- Natapol Pornputtapong
- Department of Biochemistry and Microbiology, Faculty of Pharmaceutical Sciences, and Research Unit of DNA Barcoding of Thai Medicinal Plants, Chulalongkorn University, Bangkok, Thailand
| | - Daniel A Acheampong
- Department of Biomedical Informatics, University of Arkansas for Medical Sciences, Little Rock, AR, United States.,Joint Graduate Program in Bioinformatics, University of Arkansas at Little Rock and University of Arkansas for Medical Sciences, Little Rock, AR, United States
| | - Preecha Patumcharoenpol
- Department of Biomedical Informatics, University of Arkansas for Medical Sciences, Little Rock, AR, United States
| | - Piroon Jenjaroenpun
- Department of Biomedical Informatics, University of Arkansas for Medical Sciences, Little Rock, AR, United States
| | - Thidathip Wongsurawat
- Department of Biomedical Informatics, University of Arkansas for Medical Sciences, Little Rock, AR, United States
| | - Se-Ran Jun
- Department of Biomedical Informatics, University of Arkansas for Medical Sciences, Little Rock, AR, United States
| | - Suganya Yongkiettrakul
- National Center for Genetic Engineering and Biotechnology, National Science and Technology Development Agency, Pathum Thani, Thailand
| | - Nipa Chokesajjawatee
- National Center for Genetic Engineering and Biotechnology, National Science and Technology Development Agency, Pathum Thani, Thailand
| | - Intawat Nookaew
- Department of Biomedical Informatics, University of Arkansas for Medical Sciences, Little Rock, AR, United States
| |
Collapse
|
36
|
Mapping sequence to feature vector using numerical representation of codons targeted to amino acids for alignment-free sequence analysis. Gene 2020; 766:145096. [PMID: 32919006 DOI: 10.1016/j.gene.2020.145096] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/25/2020] [Revised: 08/16/2020] [Accepted: 08/24/2020] [Indexed: 12/17/2022]
Abstract
The phylogenetic analysis based on sequence similarity targeted to real biological taxa is one of the major challenging tasks. In this paper, we propose a novel alignment-free method, CoFASA (Codon Feature based Amino acid Sequence Analyser), for similarity analysis of nucleotide sequences. At first, we assign numerical weights to the four nucleotides. We then calculate a score of each codon based on the numerical value of the constituent nucleotides, termed as degree of codons. Accordingly, we obtain the degree of each amino acid based on the degree of codons targeted towards a specific amino acid. Utilizing the degree of twenty amino acids and their relative abundance within a given sequence, we generate 20-dimensional features for every coding DNA sequence or protein sequence. We use the features for performing phylogenetic analysis of the set of candidate sequences. We use multiple protein sequences derived from Beta-globin (BG), NADH dehydrogenase subunit 5 (ND5), Transferrins (TFs), Xylanases, low identity (<40%) and high identity (⩾40%) protein sequences (encompassing 533 and 1064 protein families) for experimental assessments. We compare our results with sixteen (16) well-known methods, including both alignment-based and alignment-free methods. Various assessment indices are used, such as the Pearson correlation coefficient, RF (Robinson-Foulds) distance and ROC score for performance analysis. While comparing the performance of CoFASA with alignment-based methods (ClustalW, ClustalΩ, MAFFT, and MUSCLE), it shows very similar results. Further, CoFASA shows better performance in comparison to well-known alignment-free methods, including LZW-Kernal, jD2Stat, FFP, spaced, and AFKS-D2s in predicting taxonomic relationship among candidate taxa. Overall, we observe that the features derived by CoFASA are very much useful in isolating the sequences according to their taxonomic labels. While our method is cost-effective, at the same time, produces consistent and satisfactory outcomes.
Collapse
|
37
|
A new graph-theoretic approach to determine the similarity of genome sequences based on nucleotide triplets. Genomics 2020; 112:4701-4714. [PMID: 32827671 PMCID: PMC7437474 DOI: 10.1016/j.ygeno.2020.08.023] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/09/2019] [Revised: 07/15/2020] [Accepted: 08/17/2020] [Indexed: 11/22/2022]
Abstract
Methods of finding sequence similarity play a significant role in computational biology. Owing to the rapid increase of genome sequences in public databases, the evolutionary relationship of species becomes more challenging. But traditional alignment-based methods are found inappropriate due to their time-consuming nature. Therefore, it is necessary to find a faster method, which applies to species phylogeny. In this paper, a new graph-theory based alignment-free sequence comparison method is proposed. A complete-bipartite graph is used to represent each genome sequence based on its nucleotide triplets. Subsequently, with the help of the weights of edges of the graph, a vector descriptor is formed. Finally, the phylogenetic tree is drawn using the UPGMA algorithm. In the present case, the datasets for comparison are related to mammals, viruses, and bacteria. In most of the cases, the phylogeny in the present case is found to be more satisfactory as compared to earlier methods. A new graph-theory based alignment-free genome sequence comparison. Use of complete bipartite graph to represent genome sequences. Descriptor based on the weights of the edges of the graph. Comparison of the phylogenetic trees of different mammals, viruses, and bacteria. Less time complexity compared to that of earlier methods.
Collapse
|
38
|
Positional Correlation Natural Vector: A Novel Method for Genome Comparison. Int J Mol Sci 2020; 21:ijms21113859. [PMID: 32485813 PMCID: PMC7312176 DOI: 10.3390/ijms21113859] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/13/2020] [Revised: 05/17/2020] [Accepted: 05/26/2020] [Indexed: 12/17/2022] Open
Abstract
Advances in sequencing technology have made large amounts of biological data available. Evolutionary analysis of data such as DNA sequences is highly important in biological studies. As alignment methods are ineffective for analyzing large-scale data due to their inherently high costs, alignment-free methods have recently attracted attention in the field of bioinformatics. In this paper, we introduce a new positional correlation natural vector (PCNV) method that involves converting a DNA sequence into an 18-dimensional numerical feature vector. Using frequency and position correlation to represent the nucleotide distribution, it is possible to obtain a PCNV for a DNA sequence. This new numerical vector design uses six suitable features to characterize the correlation among nucleotide positions in sequences. PCNV is also very easy to compute and can be used for rapid genome comparison. To test our novel method, we performed phylogenetic analysis with several viral and bacterial genome datasets with PCNV. For comparison, an alignment-based method, Bayesian inference, and two alignment-free methods, feature frequency profile and natural vector, were performed using the same datasets. We found that the PCNV technique is fast and accurate when used for phylogenetic analysis and classification of viruses and bacteria.
Collapse
|
39
|
Acman M, van Dorp L, Santini JM, Balloux F. Large-scale network analysis captures biological features of bacterial plasmids. Nat Commun 2020; 11:2452. [PMID: 32415210 PMCID: PMC7229196 DOI: 10.1038/s41467-020-16282-w] [Citation(s) in RCA: 57] [Impact Index Per Article: 14.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/03/2019] [Accepted: 04/23/2020] [Indexed: 11/30/2022] Open
Abstract
Many bacteria can exchange genetic material through horizontal gene transfer (HGT) mediated by plasmids and plasmid-borne transposable elements. Here, we study the population structure and dynamics of over 10,000 bacterial plasmids, by quantifying their genetic similarities and reconstructing a network based on their shared k-mer content. We use a community detection algorithm to assign plasmids into cliques, which correlate with plasmid gene content, bacterial host range, GC content, and existing classifications based on replicon and mobility (MOB) types. Further analysis of plasmid population structure allows us to uncover candidates for yet undescribed replicon genes, and to identify transposable elements as the main drivers of HGT at broad phylogenetic scales. Our work illustrates the potential of network-based analyses of the bacterial 'mobilome' and opens up the prospect of a natural, exhaustive classification framework for bacterial plasmids.
Collapse
Affiliation(s)
- Mislav Acman
- UCL Genetics Institute, University College London, Gower Street, London, WC1E 6BT, UK.
| | - Lucy van Dorp
- UCL Genetics Institute, University College London, Gower Street, London, WC1E 6BT, UK
| | - Joanne M Santini
- Institute of Structural and Molecular Biology, University College London, Gower Street, London, WC1E 6BT, UK
| | - Francois Balloux
- UCL Genetics Institute, University College London, Gower Street, London, WC1E 6BT, UK.
| |
Collapse
|
40
|
Kaufer A, Stark D, Ellis J. A review of the systematics, species identification and diagnostics of the Trypanosomatidae using the maxicircle kinetoplast DNA: from past to present. Int J Parasitol 2020; 50:449-460. [PMID: 32333942 DOI: 10.1016/j.ijpara.2020.03.003] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/19/2019] [Revised: 02/28/2020] [Accepted: 03/09/2020] [Indexed: 11/25/2022]
Abstract
The Trypanosomatid family are a diverse and widespread group of protozoan parasites that belong to the higher order class Kinetoplastida. Containing predominantly monoxenous species (i.e. those having only a single host) that are confined to invertebrate hosts, this class is primarily known for its pathogenic dixenous species (i.e. those that have two hosts), serving as the aetiological agents of the important neglected tropical diseases including leishmaniasis, American trypanosomiasis (Chagas disease) and human African trypanosomiasis. Over the past few decades, a multitude of studies have investigated the diversity, classification and evolutionary history of the trypanosomatid family using different approaches and molecular targets. The mitochondrial-like DNA of the trypanosomatid parasites, also known as the kinetoplast, has emerged as a unique taxonomic and diagnostic target for exploring the evolution of this diverse group of parasitic eukaryotes. This review discusses recent advancements and important developments that have made a significant impact in the field of trypanosomatid systematics and diagnostics in recent years.
Collapse
Affiliation(s)
- Alexa Kaufer
- School of Life Sciences, University of Technology Sydney, Ultimo, NSW 2007, Australia.
| | - Damien Stark
- Department of Microbiology, St Vincent's Hospital Sydney, Darlinghurst, NSW 2010, Australia
| | - John Ellis
- School of Life Sciences, University of Technology Sydney, Ultimo, NSW 2007, Australia
| |
Collapse
|
41
|
Nwaiwu O, Aduba CC. An in silico analysis of acquired antimicrobial resistance genes in Aeromonas plasmids. AIMS Microbiol 2020; 6:75-91. [PMID: 32226916 PMCID: PMC7099201 DOI: 10.3934/microbiol.2020005] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/12/2020] [Accepted: 03/13/2020] [Indexed: 12/17/2022] Open
Abstract
Sequences of 105 Aeromonas species plasmids were probed for acquired anti-microbial resistance (AMR) genes using a bioinformatics approach. The plasmids showed no positive linear correlation between size and GC content and up to 55 acquired AMR genes were found in 39 (37%) plasmids after in silico screening for resistance against 15 antibiotic drug classes. Overall, potential multiple antibiotic resistance (p-MAR) index ranged from 0.07 to 0.53. Up to 18 plasmids were predicted to mediate multiple drug resistance (MDR). Plasmids pS121-1a (A. salmonicida), pWCX23_1 (A. hydrophila) and pASP-a58 (A. veronii) harboured 18, 15 and 14 AMR genes respectively. The five most occurring drug classes for which AMR genes were detected were aminoglycosides (27%), followed by beta-lactams (17%), sulphonamides (13%), fluoroquinolones (13%), and phenicols (10%). The most prevalent genes were a sulphonamide resistant gene Sul1, the gene aac (6')-Ib-cr (aminoglycoside 6'-N-acetyl transferase type Ib-cr) resistant to aminoglycosides and the blaKPC-2 gene, which encodes carbapenemase-production. Plasmid acquisition of AMR genes was mainly inter-genus rather than intra-genus. Eighteen plasmids showed template or host genes acquired from Pseudomonas monteilii, Salmonella enterica or Escherichia coli. The most occurring antimicrobial resistance determinants (ARDs) were beta-lactamase, followed by aminoglycosides acetyl-transferases, and then efflux pumps. Screening of new isolates in vitro and in vivo is required to ascertain the level of phenotypic expression of colistin and other acquired AMR genes detected.
Collapse
Affiliation(s)
- Ogueri Nwaiwu
- School of Biosciences, University of Nottingham, Sutton Bonington Campus, United Kingdom
| | - Chiugo Claret Aduba
- Department of Science Laboratory Technology, University of Nigeria, Nsukka, Nigeria
| |
Collapse
|
42
|
Dencker T, Leimeister CA, Gerth M, Bleidorn C, Snir S, Morgenstern B. 'Multi-SpaM': a maximum-likelihood approach to phylogeny reconstruction using multiple spaced-word matches and quartet trees. NAR Genom Bioinform 2020; 2:lqz013. [PMID: 33575565 PMCID: PMC7671388 DOI: 10.1093/nargab/lqz013] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/30/2019] [Revised: 07/31/2019] [Accepted: 10/13/2019] [Indexed: 02/03/2023] Open
Abstract
Word-based or 'alignment-free' methods for phylogeny inference have become popular in recent years. These methods are much faster than traditional, alignment-based approaches, but they are generally less accurate. Most alignment-free methods calculate 'pairwise' distances between nucleic-acid or protein sequences; these distance values can then be used as input for tree-reconstruction programs such as neighbor-joining. In this paper, we propose the first word-based phylogeny approach that is based on 'multiple' sequence comparison and 'maximum likelihood'. Our algorithm first samples small, gap-free alignments involving four taxa each. For each of these alignments, it then calculates a quartet tree and, finally, the program 'Quartet MaxCut' is used to infer a super tree for the full set of input taxa from the calculated quartet trees. Experimental results show that trees produced with our approach are of high quality.
Collapse
Affiliation(s)
- Thomas Dencker
- Department of Bioinformatics, Institute of Microbiology and Genetics, Universität Göttingen, Goldschmidtstr. 1, 37077 Göttingen, Germany
| | - Chris-André Leimeister
- Department of Bioinformatics, Institute of Microbiology and Genetics, Universität Göttingen, Goldschmidtstr. 1, 37077 Göttingen, Germany
| | - Michael Gerth
- Institute for Integrative Biology, University of Liverpool, Biosciences Building, Crown Street, L69 7ZB Liverpool, UK
| | - Christoph Bleidorn
- Department of Animal Evolution and Biodiversity, Universität Göttingen, Untere Karspüle 2, 37073 Göttingen, Germany
- Museo Nacional de Ciencias Naturales, Spanish National Research Council (CSIC), 28006 Madrid, Spain
| | - Sagi Snir
- Institute of Evolution, Department of Evolutionary and Environmental Biology, University of Haifa, 199 Aba Khoushy Ave. Mount Carmel, Haifa, Israel
| | - Burkhard Morgenstern
- Department of Bioinformatics, Institute of Microbiology and Genetics, Universität Göttingen, Goldschmidtstr. 1, 37077 Göttingen, Germany
- Göttingen Center of Molecular Biosciences (GZMB), Justus-von-Liebig-Weg 11, 37077 Göttingen, Germany
| |
Collapse
|
43
|
Accelerating Binary String Comparisons with a Scalable, Streaming-Based System Architecture Based on FPGAs. ALGORITHMS 2020. [DOI: 10.3390/a13020047] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/16/2022]
Abstract
This paper is concerned with Field Programmable Gate Arrays (FPGA)-based systems for energy-efficient high-throughput string comparison. Modern applications which involve comparisons across large data sets—such as large sequence sets in molecular biology—are by their nature computationally intensive. In this work, we present a scalable FPGA-based system architecture to accelerate the comparison of binary strings. The current architecture supports arbitrary lengths in the range 16 to 2048-bit, covering a wide range of possible applications. In our example application, we consider DNA sequences embedded in a binary vector space through Locality Sensitive Hashing (LSH) one of several possible encodings that enable us to avoid more costly character-based operations. Here the resulting encoding is a 512-bit binary signature with comparisons based on the Hamming distance. In this approach, most of the load arises from the calculation of the O ( m ∗ n ) Hamming distances between the signatures, where m is the number of queries and n is the number of signatures contained in the database. Signature generation only needs to be performed once, and we do not consider it further, focusing instead on accelerating the signature comparisons. The proposed FPGA-based architecture is optimized for high-throughput using hundreds of computing elements, arranged in a systolic array. These core computing elements can be adapted to support other string comparison algorithms with little effort, while the other infrastructure stays the same. On a Xilinx Virtex UltraScale+ FPGA (XCVU9P-2), a peak throughput of 75.4 billion comparisons per second—of 512-bit signatures—was achieved, using a design with 384 parallel processing elements and a clock frequency of 200 MHz. This makes our FPGA design 86 times faster than a highly optimized CPU implementation. Compared to a GPU design, executed on an NVIDIA GTX1060, it performs nearly five times faster.
Collapse
|
44
|
Röhling S, Linne A, Schellhorn J, Hosseini M, Dencker T, Morgenstern B. The number of k-mer matches between two DNA sequences as a function of k and applications to estimate phylogenetic distances. PLoS One 2020; 15:e0228070. [PMID: 32040534 PMCID: PMC7010260 DOI: 10.1371/journal.pone.0228070] [Citation(s) in RCA: 23] [Impact Index Per Article: 5.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/07/2020] [Accepted: 01/08/2020] [Indexed: 12/14/2022] Open
Abstract
We study the number Nk of length-k word matches between pairs of evolutionarily related DNA sequences, as a function of k. We show that the Jukes-Cantor distance between two genome sequences-i.e. the number of substitutions per site that occurred since they evolved from their last common ancestor-can be estimated from the slope of a function F that depends on Nk and that is affine-linear within a certain range of k. Integers kmin and kmax can be calculated depending on the length of the input sequences, such that the slope of F in the relevant range can be estimated from the values F(kmin) and F(kmax). This approach can be generalized to so-called Spaced-word Matches (SpaM), where mismatches are allowed at positions specified by a user-defined binary pattern. Based on these theoretical results, we implemented a prototype software program for alignment-free sequence comparison called Slope-SpaM. Test runs on real and simulated sequence data show that Slope-SpaM can accurately estimate phylogenetic distances for distances up to around 0.5 substitutions per position. The statistical stability of our results is improved if spaced words are used instead of contiguous words. Unlike previous alignment-free methods that are based on the number of (spaced) word matches, Slope-SpaM produces accurate results, even if sequences share only local homologies.
Collapse
Affiliation(s)
- Sophie Röhling
- University of Göttingen, Department of Bioinformatics, Göttingen, Germany
| | - Alexander Linne
- University of Göttingen, Department of Bioinformatics, Göttingen, Germany
| | - Jendrik Schellhorn
- University of Göttingen, Department of Bioinformatics, Göttingen, Germany
| | | | - Thomas Dencker
- University of Göttingen, Department of Bioinformatics, Göttingen, Germany
| | - Burkhard Morgenstern
- University of Göttingen, Department of Bioinformatics, Göttingen, Germany
- Göttingen Center of Molecular Biosciences (GZMB), Göttingen, Germany
| |
Collapse
|
45
|
Dlamini GS, Muller SJ, Meraba RL, Young RA, Mashiyane J, Chiwewe T, Mapiye DS. Classification of COVID-19 and Other Pathogenic Sequences: A Dinucleotide Frequency and Machine Learning Approach. IEEE ACCESS : PRACTICAL INNOVATIONS, OPEN SOLUTIONS 2020; 8:195263-195273. [PMID: 34976561 PMCID: PMC8675546 DOI: 10.1109/access.2020.3031387] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 10/02/2020] [Accepted: 10/04/2020] [Indexed: 05/08/2023]
Abstract
The world is grappling with the COVID-19 pandemic caused by the 2019 novel SARS-CoV-2. To better understand this novel virus and its relationship with other pathogens, new methods for analyzing the genome are required. In this study, intrinsic dinucleotide genomic signatures were analyzed for whole genome sequence data of eight pathogenic species, including SARS-CoV-2. The genome sequences were transformed into dinucleotide relative frequencies and classified using the extreme gradient boosting (XGBoost) model. The classification models were trained to a) distinguish between the sequences of all eight species and b) distinguish between sequences of SARS-CoV-2 that originate from different geographic regions. Our method attained 100% in all performance metrics and for all tasks in the eight-species classification problem. Moreover, the models achieved 67% balanced accuracy for the task of classifying the SARS-CoV-2 sequences into the six continental regions and achieved 86% balanced accuracy for the task of classifying SARS-CoV-2 samples as either originating from Asia or not. Analysis of the dinucleotide genomic profiles of the eight species revealed a similarity between the SARS-CoV-2 and MERS-CoV viral sequences. Further analysis of SARS-CoV-2 viral sequences from the six continents revealed that samples from Oceania had the highest frequency of TT dinucleotides as well as the lowest CG frequency compared to the other continents. The dinucleotide signatures of AC, AG,CA, CT, GA, GT, TC, and TG were well conserved across most genomes, while the frequencies of other dinucleotide signatures varied considerably. Altogether, the results from this study demonstrate the utility of dinucleotide relative frequencies for discriminating and identifying similar species.
Collapse
|
46
|
Read-SpaM: assembly-free and alignment-free comparison of bacterial genomes with low sequencing coverage. BMC Bioinformatics 2019; 20:638. [PMID: 31842735 PMCID: PMC6916211 DOI: 10.1186/s12859-019-3205-7] [Citation(s) in RCA: 16] [Impact Index Per Article: 3.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND In many fields of biomedical research, it is important to estimate phylogenetic distances between taxa based on low-coverage sequencing reads. Major applications are, for example, phylogeny reconstruction, species identification from small sequencing samples, or bacterial strain typing in medical diagnostics. RESULTS We adapted our previously developed software program Filtered Spaced-Word Matches (FSWM) for alignment-free phylogeny reconstruction to take unassembled reads as input; we call this implementation Read-SpaM. CONCLUSIONS Test runs on simulated reads from semi-artificial and real-world bacterial genomes show that our approach can estimate phylogenetic distances with high accuracy, even for large evolutionary distances and for very low sequencing coverage.
Collapse
|
47
|
Roddy AC, Jurek-Loughrey A, Souza J, Gilmore A, O’Reilly PG, Stupnikov A, Gonzalez de Castro D, Prise KM, Salto-Tellez M, McArt DG. NUQA: Estimating Cancer Spatial and Temporal Heterogeneity and Evolution through Alignment-Free Methods. Mol Biol Evol 2019; 36:2883-2889. [PMID: 31424551 PMCID: PMC6878956 DOI: 10.1093/molbev/msz182] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/22/2022] Open
Abstract
Longitudinal next-generation sequencing of cancer patient samples has enhanced our understanding of the evolution and progression of various cancers. As a result, and due to our increasing knowledge of heterogeneity, such sampling is becoming increasingly common in research and clinical trial sample collections. Traditionally, the evolutionary analysis of these cohorts involves the use of an aligner followed by subsequent stringent downstream analyses. However, this can lead to large levels of information loss due to the vast mutational landscape that characterizes tumor samples. Here, we propose an alignment-free approach for sequence comparison-a well-established approach in a range of biological applications including typical phylogenetic classification. Such methods could be used to compare information collated in raw sequence files to allow an unsupervised assessment of the evolutionary trajectory of patient genomic profiles. In order to highlight this utility in cancer research we have applied our alignment-free approach using a previously established metric, Jensen-Shannon divergence, and a metric novel to this area, Hellinger distance, to two longitudinal cancer patient cohorts in glioma and clear cell renal cell carcinoma using our software, NUQA. We hypothesize that this approach has the potential to reveal novel information about the heterogeneity and evolutionary trajectory of spatiotemporal tumor samples, potentially revealing early events in tumorigenesis and the origins of metastases and recurrences. Key words: alignment-free, Hellinger distance, exome-seq, evolution, phylogenetics, longitudinal.
Collapse
Affiliation(s)
- Aideen C Roddy
- Centre for Cancer Research and Cell Biology, Queen’s University Belfast, Belfast, United Kingdom
| | - Anna Jurek-Loughrey
- School of Electronics, Electrical Engineering and Computer Science, Queen’s University Belfast, Belfast, United Kingdom
| | - Jose Souza
- Centre for Cancer Research and Cell Biology, Queen’s University Belfast, Belfast, United Kingdom
| | - Alan Gilmore
- Centre for Cancer Research and Cell Biology, Queen’s University Belfast, Belfast, United Kingdom
| | - Paul G O’Reilly
- Centre for Cancer Research and Cell Biology, Queen’s University Belfast, Belfast, United Kingdom
| | - Alexey Stupnikov
- Centre for Cancer Research and Cell Biology, Queen’s University Belfast, Belfast, United Kingdom
- Department of Oncology, School of Medicine, Johns Hopkins University, Baltimore, MD
| | - David Gonzalez de Castro
- Centre for Cancer Research and Cell Biology, Queen’s University Belfast, Belfast, United Kingdom
| | - Kevin M Prise
- Centre for Cancer Research and Cell Biology, Queen’s University Belfast, Belfast, United Kingdom
| | - Manuel Salto-Tellez
- Centre for Cancer Research and Cell Biology, Queen’s University Belfast, Belfast, United Kingdom
| | - Darragh G McArt
- Centre for Cancer Research and Cell Biology, Queen’s University Belfast, Belfast, United Kingdom
| |
Collapse
|
48
|
Evolutionary Insight into the Trypanosomatidae Using Alignment-Free Phylogenomics of the Kinetoplast. Pathogens 2019; 8:pathogens8030157. [PMID: 31540520 PMCID: PMC6789588 DOI: 10.3390/pathogens8030157] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/23/2019] [Revised: 09/10/2019] [Accepted: 09/13/2019] [Indexed: 12/12/2022] Open
Abstract
Advancements in next-generation sequencing techniques have led to a substantial increase in the genomic information available for analyses in evolutionary biology. As such, this data requires the exponential growth in bioinformatic methods and expertise required to understand such vast quantities of genomic data. Alignment-free phylogenomics offer an alternative approach for large-scale analyses that may have the potential to address these challenges. The evolutionary relationships between various species within the trypanosomatid family, specifically members belonging to the genera Leishmania and Trypanosoma have been extensively studies over the last 30 years. However, there is a need for a more exhaustive analysis of the Trypanosomatidae, summarising the evolutionary patterns amongst the entire family of these important protists. The mitochondrial DNA of the trypanosomatids, better known as the kinetoplast, represents a valuable taxonomic marker given its unique presence across all kinetoplastid protozoans. The aim of this study was to validate the reliability and robustness of alignment-free approaches for phylogenomic analyses and its applicability to reconstruct the evolutionary relationships between the trypanosomatid family. In the present study, alignment-free analyses demonstrated the strength of these methods, particularly when dealing with large datasets compared to the traditional phylogenetic approaches. We present a maxicircle genome phylogeny of 46 species spanning the trypanosomatid family, demonstrating the superiority of the maxicircle for the analysis and taxonomic resolution of the Trypanosomatidae.
Collapse
|
49
|
Zielezinski A, Girgis HZ, Bernard G, Leimeister CA, Tang K, Dencker T, Lau AK, Röhling S, Choi JJ, Waterman MS, Comin M, Kim SH, Vinga S, Almeida JS, Chan CX, James BT, Sun F, Morgenstern B, Karlowski WM. Benchmarking of alignment-free sequence comparison methods. Genome Biol 2019; 20:144. [PMID: 31345254 PMCID: PMC6659240 DOI: 10.1186/s13059-019-1755-7] [Citation(s) in RCA: 101] [Impact Index Per Article: 20.2] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/16/2019] [Accepted: 07/03/2019] [Indexed: 11/22/2022] Open
Abstract
BACKGROUND Alignment-free (AF) sequence comparison is attracting persistent interest driven by data-intensive applications. Hence, many AF procedures have been proposed in recent years, but a lack of a clearly defined benchmarking consensus hampers their performance assessment. RESULTS Here, we present a community resource (http://afproject.org) to establish standards for comparing alignment-free approaches across different areas of sequence-based research. We characterize 74 AF methods available in 24 software tools for five research applications, namely, protein sequence classification, gene tree inference, regulatory element detection, genome-based phylogenetic inference, and reconstruction of species trees under horizontal gene transfer and recombination events. CONCLUSION The interactive web service allows researchers to explore the performance of alignment-free tools relevant to their data types and analytical goals. It also allows method developers to assess their own algorithms and compare them with current state-of-the-art tools, accelerating the development of new, more accurate AF solutions.
Collapse
Affiliation(s)
- Andrzej Zielezinski
- Department of Computational Biology, Faculty of Biology, Adam Mickiewicz University Poznan, Uniwersytetu Poznańskiego 6, 61-614, Poznan, Poland
| | - Hani Z Girgis
- Tandy School of Computer Science, The University of Tulsa, 800 South Tucker Drive, Tulsa, OK, 74104, USA
| | | | - Chris-Andre Leimeister
- Department of Bioinformatics, Institute of Microbiology and Genetics, University of Göttingen, Goldschmidtstr. 1, 37077, Göttingen, Germany
| | - Kujin Tang
- Department of Biological Sciences, Quantitative and Computational Biology Program, University of Southern California, Los Angeles, CA, 90089, USA
| | - Thomas Dencker
- Department of Bioinformatics, Institute of Microbiology and Genetics, University of Göttingen, Goldschmidtstr. 1, 37077, Göttingen, Germany
| | - Anna Katharina Lau
- Department of Bioinformatics, Institute of Microbiology and Genetics, University of Göttingen, Goldschmidtstr. 1, 37077, Göttingen, Germany
| | - Sophie Röhling
- Department of Bioinformatics, Institute of Microbiology and Genetics, University of Göttingen, Goldschmidtstr. 1, 37077, Göttingen, Germany
| | - Jae Jin Choi
- Department of Chemistry, University of California, Berkeley, CA, 94720, USA
- Molecular Biophysics & Integrated Bioimaging Division, Lawrence Berkeley National Laboratory, Berkeley, CA, 94720, USA
| | - Michael S Waterman
- Department of Biological Sciences, Quantitative and Computational Biology Program, University of Southern California, Los Angeles, CA, 90089, USA
- Centre for Computational Systems Biology, School of Mathematical Sciences, Fudan University, Shanghai, 200433, China
| | - Matteo Comin
- Department of Information Engineering, University of Padova, Padova, Italy
| | - Sung-Hou Kim
- Department of Chemistry, University of California, Berkeley, CA, 94720, USA
- Molecular Biophysics & Integrated Bioimaging Division, Lawrence Berkeley National Laboratory, Berkeley, CA, 94720, USA
| | - Susana Vinga
- INESC-ID, Instituto Superior Técnico, Universidade de Lisboa, Av. Rovisco Pais 1, 1049-001, Lisbon, Portugal
- IDMEC, Instituto Superior Técnico, Universidade de Lisboa, Av. Rovisco Pais 1, 1049-001, Lisbon, Portugal
| | - Jonas S Almeida
- Division of Cancer Epidemiology and Genetics (DCEG), National Cancer Institute (NIH/NCI), Bethesda, USA
| | - Cheong Xin Chan
- Institute for Molecular Bioscience, and School of Chemistry and Molecular Biosciences, The University of Queensland, Brisbane, QLD, 4072, Australia
| | - Benjamin T James
- Tandy School of Computer Science, The University of Tulsa, 800 South Tucker Drive, Tulsa, OK, 74104, USA
| | - Fengzhu Sun
- Department of Biological Sciences, Quantitative and Computational Biology Program, University of Southern California, Los Angeles, CA, 90089, USA
- Centre for Computational Systems Biology, School of Mathematical Sciences, Fudan University, Shanghai, 200433, China
| | - Burkhard Morgenstern
- Department of Bioinformatics, Institute of Microbiology and Genetics, University of Göttingen, Goldschmidtstr. 1, 37077, Göttingen, Germany
| | - Wojciech M Karlowski
- Department of Computational Biology, Faculty of Biology, Adam Mickiewicz University Poznan, Uniwersytetu Poznańskiego 6, 61-614, Poznan, Poland.
| |
Collapse
|
50
|
Forsdyke DR. Success of alignment-free oligonucleotide (k-mer) analysis confirms relative importance of genomes not genes in speciation and phylogeny. Biol J Linn Soc Lond 2019. [DOI: 10.1093/biolinnean/blz096] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022]
Abstract
AbstractThe utility of DNA sequence substrings (k-mers) in alignment-free phylogenetic classification, including that of bacteria and viruses, is increasingly recognized. However, its biological basis eludes many 21st century practitioners. A path from the 19th century recognition of the informational basis of heredity to the modern era can be discerned. Crick’s DNA ‘unpairing postulate’ predicted that recombinational pairing of homologous DNAs during meiosis would be mediated by short k-mers in the loops of stem-loop structures extruded from classical duplex helices. The complementary ‘kissing’ duplex loops – like tRNA anticodon–codon k-mer duplexes – would seed a more extensive pairing that would then extend until limited by lack of homology or other factors. Indeed, this became the principle behind alignment-based methods that assessed similarity by degree of DNA–DNA reassociation in vitro. These are now seen as less sensitive than alignment-free methods that are closely consistent, both theoretically and mechanistically, with chromosomal anti-recombination models for the initiation of divergence into new species. The analytical power of k-mer differences supports the theses that evolutionary advance sometimes serves the needs of nucleic acids (genomes) rather than proteins (genes), and that such differences can play a role in early speciation events.
Collapse
Affiliation(s)
- Donald R Forsdyke
- Department of Biomedical and Molecular Sciences, Queen’s University, Kingston, Ontario, Canada
| |
Collapse
|