1
|
Chow CFW, Ghosh S, Hadarovich A, Toth-Petroczy A. SHARK enables sensitive detection of evolutionary homologs and functional analogs in unalignable and disordered sequences. Proc Natl Acad Sci U S A 2024; 121:e2401622121. [PMID: 39383002 DOI: 10.1073/pnas.2401622121] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/24/2024] [Accepted: 08/30/2024] [Indexed: 10/11/2024] Open
Abstract
Intrinsically disordered regions (IDRs) are structurally flexible protein segments with regulatory functions in multiple contexts, such as in the assembly of biomolecular condensates. Since IDRs undergo more rapid evolution than ordered regions, identifying homology of such poorly conserved regions remains challenging for state-of-the-art alignment-based methods that rely on position-specific conservation of residues. Thus, systematic functional annotation and evolutionary analysis of IDRs have been limited, despite them comprising ~21% of proteins. To accurately assess homology between unalignable sequences, we developed an alignment-free sequence comparison algorithm, SHARK (Similarity/Homology Assessment by Relating K-mers). We trained SHARK-dive, a machine learning homology classifier, which achieved superior performance to standard alignment-based approaches in assessing evolutionary homology in unalignable sequences. Furthermore, it correctly identified dissimilar but functionally analogous IDRs in IDR-replacement experiments reported in the literature, whereas alignment-based tools were incapable of detecting such functional relationships. SHARK-dive not only predicts functionally similar IDRs at a proteome-wide scale but also identifies cryptic sequence properties and motifs that drive remote homology and analogy, thereby providing interpretable and experimentally verifiable hypotheses of the sequence determinants that underlie such relationships. SHARK-dive acts as an alternative to alignment to facilitate systematic analysis and functional annotation of the unalignable protein universe.
Collapse
Affiliation(s)
- Chi Fung Willis Chow
- Max Planck Institute of Molecular Cell Biology and Genetics, Dresden 01307, Germany
- Center for Systems Biology Dresden, Dresden 01307, Germany
- Cluster of Excellence Physics of Life, Technische Universität Dresden, Dresden 01062, Germany
| | - Soumyadeep Ghosh
- Max Planck Institute of Molecular Cell Biology and Genetics, Dresden 01307, Germany
- Center for Systems Biology Dresden, Dresden 01307, Germany
| | - Anna Hadarovich
- Max Planck Institute of Molecular Cell Biology and Genetics, Dresden 01307, Germany
- Center for Systems Biology Dresden, Dresden 01307, Germany
| | - Agnes Toth-Petroczy
- Max Planck Institute of Molecular Cell Biology and Genetics, Dresden 01307, Germany
- Center for Systems Biology Dresden, Dresden 01307, Germany
- Cluster of Excellence Physics of Life, Technische Universität Dresden, Dresden 01062, Germany
| |
Collapse
|
2
|
Yang H, Lu X, Chang J, Chang Q, Zheng W, Chen Z, Yi H. Kssdtree: an interactive Python package for phylogenetic analysis based on sketching technique. Bioinformatics 2024; 40:btae566. [PMID: 39298462 PMCID: PMC11467128 DOI: 10.1093/bioinformatics/btae566] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/12/2024] [Revised: 09/14/2024] [Accepted: 09/18/2024] [Indexed: 09/21/2024] Open
Abstract
SUMMARY Sketching technologies have recently emerged as a promising solution for real-time, large-scale phylogenetic analysis. However, existing sketching-based phylogenetic tools exhibit drawbacks, including platform restrictions, deficiencies in tree visualization, and inherent distance estimation bias. These limitations collectively impede the overall convenience and efficiency of the analysis. In this study, we introduce Kssdtree, an interactive Python package designed to address these challenges. Kssdtree surpasses other sketching-based tools by demonstrating superior performance in terms of both accuracy and time efficiency on comprehensive benchmarking datasets. Notably, Kssdtree offers key advantages such as intra-species phylogenomic analysis and GTDB-based phylogenetic placement analysis, significantly enhancing the scope and depth of phylogenetic investigations. Through extensive evaluations and comparisons, Kssdtree stands out as an efficient and versatile method for real-time, large-scale phylogenetic analysis. AVAILABILITY AND IMPLEMENTATION The Kssdtree Python package is freely accessible at https://pypi.org/project/kssdtree and source code is available at https://github.com/yhlink/kssdtree. The documentation and instantiation for the software is available at https://kssdtree.readthedocs.io/en/latest. The video tutorial is available at https://youtu.be/_6hg59Yn-Ws.
Collapse
Affiliation(s)
- Hang Yang
- College of Computer Science and Technology (College of Data Science), Taiyuan University of Technology, Jinzhong 030600, China
- Shenzhen Branch, Guangdong Laboratory for Lingnan Modern Agriculture, Genome Analysis Laboratory of the Ministry of Agriculture, Agricultural Genomics Institute at Shenzhen, Chinese Academy of Agricultural Sciences, Shenzhen 518055, China
| | - Xiaoxin Lu
- Shenzhen Branch, Guangdong Laboratory for Lingnan Modern Agriculture, Genome Analysis Laboratory of the Ministry of Agriculture, Agricultural Genomics Institute at Shenzhen, Chinese Academy of Agricultural Sciences, Shenzhen 518055, China
| | - Jiaxing Chang
- College of Computer Science and Technology (College of Data Science), Taiyuan University of Technology, Jinzhong 030600, China
- Shenzhen Branch, Guangdong Laboratory for Lingnan Modern Agriculture, Genome Analysis Laboratory of the Ministry of Agriculture, Agricultural Genomics Institute at Shenzhen, Chinese Academy of Agricultural Sciences, Shenzhen 518055, China
| | - Qing Chang
- Shenzhen Branch, Guangdong Laboratory for Lingnan Modern Agriculture, Genome Analysis Laboratory of the Ministry of Agriculture, Agricultural Genomics Institute at Shenzhen, Chinese Academy of Agricultural Sciences, Shenzhen 518055, China
| | - Wen Zheng
- College of Computer Science and Technology (College of Data Science), Taiyuan University of Technology, Jinzhong 030600, China
| | - Zehua Chen
- College of Computer Science and Technology (College of Data Science), Taiyuan University of Technology, Jinzhong 030600, China
| | - Huiguang Yi
- Shenzhen Branch, Guangdong Laboratory for Lingnan Modern Agriculture, Genome Analysis Laboratory of the Ministry of Agriculture, Agricultural Genomics Institute at Shenzhen, Chinese Academy of Agricultural Sciences, Shenzhen 518055, China
| |
Collapse
|
3
|
Li X, Zhou T, Feng X, Yau ST, Yau SST. Exploring geometry of genome space via Grassmann manifolds. Innovation (N Y) 2024; 5:100677. [PMID: 39206218 PMCID: PMC11350263 DOI: 10.1016/j.xinn.2024.100677] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/06/2024] [Accepted: 07/18/2024] [Indexed: 09/04/2024] Open
Abstract
It is important to understand the geometry of genome space in biology. After transforming genome sequences into frequency matrices of the chaos game representation (FCGR), we regard a genome sequence as a point in a suitable Grassmann manifold by analyzing the column space of the corresponding FCGR. To assess the sequence similarity, we employ the generalized Grassmannian distance, an intrinsic geometric distance that differs from the traditional Euclidean distance used in the classical k-mer frequency-based methods. With this method, we constructed phylogenetic trees for various genome datasets, including influenza A virus hemagglutinin gene, Orthocoronavirinae genome, and SARS-CoV-2 complete genome sequences. Our comparative analysis with multiple sequence alignment and alignment-free methods for large-scale sequences revealed that our method, which employs the subspace distance between the column spaces of different FCGRs (FCGR-SD), outperformed its competitors in terms of both speed and accuracy. In addition, we used low-dimensional visualization of the SARS-CoV-2 genome sequences and spike protein nucleotide sequences with our methods, resulting in some intriguing findings. We not only propose a novel and efficient algorithm for comparing genome sequences but also demonstrate that genome data have some intrinsic manifold structures, providing a new geometric perspective for molecular biology studies.
Collapse
Affiliation(s)
- Xiaoguang Li
- School of Statistics and Management, Shanghai University of Finance and Economics, Shanghai 200433, China
| | - Tao Zhou
- Department of Mathematical Sciences, Tsinghua University, Beijing 100084, China
| | - Xingdong Feng
- School of Statistics and Management, Shanghai University of Finance and Economics, Shanghai 200433, China
| | - Shing-Tung Yau
- Department of Mathematical Sciences, Tsinghua University, Beijing 100084, China
- Yanqi Lake Beijing Institute of Mathematical Sciences and Applications, Beijing 101408, China
| | - Stephen S.-T. Yau
- Department of Mathematical Sciences, Tsinghua University, Beijing 100084, China
- Yanqi Lake Beijing Institute of Mathematical Sciences and Applications, Beijing 101408, China
| |
Collapse
|
4
|
Stock M, Van Criekinge W, Boeckaerts D, Taelman S, Van Haeverbeke M, Dewulf P, De Baets B. Hyperdimensional computing: A fast, robust, and interpretable paradigm for biological data. PLoS Comput Biol 2024; 20:e1012426. [PMID: 39316621 PMCID: PMC11421772 DOI: 10.1371/journal.pcbi.1012426] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 09/26/2024] Open
Abstract
Advances in bioinformatics are primarily due to new algorithms for processing diverse biological data sources. While sophisticated alignment algorithms have been pivotal in analyzing biological sequences, deep learning has substantially transformed bioinformatics, addressing sequence, structure, and functional analyses. However, these methods are incredibly data-hungry, compute-intensive, and hard to interpret. Hyperdimensional computing (HDC) has recently emerged as an exciting alternative. The key idea is that random vectors of high dimensionality can represent concepts such as sequence identity or phylogeny. These vectors can then be combined using simple operators for learning, reasoning, or querying by exploiting the peculiar properties of high-dimensional spaces. Our work reviews and explores HDC's potential for bioinformatics, emphasizing its efficiency, interpretability, and adeptness in handling multimodal and structured data. HDC holds great potential for various omics data searching, biosignal analysis, and health applications.
Collapse
Affiliation(s)
- Michiel Stock
- KERMIT Research Unit, Department of Data Analysis and Mathematical Modelling, Ghent University, Ghent, Belgium
| | - Wim Van Criekinge
- Biobix Research Unit, Department of Data Analysis and Mathematical Modelling, Ghent University, Ghent, Belgium
| | - Dimitri Boeckaerts
- KERMIT Research Unit, Department of Data Analysis and Mathematical Modelling, Ghent University, Ghent, Belgium
- Laboratory of Applied Biotechnology, Department of Biotechnology, Ghent University, Ghent, Belgium
| | - Steff Taelman
- KERMIT Research Unit, Department of Data Analysis and Mathematical Modelling, Ghent University, Ghent, Belgium
- Biobix Research Unit, Department of Data Analysis and Mathematical Modelling, Ghent University, Ghent, Belgium
- BioLizard nv, Ghent, Belgium
| | - Maxime Van Haeverbeke
- KERMIT Research Unit, Department of Data Analysis and Mathematical Modelling, Ghent University, Ghent, Belgium
| | - Pieter Dewulf
- KERMIT Research Unit, Department of Data Analysis and Mathematical Modelling, Ghent University, Ghent, Belgium
| | - Bernard De Baets
- KERMIT Research Unit, Department of Data Analysis and Mathematical Modelling, Ghent University, Ghent, Belgium
| |
Collapse
|
5
|
Islam R, Rahman A. An alignment-free method for detection of missing regions for phylogenetic analysis. Heliyon 2024; 10:e32227. [PMID: 38933968 PMCID: PMC11200290 DOI: 10.1016/j.heliyon.2024.e32227] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/17/2024] [Revised: 05/17/2024] [Accepted: 05/29/2024] [Indexed: 06/28/2024] Open
Abstract
Phylogenetic tree estimation using conventional approaches usually requires pairwise or multiple sequence alignment. However, sequence alignment has difficulties related to scalability and accuracy in case of long sequences such as whole genomes, low sequence identity, and in presence of genomic rearrangements. To address these issues, alignment-free approaches have been proposed. While these methods have demonstrated promising results, many of these lead to errors when regions are missing from the sequences of one or more species that are trivially detected in alignment-based methods. Here, we present an alignment-free method for detecting missing regions in sequences of species for which phylogeny is to be estimated. It is based on counts of k-mers and can be used to filter out k-mers belonging to regions in one species that are missing in one or more of the other species. We perform experiments with real and simulated datasets containing missing regions and find that it can successfully detect a large fraction of such k-mers and can lead to improvements in the estimated phylogenies. Our method can be used in k-mer based alignment-free phylogeny estimation methods to filter out k-mers corresponding to missing regions.
Collapse
Affiliation(s)
- Rubyeat Islam
- Department of Computer Science and Engineering, Military Institute of Science and Technology, Dhaka, Bangladesh
| | - Atif Rahman
- Department of Computer Science and Engineering, Bangladesh University of Engineering and Technology, Dhaka, Bangladesh
| |
Collapse
|
6
|
Prusokiene A, Boonham N, Fox A, Howard TP. Mottle: Accurate pairwise substitution distance at high divergence through the exploitation of short-read mappers and gradient descent. PLoS One 2024; 19:e0298834. [PMID: 38512939 PMCID: PMC10956839 DOI: 10.1371/journal.pone.0298834] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/07/2023] [Accepted: 01/30/2024] [Indexed: 03/23/2024] Open
Abstract
Current tools for estimating the substitution distance between two related sequences struggle to remain accurate at a high divergence. Difficulties at distant homologies, such as false seeding and over-alignment, create a high barrier for the development of a stable estimator. This is especially true for viral genomes, which carry a high rate of mutation, small size, and sparse taxonomy. Developing an accurate substitution distance measure would help to elucidate the relationship between highly divergent sequences, interrogate their evolutionary history, and better facilitate the discovery of new viral genomes. To tackle these problems, we propose an approach that uses short-read mappers to create whole-genome maps, and gradient descent to isolate the homologous fraction and calculate the final distance value. We implement this approach as Mottle. With the use of simulated and biological sequences, Mottle was able to remain stable to 0.66-0.96 substitutions per base pair and identify viral outgroup genomes with 95% accuracy at the family-order level. Our results indicate that Mottle performs as well as existing programs in identifying taxonomic relationships, with more accurate numerical estimation of genomic distance over greater divergences. By contrast, one limitation is a reduced numerical accuracy at low divergences, and on genomes where insertions and deletions are uncommon, when compared to alternative approaches. We propose that Mottle may therefore be of particular interest in the study of viruses, viral relationships, and notably for viral discovery platforms, helping in benchmarking of homology search tools and defining the limits of taxonomic classification methods. The code for Mottle is available at https://github.com/tphoward/Mottle_Repo.
Collapse
Affiliation(s)
- Alisa Prusokiene
- Faculty of Science, Agriculture and Engineering, School of Natural and Environmental Sciences, Newcastle University, United Kingdom
| | - Neil Boonham
- Faculty of Science, Agriculture and Engineering, School of Natural and Environmental Sciences, Newcastle University, United Kingdom
| | - Adrian Fox
- Faculty of Science, Agriculture and Engineering, School of Natural and Environmental Sciences, Newcastle University, United Kingdom
- Fera Ltd., Biotech Campus, York, United Kingdom
| | - Thomas P. Howard
- Faculty of Science, Agriculture and Engineering, School of Natural and Environmental Sciences, Newcastle University, United Kingdom
| |
Collapse
|
7
|
Bonnie JK, Ahmed OY, Langmead B. DandD: Efficient measurement of sequence growth and similarity. iScience 2024; 27:109054. [PMID: 38361606 PMCID: PMC10867639 DOI: 10.1016/j.isci.2024.109054] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/29/2023] [Revised: 01/11/2024] [Accepted: 01/23/2024] [Indexed: 02/17/2024] Open
Abstract
Genome assembly databases are growing rapidly. The redundancy of sequence content between a new assembly and previous ones is neither conceptually nor algorithmically easy to measure. We introduce pertinent methods and DandD, a tool addressing how much new sequence is gained when a sequence collection grows. DandD can describe how much structural variation is discovered in each new human genome assembly and when discoveries will level off in the future. DandD uses a measure called δ ("delta"), developed initially for data compression and chiefly dependent on k-mer counts. DandD rapidly estimates δ using genomic sketches. We propose δ as an alternative to k-mer-specific cardinalities when computing the Jaccard coefficient, thereby avoiding the pitfalls of a poor choice of k. We demonstrate the utility of DandD's functions for estimating δ, characterizing the rate of pangenome growth, and computing all-pairs similarities using k-independent Jaccard.
Collapse
Affiliation(s)
- Jessica K. Bonnie
- Department of Computer Science, Johns Hopkins University, Baltimore, MD, USA
| | - Omar Y. Ahmed
- Department of Computer Science, Johns Hopkins University, Baltimore, MD, USA
| | - Ben Langmead
- Department of Computer Science, Johns Hopkins University, Baltimore, MD, USA
| |
Collapse
|
8
|
Pal J, Ghosh S, Maji B, Bhattacharya DK. MMV method: a new approach to compare protein sequences under binary representation. J Biomol Struct Dyn 2024:1-7. [PMID: 38375605 DOI: 10.1080/07391102.2024.2317982] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/28/2023] [Accepted: 02/07/2024] [Indexed: 02/21/2024]
Abstract
In the present work, a new form of descriptor using minimal moment vector (MMV) is introduced to compare protein sequences in the frequency domain under their component wise binary representations. From every sequence, 20 different binary component sequences are formed, each corresponding to 20 amino acids. Each such vector is now shifted from the time domain to the frequency domain by applying the Fast Fourier Transform (FFT). Next, the power spectrum calculated from the FFT values for each component sequence is so normalized that the sum of the components equals 1. The descriptor is defined as a 20-component vector composed of the 20 second-order minimal moments calculated from the normalized spectrum of the 20 component sequences. Once the descriptor is known, the distance matrix is created by applying the Euclidean Distance measure. The phylogenetic tree is generated by applying the unweighted pair group method with the arithmetic mean (UPGMA) algorithm using Molecular Evolutionary Genetics Analysis11 (MEGA11) software. In this work, the datasets used for similarity studies are 9 NADH dehydrogenase 5 (ND5), 12 Baculoviruses, 24 Transferrins (TF) proteins, and 50 Spike Protein of coronavirus. A qualitative measure using rationalized perception is used to compare the effectiveness of the proposed method. Quantitative measure based on symmetric distance (SD) is used to compare the phylogenetic trees of the present method with those obtained by other methods. It is observed that the phylogenetic trees generated by the proposed technique are at par with their known biological references, and they produce results better than those of the earlier methods.Communicated by Ramaswamy H. Sarma.
Collapse
Affiliation(s)
- Jayanta Pal
- Department of ECE, National Institute of Technology, Durgapur, India
- Department of CSE, Narula Institute of Technology, Kolkata, India
| | - Soumen Ghosh
- Department of ECE, National Institute of Technology, Durgapur, India
- Department of IT, Narula Institute of Technology, Kolkata, India
| | - Bansibadan Maji
- Department of ECE, National Institute of Technology, Durgapur, India
| | | |
Collapse
|
9
|
Fruzangohar M, Moolhuijzen P, Bakaj N, Taylor J. CoreDetector: a flexible and efficient program for core-genome alignment of evolutionary diverse genomes. Bioinformatics 2023; 39:btad628. [PMID: 37878789 PMCID: PMC10663985 DOI: 10.1093/bioinformatics/btad628] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/29/2023] [Revised: 09/20/2023] [Accepted: 10/23/2023] [Indexed: 10/27/2023] Open
Abstract
MOTIVATION Whole genome alignment of eukaryote species remains an important method for the determination of sequence and structural variations and can also be used to ascertain the representative non-redundant core-genome sequence of a population. Many whole genome alignment tools were first developed for the more mature analysis of prokaryote species with few current tools containing the functionality to process larger genomes of eukaryotes as well as genomes of more divergent species. In addition, the functionality of these tools becomes computationally prohibitive due to the significant compute resources needed to handle larger genomes. RESULTS In this research, we present CoreDetector, an easy-to-use general-purpose program that can align the core-genome sequences for a range of genome sizes and divergence levels. To illustrate the flexibility of CoreDetector, we conducted alignments of a large set of closely related fungal pathogen and hexaploid wheat cultivar genomes as well as more divergent fly and rodent species genomes. In all cases, compared to existing multiple genome alignment tools, CoreDetector exhibited improved flexibility, efficiency, and competitive accuracy in tested cases. AVAILABILITY AND IMPLEMENTATION CoreDetector was developed in the cross platform, and easily deployable, Java language. A packaged pipeline is readily executable in a bash terminal without any external need for Perl or Python environments. Installation, example data, and usage instructions for CoreDetector are freely available from https://github.com/mfruzan/CoreDetector.
Collapse
Affiliation(s)
- Mario Fruzangohar
- The Biometry Hub, School of Agriculture, Food and Wine, University of Adelaide, Urrbrae, South Australia 5064, Australia
| | - Paula Moolhuijzen
- Centre for Crop Disease Management, School of Molecular and Life Sciences, Curtin University, Bentley, Western Australia 6102, Australia
| | - Nicolette Bakaj
- The Biometry Hub, School of Agriculture, Food and Wine, University of Adelaide, Urrbrae, South Australia 5064, Australia
| | - Julian Taylor
- The Biometry Hub, School of Agriculture, Food and Wine, University of Adelaide, Urrbrae, South Australia 5064, Australia
| |
Collapse
|
10
|
Wei ZG, Chen X, Zhang XD, Zhang H, Fan XG, Gao HY, Liu F, Qian Y. Comparison of Methods for Biological Sequence Clustering. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2023; 20:2874-2888. [PMID: 37028305 DOI: 10.1109/tcbb.2023.3253138] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/19/2023]
Abstract
Recent advances in sequencing technology have considerably promoted genomics research by providing high-throughput sequencing economically. This great advancement has resulted in a huge amount of sequencing data. Clustering analysis is powerful to study and probe the large-scale sequence data. A number of available clustering methods have been developed in the last decade. Despite numerous comparison studies being published, we noticed that they have two main limitations: only traditional alignment-based clustering methods are compared and the evaluation metrics heavily rely on labeled sequence data. In this study, we present a comprehensive benchmark study for sequence clustering methods. Specifically, i) alignment-based clustering algorithms including classical (e.g., CD-HIT, UCLUST, VSEARCH) and recently proposed methods (e.g., MMseq2, Linclust, edClust) are assessed; ii) two alignment-free methods (e.g., LZW-Kernel and Mash) are included to compare with alignment-based methods; and iii) different evaluation measures based on the true labels (supervised metrics) and the input data itself (unsupervised metrics) are applied to quantify their clustering results. The aims of this study are to help biological analyzers in choosing one reasonable clustering algorithm for processing their collected sequences, and furthermore, motivate algorithm designers to develop more efficient sequence clustering approaches.
Collapse
|
11
|
Guerrini V, Conte A, Grossi R, Liti G, Rosone G, Tattini L. phyBWT2: phylogeny reconstruction via eBWT positional clustering. Algorithms Mol Biol 2023; 18:11. [PMID: 37537624 PMCID: PMC10399073 DOI: 10.1186/s13015-023-00232-4] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/31/2023] [Accepted: 06/10/2023] [Indexed: 08/05/2023] Open
Abstract
BACKGROUND Molecular phylogenetics studies the evolutionary relationships among the individuals of a population through their biological sequences. It may provide insights about the origin and the evolution of viral diseases, or highlight complex evolutionary trajectories. A key task is inferring phylogenetic trees from any type of sequencing data, including raw short reads. Yet, several tools require pre-processed input data e.g. from complex computational pipelines based on de novo assembly or from mappings against a reference genome. As sequencing technologies keep becoming cheaper, this puts increasing pressure on designing methods that perform analysis directly on their outputs. From this viewpoint, there is a growing interest in alignment-, assembly-, and reference-free methods that could work on several data including raw reads data. RESULTS We present phyBWT2, a newly improved version of phyBWT (Guerrini et al. in 22nd International Workshop on Algorithms in Bioinformatics (WABI) 242:23-12319, 2022). Both of them directly reconstruct phylogenetic trees bypassing both the alignment against a reference genome and de novo assembly. They exploit the combinatorial properties of the extended Burrows-Wheeler Transform (eBWT) and the corresponding eBWT positional clustering framework to detect relevant blocks of the longest shared substrings of varying length (unlike the k-mer-based approaches that need to fix the length k a priori). As a result, they provide novel alignment-, assembly-, and reference-free methods that build partition trees without relying on the pairwise comparison of sequences, thus avoiding to use a distance matrix to infer phylogeny. In addition, phyBWT2 outperforms phyBWT in terms of running time, as the former reconstructs phylogenetic trees step-by-step by considering multiple partitions, instead of just one partition at a time, as previously done by the latter. CONCLUSIONS Based on the results of the experiments on sequencing data, we conclude that our method can produce trees of quality comparable to the benchmark phylogeny by handling datasets of different types (short reads, contigs, or entire genomes). Overall, the experiments confirm the effectiveness of phyBWT2 that improves the performance of its previous version phyBWT, while preserving the accuracy of the results.
Collapse
Affiliation(s)
| | - Alessio Conte
- Dipartimento di Informatica, University of Pisa, Pisa, Italy.
| | - Roberto Grossi
- Dipartimento di Informatica, University of Pisa, Pisa, Italy.
| | - Gianni Liti
- CNRS UMR 7284, INSERM U1081 Université Côte d'Azu, Nice, France
| | - Giovanna Rosone
- Dipartimento di Informatica, University of Pisa, Pisa, Italy.
| | - Lorenzo Tattini
- CNRS UMR 7284, INSERM U1081 Université Côte d'Azu, Nice, France
| |
Collapse
|
12
|
de Andrade AAS, Grivet M, Brustolini O, Vasconcelos ATR. ( m, n)-mer-a simple statistical feature for sequence classification. BIOINFORMATICS ADVANCES 2023; 3:vbad088. [PMID: 37448814 PMCID: PMC10338135 DOI: 10.1093/bioadv/vbad088] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 02/11/2023] [Revised: 06/22/2023] [Accepted: 07/10/2023] [Indexed: 07/15/2023]
Abstract
Summary The (m, n)-mer is a simple alternative classification feature based on conditional probability distributions. In this application note, we compared k-mer and (m, n)-mer frequency features in 11 distinct datasets used for binary, multiclass and clustering classifications. Our findings show that the (m, n)-mer frequency features are related to the highest performance metrics and often statistically outperformed the k-mers. Here, the (m, n)-mer frequencies improved performance for classifying smaller sequence lengths (as short as 300 bp) and yielded higher metrics when using short values of k (ranging from 2 to 4). Therefore, we present the (m, n)-mers frequencies to the scientific community as a feature that seems to be quite effective in identifying complex discriminatory patterns and classifying polyphyletic sequence groups. Availability and implementation The (m, n)-mer algorithm is released as an R package within the CRAN project (https://cran.r-project.org/web/packages/mnmer) and is also available at https://github.com/labinfo-lncc/mnmer. Supplementary information Supplementary data are available at Bioinformatics Advances online.
Collapse
Affiliation(s)
- Amanda Araújo Serrão de Andrade
- Bioinformatics Laboratory (LABINFO), National Laboratory for Scientific Computing, Av. Getulio Vargas, 333—Quitandinha, 25651-076, Rio de Janeiro, Brazil
| | - Marco Grivet
- Pontifícia Universidade Católica do Rio de Janeiro, Rua Marquês de São Vicente 225, Gávea, 22451-900, Rio de Janeiro, Brazil
| | - Otávio Brustolini
- Bioinformatics Laboratory (LABINFO), National Laboratory for Scientific Computing, Av. Getulio Vargas, 333—Quitandinha, 25651-076, Rio de Janeiro, Brazil
| | | |
Collapse
|
13
|
Garza AB, Garcia R, Halfon MS, Girgis HZ. Evaluation of metric and representation learning approaches: Effects of representations driven by relative distance on the performance. 2023 INTELLIGENT METHODS, SYSTEMS, AND APPLICATIONS 2023; 2023:545-550. [PMID: 37822849 PMCID: PMC10566582 DOI: 10.1109/imsa58542.2023.10217475] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 10/13/2023]
Abstract
Several deep neural network architectures have emerged recently for metric learning. We asked which architecture is the most effective in measuring the similarity or dissimilarity among images. To this end, we evaluated six networks on a standard image set. We evaluated variational autoencoders, Siamese networks, triplet networks, and variational auto-encoders combined with Siamese or triplet networks. These networks were compared to a baseline network consisting of multiple separable convolutional layers. Our study revealed the following: (i) the triplet architecture proved the most effective one due to learning a relative distance - not an absolute distance; (ii) combining auto-encoders with networks that learn metrics (e.g., Siamese or triplet networks) is unwarranted; and (iii) an architecture based on separable convolutional layers is a reasonable simple alternative to triplet networks. These results can potentially impact our field by encouraging architects to develop advanced networks that take advantage of separable convolution and relative distance.
Collapse
Affiliation(s)
- Anthony B. Garza
- Department of Electrical Engineering and Computer Science, Texas A&M University-Kingsville, Kingsville, TX, USA
| | - Rolando Garcia
- Department of Electrical Engineering and Computer Science, Texas A&M University-Kingsville, Kingsville, TX, USA
| | - Marc S. Halfon
- Department of Biochemistry, State University of New York at Buffalo, Buffalo, NY, USA
| | - Hani Z. Girgis
- Department of Electrical Engineering and Computer Science, Texas A&M University-Kingsville Kingsville, TX, USA
| |
Collapse
|
14
|
Schmidt S, Khan S, Alanko JN, Pibiri GE, Tomescu AI. Matchtigs: minimum plain text representation of k-mer sets. Genome Biol 2023; 24:136. [PMID: 37296461 PMCID: PMC10251615 DOI: 10.1186/s13059-023-02968-z] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/16/2021] [Accepted: 05/10/2023] [Indexed: 06/12/2023] Open
Abstract
We propose a polynomial algorithm computing a minimum plain-text representation of k-mer sets, as well as an efficient near-minimum greedy heuristic. When compressing read sets of large model organisms or bacterial pangenomes, with only a minor runtime increase, we shrink the representation by up to 59% over unitigs and 26% over previous work. Additionally, the number of strings is decreased by up to 97% over unitigs and 90% over previous work. Finally, a small representation has advantages in downstream applications, as it speeds up SSHash-Lite queries by up to 4.26× over unitigs and 2.10× over previous work.
Collapse
Affiliation(s)
- Sebastian Schmidt
- Department of Computer Science, University of Helsinki, Helsinki, Finland
| | - Shahbaz Khan
- Department of Computer Science and Engineering, Indian Institute of Technology Roorkee, Roorkee, India
| | - Jarno N. Alanko
- Department of Computer Science, University of Helsinki, Helsinki, Finland
- Faculty of Computer Science, Dalhousie University, Halifax, Canada
| | - Giulio E. Pibiri
- Department of Environmental Sciences, Informatics and Statistics, Ca’ Foscari University of Venice, Venice, Italy
- ISTI-CNR, Pisa, Italy
| | | |
Collapse
|
15
|
Nawaz MS, Fournier-Viger P, He Y, Zhang Q. PSAC-PDB: Analysis and classification of protein structures. Comput Biol Med 2023; 158:106814. [PMID: 36989742 DOI: 10.1016/j.compbiomed.2023.106814] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/08/2022] [Revised: 03/09/2023] [Accepted: 03/20/2023] [Indexed: 03/29/2023]
Abstract
This paper presents a novel framework, called PSAC-PDB, for analyzing and classifying protein structures from the Protein Data Bank (PDB). PSAC-PDB first finds, analyze and identifies protein structures in PDB that are similar to a protein structure of interest using a protein structure comparison tool. Second, the amino acids (AA) sequences of identified protein structures (obtained from PDB), their aligned amino acids (AAA) and aligned secondary structure elements (ASSE) (obtained by structural alignment), and frequent AA (FAA) patterns (discovered by sequential pattern mining), are used for the reliable detection/classification of protein structures. Eleven classifiers are used and their performance is compared using six evaluation metrics. Results show that three classifiers perform well on overall, and that FAA patterns can be used to efficiently classify protein structures in place of providing the whole AA sequences, AAA or ASSE. Furthermore, better classification results are obtained using AAA of protein structures rather than AA sequences. PSAC-PDB also performed better than state-of-the-art approaches for SARS-CoV-2 genome sequences classification.
Collapse
|
16
|
Redshaw J, Ting DSJ, Brown A, Hirst JD, Gärtner T. Krein support vector machine classification of antimicrobial peptides. DIGITAL DISCOVERY 2023; 2:502-511. [PMID: 37065679 PMCID: PMC10087059 DOI: 10.1039/d3dd00004d] [Citation(s) in RCA: 3] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 01/19/2023] [Accepted: 02/22/2023] [Indexed: 03/02/2023]
Abstract
Antimicrobial peptides (AMPs) represent a potential solution to the growing problem of antimicrobial resistance, yet their identification through wet-lab experiments is a costly and time-consuming process. Accurate computational predictions would allow rapid in silico screening of candidate AMPs, thereby accelerating the discovery process. Kernel methods are a class of machine learning algorithms that utilise a kernel function to transform input data into a new representation. When appropriately normalised, the kernel function can be regarded as a notion of similarity between instances. However, many expressive notions of similarity are not valid kernel functions, meaning they cannot be used with standard kernel methods such as the support-vector machine (SVM). The Kreĭn-SVM represents generalisation of the standard SVM that admits a much larger class of similarity functions. In this study, we propose and develop Kreĭn-SVM models for AMP classification and prediction by employing the Levenshtein distance and local alignment score as sequence similarity functions. Utilising two datasets from the literature, each containing more than 3000 peptides, we train models to predict general antimicrobial activity. Our best models achieve an AUC of 0.967 and 0.863 on the test sets of each respective dataset, outperforming the in-house and literature baselines in both cases. We also curate a dataset of experimentally validated peptides, measured against Staphylococcus aureus and Pseudomonas aeruginosa, in order to evaluate the applicability of our methodology in predicting microbe-specific activity. In this case, our best models achieve an AUC of 0.982 and 0.891, respectively. Models to predict both general and microbe-specific activities are made available as web applications.
Collapse
Affiliation(s)
- Joseph Redshaw
- School of Chemistry, University of Nottingham, University Park Nottingham NG7 2RD UK
| | - Darren S J Ting
- Academic Ophthalmology, School of Medicine, University of Nottingham Nottingham NG7 2UH UK
- Academic Unit of Ophthalmology, Institute of Inflammation and Ageing, University of Birmingham Birmingham UK
- Birmingham and Midland Eye Centre Birmingham UK
| | - Alex Brown
- Artificial Intelligence and Machine Learning, GSK Medicines Research Centre Gunnels Wood Road Stevenage SG1 2NY UK
| | - Jonathan D Hirst
- School of Chemistry, University of Nottingham, University Park Nottingham NG7 2RD UK
| | - Thomas Gärtner
- Machine Learning Group, TU Wien Informatics Vienna Austria
| |
Collapse
|
17
|
Simons AL, Theroux S, Osborne M, Nuzhdin S, Mazor R, Steele J. Zeta diversity patterns in metabarcoded lotic algal assemblages as a tool for bioassessment. ECOLOGICAL APPLICATIONS : A PUBLICATION OF THE ECOLOGICAL SOCIETY OF AMERICA 2023; 33:e2812. [PMID: 36708145 DOI: 10.1002/eap.2812] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 07/21/2022] [Revised: 12/07/2022] [Accepted: 12/20/2022] [Indexed: 06/18/2023]
Abstract
Assessments of the ecological health of algal assemblages in streams typically focus on measures of their local diversity and classify individuals by morphotaxonomy. Such assemblages are often connected through various ecological processes, such as dispersal, and may be more accurately assessed as components of regional-, rather than local-scale assemblages. With recent declines in the costs of sequencing and computation, it has also become increasingly feasible to use metabarcoding to more accurately classify algal species and perform regional-scale bioassessments. Recently, zeta diversity has been explored as a novel method of constructing regional bioassessments for groups of streams. Here, we model the use of zeta diversity to investigate whether stream health can be determined by the landscape diversity of algal assemblages. We also compare the use of DNA metabarcoding and morphotaxonomy classifications in these zeta diversity-based bioassessments of regional stream health. From 96 stream samples in California, we used various orders of zeta diversity to construct models of biotic integrity for multiple assemblages of diatoms, as well as hybrid assemblages of diatoms in combination with soft-bodied algae, using taxonomy data generated with both DNA sequencing as well as traditional morphotaxonomic approaches. We compared our ability to evaluate the ecological health of streams with the performance of multiple algal indices of biological condition. Our zeta diversity-based models of regional biotic integrity were more strongly correlated with existing indices for algal assemblages classified using metabarcoding compared to morphotaxonomy. Metabarcoding for diatoms and hybrid algal assemblages involved rbcL and 18S V9 primers, respectively. Importantly, we also found that these algal assemblages, independent of the classification method, are more likely to be assembled under a process of niche differentiation rather than stochastically. Taken together, these results suggest the potential for zeta diversity patterns of algal assemblages classified using metabarcoding to inform stream bioassessments.
Collapse
Affiliation(s)
- Ariel Levi Simons
- Dornsife College of Letters, Arts and Sciences, University of Southern California, Los Angeles, California, USA
| | - Susanna Theroux
- Southern California Coastal Water Research Project, Costa Mesa, California, USA
| | - Melisa Osborne
- Dornsife College of Letters, Arts and Sciences, University of Southern California, Los Angeles, California, USA
| | - Sergey Nuzhdin
- Dornsife College of Letters, Arts and Sciences, University of Southern California, Los Angeles, California, USA
| | - Raphael Mazor
- Southern California Coastal Water Research Project, Costa Mesa, California, USA
| | - Joshua Steele
- Southern California Coastal Water Research Project, Costa Mesa, California, USA
| |
Collapse
|
18
|
In silico environmental sampling of emerging fungal pathogens via big data analysis. FUNGAL ECOL 2023. [DOI: 10.1016/j.funeco.2022.101212] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/24/2022]
|
19
|
de Souza LC, Azevedo KS, de Souza JG, Barbosa RDM, Fernandes MAC. New proposal of viral genome representation applied in the classification of SARS-CoV-2 with deep learning. BMC Bioinformatics 2023; 24:92. [PMID: 36906520 PMCID: PMC10007673 DOI: 10.1186/s12859-023-05188-1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/11/2022] [Accepted: 02/15/2023] [Indexed: 03/13/2023] Open
Abstract
BACKGROUND In December 2019, the first case of COVID-19 was described in Wuhan, China, and by July 2022, there were already 540 million confirmed cases. Due to the rapid spread of the virus, the scientific community has made efforts to develop techniques for the viral classification of SARS-CoV-2. RESULTS In this context, we developed a new proposal for gene sequence representation with Genomic Signal Processing techniques for the work presented in this paper. First, we applied the mapping approach to samples of six viral species of the Coronaviridae family, which belongs SARS-CoV-2 Virus. We then used the sequence downsized obtained by the method proposed in a deep learning architecture for viral classification, achieving an accuracy of 98.35%, 99.08%, and 99.69% for the 64, 128, and 256 sizes of the viral signatures, respectively, and obtaining 99.95% precision for the vectors with size 256. CONCLUSIONS The classification results obtained, in comparison to the results produced using other state-of-the-art representation techniques, demonstrate that the proposed mapping can provide a satisfactory performance result with low computational memory and processing time costs.
Collapse
Affiliation(s)
- Luísa C. de Souza
- Laboratory of Machine Learning and Intelligent Instrumentation, Federal University of Rio Grande do Norte, Natal, RN 59078-970 Brazil
| | - Karolayne S. Azevedo
- Laboratory of Machine Learning and Intelligent Instrumentation, Federal University of Rio Grande do Norte, Natal, RN 59078-970 Brazil
| | - Jackson G. de Souza
- Laboratory of Machine Learning and Intelligent Instrumentation, Federal University of Rio Grande do Norte, Natal, RN 59078-970 Brazil
| | - Raquel de M. Barbosa
- Department of Pharmacy and Pharmaceutical Technology, University of Granada, Granada, Spain
| | - Marcelo A. C. Fernandes
- Laboratory of Machine Learning and Intelligent Instrumentation, Federal University of Rio Grande do Norte, Natal, RN 59078-970 Brazil
- Department of Computer Engineering and Automation, Federal University of Rio Grande do Norte, Natal, RN 59078-970 Brazil
- Bioinformatics Multidisciplinary Environment (BioME), Federal University of Rio Grande do Norte, Natal, RN 59078-970 Brazil
| |
Collapse
|
20
|
Raiyemo DA, Bobadilla LK, Tranel PJ. Genomic profiling of dioecious Amaranthus species provides novel insights into species relatedness and sex genes. BMC Biol 2023; 21:37. [PMID: 36804015 PMCID: PMC9940365 DOI: 10.1186/s12915-023-01539-9] [Citation(s) in RCA: 2] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/21/2022] [Accepted: 02/08/2023] [Indexed: 02/21/2023] Open
Abstract
BACKGROUND Amaranthus L. is a diverse genus consisting of domesticated, weedy, and non-invasive species distributed around the world. Nine species are dioecious, of which Amaranthus palmeri S. Watson and Amaranthus tuberculatus (Moq.) J.D. Sauer are troublesome weeds of agronomic crops in the USA and elsewhere. Shallow relationships among the dioecious Amaranthus species and the conservation of candidate genes within previously identified A. palmeri and A. tuberculatus male-specific regions of the Y (MSYs) in other dioecious species are poorly understood. In this study, seven genomes of dioecious amaranths were obtained by paired-end short-read sequencing and combined with short reads of seventeen species in the family Amaranthaceae from NCBI database. The species were phylogenomically analyzed to understand their relatedness. Genome characteristics for the dioecious species were evaluated and coverage analysis was used to investigate the conservation of sequences within the MSY regions. RESULTS We provide genome size, heterozygosity, and ploidy level inference for seven newly sequenced dioecious Amaranthus species and two additional dioecious species from the NCBI database. We report a pattern of transposable element proliferation in the species, in which seven species had more Ty3 elements than copia elements while A. palmeri and A. watsonii had more copia elements than Ty3 elements, similar to the TE pattern in some monoecious amaranths. Using a Mash-based phylogenomic analysis, we accurately recovered taxonomic relationships among the dioecious Amaranthus species that were previously identified based on comparative morphology. Coverage analysis revealed eleven candidate gene models within the A. palmeri MSY region with male-enriched coverages, as well as regions on scaffold 19 with female-enriched coverage, based on A. watsonii read alignments. A previously reported FLOWERING LOCUS T (FT) within A. tuberculatus MSY contig was also found to exhibit male-enriched coverages for three species closely related to A. tuberculatus but not for A. watsonii reads. Additional characterization of the A. palmeri MSY region revealed that 78% of the region is made of repetitive elements, typical of a sex determination region with reduced recombination. CONCLUSIONS The results of this study further increase our understanding of the relationships among the dioecious species of the Amaranthus genus as well as revealed genes with potential roles in sex function in the species.
Collapse
Affiliation(s)
- Damilola A Raiyemo
- Department of Crop Sciences, University of Illinois, Urbana, IL, 61801, USA
| | - Lucas K Bobadilla
- Department of Crop Sciences, University of Illinois, Urbana, IL, 61801, USA
| | - Patrick J Tranel
- Department of Crop Sciences, University of Illinois, Urbana, IL, 61801, USA.
| |
Collapse
|
21
|
Bonnie JK, Ahmed O, Langmead B. DandD: efficient measurement of sequence growth and similarity. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2023:2023.02.02.526837. [PMID: 36778393 PMCID: PMC9915590 DOI: 10.1101/2023.02.02.526837] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Indexed: 02/07/2023]
Abstract
Genome assembly databases are growing rapidly. The sequence content in each new assembly can be largely redundant with previous ones, but this is neither conceptually nor algorithmically easy to measure. We propose new methods and a new tool called DandD that addresses the question of how much new sequence is gained when a sequence collection grows. DandD can describe how much human structural variation is being discovered in each new human genome assembly and when discoveries will level off in the future. DandD uses a measure called δ ("delta"), developed initially for data compression. Computing δ directly requires counting k-mers, but DandD can rapidly estimate it using genomic sketches. We also propose δ as an alternative to k-mer-specific cardinalities when computing the Jaccard coefficient, avoiding the pitfalls of a poor choice of k. We demonstrate the utility of DandD's functions for estimating δ, characterizing the rate of pangenome growth, and computing all-pairs similarities using k-independent Jaccard. DandD is open source software available at: https://github.com/jessicabonnie/dandd.
Collapse
Affiliation(s)
| | - Omar Ahmed
- Department of Computer Science, Johns Hopkins University
| | - Ben Langmead
- Department of Computer Science, Johns Hopkins University
| |
Collapse
|
22
|
Tang R, Yu Z, Li J. KINN: An alignment-free accurate phylogeny reconstruction method based on inner distance distributions of k-mer pairs in biological sequences. Mol Phylogenet Evol 2023; 179:107662. [PMID: 36375789 DOI: 10.1016/j.ympev.2022.107662] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/30/2022] [Revised: 10/10/2022] [Accepted: 11/02/2022] [Indexed: 11/13/2022]
Abstract
Alignment-based methods have faced disadvantages in sequence comparison and phylogeny reconstruction due to their high computational complexity. Alignment-free methods for sequence comparison and phylogeny inference have attracted a great deal of attention in recent years. Here, we explore an alignment-free approach that uses inner distance distributions of k-mer pairs in biological sequences for phylogeny inference. For every sequence in a dataset, our method transforms the sequence into a numeric feature vector consisting of features each representing a specific k-mer pair's contribution to the characterization of the sequentiality uniqueness of the sequence. This newly defined k-mer pair's contribution is an integration of the reverse Kullback-Leibler divergence, pseudo mode and the classic entropy of an inner distance distribution of the k-mer pair in the sequence. Our method has been tested on datasets of complete genome sequences, complete protein sequences, and gene sequences of rRNA of various lengths. Our method achieves the best performance in comparison with state-of-the-art alignment-free methods as measured by the Robinson-Foulds distance between the reference and the constructed phylogeny trees.
Collapse
Affiliation(s)
- Runbin Tang
- Hunan Key Laboratory for Computation and Simulation in Science and Engineering and Key Laboratory of Intelligent Computing and Information Processing of Ministry of Education, Xiangtan University, Hunan 411105, China; School of Mathematical Sciences, Chongqing Normal University, Chongqing 401331, China
| | - Zuguo Yu
- Hunan Key Laboratory for Computation and Simulation in Science and Engineering and Key Laboratory of Intelligent Computing and Information Processing of Ministry of Education, Xiangtan University, Hunan 411105, China.
| | - Jinyan Li
- Data Science Institute, University of Technology Sydney, Ultimo, NSW 2007, Australia.
| |
Collapse
|
23
|
Simmonds P, Adriaenssens EM, Zerbini FM, Abrescia NGA, Aiewsakun P, Alfenas-Zerbini P, Bao Y, Barylski J, Drosten C, Duffy S, Duprex WP, Dutilh BE, Elena SF, García ML, Junglen S, Katzourakis A, Koonin EV, Krupovic M, Kuhn JH, Lambert AJ, Lefkowitz EJ, Łobocka M, Lood C, Mahony J, Meier-Kolthoff JP, Mushegian AR, Oksanen HM, Poranen MM, Reyes-Muñoz A, Robertson DL, Roux S, Rubino L, Sabanadzovic S, Siddell S, Skern T, Smith DB, Sullivan MB, Suzuki N, Turner D, Van Doorslaer K, Vandamme AM, Varsani A, Vasilakis N. Four principles to establish a universal virus taxonomy. PLoS Biol 2023; 21:e3001922. [PMID: 36780432 PMCID: PMC9925010 DOI: 10.1371/journal.pbio.3001922] [Citation(s) in RCA: 37] [Impact Index Per Article: 37.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/15/2023] Open
Abstract
A universal taxonomy of viruses is essential for a comprehensive view of the virus world and for communicating the complicated evolutionary relationships among viruses. However, there are major differences in the conceptualisation and approaches to virus classification and nomenclature among virologists, clinicians, agronomists, and other interested parties. Here, we provide recommendations to guide the construction of a coherent and comprehensive virus taxonomy, based on expert scientific consensus. Firstly, assignments of viruses should be congruent with the best attainable reconstruction of their evolutionary histories, i.e., taxa should be monophyletic. This fundamental principle for classification of viruses is currently included in the International Committee on Taxonomy of Viruses (ICTV) code only for the rank of species. Secondly, phenotypic and ecological properties of viruses may inform, but not override, evolutionary relatedness in the placement of ranks. Thirdly, alternative classifications that consider phenotypic attributes, such as being vector-borne (e.g., "arboviruses"), infecting a certain type of host (e.g., "mycoviruses," "bacteriophages") or displaying specific pathogenicity (e.g., "human immunodeficiency viruses"), may serve important clinical and regulatory purposes but often create polyphyletic categories that do not reflect evolutionary relationships. Nevertheless, such classifications ought to be maintained if they serve the needs of specific communities or play a practical clinical or regulatory role. However, they should not be considered or called taxonomies. Finally, while an evolution-based framework enables viruses discovered by metagenomics to be incorporated into the ICTV taxonomy, there are essential requirements for quality control of the sequence data used for these assignments. Combined, these four principles will enable future development and expansion of virus taxonomy as the true evolutionary diversity of viruses becomes apparent.
Collapse
Affiliation(s)
- Peter Simmonds
- Nuffield Department of Medicine, University of Oxford, Oxford, United Kingdom
| | | | - F. Murilo Zerbini
- Departamento de Fitopatologia/BIOAGRO, Universidade Federal de Viçosa, Viçosa, Brazil
| | - Nicola G. A. Abrescia
- Structure and Cell Biology of Viruses Lab, Center for Cooperative Research in Biosciences—BRTA, Derio, Spain
- Basque Foundation for Science, IKERBASQUE, Bilbao, Spain
| | - Pakorn Aiewsakun
- Department of Microbiology, Faculty of Science, Mahidol University, Bangkok, Thailand
| | | | - Yiming Bao
- National Genomics Data Center, Beijing Institute of Genomics, Chinese Academy of Sciences and China National Center for Bioinformation, Beijing, China
- University of Chinese Academy of Sciences, Beijing, China
| | - Jakub Barylski
- Department of Molecular Virology, Adam Mickiewicz University, Poznan, Poland
| | - Christian Drosten
- Institute of Virology, Charité-Universitätsmedizin Berlin, corporate member of Free University Berlin, Humboldt University, Berlin, Germany
- Berlin Institute of Health, Berlin, Germany
| | - Siobain Duffy
- Department of Ecology, Evolution and Natural Resources, School of Environmental and Biological Sciences, Rutgers The State University of New Jersey, New Brunswick, New Jersey, United States of America
| | - W. Paul Duprex
- The Center for Vaccine Research, University of Pittsburgh School of Medicine, University of Pittsburgh, Pittsburgh, Pennsylvania, United States of America
| | - Bas E. Dutilh
- Institute of Biodiversity, Faculty of Biological Sciences, Cluster of Excellence Balance of the Microverse, Friedrich-Schiller-University, Jena, Germany
- Theoretical Biology and Bioinformatics, Science for Life, Utrecht University, Utrecht, the Netherlands
| | - Santiago F. Elena
- Instituto de Biología Integrativa de Sistemas (I2SysBio), CSIC-Universitat de València, Valencia, Spain
- Santa Fe Institute, Santa Fe, New Mexico, United States of America
| | - Maria Laura García
- Instituto de Biotecnología y Biología Molecular, CCT-La Plata, CONICET, UNLP, La Plata, Argentina
| | - Sandra Junglen
- Institute of Virology, Charité-Universitätsmedizin Berlin, corporate member of Free University Berlin, Humboldt University, Berlin, Germany
- Berlin Institute of Health, Berlin, Germany
| | - Aris Katzourakis
- Department of Biology, University of Oxford, Oxford, United Kingdom
| | - Eugene V. Koonin
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, Maryland, United States of America
| | - Mart Krupovic
- Institut Pasteur, Université Paris Cité, CNRS UMR6047, Archaeal Virology Unit, Paris, France
| | - Jens H. Kuhn
- Integrated Research Facility at Fort Detrick (IRF-Frederick), National Institute of Allergy and Infectious Diseases, National Institutes of Health, Fort Detrick, Frederick, Maryland, United States of America
| | - Amy J. Lambert
- Division of Vector-Borne Diseases, National Center for Emerging and Zoonotic Infectious Diseases, Centers for Disease Control and Prevention, Fort Collins, Colorado, United States of America
| | - Elliot J. Lefkowitz
- Department of Microbiology, University of Alabama at Birmingham, Birmingham, Alabama, United States of America
| | - Małgorzata Łobocka
- Institute of Biochemistry and Biophysics of the Polish Academy of Sciences, Warsaw, Poland
| | - Cédric Lood
- Department of Biosystems, KU Leuven, Leuven, Belgium
| | - Jennifer Mahony
- School of Microbiology and APC Microbiome Ireland, University College Cork, Cork, Ireland
| | - Jan P. Meier-Kolthoff
- Department of Bioinformatics and Databases, Leibniz Institute DSMZ—German Collection of Microorganisms and Cell Cultures GmbH, Braunschweig, Germany
| | - Arcady R. Mushegian
- Division of Molecular and Cellular Biosciences, National Science Foundation, Alexandria, Virginia, United States of America
| | - Hanna M. Oksanen
- Molecular and Integrative Biosciences Research Programme, Faculty of Biological and Environmental Sciences, University of Helsinki, Helsinki, Finland
| | - Minna M. Poranen
- Molecular and Integrative Biosciences Research Programme, Faculty of Biological and Environmental Sciences, University of Helsinki, Helsinki, Finland
| | - Alejandro Reyes-Muñoz
- Max Planck Tandem Group in Computational Biology, Departamento de Ciencias Biológicas, Universidad de los Andes, Bogotá, Colombia
| | - David L. Robertson
- MRC-University of Glasgow Centre for Virus Research, Glasgow, United Kingdom
| | - Simon Roux
- Department of Energy Joint Genome Institute, Lawrence Berkeley National Laboratory, Berkeley, California, United States of America
| | - Luisa Rubino
- Istituto per la Protezione Sostenibile delle Piante, CNR, UOS Bari, Bari, Italy
| | - Sead Sabanadzovic
- Department of Biochemistry, Molecular Biology, Entomology and Plant Pathology, Mississippi State University, Mississippi State, Mississippi, United States of America
| | - Stuart Siddell
- School of Cellular and Molecular Medicine, Faculty of Life Sciences, University of Bristol, Bristol, United Kingdom
| | - Tim Skern
- Medical University of Vienna, Max Perutz Labs, Vienna Biocenter, Vienna, Austria
| | - Donald B. Smith
- Nuffield Department of Medicine, University of Oxford, Oxford, United Kingdom
| | - Matthew B. Sullivan
- Departments of Microbiology and Civil, Environmental, and Geodetic Engineering, Ohio State University, Columbus, Ohio, United States of America
| | - Nobuhiro Suzuki
- Institute of Plant Science and Resources, Okayama University, Kurashiki, Okayama, Japan
| | - Dann Turner
- School of Applied Sciences, College of Health, Science and Society, University of the West of England, Bristol, United Kingdom
| | - Koenraad Van Doorslaer
- School of Animal and Comparative Biomedical Sciences, Department of Immunobiology, BIO5 Institute, and University of Arizona Cancer Center, Tucson, Arizona, United States of America
| | - Anne-Mieke Vandamme
- KU Leuven, Department of Microbiology, Immunology and Transplantation, Rega Institute for Medical Research, Leuven, Belgium
- Center for Global Health and Tropical Medicine, Instituto de Higiene e Medicina Tropical, Universidade Nova de Lisboa, Lisbon, Portugal
| | - Arvind Varsani
- The Biodesign Center for Fundamental and Applied Microbiomics, School of Life Sciences, Center for Evolution and Medicine, Arizona State University, Tempe, Arizona, United States of America
| | - Nikos Vasilakis
- Department of Pathology, Center of Vector-Borne and Zoonotic Diseases, Institute for Human Infection and Immunity and World Reference Center for Emerging Viruses and Arboviruses, The University of Texas Medical Branch, Galveston, Texas, United States of America
| |
Collapse
|
24
|
Dey S, Das S, Bhattacharya DK. Biochemical Property Based Positional Matrix: A New Approach Towards Genome Sequence Comparison. J Mol Evol 2023; 91:93-131. [PMID: 36587178 PMCID: PMC9805373 DOI: 10.1007/s00239-022-10082-0] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/08/2022] [Accepted: 12/01/2022] [Indexed: 01/01/2023]
Abstract
The growth of the genome sequence has become one of the emerging areas in the study of bioinformatics. It has led to an excessive demand for researchers to develop advanced methodologies for evolutionary relationships among species. The alignment-free methods have been proved to be more efficient and appropriate related to time and space than existing alignment-based methods for sequence analysis. In this study, a new alignment-free genome sequence comparison technique is proposed based on the biochemical properties of nucleotides. Each genome sequence can be distributed in four parameters to represent a 21-dimensional numerical descriptor using the Positional Matrix. To substantiate the proposed method, phylogenetic trees are constructed on the viral and mammalian datasets by applying the UPGMA/NJ clustering method. Further, the results of this method are compared with the results of the Feature Frequency Profiles method, the Positional Correlation Natural Vector method, the Graph-theoretic method, the Multiple Encoding Vector method, and the Fuzzy Integral Similarity method. In most cases, it is found that the present method produces more accurate results than the prior methods. Also, in the present method, the execution time for computation is comparatively small.
Collapse
Affiliation(s)
- Sudeshna Dey
- grid.440742.10000 0004 1799 6713Computer Science and Engineering, Narula Institute of Technology, Kolkata, 700109 India
| | - Subhram Das
- grid.440742.10000 0004 1799 6713Computer Science and Engineering, Narula Institute of Technology, Kolkata, 700109 India
| | - D. K. Bhattacharya
- grid.59056.3f0000 0001 0664 9773Pure Mathematics, Calcutta University, Kolkata, 700019 India
| |
Collapse
|
25
|
Anjum N, Nabil RL, Rafi RI, Bayzid MS, Rahman MS. CD-MAWS: An Alignment-Free Phylogeny Estimation Method Using Cosine Distance on Minimal Absent Word Sets. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2023; 20:196-205. [PMID: 34928803 DOI: 10.1109/tcbb.2021.3136792] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/04/2023]
Abstract
Multiple sequence alignment has been the traditional and well established approach of sequence analysis and comparison, though it is time and memory consuming. As the scale of sequencing data is increasing day by day, the importance of faster yet accurate alignment-free methods is on the rise. Several alignment-free sequence analysis methods have been established in the literature in recent years, which extract numerical features from genomic data to analyze sequences and also to estimate phylogenetic relationship among genes and species. Minimal Absent Word (MAW) is an effective concept for representing characteristics of a sequence in an alignment-free manner. In this study, we present CD-MAWS, a distance measure based on cosine of the angle between composition vectors constructed using minimal absent words, for sequence analysis in a computationally inexpensive manner. We have benchmarked CD-MAWS using several AFProject datasets, such as Fish mtDNA, E.coli, Plants, Shigella and Yersinia datasets, and found it to perform quite well. Applied on several other biological datasets such as mammal mtDNA, bacterial genomes and viral genomes, CD-MAWS resolved phylogenetic relationships similar to or better than state-of-the-art alignment-free methods such as Mash, Skmer, Co-phylog and kSNP3.
Collapse
|
26
|
Bohnsack KS, Kaden M, Abel J, Villmann T. Alignment-Free Sequence Comparison: A Systematic Survey From a Machine Learning Perspective. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2023; 20:119-135. [PMID: 34990369 DOI: 10.1109/tcbb.2022.3140873] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/04/2023]
Abstract
The encounter of large amounts of biological sequence data generated during the last decades and the algorithmic and hardware improvements have offered the possibility to apply machine learning techniques in bioinformatics. While the machine learning community is aware of the necessity to rigorously distinguish data transformation from data comparison and adopt reasonable combinations thereof, this awareness is often lacking in the field of comparative sequence analysis. With realization of the disadvantages of alignments for sequence comparison, some typical applications use more and more so-called alignment-free approaches. In light of this development, we present a conceptual framework for alignment-free sequence comparison, which highlights the delineation of: 1) the sequence data transformation comprising of adequate mathematical sequence coding and feature generation, from 2) the subsequent (dis-)similarity evaluation of the transformed data by means of problem-specific but mathematically consistent proximity measures. We consider coding to be an information-loss free data transformation in order to get an appropriate representation, whereas feature generation is inevitably information-lossy with the intention to extract just the task-relevant information. This distinction sheds light on the plethora of methods available and assists in identifying suitable methods in machine learning and data analysis to compare the sequences under these premises.
Collapse
|
27
|
Munagala NVTS, Amanchi PK, Balasubramanian K, Panicker A, Nagaraj N. Compression-Complexity Measures for Analysis and Classification of Coronaviruses. ENTROPY (BASEL, SWITZERLAND) 2022; 25:81. [PMID: 36673224 PMCID: PMC9857615 DOI: 10.3390/e25010081] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 10/05/2022] [Revised: 12/10/2022] [Accepted: 12/18/2022] [Indexed: 06/17/2023]
Abstract
Finding a vaccine or specific antiviral treatment for a global pandemic of virus diseases (such as the ongoing COVID-19) requires rapid analysis, annotation and evaluation of metagenomic libraries to enable a quick and efficient screening of nucleotide sequences. Traditional sequence alignment methods are not suitable and there is a need for fast alignment-free techniques for sequence analysis. Information theory and data compression algorithms provide a rich set of mathematical and computational tools to capture essential patterns in biological sequences. In this study, we investigate the use of compression-complexity (Effort-to-Compress or ETC and Lempel-Ziv or LZ complexity) based distance measures for analyzing genomic sequences. The proposed distance measure is used to successfully reproduce the phylogenetic trees for a mammalian dataset consisting of eight species clusters, a set of coronaviruses belonging to group I, group II, group III, and SARS-CoV-1 coronaviruses, and a set of coronaviruses causing COVID-19 (SARS-CoV-2), and those not causing COVID-19. Having demonstrated the usefulness of these compression complexity measures, we employ them for the automatic classification of COVID-19-causing genome sequences using machine learning techniques. Two flavors of SVM (linear and quadratic) along with linear discriminant and fine K Nearest Neighbors classifer are used for classification. Using a data set comprising 1001 coronavirus sequences (causing COVID-19 and those not causing COVID-19), a classification accuracy of 98% is achieved with a sensitivity of 95% and a specificity of 99.8%. This work could be extended further to enable medical practitioners to automatically identify and characterize coronavirus strains and their rapidly growing mutants in a fast and efficient fashion.
Collapse
Affiliation(s)
- Naga Venkata Trinath Sai Munagala
- Department of Electronics and Communication Engineering, Amrita School of Engineering, Coimbatore, Amrita Vishwa Vidyapeetham, Ettimadai 641112, Tamil Nadu, India
| | - Prem Kumar Amanchi
- Department of Electronics and Communication Engineering, Amrita School of Engineering, Coimbatore, Amrita Vishwa Vidyapeetham, Ettimadai 641112, Tamil Nadu, India
| | - Karthi Balasubramanian
- Department of Electronics and Communication Engineering, Amrita School of Engineering, Coimbatore, Amrita Vishwa Vidyapeetham, Ettimadai 641112, Tamil Nadu, India
| | - Athira Panicker
- Department of Electronics and Communication Engineering, Amrita School of Engineering, Coimbatore, Amrita Vishwa Vidyapeetham, Ettimadai 641112, Tamil Nadu, India
| | - Nithin Nagaraj
- Consciousness Studies Programme, National Institute of Advanced Studies, Bengaluru 560012, Karnataka, India
| |
Collapse
|
28
|
Silva JM, Qi W, Pinho AJ, Pratas D. AlcoR: alignment-free simulation, mapping, and visualization of low-complexity regions in biological data. Gigascience 2022; 12:giad101. [PMID: 38091509 PMCID: PMC10716826 DOI: 10.1093/gigascience/giad101] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/28/2023] [Revised: 09/29/2023] [Accepted: 11/07/2023] [Indexed: 12/18/2023] Open
Abstract
BACKGROUND Low-complexity data analysis is the area that addresses the search and quantification of regions in sequences of elements that contain low-complexity or repetitive elements. For example, these can be tandem repeats, inverted repeats, homopolymer tails, GC-biased regions, similar genes, and hairpins, among many others. Identifying these regions is crucial because of their association with regulatory and structural characteristics. Moreover, their identification provides positional and quantity information where standard assembly methodologies face significant difficulties because of substantial higher depth coverage (mountains), ambiguous read mapping, or where sequencing or reconstruction defects may occur. However, the capability to distinguish low-complexity regions (LCRs) in genomic and proteomic sequences is a challenge that depends on the model's ability to find them automatically. Low-complexity patterns can be implicit through specific or combined sources, such as algorithmic or probabilistic, and recurring to different spatial distances-namely, local, medium, or distant associations. FINDINGS This article addresses the challenge of automatically modeling and distinguishing LCRs, providing a new method and tool (AlcoR) for efficient and accurate segmentation and visualization of these regions in genomic and proteomic sequences. The method enables the use of models with different memories, providing the ability to distinguish local from distant low-complexity patterns. The method is reference and alignment free, providing additional methodologies for testing, including a highly flexible simulation method for generating biological sequences (DNA or protein) with different complexity levels, sequence masking, and a visualization tool for automatic computation of the LCR maps into an ideogram style. We provide illustrative demonstrations using synthetic, nearly synthetic, and natural sequences showing the high efficiency and accuracy of AlcoR. As large-scale results, we use AlcoR to unprecedentedly provide a whole-chromosome low-complexity map of a recent complete human genome and the haplotype-resolved chromosome pairs of a heterozygous diploid African cassava cultivar. CONCLUSIONS The AlcoR method provides the ability of fast sequence characterization through data complexity analysis, ideally for scenarios entangling the presence of new or unknown sequences. AlcoR is implemented in C language using multithreading to increase the computational speed, is flexible for multiple applications, and does not contain external dependencies. The tool accepts any sequence in FASTA format. The source code is freely provided at https://github.com/cobilab/alcor.
Collapse
Affiliation(s)
- Jorge M Silva
- IEETA, Institute of Electronics and Informatics Engineering of Aveiro, and LASI, Intelligent Systems Associate Laboratory, University of Aveiro, Campus Universitário de Santiago, 3810-193 Aveiro, Portugal
- Department of Electronics Telecommunications and Informatics, University of Aveiro, Campus Universitario de Santiago, 3810-193, Aveiro, Portugal
| | - Weihong Qi
- Functional Genomics Center Zurich, ETH Zurich and University of Zurich, Winterthurerstrasse, 190, 8057, Zurich, Switzerland
- SIB, Swiss Institute of Bioinformatics, 1202, Geneva, Switzerland
| | - Armando J Pinho
- IEETA, Institute of Electronics and Informatics Engineering of Aveiro, and LASI, Intelligent Systems Associate Laboratory, University of Aveiro, Campus Universitário de Santiago, 3810-193 Aveiro, Portugal
- Department of Electronics Telecommunications and Informatics, University of Aveiro, Campus Universitario de Santiago, 3810-193, Aveiro, Portugal
| | - Diogo Pratas
- IEETA, Institute of Electronics and Informatics Engineering of Aveiro, and LASI, Intelligent Systems Associate Laboratory, University of Aveiro, Campus Universitário de Santiago, 3810-193 Aveiro, Portugal
- Department of Electronics Telecommunications and Informatics, University of Aveiro, Campus Universitario de Santiago, 3810-193, Aveiro, Portugal
- Department of Virology, University of Helsinki, Haartmaninkatu, 3, 00014 Helsinki, Finland
| |
Collapse
|
29
|
Chen J, Yang L, Li L, Goodison S, Sun Y. Alignment-free comparison of metagenomics sequences via approximate string matching. BIOINFORMATICS ADVANCES 2022; 2:vbac077. [PMID: 36388153 PMCID: PMC9645238 DOI: 10.1093/bioadv/vbac077] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 07/04/2022] [Revised: 09/16/2022] [Accepted: 10/19/2022] [Indexed: 11/11/2022]
Abstract
Summary Quantifying pairwise sequence similarities is a key step in metagenomics studies. Alignment-free methods provide a computationally efficient alternative to alignment-based methods for large-scale sequence analysis. Several neural network-based methods have recently been developed for this purpose. However, existing methods do not perform well on sequences of varying lengths and are sensitive to the presence of insertions and deletions. In this article, we describe the development of a new method, referred to as AsMac that addresses the aforementioned issues. We proposed a novel neural network structure for approximate string matching for the extraction of pertinent information from biological sequences and developed an efficient gradient computation algorithm for training the constructed neural network. We performed a large-scale benchmark study using real-world data that demonstrated the effectiveness and potential utility of the proposed method. Availability and implementation The open-source software for the proposed method and trained neural-network models for some commonly used metagenomics marker genes were developed and are freely available at www.acsu.buffalo.edu/~yijunsun/lab/AsMac.html. Supplementary information Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
| | | | - Lu Li
- Department of Oral Biology, University at Buffalo, Buffalo, NY 14215, USA
| | - Steve Goodison
- Department of Quantitative Health Sciences, Mayo Clinic, Jacksonville, FL 32224, USA
| | - Yijun Sun
- To whom correspondence should be addressed.
| |
Collapse
|
30
|
Rachtman E, Sarmashghi S, Bafna V, Mirarab S. Quantifying the uncertainty of assembly-free genome-wide distance estimates and phylogenetic relationships using subsampling. Cell Syst 2022; 13:817-829.e3. [PMID: 36265468 PMCID: PMC9589918 DOI: 10.1016/j.cels.2022.06.007] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/24/2021] [Revised: 03/14/2022] [Accepted: 06/28/2022] [Indexed: 01/26/2023]
Abstract
Computing distance between two genomes without alignments or even access to assemblies has many downstream analyses. However, alignment-free methods, including in the fast-growing field of genome skimming, are hampered by a significant methodological gap. While accurate methods (many k-mer-based) for assembly-free distance calculation exist, measuring the uncertainty of estimated distances has not been sufficiently studied. In this paper, we show that bootstrapping, the standard non-parametric method of measuring estimator uncertainty, is not accurate for k-mer-based methods that rely on k-mer frequency profiles. Instead, we propose using subsampling (with no replacement) in combination with a correction step to reduce the variance of the inferred distribution. We show that the distribution of distances using our procedure matches the true uncertainty of the estimator. The resulting phylogenetic support values effectively differentiate between correct and incorrect branches and identify controversial branches that change across alignment-free and alignment-based phylogenies reported in the literature.
Collapse
Affiliation(s)
- Eleonora Rachtman
- Bioinformatics and Systems Biology Graduate Program, UC San Diego, San Diego, CA 92093, USA
| | - Shahab Sarmashghi
- Department of Electrical and Computer Engineering, UC San Diego, San Diego, CA 92093, USA
| | - Vineet Bafna
- Department of Computer Science and Engineering, UC San Diego, San Diego, CA 92093, USA
| | - Siavash Mirarab
- Department of Electrical and Computer Engineering, UC San Diego, San Diego, CA 92093, USA.
| |
Collapse
|
31
|
Uddin M, Islam MK, Hassan MR, Jahan F, Baek JH. A fast and efficient algorithm for DNA sequence similarity identification. COMPLEX INTELL SYST 2022; 9:1265-1280. [PMID: 36035628 PMCID: PMC9395857 DOI: 10.1007/s40747-022-00846-y] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/02/2021] [Accepted: 08/05/2022] [Indexed: 11/22/2022]
Abstract
DNA sequence similarity analysis is necessary for enormous purposes including genome analysis, extracting biological information, finding the evolutionary relationship of species. There are two types of sequence analysis which are alignment-based (AB) and alignment-free (AF). AB is effective for small homologous sequences but becomes NP-hard problem for long sequences. However, AF algorithms can solve the major limitations of AB. But most of the existing AF methods show high time complexity and memory consumption, less precision, and less performance on benchmark datasets. To minimize these limitations, we develop an AF algorithm using a 2D \documentclass[12pt]{minimal}
\usepackage{amsmath}
\usepackage{wasysym}
\usepackage{amsfonts}
\usepackage{amssymb}
\usepackage{amsbsy}
\usepackage{mathrsfs}
\usepackage{upgreek}
\setlength{\oddsidemargin}{-69pt}
\begin{document}$$k-mer$$\end{document}k-mer count matrix inspired by the CGR approach. Then we shrink the matrix by analyzing the neighbors and then measure similarities using the best combinations of pairwise distance (PD) and phylogenetic tree methods. We also dynamically choose the value of k for \documentclass[12pt]{minimal}
\usepackage{amsmath}
\usepackage{wasysym}
\usepackage{amsfonts}
\usepackage{amssymb}
\usepackage{amsbsy}
\usepackage{mathrsfs}
\usepackage{upgreek}
\setlength{\oddsidemargin}{-69pt}
\begin{document}$$k-mer$$\end{document}k-mer. We develop an efficient system for finding the positions of \documentclass[12pt]{minimal}
\usepackage{amsmath}
\usepackage{wasysym}
\usepackage{amsfonts}
\usepackage{amssymb}
\usepackage{amsbsy}
\usepackage{mathrsfs}
\usepackage{upgreek}
\setlength{\oddsidemargin}{-69pt}
\begin{document}$$k-mer$$\end{document}k-mer in the count matrix. We apply our system in six different datasets. We achieve the top rank for two benchmark datasets from AFproject, 100% accuracy for two datasets (16 S Ribosomal, 18 Eutherian), and achieve a milestone for time complexity and memory consumption in comparison to the existing study datasets (HEV, HIV-1). Therefore, the comparative results of the benchmark datasets and existing studies demonstrate that our method is highly effective, efficient, and accurate. Thus, our method can be used with the top level of authenticity for DNA sequence similarity measurement.
Collapse
|
32
|
Balaban M, Bristy NA, Faisal A, Bayzid MS, Mirarab S. Genome-wide alignment-free phylogenetic distance estimation under a no strand-bias model. BIOINFORMATICS ADVANCES 2022; 2:vbac055. [PMID: 35992043 PMCID: PMC9383262 DOI: 10.1093/bioadv/vbac055] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 06/21/2022] [Accepted: 08/09/2022] [Indexed: 01/27/2023]
Abstract
While alignment has been the dominant approach for determining homology prior to phylogenetic inference, alignment-free methods can simplify the analysis, especially when analyzing genome-wide data. Furthermore, alignment-free methods present the only option for emerging forms of data, such as genome skims, which do not permit assembly. Despite the appeal, alignment-free methods have not been competitive with alignment-based methods in terms of accuracy. One limitation of alignment-free methods is their reliance on simplified models of sequence evolution such as Jukes-Cantor. If we can estimate frequencies of base substitutions in an alignment-free setting, we can compute pairwise distances under more complex models. However, since the strand of DNA sequences is unknown for many forms of genome-wide data, which arguably present the best use case for alignment-free methods, the most complex models that one can use are the so-called no strand-bias models. We show how to calculate distances under a four-parameter no strand-bias model called TK4 without relying on alignments or assemblies. The main idea is to replace letters in the input sequences and recompute Jaccard indices between k-mer sets. However, on larger genomes, we also need to compute the number of k-mer mismatches after replacement due to random chance as opposed to homology. We show in simulation that alignment-free distances can be highly accurate when genomes evolve under the assumed models and study the accuracy on assembled and unassembled biological data. Availability and implementation Our software is available open source at https://github.com/nishatbristy007/NSB. Supplementary information Supplementary data are available at Bioinformatics Advances online.
Collapse
Affiliation(s)
| | | | - Ahnaf Faisal
- Computer Science and Engineering, Bangladesh University of Engineering and Technology, Dhaka 1205, Bangladesh
| | - Md Shamsuzzoha Bayzid
- Computer Science and Engineering, Bangladesh University of Engineering and Technology, Dhaka 1205, Bangladesh
| | | |
Collapse
|
33
|
Birth N, Dencker T, Morgenstern B. Insertions and deletions as phylogenetic signal in an alignment-free context. PLoS Comput Biol 2022; 18:e1010303. [PMID: 35939516 PMCID: PMC9387925 DOI: 10.1371/journal.pcbi.1010303] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/18/2021] [Revised: 08/18/2022] [Accepted: 06/14/2022] [Indexed: 11/18/2022] Open
Abstract
Most methods for phylogenetic tree reconstruction are based on sequence alignments; they infer phylogenies from substitutions that may have occurred at the aligned sequence positions. Gaps in alignments are usually not employed as phylogenetic signal. In this paper, we explore an alignment-free approach that uses insertions and deletions (indels) as an additional source of information for phylogeny inference. For a set of four or more input sequences, we generate so-called quartet blocks of four putative homologous segments each. For pairs of such quartet blocks involving the same four sequences, we compare the distances between the two blocks in these sequences, to obtain hints about indels that may have happened between the blocks since the respective four sequences have evolved from their last common ancestor. A prototype implementation that we call Gap-SpaM is presented to infer phylogenetic trees from these data, using a quartet-tree approach or, alternatively, under the maximum-parsimony paradigm. This approach should not be regarded as an alternative to established methods, but rather as a complementary source of phylogenetic information. Interestingly, however, our software is able to produce phylogenetic trees from putative indels alone that are comparable to trees obtained with existing alignment-free methods. Phylogenetic tree inference based on DNA or protein sequence comparison is a fundamental task in computational biology. Given a multiple alignment of a set of input sequences, most approaches compare aligned sequence positions to each other, to find a suitable tree, based on a model of molecular evolution. Insertions and deletions that may have happened since the input sequences evolved from their last common ancestor are ignored by most phylogeny methods. Herein, we show that insertions and deletions can provide an additional source of information for phylogeny inference, and that such information can be obtained with a simple alignment-free approach. We provide an implementation of this idea that we call Gap-SpaM. The proposed approach is complementary to existing phylogeny methods since it is based on a completely different source of information. It is, thus, not meant to be an alternative to those existing methods but rather as a possible additional source of information for tree inference.
Collapse
Affiliation(s)
- Niklas Birth
- Department of Bioinformatics, Institute of Microbiology and Genetics, Universisät Göttingen, Göttingen, Germany
| | - Thomas Dencker
- Department of Bioinformatics, Institute of Microbiology and Genetics, Universisät Göttingen, Göttingen, Germany
| | - Burkhard Morgenstern
- Department of Bioinformatics, Institute of Microbiology and Genetics, Universisät Göttingen, Göttingen, Germany
- Göttingen Center of Molecular Biosciences (GZMB), Göttingen, Germany
- Campus-Institute Data Science (CIDAS), Göttingen, Germany
- * E-mail:
| |
Collapse
|
34
|
Câmara GBM, Coutinho MGF, da Silva LMD, Gadelha WVDN, Torquato MF, Barbosa RDM, Fernandes MAC. Convolutional Neural Network Applied to SARS-CoV-2 Sequence Classification. SENSORS (BASEL, SWITZERLAND) 2022; 22:5730. [PMID: 35957287 PMCID: PMC9371030 DOI: 10.3390/s22155730] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 06/29/2022] [Revised: 07/28/2022] [Accepted: 07/28/2022] [Indexed: 06/15/2023]
Abstract
COVID-19, the illness caused by the severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) virus belonging to the Coronaviridade family, a single-strand positive-sense RNA genome, has been spreading around the world and has been declared a pandemic by the World Health Organization. On 17 January 2022, there were more than 329 million cases, with more than 5.5 million deaths. Although COVID-19 has a low mortality rate, its high capacities for contamination, spread, and mutation worry the authorities, especially after the emergence of the Omicron variant, which has a high transmission capacity and can more easily contaminate even vaccinated people. Such outbreaks require elucidation of the taxonomic classification and origin of the virus (SARS-CoV-2) from the genomic sequence for strategic planning, containment, and treatment of the disease. Thus, this work proposes a high-accuracy technique to classify viruses and other organisms from a genome sequence using a deep learning convolutional neural network (CNN). Unlike the other literature, the proposed approach does not limit the length of the genome sequence. The results show that the novel proposal accurately distinguishes SARS-CoV-2 from the sequences of other viruses. The results were obtained from 1557 instances of SARS-CoV-2 from the National Center for Biotechnology Information (NCBI) and 14,684 different viruses from the Virus-Host DB. As a CNN has several changeable parameters, the tests were performed with forty-eight different architectures; the best of these had an accuracy of 91.94 ± 2.62% in classifying viruses into their realms correctly, in addition to 100% accuracy in classifying SARS-CoV-2 into its respective realm, Riboviria. For the subsequent classifications (family, genera, and subgenus), this accuracy increased, which shows that the proposed architecture may be viable in the classification of the virus that causes COVID-19.
Collapse
Affiliation(s)
- Gabriel B. M. Câmara
- Bioinformatics Multidisciplinary Environment (BioME), Federal University of Rio Grande do Norte, Natal 59078-970, RN, Brazil;
- Laboratory of Machine Learning and Intelligent Instrumentation, Federal University of Rio Grande do Norte, Natal 59078-970, RN, Brazil; (M.G.F.C.); (L.M.D.d.S.); (W.V.d.N.G.); (M.F.T.)
| | - Maria G. F. Coutinho
- Laboratory of Machine Learning and Intelligent Instrumentation, Federal University of Rio Grande do Norte, Natal 59078-970, RN, Brazil; (M.G.F.C.); (L.M.D.d.S.); (W.V.d.N.G.); (M.F.T.)
| | - Lucileide M. D. da Silva
- Laboratory of Machine Learning and Intelligent Instrumentation, Federal University of Rio Grande do Norte, Natal 59078-970, RN, Brazil; (M.G.F.C.); (L.M.D.d.S.); (W.V.d.N.G.); (M.F.T.)
- Federal Institute of Education, Science and Technology of Rio Grande do Norte, Paraiso, Santa Cruz 59200-000, RN, Brazil
| | - Walter V. do N. Gadelha
- Laboratory of Machine Learning and Intelligent Instrumentation, Federal University of Rio Grande do Norte, Natal 59078-970, RN, Brazil; (M.G.F.C.); (L.M.D.d.S.); (W.V.d.N.G.); (M.F.T.)
| | - Matheus F. Torquato
- Laboratory of Machine Learning and Intelligent Instrumentation, Federal University of Rio Grande do Norte, Natal 59078-970, RN, Brazil; (M.G.F.C.); (L.M.D.d.S.); (W.V.d.N.G.); (M.F.T.)
| | - Raquel de M. Barbosa
- Laboratory of Machine Learning and Intelligent Instrumentation, Federal University of Rio Grande do Norte, Natal 59078-970, RN, Brazil; (M.G.F.C.); (L.M.D.d.S.); (W.V.d.N.G.); (M.F.T.)
- Department of Pharmacy and Pharmaceutical Technology, University of Granada, 18071 Granada, Spain
| | - Marcelo A. C. Fernandes
- Bioinformatics Multidisciplinary Environment (BioME), Federal University of Rio Grande do Norte, Natal 59078-970, RN, Brazil;
- Laboratory of Machine Learning and Intelligent Instrumentation, Federal University of Rio Grande do Norte, Natal 59078-970, RN, Brazil; (M.G.F.C.); (L.M.D.d.S.); (W.V.d.N.G.); (M.F.T.)
- Department of Computer Engineering and Automation, Federal University of Rio Grande do Norte, Natal 59078-970, RN, Brazil
| |
Collapse
|
35
|
Swain MT, Vickers M. Interpreting alignment-free sequence comparison: what makes a score a good score? NAR Genom Bioinform 2022; 4:lqac062. [PMID: 36071721 PMCID: PMC9442500 DOI: 10.1093/nargab/lqac062] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/01/2022] [Revised: 07/01/2022] [Accepted: 08/16/2022] [Indexed: 11/13/2022] Open
Abstract
Alignment-free methods are alternatives to alignment-based methods when searching sequence data sets. The output from an alignment-free sequence comparison is a similarity score, the interpretation of which is not straightforward. We propose objective functions to interpret and calibrate outputs from alignment-free searches, noting that different objective functions are necessary for different biological contexts. This leads to advantages: visualising and comparing score distributions, including those from true positives, may be a relatively simple method to gain insight into the performance of different metrics. Using an empirical approach with both DNA and protein sequences, we characterise different similarity score distributions generated under different parameters. In particular, we demonstrate how sequence length can affect the scores. We show that scores of true positive sequence pairs may correlate significantly with their mean length; and even if the correlation is weak, the relative difference in length of the sequence pair may significantly reduce the effectiveness of alignment-free metrics. Importantly, we show how objective functions can be used with test data to accurately estimate the probability of true positives. This can significantly increase the utility of alignment-free approaches. Finally, we have developed a general-purpose software tool called KAST for use in high-throughput workflows on Linux clusters.
Collapse
Affiliation(s)
- Martin T Swain
- Department of Life Sciences, Aberystwyth University , Penglais, Aberystwyth, Ceredigion, SY23 3DA, UK
| | - Martin Vickers
- The John Innes Centre, Norwich Research Park , Norwich NR4 7UH, UK
| |
Collapse
|
36
|
Cay SB, Cinar YU, Kuralay SC, Inal B, Zararsiz G, Ciftci A, Mollman R, Obut O, Eldem V, Bakir Y, Erol O. Genome skimming approach reveals the gene arrangements in the chloroplast genomes of the highly endangered Crocus L. species: Crocus istanbulensis (B.Mathew) Rukšāns. PLoS One 2022; 17:e0269747. [PMID: 35704623 PMCID: PMC9200356 DOI: 10.1371/journal.pone.0269747] [Citation(s) in RCA: 4] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/17/2021] [Accepted: 05/27/2022] [Indexed: 11/19/2022] Open
Abstract
Crocus istanbulensis (B.Mathew) Rukšāns is one of the most endangered Crocus species in the world and has an extremely limited distribution range in Istanbul. Our recent field work indicates that no more than one hundred individuals remain in the wild. In the present study, we used genome skimming to determine the complete chloroplast (cp) genome sequences of six C. istanbulensis individuals collected from the locus classicus. The cp genome of C. istanbulensis has 151,199 base pairs (bp), with a large single-copy (LSC) (81,197 bp), small single copy (SSC) (17,524 bp) and two inverted repeat (IR) regions of 26,236 bp each. The cp genome contains 132 genes, of which 86 are protein-coding (PCGs), 8 are rRNA and 38 are tRNA genes. Most of the repeats are found in intergenic spacers of Crocus species. Mononucleotide repeats were most abundant, accounting for over 80% of total repeats. The cp genome contained four palindrome repeats and one forward repeat. Comparative analyses among other Iridaceae species identified one inversion in the terminal positions of LSC region and three different gene (psbA, rps3 and rpl22) arrangements in C. istanbulensis that were not reported previously. To measure selective pressure in the exons of chloroplast coding sequences, we performed a sequence analysis of plastome-encoded genes. A total of seven genes (accD, rpoC2, psbK, rps12, ccsA, clpP and ycf2) were detected under positive selection in the cp genome. Alignment-free sequence comparison showed an extremely low sequence diversity across naturally occurring C. istanbulensis specimens. All six sequenced individuals shared the same cp haplotype. In summary, this study will aid further research on the molecular evolution and development of ex situ conservation strategies of C. istanbulensis.
Collapse
Affiliation(s)
- Selahattin Baris Cay
- Department of Biology, Faculty of Sciences, Istanbul University, Istanbul, Turkey
| | - Yusuf Ulas Cinar
- Department of Biology, Faculty of Sciences, Istanbul University, Istanbul, Turkey
| | - Selim Can Kuralay
- Department of Biology, Faculty of Sciences, Istanbul University, Istanbul, Turkey
| | - Behcet Inal
- Department of Agricultural Biotechnology, Faculty of Agriculture, University of Siirt, Siirt, Turkey
| | - Gokmen Zararsiz
- Department of Biostatistics, Erciyes University, Kayseri, Turkey
- Drug Application and Research Center (ERFARMA), Erciyes University, Kayseri, Turkey
| | - Almila Ciftci
- Department of Biology, Faculty of Sciences, Istanbul University, Istanbul, Turkey
| | - Rachel Mollman
- Department of Biology, Faculty of Sciences, Istanbul University, Istanbul, Turkey
| | - Onur Obut
- Department of Biology, Faculty of Sciences, Istanbul University, Istanbul, Turkey
| | - Vahap Eldem
- Department of Biology, Faculty of Sciences, Istanbul University, Istanbul, Turkey
- * E-mail:
| | - Yakup Bakir
- Department of Plant Bioactive Metabolites, ACTV Biotechnology, Inc., Istanbul, Turkey
| | - Osman Erol
- Department of Biology, Faculty of Sciences, Istanbul University, Istanbul, Turkey
| |
Collapse
|
37
|
Linheiro R, Sabatino S, Lobo D, Archer J. CView: A network based tool for enhanced alignment visualization. PLoS One 2022; 17:e0259726. [PMID: 35696379 PMCID: PMC9191720 DOI: 10.1371/journal.pone.0259726] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/20/2021] [Accepted: 05/31/2022] [Indexed: 11/19/2022] Open
Abstract
To date basic visualization of sequence alignments have largely focused on displaying per-site columns of nucleotide, or amino acid, residues along with associated frequency summarizations. The persistence of this tendency to the recent tools designed for viewing mapped read data indicates that such a perspective not only provides a reliable visualization of per-site alterations, but also offers implicit reassurance to the end-user in relation to data accessibility. However, the initial insight gained is limited, something that is especially true when viewing alignments consisting of many sequences representing differing factors such as location, date and subtype. A basic alignment viewer can have potential to increase initial insight through visual enhancement, whilst not delving into the realms of complex sequence analysis. We present CView, a visualizer that expands on the per-site representation of residues through the incorporation of a dynamic network that is based on the summarization of diversity present across different regions of the alignment. Within the network, nodes are based on the clustering of sequence fragments that span windows placed consecutively along the alignment. Edges are placed between nodes of neighbouring windows where they share sequence identification(s), i.e. different regions of the same sequence(s). Thus, if a node is selected on the network, then the relationship that sequences passing through that node have to other regions of diversity within the alignment can be observed through path tracing. In addition to augmenting visual insight, CView provides export features including variant summarization, per-site residue and kmer frequencies, consensus sequence, alignment dissection as well as clustering; each useful across a range of research areas. The software has been designed to be user friendly, intuitive and interactive. It is open source and an executable jar, source code, quick start, usage tutorial and test data are available (under the GNU General Public License) from https://sourceforge.net/projects/cview/.
Collapse
Affiliation(s)
- Raquel Linheiro
- CIBIO, Centro de Investigação em Biodiversidade e Recursos Genéticos, InBIO Laboratório Associado, Campus de Vairão, Universidade do Porto, Vairão, Portugal
| | - Stephen Sabatino
- CIBIO, Centro de Investigação em Biodiversidade e Recursos Genéticos, InBIO Laboratório Associado, Campus de Vairão, Universidade do Porto, Vairão, Portugal
- BIOPOLIS, Program in Genomics, Biodiversity and Land Planning, CIBIO, Campus de Vairão, Vairão, Portugal
| | - Diana Lobo
- CIBIO, Centro de Investigação em Biodiversidade e Recursos Genéticos, InBIO Laboratório Associado, Campus de Vairão, Universidade do Porto, Vairão, Portugal
- BIOPOLIS, Program in Genomics, Biodiversity and Land Planning, CIBIO, Campus de Vairão, Vairão, Portugal
- Departamento de Biologia, Faculdade de Ciências, Universidade do Porto, Porto, Portugal
| | - John Archer
- CIBIO, Centro de Investigação em Biodiversidade e Recursos Genéticos, InBIO Laboratório Associado, Campus de Vairão, Universidade do Porto, Vairão, Portugal
- BIOPOLIS, Program in Genomics, Biodiversity and Land Planning, CIBIO, Campus de Vairão, Vairão, Portugal
- * E-mail:
| |
Collapse
|
38
|
Girgis HZ. MeShClust v3.0: high-quality clustering of DNA sequences using the mean shift algorithm and alignment-free identity scores. BMC Genomics 2022; 23:423. [PMID: 35668366 PMCID: PMC9171953 DOI: 10.1186/s12864-022-08619-0] [Citation(s) in RCA: 10] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/15/2022] [Accepted: 05/11/2022] [Indexed: 11/22/2022] Open
Abstract
Background Tools for accurately clustering biological sequences are among the most important tools in computational biology. Two pioneering tools for clustering sequences are CD-HIT and UCLUST, both of which are fast and consume reasonable amounts of memory; however, there is a big room for improvement in terms of cluster quality. Motivated by this opportunity for improving cluster quality, we applied the mean shift algorithm in MeShClust v1.0. The mean shift algorithm is an instance of unsupervised learning. Its strong theoretical foundation guarantees the convergence to the true cluster centers. Our implementation of the mean shift algorithm in MeShClust v1.0 was a step forward. In this work, we scale up the algorithm by adapting an out-of-core strategy while utilizing alignment-free identity scores in a new tool: MeShClust v3.0. Results We evaluated CD-HIT, MeShClust v1.0, MeShClust v3.0, and UCLUST on 22 synthetic sets and five real sets. These data sets were designed or selected for testing the tools in terms of scalability and different similarity levels among sequences comprising clusters. On the synthetic data sets, MeShClust v3.0 outperformed the related tools on all sets in terms of cluster quality. On two real data sets obtained from human microbiome and maize transposons, MeShClust v3.0 outperformed the related tools by wide margins, achieving 55%–300% improvement in cluster quality. On another set that includes degenerate viral sequences, MeShClust v3.0 came third. On two bacterial sets, MeShClust v3.0 was the only applicable tool because of the long sequences in these sets. MeShClust v3.0 requires more time and memory than the related tools; almost all personal computers at the time of this writing can accommodate such requirements. MeShClust v3.0 can estimate an important parameter that controls cluster membership with high accuracy. Conclusions These results demonstrate the high quality of clusters produced by MeShClust v3.0 and its ability to apply the mean shift algorithm to large data sets and long sequences. Because clustering tools are utilized in many studies, providing high-quality clusters will help with deriving accurate biological knowledge. Supplementary Information The online version contains supplementary material available at (10.1186/s12864-022-08619-0).
Collapse
Affiliation(s)
- Hani Z Girgis
- Bioinformatics Toolsmith Laboratory, Department of Electrical Engineering and Computer Science, Texas A&M University-Kingsville, Kingsville, TX, USA.
| |
Collapse
|
39
|
Wang R, Wu J, Jiang N, Lin H, An F, Wu C, Yue X, Shi H, Wu R. Recent developments in horizontal gene transfer with the adaptive innovation of fermented foods. Crit Rev Food Sci Nutr 2022; 63:569-584. [PMID: 35647734 DOI: 10.1080/10408398.2022.2081127] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/03/2023]
Abstract
Horizontal gene transfer (HGT) has contributed significantly to the adaptability of bacteria, yeast and mold in fermented foods, whose evidence has been found in several fermented foods. Although not every HGT has biological significance, it plays an important role in improving the quality of fermented foods. In this review, how HGT facilitated microbial domestication and adaptive evolution in fermented foods was discussed. HGT can assist in the industrial innovation of fermented foods, and this adaptive evolution strategy can improve the quality of fermented foods. Additionally, the mechanism underlying HGT in fermented foods were analyzed. Furthermore, the critical bottlenecks involved in optimizing HGT during the production of fermented foods and strategies for optimizing HGT were proposed. Finally, the prospect of HGT for promoting the industrial innovation of fermented foods was highlighted. The comprehensive report on HGT in fermented foods provides a new trend for domesticating preferable starters for food fermentation, thus optimizing the quality and improving the industrial production of fermented foods.
Collapse
Affiliation(s)
- Ruhong Wang
- College of Food Science, Shenyang Agricultural University, Shenyang, P.R. China
| | - Junrui Wu
- College of Food Science, Shenyang Agricultural University, Shenyang, P.R. China.,Liaoning Engineering Research Center of Food Fermentation Technology, Shenyang Agricultural University, Shenyang, P.R. China.,Shenyang Key Laboratory of Microbial Fermentation Technology Innovation, Shenyang Agricultural University, Shenyang, P.R. China
| | - Nan Jiang
- College of Food Science, Shenyang Agricultural University, Shenyang, P.R. China
| | - Hao Lin
- College of Food Science, Shenyang Agricultural University, Shenyang, P.R. China
| | - Feiyu An
- College of Food Science, Shenyang Agricultural University, Shenyang, P.R. China
| | - Chen Wu
- College of Food Science, Shenyang Agricultural University, Shenyang, P.R. China
| | - Xiqing Yue
- College of Food Science, Shenyang Agricultural University, Shenyang, P.R. China.,Liaoning Engineering Research Center of Food Fermentation Technology, Shenyang Agricultural University, Shenyang, P.R. China.,Shenyang Key Laboratory of Microbial Fermentation Technology Innovation, Shenyang Agricultural University, Shenyang, P.R. China
| | - Haisu Shi
- College of Food Science, Shenyang Agricultural University, Shenyang, P.R. China.,Liaoning Engineering Research Center of Food Fermentation Technology, Shenyang Agricultural University, Shenyang, P.R. China.,Shenyang Key Laboratory of Microbial Fermentation Technology Innovation, Shenyang Agricultural University, Shenyang, P.R. China
| | - Rina Wu
- College of Food Science, Shenyang Agricultural University, Shenyang, P.R. China.,Liaoning Engineering Research Center of Food Fermentation Technology, Shenyang Agricultural University, Shenyang, P.R. China.,Shenyang Key Laboratory of Microbial Fermentation Technology Innovation, Shenyang Agricultural University, Shenyang, P.R. China
| |
Collapse
|
40
|
Pérez-Losada M, Narayanan DB, Kolbe AR, Ramos-Tapia I, Castro-Nallar E, Crandall KA, Domínguez J. Comparative Analysis of Metagenomics and Metataxonomics for the Characterization of Vermicompost Microbiomes. Front Microbiol 2022; 13:854423. [PMID: 35620097 PMCID: PMC9127802 DOI: 10.3389/fmicb.2022.854423] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/20/2022] [Accepted: 04/21/2022] [Indexed: 11/21/2022] Open
Abstract
The study of microbial communities or microbiotas in animals and environments is important because of their impact in a broad range of industrial applications, diseases and ecological roles. High throughput sequencing (HTS) is the best strategy to characterize microbial composition and function. Microbial profiles can be obtained either by shotgun sequencing of genomes, or through amplicon sequencing of target genes (e.g., 16S rRNA for bacteria and ITS for fungi). Here, we compared both HTS approaches at assessing taxonomic and functional diversity of bacterial and fungal communities during vermicomposting of white grape marc. We applied specific HTS workflows to the same 12 microcosms, with and without earthworms, sampled at two distinct phases of the vermicomposting process occurring at 21 and 63 days. Metataxonomic profiles were inferred in DADA2, with bacterial metabolic pathways predicted via PICRUSt2. Metagenomic taxonomic profiles were inferred in PathoScope, while bacterial functional profiles were inferred in Humann2. Microbial profiles inferred by metagenomics and metataxonomics showed similarities and differences in composition, structure, and metabolic function at different taxonomic levels. Microbial composition and abundance estimated by both HTS approaches agreed reasonably well at the phylum level, but larger discrepancies were observed at lower taxonomic ranks. Shotgun HTS identified ~1.8 times more bacterial genera than 16S rRNA HTS, while ITS HTS identified two times more fungal genera than shotgun HTS. This is mainly a consequence of the difference in resolution and reference richness between amplicon and genome sequencing approaches and databases, respectively. Our study also revealed great differences and even opposite trends in alpha- and beta-diversity between amplicon and shotgun HTS. Interestingly, amplicon PICRUSt2-imputed functional repertoires overlapped ~50% with shotgun Humann2 profiles. Finally, both approaches indicated that although bacteria and fungi are the main drivers of biochemical decomposition, earthworms also play a key role in plant vermicomposting. In summary, our study highlights the strengths and weaknesses of metagenomics and metataxonomics and provides new insights on the vermicomposting of white grape marc. Since both approaches may target different biological aspects of the communities, combining them will provide a better understanding of the microbiotas under study.
Collapse
Affiliation(s)
- Marcos Pérez-Losada
- Computational Biology Institute, The George Washington University, Washington, DC, United States
- Department of Biostatistics and Bioinformatics, Milken Institute School of Public Health, The George Washington University, Washington, DC, United States
- CIBIO-InBIO, Centro de Investigação em Biodiversidade e Recursos Genéticos, Universidade do Porto, Vairão, Portugal
| | - Dhatri Badri Narayanan
- Computational Biology Institute, The George Washington University, Washington, DC, United States
- Department of Biostatistics and Bioinformatics, Milken Institute School of Public Health, The George Washington University, Washington, DC, United States
| | - Allison R. Kolbe
- Computational Biology Institute, The George Washington University, Washington, DC, United States
- Department of Biostatistics and Bioinformatics, Milken Institute School of Public Health, The George Washington University, Washington, DC, United States
| | - Ignacio Ramos-Tapia
- Instituto de Investigación Interdisciplinaria (I3), Universidad de Talca, Talca, Chile
| | - Eduardo Castro-Nallar
- CIBIO-InBIO, Centro de Investigação em Biodiversidade e Recursos Genéticos, Universidade do Porto, Vairão, Portugal
- Instituto de Investigación Interdisciplinaria (I3), Universidad de Talca, Talca, Chile
- Departamento de Microbiología, Facultad de Ciencias de la Salud, Universidad de Talca, Talca, Chile
| | - Keith A. Crandall
- Computational Biology Institute, The George Washington University, Washington, DC, United States
- Department of Biostatistics and Bioinformatics, Milken Institute School of Public Health, The George Washington University, Washington, DC, United States
| | - Jorge Domínguez
- Grupo de Ecoloxía Animal (GEA), Universidade de Vigo, Vigo, Spain
| |
Collapse
|
41
|
Revisiting the recombinant history of HIV-1 group M with dynamic network community detection. Proc Natl Acad Sci U S A 2022; 119:e2108815119. [PMID: 35500121 PMCID: PMC9171507 DOI: 10.1073/pnas.2108815119] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/29/2022] Open
Abstract
Recombination is a major mechanism through which HIV type 1 (HIV-1) maintains genetic diversity and interferes with viral eradication efforts. There is growing evidence demonstrating a recombinant origin of primate lentiviruses including HIV-1 group M (HIV-1/M). Inferring the extent of recombination across the entire HIV-1/M genome is of great importance as it provides deeper insights into the origin, dynamics, and evolution of the global pandemic. Here we propose an alternative method that can reconstruct the extent of genome-wide recombination in HIV-1, uncover reticulate patterns, and serve as a framework for HIV-1 classification. Our method provides an alternative approach for understanding the roles of virus recombination in the early evolutionary history of zoonosis for other emerging viruses. The prevailing abundance of full-length HIV type 1 (HIV-1) genome sequences provides an opportunity to revisit the standard model of HIV-1 group M (HIV-1/M) diversity that clusters genomes into largely nonrecombinant subtypes, which is not consistent with recent evidence of deep recombinant histories for simian immunodeficiency virus (SIV) and other HIV-1 groups. Here we develop an unsupervised nonparametric clustering approach, which does not rely on predefined nonrecombinant genomes, by adapting a community detection method developed for dynamic social network analysis. We show that this method (dynamic stochastic block model [DSBM]) attains a significantly lower mean error rate in detecting recombinant breakpoints in simulated data (quasibinomial generalized linear model (GLM), P<8×10−8), compared to other reference-free recombination detection programs (genetic algorithm for recombination detection [GARD], recombination detection program 4 [RDP4], and RDP5). When this method was applied to a representative sample of n = 525 actual HIV-1 genomes, we determined k = 29 as the optimal number of DSBM clusters and used change-point detection to estimate that at least 95% of these genomes are recombinant. Further, we identified both known and undocumented recombination hotspots in the HIV-1 genome and evidence of intersubtype recombination in HIV-1 subtype reference genomes. We propose that clusters generated by DSBM can provide an informative framework for HIV-1 classification.
Collapse
|
42
|
Aledo JC. Phylogenies from unaligned proteomes using sequence environments of amino acid residues. Sci Rep 2022; 12:7497. [PMID: 35523825 PMCID: PMC9076898 DOI: 10.1038/s41598-022-11370-x] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/15/2022] [Accepted: 04/21/2022] [Indexed: 11/09/2022] Open
Abstract
Alignment-free methods for sequence comparison and phylogeny inference have attracted a great deal of attention in recent years. Several algorithms have been implemented in diverse software packages. Despite the great number of existing methods, most of them are based on word statistics. Although they propose different filtering and weighting strategies and explore different metrics, their performance may be limited by the phylogenetic signal preserved in these words. Herein, we present a different approach based on the species-specific amino acid neighborhood preferences. These differential preferences can be assessed in the context of vector spaces. In this way, a distance-based method to build phylogenies has been developed and implemented into an easy-to-use R package. Tests run on real-world datasets show that this method can reconstruct phylogenetic relationships with high accuracy, and often outperforms other alignment-free approaches. Furthermore, we present evidence that the new method can perform reliably on datasets formed by non-orthologous protein sequences, that is, the method not only does not require the identification of orthologous proteins, but also does not require their presence in the analyzed dataset. These results suggest that the neighborhood preference of amino acids conveys a phylogenetic signal that may be of great utility in phylogenomics.
Collapse
Affiliation(s)
- Juan Carlos Aledo
- Department of Molecular Biology and Biochemistry, University of Málaga, 29071, Málaga, Spain.
| |
Collapse
|
43
|
Lo R, Dougan KE, Chen Y, Shah S, Bhattacharya D, Chan CX. Alignment-Free Analysis of Whole-Genome Sequences From Symbiodiniaceae Reveals Different Phylogenetic Signals in Distinct Regions. FRONTIERS IN PLANT SCIENCE 2022; 13:815714. [PMID: 35557718 PMCID: PMC9087856 DOI: 10.3389/fpls.2022.815714] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 11/15/2021] [Accepted: 04/04/2022] [Indexed: 05/24/2023]
Abstract
Dinoflagellates of the family Symbiodiniaceae are predominantly essential symbionts of corals and other marine organisms. Recent research reveals extensive genome sequence divergence among Symbiodiniaceae taxa and high phylogenetic diversity hidden behind subtly different cell morphologies. Using an alignment-free phylogenetic approach based on sub-sequences of fixed length k (i.e. k-mers), we assessed the phylogenetic signal among whole-genome sequences from 16 Symbiodiniaceae taxa (including the genera of Symbiodinium, Breviolum, Cladocopium, Durusdinium and Fugacium) and two strains of Polarella glacialis as outgroup. Based on phylogenetic trees inferred from k-mers in distinct genomic regions (i.e. repeat-masked genome sequences, protein-coding sequences, introns and repeats) and in protein sequences, the phylogenetic signal associated with protein-coding DNA and the encoded amino acids is largely consistent with the Symbiodiniaceae phylogeny based on established markers, such as large subunit rRNA. The other genome sequences (introns and repeats) exhibit distinct phylogenetic signals, supporting the expected differential evolutionary pressure acting on these regions. Our analysis of conserved core k-mers revealed the prevalence of conserved k-mers (>95% core 23-mers among all 18 genomes) in annotated repeats and non-genic regions of the genomes. We observed 180 distinct repeat types that are significantly enriched in genomes of the symbiotic versus free-living Symbiodinium taxa, suggesting an enhanced activity of transposable elements linked to the symbiotic lifestyle. We provide evidence that representation of alignment-free phylogenies as dynamic networks enhances the ability to generate new hypotheses about genome evolution in Symbiodiniaceae. These results demonstrate the potential of alignment-free phylogenetic methods as a scalable approach for inferring comprehensive, unbiased whole-genome phylogenies of dinoflagellates and more broadly of microbial eukaryotes.
Collapse
Affiliation(s)
- Rosalyn Lo
- Australian Centre for Ecogenomics, School of Chemistry and Molecular Biosciences, University of Queensland, Brisbane, QLD, Australia
| | - Katherine E. Dougan
- Australian Centre for Ecogenomics, School of Chemistry and Molecular Biosciences, University of Queensland, Brisbane, QLD, Australia
| | - Yibi Chen
- Australian Centre for Ecogenomics, School of Chemistry and Molecular Biosciences, University of Queensland, Brisbane, QLD, Australia
| | - Sarah Shah
- Australian Centre for Ecogenomics, School of Chemistry and Molecular Biosciences, University of Queensland, Brisbane, QLD, Australia
| | - Debashish Bhattacharya
- Department of Biochemistry and Microbiology, Rutgers University, New Brunswick, NJ, United States
| | - Cheong Xin Chan
- Australian Centre for Ecogenomics, School of Chemistry and Molecular Biosciences, University of Queensland, Brisbane, QLD, Australia
| |
Collapse
|
44
|
Inferring Species Compositions of Complex Fungal Communities from Long- and Short-Read Sequence Data. mBio 2022; 13:e0244421. [PMID: 35404122 PMCID: PMC9040722 DOI: 10.1128/mbio.02444-21] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/20/2022] Open
Abstract
Our study is unique in that it provides an in-depth comparative study of a real-life complex fungal community analyzed with multiple long- and short-read sequencing approaches. These technologies and their application are currently of great interest to diverse biologists as they seek to characterize the community compositions of microbiomes.
Collapse
|
45
|
Qi W, Lim YW, Patrignani A, Schläpfer P, Bratus-Neuenschwander A, Grüter S, Chanez C, Rodde N, Prat E, Vautrin S, Fustier MA, Pratas D, Schlapbach R, Gruissem W. The haplotype-resolved chromosome pairs of a heterozygous diploid African cassava cultivar reveal novel pan-genome and allele-specific transcriptome features. Gigascience 2022; 11:giac028. [PMID: 35333302 PMCID: PMC8952263 DOI: 10.1093/gigascience/giac028] [Citation(s) in RCA: 21] [Impact Index Per Article: 10.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/16/2021] [Revised: 01/11/2022] [Accepted: 02/22/2022] [Indexed: 01/01/2023] Open
Abstract
BACKGROUND Cassava (Manihot esculenta) is an important clonally propagated food crop in tropical and subtropical regions worldwide. Genetic gain by molecular breeding has been limited, partially because cassava is a highly heterozygous crop with a repetitive and difficult-to-assemble genome. FINDINGS Here we demonstrate that Pacific Biosciences high-fidelity (HiFi) sequencing reads, in combination with the assembler hifiasm, produced genome assemblies at near complete haplotype resolution with higher continuity and accuracy compared to conventional long sequencing reads. We present 2 chromosome-scale haploid genomes phased with Hi-C technology for the diploid African cassava variety TME204. With consensus accuracy >QV46, contig N50 >18 Mb, BUSCO completeness of 99%, and 35k phased gene loci, it is the most accurate, continuous, complete, and haplotype-resolved cassava genome assembly so far. Ab initio gene prediction with RNA-seq data and Iso-Seq transcripts identified abundant novel gene loci, with enriched functionality related to chromatin organization, meristem development, and cell responses. During tissue development, differentially expressed transcripts of different haplotype origins were enriched for different functionality. In each tissue, 20-30% of transcripts showed allele-specific expression (ASE) differences. ASE bias was often tissue specific and inconsistent across different tissues. Direction-shifting was observed in <2% of the ASE transcripts. Despite high gene synteny, the HiFi genome assembly revealed extensive chromosome rearrangements and abundant intra-genomic and inter-genomic divergent sequences, with large structural variations mostly related to LTR retrotransposons. We use the reference-quality assemblies to build a cassava pan-genome and demonstrate its importance in representing the genetic diversity of cassava for downstream reference-guided omics analysis and breeding. CONCLUSIONS The phased and annotated chromosome pairs allow a systematic view of the heterozygous diploid genome organization in cassava with improved accuracy, completeness, and haplotype resolution. They will be a valuable resource for cassava breeding and research. Our study may also provide insights into developing cost-effective and efficient strategies for resolving complex genomes with high resolution, accuracy, and continuity.
Collapse
Affiliation(s)
- Weihong Qi
- Functional Genomics Center Zurich, ETH Zurich and University of Zurich, Winterthurerstrasse 190, 8057, Zurich, Switzerland
- Department of Biology, Institute of Molecular Plant Biology, ETH Zurich, Universitätstrasse 2, 8092, Zurich, Switzerland
- SIB Swiss Institute of Bioinformatics, 1202, Geneva, Switzerland
| | - Yi-Wen Lim
- Department of Biology, Institute of Molecular Plant Biology, ETH Zurich, Universitätstrasse 2, 8092, Zurich, Switzerland
| | - Andrea Patrignani
- Functional Genomics Center Zurich, ETH Zurich and University of Zurich, Winterthurerstrasse 190, 8057, Zurich, Switzerland
| | - Pascal Schläpfer
- Department of Biology, Institute of Molecular Plant Biology, ETH Zurich, Universitätstrasse 2, 8092, Zurich, Switzerland
| | - Anna Bratus-Neuenschwander
- Functional Genomics Center Zurich, ETH Zurich and University of Zurich, Winterthurerstrasse 190, 8057, Zurich, Switzerland
| | - Simon Grüter
- Functional Genomics Center Zurich, ETH Zurich and University of Zurich, Winterthurerstrasse 190, 8057, Zurich, Switzerland
| | - Christelle Chanez
- Department of Biology, Institute of Molecular Plant Biology, ETH Zurich, Universitätstrasse 2, 8092, Zurich, Switzerland
| | - Nathalie Rodde
- INRAE, CNRGV French Plant Genomic Resource Center, F-31320, Castanet Tolosan, France
| | - Elisa Prat
- INRAE, CNRGV French Plant Genomic Resource Center, F-31320, Castanet Tolosan, France
| | - Sonia Vautrin
- INRAE, CNRGV French Plant Genomic Resource Center, F-31320, Castanet Tolosan, France
| | | | - Diogo Pratas
- Department of Electronics, Telecommunications and Informatics and Institute of Electronics and Informatics Engineering of Aveiro, University of Aveiro, Campus Universitário de Santiago, 3810-193 Aveiro, Portugal
- Department of Virology, University of Helsinki, Haartmaninkatu 3, 00014 Helsinki, Finland
| | - Ralph Schlapbach
- Functional Genomics Center Zurich, ETH Zurich and University of Zurich, Winterthurerstrasse 190, 8057, Zurich, Switzerland
| | - Wilhelm Gruissem
- Department of Biology, Institute of Molecular Plant Biology, ETH Zurich, Universitätstrasse 2, 8092, Zurich, Switzerland
- Biotechnology Center, National Chung Hsing University, 145 Xingda Road, Taichung 40227, Taiwan
| |
Collapse
|
46
|
Irinyi L, Rope M, Meyer W. In depth search of the Sequence Read Archive database reveals global distribution of the emerging pathogenic fungus Scedosporium aurantiacum. Med Mycol 2022; 60:6542442. [PMID: 35244718 PMCID: PMC8994208 DOI: 10.1093/mmy/myac019] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/01/2021] [Revised: 01/30/2022] [Accepted: 03/01/2022] [Indexed: 11/24/2022] Open
Abstract
Scedosporium species are emerging opportunistic fungal pathogens causing various infections mainly in immunocompromised patients, but also in immunocompetent individuals, following traumatic injuries. Clinical manifestations range from local infections, such as subcutaneous mycetoma or bone and joint infections, to pulmonary colonization and severe disseminated diseases. They are commonly found in soil and other environmental sources. To date S. aurantiacum has been reported only from a handful of countries. To identify the worldwide distribution of this species we screened publicly available sequencing data from fungal metabarcoding studies in the Sequence Read Archive (SRA) of The National Centre for Biotechnology Information (NCBI) by multiple BLAST searches. S. aurantiacum was found in 26 countries and two islands, throughout every climatic region. This distribution is like that of other Scedosporium species. Several new environmental sources of S. aurantiacum including human and bovine milk, chicken and canine gut, freshwater, and feces of the giant white-tailed rat (Uromys caudimaculatus) were identified. This study demonstrated that raw sequence data stored in the SRA database can be repurposed using a big data analysis approach to answer biological questions of interest.
Collapse
Affiliation(s)
- Laszlo Irinyi
- Molecular Mycology Research Laboratory, Centre for Infectious Diseases and Microbiology, Faculty of Medicine and Health, Sydney Medical School, Westmead Clinical School, The University of Sydney, Sydney, NSW, Australia.,Marie Bashir Institute for Infectious Diseases and Biosecurity, The University of Sydney, Sydney, NSW, Australia.,Westmead Institute for Medical Research, Westmead, NSW Australia
| | - Michael Rope
- Division of Biomedical Science and Biochemistry, Australian National University, Canberra, ACT, Australia
| | - Wieland Meyer
- Molecular Mycology Research Laboratory, Centre for Infectious Diseases and Microbiology, Faculty of Medicine and Health, Sydney Medical School, Westmead Clinical School, The University of Sydney, Sydney, NSW, Australia.,Marie Bashir Institute for Infectious Diseases and Biosecurity, The University of Sydney, Sydney, NSW, Australia.,Westmead Institute for Medical Research, Westmead, NSW Australia.,Westmead Hospital (Research and Education Network), Westmead, NSW, Australia
| |
Collapse
|
47
|
Muñoz-Baena L, Poon AFY. Using networks to analyze and visualize the distribution of overlapping genes in virus genomes. PLoS Pathog 2022; 18:e1010331. [PMID: 35202429 PMCID: PMC8903798 DOI: 10.1371/journal.ppat.1010331] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/02/2021] [Revised: 03/08/2022] [Accepted: 02/02/2022] [Indexed: 11/19/2022] Open
Abstract
Gene overlap occurs when two or more genes are encoded by the same nucleotides. This phenomenon is found in all taxonomic domains, but is particularly common in viruses, where it may increase the information content of compact genomes or influence the creation of new genes. Here we report a global comparative study of overlapping open reading frames (OvRFs) of 12,609 virus reference genomes in the NCBI database. We retrieved metadata associated with all annotated open reading frames (ORFs) in each genome record to calculate the number, length, and frameshift of OvRFs. Our results show that while the number of OvRFs increases with genome length, they tend to be shorter in longer genomes. The majority of overlaps involve +2 frameshifts, predominantly found in dsDNA viruses. Antisense overlaps in which one of the ORFs was encoded in the same frame on the opposite strand (−0) tend to be longer. Next, we develop a new graph-based representation of the distribution of overlaps among the ORFs of genomes in a given virus family. In the absence of an unambiguous partition of ORFs by homology at this taxonomic level, we used an alignment-free k-mer based approach to cluster protein coding sequences by similarity. We connect these clusters with two types of directed edges to indicate (1) that constituent ORFs are adjacent in one or more genomes, and (2) that these ORFs overlap. These adjacency graphs not only provide a natural visualization scheme, but also a novel statistical framework for analyzing the effects of gene- and genome-level attributes on the frequencies of overlaps.
Collapse
Affiliation(s)
- Laura Muñoz-Baena
- Department of Microbiology and Immunology, Western University, London, ON, Canada
| | - Art F. Y. Poon
- Department of Microbiology and Immunology, Western University, London, ON, Canada
- Department of Pathology and Laboratory Medicine, Western University, London, ON, Canada
- * E-mail:
| |
Collapse
|
48
|
Dougan KE, González-Pech RA, Stephens TG, Shah S, Chen Y, Ragan MA, Bhattacharya D, Chan CX. Genome-powered classification of microbial eukaryotes: focus on coral algal symbionts. Trends Microbiol 2022; 30:831-840. [DOI: 10.1016/j.tim.2022.02.001] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/10/2021] [Revised: 01/20/2022] [Accepted: 02/01/2022] [Indexed: 12/20/2022]
|
49
|
Cattaneo G, Ferraro Petrillo U, Giancarlo R, Palini F, Romualdi C. The power of word-frequency-based alignment-free functions: a comprehensive large-scale experimental analysis. Bioinformatics 2022; 38:925-932. [PMID: 34718420 DOI: 10.1093/bioinformatics/btab747] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/02/2021] [Revised: 10/07/2021] [Accepted: 10/26/2021] [Indexed: 02/03/2023] Open
Abstract
MOTIVATION Alignment-free (AF) distance/similarity functions are a key tool for sequence analysis. Experimental studies on real datasets abound and, to some extent, there are also studies regarding their control of false positive rate (Type I error). However, assessment of their power, i.e. their ability to identify true similarity, has been limited to some members of the D2 family. The corresponding experimental studies have concentrated on short sequences, a scenario no longer adequate for current applications, where sequence lengths may vary considerably. Such a State of the Art is methodologically problematic, since information regarding a key feature such as power is either missing or limited. RESULTS By concentrating on a representative set of word-frequency-based AF functions, we perform the first coherent and uniform evaluation of the power, involving also Type I error for completeness. Two alternative models of important genomic features (CIS Regulatory Modules and Horizontal Gene Transfer), a wide range of sequence lengths from a few thousand to millions, and different values of k have been used. As a result, we provide a characterization of those AF functions that is novel and informative. Indeed, we identify weak and strong points of each function considered, which may be used as a guide to choose one for analysis tasks. Remarkably, of the 15 functions that we have considered, only four stand out, with small differences between small and short sequence length scenarios. Finally, to encourage the use of our methodology for validation of future AF functions, the Big Data platform supporting it is public. AVAILABILITY AND IMPLEMENTATION The software is available at: https://github.com/pipp8/power_statistics. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Giuseppe Cattaneo
- Dipartimento di Informatica, Università di Salerno, Fisciano, SA 84084, Italy
| | | | - Raffaele Giancarlo
- Dipartimento di Matematica ed Informatica, Università di Palermo, 90133 Palermo, Italy
| | - Francesco Palini
- Dipartimento di Scienze Statistiche, Università di Roma-La Sapienza, 00185 Rome, Italy
| | - Chiara Romualdi
- Dipartimento di Biologia, Università di Padova, 35131 Padova, Italy
| |
Collapse
|
50
|
Millán Arias P, Alipour F, Hill KA, Kari L. DeLUCS: Deep learning for unsupervised clustering of DNA sequences. PLoS One 2022; 17:e0261531. [PMID: 35061715 PMCID: PMC8782307 DOI: 10.1371/journal.pone.0261531] [Citation(s) in RCA: 13] [Impact Index Per Article: 6.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/03/2021] [Accepted: 12/06/2021] [Indexed: 11/25/2022] Open
Abstract
We present a novel Deep Learning method for the Unsupervised Clustering of DNA Sequences (DeLUCS) that does not require sequence alignment, sequence homology, or (taxonomic) identifiers. DeLUCS uses Frequency Chaos Game Representations (FCGR) of primary DNA sequences, and generates "mimic" sequence FCGRs to self-learn data patterns (genomic signatures) through the optimization of multiple neural networks. A majority voting scheme is then used to determine the final cluster assignment for each sequence. The clusters learned by DeLUCS match true taxonomic groups for large and diverse datasets, with accuracies ranging from 77% to 100%: 2,500 complete vertebrate mitochondrial genomes, at taxonomic levels from sub-phylum to genera; 3,200 randomly selected 400 kbp-long bacterial genome segments, into clusters corresponding to bacterial families; three viral genome and gene datasets, averaging 1,300 sequences each, into clusters corresponding to virus subtypes. DeLUCS significantly outperforms two classic clustering methods (K-means++ and Gaussian Mixture Models) for unlabelled data, by as much as 47%. DeLUCS is highly effective, it is able to cluster datasets of unlabelled primary DNA sequences totalling over 1 billion bp of data, and it bypasses common limitations to classification resulting from the lack of sequence homology, variation in sequence length, and the absence or instability of sequence annotations and taxonomic identifiers. Thus, DeLUCS offers fast and accurate DNA sequence clustering for previously intractable datasets.
Collapse
Affiliation(s)
- Pablo Millán Arias
- School of Computer Science, University of Waterloo, Waterloo, ON, Canada
| | - Fatemeh Alipour
- School of Computer Science, University of Waterloo, Waterloo, ON, Canada
| | - Kathleen A. Hill
- Department of Biology, University of Western Ontario, London, ON, Canada
| | - Lila Kari
- School of Computer Science, University of Waterloo, Waterloo, ON, Canada
| |
Collapse
|