1
|
Stock M, Van Criekinge W, Boeckaerts D, Taelman S, Van Haeverbeke M, Dewulf P, De Baets B. Hyperdimensional computing: A fast, robust, and interpretable paradigm for biological data. PLoS Comput Biol 2024; 20:e1012426. [PMID: 39316621 PMCID: PMC11421772 DOI: 10.1371/journal.pcbi.1012426] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 09/26/2024] Open
Abstract
Advances in bioinformatics are primarily due to new algorithms for processing diverse biological data sources. While sophisticated alignment algorithms have been pivotal in analyzing biological sequences, deep learning has substantially transformed bioinformatics, addressing sequence, structure, and functional analyses. However, these methods are incredibly data-hungry, compute-intensive, and hard to interpret. Hyperdimensional computing (HDC) has recently emerged as an exciting alternative. The key idea is that random vectors of high dimensionality can represent concepts such as sequence identity or phylogeny. These vectors can then be combined using simple operators for learning, reasoning, or querying by exploiting the peculiar properties of high-dimensional spaces. Our work reviews and explores HDC's potential for bioinformatics, emphasizing its efficiency, interpretability, and adeptness in handling multimodal and structured data. HDC holds great potential for various omics data searching, biosignal analysis, and health applications.
Collapse
Affiliation(s)
- Michiel Stock
- KERMIT Research Unit, Department of Data Analysis and Mathematical Modelling, Ghent University, Ghent, Belgium
| | - Wim Van Criekinge
- Biobix Research Unit, Department of Data Analysis and Mathematical Modelling, Ghent University, Ghent, Belgium
| | - Dimitri Boeckaerts
- KERMIT Research Unit, Department of Data Analysis and Mathematical Modelling, Ghent University, Ghent, Belgium
- Laboratory of Applied Biotechnology, Department of Biotechnology, Ghent University, Ghent, Belgium
| | - Steff Taelman
- KERMIT Research Unit, Department of Data Analysis and Mathematical Modelling, Ghent University, Ghent, Belgium
- Biobix Research Unit, Department of Data Analysis and Mathematical Modelling, Ghent University, Ghent, Belgium
- BioLizard nv, Ghent, Belgium
| | - Maxime Van Haeverbeke
- KERMIT Research Unit, Department of Data Analysis and Mathematical Modelling, Ghent University, Ghent, Belgium
| | - Pieter Dewulf
- KERMIT Research Unit, Department of Data Analysis and Mathematical Modelling, Ghent University, Ghent, Belgium
| | - Bernard De Baets
- KERMIT Research Unit, Department of Data Analysis and Mathematical Modelling, Ghent University, Ghent, Belgium
| |
Collapse
|
2
|
Wang T, Yu ZG, Li J. CGRWDL: alignment-free phylogeny reconstruction method for viruses based on chaos game representation weighted by dynamical language model. Front Microbiol 2024; 15:1339156. [PMID: 38572227 PMCID: PMC10987876 DOI: 10.3389/fmicb.2024.1339156] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/15/2023] [Accepted: 02/23/2024] [Indexed: 04/05/2024] Open
Abstract
Traditional alignment-based methods meet serious challenges in genome sequence comparison and phylogeny reconstruction due to their high computational complexity. Here, we propose a new alignment-free method to analyze the phylogenetic relationships (classification) among species. In our method, the dynamical language (DL) model and the chaos game representation (CGR) method are used to characterize the frequency information and the context information of k-mers in a sequence, respectively. Then for each DNA sequence or protein sequence in a dataset, our method converts the sequence into a feature vector that represents the sequence information based on CGR weighted by the DL model to infer phylogenetic relationships. We name our method CGRWDL. Its performance was tested on both DNA and protein sequences of 8 datasets of viruses to construct the phylogenetic trees. We compared the Robinson-Foulds (RF) distance between the phylogenetic tree constructed by CGRWDL and the reference tree by other advanced methods for each dataset. The results show that the phylogenetic trees constructed by CGRWDL can accurately classify the viruses, and the RF scores between the trees and the reference trees are smaller than that with other methods.
Collapse
Affiliation(s)
- Ting Wang
- National Center for Applied Mathematics in Hunan, Xiangtan University, Xiangtan, Hunan, China
- Key Laboratory of Intelligent Computing and Information Processing of Ministry of Education, Xiangtan University, Xiangtan, Hunan, China
| | - Zu-Guo Yu
- National Center for Applied Mathematics in Hunan, Xiangtan University, Xiangtan, Hunan, China
- Key Laboratory of Intelligent Computing and Information Processing of Ministry of Education, Xiangtan University, Xiangtan, Hunan, China
| | - Jinyan Li
- School of Computer Science and Control Engineering, Shenzhen Institute of Advanced Technology, Shenzhen, Guangdong, China
| |
Collapse
|
3
|
Cahuantzi R, Lythgoe KA, Hall I, Pellis L, House T. Unsupervised identification of significant lineages of SARS-CoV-2 through scalable machine learning methods. Proc Natl Acad Sci U S A 2024; 121:e2317284121. [PMID: 38478692 PMCID: PMC10962941 DOI: 10.1073/pnas.2317284121] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/10/2023] [Accepted: 02/05/2024] [Indexed: 03/21/2024] Open
Abstract
Since its emergence in late 2019, SARS-CoV-2 has diversified into a large number of lineages and caused multiple waves of infection globally. Novel lineages have the potential to spread rapidly and internationally if they have higher intrinsic transmissibility and/or can evade host immune responses, as has been seen with the Alpha, Delta, and Omicron variants of concern. They can also cause increased mortality and morbidity if they have increased virulence, as was seen for Alpha and Delta. Phylogenetic methods provide the "gold standard" for representing the global diversity of SARS-CoV-2 and to identify newly emerging lineages. However, these methods are computationally expensive, struggle when datasets get too large, and require manual curation to designate new lineages. These challenges provide a motivation to develop complementary methods that can incorporate all of the genetic data available without down-sampling to extract meaningful information rapidly and with minimal curation. In this paper, we demonstrate the utility of using algorithmic approaches based on word-statistics to represent whole sequences, bringing speed, scalability, and interpretability to the construction of genetic topologies. While not serving as a substitute for current phylogenetic analyses, the proposed methods can be used as a complementary, and fully automatable, approach to identify and confirm new emerging variants.
Collapse
Affiliation(s)
- Roberto Cahuantzi
- Department of Mathematics, The University of Manchester, ManchesterM13 9PL, United Kingdom
- United Kingdom Health Security Agency, University of Oxford, OxfordOX3 7LF, United Kingdom
| | - Katrina A. Lythgoe
- Department of Biology, University of Oxford, OxfordOX1 3SZ, United Kingdom
- Big Data Institute, University of Oxford, OxfordOX3 7LF, United Kingdom
- Pandemic Sciences Institute, University of Oxford, OxfordOX3 7LF, United Kingdom
| | - Ian Hall
- Department of Mathematics, The University of Manchester, ManchesterM13 9PL, United Kingdom
| | - Lorenzo Pellis
- Department of Mathematics, The University of Manchester, ManchesterM13 9PL, United Kingdom
| | - Thomas House
- Department of Mathematics, The University of Manchester, ManchesterM13 9PL, United Kingdom
| |
Collapse
|
4
|
Dey S, Ghosh P, Das S. Positional difference and Frequency (PdF) based alignment-free technique for genome sequence comparison. J Biomol Struct Dyn 2023; 42:12660-12688. [PMID: 37885236 DOI: 10.1080/07391102.2023.2272748] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/07/2023] [Accepted: 09/19/2023] [Indexed: 10/28/2023]
Abstract
In the field of computational biology, genome sequence comparison among different species is essential and has applications in both the research and scientific fields. Owing to the lengthy processing time and large number of data sets, the alignment-based approaches are unsuitable and ineffective. Therefore, alignment-free techniques have obtained popularity for acquiring proper sequence clustering and evolutionary relationship among species. In this paper, a complete bipartite graph based Positional difference and Frequency (PdF) vector descriptor is introduced. Positional difference and Frequency, two parameters, are applied to the genome sequence to create a 16- dimensional vector descriptor using the di-nucleotide representation of genome sequence. Subsequently, a distance matrix is calculated to construct the phylogenetic trees for different data sets of mammals and viruses. The achieved outcomes are compared with the phylogenetic trees of the earlier methods viz. the FFP method, the ClustalW method, the MEV method, the PCNV method and the FIS method. In most instances, the proposed method produces more precise outcomes than the preceding techniques and has potential for genome sequence comparison on both the equal and unequal length of data-sets.Communicated by Ramaswamy H. Sarma.
Collapse
Affiliation(s)
- Sudeshna Dey
- Computer Science and Engineering, Narula Institute of Technology, Kolkata, India
| | - Papri Ghosh
- Computer Science and Engineering, Narula Institute of Technology, Kolkata, India
| | - Subhram Das
- Computer Science and Engineering, Narula Institute of Technology, Kolkata, India
| |
Collapse
|
5
|
Akbari Rokn Abadi S, Mohammadi A, Koohi S. A new profiling approach for DNA sequences based on the nucleotides' physicochemical features for accurate analysis of SARS-CoV-2 genomes. BMC Genomics 2023; 24:266. [PMID: 37202721 DOI: 10.1186/s12864-023-09373-7] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/20/2022] [Accepted: 05/11/2023] [Indexed: 05/20/2023] Open
Abstract
BACKGROUND The prevalence of the COVID-19 disease in recent years and its widespread impact on mortality, as well as various aspects of life around the world, has made it important to study this disease and its viral cause. However, very long sequences of this virus increase the processing time, complexity of calculation, and memory consumption required by the available tools to compare and analyze the sequences. RESULTS We present a new encoding method, named PC-mer, based on the k-mer and physic-chemical properties of nucleotides. This method minimizes the size of encoded data by around 2 k times compared to the classical k-mer based profiling method. Moreover, using PC-mer, we designed two tools: 1) a machine-learning-based classification tool for coronavirus family members with the ability to recive input sequences from the NCBI database, and 2) an alignment-free computational comparison tool for calculating dissimilarity scores between coronaviruses at the genus and species levels. CONCLUSIONS PC-mer achieves 100% accuracy despite the use of very simple classification algorithms based on Machine Learning. Assuming dynamic programming-based pairwise alignment as the ground truth approach, we achieved a degree of convergence of more than 98% for coronavirus genus-level sequences and 93% for SARS-CoV-2 sequences using PC-mer in the alignment-free classification method. This outperformance of PC-mer suggests that it can serve as a replacement for alignment-based approaches in certain sequence analysis applications that rely on similarity/dissimilarity scores, such as searching sequences, comparing sequences, and certain types of phylogenetic analysis methods that are based on sequence comparison.
Collapse
Affiliation(s)
| | | | - Somayyeh Koohi
- Department of Computer Engineering, Sharif University of Technology, Tehran, Iran.
| |
Collapse
|
6
|
Generating Minimal Models of H1N1 NS1 Gene Sequences Using Alignment-Based and Alignment-Free Algorithms. Genes (Basel) 2023; 14:genes14010186. [PMID: 36672928 PMCID: PMC9858667 DOI: 10.3390/genes14010186] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/13/2022] [Revised: 01/03/2023] [Accepted: 01/05/2023] [Indexed: 01/13/2023] Open
Abstract
For virus classification and tracing, one idea is to generate minimal models from the gene sequences of each virus group for comparative analysis within and between classes, as well as classification and tracing of new sequences. The starting point of defining a minimal model for a group of gene sequences is to find their longest common sequence (LCS), but this is a non-deterministic polynomial-time hard (NP-hard) problem. Therefore, we applied some heuristic approaches of finding LCS, as well as some of the newer methods of treating gene sequences, including multiple sequence alignment (MSA) and k-mer natural vector (NV) encoding. To evaluate our algorithms, a five-fold cross validation classification scheme on a dataset of H1N1 virus non-structural protein 1 (NS1) gene was analyzed. The results indicate that the MSA-based algorithm has the best performance measured by classification accuracy, while the NV-based algorithm exhibits advantages in the time complexity of generating minimal models.
Collapse
|
7
|
Bohnsack KS, Kaden M, Abel J, Villmann T. Alignment-Free Sequence Comparison: A Systematic Survey From a Machine Learning Perspective. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2023; 20:119-135. [PMID: 34990369 DOI: 10.1109/tcbb.2022.3140873] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/04/2023]
Abstract
The encounter of large amounts of biological sequence data generated during the last decades and the algorithmic and hardware improvements have offered the possibility to apply machine learning techniques in bioinformatics. While the machine learning community is aware of the necessity to rigorously distinguish data transformation from data comparison and adopt reasonable combinations thereof, this awareness is often lacking in the field of comparative sequence analysis. With realization of the disadvantages of alignments for sequence comparison, some typical applications use more and more so-called alignment-free approaches. In light of this development, we present a conceptual framework for alignment-free sequence comparison, which highlights the delineation of: 1) the sequence data transformation comprising of adequate mathematical sequence coding and feature generation, from 2) the subsequent (dis-)similarity evaluation of the transformed data by means of problem-specific but mathematically consistent proximity measures. We consider coding to be an information-loss free data transformation in order to get an appropriate representation, whereas feature generation is inevitably information-lossy with the intention to extract just the task-relevant information. This distinction sheds light on the plethora of methods available and assists in identifying suitable methods in machine learning and data analysis to compare the sequences under these premises.
Collapse
|
8
|
Munagala NVTS, Amanchi PK, Balasubramanian K, Panicker A, Nagaraj N. Compression-Complexity Measures for Analysis and Classification of Coronaviruses. ENTROPY (BASEL, SWITZERLAND) 2022; 25:81. [PMID: 36673224 PMCID: PMC9857615 DOI: 10.3390/e25010081] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 10/05/2022] [Revised: 12/10/2022] [Accepted: 12/18/2022] [Indexed: 06/17/2023]
Abstract
Finding a vaccine or specific antiviral treatment for a global pandemic of virus diseases (such as the ongoing COVID-19) requires rapid analysis, annotation and evaluation of metagenomic libraries to enable a quick and efficient screening of nucleotide sequences. Traditional sequence alignment methods are not suitable and there is a need for fast alignment-free techniques for sequence analysis. Information theory and data compression algorithms provide a rich set of mathematical and computational tools to capture essential patterns in biological sequences. In this study, we investigate the use of compression-complexity (Effort-to-Compress or ETC and Lempel-Ziv or LZ complexity) based distance measures for analyzing genomic sequences. The proposed distance measure is used to successfully reproduce the phylogenetic trees for a mammalian dataset consisting of eight species clusters, a set of coronaviruses belonging to group I, group II, group III, and SARS-CoV-1 coronaviruses, and a set of coronaviruses causing COVID-19 (SARS-CoV-2), and those not causing COVID-19. Having demonstrated the usefulness of these compression complexity measures, we employ them for the automatic classification of COVID-19-causing genome sequences using machine learning techniques. Two flavors of SVM (linear and quadratic) along with linear discriminant and fine K Nearest Neighbors classifer are used for classification. Using a data set comprising 1001 coronavirus sequences (causing COVID-19 and those not causing COVID-19), a classification accuracy of 98% is achieved with a sensitivity of 95% and a specificity of 99.8%. This work could be extended further to enable medical practitioners to automatically identify and characterize coronavirus strains and their rapidly growing mutants in a fast and efficient fashion.
Collapse
Affiliation(s)
- Naga Venkata Trinath Sai Munagala
- Department of Electronics and Communication Engineering, Amrita School of Engineering, Coimbatore, Amrita Vishwa Vidyapeetham, Ettimadai 641112, Tamil Nadu, India
| | - Prem Kumar Amanchi
- Department of Electronics and Communication Engineering, Amrita School of Engineering, Coimbatore, Amrita Vishwa Vidyapeetham, Ettimadai 641112, Tamil Nadu, India
| | - Karthi Balasubramanian
- Department of Electronics and Communication Engineering, Amrita School of Engineering, Coimbatore, Amrita Vishwa Vidyapeetham, Ettimadai 641112, Tamil Nadu, India
| | - Athira Panicker
- Department of Electronics and Communication Engineering, Amrita School of Engineering, Coimbatore, Amrita Vishwa Vidyapeetham, Ettimadai 641112, Tamil Nadu, India
| | - Nithin Nagaraj
- Consciousness Studies Programme, National Institute of Advanced Studies, Bengaluru 560012, Karnataka, India
| |
Collapse
|
9
|
WalkIm: Compact image-based encoding for high-performance classification of biological sequences using simple tuning-free CNNs. PLoS One 2022; 17:e0267106. [PMID: 35427371 PMCID: PMC9012348 DOI: 10.1371/journal.pone.0267106] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/06/2021] [Accepted: 04/01/2022] [Indexed: 11/28/2022] Open
Abstract
The classification of biological sequences is an open issue for a variety of data sets, such as viral and metagenomics sequences. Therefore, many studies utilize neural network tools, as the well-known methods in this field, and focus on designing customized network structures. However, a few works focus on more effective factors, such as input encoding method or implementation technology, to address accuracy and efficiency issues in this area. Therefore, in this work, we propose an image-based encoding method, called as WalkIm, whose adoption, even in a simple neural network, provides competitive accuracy and superior efficiency, compared to the existing classification methods (e.g. VGDC, CASTOR, and DLM-CNN) for a variety of biological sequences. Using WalkIm for classifying various data sets (i.e. viruses whole-genome data, metagenomics read data, and metabarcoding data), it achieves the same performance as the existing methods, with no enforcement of parameter initialization or network architecture adjustment for each data set. It is worth noting that even in the case of classifying high-mutant data sets, such as Coronaviruses, it achieves almost 100% accuracy for classifying its various types. In addition, WalkIm achieves high-speed convergence during network training, as well as reduction of network complexity. Therefore WalkIm method enables us to execute the classifying neural networks on a normal desktop system in a short time interval. Moreover, we addressed the compatibility of WalkIm encoding method with free-space optical processing technology. Taking advantages of optical implementation of convolutional layers, we illustrated that the training time can be reduced by up to 500 time. In addition to all aforementioned advantages, this encoding method preserves the structure of generated images in various modes of sequence transformation, such as reverse complement, complement, and reverse modes.
Collapse
|
10
|
High-Throughput Genotyping Technologies in Plant Taxonomy. Methods Mol Biol 2021; 2222:149-166. [PMID: 33301093 DOI: 10.1007/978-1-0716-0997-2_9] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/25/2023]
Abstract
Molecular markers provide researchers with a powerful tool for variation analysis between plant genomes. They are heritable and widely distributed across the genome and for this reason have many applications in plant taxonomy and genotyping. Over the last decade, molecular marker technology has developed rapidly and is now a crucial component for genetic linkage analysis, trait mapping, diversity analysis, and association studies. This chapter focuses on molecular marker discovery, its application, and future perspectives for plant genotyping through pangenome assemblies. Included are descriptions of automated methods for genome and sequence distance estimation, genome contaminant analysis in sequence reads, genome structural variation, and SNP discovery methods.
Collapse
|
11
|
Sengupta DC, Hill MD, Benton KR, Banerjee HN. Similarity Studies of Corona Viruses through Chaos Game Representation. COMPUTATIONAL MOLECULAR BIOSCIENCE 2020; 10:61-72. [PMID: 32953249 PMCID: PMC7497811 DOI: 10.4236/cmb.2020.103004] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 11/20/2022]
Abstract
The novel coronavirus (SARS-COV-2) is generally referred to as Covid-19 virus has spread to 213 countries with nearly 7 million confirmed cases and nearly 400,000 deaths. Such major outbreaks demand classification and origin of the virus genomic sequence, for planning, containment, and treatment. Motivated by the above need, we report two alignment-free methods combing with CGR to perform clustering analysis and create a phylogenetic tree based on it. To each DNA sequence we associate a matrix then define distance between two DNA sequences to be the distance between their associated matrix. These methods are being used for phylogenetic analysis of coronavirus sequences. Our approach provides a powerful tool for analyzing and annotating genomes and their phylogenetic relationships. We also compare our tool to ClustalX algorithm which is one of the most popular alignment methods. Our alignment-free methods are shown to be capable of finding closest genetic relatives of coronaviruses.
Collapse
Affiliation(s)
- Dipendra C Sengupta
- Department of Mathematics, Computer Science & Engineering Technology, Elizabeth City State University, Elizabeth City, North Carolina, USA
| | - Matthew D Hill
- Department of Mathematics, Computer Science & Engineering Technology, Elizabeth City State University, Elizabeth City, North Carolina, USA
| | - Kevin R Benton
- Department of Mathematics, Computer Science & Engineering Technology, Elizabeth City State University, Elizabeth City, North Carolina, USA
| | - Hirendra N Banerjee
- Department Natural Sciences, Elizabeth City State University, Elizabeth City, North Carolina, USA
| |
Collapse
|
12
|
A new graph-theoretic approach to determine the similarity of genome sequences based on nucleotide triplets. Genomics 2020; 112:4701-4714. [PMID: 32827671 PMCID: PMC7437474 DOI: 10.1016/j.ygeno.2020.08.023] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/09/2019] [Revised: 07/15/2020] [Accepted: 08/17/2020] [Indexed: 11/22/2022]
Abstract
Methods of finding sequence similarity play a significant role in computational biology. Owing to the rapid increase of genome sequences in public databases, the evolutionary relationship of species becomes more challenging. But traditional alignment-based methods are found inappropriate due to their time-consuming nature. Therefore, it is necessary to find a faster method, which applies to species phylogeny. In this paper, a new graph-theory based alignment-free sequence comparison method is proposed. A complete-bipartite graph is used to represent each genome sequence based on its nucleotide triplets. Subsequently, with the help of the weights of edges of the graph, a vector descriptor is formed. Finally, the phylogenetic tree is drawn using the UPGMA algorithm. In the present case, the datasets for comparison are related to mammals, viruses, and bacteria. In most of the cases, the phylogeny in the present case is found to be more satisfactory as compared to earlier methods. A new graph-theory based alignment-free genome sequence comparison. Use of complete bipartite graph to represent genome sequences. Descriptor based on the weights of the edges of the graph. Comparison of the phylogenetic trees of different mammals, viruses, and bacteria. Less time complexity compared to that of earlier methods.
Collapse
|
13
|
Positional Correlation Natural Vector: A Novel Method for Genome Comparison. Int J Mol Sci 2020; 21:ijms21113859. [PMID: 32485813 PMCID: PMC7312176 DOI: 10.3390/ijms21113859] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/13/2020] [Revised: 05/17/2020] [Accepted: 05/26/2020] [Indexed: 12/17/2022] Open
Abstract
Advances in sequencing technology have made large amounts of biological data available. Evolutionary analysis of data such as DNA sequences is highly important in biological studies. As alignment methods are ineffective for analyzing large-scale data due to their inherently high costs, alignment-free methods have recently attracted attention in the field of bioinformatics. In this paper, we introduce a new positional correlation natural vector (PCNV) method that involves converting a DNA sequence into an 18-dimensional numerical feature vector. Using frequency and position correlation to represent the nucleotide distribution, it is possible to obtain a PCNV for a DNA sequence. This new numerical vector design uses six suitable features to characterize the correlation among nucleotide positions in sequences. PCNV is also very easy to compute and can be used for rapid genome comparison. To test our novel method, we performed phylogenetic analysis with several viral and bacterial genome datasets with PCNV. For comparison, an alignment-based method, Bayesian inference, and two alignment-free methods, feature frequency profile and natural vector, were performed using the same datasets. We found that the PCNV technique is fast and accurate when used for phylogenetic analysis and classification of viruses and bacteria.
Collapse
|
14
|
De Pierri CR, Voyceik R, Santos de Mattos LGC, Kulik MG, Camargo JO, Repula de Oliveira AM, de Lima Nichio BT, Marchaukoski JN, da Silva Filho AC, Guizelini D, Ortega JM, Pedrosa FO, Raittz RT. SWeeP: representing large biological sequences datasets in compact vectors. Sci Rep 2020; 10:91. [PMID: 31919449 PMCID: PMC6952362 DOI: 10.1038/s41598-019-55627-4] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/21/2019] [Accepted: 12/02/2019] [Indexed: 12/25/2022] Open
Abstract
Vectoral and alignment-free approaches to biological sequence representation have been explored in bioinformatics to efficiently handle big data. Even so, most current methods involve sequence comparisons via alignment-based heuristics and fail when applied to the analysis of large data sets. Here, we present “Spaced Words Projection (SWeeP)”, a method for representing biological sequences using relatively small vectors while preserving intersequence comparability. SWeeP uses spaced-words by scanning the sequences and generating indices to create a higher-dimensional vector that is later projected onto a smaller randomly oriented orthonormal base. We constructed phylogenetic trees for all organisms with mitochondrial and bacterial protein data in the NCBI database. SWeeP quickly built complete and accurate trees for these organisms with low computational cost. We compared SWeeP to other alignment-free methods and Sweep was 10 to 100 times quicker than the other techniques. A tool to build SWeeP vectors is available at https://sourceforge.net/projects/spacedwordsprojection/.
Collapse
Affiliation(s)
- Camilla Reginatto De Pierri
- Federal University of Paraná - SEPT, Graduate Program in Bioinformatics, Curitiba, Paraná, Brazil.,Federal University of Paraná, Department of Biochemistry and Molecular Biology, Curitiba, Paraná, Brazil
| | - Ricardo Voyceik
- Federal University of Minas Gerais, Institute of Biological Sciences (ICB), Belo Horizonte, Minas Gerais, Brazil
| | | | - Mariane Gonçalves Kulik
- Federal University of Paraná - SEPT, Graduate Program in Bioinformatics, Curitiba, Paraná, Brazil
| | - Josué Oliveira Camargo
- Federal University of Paraná - SEPT, Graduate Program in Bioinformatics, Curitiba, Paraná, Brazil.,Federal University of Paraná, Department of Biochemistry and Molecular Biology, Curitiba, Paraná, Brazil
| | - Aryel Marlus Repula de Oliveira
- Federal University of Paraná - SEPT, Graduate Program in Bioinformatics, Curitiba, Paraná, Brazil.,Federal University of Paraná, Department of Genetics, Curitiba, Paraná, Brazil
| | - Bruno Thiago de Lima Nichio
- Federal University of Paraná - SEPT, Graduate Program in Bioinformatics, Curitiba, Paraná, Brazil.,Federal University of Paraná, Department of Biochemistry and Molecular Biology, Curitiba, Paraná, Brazil
| | | | - Antonio Camilo da Silva Filho
- Federal University of Paraná - SEPT, Graduate Program in Bioinformatics, Curitiba, Paraná, Brazil.,Federal University of Paraná, Department of Pharmaceutical Sciences, Curitiba, Paraná, Brazil
| | - Dieval Guizelini
- Federal University of Paraná - SEPT, Graduate Program in Bioinformatics, Curitiba, Paraná, Brazil
| | - J Miguel Ortega
- Federal University of Minas Gerais, Institute of Biological Sciences (ICB), Belo Horizonte, Minas Gerais, Brazil
| | - Fabio O Pedrosa
- Federal University of Paraná - SEPT, Graduate Program in Bioinformatics, Curitiba, Paraná, Brazil.,Federal University of Paraná, Department of Biochemistry and Molecular Biology, Curitiba, Paraná, Brazil
| | - Roberto Tadeu Raittz
- Federal University of Paraná - SEPT, Graduate Program in Bioinformatics, Curitiba, Paraná, Brazil. .,Federal University of Minas Gerais, Institute of Biological Sciences (ICB), Belo Horizonte, Minas Gerais, Brazil. .,Federal University of Paraná, Department of Genetics, Curitiba, Paraná, Brazil.
| |
Collapse
|
15
|
Lichtblau D. Alignment-free genomic sequence comparison using FCGR and signal processing. BMC Bioinformatics 2019; 20:742. [PMID: 31888438 PMCID: PMC6937637 DOI: 10.1186/s12859-019-3330-3] [Citation(s) in RCA: 19] [Impact Index Per Article: 3.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/25/2019] [Accepted: 12/17/2019] [Indexed: 01/14/2023] Open
Abstract
Background Alignment-free methods of genomic comparison offer the possibility of scaling to large data sets of nucleotide sequences comprised of several thousand or more base pairs. Such methods can be used for purposes of deducing “nearby” species in a reference data set, or for constructing phylogenetic trees. Results We describe one such method that gives quite strong results. We use the Frequency Chaos Game Representation (FCGR) to create images from such sequences, We then reduce dimension, first using a Fourier trig transform, followed by a Singular Values Decomposition (SVD). This gives vectors of modest length. These in turn are used for fast sequence lookup, construction of phylogenetic trees, and classification of virus genomic data. We illustrate the accuracy and scalability of this approach on several benchmark test sets. Conclusions The tandem of FCGR and dimension reductions using Fourier-type transforms and SVD provides a powerful approach for alignment-free genomic comparison. Results compare favorably and often surpass best results reported in prior literature. Good scalability is also observed.
Collapse
|
16
|
Das S, Das A, Mondal B, Dey N, Bhattacharya DK, Tibarewala DN. Genome sequence comparison under a new form of tri-nucleotide representation based on bio-chemical properties of nucleotides. Gene 2019; 730:144257. [PMID: 31759983 DOI: 10.1016/j.gene.2019.144257] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/29/2019] [Revised: 11/01/2019] [Accepted: 11/05/2019] [Indexed: 10/25/2022]
Abstract
Genetic sequence analysis, classification of genome sequence and evolutionary relationship between species using their biological sequences, are the emerging research domain in Bioinformatics. Several methods have already been applied to DNA sequence comparison under tri-nucleotide representation. In this paper, a new form of tri-nucleotide representation is proposed for sequence comparison. The comparison does not depend on the alignment of the sequences. In this representation, the bio-chemical properties of the nucleotides are considered. The novelty of this method is that the sequences of unequal lengths are represented by vectors of the same length and each of the tri-nucleotide formed out of the given sequence has its unique representation. To validate the proposed method, it is verified on several data sets related to mammalians, viruses and bacteria. The results of this method are further compared with those obtained by methods such as probabilistic method, natural vector method, Fourier power spectrum method, multiple encoding vector method, and feature frequency profiles method. Moreover, this method produces accurate phylogeny in all the cases. It is also proved that the time complexity of the present method is less.
Collapse
Affiliation(s)
- Subhram Das
- Computer Science and Engineering, Narula Institute of Technology, Kolkata, India.
| | - Arijit Das
- Computer Science and Engineering, Narula Institute of Technology, Kolkata, India
| | - Bingshati Mondal
- Computer Science and Engineering, Narula Institute of Technology, Kolkata, India
| | - Nilanjan Dey
- Department of Information Technology, Techno India College of Technology, Kolkata, India
| | - D K Bhattacharya
- Department of Pure Mathematics, Calcutta University, Kolkata, India
| | - D N Tibarewala
- Department of Bio-Science and Engineering, Jadavpur University, Kolkata, India
| |
Collapse
|
17
|
Magnus representation of genome sequences. J Theor Biol 2019; 480:104-111. [DOI: 10.1016/j.jtbi.2019.08.004] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/26/2019] [Revised: 07/30/2019] [Accepted: 08/05/2019] [Indexed: 11/24/2022]
|
18
|
Li J, Zhang L, Li H, Ping Y, Xu Q, Wang R, Tan R, Wang Z, Liu B, Wang Y. Integrated entropy-based approach for analyzing exons and introns in DNA sequences. BMC Bioinformatics 2019; 20:283. [PMID: 31182012 PMCID: PMC6557737 DOI: 10.1186/s12859-019-2772-y] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/29/2022] Open
Abstract
BACKGROUND Numerous essential algorithms and methods, including entropy-based quantitative methods, have been developed to analyze complex DNA sequences since the last decade. Exons and introns are the most notable components of DNA and their identification and prediction are always the focus of state-of-the-art research. RESULTS In this study, we designed an integrated entropy-based analysis approach, which involves modified topological entropy calculation, genomic signal processing (GSP) method and singular value decomposition (SVD), to investigate exons and introns in DNA sequences. We optimized and implemented the topological entropy and the generalized topological entropy to calculate the complexity of DNA sequences, highlighting the characteristics of repetition sequences. By comparing digitalizing entropy values of exons and introns, we observed that they are significantly different. After we converted DNA data to numerical topological entropy value, we applied SVD method to effectively investigate exon and intron regions on a single gene sequence. Additionally, several genes across five species are used for exon predictions. CONCLUSIONS Our approach not only helps to explore the complexity of DNA sequence and its functional elements, but also provides an entropy-based GSP method to analyze exon and intron regions. Our work is feasible across different species and extendable to analyze other components in both coding and noncoding region of DNA sequences.
Collapse
Affiliation(s)
- Junyi Li
- School of Computer Science and Technology, Harbin Institute of Technology (Shenzhen), Shenzhen, Guangdong, 518055 China
| | - Li Zhang
- School of Computer Science and Technology, Harbin Institute of Technology (Shenzhen), Shenzhen, Guangdong, 518055 China
| | - Huinian Li
- School of Computer Science and Technology, Harbin Institute of Technology (Shenzhen), Shenzhen, Guangdong, 518055 China
| | - Yuan Ping
- School of Computer Science and Technology, Harbin Institute of Technology (Shenzhen), Shenzhen, Guangdong, 518055 China
| | - Qingzhe Xu
- School of Computer Science and Technology, Harbin Institute of Technology (Shenzhen), Shenzhen, Guangdong, 518055 China
| | - Rongjie Wang
- School of Computer Science and Technology, Harbin Institute of Technology, Harbin, Heilongjiang, 150001 China
| | - Renjie Tan
- School of Computer Science and Technology, Harbin Institute of Technology, Harbin, Heilongjiang, 150001 China
| | - Zhen Wang
- CAS Key Laboratory of Computational Biology, CAS-MPG Partner Institute for Computational Biology, Shanghai Institute of Nutrition and Health, Shanghai Institutes for Biological Sciences, University of Chinese Academy of Sciences, Chinese Academy of Sciences, Shanghai, 200031 China
| | - Bo Liu
- School of Computer Science and Technology, Harbin Institute of Technology, Harbin, Heilongjiang, 150001 China
| | - Yadong Wang
- School of Computer Science and Technology, Harbin Institute of Technology (Shenzhen), Shenzhen, Guangdong, 518055 China
- School of Computer Science and Technology, Harbin Institute of Technology, Harbin, Heilongjiang, 150001 China
| |
Collapse
|
19
|
Criscuolo A. A fast alignment-free bioinformatics procedure to infer accurate distance-based phylogenetic trees from genome assemblies. RESEARCH IDEAS AND OUTCOMES 2019. [DOI: 10.3897/rio.5.e36178] [Citation(s) in RCA: 36] [Impact Index Per Article: 7.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/26/2022] Open
Abstract
This paper describes a novel alignment-free distance-based procedure for inferring phylogenetic trees from genome contig sequences using publicly available bioinformatics tools. For each pair of genomes, a dissimilarity measure is first computed and next transformed to obtain an estimation of the number of substitution events that have occurred during their evolution. These pairwise evolutionary distances are then used to infer a phylogenetic tree and assess a confidence support for each internal branch. Analyses of both simulated and real genome datasets show that this bioinformatics procedure allows accurate phylogenetic trees to be reconstructed with fast running times, especially when launched on multiple threads. Implemented in a publicly available script, named JolyTree, this procedure is a useful approach for quickly inferring species trees without the burden and potential biases of multiple sequence alignments.
Collapse
|
20
|
Randhawa GS, Hill KA, Kari L. ML-DSP: Machine Learning with Digital Signal Processing for ultrafast, accurate, and scalable genome classification at all taxonomic levels. BMC Genomics 2019; 20:267. [PMID: 30943897 PMCID: PMC6448311 DOI: 10.1186/s12864-019-5571-y] [Citation(s) in RCA: 20] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/04/2018] [Accepted: 02/27/2019] [Indexed: 11/11/2022] Open
Abstract
Background Although software tools abound for the comparison, analysis, identification, and classification of genomic sequences, taxonomic classification remains challenging due to the magnitude of the datasets and the intrinsic problems associated with classification. The need exists for an approach and software tool that addresses the limitations of existing alignment-based methods, as well as the challenges of recently proposed alignment-free methods. Results We propose a novel combination of supervised Machine Learning with Digital Signal Processing, resulting in ML-DSP: an alignment-free software tool for ultrafast, accurate, and scalable genome classification at all taxonomic levels. We test ML-DSP by classifying 7396 full mitochondrial genomes at various taxonomic levels, from kingdom to genus, with an average classification accuracy of >97%. A quantitative comparison with state-of-the-art classification software tools is performed, on two small benchmark datasets and one large 4322 vertebrate mtDNA genomes dataset. Our results show that ML-DSP overwhelmingly outperforms the alignment-based software MEGA7 (alignment with MUSCLE or CLUSTALW) in terms of processing time, while having comparable classification accuracies for small datasets and superior accuracies for the large dataset. Compared with the alignment-free software FFP (Feature Frequency Profile), ML-DSP has significantly better classification accuracy, and is overall faster. We also provide preliminary experiments indicating the potential of ML-DSP to be used for other datasets, by classifying 4271 complete dengue virus genomes into subtypes with 100% accuracy, and 4,710 bacterial genomes into phyla with 95.5% accuracy. Lastly, our analysis shows that the “Purine/Pyrimidine”, “Just-A” and “Real” numerical representations of DNA sequences outperform ten other such numerical representations used in the Digital Signal Processing literature for DNA classification purposes. Conclusions Due to its superior classification accuracy, speed, and scalability to large datasets, ML-DSP is highly relevant in the classification of newly discovered organisms, in distinguishing genomic signatures and identifying their mechanistic determinants, and in evaluating genome integrity.
Collapse
Affiliation(s)
- Gurjit S Randhawa
- Department of Computer Science, University of Western Ontario, London, ON, Canada.
| | - Kathleen A Hill
- Department of Biology, University of Western Ontario, London, ON, Canada
| | - Lila Kari
- School of Computer Science, University of Waterloo, Waterloo, ON, Canada
| |
Collapse
|
21
|
Saw AK, Raj G, Das M, Talukdar NC, Tripathy BC, Nandi S. Alignment-free method for DNA sequence clustering using Fuzzy integral similarity. Sci Rep 2019; 9:3753. [PMID: 30842590 PMCID: PMC6403383 DOI: 10.1038/s41598-019-40452-6] [Citation(s) in RCA: 14] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/31/2018] [Accepted: 01/28/2019] [Indexed: 12/28/2022] Open
Abstract
A larger amount of sequence data in private and public databases produced by next-generation sequencing put new challenges due to limitation associated with the alignment-based method for sequence comparison. So, there is a high need for faster sequence analysis algorithms. In this study, we developed an alignment-free algorithm for faster sequence analysis. The novelty of our approach is the inclusion of fuzzy integral with Markov chain for sequence analysis in the alignment-free model. The method estimate the parameters of a Markov chain by considering the frequencies of occurrence of all possible nucleotide pairs from each DNA sequence. These estimated Markov chain parameters were used to calculate similarity among all pairwise combinations of DNA sequences based on a fuzzy integral algorithm. This matrix is used as an input for the neighbor program in the PHYLIP package for phylogenetic tree construction. Our method was tested on eight benchmark datasets and on in-house generated datasets (18 s rDNA sequences from 11 arbuscular mycorrhizal fungi (AMF) and 16 s rDNA sequences of 40 bacterial isolates from plant interior). The results indicate that the fuzzy integral algorithm is an efficient and feasible alignment-free method for sequence analysis on the genomic scale.
Collapse
Affiliation(s)
- Ajay Kumar Saw
- Institute of Advanced Study in Science and Technology, Mathematical Sciences Division, Guwahati, 781035, India
| | - Garima Raj
- Institute of Advanced Study in Science and Technology, Life Science Division, Guwahati, 781035, India
| | - Manashi Das
- Institute of Advanced Study in Science and Technology, Life Science Division, Guwahati, 781035, India
| | - Narayan Chandra Talukdar
- Institute of Advanced Study in Science and Technology, Life Science Division, Guwahati, 781035, India
| | | | - Soumyadeep Nandi
- Institute of Advanced Study in Science and Technology, Life Science Division, Guwahati, 781035, India.
| |
Collapse
|