1
|
Çi Ftçi B, Teki N R. Prediction of viral families and hosts of single-stranded RNA viruses based on K-Mer coding from phylogenetic gene sequences. Comput Biol Chem 2024; 112:108114. [PMID: 38852362 DOI: 10.1016/j.compbiolchem.2024.108114] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/30/2023] [Revised: 05/06/2024] [Accepted: 05/25/2024] [Indexed: 06/11/2024]
Abstract
There are billions of virus species worldwide, and viruses, the smallest parasitic entities, pose a serious threat. Therefore, fighting associated disorders requires an understanding of the genetic structure of viruses. Considering the wide diversity and rapid evolution of viruses, there is a critical need to quickly and accurately classify viral species and their potential hosts to better understand transmission dynamics, facilitating the development of targeted therapies. Recognizing this, this study has investigated the classes of RNA viruses based on their genomic sequences using Machine Learning (ML) and Deep Learning (DL) models. The PhyVirus dataset, consisting of pathogenic Single-stranded RNA viruses of Baltimore group four (+ssRNA) and five (-ssRNA) with different hosts and species, was analyzed. The dataset containing viral gene sequences was analyzed using the K-Mer coding technique, which is based on base words of various lengths. The study used classical ML algorithms (Random Forest, Gradient Boosting and Extra Trees) and the Fully Connected Deep Neural Network, a Deep Learning algorithm, to predict viral families and hosts. Detailed analyses were performed on the classifier performance in scenarios with different train-test ratios and different word lengths (k-values) for K-Mer. The observed results show that Fully Connected Deep Neural Network has a high success rate of 99.60 % in predicting virus families. In predicting virus hosts, the Extra Trees classifier achieved the highest success rate of 81.53 %. This study is considered to be the first classification study in the literature on this dataset, which has a very large family and host diversity consisting of gene sequences of Single-stranded RNA viruses. Our detailed investigations on how varying word lengths based on K-Mer coding in gene sequences affect the classification into viral families and hosts make this study particularly valuable. This study shows that ML and DL methods have the potential to produce valuable results in phylogenetic studies. In addition, the results and high-performance values show that these methods can be successfully used in regenerative applications of gene sequences or in studies such as the elimination of losses in gene sequences.
Collapse
Affiliation(s)
- Bahar Çi Ftçi
- Batman University, Institute of Graduate Studies, Department of Electrical and Electronic Engineering, Turkey; Siirt University, Distance Education Application and Research Center, Turkey.
| | - Ramazan Teki N
- Batman University, Faculty of Engineering and Architecture, Department of Computer Engineering, Turkey.
| |
Collapse
|
2
|
Qayyum A, Benzinou A, Saidani O, Alhayan F, Khan MA, Masood A, Mazher M. Assessment and classification of COVID-19 DNA sequence using pairwise features concatenation from multi-transformer and deep features with machine learning models. SLAS Technol 2024; 29:100147. [PMID: 38796034 DOI: 10.1016/j.slast.2024.100147] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/12/2024] [Revised: 03/31/2024] [Accepted: 05/22/2024] [Indexed: 05/28/2024]
Abstract
The 2019 novel coronavirus (renamed SARS-CoV-2, and generally referred to as the COVID-19 virus) has spread to 184 countries with over 1.5 million confirmed cases. Such a major viral outbreak demands early elucidation of taxonomic classification and origin of the virus genomic sequence, for strategic planning, containment, and treatment. The emerging global infectious COVID-19 disease by novel Severe Acute Respiratory Syndrome Coronavirus 2 (SARS-CoV-2) presents critical threats to global public health and the economy since it was identified in late December 2019 in China. The virus has gone through various pathways of evolution. Due to the continued evolution of the SARS-CoV-2 pandemic, researchers worldwide are working to mitigate, suppress its spread, and better understand it by deploying deep learning and machine learning approaches. In a general computational context for biomedical data analysis, DNA sequence classification is a crucial challenge. Several machine and deep learning techniques have been used in recent years to complete this task with some success. The classification of DNA sequences is a key research area in bioinformatics as it enables researchers to conduct genomic analysis and detect possible diseases. In this paper, three state-of-the-art deep learning-based models are proposed using two DNA sequence conversion methods. We also proposed a novel multi-transformer deep learning model and pairwise features fusion technique for DNA sequence classification. Furthermore, deep features are extracted from the last layer of the multi-transformer and used in machine-learning models for DNA sequence classification. The k-mer and one-hot encoding sequence conversion techniques have been presented. The proposed multi-transformer achieved the highest performance in COVID DNA sequence classification. Automatic identification and classification of viruses are essential to avoid an outbreak like COVID-19. It also helps in detecting the effect of viruses and drug design.
Collapse
Affiliation(s)
- Abdul Qayyum
- National Heart and Lung Institute, Imperial College London, London, United Kingdom
| | | | - Oumaima Saidani
- Department of Information Systems, College of Computer and Information Sciences, Princess Nourah bint Abdulrahman University, P.O.Box 84428, Riyadh 11671, Saudi Arabia.
| | - Fatimah Alhayan
- Department of Information Systems, College of Computer and Information Sciences, Princess Nourah bint Abdulrahman University, P.O.Box 84428, Riyadh 11671, Saudi Arabia.
| | - Muhammad Attique Khan
- Department of Computer Science and Mathematics, Lebanese American University, Beirut, Lebanon
| | - Anum Masood
- Department of Physics, Norwegian University of Science and Technology, Trondheim NO-7491, Norway.
| | - Moona Mazher
- Centre for Medical Image Computing, Department of Computer Science, University College London, London, United Kingdom
| |
Collapse
|
3
|
Duchen D, Clipman SJ, Vergara C, Thio CL, Thomas DL, Duggal P, Wojcik GL. A hepatitis B virus (HBV) sequence variation graph improves alignment and sample-specific consensus sequence construction. PLoS One 2024; 19:e0301069. [PMID: 38669259 PMCID: PMC11051683 DOI: 10.1371/journal.pone.0301069] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/05/2023] [Accepted: 03/09/2024] [Indexed: 04/28/2024] Open
Abstract
Nearly 300 million individuals live with chronic hepatitis B virus (HBV) infection (CHB), for which no curative therapy is available. As viral diversity is associated with pathogenesis and immunological control of infection, improved methods to characterize this diversity could aid drug development efforts. Conventionally, viral sequencing data are mapped/aligned to a reference genome, and only the aligned sequences are retained for analysis. Thus, reference selection is critical, yet selecting the most representative reference a priori remains difficult. We investigate an alternative pangenome approach which can combine multiple reference sequences into a graph which can be used during alignment. Using simulated short-read sequencing data generated from publicly available HBV genomes and real sequencing data from an individual living with CHB, we demonstrate alignment to a phylogenetically representative 'genome graph' can improve alignment, avoid issues of reference ambiguity, and facilitate the construction of sample-specific consensus sequences more genetically similar to the individual's infection. Graph-based methods can, therefore, improve efforts to characterize the genetics of viral pathogens, including HBV, and have broader implications in host-pathogen research.
Collapse
Affiliation(s)
- Dylan Duchen
- Department of Epidemiology, Johns Hopkins Bloomberg School of Public Health, Baltimore, MD, United States of America
- Center for Biomedical Data Science, Yale School of Medicine, New Haven, CT, United States of America
| | - Steven J. Clipman
- Division of Infectious Diseases, Johns Hopkins University School of Medicine, Baltimore, MD, United States of America
| | - Candelaria Vergara
- Department of Epidemiology, Johns Hopkins Bloomberg School of Public Health, Baltimore, MD, United States of America
| | - Chloe L. Thio
- Division of Infectious Diseases, Johns Hopkins University School of Medicine, Baltimore, MD, United States of America
| | - David L. Thomas
- Division of Infectious Diseases, Johns Hopkins University School of Medicine, Baltimore, MD, United States of America
| | - Priya Duggal
- Department of Epidemiology, Johns Hopkins Bloomberg School of Public Health, Baltimore, MD, United States of America
| | - Genevieve L. Wojcik
- Department of Epidemiology, Johns Hopkins Bloomberg School of Public Health, Baltimore, MD, United States of America
| |
Collapse
|
4
|
Lebatteux D, Soudeyns H, Boucoiran I, Gantt S, Diallo AB. Machine learning-based approach KEVOLVE efficiently identifies SARS-CoV-2 variant-specific genomic signatures. PLoS One 2024; 19:e0296627. [PMID: 38241279 PMCID: PMC10798494 DOI: 10.1371/journal.pone.0296627] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/21/2022] [Accepted: 12/07/2023] [Indexed: 01/21/2024] Open
Abstract
Machine learning was shown to be effective at identifying distinctive genomic signatures among viral sequences. These signatures are defined as pervasive motifs in the viral genome that allow discrimination between species or variants. In the context of SARS-CoV-2, the identification of these signatures can assist in taxonomic and phylogenetic studies, improve in the recognition and definition of emerging variants, and aid in the characterization of functional properties of polymorphic gene products. In this paper, we assess KEVOLVE, an approach based on a genetic algorithm with a machine-learning kernel, to identify multiple genomic signatures based on minimal sets of k-mers. In a comparative study, in which we analyzed large SARS-CoV-2 genome dataset, KEVOLVE was more effective at identifying variant-discriminative signatures than several gold-standard statistical tools. Subsequently, these signatures were characterized using a new extension of KEVOLVE (KANALYZER) to highlight variations of the discriminative signatures among different classes of variants, their genomic location, and the mutations involved. The majority of identified signatures were associated with known mutations among the different variants, in terms of functional and pathological impact based on available literature. Here we showed that KEVOLVE is a robust machine learning approach to identify discriminative signatures among SARS-CoV-2 variants, which are frequently also biologically relevant, while bypassing multiple sequence alignments. The source code of the method and additional resources are available at: https://github.com/bioinfoUQAM/KEVOLVE.
Collapse
Affiliation(s)
- Dylan Lebatteux
- Department of Computer Science, Université du Québec à Montréal, Montréal, Québec, Canada
| | - Hugo Soudeyns
- CHU Sainte-Justine Research Centre, Montréal, Québec, Canada
- Department of Microbiology, Infectious Diseases and Immunology, Faculty of Medicine, Université de Montréal, Montréal, Québec, Canada
- Department of Pediatrics, Faculty of Medicine, Université du Québec à Montréal, Montréal, Québec, Canada
| | - Isabelle Boucoiran
- Department of Obstetrics and Gynecology, Faculty of Medicine, Université de Montréal, Montreal, Quebec, Canada
| | - Soren Gantt
- CHU Sainte-Justine Research Centre, Montréal, Québec, Canada
- Department of Microbiology, Infectious Diseases and Immunology, Faculty of Medicine, Université de Montréal, Montréal, Québec, Canada
| | | |
Collapse
|
5
|
Alipour F, Holmes C, Lu YY, Hill KA, Kari L. Leveraging machine learning for taxonomic classification of emerging astroviruses. Front Mol Biosci 2024; 10:1305506. [PMID: 38274100 PMCID: PMC10808839 DOI: 10.3389/fmolb.2023.1305506] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/01/2023] [Accepted: 12/12/2023] [Indexed: 01/27/2024] Open
Abstract
Astroviruses are a family of genetically diverse viruses associated with disease in humans and birds with significant health effects and economic burdens. Astrovirus taxonomic classification includes two genera, Avastrovirus and Mamastrovirus. However, with next-generation sequencing, broader interspecies transmission has been observed necessitating a reexamination of the current host-based taxonomic classification approach. In this study, a novel taxonomic classification method is presented for emergent and as yet unclassified astroviruses, based on whole genome sequence k-mer composition in addition to host information. An optional component responsible for identifying recombinant sequences was added to the method's pipeline, to counteract the impact of genetic recombination on viral classification. The proposed three-pronged classification method consists of a supervised machine learning method, an unsupervised machine learning method, and the consideration of host species. Using this three-pronged approach, we propose genus labels for 191 as yet unclassified astrovirus genomes. Genus labels are also suggested for an additional eight as yet unclassified astrovirus genomes for which incompatibility was observed with the host species, suggesting cross-species infection. Lastly, our machine learning-based approach augmented by a principal component analysis (PCA) analysis provides evidence supporting the hypothesis of the existence of human astrovirus (HAstV) subgenus of the genus Mamastrovirus, and a goose astrovirus (GoAstV) subgenus of the genus Avastrovirus. Overall, this multipronged machine learning approach provides a fast, reliable, and scalable prediction method of taxonomic labels, able to keep pace with emerging viruses and the exponential increase in the output of modern genome sequencing technologies.
Collapse
Affiliation(s)
- Fatemeh Alipour
- School of Computer Science, University of Waterloo, Waterloo, ON, Canada
| | - Connor Holmes
- Department of Biology, University of Western Ontario, London, ON, Canada
| | - Yang Young Lu
- School of Computer Science, University of Waterloo, Waterloo, ON, Canada
| | - Kathleen A. Hill
- Department of Biology, University of Western Ontario, London, ON, Canada
| | - Lila Kari
- School of Computer Science, University of Waterloo, Waterloo, ON, Canada
| |
Collapse
|
6
|
Thind AS, Sinha S. Using Chaos-Game-Representation for Analysing the SARS-CoV-2 Lineages, Newly Emerging Strains and Recombinants. Curr Genomics 2023; 24:187-195. [PMID: 38178984 PMCID: PMC10761335 DOI: 10.2174/0113892029264990231013112156] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/15/2023] [Revised: 08/09/2023] [Accepted: 09/15/2023] [Indexed: 01/06/2024] Open
Abstract
Background Viruses have high mutation rates, facilitating rapid evolution and the emergence of new species, subspecies, strains and recombinant forms. Accurate classification of these forms is crucial for understanding viral evolution and developing therapeutic applications. Phylogenetic classification is typically performed by analyzing molecular differences at the genomic and sub-genomic levels. This involves aligning homologous proteins or genes. However, there is growing interest in developing alignment-free methods for whole-genome comparisons that are computationally efficient. Methods Here we elaborate on the Chaos Game Representation (CGR) method, based on concepts of statistical physics and free of sequence alignment assumptions. We adopt the CGR method for classification of the closely related clades/lineages A and B of the SARS-Corona virus 2019 (SARS-CoV-2), which is one of the fastest evolving viruses. Results Our study shows that the CGR approach can easily yield the SARS-CoV-2 phylogeny from the available whole genomes of lineage A and lineage B sequences. It also shows an accurate classification of eight different strains and the newly evolved XBB variant from its parental strains. Compared to alignment-based methods (Neighbour-Joining and Maximum Likelihood), the CGR method requires low computational resources, is fast and accurate for long sequences, and, being a K-mer based approach, allows simultaneous comparison of a large number of closely-related sequences of different sizes. Further, we developed an R pipeline CGRphylo, available on GitHub, which integrates the CGR module with various other R packages to create phylogenetic trees and visualize them. Conclusion Our findings demonstrate the efficacy of the CGR method for accurate classification and tracking of rapidly evolving viruses, offering valuable insights into the evolution and emergence of new SARS-CoV-2 strains and recombinants.
Collapse
Affiliation(s)
- Amarinder Singh Thind
- Department of Biological Sciences, Indian Institute of Science Education & Research, Mohali, India
- Illawarra Shoalhaven Local Health District (ISLHD), NSW Health, Australia
| | - Somdatta Sinha
- Department of Biological Sciences, Indian Institute of Science Education & Research, Mohali, India
| |
Collapse
|
7
|
Arias PM, Butler J, Randhawa GS, Soltysiak MPM, Hill KA, Kari L. Environment and taxonomy shape the genomic signature of prokaryotic extremophiles. Sci Rep 2023; 13:16105. [PMID: 37752120 PMCID: PMC10522608 DOI: 10.1038/s41598-023-42518-y] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/06/2023] [Accepted: 09/11/2023] [Indexed: 09/28/2023] Open
Abstract
This study provides comprehensive quantitative evidence suggesting that adaptations to extreme temperatures and pH imprint a discernible environmental component in the genomic signature of microbial extremophiles. Both supervised and unsupervised machine learning algorithms were used to analyze genomic signatures, each computed as the k-mer frequency vector of a 500 kbp DNA fragment arbitrarily selected to represent a genome. Computational experiments classified/clustered genomic signatures extracted from a curated dataset of [Formula: see text] extremophile (temperature, pH) bacteria and archaea genomes, at multiple scales of analysis, [Formula: see text]. The supervised learning resulted in high accuracies for taxonomic classifications at [Formula: see text], and medium to medium-high accuracies for environment category classifications of the same datasets at [Formula: see text]. For [Formula: see text], our findings were largely consistent with amino acid compositional biases and codon usage patterns in coding regions, previously attributed to extreme environment adaptations. The unsupervised learning of unlabelled sequences identified several exemplars of hyperthermophilic organisms with large similarities in their genomic signatures, in spite of belonging to different domains in the Tree of Life.
Collapse
Affiliation(s)
- Pablo Millán Arias
- School of Computer Science, University of Waterloo, Waterloo, ON, Canada.
| | - Joseph Butler
- Department of Biology, University of Western Ontario, London, ON, Canada
| | - Gurjit S Randhawa
- School of Mathematical and Computational Sciences, University of Prince Edward Island, Charlottetown, PE, Canada
| | | | - Kathleen A Hill
- Department of Biology, University of Western Ontario, London, ON, Canada
| | - Lila Kari
- School of Computer Science, University of Waterloo, Waterloo, ON, Canada
| |
Collapse
|
8
|
Naorem LD, Sharma N, Raghava GPS. A web server for predicting and scanning of IL-5 inducing peptides using alignment-free and alignment-based method. Comput Biol Med 2023; 158:106864. [PMID: 37058758 DOI: 10.1016/j.compbiomed.2023.106864] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/01/2022] [Revised: 03/06/2023] [Accepted: 03/30/2023] [Indexed: 04/16/2023]
Abstract
Interleukin-5 (IL-5) can act as an enticing therapeutic target due to its pivotal role in several eosinophil-mediated diseases. The aim of this study is to develop a model for predicting IL-5 inducing antigenic regions in a protein with high precision. All models in this study have been trained, tested and validated on experimentally validated 1907 IL-5 inducing and 7759 non-IL-5 inducing peptides obtained from IEDB. Our primary analysis indicates that IL-5 inducing peptides are dominated by certain residues like Ile, Asn, and Tyr. It was also observed that binders of a wide range of HLA alleles can induce IL-5. Initially, alignment-based methods have been developed using similarity and motif search. These alignment-based methods provide high precision but poor coverage. In order to overcome this limitation, we explore alignment-free methods which are mainly machine learning-based models. Firstly, models have been developed using binary profiles and eXtreme Gradient Boosting-based model achieved a maximum AUC of 0.59. Secondly, composition-based models have been developed and our dipeptide-based random forest model achieved a maximum AUC of 0.74. Thirdly, random forest model developed using selected 250 dipeptides and achieved AUC 0.75 and MCC 0.29 on validation dataset; best among alignment-free models. In order to improve the performance, we developed an ensemble or hybrid method that combined alignment-based and alignment-free methods. Our hybrid method achieved AUC 0.94 with MCC 0.60 on a validation/independent dataset. The best hybrid model developed in this study has been incorporated into the user-friendly web server and a standalone package named 'IL5pred' (https://webs.iiitd.edu.in/raghava/il5pred/).
Collapse
Affiliation(s)
- Leimarembi Devi Naorem
- Department of Computational Biology, Indraprastha Institute of Information Technology, Okhla Phase 3, New Delhi, 110020, India.
| | - Neelam Sharma
- Department of Computational Biology, Indraprastha Institute of Information Technology, Okhla Phase 3, New Delhi, 110020, India.
| | - Gajendra P S Raghava
- Department of Computational Biology, Indraprastha Institute of Information Technology, Okhla Phase 3, New Delhi, 110020, India.
| |
Collapse
|
9
|
Abadi SAR, Mohammadi A, Koohi S. An automated ultra-fast, memory-efficient, and accurate method for viral genome classification. J Biomed Inform 2023; 139:104316. [PMID: 36781036 DOI: 10.1016/j.jbi.2023.104316] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/03/2022] [Revised: 01/30/2023] [Accepted: 02/08/2023] [Indexed: 02/13/2023]
Abstract
The classification of different organisms into subtypes is one of the most important tools of organism studies, and among them, the classification of viruses itself has been the focus of many studies due to their use in virology and epidemiology. Many methods have been proposed to classify viruses, some of which are designed for a specific family of organisms and some of which are more general. But still, especially for certain categories such as Influenza and HIV, classification is facing performance challenges as well as processing and memory bottlenecks. In this way, we designed an automated classifier, called PC-mer, that is based on k-mer and physicochemical characteristics of nucleotides, which reduces the number of features about 2 k times compared to the alternative methods based on k-mer, and compared to integer and one-hot encoding methods, it is possible to keep the number of features constant despite the growth of the sequence length. In this way, it also increases the training speed by an average of 17.93 times. This improvement in processing complexity is provided while PC-mer can also improve the classifying performance for a variety of virus families.
Collapse
Affiliation(s)
| | - Amirhossein Mohammadi
- No 717, Department of Computer Engineering, Sharif University of Technology, Tehran, Iran
| | - Somayyeh Koohi
- No 717, Department of Computer Engineering, Sharif University of Technology, Tehran, Iran.
| |
Collapse
|
10
|
Chourasia P, Ali S, Ciccolella S, Vedova GD, Patterson M. Reads2Vec: Efficient Embedding of Raw High-Throughput Sequencing Reads Data. JOURNAL OF COMPUTATIONAL BIOLOGY : A JOURNAL OF COMPUTATIONAL MOLECULAR CELL BIOLOGY 2023; 30:469-491. [PMID: 36730750 DOI: 10.1089/cmb.2022.0424] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 02/04/2023]
Abstract
The massive amount of genomic data appearing for SARS-CoV-2 since the beginning of the COVID-19 pandemic has challenged traditional methods for studying its dynamics. As a result, new methods such as Pangolin, which can scale to the millions of samples of SARS-CoV-2 currently available, have appeared. Such a tool is tailored to take as input assembled, aligned, and curated full-length sequences, such as those found in the GISAID database. As high-throughput sequencing technologies continue to advance, such assembly, alignment, and curation may become a bottleneck, creating a need for methods that can process raw sequencing reads directly. In this article, we propose Reads2Vec, an alignment-free embedding approach that can generate a fixed-length feature vector representation directly from the raw sequencing reads without requiring assembly. Furthermore, since such an embedding is a numerical representation, it may be applied to highly optimized classification and clustering algorithms. Experiments on simulated data show that our proposed embedding obtains better classification results and better clustering properties contrary to existing alignment-free baselines. In a study on real data, we show that alignment-free embeddings have better clustering properties than the Pangolin tool and that the spike region of the SARS-CoV-2 genome heavily informs the alignment-free clusterings, which is consistent with current biological knowledge of SARS-CoV-2.
Collapse
Affiliation(s)
- Prakash Chourasia
- Department of Computer Science, Georgia State University, Atlanta, Georgia, USA
| | - Sarwan Ali
- Department of Computer Science, Georgia State University, Atlanta, Georgia, USA
| | - Simone Ciccolella
- Department of Informatics, Systems and Communication (DISCo), University of Milano-Bicocca, Milan, Italy
| | - Gianluca Della Vedova
- Department of Informatics, Systems and Communication (DISCo), University of Milano-Bicocca, Milan, Italy
| | - Murray Patterson
- Department of Computer Science, Georgia State University, Atlanta, Georgia, USA
| |
Collapse
|
11
|
Bhattacharya D, Kleeblatt DC, Statt A, Reinhart WF. Predicting aggregate morphology of sequence-defined macromolecules with recurrent neural networks. SOFT MATTER 2022; 18:5037-5051. [PMID: 35748651 DOI: 10.1039/d2sm00452f] [Citation(s) in RCA: 9] [Impact Index Per Article: 4.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/15/2023]
Abstract
Self-assembly of dilute sequence-defined macromolecules is a complex phenomenon in which the local arrangement of chemical moieties can lead to the formation of long-range structure. The dependence of this structure on the sequence necessarily implies that a mapping between the two exists, yet it has been difficult to model so far. Predicting the aggregation behavior of these macromolecules is challenging due to the lack of effective order parameters, a vast design space, inherent variability, and high computational costs associated with currently available simulation techniques. Here, we accurately predict the morphology of aggregates self-assembled from sequence-defined macromolecules using supervised machine learning. We find that regression models with implicit representation learning perform significantly better than those based on engineered features such as k-mer counting, and a recurrent-neural-network-based regressor performs the best out of nine model architectures we tested. Furthermore, we demonstrate the high-throughput screening of monomer sequences using the regression model to identify candidates for self-assembly into selected morphologies. Our strategy is shown to successfully identify multiple suitable sequences in every test we performed, so we hope the insights gained here can be extended to other increasingly complex design scenarios in the future, such as the design of sequences under polydispersity and at varying environmental conditions.
Collapse
Affiliation(s)
- Debjyoti Bhattacharya
- Materials Science and Engineering, Pennsylvania State University, University Park, PA 16802, USA.
| | - Devon C Kleeblatt
- Materials Science and Engineering, Pennsylvania State University, University Park, PA 16802, USA.
| | - Antonia Statt
- Materials Science and Engineering, Grainger College of Engineering, University of Illinois, Urbana-Champaign, IL 61801, USA
| | - Wesley F Reinhart
- Materials Science and Engineering, Pennsylvania State University, University Park, PA 16802, USA.
- Institute for Computational and Data Sciences, Pennsylvania State University, University Park, PA 16802, USA
| |
Collapse
|
12
|
Sarkar J, Saha I, Ghosh N, Maity D, Plewczynski D. Online Predictor Using Machine Learning to Predict Novel Coronavirus and Other Pathogenic Viruses. ACS OMEGA 2022; 7:23069-23074. [PMID: 35847318 PMCID: PMC9280959 DOI: 10.1021/acsomega.2c00215] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 06/15/2023]
Abstract
The problem of virus classification is always a subject of concern for virology or epidemiology over the decades. In this regard, a machine learning technique can be used to predict the novel coronavirus by considering its sequence. Thus, we are proposing a machine learning-based novel coronavirus prediction technique, called COVID-Predictor, where 1000 sequences of SARS-CoV-1, MERS-CoV, SARS-CoV-2, and other viruses are used to train a Naive Bayes classifier so that it can predict any unknown sequences of these viruses. The model has been validated using 10-fold cross-validation in comparison with other machine learning techniques. The results show the superiority of our predictor by achieving an average 99.7% accuracy on an unseen validation set of viruses. The same pre-trained model has been used to design a web-based application where sequences of unknown viruses can be uploaded to predict the novel coronavirus.
Collapse
Affiliation(s)
- Jnanendra
Prasad Sarkar
- Department
of Computer Science and Engineering, Jadavpur
University, Kolkata 700032, West Bengal, India
| | - Indrajit Saha
- Department
of Computer Science and Engineering, National
Institute of Technical Teachers’ Training and Research, Kolkata 700106, West Bengal, India
| | - Nimisha Ghosh
- Department
of Computer Science and Information Technology, Institute of Technical Education and Research, Siksha ‘O’
Anusandhan (Deemed to be University), Bhubaneswar, Odisha 751030, India
- Faculty
of Mathematics, Informatics and Mechanics, University of Warsaw, Warsaw 02-097,Poland
| | - Debasree Maity
- Department
of Electronics and Communication Engineering, MCKV Institute of Engineering, Howrah, West Bengal 711204, India
| | - Dariusz Plewczynski
- Laboratory
of Functional and Structural Genomics, Centre of New Technologies, University of Warsaw, 02-097 Warsaw, Poland
- Laboratory
of Bioinformatics and Computational Genomics, Faculty of Mathematics
and Information Science, Warsaw University
of Technology, 00-927 Warsaw, Poland
| |
Collapse
|
13
|
McElhinney JMWR, Catacutan MK, Mawart A, Hasan A, Dias J. Interfacing Machine Learning and Microbial Omics: A Promising Means to Address Environmental Challenges. Front Microbiol 2022; 13:851450. [PMID: 35547145 PMCID: PMC9083327 DOI: 10.3389/fmicb.2022.851450] [Citation(s) in RCA: 5] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/09/2022] [Accepted: 03/14/2022] [Indexed: 11/13/2022] Open
Abstract
Microbial communities are ubiquitous and carry an exceptionally broad metabolic capability. Upon environmental perturbation, microbes are also amongst the first natural responsive elements with perturbation-specific cues and markers. These communities are thereby uniquely positioned to inform on the status of environmental conditions. The advent of microbial omics has led to an unprecedented volume of complex microbiological data sets. Importantly, these data sets are rich in biological information with potential for predictive environmental classification and forecasting. However, the patterns in this information are often hidden amongst the inherent complexity of the data. There has been a continued rise in the development and adoption of machine learning (ML) and deep learning architectures for solving research challenges of this sort. Indeed, the interface between molecular microbial ecology and artificial intelligence (AI) appears to show considerable potential for significantly advancing environmental monitoring and management practices through their application. Here, we provide a primer for ML, highlight the notion of retaining biological sample information for supervised ML, discuss workflow considerations, and review the state of the art of the exciting, yet nascent, interdisciplinary field of ML-driven microbial ecology. Current limitations in this sphere of research are also addressed to frame a forward-looking perspective toward the realization of what we anticipate will become a pivotal toolkit for addressing environmental monitoring and management challenges in the years ahead.
Collapse
Affiliation(s)
- James M. W. R. McElhinney
- Applied Genomics Laboratory, Center for Membranes and Advanced Water Technology, Khalifa University, Abu Dhabi, United Arab Emirates
| | | | - Aurelie Mawart
- Applied Genomics Laboratory, Center for Membranes and Advanced Water Technology, Khalifa University, Abu Dhabi, United Arab Emirates
| | - Ayesha Hasan
- Applied Genomics Laboratory, Center for Membranes and Advanced Water Technology, Khalifa University, Abu Dhabi, United Arab Emirates
- Department of Biomedical Engineering, Khalifa University, Abu Dhabi, United Arab Emirates
| | - Jorge Dias
- EECS, Center for Autonomous Robotic Systems, Khalifa University, Abu Dhabi, United Arab Emirates
| |
Collapse
|
14
|
WalkIm: Compact image-based encoding for high-performance classification of biological sequences using simple tuning-free CNNs. PLoS One 2022; 17:e0267106. [PMID: 35427371 PMCID: PMC9012348 DOI: 10.1371/journal.pone.0267106] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/06/2021] [Accepted: 04/01/2022] [Indexed: 11/28/2022] Open
Abstract
The classification of biological sequences is an open issue for a variety of data sets, such as viral and metagenomics sequences. Therefore, many studies utilize neural network tools, as the well-known methods in this field, and focus on designing customized network structures. However, a few works focus on more effective factors, such as input encoding method or implementation technology, to address accuracy and efficiency issues in this area. Therefore, in this work, we propose an image-based encoding method, called as WalkIm, whose adoption, even in a simple neural network, provides competitive accuracy and superior efficiency, compared to the existing classification methods (e.g. VGDC, CASTOR, and DLM-CNN) for a variety of biological sequences. Using WalkIm for classifying various data sets (i.e. viruses whole-genome data, metagenomics read data, and metabarcoding data), it achieves the same performance as the existing methods, with no enforcement of parameter initialization or network architecture adjustment for each data set. It is worth noting that even in the case of classifying high-mutant data sets, such as Coronaviruses, it achieves almost 100% accuracy for classifying its various types. In addition, WalkIm achieves high-speed convergence during network training, as well as reduction of network complexity. Therefore WalkIm method enables us to execute the classifying neural networks on a normal desktop system in a short time interval. Moreover, we addressed the compatibility of WalkIm encoding method with free-space optical processing technology. Taking advantages of optical implementation of convolutional layers, we illustrated that the training time can be reduced by up to 500 time. In addition to all aforementioned advantages, this encoding method preserves the structure of generated images in various modes of sequence transformation, such as reverse complement, complement, and reverse modes.
Collapse
|
15
|
Cacciabue M, Aguilera P, Gismondi MI, Taboga O. Covidex: An ultrafast and accurate tool for SARS-CoV-2 subtyping. INFECTION, GENETICS AND EVOLUTION : JOURNAL OF MOLECULAR EPIDEMIOLOGY AND EVOLUTIONARY GENETICS IN INFECTIOUS DISEASES 2022; 99:105261. [PMID: 35231666 PMCID: PMC8881885 DOI: 10.1016/j.meegid.2022.105261] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 04/17/2021] [Revised: 12/20/2021] [Accepted: 02/23/2022] [Indexed: 11/29/2022]
Abstract
The epidemiological surveillance of SARS-CoV-2 by means of whole-genome sequencing has revealed the emergence and co-existence of multiple viral lineages or subtypes throughout the world. Moreover, it has been shown that several subtypes of this virus display particular phenotypes, such as increased transmissibility or reduced susceptibility to neutralizing antibodies, leading to the denomination of Variants of Interest (VOI) or Variants of Concern (VOC). Thus, subtyping of SARS-CoV-2 is a crucial step for the surveillance of this pathogen. Here, we present Covidex, an open-source, alignment-free machine learning subtyping tool. It is a shiny web app that allows an ultra-fast and accurate classification of SARS-CoV-2 genome sequences into the three most used nomenclature systems (GISAID, Nextstrain, Pango lineages). It also categorizes input sequences as VOI or VOC, according to current definitions. The program is cross-platform compatible and it is available via Source-Forge https://sourceforge.net/projects/covidex or via the web application http://covidex.unlu.edu.ar.
Collapse
Affiliation(s)
- Marco Cacciabue
- Instituto de Agrobiotecnología y Biología Molecular (IABIMO), Instituto Nacional de Tecnología Agropecuaria (INTA), Consejo Nacional de Investigaciones Científicas y Técnicas (CONICET), De los Reseros y N. Repetto s/n, Hurlingham B1686IGC, Buenos Aires, Argentina; Universidad Nacional de Luján, Departamento de Ciencias Básicas, Av. Constitución y RN 5, 6700 Luján, Buenos Aires, Argentina.
| | - Pablo Aguilera
- Instituto de Agrobiotecnología y Biología Molecular (IABIMO), Instituto Nacional de Tecnología Agropecuaria (INTA), Consejo Nacional de Investigaciones Científicas y Técnicas (CONICET), De los Reseros y N. Repetto s/n, Hurlingham B1686IGC, Buenos Aires, Argentina; Universidad Nacional de Luján, Departamento de Ciencias Básicas, Av. Constitución y RN 5, 6700 Luján, Buenos Aires, Argentina
| | - María Inés Gismondi
- Instituto de Agrobiotecnología y Biología Molecular (IABIMO), Instituto Nacional de Tecnología Agropecuaria (INTA), Consejo Nacional de Investigaciones Científicas y Técnicas (CONICET), De los Reseros y N. Repetto s/n, Hurlingham B1686IGC, Buenos Aires, Argentina; Universidad Nacional de Luján, Departamento de Ciencias Básicas, Av. Constitución y RN 5, 6700 Luján, Buenos Aires, Argentina
| | - Oscar Taboga
- Instituto de Agrobiotecnología y Biología Molecular (IABIMO), Instituto Nacional de Tecnología Agropecuaria (INTA), Consejo Nacional de Investigaciones Científicas y Técnicas (CONICET), De los Reseros y N. Repetto s/n, Hurlingham B1686IGC, Buenos Aires, Argentina
| |
Collapse
|
16
|
PWM2Vec: An Efficient Embedding Approach for Viral Host Specification from Coronavirus Spike Sequences. BIOLOGY 2022; 11:biology11030418. [PMID: 35336792 PMCID: PMC8945605 DOI: 10.3390/biology11030418] [Citation(s) in RCA: 7] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 01/04/2022] [Revised: 02/24/2022] [Accepted: 03/07/2022] [Indexed: 01/14/2023]
Abstract
Simple Summary The family of coronaviruses comprises a diverse set of strains and variants which cause diseases from the common cold to COVID-19. Moreover, they infect a wide array of hosts from bats, camels, birds, to humans. Studying coronaviruses through the lens of host specificity provides a unique perspective to understanding the evolution, diversity and dynamics of this family. In particular, this can reveal groups of different hosts infected by similar strains, giving clues on strains which were more likely to have evolved to jump from one host to another. In this work, we frame host specificity as a classification task, in designing a very compact numerical representation of the spike sequences of different coronaviruses. Based on this numerical representation, classification methods are able to detect the target host with high accuracy. Such an approach can used to efficiently scale to large volumes of sequences, in order to unveil trends in the host specificity of different coronavirus strains. Abstract The study of host specificity has important connections to the question about the origin of SARS-CoV-2 in humans which led to the COVID-19 pandemic—an important open question. There are speculations that bats are a possible origin. Likewise, there are many closely related (corona)viruses, such as SARS, which was found to be transmitted through civets. The study of the different hosts which can be potential carriers and transmitters of deadly viruses to humans is crucial to understanding, mitigating, and preventing current and future pandemics. In coronaviruses, the surface (S) protein, or spike protein, is important in determining host specificity, since it is the point of contact between the virus and the host cell membrane. In this paper, we classify the hosts of over five thousand coronaviruses from their spike protein sequences, segregating them into clusters of distinct hosts among birds, bats, camels, swine, humans, and weasels, to name a few. We propose a feature embedding based on the well-known position weight matrix (PWM), which we call PWM2Vec, and we use it to generate feature vectors from the spike protein sequences of these coronaviruses. While our embedding is inspired by the success of PWMs in biological applications, such as determining protein function and identifying transcription factor binding sites, we are the first (to the best of our knowledge) to use PWMs from viral sequences to generate fixed-length feature vector representations, and use them in the context of host classification. The results on real world data show that when using PWM2Vec, machine learning classifiers are able to perform comparably to the baseline models in terms of predictive performance and runtime—in some cases, the performance is better. We also measure the importance of different amino acids using information gain to show the amino acids which are important for predicting the host of a given coronavirus. Finally, we perform some statistical analyses on these results to show that our embedding is more compact than the embeddings of the baseline models.
Collapse
|
17
|
Ekpenyong ME, Adegoke AA, Edoho ME, Inyang UG, Udo IJ, Ekaidem IS, Osang F, Uto NP, Geoffery JI. Collaborative Mining of Whole Genome Sequences for Intelligent HIV-1 Sub-Strain(s) Discovery. Curr HIV Res 2022; 20:163-183. [PMID: 35142269 DOI: 10.2174/1570162x20666220210142209] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/24/2021] [Revised: 11/30/2021] [Accepted: 12/20/2021] [Indexed: 11/22/2022]
Abstract
BACKGROUND Effective global antiretroviral vaccines and therapeutic strategies depend on the diversity, evolution, and epidemiology of their various strains as well as their transmission and pathogenesis. Most viral disease-causing particles are clustered into a taxonomy of subtypes to suggest pointers toward nucleotide-specific vaccines or therapeutic applications of clinical significance sufficient for sequence-specific diagnosis and homologous viral studies. These are very useful to formulate predictors to induce cross-resistance to some retroviral control drugs being used across study areas. OBJECTIVE This research proposed a collaborative framework of hybridized (Machine Learning and Natural Language Processing) techniques to discover hidden genome patterns and feature predictors, for HIV-1 genome sequences mining. METHOD 630 human HIV-1 genome sequences above 8500 bps were excavated from the National Center for Biotechnology Information (NCBI) database (https://www.ncbi.nlm.nih.gov) for 21 countries across different continents, Antarctica exempt. These sequences were transformed and learned using a self-organizing map (SOM). To discriminate emerging/new sub-strain(s), the HIV-1 reference genome was included as part of the input isolates/samples during the training. After training the SOM, component planes defining pattern clusters of the input datasets were generated, for cognitive knowledge mining and subsequent labelling of the datasets. Additional genome features including dinucleotide transmission recurrences, codon recurrences, and mutation recurrences, were finally extracted from the raw genomes to construct output classification targets for supervised learning. RESULTS SOM training explains the inherent pattern diversity of HIV-1 genomes as well as inter- and intra-country transmissions in which mobility might play an active role, as corroborated by literature. Nine sub-strains were discovered after disassembling the SOM correlation hunting matrix space attributed to disparate clusters. Cognitive knowledge mining separated similar pattern clusters bounded by a certain degree of correlation range, discovered by the SOM. A Kruskal-Wallis rank-sum test and Wilcoxon rank-sum test showed statistically significant variations in dinucleotide, codon, and mutation patterns. CONCLUSION Results of the discovered sub-strains and response clusters visualizations corroborate existing literature, with significant haplotype variations. The proposed framework would assist in the development of decision support systems for easy contact tracing, infectious disease surveillance, and studying the progressive evolution of the reference HIV-1 genome.
Collapse
Affiliation(s)
- Moses E Ekpenyong
- Department of Computer Science, Faculty of Science, University of Uyo, Uyo, Nigeria
- Centre for Research and Development, University of Uyo, Uyo, Nigeria
| | - Anthony A Adegoke
- Department of Microbiology, Faculty of Science, University of Uyo, Uyo, Nigeria
| | - Mercy E Edoho
- Department of Computer Science, Faculty of Science, University of Uyo, Uyo, Nigeria
| | - Udoinyang G Inyang
- Department of Computer Science, Faculty of Science, University of Uyo, Uyo, Nigeria
| | - Ifiok J Udo
- Department of Computer Science, Faculty of Science, University of Uyo, Uyo, Nigeria
| | - Itemobong S Ekaidem
- Department of Chemical Pathology, College of Health Sciences, University of Uyo, Uyo, Nigeria
| | - Francis Osang
- Department of Computer Science, Faculty of Science, National Open University, Abuja, Nigeria
| | - Nseobong P Uto
- School of Mathematics and Statistics, University of St Andrews, Scotland, United Kingdom
| | - Joseph I Geoffery
- Department of Computer Science, Faculty of Science, University of Uyo, Uyo, Nigeria
| |
Collapse
|
18
|
Millán Arias P, Alipour F, Hill KA, Kari L. DeLUCS: Deep learning for unsupervised clustering of DNA sequences. PLoS One 2022; 17:e0261531. [PMID: 35061715 PMCID: PMC8782307 DOI: 10.1371/journal.pone.0261531] [Citation(s) in RCA: 11] [Impact Index Per Article: 5.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/03/2021] [Accepted: 12/06/2021] [Indexed: 11/25/2022] Open
Abstract
We present a novel Deep Learning method for the Unsupervised Clustering of DNA Sequences (DeLUCS) that does not require sequence alignment, sequence homology, or (taxonomic) identifiers. DeLUCS uses Frequency Chaos Game Representations (FCGR) of primary DNA sequences, and generates “mimic” sequence FCGRs to self-learn data patterns (genomic signatures) through the optimization of multiple neural networks. A majority voting scheme is then used to determine the final cluster assignment for each sequence. The clusters learned by DeLUCS match true taxonomic groups for large and diverse datasets, with accuracies ranging from 77% to 100%: 2,500 complete vertebrate mitochondrial genomes, at taxonomic levels from sub-phylum to genera; 3,200 randomly selected 400 kbp-long bacterial genome segments, into clusters corresponding to bacterial families; three viral genome and gene datasets, averaging 1,300 sequences each, into clusters corresponding to virus subtypes. DeLUCS significantly outperforms two classic clustering methods (K-means++ and Gaussian Mixture Models) for unlabelled data, by as much as 47%. DeLUCS is highly effective, it is able to cluster datasets of unlabelled primary DNA sequences totalling over 1 billion bp of data, and it bypasses common limitations to classification resulting from the lack of sequence homology, variation in sequence length, and the absence or instability of sequence annotations and taxonomic identifiers. Thus, DeLUCS offers fast and accurate DNA sequence clustering for previously intractable datasets.
Collapse
Affiliation(s)
- Pablo Millán Arias
- School of Computer Science, University of Waterloo, Waterloo, ON, Canada
- * E-mail: (PMA); (FA)
| | - Fatemeh Alipour
- School of Computer Science, University of Waterloo, Waterloo, ON, Canada
- * E-mail: (PMA); (FA)
| | - Kathleen A. Hill
- Department of Biology, University of Western Ontario, London, ON, Canada
| | - Lila Kari
- School of Computer Science, University of Waterloo, Waterloo, ON, Canada
| |
Collapse
|
19
|
Singh OP, Vallejo M, El-Badawy IM, Aysha A, Madhanagopal J, Mohd Faudzi AA. Classification of SARS-CoV-2 and non-SARS-CoV-2 using machine learning algorithms. Comput Biol Med 2021; 136:104650. [PMID: 34329865 PMCID: PMC8294595 DOI: 10.1016/j.compbiomed.2021.104650] [Citation(s) in RCA: 9] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/28/2021] [Revised: 07/08/2021] [Accepted: 07/13/2021] [Indexed: 11/28/2022]
Abstract
Due to the continued evolution of the SARS-CoV-2 pandemic, researchers worldwide are working to mitigate, suppress its spread, and better understand it by deploying digital signal processing (DSP) and machine learning approaches. This study presents an alignment-free approach to classify the SARS-CoV-2 using complementary DNA, which is DNA synthesized from the single-stranded RNA virus. Herein, a total of 1582 samples, with different lengths of genome sequences from different regions, were collected from various data sources and divided into a SARS-CoV-2 and a non-SARS-CoV-2 group. We extracted eight biomarkers based on three-base periodicity, using DSP techniques, and ranked those based on a filter-based feature selection. The ranked biomarkers were fed into k-nearest neighbor, support vector machines, decision trees, and random forest classifiers for the classification of SARS-CoV-2 from other coronaviruses. The training dataset was used to test the performance of the classifiers based on accuracy and F-measure via 10-fold cross-validation. Kappa-scores were estimated to check the influence of unbalanced data. Further, 10 × 10 cross-validation paired t-test was utilized to test the best model with unseen data. Random forest was elected as the best model, differentiating the SARS-CoV-2 coronavirus from other coronaviruses and a control a group with an accuracy of 97.4 %, sensitivity of 96.2 %, and specificity of 98.2 %, when tested with unseen samples. Moreover, the proposed algorithm was computationally efficient, taking only 0.31 s to compute the genome biomarkers, outperforming previous studies.
Collapse
Affiliation(s)
| | - Marta Vallejo
- School of Engineering & Physical Sciences, Heriot-Watt University, Edinburgh, UK
| | - Ismail M El-Badawy
- Electronics and Communications Engineering Department, Arab Academy for Science and Technology, Cairo, Egypt
| | - Ali Aysha
- School of Chemistry, University of Edinburgh, Edinburgh, UK
| | - Jagannathan Madhanagopal
- School of Physiotherapy, Faculty of Allied Health Professional, AIMST University, Semeling Campus, Bedong, Kedah, Malaysia
| | | |
Collapse
|
20
|
Analysis of DNA Sequence Classification Using CNN and Hybrid Models. COMPUTATIONAL AND MATHEMATICAL METHODS IN MEDICINE 2021; 2021:1835056. [PMID: 34306171 PMCID: PMC8285202 DOI: 10.1155/2021/1835056] [Citation(s) in RCA: 10] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 05/24/2021] [Accepted: 06/25/2021] [Indexed: 12/23/2022]
Abstract
In a general computational context for biomedical data analysis, DNA sequence classification is a crucial challenge. Several machine learning techniques have used to complete this task in recent years successfully. Identification and classification of viruses are essential to avoid an outbreak like COVID-19. Regardless, the feature selection process remains the most challenging aspect of the issue. The most commonly used representations worsen the case of high dimensionality, and sequences lack explicit features. It also helps in detecting the effect of viruses and drug design. In recent days, deep learning (DL) models can automatically extract the features from the input. In this work, we employed CNN, CNN-LSTM, and CNN-Bidirectional LSTM architectures using Label and K-mer encoding for DNA sequence classification. The models are evaluated on different classification metrics. From the experimental results, the CNN and CNN-Bidirectional LSTM with K-mer encoding offers high accuracy with 93.16% and 93.13%, respectively, on testing data.
Collapse
|
21
|
Tang R, Yu Z, Ma Y, Wu Y, Phoebe Chen YP, Wong L, Li J. Genetic source completeness of HIV-1 circulating recombinant forms (CRFs) predicted by multi-label learning. Bioinformatics 2021; 37:750-758. [PMID: 33063094 DOI: 10.1093/bioinformatics/btaa887] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/27/2020] [Revised: 08/12/2020] [Accepted: 09/30/2020] [Indexed: 12/18/2022] Open
Abstract
MOTIVATION Infection with strains of different subtypes and the subsequent crossover reading between the two strands of genomic RNAs by host cells' reverse transcriptase are the main causes of the vast HIV-1 sequence diversity. Such inter-subtype genomic recombinants can become circulating recombinant forms (CRFs) after widespread transmissions in a population. Complete prediction of all the subtype sources of a CRF strain is a complicated machine learning problem. It is also difficult to understand whether a strain is an emerging new subtype and if so, how to accurately identify the new components of the genetic source. RESULTS We introduce a multi-label learning algorithm for the complete prediction of multiple sources of a CRF sequence as well as the prediction of its chronological number. The prediction is strengthened by a voting of various multi-label learning methods to avoid biased decisions. In our steps, frequency and position features of the sequences are both extracted to capture signature patterns of pure subtypes and CRFs. The method was applied to 7185 HIV-1 sequences, comprising 5530 pure subtype sequences and 1655 CRF sequences. Results have demonstrated that the method can achieve very high accuracy (reaching 99%) in the prediction of the complete set of labels of HIV-1 recombinant forms. A few wrong predictions are actually incomplete predictions, very close to the complete set of genuine labels. AVAILABILITY AND IMPLEMENTATION https://github.com/Runbin-tang/The-source-of-HIV-CRFs-prediction. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Runbin Tang
- Key Laboratory of Intelligent Computing and Information Processing of Ministry of Education and Hunan Key Laboratory for Computation and Simulation in Science and Engineering, Xiangtan University, Hunan 411105, China.,Advanced Analytics Institute, University of Technology Sydney, Sydney, NSW 2007, Australia
| | - Zuguo Yu
- Key Laboratory of Intelligent Computing and Information Processing of Ministry of Education and Hunan Key Laboratory for Computation and Simulation in Science and Engineering, Xiangtan University, Hunan 411105, China.,School of Electrical Engineering and Computer Science, Queensland University of Technology, Brisbane, QLD 4001, Australia
| | - Yuanlin Ma
- Key Laboratory of Intelligent Computing and Information Processing of Ministry of Education and Hunan Key Laboratory for Computation and Simulation in Science and Engineering, Xiangtan University, Hunan 411105, China
| | - Yaoqun Wu
- Key Laboratory of Intelligent Computing and Information Processing of Ministry of Education and Hunan Key Laboratory for Computation and Simulation in Science and Engineering, Xiangtan University, Hunan 411105, China
| | - Yi-Ping Phoebe Chen
- Department of Computer Science and Information Technology, La Trobe University, Melbourne, VIC 3086, Australia
| | - Limsoon Wong
- School of Computing, National University of Singapore, Singapore 117417, Singapore
| | - Jinyan Li
- Advanced Analytics Institute, University of Technology Sydney, Sydney, NSW 2007, Australia
| |
Collapse
|
22
|
Hufsky F, Lamkiewicz K, Almeida A, Aouacheria A, Arighi C, Bateman A, Baumbach J, Beerenwinkel N, Brandt C, Cacciabue M, Chuguransky S, Drechsel O, Finn RD, Fritz A, Fuchs S, Hattab G, Hauschild AC, Heider D, Hoffmann M, Hölzer M, Hoops S, Kaderali L, Kalvari I, von Kleist M, Kmiecinski R, Kühnert D, Lasso G, Libin P, List M, Löchel HF, Martin MJ, Martin R, Matschinske J, McHardy AC, Mendes P, Mistry J, Navratil V, Nawrocki EP, O’Toole ÁN, Ontiveros-Palacios N, Petrov AI, Rangel-Pineros G, Redaschi N, Reimering S, Reinert K, Reyes A, Richardson L, Robertson DL, Sadegh S, Singer JB, Theys K, Upton C, Welzel M, Williams L, Marz M. Computational strategies to combat COVID-19: useful tools to accelerate SARS-CoV-2 and coronavirus research. Brief Bioinform 2021; 22:642-663. [PMID: 33147627 PMCID: PMC7665365 DOI: 10.1093/bib/bbaa232] [Citation(s) in RCA: 78] [Impact Index Per Article: 26.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/27/2020] [Revised: 07/28/2020] [Accepted: 08/26/2020] [Indexed: 12/16/2022] Open
Abstract
SARS-CoV-2 (severe acute respiratory syndrome coronavirus 2) is a novel virus of the family Coronaviridae. The virus causes the infectious disease COVID-19. The biology of coronaviruses has been studied for many years. However, bioinformatics tools designed explicitly for SARS-CoV-2 have only recently been developed as a rapid reaction to the need for fast detection, understanding and treatment of COVID-19. To control the ongoing COVID-19 pandemic, it is of utmost importance to get insight into the evolution and pathogenesis of the virus. In this review, we cover bioinformatics workflows and tools for the routine detection of SARS-CoV-2 infection, the reliable analysis of sequencing data, the tracking of the COVID-19 pandemic and evaluation of containment measures, the study of coronavirus evolution, the discovery of potential drug targets and development of therapeutic strategies. For each tool, we briefly describe its use case and how it advances research specifically for SARS-CoV-2. All tools are free to use and available online, either through web applications or public code repositories. Contact:evbc@unj-jena.de.
Collapse
Affiliation(s)
| | | | | | | | | | | | | | | | - Christian Brandt
- Institute of Infectious Disease and Infection Control at Jena University Hospital, Germany
| | - Marco Cacciabue
- Consejo Nacional de Investigaciones Científicas y Tócnicas (CONICET) working on FMDV virology at the Instituto de Agrobiotecnología y Biología Molecular (IABiMo, INTA-CONICET) and at the Departamento de Ciencias Básicas, Universidad Nacional de Luján (UNLu), Argentina
| | | | - Oliver Drechsel
- bioinformatics department at the Robert Koch-Institute, Germany
| | | | - Adrian Fritz
- Computational Biology of Infection Research group of Alice C. McHardy at the Helmholtz Centre for Infection Research, Germany
| | - Stephan Fuchs
- bioinformatics department at the Robert Koch-Institute, Germany
| | - Georges Hattab
- Bioinformatics Division at Philipps-University Marburg, Germany
| | | | - Dominik Heider
- Data Science in Biomedicine at the Philipps-University of Marburg, Germany
| | | | | | - Stefan Hoops
- Biocomplexity Institute and Initiative at the University of Virginia, USA
| | - Lars Kaderali
- Bioinformatics and head of the Institute of Bioinformatics at University Medicine Greifswald, Germany
| | | | - Max von Kleist
- bioinformatics department at the Robert Koch-Institute, Germany
| | - Renó Kmiecinski
- bioinformatics department at the Robert Koch-Institute, Germany
| | | | - Gorka Lasso
- Chandran Lab, Albert Einstein College of Medicine, USA
| | | | | | | | | | | | | | - Alice C McHardy
- Computational Biology of Infection Research Lab at the Helmholtz Centre for Infection Research in Braunschweig, Germany
| | - Pedro Mendes
- Center for Quantitative Medicine of the University of Connecticut School of Medicine, USA
| | | | - Vincent Navratil
- Bioinformatics and Systems Biology at the Rhône Alpes Bioinformatics core facility, Universitó de Lyon, France
| | | | | | | | | | | | - Nicole Redaschi
- Development of the Swiss-Prot group at the SIB for UniProt and SIB resources that cover viral biology (ViralZone)
| | - Susanne Reimering
- Computational Biology of Infection Research group of Alice C. McHardy at the Helmholtz Centre for Infection Research
| | | | | | | | | | - Sepideh Sadegh
- Chair of Experimental Bioinformatics at Technical University of Munich, Germany
| | - Joshua B Singer
- MRC-University of Glasgow Centre for Virus Research, Glasgow, Scotland, UK
| | | | - Chris Upton
- Department of Biochemistry and Microbiology, University of Victoria, Canada
| | | | | | - Manja Marz
- Friedrich Schiller University Jena, Germany
| |
Collapse
|
23
|
Auslander N, Gussow AB, Koonin EV. Incorporating Machine Learning into Established Bioinformatics Frameworks. Int J Mol Sci 2021; 22:2903. [PMID: 33809353 PMCID: PMC8000113 DOI: 10.3390/ijms22062903] [Citation(s) in RCA: 35] [Impact Index Per Article: 11.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/15/2021] [Revised: 03/08/2021] [Accepted: 03/10/2021] [Indexed: 12/23/2022] Open
Abstract
The exponential growth of biomedical data in recent years has urged the application of numerous machine learning techniques to address emerging problems in biology and clinical research. By enabling the automatic feature extraction, selection, and generation of predictive models, these methods can be used to efficiently study complex biological systems. Machine learning techniques are frequently integrated with bioinformatic methods, as well as curated databases and biological networks, to enhance training and validation, identify the best interpretable features, and enable feature and model investigation. Here, we review recently developed methods that incorporate machine learning within the same framework with techniques from molecular evolution, protein structure analysis, systems biology, and disease genomics. We outline the challenges posed for machine learning, and, in particular, deep learning in biomedicine, and suggest unique opportunities for machine learning techniques integrated with established bioinformatics approaches to overcome some of these challenges.
Collapse
Affiliation(s)
| | | | - Eugene V. Koonin
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD 20894, USA;
| |
Collapse
|
24
|
Saha I, Ghosh N, Maity D, Seal A, Plewczynski D. COVID-DeepPredictor: Recurrent Neural Network to Predict SARS-CoV-2 and Other Pathogenic Viruses. Front Genet 2021; 12:569120. [PMID: 33643375 PMCID: PMC7906283 DOI: 10.3389/fgene.2021.569120] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/03/2020] [Accepted: 01/13/2021] [Indexed: 11/13/2022] Open
Abstract
The COVID-19 disease for Novel coronavirus (SARS-CoV-2) has turned out to be a global pandemic. The high transmission rate of this pathogenic virus demands an early prediction and proper identification for the subsequent treatment. However, polymorphic nature of this virus allows it to adapt and sustain in different kinds of environment which makes it difficult to predict. On the other hand, there are other pathogens like SARS-CoV-1, MERS-CoV, Ebola, Dengue, and Influenza as well, so that a predictor is highly required to distinguish them with the use of their genomic information. To mitigate this problem, in this work COVID-DeepPredictor is proposed on the framework of deep learning to identify an unknown sequence of these pathogens. COVID-DeepPredictor uses Long Short Term Memory as Recurrent Neural Network for the underlying prediction with an alignment-free technique. In this regard, k-mer technique is applied to create Bag-of-Descriptors (BoDs) in order to generate Bag-of-Unique-Descriptors (BoUDs) as vocabulary and subsequently embedded representation is prepared for the given virus sequences. This predictor is not only validated for the dataset using K -fold cross-validation but also for unseen test datasets of SARS-CoV-2 sequences and sequences from other viruses as well. To verify the efficacy of COVID-DeepPredictor, it has been compared with other state-of-the-art prediction techniques based on Linear Discriminant Analysis, Random Forests, and Gradient Boosting Method. COVID-DeepPredictor achieves 100% prediction accuracy on validation dataset while on test datasets, the accuracy ranges from 99.51 to 99.94%. It shows superior results over other prediction techniques as well. In addition to this, accuracy and runtime of COVID-DeepPredictor are considered simultaneously to determine the value of k in k-mer, a comparative study among k values in k-mer, Bag-of-Descriptors (BoDs), and Bag-of-Unique-Descriptors (BoUDs) and a comparison between COVID-DeepPredictor and Nucleotide BLAST have also been performed. The code, training, and test datasets used for COVID-DeepPredictor are available at http://www.nitttrkol.ac.in/indrajit/projects/COVID-DeepPredictor/.
Collapse
Affiliation(s)
- Indrajit Saha
- Department of Computer Science and Engineering, National Institute of Technical Teachers' Training and Research, Kolkata, India
| | - Nimisha Ghosh
- Department of Computer Science and Information Technology, Institute of Technical Education and Research, Siksha ‘O’ Anusandhan (Deemed to Be University), Bhubaneswar, India
| | - Debasree Maity
- Department of Electronics and Communication Engineering, MCKV Institute of Engineering, Howrah, India
| | - Arjit Seal
- Cognizant Technology Solutions Pvt. Ltd., Kolkata, India
| | - Dariusz Plewczynski
- Laboratory of Bioinformatics and Computational Genomics, Faculty of Mathematics and Information Science, Warsaw University of Technology, Warsaw, Poland
- Laboratory of Functional and Structural Genomics, Centre of New Technologies, University of Warsaw, Warsaw, Poland
| |
Collapse
|
25
|
Kapaata A, Balinda SN, Xu R, Salazar MG, Herard K, Brooks K, Laban K, Hare J, Dilernia D, Kamali A, Ruzagira E, Mukasa F, Gilmour J, Salazar-Gonzalez JF, Yue L, Cotten M, Hunter E, Kaleebu P. HIV-1 Gag-Pol Sequences from Ugandan Early Infections Reveal Sequence Variants Associated with Elevated Replication Capacity. Viruses 2021; 13:v13020171. [PMID: 33498793 PMCID: PMC7912664 DOI: 10.3390/v13020171] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/22/2020] [Revised: 01/04/2021] [Accepted: 01/06/2021] [Indexed: 01/05/2023] Open
Abstract
The ability to efficiently establish a new infection is a critical property for human immunodeficiency virus type 1 (HIV-1). Although the envelope protein of the virus plays an essential role in receptor binding and internalization of the infecting virus, the structural proteins, the polymerase and the assembly of new virions may also play a role in establishing and spreading viral infection in a new host. We examined Ugandan viruses from newly infected patients and focused on the contribution of the Gag-Pol genes to replication capacity. A panel of Gag-Pol sequences generated using single genome amplification from incident HIV-1 infections were cloned into a common HIV-1 NL4.3 pol/env backbone and the influence of Gag-Pol changes on replication capacity was monitored. Using a novel protein domain approach, we then documented diversity in the functional protein domains across the Gag-Pol region and identified differences in the Gag-p6 domain that were frequently associated with higher in vitro replication.
Collapse
Affiliation(s)
- Anne Kapaata
- Medical Research Council, UVRI & LSTHM Uganda Research Unit, Plot 51–59, Entebbe, Uganda; (A.K.); (S.N.B.); (M.G.S.); (K.L.); (E.R.); (F.M.); (J.F.S.-G.); (P.K.)
| | - Sheila N. Balinda
- Medical Research Council, UVRI & LSTHM Uganda Research Unit, Plot 51–59, Entebbe, Uganda; (A.K.); (S.N.B.); (M.G.S.); (K.L.); (E.R.); (F.M.); (J.F.S.-G.); (P.K.)
| | - Rui Xu
- Emory University, Atlanta, GA 30322, USA; (R.X.); (K.H.); (K.B.); (D.D.); (L.Y.); (E.H.)
| | - Maria G. Salazar
- Medical Research Council, UVRI & LSTHM Uganda Research Unit, Plot 51–59, Entebbe, Uganda; (A.K.); (S.N.B.); (M.G.S.); (K.L.); (E.R.); (F.M.); (J.F.S.-G.); (P.K.)
| | - Kimberly Herard
- Emory University, Atlanta, GA 30322, USA; (R.X.); (K.H.); (K.B.); (D.D.); (L.Y.); (E.H.)
| | - Kelsie Brooks
- Emory University, Atlanta, GA 30322, USA; (R.X.); (K.H.); (K.B.); (D.D.); (L.Y.); (E.H.)
| | - Kato Laban
- Medical Research Council, UVRI & LSTHM Uganda Research Unit, Plot 51–59, Entebbe, Uganda; (A.K.); (S.N.B.); (M.G.S.); (K.L.); (E.R.); (F.M.); (J.F.S.-G.); (P.K.)
| | - Jonathan Hare
- Imperial College London, London SW7 2AZ, UK; (J.H.); (J.G.)
- International AIDS Vaccine Initiative (IAVI), New York, NY 10004, USA
| | - Dario Dilernia
- Emory University, Atlanta, GA 30322, USA; (R.X.); (K.H.); (K.B.); (D.D.); (L.Y.); (E.H.)
| | | | - Eugene Ruzagira
- Medical Research Council, UVRI & LSTHM Uganda Research Unit, Plot 51–59, Entebbe, Uganda; (A.K.); (S.N.B.); (M.G.S.); (K.L.); (E.R.); (F.M.); (J.F.S.-G.); (P.K.)
| | - Freddie Mukasa
- Medical Research Council, UVRI & LSTHM Uganda Research Unit, Plot 51–59, Entebbe, Uganda; (A.K.); (S.N.B.); (M.G.S.); (K.L.); (E.R.); (F.M.); (J.F.S.-G.); (P.K.)
| | - Jill Gilmour
- Imperial College London, London SW7 2AZ, UK; (J.H.); (J.G.)
- International AIDS Vaccine Initiative (IAVI), New York, NY 10004, USA
| | - Jesus F. Salazar-Gonzalez
- Medical Research Council, UVRI & LSTHM Uganda Research Unit, Plot 51–59, Entebbe, Uganda; (A.K.); (S.N.B.); (M.G.S.); (K.L.); (E.R.); (F.M.); (J.F.S.-G.); (P.K.)
| | - Ling Yue
- Emory University, Atlanta, GA 30322, USA; (R.X.); (K.H.); (K.B.); (D.D.); (L.Y.); (E.H.)
| | - Matthew Cotten
- Medical Research Council, UVRI & LSTHM Uganda Research Unit, Plot 51–59, Entebbe, Uganda; (A.K.); (S.N.B.); (M.G.S.); (K.L.); (E.R.); (F.M.); (J.F.S.-G.); (P.K.)
- Centre for Virus Research, MRC-University of Glasgow, Glasgow G61 1QH, UK
- Correspondence: ; Tel.: +25-6701-509-685
| | - Eric Hunter
- Emory University, Atlanta, GA 30322, USA; (R.X.); (K.H.); (K.B.); (D.D.); (L.Y.); (E.H.)
| | - Pontiano Kaleebu
- Medical Research Council, UVRI & LSTHM Uganda Research Unit, Plot 51–59, Entebbe, Uganda; (A.K.); (S.N.B.); (M.G.S.); (K.L.); (E.R.); (F.M.); (J.F.S.-G.); (P.K.)
| |
Collapse
|
26
|
Sarkar JP, Saha I, Seal A, Maity D, Maulik U. Topological Analysis for Sequence Variability: Case Study on more than 2K SARS-CoV-2 sequences of COVID-19 infected 54 countries in comparison with SARS-CoV-1 and MERS-CoV. INFECTION GENETICS AND EVOLUTION 2021; 88:104708. [PMID: 33421654 PMCID: PMC7787073 DOI: 10.1016/j.meegid.2021.104708] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 07/28/2020] [Revised: 10/27/2020] [Accepted: 12/31/2020] [Indexed: 12/11/2022]
Abstract
The pandemic due to novel coronavirus, SARS-CoV-2 is a serious global concern now. More than thousand new COVID-19 infections are getting reported daily for this virus across the globe. Thus, the medical research communities are trying to find the remedy to restrict the spreading of this virus, while the vaccine development work is still under research in parallel. In such critical situation, not only the medical research community, but also the scientists in different fields like microbiology, pharmacy, bioinformatics and data science are also sharing effort to accelerate the process of vaccine development, virus prediction, forecasting the transmissible probability and reproduction cases of virus for social awareness. With the similar context, in this article, we have studied sequence variability of the virus primarily focusing on three aspects: (a) sequence variability among SARS-CoV-1, MERS-CoV and SARS-CoV-2 in human host, which are in the same coronavirus family, (b) sequence variability of SARS-CoV-2 in human host for 54 different countries and (c) sequence variability between coronavirus family and country specific SARS-CoV-2 sequences in human host. For this purpose, as a case study, we have performed topological analysis of 2391 global genomic sequences of SARS-CoV-2 in association with SARS-CoV-1 and MERS-CoV using an integrated semi-alignment based computational technique. The results of the semi-alignment based technique are experimentally and statistically found similar to alignment based technique and computationally faster. Moreover, the outcome of this analysis can help to identify the nations with homogeneous SARS-CoV-2 sequences, so that same vaccine can be applied to their heterogeneous human population.
Collapse
Affiliation(s)
- Jnanendra Prasad Sarkar
- Larsen & Toubro Infotech Ltd., Pune, Maharashtra, India; Department of Computer Science and Engineering, Jadavpur University, Kolkata, West Bengal, India
| | - Indrajit Saha
- Department of Computer Science and Engineering, National Institute of Technical Teachers' Training & Research, Kolkata, West Bengal, India.
| | - Arijit Seal
- Cognizant Technology Solutions, Kolkata, West Bengal, India
| | - Debasree Maity
- Department of Electronics and Communication Engineering, MCKV Institute of Engineering, Howrah, West Bengal, India
| | - Ujjwal Maulik
- Department of Computer Science and Engineering, Jadavpur University, Kolkata, West Bengal, India
| |
Collapse
|
27
|
Abstract
K-mer based comparisons have emerged as powerful complements to BLAST-like alignment algorithms, particularly when the sequences being compared lack direct evolutionary relationships. In this chapter, we describe methods to compare k-mer content between groups of long noncoding RNAs (lncRNAs), to identify communities of lncRNAs with related k-mer contents, to identify the enrichment of protein-binding motifs in lncRNAs, and to scan for domains of related k-mer contents in lncRNAs. Our step-by-step instructions are complemented by Python code deposited in Github. Though our chapter focuses on lncRNAs, the methods we describe could be applied to any set of nucleic acid sequences.
Collapse
Affiliation(s)
- Jessime M Kirk
- Department of Pharmacology, University of North Carolina at Chapel Hill, Chapel Hill, NC, USA
- Curriculum in Bioinformatics and Computational Biology, University of North Carolina at Chapel Hill, Chapel Hill, NC, USA
- Lineberger Comprehensive Cancer Center, University of North Carolina at Chapel Hill, Chapel Hill, NC, USA
- Invitae Corporation, San Francisco, CA, USA
| | - Daniel Sprague
- Department of Pharmacology, University of North Carolina at Chapel Hill, Chapel Hill, NC, USA
- Lineberger Comprehensive Cancer Center, University of North Carolina at Chapel Hill, Chapel Hill, NC, USA
- Curriculum in Pharmacology, University of North Carolina at Chapel Hill, Chapel Hill, NC, USA
- Flagship Pioneering, Boston, MA, USA
| | - J Mauro Calabrese
- Department of Pharmacology, University of North Carolina at Chapel Hill, Chapel Hill, NC, USA.
- Curriculum in Bioinformatics and Computational Biology, University of North Carolina at Chapel Hill, Chapel Hill, NC, USA.
- Lineberger Comprehensive Cancer Center, University of North Carolina at Chapel Hill, Chapel Hill, NC, USA.
- Curriculum in Pharmacology, University of North Carolina at Chapel Hill, Chapel Hill, NC, USA.
| |
Collapse
|
28
|
MirLocPredictor: A ConvNet-Based Multi-Label MicroRNA Subcellular Localization Predictor by Incorporating k-Mer Positional Information. Genes (Basel) 2020; 11:genes11121475. [PMID: 33316943 PMCID: PMC7763197 DOI: 10.3390/genes11121475] [Citation(s) in RCA: 9] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Key Words] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/03/2020] [Revised: 11/23/2020] [Accepted: 11/25/2020] [Indexed: 02/06/2023] Open
Abstract
MicroRNAs (miRNA) are small noncoding RNA sequences consisting of about 22 nucleotides that are involved in the regulation of almost 60% of mammalian genes. Presently, there are very limited approaches for the visualization of miRNA locations present inside cells to support the elucidation of pathways and mechanisms behind miRNA function, transport, and biogenesis. MIRLocator, a state-of-the-art tool for the prediction of subcellular localization of miRNAs makes use of a sequence-to-sequence model along with pretrained k-mer embeddings. Existing pretrained k-mer embedding generation methodologies focus on the extraction of semantics of k-mers. However, in RNA sequences, positional information of nucleotides is more important because distinct positions of the four nucleotides define the function of an RNA molecule. Considering the importance of the nucleotide position, we propose a novel approach (kmerPR2vec) which is a fusion of positional information of k-mers with randomly initialized neural k-mer embeddings. In contrast to existing k-mer-based representation, the proposed kmerPR2vec representation is much more rich in terms of semantic information and has more discriminative power. Using novel kmerPR2vec representation, we further present an end-to-end system (MirLocPredictor) which couples the discriminative power of kmerPR2vec with Convolutional Neural Networks (CNNs) for miRNA subcellular location prediction. The effectiveness of the proposed kmerPR2vec approach is evaluated with deep learning-based topologies (i.e., Convolutional Neural Networks (CNN) and Recurrent Neural Network (RNN)) and by using 9 different evaluation measures. Analysis of the results reveals that MirLocPredictor outperform state-of-the-art methods with a significant margin of 18% and 19% in terms of precision and recall.
Collapse
|
29
|
Sengupta DC, Hill MD, Benton KR, Banerjee HN. Similarity Studies of Corona Viruses through Chaos Game Representation. COMPUTATIONAL MOLECULAR BIOSCIENCE 2020; 10:61-72. [PMID: 32953249 PMCID: PMC7497811 DOI: 10.4236/cmb.2020.103004] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 11/20/2022]
Abstract
The novel coronavirus (SARS-COV-2) is generally referred to as Covid-19 virus has spread to 213 countries with nearly 7 million confirmed cases and nearly 400,000 deaths. Such major outbreaks demand classification and origin of the virus genomic sequence, for planning, containment, and treatment. Motivated by the above need, we report two alignment-free methods combing with CGR to perform clustering analysis and create a phylogenetic tree based on it. To each DNA sequence we associate a matrix then define distance between two DNA sequences to be the distance between their associated matrix. These methods are being used for phylogenetic analysis of coronavirus sequences. Our approach provides a powerful tool for analyzing and annotating genomes and their phylogenetic relationships. We also compare our tool to ClustalX algorithm which is one of the most popular alignment methods. Our alignment-free methods are shown to be capable of finding closest genetic relatives of coronaviruses.
Collapse
Affiliation(s)
- Dipendra C Sengupta
- Department of Mathematics, Computer Science & Engineering Technology, Elizabeth City State University, Elizabeth City, North Carolina, USA
| | - Matthew D Hill
- Department of Mathematics, Computer Science & Engineering Technology, Elizabeth City State University, Elizabeth City, North Carolina, USA
| | - Kevin R Benton
- Department of Mathematics, Computer Science & Engineering Technology, Elizabeth City State University, Elizabeth City, North Carolina, USA
| | - Hirendra N Banerjee
- Department Natural Sciences, Elizabeth City State University, Elizabeth City, North Carolina, USA
| |
Collapse
|
30
|
Nugent CM, Adamowicz SJ. Alignment-free classification of COI DNA barcode data with the Python package Alfie. METABARCODING AND METAGENOMICS 2020. [DOI: 10.3897/mbmg.4.55815] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/04/2023] Open
Abstract
Characterization of biodiversity from environmental DNA samples and bulk metabarcoding data is hampered by off-target sequences that can confound conclusions about a taxonomic group of interest. Existing methods for isolation of target sequences rely on alignment to existing reference barcodes, but this can bias results against novel genetic variants. Effectively parsing targeted DNA barcode data from off-target noise improves the quality of biodiversity estimates and biological conclusions by limiting subsequent analyses to a relevant subset of available data. Here, we present Alfie, a Python package for the alignment-free classification of cytochrome c oxidase subunit I (COI) DNA barcode sequences to taxonomic kingdoms. The package determines k-mer frequencies of DNA sequences, and the frequencies serve as input for a neural network classifier that was trained and tested using ~58,000 publicly available COI sequences. The classifier was designed and optimized through a series of tests that allowed for the optimal set of DNA k-mer features and optimal machine learning algorithm to be selected. The neural network classifier rapidly assigns COI sequences of varying lengths to kingdoms with greater than 99% accuracy and is shown to generalize effectively and make accurate predictions about data from previously unseen taxonomic classes. The package contains an application programming interface that allows the Alfie package’s functionality to be extended to different DNA sequence classification tasks to suit a user’s need, including classification of different genes and barcodes, and classification to different taxonomic levels. Alfie is free and publicly available through GitHub (https://github.com/CNuge/alfie) and the Python package index (https://pypi.org/project/alfie/).
Collapse
|
31
|
Randhawa GS, Soltysiak MPM, El Roz H, de Souza CPE, Hill KA, Kari L. Machine learning using intrinsic genomic signatures for rapid classification of novel pathogens: COVID-19 case study. PLoS One 2020; 15:e0232391. [PMID: 32330208 PMCID: PMC7182198 DOI: 10.1371/journal.pone.0232391] [Citation(s) in RCA: 195] [Impact Index Per Article: 48.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/20/2020] [Accepted: 04/14/2020] [Indexed: 12/24/2022] Open
Abstract
The 2019 novel coronavirus (renamed SARS-CoV-2, and generally referred to as the COVID-19 virus) has spread to 184 countries with over 1.5 million confirmed cases. Such major viral outbreaks demand early elucidation of taxonomic classification and origin of the virus genomic sequence, for strategic planning, containment, and treatment. This paper identifies an intrinsic COVID-19 virus genomic signature and uses it together with a machine learning-based alignment-free approach for an ultra-fast, scalable, and highly accurate classification of whole COVID-19 virus genomes. The proposed method combines supervised machine learning with digital signal processing (MLDSP) for genome analyses, augmented by a decision tree approach to the machine learning component, and a Spearman's rank correlation coefficient analysis for result validation. These tools are used to analyze a large dataset of over 5000 unique viral genomic sequences, totalling 61.8 million bp, including the 29 COVID-19 virus sequences available on January 27, 2020. Our results support a hypothesis of a bat origin and classify the COVID-19 virus as Sarbecovirus, within Betacoronavirus. Our method achieves 100% accurate classification of the COVID-19 virus sequences, and discovers the most relevant relationships among over 5000 viral genomes within a few minutes, ab initio, using raw DNA sequence data alone, and without any specialized biological knowledge, training, gene or genome annotations. This suggests that, for novel viral and pathogen genome sequences, this alignment-free whole-genome machine-learning approach can provide a reliable real-time option for taxonomic classification.
Collapse
Affiliation(s)
- Gurjit S. Randhawa
- Department of Computer Science, The University of Western Ontario, London, ON, Canada
| | | | - Hadi El Roz
- Department of Biology, The University of Western Ontario, London, ON, Canada
| | - Camila P. E. de Souza
- Department of Statistical and Actuarial Sciences, The University of Western Ontario, London, ON, Canada
| | - Kathleen A. Hill
- Department of Biology, The University of Western Ontario, London, ON, Canada
| | - Lila Kari
- School of Computer Science, University of Waterloo, Waterloo, ON, Canada
| |
Collapse
|
32
|
Phylogenetic Analysis of HIV-1 Genomes Based on the Position-Weighted K-mers Method. ENTROPY 2020; 22:e22020255. [PMID: 33286029 PMCID: PMC7516702 DOI: 10.3390/e22020255] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 01/17/2020] [Revised: 02/07/2020] [Accepted: 02/20/2020] [Indexed: 12/31/2022]
Abstract
HIV-1 viruses, which are predominant in the family of HIV viruses, have strong pathogenicity and infectivity. They can evolve into many different variants in a very short time. In this study, we propose a new and effective alignment-free method for the phylogenetic analysis of HIV-1 viruses using complete genome sequences. Our method combines the position distribution information and the counts of the k-mers together. We also propose a metric to determine the optimal k value. We name our method the Position-Weighted k-mers (PWkmer) method. Validation and comparison with the Robinson-Foulds distance method and the modified bootstrap method on a benchmark dataset show that our method is reliable for the phylogenetic analysis of HIV-1 viruses. PWkmer can resolve within-group variations for different known subtypes of Group M of HIV-1 viruses. This method is simple and computationally fast for whole genome phylogenetic analysis.
Collapse
|
33
|
Borrayo E, May-Canche I, Paredes O, Morales JA, Romo-Vázquez R, Vélez-Pérez H. Whole-Genome k-mer Topic Modeling AssociatesBacterial Families. Genes (Basel) 2020; 11:genes11020197. [PMID: 32075081 PMCID: PMC7074292 DOI: 10.3390/genes11020197] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/07/2020] [Revised: 02/07/2020] [Accepted: 02/09/2020] [Indexed: 11/16/2022] Open
Abstract
Alignment-free k-mer-based algorithms in whole genome sequence comparisons remainan ongoing challenge. Here, we explore the possibility to use Topic Modeling for organismwhole-genome comparisons. We analyzed 30 complete genomes from three bacterial families bytopic modeling. For this, each genome was considered as a document and 13-mer nucleotiderepresentations as words. Latent Dirichlet allocation was used as the probabilistic modeling of thecorpus. We where able to identify the topic distribution among analyzed genomes, which is highlyconsistent with traditional hierarchical classification. It is possible that topic modeling may be appliedto establish relationships between genome's composition and biological phenomena.
Collapse
Affiliation(s)
- Ernesto Borrayo
- Electronics Department, CUCEI, Universidad de Guadalajara, Jalisco 44100, Mexico;
| | - Isaias May-Canche
- Computer Sciences Department, CUCEI, Universidad de Guadalajara, Jalisco 44100, Mexico; (I.M.-C.); (O.P.); (J.A.M.); (R.R.-V.)
- Instituto Tecnológico de Chetumal, Quintana Roo 77000, Mexico
| | - Omar Paredes
- Computer Sciences Department, CUCEI, Universidad de Guadalajara, Jalisco 44100, Mexico; (I.M.-C.); (O.P.); (J.A.M.); (R.R.-V.)
| | - J. Alejandro Morales
- Computer Sciences Department, CUCEI, Universidad de Guadalajara, Jalisco 44100, Mexico; (I.M.-C.); (O.P.); (J.A.M.); (R.R.-V.)
| | - Rebeca Romo-Vázquez
- Computer Sciences Department, CUCEI, Universidad de Guadalajara, Jalisco 44100, Mexico; (I.M.-C.); (O.P.); (J.A.M.); (R.R.-V.)
| | - Hugo Vélez-Pérez
- Computer Sciences Department, CUCEI, Universidad de Guadalajara, Jalisco 44100, Mexico; (I.M.-C.); (O.P.); (J.A.M.); (R.R.-V.)
- Correspondence:
| |
Collapse
|
34
|
Dlamini GS, Muller SJ, Meraba RL, Young RA, Mashiyane J, Chiwewe T, Mapiye DS. Classification of COVID-19 and Other Pathogenic Sequences: A Dinucleotide Frequency and Machine Learning Approach. IEEE ACCESS : PRACTICAL INNOVATIONS, OPEN SOLUTIONS 2020; 8:195263-195273. [PMID: 34976561 PMCID: PMC8675546 DOI: 10.1109/access.2020.3031387] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 10/02/2020] [Accepted: 10/04/2020] [Indexed: 05/08/2023]
Abstract
The world is grappling with the COVID-19 pandemic caused by the 2019 novel SARS-CoV-2. To better understand this novel virus and its relationship with other pathogens, new methods for analyzing the genome are required. In this study, intrinsic dinucleotide genomic signatures were analyzed for whole genome sequence data of eight pathogenic species, including SARS-CoV-2. The genome sequences were transformed into dinucleotide relative frequencies and classified using the extreme gradient boosting (XGBoost) model. The classification models were trained to a) distinguish between the sequences of all eight species and b) distinguish between sequences of SARS-CoV-2 that originate from different geographic regions. Our method attained 100% in all performance metrics and for all tasks in the eight-species classification problem. Moreover, the models achieved 67% balanced accuracy for the task of classifying the SARS-CoV-2 sequences into the six continental regions and achieved 86% balanced accuracy for the task of classifying SARS-CoV-2 samples as either originating from Asia or not. Analysis of the dinucleotide genomic profiles of the eight species revealed a similarity between the SARS-CoV-2 and MERS-CoV viral sequences. Further analysis of SARS-CoV-2 viral sequences from the six continents revealed that samples from Oceania had the highest frequency of TT dinucleotides as well as the lowest CG frequency compared to the other continents. The dinucleotide signatures of AC, AG,CA, CT, GA, GT, TC, and TG were well conserved across most genomes, while the frequencies of other dinucleotide signatures varied considerably. Altogether, the results from this study demonstrate the utility of dinucleotide relative frequencies for discriminating and identifying similar species.
Collapse
|
35
|
Randhawa GS, Hill KA, Kari L. MLDSP-GUI: an alignment-free standalone tool with an interactive graphical user interface for DNA sequence comparison and analysis. Bioinformatics 2019; 36:2258-2259. [DOI: 10.1093/bioinformatics/btz918] [Citation(s) in RCA: 11] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/02/2019] [Revised: 11/22/2019] [Accepted: 12/11/2019] [Indexed: 11/14/2022] Open
Abstract
Abstract
Summary
Machine Learning with Digital Signal Processing and Graphical User Interface (MLDSP-GUI) is an open-source, alignment-free, ultrafast, computationally lightweight, and standalone software tool with an interactive GUI for comparison and analysis of DNA sequences. MLDSP-GUI is a general-purpose tool that can be used for a variety of applications such as taxonomic classification, disease classification, virus subtype classification, evolutionary analyses, among others.
Availability and implementation
MLDSP-GUI is open-source, cross-platform compatible, and is available under the terms of the Creative Commons Attribution 4.0 International license (http://creativecommons.org/licenses/by/4.0/). The executable and dataset files are available at https://sourceforge.net/projects/mldsp-gui/.
Supplementary information
Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Gurjit S Randhawa
- Department of Computer Science, University of Western Ontario, London, ON N6A 5B7, Canada
| | - Kathleen A Hill
- Department of Biology, University of Western Ontario, London, ON N6A 5B7, Canada
| | - Lila Kari
- School of Computer Science, University of Waterloo, Waterloo, ON N2L 3G1, Canada
| |
Collapse
|
36
|
He L, Dong R, He RL, Yau SST. A novel alignment-free method for HIV-1 subtype classification. INFECTION GENETICS AND EVOLUTION 2019; 77:104080. [PMID: 31683009 DOI: 10.1016/j.meegid.2019.104080] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Subscribe] [Scholar Register] [Received: 07/03/2019] [Revised: 10/08/2019] [Accepted: 10/20/2019] [Indexed: 11/16/2022]
Abstract
HIV-1 is the most common and pathogenic strain of human immunodeficiency virus consisting of many subtypes. To study the difference among HIV-1 subtypes in infection, diagnosis and drug design, it is important to identify HIV-1 subtypes from clinical HIV-1 samples. In this work, we propose an effective numeric representation called Subsequence Natural Vector (SNV) to encode HIV-1 sequences. Using the representation, we introduce an improved linear discriminant analysis method to classify HIV-1 viruses correctly. SNV is based on distribution of nucleotides in HIV-1 viral sequences. It not only computes the number of nucleotides, but also describes the position and variance of nucleotides in viruses. To validate our alignment-free method, 6902 complete genomes and 11,668 pol gene sequences of HIV-1 subtypes were collected from the up-to-date Los Alamos HIV database. SNV outperforms the three popular methods, Kameris, Comet and REGA, with almost 100% Sensitivity and Specificity, also with much less time. Our subtyping algorithm especially works better for circulating recombinant forms (CRFs) consisting of a few sequences. Our approach is also powerful to separate unique recombinant forms (URFs) from other subtypes with 100% Sensitivity and Specificity. Moreover, phylogenetic trees based on SNV representation are constructed using full-length HIV-1 genomes and pol genes respectively, where viruses from the same subtype are clustered together correctly.
Collapse
Affiliation(s)
- Lily He
- Department of Mathematical Sciences, Tsinghua University, Beijing 100084, PR China
| | - Rui Dong
- Department of Mathematical Sciences, Tsinghua University, Beijing 100084, PR China
| | - Rong Lucy He
- Department of Biological Sciences, Chicago State University, Chicago, United States of America
| | - Stephen S-T Yau
- Department of Mathematical Sciences, Tsinghua University, Beijing 100084, PR China.
| |
Collapse
|
37
|
Forsdyke DR. Success of alignment-free oligonucleotide (k-mer) analysis confirms relative importance of genomes not genes in speciation and phylogeny. Biol J Linn Soc Lond 2019. [DOI: 10.1093/biolinnean/blz096] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022]
Abstract
AbstractThe utility of DNA sequence substrings (k-mers) in alignment-free phylogenetic classification, including that of bacteria and viruses, is increasingly recognized. However, its biological basis eludes many 21st century practitioners. A path from the 19th century recognition of the informational basis of heredity to the modern era can be discerned. Crick’s DNA ‘unpairing postulate’ predicted that recombinational pairing of homologous DNAs during meiosis would be mediated by short k-mers in the loops of stem-loop structures extruded from classical duplex helices. The complementary ‘kissing’ duplex loops – like tRNA anticodon–codon k-mer duplexes – would seed a more extensive pairing that would then extend until limited by lack of homology or other factors. Indeed, this became the principle behind alignment-based methods that assessed similarity by degree of DNA–DNA reassociation in vitro. These are now seen as less sensitive than alignment-free methods that are closely consistent, both theoretically and mechanistically, with chromosomal anti-recombination models for the initiation of divergence into new species. The analytical power of k-mer differences supports the theses that evolutionary advance sometimes serves the needs of nucleic acids (genomes) rather than proteins (genes), and that such differences can play a role in early speciation events.
Collapse
Affiliation(s)
- Donald R Forsdyke
- Department of Biomedical and Molecular Sciences, Queen’s University, Kingston, Ontario, Canada
| |
Collapse
|
38
|
Randhawa GS, Hill KA, Kari L. ML-DSP: Machine Learning with Digital Signal Processing for ultrafast, accurate, and scalable genome classification at all taxonomic levels. BMC Genomics 2019; 20:267. [PMID: 30943897 PMCID: PMC6448311 DOI: 10.1186/s12864-019-5571-y] [Citation(s) in RCA: 20] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/04/2018] [Accepted: 02/27/2019] [Indexed: 11/11/2022] Open
Abstract
Background Although software tools abound for the comparison, analysis, identification, and classification of genomic sequences, taxonomic classification remains challenging due to the magnitude of the datasets and the intrinsic problems associated with classification. The need exists for an approach and software tool that addresses the limitations of existing alignment-based methods, as well as the challenges of recently proposed alignment-free methods. Results We propose a novel combination of supervised Machine Learning with Digital Signal Processing, resulting in ML-DSP: an alignment-free software tool for ultrafast, accurate, and scalable genome classification at all taxonomic levels. We test ML-DSP by classifying 7396 full mitochondrial genomes at various taxonomic levels, from kingdom to genus, with an average classification accuracy of >97%. A quantitative comparison with state-of-the-art classification software tools is performed, on two small benchmark datasets and one large 4322 vertebrate mtDNA genomes dataset. Our results show that ML-DSP overwhelmingly outperforms the alignment-based software MEGA7 (alignment with MUSCLE or CLUSTALW) in terms of processing time, while having comparable classification accuracies for small datasets and superior accuracies for the large dataset. Compared with the alignment-free software FFP (Feature Frequency Profile), ML-DSP has significantly better classification accuracy, and is overall faster. We also provide preliminary experiments indicating the potential of ML-DSP to be used for other datasets, by classifying 4271 complete dengue virus genomes into subtypes with 100% accuracy, and 4,710 bacterial genomes into phyla with 95.5% accuracy. Lastly, our analysis shows that the “Purine/Pyrimidine”, “Just-A” and “Real” numerical representations of DNA sequences outperform ten other such numerical representations used in the Digital Signal Processing literature for DNA classification purposes. Conclusions Due to its superior classification accuracy, speed, and scalability to large datasets, ML-DSP is highly relevant in the classification of newly discovered organisms, in distinguishing genomic signatures and identifying their mechanistic determinants, and in evaluating genome integrity.
Collapse
Affiliation(s)
- Gurjit S Randhawa
- Department of Computer Science, University of Western Ontario, London, ON, Canada.
| | - Kathleen A Hill
- Department of Biology, University of Western Ontario, London, ON, Canada
| | - Lila Kari
- School of Computer Science, University of Waterloo, Waterloo, ON, Canada
| |
Collapse
|