Reference Citation Analysis: Find an Article, Find a Category, Find a Journal, Find a Scholar

For: Solis-Reyes S, Avino M, Poon A, Kari L. An open-source k-mer based machine learning tool for fast and accurate subtyping of HIV-1 genomes. PLoS One 2018;13:e0206409. [PMID: 30427878 DOI: 10.1371/journal.pone.0206409] [Citation(s) in RCA: 36] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/24/2018] [Accepted: 10/14/2018] [Indexed: 01/11/2023] Open

For:	Solis-Reyes S, Avino M, Poon A, Kari L. An open-source k-mer based machine learning tool for fast and accurate subtyping of HIV-1 genomes. PLoS One 2018;13:e0206409. [PMID: 30427878 DOI: 10.1371/journal.pone.0206409] [Citation(s) in RCA: 36] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/24/2018] [Accepted: 10/14/2018] [Indexed: 01/11/2023] Open

Number

Cited by Other Article(s)

Çi Ftçi B, Teki N R. Prediction of viral families and hosts of single-stranded RNA viruses based on K-Mer coding from phylogenetic gene sequences. Comput Biol Chem 2024;112:108114. [PMID: 38852362 DOI: 10.1016/j.compbiolchem.2024.108114] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/30/2023] [Revised: 05/06/2024] [Accepted: 05/25/2024] [Indexed: 06/11/2024]

Abstract

There are billions of virus species worldwide, and viruses, the smallest parasitic entities, pose a serious threat. Therefore, fighting associated disorders requires an understanding of the genetic structure of viruses. Considering the wide diversity and rapid evolution of viruses, there is a critical need to quickly and accurately classify viral species and their potential hosts to better understand transmission dynamics, facilitating the development of targeted therapies. Recognizing this, this study has investigated the classes of RNA viruses based on their genomic sequences using Machine Learning (ML) and Deep Learning (DL) models. The PhyVirus dataset, consisting of pathogenic Single-stranded RNA viruses of Baltimore group four (+ssRNA) and five (-ssRNA) with different hosts and species, was analyzed. The dataset containing viral gene sequences was analyzed using the K-Mer coding technique, which is based on base words of various lengths. The study used classical ML algorithms (Random Forest, Gradient Boosting and Extra Trees) and the Fully Connected Deep Neural Network, a Deep Learning algorithm, to predict viral families and hosts. Detailed analyses were performed on the classifier performance in scenarios with different train-test ratios and different word lengths (k-values) for K-Mer. The observed results show that Fully Connected Deep Neural Network has a high success rate of 99.60 % in predicting virus families. In predicting virus hosts, the Extra Trees classifier achieved the highest success rate of 81.53 %. This study is considered to be the first classification study in the literature on this dataset, which has a very large family and host diversity consisting of gene sequences of Single-stranded RNA viruses. Our detailed investigations on how varying word lengths based on K-Mer coding in gene sequences affect the classification into viral families and hosts make this study particularly valuable. This study shows that ML and DL methods have the potential to produce valuable results in phylogenetic studies. In addition, the results and high-performance values show that these methods can be successfully used in regenerative applications of gene sequences or in studies such as the elimination of losses in gene sequences.

Collapse

Qayyum A, Benzinou A, Saidani O, Alhayan F, Khan MA, Masood A, Mazher M. Assessment and classification of COVID-19 DNA sequence using pairwise features concatenation from multi-transformer and deep features with machine learning models. SLAS Technol 2024;29:100147. [PMID: 38796034 DOI: 10.1016/j.slast.2024.100147] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/12/2024] [Revised: 03/31/2024] [Accepted: 05/22/2024] [Indexed: 05/28/2024]

Abstract

The 2019 novel coronavirus (renamed SARS-CoV-2, and generally referred to as the COVID-19 virus) has spread to 184 countries with over 1.5 million confirmed cases. Such a major viral outbreak demands early elucidation of taxonomic classification and origin of the virus genomic sequence, for strategic planning, containment, and treatment. The emerging global infectious COVID-19 disease by novel Severe Acute Respiratory Syndrome Coronavirus 2 (SARS-CoV-2) presents critical threats to global public health and the economy since it was identified in late December 2019 in China. The virus has gone through various pathways of evolution. Due to the continued evolution of the SARS-CoV-2 pandemic, researchers worldwide are working to mitigate, suppress its spread, and better understand it by deploying deep learning and machine learning approaches. In a general computational context for biomedical data analysis, DNA sequence classification is a crucial challenge. Several machine and deep learning techniques have been used in recent years to complete this task with some success. The classification of DNA sequences is a key research area in bioinformatics as it enables researchers to conduct genomic analysis and detect possible diseases. In this paper, three state-of-the-art deep learning-based models are proposed using two DNA sequence conversion methods. We also proposed a novel multi-transformer deep learning model and pairwise features fusion technique for DNA sequence classification. Furthermore, deep features are extracted from the last layer of the multi-transformer and used in machine-learning models for DNA sequence classification. The k-mer and one-hot encoding sequence conversion techniques have been presented. The proposed multi-transformer achieved the highest performance in COVID DNA sequence classification. Automatic identification and classification of viruses are essential to avoid an outbreak like COVID-19. It also helps in detecting the effect of viruses and drug design.

Collapse

Duchen D, Clipman SJ, Vergara C, Thio CL, Thomas DL, Duggal P, Wojcik GL. A hepatitis B virus (HBV) sequence variation graph improves alignment and sample-specific consensus sequence construction. PLoS One 2024;19:e0301069. [PMID: 38669259 PMCID: PMC11051683 DOI: 10.1371/journal.pone.0301069] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/05/2023] [Accepted: 03/09/2024] [Indexed: 04/28/2024] Open

Lebatteux D, Soudeyns H, Boucoiran I, Gantt S, Diallo AB. Machine learning-based approach KEVOLVE efficiently identifies SARS-CoV-2 variant-specific genomic signatures. PLoS One 2024;19:e0296627. [PMID: 38241279 PMCID: PMC10798494 DOI: 10.1371/journal.pone.0296627] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/21/2022] [Accepted: 12/07/2023] [Indexed: 01/21/2024] Open

Alipour F, Holmes C, Lu YY, Hill KA, Kari L. Leveraging machine learning for taxonomic classification of emerging astroviruses. Front Mol Biosci 2024;10:1305506. [PMID: 38274100 PMCID: PMC10808839 DOI: 10.3389/fmolb.2023.1305506] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/01/2023] [Accepted: 12/12/2023] [Indexed: 01/27/2024] Open

Abstract

Astroviruses are a family of genetically diverse viruses associated with disease in humans and birds with significant health effects and economic burdens. Astrovirus taxonomic classification includes two genera, Avastrovirus and Mamastrovirus. However, with next-generation sequencing, broader interspecies transmission has been observed necessitating a reexamination of the current host-based taxonomic classification approach. In this study, a novel taxonomic classification method is presented for emergent and as yet unclassified astroviruses, based on whole genome sequence k-mer composition in addition to host information. An optional component responsible for identifying recombinant sequences was added to the method's pipeline, to counteract the impact of genetic recombination on viral classification. The proposed three-pronged classification method consists of a supervised machine learning method, an unsupervised machine learning method, and the consideration of host species. Using this three-pronged approach, we propose genus labels for 191 as yet unclassified astrovirus genomes. Genus labels are also suggested for an additional eight as yet unclassified astrovirus genomes for which incompatibility was observed with the host species, suggesting cross-species infection. Lastly, our machine learning-based approach augmented by a principal component analysis (PCA) analysis provides evidence supporting the hypothesis of the existence of human astrovirus (HAstV) subgenus of the genus Mamastrovirus, and a goose astrovirus (GoAstV) subgenus of the genus Avastrovirus. Overall, this multipronged machine learning approach provides a fast, reliable, and scalable prediction method of taxonomic labels, able to keep pace with emerging viruses and the exponential increase in the output of modern genome sequencing technologies.

Collapse

Thind AS, Sinha S. Using Chaos-Game-Representation for Analysing the SARS-CoV-2 Lineages, Newly Emerging Strains and Recombinants. Curr Genomics 2023;24:187-195. [PMID: 38178984 PMCID: PMC10761335 DOI: 10.2174/0113892029264990231013112156] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/15/2023] [Revised: 08/09/2023] [Accepted: 09/15/2023] [Indexed: 01/06/2024] Open

Abstract

Background

Viruses have high mutation rates, facilitating rapid evolution and the emergence of new species, subspecies, strains and recombinant forms. Accurate classification of these forms is crucial for understanding viral evolution and developing therapeutic applications. Phylogenetic classification is typically performed by analyzing molecular differences at the genomic and sub-genomic levels. This involves aligning homologous proteins or genes. However, there is growing interest in developing alignment-free methods for whole-genome comparisons that are computationally efficient.

Methods

Here we elaborate on the Chaos Game Representation (CGR) method, based on concepts of statistical physics and free of sequence alignment assumptions. We adopt the CGR method for classification of the closely related clades/lineages A and B of the SARS-Corona virus 2019 (SARS-CoV-2), which is one of the fastest evolving viruses.

Results

Our study shows that the CGR approach can easily yield the SARS-CoV-2 phylogeny from the available whole genomes of lineage A and lineage B sequences. It also shows an accurate classification of eight different strains and the newly evolved XBB variant from its parental strains. Compared to alignment-based methods (Neighbour-Joining and Maximum Likelihood), the CGR method requires low computational resources, is fast and accurate for long sequences, and, being a K-mer based approach, allows simultaneous comparison of a large number of closely-related sequences of different sizes. Further, we developed an R pipeline CGRphylo, available on GitHub, which integrates the CGR module with various other R packages to create phylogenetic trees and visualize them.

Conclusion

Our findings demonstrate the efficacy of the CGR method for accurate classification and tracking of rapidly evolving viruses, offering valuable insights into the evolution and emergence of new SARS-CoV-2 strains and recombinants.

Collapse

Arias PM, Butler J, Randhawa GS, Soltysiak MPM, Hill KA, Kari L. Environment and taxonomy shape the genomic signature of prokaryotic extremophiles. Sci Rep 2023;13:16105. [PMID: 37752120 PMCID: PMC10522608 DOI: 10.1038/s41598-023-42518-y] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/06/2023] [Accepted: 09/11/2023] [Indexed: 09/28/2023] Open

Naorem LD, Sharma N, Raghava GPS. A web server for predicting and scanning of IL-5 inducing peptides using alignment-free and alignment-based method. Comput Biol Med 2023;158:106864. [PMID: 37058758 DOI: 10.1016/j.compbiomed.2023.106864] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/01/2022] [Revised: 03/06/2023] [Accepted: 03/30/2023] [Indexed: 04/16/2023]

Abadi SAR, Mohammadi A, Koohi S. An automated ultra-fast, memory-efficient, and accurate method for viral genome classification. J Biomed Inform 2023;139:104316. [PMID: 36781036 DOI: 10.1016/j.jbi.2023.104316] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/03/2022] [Revised: 01/30/2023] [Accepted: 02/08/2023] [Indexed: 02/13/2023]

Chourasia P, Ali S, Ciccolella S, Vedova GD, Patterson M. Reads2Vec: Efficient Embedding of Raw High-Throughput Sequencing Reads Data. JOURNAL OF COMPUTATIONAL BIOLOGY : A JOURNAL OF COMPUTATIONAL MOLECULAR CELL BIOLOGY 2023;30:469-491. [PMID: 36730750 DOI: 10.1089/cmb.2022.0424] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 02/04/2023]

Bhattacharya D, Kleeblatt DC, Statt A, Reinhart WF. Predicting aggregate morphology of sequence-defined macromolecules with recurrent neural networks. SOFT MATTER 2022;18:5037-5051. [PMID: 35748651 DOI: 10.1039/d2sm00452f] [Citation(s) in RCA: 9] [Impact Index Per Article: 4.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/15/2023]

Sarkar J, Saha I, Ghosh N, Maity D, Plewczynski D. Online Predictor Using Machine Learning to Predict Novel Coronavirus and Other Pathogenic Viruses. ACS OMEGA 2022;7:23069-23074. [PMID: 35847318 PMCID: PMC9280959 DOI: 10.1021/acsomega.2c00215] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 06/15/2023]

McElhinney JMWR, Catacutan MK, Mawart A, Hasan A, Dias J. Interfacing Machine Learning and Microbial Omics: A Promising Means to Address Environmental Challenges. Front Microbiol 2022;13:851450. [PMID: 35547145 PMCID: PMC9083327 DOI: 10.3389/fmicb.2022.851450] [Citation(s) in RCA: 5] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/09/2022] [Accepted: 03/14/2022] [Indexed: 11/13/2022] Open

WalkIm: Compact image-based encoding for high-performance classification of biological sequences using simple tuning-free CNNs. PLoS One 2022;17:e0267106. [PMID: 35427371 PMCID: PMC9012348 DOI: 10.1371/journal.pone.0267106] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/06/2021] [Accepted: 04/01/2022] [Indexed: 11/28/2022] Open

Abstract

The classification of biological sequences is an open issue for a variety of data sets, such as viral and metagenomics sequences. Therefore, many studies utilize neural network tools, as the well-known methods in this field, and focus on designing customized network structures. However, a few works focus on more effective factors, such as input encoding method or implementation technology, to address accuracy and efficiency issues in this area. Therefore, in this work, we propose an image-based encoding method, called as WalkIm, whose adoption, even in a simple neural network, provides competitive accuracy and superior efficiency, compared to the existing classification methods (e.g. VGDC, CASTOR, and DLM-CNN) for a variety of biological sequences. Using WalkIm for classifying various data sets (i.e. viruses whole-genome data, metagenomics read data, and metabarcoding data), it achieves the same performance as the existing methods, with no enforcement of parameter initialization or network architecture adjustment for each data set. It is worth noting that even in the case of classifying high-mutant data sets, such as Coronaviruses, it achieves almost 100% accuracy for classifying its various types. In addition, WalkIm achieves high-speed convergence during network training, as well as reduction of network complexity. Therefore WalkIm method enables us to execute the classifying neural networks on a normal desktop system in a short time interval. Moreover, we addressed the compatibility of WalkIm encoding method with free-space optical processing technology. Taking advantages of optical implementation of convolutional layers, we illustrated that the training time can be reduced by up to 500 time. In addition to all aforementioned advantages, this encoding method preserves the structure of generated images in various modes of sequence transformation, such as reverse complement, complement, and reverse modes.

Collapse

Cacciabue M, Aguilera P, Gismondi MI, Taboga O. Covidex: An ultrafast and accurate tool for SARS-CoV-2 subtyping. INFECTION, GENETICS AND EVOLUTION : JOURNAL OF MOLECULAR EPIDEMIOLOGY AND EVOLUTIONARY GENETICS IN INFECTIOUS DISEASES 2022;99:105261. [PMID: 35231666 PMCID: PMC8881885 DOI: 10.1016/j.meegid.2022.105261] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 04/17/2021] [Revised: 12/20/2021] [Accepted: 02/23/2022] [Indexed: 11/29/2022]

PWM2Vec: An Efficient Embedding Approach for Viral Host Specification from Coronavirus Spike Sequences. BIOLOGY 2022;11:biology11030418. [PMID: 35336792 PMCID: PMC8945605 DOI: 10.3390/biology11030418] [Citation(s) in RCA: 7] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 01/04/2022] [Revised: 02/24/2022] [Accepted: 03/07/2022] [Indexed: 01/14/2023]

Abstract

Simple Summary

The family of coronaviruses comprises a diverse set of strains and variants which cause diseases from the common cold to COVID-19. Moreover, they infect a wide array of hosts from bats, camels, birds, to humans. Studying coronaviruses through the lens of host specificity provides a unique perspective to understanding the evolution, diversity and dynamics of this family. In particular, this can reveal groups of different hosts infected by similar strains, giving clues on strains which were more likely to have evolved to jump from one host to another. In this work, we frame host specificity as a classification task, in designing a very compact numerical representation of the spike sequences of different coronaviruses. Based on this numerical representation, classification methods are able to detect the target host with high accuracy. Such an approach can used to efficiently scale to large volumes of sequences, in order to unveil trends in the host specificity of different coronavirus strains.

Abstract

The study of host specificity has important connections to the question about the origin of SARS-CoV-2 in humans which led to the COVID-19 pandemic—an important open question. There are speculations that bats are a possible origin. Likewise, there are many closely related (corona)viruses, such as SARS, which was found to be transmitted through civets. The study of the different hosts which can be potential carriers and transmitters of deadly viruses to humans is crucial to understanding, mitigating, and preventing current and future pandemics. In coronaviruses, the surface (S) protein, or spike protein, is important in determining host specificity, since it is the point of contact between the virus and the host cell membrane. In this paper, we classify the hosts of over five thousand coronaviruses from their spike protein sequences, segregating them into clusters of distinct hosts among birds, bats, camels, swine, humans, and weasels, to name a few. We propose a feature embedding based on the well-known position weight matrix (PWM), which we call PWM2Vec, and we use it to generate feature vectors from the spike protein sequences of these coronaviruses. While our embedding is inspired by the success of PWMs in biological applications, such as determining protein function and identifying transcription factor binding sites, we are the first (to the best of our knowledge) to use PWMs from viral sequences to generate fixed-length feature vector representations, and use them in the context of host classification. The results on real world data show that when using PWM2Vec, machine learning classifiers are able to perform comparably to the baseline models in terms of predictive performance and runtime—in some cases, the performance is better. We also measure the importance of different amino acids using information gain to show the amino acids which are important for predicting the host of a given coronavirus. Finally, we perform some statistical analyses on these results to show that our embedding is more compact than the embeddings of the baseline models.

Collapse

Ekpenyong ME, Adegoke AA, Edoho ME, Inyang UG, Udo IJ, Ekaidem IS, Osang F, Uto NP, Geoffery JI. Collaborative Mining of Whole Genome Sequences for Intelligent HIV-1 Sub-Strain(s) Discovery. Curr HIV Res 2022;20:163-183. [PMID: 35142269 DOI: 10.2174/1570162x20666220210142209] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/24/2021] [Revised: 11/30/2021] [Accepted: 12/20/2021] [Indexed: 11/22/2022]

Abstract

BACKGROUND

Effective global antiretroviral vaccines and therapeutic strategies depend on the diversity, evolution, and epidemiology of their various strains as well as their transmission and pathogenesis. Most viral disease-causing particles are clustered into a taxonomy of subtypes to suggest pointers toward nucleotide-specific vaccines or therapeutic applications of clinical significance sufficient for sequence-specific diagnosis and homologous viral studies. These are very useful to formulate predictors to induce cross-resistance to some retroviral control drugs being used across study areas.

OBJECTIVE

This research proposed a collaborative framework of hybridized (Machine Learning and Natural Language Processing) techniques to discover hidden genome patterns and feature predictors, for HIV-1 genome sequences mining.

METHOD

630 human HIV-1 genome sequences above 8500 bps were excavated from the National Center for Biotechnology Information (NCBI) database (https://www.ncbi.nlm.nih.gov) for 21 countries across different continents, Antarctica exempt. These sequences were transformed and learned using a self-organizing map (SOM). To discriminate emerging/new sub-strain(s), the HIV-1 reference genome was included as part of the input isolates/samples during the training. After training the SOM, component planes defining pattern clusters of the input datasets were generated, for cognitive knowledge mining and subsequent labelling of the datasets. Additional genome features including dinucleotide transmission recurrences, codon recurrences, and mutation recurrences, were finally extracted from the raw genomes to construct output classification targets for supervised learning.

RESULTS

SOM training explains the inherent pattern diversity of HIV-1 genomes as well as inter- and intra-country transmissions in which mobility might play an active role, as corroborated by literature. Nine sub-strains were discovered after disassembling the SOM correlation hunting matrix space attributed to disparate clusters. Cognitive knowledge mining separated similar pattern clusters bounded by a certain degree of correlation range, discovered by the SOM. A Kruskal-Wallis rank-sum test and Wilcoxon rank-sum test showed statistically significant variations in dinucleotide, codon, and mutation patterns.

CONCLUSION

Results of the discovered sub-strains and response clusters visualizations corroborate existing literature, with significant haplotype variations. The proposed framework would assist in the development of decision support systems for easy contact tracing, infectious disease surveillance, and studying the progressive evolution of the reference HIV-1 genome.

Collapse

Millán Arias P, Alipour F, Hill KA, Kari L. DeLUCS: Deep learning for unsupervised clustering of DNA sequences. PLoS One 2022;17:e0261531. [PMID: 35061715 PMCID: PMC8782307 DOI: 10.1371/journal.pone.0261531] [Citation(s) in RCA: 11] [Impact Index Per Article: 5.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/03/2021] [Accepted: 12/06/2021] [Indexed: 11/25/2022] Open

Singh OP, Vallejo M, El-Badawy IM, Aysha A, Madhanagopal J, Mohd Faudzi AA. Classification of SARS-CoV-2 and non-SARS-CoV-2 using machine learning algorithms. Comput Biol Med 2021;136:104650. [PMID: 34329865 PMCID: PMC8294595 DOI: 10.1016/j.compbiomed.2021.104650] [Citation(s) in RCA: 9] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/28/2021] [Revised: 07/08/2021] [Accepted: 07/13/2021] [Indexed: 11/28/2022]

Analysis of DNA Sequence Classification Using CNN and Hybrid Models. COMPUTATIONAL AND MATHEMATICAL METHODS IN MEDICINE 2021;2021:1835056. [PMID: 34306171 PMCID: PMC8285202 DOI: 10.1155/2021/1835056] [Citation(s) in RCA: 10] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 05/24/2021] [Accepted: 06/25/2021] [Indexed: 12/23/2022]

Tang R, Yu Z, Ma Y, Wu Y, Phoebe Chen YP, Wong L, Li J. Genetic source completeness of HIV-1 circulating recombinant forms (CRFs) predicted by multi-label learning. Bioinformatics 2021;37:750-758. [PMID: 33063094 DOI: 10.1093/bioinformatics/btaa887] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/27/2020] [Revised: 08/12/2020] [Accepted: 09/30/2020] [Indexed: 12/18/2022] Open

Hufsky F, Lamkiewicz K, Almeida A, Aouacheria A, Arighi C, Bateman A, Baumbach J, Beerenwinkel N, Brandt C, Cacciabue M, Chuguransky S, Drechsel O, Finn RD, Fritz A, Fuchs S, Hattab G, Hauschild AC, Heider D, Hoffmann M, Hölzer M, Hoops S, Kaderali L, Kalvari I, von Kleist M, Kmiecinski R, Kühnert D, Lasso G, Libin P, List M, Löchel HF, Martin MJ, Martin R, Matschinske J, McHardy AC, Mendes P, Mistry J, Navratil V, Nawrocki EP, O’Toole ÁN, Ontiveros-Palacios N, Petrov AI, Rangel-Pineros G, Redaschi N, Reimering S, Reinert K, Reyes A, Richardson L, Robertson DL, Sadegh S, Singer JB, Theys K, Upton C, Welzel M, Williams L, Marz M. Computational strategies to combat COVID-19: useful tools to accelerate SARS-CoV-2 and coronavirus research. Brief Bioinform 2021;22:642-663. [PMID: 33147627 PMCID: PMC7665365 DOI: 10.1093/bib/bbaa232] [Citation(s) in RCA: 78] [Impact Index Per Article: 26.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/27/2020] [Revised: 07/28/2020] [Accepted: 08/26/2020] [Indexed: 12/16/2022] Open

Affiliation(s)

Franziska Hufsky Friedrich-Schiller-University Jena, Germany
Kevin Lamkiewicz Friedrich-Schiller-University Jena, Germany
Alexandre Almeida EMBL-EBI and the Wellcome Sanger Institute, UK
Abdel Aouacheria CNRS, France
Cecilia Arighi Biocuration and Literature Access at PIR, USA
Alex Bateman Protein Sequence Resources at EMBL-EBI, UK
Jan Baumbach Technical University of Munich, Germany
Niko Beerenwinkel Computational Biology at ETH Zurich, Switzerland
Christian Brandt Institute of Infectious Disease and Infection Control at Jena University Hospital, Germany
Marco Cacciabue Consejo Nacional de Investigaciones Científicas y Tócnicas (CONICET) working on FMDV virology at the Instituto de Agrobiotecnología y Biología Molecular (IABiMo, INTA-CONICET) and at the Departamento de Ciencias Básicas, Universidad Nacional de Luján (UNLu), Argentina
Sara Chuguransky Pfam and InterPro databases, at the EMBL-EBI, UK
Oliver Drechsel bioinformatics department at the Robert Koch-Institute, Germany
Robert D Finn Pfam and MGnify
Adrian Fritz Computational Biology of Infection Research group of Alice C. McHardy at the Helmholtz Centre for Infection Research, Germany
Stephan Fuchs bioinformatics department at the Robert Koch-Institute, Germany
Georges Hattab Bioinformatics Division at Philipps-University Marburg, Germany
Anne-Christin Hauschild Philipps-University Marburg, Germany
Dominik Heider Data Science in Biomedicine at the Philipps-University of Marburg, Germany
Marie Hoffmann Freie Universität Berlin, Germany
Martin Hölzer Friedrich Schiller University Jena, Germany
Stefan Hoops Biocomplexity Institute and Initiative at the University of Virginia, USA
Lars Kaderali Bioinformatics and head of the Institute of Bioinformatics at University Medicine Greifswald, Germany
Ioanna Kalvari Senior Software Developer
Max von Kleist bioinformatics department at the Robert Koch-Institute, Germany
Renó Kmiecinski bioinformatics department at the Robert Koch-Institute, Germany
Denise Kühnert Max Planck Institute for the Science of Human History
Gorka Lasso Chandran Lab, Albert Einstein College of Medicine, USA
Pieter Libin University of Hasselt, Belgium
Markus List Technical University of Munich, Germany
Hannah F Löchel Philipps-University Marburg, Germany
Maria J Martin EMBL-EBI, UK
Roman Martin Philipps-University Marburg, Germany
Julian Matschinske Chair of Experimental Bioinformatics at TU Munich, Germany
Alice C McHardy Computational Biology of Infection Research Lab at the Helmholtz Centre for Infection Research in Braunschweig, Germany
Pedro Mendes Center for Quantitative Medicine of the University of Connecticut School of Medicine, USA
Jaina Mistry EMBL-EBI, UK
Vincent Navratil Bioinformatics and Systems Biology at the Rhône Alpes Bioinformatics core facility, Universitó de Lyon, France
Eric P Nawrocki National Center for Biotechnology Information (NCBI)
Áine Niamh O’Toole Rambaut group at Edinburgh University, UK
Nancy Ontiveros-Palacios EMBL-EBI, UK
Anton I Petrov EMBL-EBI, UK
Guillermo Rangel-Pineros GLOBE Institute in the University of Copenhagen, Denmark
Nicole Redaschi Development of the Swiss-Prot group at the SIB for UniProt and SIB resources that cover viral biology (ViralZone)
Susanne Reimering Computational Biology of Infection Research group of Alice C. McHardy at the Helmholtz Centre for Infection Research
Knut Reinert Freie Universität Berlin, Germany
Alejandro Reyes Universidad de los Andes, Colombia
Lorna Richardson Sequence Families team at EMBL-EBI, UK
David L Robertson MRC-University of Glasgow Centre for Virus Research, UK
Sepideh Sadegh Chair of Experimental Bioinformatics at Technical University of Munich, Germany
Joshua B Singer MRC-University of Glasgow Centre for Virus Research, Glasgow, Scotland, UK
Kristof Theys Rega institute of the University of Leuven, Belgium
Chris Upton Department of Biochemistry and Microbiology, University of Victoria, Canada
Marius Welzel Philipps-University Marburg, Germany
Lowri Williams Pfam and InterPro databases, at EMBL-EBI, UK
Manja Marz Friedrich Schiller University Jena, Germany

Collapse

Auslander N, Gussow AB, Koonin EV. Incorporating Machine Learning into Established Bioinformatics Frameworks. Int J Mol Sci 2021;22:2903. [PMID: 33809353 PMCID: PMC8000113 DOI: 10.3390/ijms22062903] [Citation(s) in RCA: 35] [Impact Index Per Article: 11.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/15/2021] [Revised: 03/08/2021] [Accepted: 03/10/2021] [Indexed: 12/23/2022] Open

Saha I, Ghosh N, Maity D, Seal A, Plewczynski D. COVID-DeepPredictor: Recurrent Neural Network to Predict SARS-CoV-2 and Other Pathogenic Viruses. Front Genet 2021;12:569120. [PMID: 33643375 PMCID: PMC7906283 DOI: 10.3389/fgene.2021.569120] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/03/2020] [Accepted: 01/13/2021] [Indexed: 11/13/2022] Open

Abstract

The COVID-19 disease for Novel coronavirus (SARS-CoV-2) has turned out to be a global pandemic. The high transmission rate of this pathogenic virus demands an early prediction and proper identification for the subsequent treatment. However, polymorphic nature of this virus allows it to adapt and sustain in different kinds of environment which makes it difficult to predict. On the other hand, there are other pathogens like SARS-CoV-1, MERS-CoV, Ebola, Dengue, and Influenza as well, so that a predictor is highly required to distinguish them with the use of their genomic information. To mitigate this problem, in this work COVID-DeepPredictor is proposed on the framework of deep learning to identify an unknown sequence of these pathogens. COVID-DeepPredictor uses Long Short Term Memory as Recurrent Neural Network for the underlying prediction with an alignment-free technique. In this regard, k-mer technique is applied to create Bag-of-Descriptors (BoDs) in order to generate Bag-of-Unique-Descriptors (BoUDs) as vocabulary and subsequently embedded representation is prepared for the given virus sequences. This predictor is not only validated for the dataset using K -fold cross-validation but also for unseen test datasets of SARS-CoV-2 sequences and sequences from other viruses as well. To verify the efficacy of COVID-DeepPredictor, it has been compared with other state-of-the-art prediction techniques based on Linear Discriminant Analysis, Random Forests, and Gradient Boosting Method. COVID-DeepPredictor achieves 100% prediction accuracy on validation dataset while on test datasets, the accuracy ranges from 99.51 to 99.94%. It shows superior results over other prediction techniques as well. In addition to this, accuracy and runtime of COVID-DeepPredictor are considered simultaneously to determine the value of k in k-mer, a comparative study among k values in k-mer, Bag-of-Descriptors (BoDs), and Bag-of-Unique-Descriptors (BoUDs) and a comparison between COVID-DeepPredictor and Nucleotide BLAST have also been performed. The code, training, and test datasets used for COVID-DeepPredictor are available at http://www.nitttrkol.ac.in/indrajit/projects/COVID-DeepPredictor/.

Collapse

Kapaata A, Balinda SN, Xu R, Salazar MG, Herard K, Brooks K, Laban K, Hare J, Dilernia D, Kamali A, Ruzagira E, Mukasa F, Gilmour J, Salazar-Gonzalez JF, Yue L, Cotten M, Hunter E, Kaleebu P. HIV-1 Gag-Pol Sequences from Ugandan Early Infections Reveal Sequence Variants Associated with Elevated Replication Capacity. Viruses 2021;13:v13020171. [PMID: 33498793 PMCID: PMC7912664 DOI: 10.3390/v13020171] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/22/2020] [Revised: 01/04/2021] [Accepted: 01/06/2021] [Indexed: 01/05/2023] Open

Affiliation(s)

Anne Kapaata Medical Research Council, UVRI & LSTHM Uganda Research Unit, Plot 51–59, Entebbe, Uganda; (A.K.); (S.N.B.); (M.G.S.); (K.L.); (E.R.); (F.M.); (J.F.S.-G.); (P.K.)
Sheila N. Balinda Medical Research Council, UVRI & LSTHM Uganda Research Unit, Plot 51–59, Entebbe, Uganda; (A.K.); (S.N.B.); (M.G.S.); (K.L.); (E.R.); (F.M.); (J.F.S.-G.); (P.K.)
Rui Xu Emory University, Atlanta, GA 30322, USA; (R.X.); (K.H.); (K.B.); (D.D.); (L.Y.); (E.H.)
Maria G. Salazar Medical Research Council, UVRI & LSTHM Uganda Research Unit, Plot 51–59, Entebbe, Uganda; (A.K.); (S.N.B.); (M.G.S.); (K.L.); (E.R.); (F.M.); (J.F.S.-G.); (P.K.)
Kimberly Herard Emory University, Atlanta, GA 30322, USA; (R.X.); (K.H.); (K.B.); (D.D.); (L.Y.); (E.H.)
Kelsie Brooks Emory University, Atlanta, GA 30322, USA; (R.X.); (K.H.); (K.B.); (D.D.); (L.Y.); (E.H.)
Kato Laban Medical Research Council, UVRI & LSTHM Uganda Research Unit, Plot 51–59, Entebbe, Uganda; (A.K.); (S.N.B.); (M.G.S.); (K.L.); (E.R.); (F.M.); (J.F.S.-G.); (P.K.)
Jonathan Hare Imperial College London, London SW7 2AZ, UK; (J.H.); (J.G.) International AIDS Vaccine Initiative (IAVI), New York, NY 10004, USA
Dario Dilernia Emory University, Atlanta, GA 30322, USA; (R.X.); (K.H.); (K.B.); (D.D.); (L.Y.); (E.H.)
Anatoli Kamali IAVI, Nairobi 00202, Kenya;
Eugene Ruzagira Medical Research Council, UVRI & LSTHM Uganda Research Unit, Plot 51–59, Entebbe, Uganda; (A.K.); (S.N.B.); (M.G.S.); (K.L.); (E.R.); (F.M.); (J.F.S.-G.); (P.K.)
Freddie Mukasa Medical Research Council, UVRI & LSTHM Uganda Research Unit, Plot 51–59, Entebbe, Uganda; (A.K.); (S.N.B.); (M.G.S.); (K.L.); (E.R.); (F.M.); (J.F.S.-G.); (P.K.)
Jill Gilmour Imperial College London, London SW7 2AZ, UK; (J.H.); (J.G.) International AIDS Vaccine Initiative (IAVI), New York, NY 10004, USA
Jesus F. Salazar-Gonzalez Medical Research Council, UVRI & LSTHM Uganda Research Unit, Plot 51–59, Entebbe, Uganda; (A.K.); (S.N.B.); (M.G.S.); (K.L.); (E.R.); (F.M.); (J.F.S.-G.); (P.K.)
Ling Yue Emory University, Atlanta, GA 30322, USA; (R.X.); (K.H.); (K.B.); (D.D.); (L.Y.); (E.H.)
Matthew Cotten Medical Research Council, UVRI & LSTHM Uganda Research Unit, Plot 51–59, Entebbe, Uganda; (A.K.); (S.N.B.); (M.G.S.); (K.L.); (E.R.); (F.M.); (J.F.S.-G.); (P.K.) Centre for Virus Research, MRC-University of Glasgow, Glasgow G61 1QH, UK Correspondence: ; Tel.: +25-6701-509-685
Eric Hunter Emory University, Atlanta, GA 30322, USA; (R.X.); (K.H.); (K.B.); (D.D.); (L.Y.); (E.H.)
Pontiano Kaleebu Medical Research Council, UVRI & LSTHM Uganda Research Unit, Plot 51–59, Entebbe, Uganda; (A.K.); (S.N.B.); (M.G.S.); (K.L.); (E.R.); (F.M.); (J.F.S.-G.); (P.K.)

Collapse

Sarkar JP, Saha I, Seal A, Maity D, Maulik U. Topological Analysis for Sequence Variability: Case Study on more than 2K SARS-CoV-2 sequences of COVID-19 infected 54 countries in comparison with SARS-CoV-1 and MERS-CoV. INFECTION GENETICS AND EVOLUTION 2021;88:104708. [PMID: 33421654 PMCID: PMC7787073 DOI: 10.1016/j.meegid.2021.104708] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 07/28/2020] [Revised: 10/27/2020] [Accepted: 12/31/2020] [Indexed: 12/11/2022]

Kirk JM, Sprague D, Calabrese JM. Classification of Long Noncoding RNAs by k-mer Content. Methods Mol Biol 2021;2254:41-60. [PMID: 33326069 PMCID: PMC7850294 DOI: 10.1007/978-1-0716-1158-6_4] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 05/18/2023]

MirLocPredictor: A ConvNet-Based Multi-Label MicroRNA Subcellular Localization Predictor by Incorporating k-Mer Positional Information. Genes (Basel) 2020;11:genes11121475. [PMID: 33316943 PMCID: PMC7763197 DOI: 10.3390/genes11121475] [Citation(s) in RCA: 9] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Key Words] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/03/2020] [Revised: 11/23/2020] [Accepted: 11/25/2020] [Indexed: 02/06/2023] Open

Abstract

MicroRNAs (miRNA) are small noncoding RNA sequences consisting of about 22 nucleotides that are involved in the regulation of almost 60% of mammalian genes. Presently, there are very limited approaches for the visualization of miRNA locations present inside cells to support the elucidation of pathways and mechanisms behind miRNA function, transport, and biogenesis. MIRLocator, a state-of-the-art tool for the prediction of subcellular localization of miRNAs makes use of a sequence-to-sequence model along with pretrained k-mer embeddings. Existing pretrained k-mer embedding generation methodologies focus on the extraction of semantics of k-mers. However, in RNA sequences, positional information of nucleotides is more important because distinct positions of the four nucleotides define the function of an RNA molecule. Considering the importance of the nucleotide position, we propose a novel approach (kmerPR2vec) which is a fusion of positional information of k-mers with randomly initialized neural k-mer embeddings. In contrast to existing k-mer-based representation, the proposed kmerPR2vec representation is much more rich in terms of semantic information and has more discriminative power. Using novel kmerPR2vec representation, we further present an end-to-end system (MirLocPredictor) which couples the discriminative power of kmerPR2vec with Convolutional Neural Networks (CNNs) for miRNA subcellular location prediction. The effectiveness of the proposed kmerPR2vec approach is evaluated with deep learning-based topologies (i.e., Convolutional Neural Networks (CNN) and Recurrent Neural Network (RNN)) and by using 9 different evaluation measures. Analysis of the results reveals that MirLocPredictor outperform state-of-the-art methods with a significant margin of 18% and 19% in terms of precision and recall.

Collapse

Sengupta DC, Hill MD, Benton KR, Banerjee HN. Similarity Studies of Corona Viruses through Chaos Game Representation. COMPUTATIONAL MOLECULAR BIOSCIENCE 2020;10:61-72. [PMID: 32953249 PMCID: PMC7497811 DOI: 10.4236/cmb.2020.103004] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 11/20/2022]

Nugent CM, Adamowicz SJ. Alignment-free classification of COI DNA barcode data with the Python package Alfie. METABARCODING AND METAGENOMICS 2020. [DOI: 10.3897/mbmg.4.55815] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/04/2023] Open

Randhawa GS, Soltysiak MPM, El Roz H, de Souza CPE, Hill KA, Kari L. Machine learning using intrinsic genomic signatures for rapid classification of novel pathogens: COVID-19 case study. PLoS One 2020;15:e0232391. [PMID: 32330208 PMCID: PMC7182198 DOI: 10.1371/journal.pone.0232391] [Citation(s) in RCA: 195] [Impact Index Per Article: 48.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/20/2020] [Accepted: 04/14/2020] [Indexed: 12/24/2022] Open

Phylogenetic Analysis of HIV-1 Genomes Based on the Position-Weighted K-mers Method. ENTROPY 2020;22:e22020255. [PMID: 33286029 PMCID: PMC7516702 DOI: 10.3390/e22020255] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 01/17/2020] [Revised: 02/07/2020] [Accepted: 02/20/2020] [Indexed: 12/31/2022]

Borrayo E, May-Canche I, Paredes O, Morales JA, Romo-Vázquez R, Vélez-Pérez H. Whole-Genome k-mer Topic Modeling AssociatesBacterial Families. Genes (Basel) 2020;11:genes11020197. [PMID: 32075081 PMCID: PMC7074292 DOI: 10.3390/genes11020197] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/07/2020] [Revised: 02/07/2020] [Accepted: 02/09/2020] [Indexed: 11/16/2022] Open

Dlamini GS, Muller SJ, Meraba RL, Young RA, Mashiyane J, Chiwewe T, Mapiye DS. Classification of COVID-19 and Other Pathogenic Sequences: A Dinucleotide Frequency and Machine Learning Approach. IEEE ACCESS : PRACTICAL INNOVATIONS, OPEN SOLUTIONS 2020;8:195263-195273. [PMID: 34976561 PMCID: PMC8675546 DOI: 10.1109/access.2020.3031387] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 10/02/2020] [Accepted: 10/04/2020] [Indexed: 05/08/2023]

Randhawa GS, Hill KA, Kari L. MLDSP-GUI: an alignment-free standalone tool with an interactive graphical user interface for DNA sequence comparison and analysis. Bioinformatics 2019;36:2258-2259. [DOI: 10.1093/bioinformatics/btz918] [Citation(s) in RCA: 11] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/02/2019] [Revised: 11/22/2019] [Accepted: 12/11/2019] [Indexed: 11/14/2022] Open

He L, Dong R, He RL, Yau SST. A novel alignment-free method for HIV-1 subtype classification. INFECTION GENETICS AND EVOLUTION 2019;77:104080. [PMID: 31683009 DOI: 10.1016/j.meegid.2019.104080] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Subscribe] [Scholar Register] [Received: 07/03/2019] [Revised: 10/08/2019] [Accepted: 10/20/2019] [Indexed: 11/16/2022]

Forsdyke DR. Success of alignment-free oligonucleotide (k-mer) analysis confirms relative importance of genomes not genes in speciation and phylogeny. Biol J Linn Soc Lond 2019. [DOI: 10.1093/biolinnean/blz096] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022]

Randhawa GS, Hill KA, Kari L. ML-DSP: Machine Learning with Digital Signal Processing for ultrafast, accurate, and scalable genome classification at all taxonomic levels. BMC Genomics 2019;20:267. [PMID: 30943897 PMCID: PMC6448311 DOI: 10.1186/s12864-019-5571-y] [Citation(s) in RCA: 20] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/04/2018] [Accepted: 02/27/2019] [Indexed: 11/11/2022] Open

Abstract

Background

Although software tools abound for the comparison, analysis, identification, and classification of genomic sequences, taxonomic classification remains challenging due to the magnitude of the datasets and the intrinsic problems associated with classification. The need exists for an approach and software tool that addresses the limitations of existing alignment-based methods, as well as the challenges of recently proposed alignment-free methods.

Results

We propose a novel combination of supervised Machine Learning with Digital Signal Processing, resulting in ML-DSP: an alignment-free software tool for ultrafast, accurate, and scalable genome classification at all taxonomic levels. We test ML-DSP by classifying 7396 full mitochondrial genomes at various taxonomic levels, from kingdom to genus, with an average classification accuracy of >97%.

A quantitative comparison with state-of-the-art classification software tools is performed, on two small benchmark datasets and one large 4322 vertebrate mtDNA genomes dataset. Our results show that ML-DSP overwhelmingly outperforms the alignment-based software MEGA7 (alignment with MUSCLE or CLUSTALW) in terms of processing time, while having comparable classification accuracies for small datasets and superior accuracies for the large dataset. Compared with the alignment-free software FFP (Feature Frequency Profile), ML-DSP has significantly better classification accuracy, and is overall faster.

We also provide preliminary experiments indicating the potential of ML-DSP to be used for other datasets, by classifying 4271 complete dengue virus genomes into subtypes with 100% accuracy, and 4,710 bacterial genomes into phyla with 95.5% accuracy.

Lastly, our analysis shows that the “Purine/Pyrimidine”, “Just-A” and “Real” numerical representations of DNA sequences outperform ten other such numerical representations used in the Digital Signal Processing literature for DNA classification purposes.

Conclusions

Due to its superior classification accuracy, speed, and scalability to large datasets, ML-DSP is highly relevant in the classification of newly discovered organisms, in distinguishing genomic signatures and identifying their mechanistic determinants, and in evaluating genome integrity.

Collapse