1
|
Li X, Zhou T, Feng X, Yau ST, Yau SST. Exploring geometry of genome space via Grassmann manifolds. Innovation (N Y) 2024; 5:100677. [PMID: 39206218 PMCID: PMC11350263 DOI: 10.1016/j.xinn.2024.100677] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/06/2024] [Accepted: 07/18/2024] [Indexed: 09/04/2024] Open
Abstract
It is important to understand the geometry of genome space in biology. After transforming genome sequences into frequency matrices of the chaos game representation (FCGR), we regard a genome sequence as a point in a suitable Grassmann manifold by analyzing the column space of the corresponding FCGR. To assess the sequence similarity, we employ the generalized Grassmannian distance, an intrinsic geometric distance that differs from the traditional Euclidean distance used in the classical k-mer frequency-based methods. With this method, we constructed phylogenetic trees for various genome datasets, including influenza A virus hemagglutinin gene, Orthocoronavirinae genome, and SARS-CoV-2 complete genome sequences. Our comparative analysis with multiple sequence alignment and alignment-free methods for large-scale sequences revealed that our method, which employs the subspace distance between the column spaces of different FCGRs (FCGR-SD), outperformed its competitors in terms of both speed and accuracy. In addition, we used low-dimensional visualization of the SARS-CoV-2 genome sequences and spike protein nucleotide sequences with our methods, resulting in some intriguing findings. We not only propose a novel and efficient algorithm for comparing genome sequences but also demonstrate that genome data have some intrinsic manifold structures, providing a new geometric perspective for molecular biology studies.
Collapse
Affiliation(s)
- Xiaoguang Li
- School of Statistics and Management, Shanghai University of Finance and Economics, Shanghai 200433, China
| | - Tao Zhou
- Department of Mathematical Sciences, Tsinghua University, Beijing 100084, China
| | - Xingdong Feng
- School of Statistics and Management, Shanghai University of Finance and Economics, Shanghai 200433, China
| | - Shing-Tung Yau
- Department of Mathematical Sciences, Tsinghua University, Beijing 100084, China
- Yanqi Lake Beijing Institute of Mathematical Sciences and Applications, Beijing 101408, China
| | - Stephen S.-T. Yau
- Department of Mathematical Sciences, Tsinghua University, Beijing 100084, China
- Yanqi Lake Beijing Institute of Mathematical Sciences and Applications, Beijing 101408, China
| |
Collapse
|
2
|
Ha AD, Aylward FO. Automated classification of giant virus genomes using a random forest model built on trademark protein families. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2023:2023.11.10.566645. [PMID: 38014039 PMCID: PMC10680617 DOI: 10.1101/2023.11.10.566645] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/29/2023]
Abstract
Viruses of the phylum Nucleocytoviricota , often referred to as "giant viruses," are prevalent in various environments around the globe and play significant roles in shaping eukaryotic diversity and activities in global ecosystems. Given the extensive phylogenetic diversity within this viral group and the highly complex composition of their genomes, taxonomic classification of giant viruses, particularly incomplete metagenome-assembled genomes (MAGs) can present a considerable challenge. Here we developed TIGTOG ( T axonomic Information of G iant viruses using T rademark O rthologous G roups), a machine learning-based approach to predict the taxonomic classification of novel giant virus MAGs based on profiles of protein family content. We applied a random forest algorithm to a training set of 1,531 quality-checked, phylogenetically diverse Nucleocytoviricota genomes using pre-selected sets of giant virus orthologous groups (GVOGs). The classification models were predictive of viral taxonomic assignments with a cross-validation accuracy of 99.6% to the order level and 97.3% to the family level. We found that no individual GVOGs or genome features significantly influenced the algorithm's performance or the models' predictions, indicating that classification predictions were based on a comprehensive genomic signature, which reduced the necessity of a fixed set of marker genes for taxonomic assigning purposes. Our classification models were validated with an independent test set of 823 giant virus genomes with varied genomic completeness and taxonomy and demonstrated an accuracy of 98.6% and 95.9% to the order and family level, respectively. Our results indicate that protein family profiles can be used to accurately classify large DNA viruses at different taxonomic levels and provide a fast and accurate method for the classification of giant viruses. This approach could easily be adapted to other viral groups.
Collapse
|
3
|
de la Fuente R, Díaz-Villanueva W, Arnau V, Moya A. Genomic Signature in Evolutionary Biology: A Review. BIOLOGY 2023; 12:biology12020322. [PMID: 36829597 PMCID: PMC9953303 DOI: 10.3390/biology12020322] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 12/08/2022] [Revised: 02/11/2023] [Accepted: 02/13/2023] [Indexed: 02/19/2023]
Abstract
Organisms are unique physical entities in which information is stored and continuously processed. The digital nature of DNA sequences enables the construction of a dynamic information reservoir. However, the distinction between the hardware and software components in the information flow is crucial to identify the mechanisms generating specific genomic signatures. In this work, we perform a bibliometric analysis to identify the different purposes of looking for particular patterns in DNA sequences associated with a given phenotype. This study has enabled us to make a conceptual breakdown of the genomic signature and differentiate the leading applications. On the one hand, it refers to gene expression profiling associated with a biological function, which may be shared across taxa. This signature is the focus of study in precision medicine. On the other hand, it also refers to characteristic patterns in species-specific DNA sequences. This interpretation plays a key role in comparative genomics, identifying evolutionary relationships. Looking at the relevant studies in our bibliographic database, we highlight the main factors causing heterogeneities in genome composition and how they can be quantified. All these findings lead us to reformulate some questions relevant to evolutionary biology.
Collapse
Affiliation(s)
- Rebeca de la Fuente
- Institute of Integrative Systems Biology (I2Sysbio), University of Valencia and Spanish Research Council (CSIC), 46980 Valencia, Spain
- Correspondence:
| | - Wladimiro Díaz-Villanueva
- Institute of Integrative Systems Biology (I2Sysbio), University of Valencia and Spanish Research Council (CSIC), 46980 Valencia, Spain
| | - Vicente Arnau
- Institute of Integrative Systems Biology (I2Sysbio), University of Valencia and Spanish Research Council (CSIC), 46980 Valencia, Spain
| | - Andrés Moya
- Institute of Integrative Systems Biology (I2Sysbio), University of Valencia and Spanish Research Council (CSIC), 46980 Valencia, Spain
- Foundation for the Promotion of Sanitary and Biomedical Research of the Valencian Community (FISABIO), 46020 Valencia, Spain
- CIBER in Epidemiology and Public Health (CIBEResp), 28029 Madrid, Spain
| |
Collapse
|
4
|
Millán Arias P, Alipour F, Hill KA, Kari L. DeLUCS: Deep learning for unsupervised clustering of DNA sequences. PLoS One 2022; 17:e0261531. [PMID: 35061715 PMCID: PMC8782307 DOI: 10.1371/journal.pone.0261531] [Citation(s) in RCA: 13] [Impact Index Per Article: 6.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/03/2021] [Accepted: 12/06/2021] [Indexed: 11/25/2022] Open
Abstract
We present a novel Deep Learning method for the Unsupervised Clustering of DNA Sequences (DeLUCS) that does not require sequence alignment, sequence homology, or (taxonomic) identifiers. DeLUCS uses Frequency Chaos Game Representations (FCGR) of primary DNA sequences, and generates "mimic" sequence FCGRs to self-learn data patterns (genomic signatures) through the optimization of multiple neural networks. A majority voting scheme is then used to determine the final cluster assignment for each sequence. The clusters learned by DeLUCS match true taxonomic groups for large and diverse datasets, with accuracies ranging from 77% to 100%: 2,500 complete vertebrate mitochondrial genomes, at taxonomic levels from sub-phylum to genera; 3,200 randomly selected 400 kbp-long bacterial genome segments, into clusters corresponding to bacterial families; three viral genome and gene datasets, averaging 1,300 sequences each, into clusters corresponding to virus subtypes. DeLUCS significantly outperforms two classic clustering methods (K-means++ and Gaussian Mixture Models) for unlabelled data, by as much as 47%. DeLUCS is highly effective, it is able to cluster datasets of unlabelled primary DNA sequences totalling over 1 billion bp of data, and it bypasses common limitations to classification resulting from the lack of sequence homology, variation in sequence length, and the absence or instability of sequence annotations and taxonomic identifiers. Thus, DeLUCS offers fast and accurate DNA sequence clustering for previously intractable datasets.
Collapse
Affiliation(s)
- Pablo Millán Arias
- School of Computer Science, University of Waterloo, Waterloo, ON, Canada
| | - Fatemeh Alipour
- School of Computer Science, University of Waterloo, Waterloo, ON, Canada
| | - Kathleen A. Hill
- Department of Biology, University of Western Ontario, London, ON, Canada
| | - Lila Kari
- School of Computer Science, University of Waterloo, Waterloo, ON, Canada
| |
Collapse
|
5
|
Hatsuda Y, Maki S, Ishizaka T, Omotani S, Koizumi N, Yasui Y, Saito T, Myotoku M, Okada A, Imaizumi T. Visualization of cross-resistance between antimicrobial agents by asymmetric multidimensional scaling. J Clin Pharm Ther 2021; 47:345-359. [PMID: 34818683 PMCID: PMC9298725 DOI: 10.1111/jcpt.13564] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/13/2021] [Revised: 10/09/2021] [Accepted: 10/25/2021] [Indexed: 12/01/2022]
Abstract
What is known and objective In our previous studies, we developed a cross‐resistance rate (CRR) correlation diagram (CRR diagram) that visually captures the magnitude of CRRs between antimicrobials using scatter plots. We used asymmetric multidimensional scaling (MDS) to transform cross‐resistance similarities between antimicrobials into a 2‐dimensional map and attempted to visually express them. We also explored the antibiograms of Pseudomonas aeruginosa before and after the transfer to newly built hospitals, and we determined by the CRR diagram that the CRRs among β‐lactam antimicrobials other than carbapenems decreased substantially with the facility transfer. The present study tests whether the analysis of CRRs by asymmetric MDS can be used as new visual information that is easy for healthcare professionals to understand. Method We tested the impact of changes in the nosocomial environment due to institutional transfers on CRRs among antimicrobials in asymmetric MDS, as well as contrasted the asymmetric MDS map and CRR diagram. Results and Discussion In the asymmetric MDS map, antimicrobial groups with the same mechanism of action were displayed close together, and antimicrobial groups with different mechanisms of action were displayed separately. The asymmetric MDS map drawn solely for antimicrobials belonging to the group with the same mechanism of action showed similarities to the CRR diagram. Also, the distance of each antimicrobial to other antimicrobials shown in the asymmetric MDS map was negatively correlated with the CRRs for them against that antimicrobial. What is new and conclusion The asymmetric MDS map expresses the dissimilarity as distances between agents, and there are no meanings or units on the ordinate and abscissa axes of the output map. In contrast, the CRR diagram expresses the antimicrobials' resistance status as values, such as resistance rate and CRR. By analysing the CRRs in the asymmetric MDS, it is feasible to visually recognize cross‐resistance similarities between antimicrobial groups as distances. The use of the asymmetric MDS combined with the CRR diagram allows us to visually understand the resistance and cross‐resistance status of each antimicrobial agent as a 2‐dimensional map, as well as to understand the trends and characteristics of the data by means of quantitative values.
Collapse
Affiliation(s)
| | - Syou Maki
- Institute of Frontier Science and Technology, Okayama University of Science, Okayama, Japan
| | | | - Sachiko Omotani
- Faculty of Pharmacy, Osaka Ohtani University, Osaka, Japan.,Sakai City Medical Center, Osaka, Japan
| | | | | | | | | | | | - Tadashi Imaizumi
- Faculty of Management and Information Sciences, Tama University, Tokyo, Japan
| |
Collapse
|
6
|
A new COVID-19 detection method from human genome sequences using CpG island features and KNN classifier. ENGINEERING SCIENCE AND TECHNOLOGY, AN INTERNATIONAL JOURNAL 2021; 24. [PMCID: PMC8064761 DOI: 10.1016/j.jestch.2020.12.026] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/03/2023]
Abstract
Various viral epidemics have been detected such as the severe acute respiratory syndrome coronavirus and the Middle East respiratory syndrome coronavirus in the last two decades. The coronavirus disease 2019 (COVID-19) is a pandemic caused by a novel betacoronavirus called severe acute respiratory syndrome coronavirus-2 (SARS-CoV-2). After the rapid spread of COVID-19, many researchers have investigated diagnosis and treatment for this terrifying disease quickly. Identifying COVID-19 from the other types of coronaviruses is a difficult problem due to their genetic similarity. In this study, we propose a new efficient COVID-19 detection method based on the K-nearest neighbors (KNN) classifier using the complete genome sequences of human coronaviruses in the dataset recorded in 2019 Novel Coronavirus Resource. We also describe two features based on CpG island that efficiently detect COVID-19 cases. Thus, genome sequences including approximately 30,000 nucleotides can be represented by only two real numbers. The KNN method is a simple and effective non-parametric technique for solving classification problems. However, performance of the KNN depends on the distance measure used. We perform 19 distance metrics investigated in five categories to improve the performance of the KNN algorithm. Some efficient performance parameters are computed to evaluate the proposed method. The proposed method achieves 98.4% precision, 99.2% recall, 98.8% F-measure, and 98.4% accuracy in a few seconds when any L1 type metric is used as a distance measure in the KNN.
Collapse
|
7
|
Tang R, Yu Z, Ma Y, Wu Y, Phoebe Chen YP, Wong L, Li J. Genetic source completeness of HIV-1 circulating recombinant forms (CRFs) predicted by multi-label learning. Bioinformatics 2021; 37:750-758. [PMID: 33063094 DOI: 10.1093/bioinformatics/btaa887] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/27/2020] [Revised: 08/12/2020] [Accepted: 09/30/2020] [Indexed: 12/18/2022] Open
Abstract
MOTIVATION Infection with strains of different subtypes and the subsequent crossover reading between the two strands of genomic RNAs by host cells' reverse transcriptase are the main causes of the vast HIV-1 sequence diversity. Such inter-subtype genomic recombinants can become circulating recombinant forms (CRFs) after widespread transmissions in a population. Complete prediction of all the subtype sources of a CRF strain is a complicated machine learning problem. It is also difficult to understand whether a strain is an emerging new subtype and if so, how to accurately identify the new components of the genetic source. RESULTS We introduce a multi-label learning algorithm for the complete prediction of multiple sources of a CRF sequence as well as the prediction of its chronological number. The prediction is strengthened by a voting of various multi-label learning methods to avoid biased decisions. In our steps, frequency and position features of the sequences are both extracted to capture signature patterns of pure subtypes and CRFs. The method was applied to 7185 HIV-1 sequences, comprising 5530 pure subtype sequences and 1655 CRF sequences. Results have demonstrated that the method can achieve very high accuracy (reaching 99%) in the prediction of the complete set of labels of HIV-1 recombinant forms. A few wrong predictions are actually incomplete predictions, very close to the complete set of genuine labels. AVAILABILITY AND IMPLEMENTATION https://github.com/Runbin-tang/The-source-of-HIV-CRFs-prediction. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Runbin Tang
- Key Laboratory of Intelligent Computing and Information Processing of Ministry of Education and Hunan Key Laboratory for Computation and Simulation in Science and Engineering, Xiangtan University, Hunan 411105, China.,Advanced Analytics Institute, University of Technology Sydney, Sydney, NSW 2007, Australia
| | - Zuguo Yu
- Key Laboratory of Intelligent Computing and Information Processing of Ministry of Education and Hunan Key Laboratory for Computation and Simulation in Science and Engineering, Xiangtan University, Hunan 411105, China.,School of Electrical Engineering and Computer Science, Queensland University of Technology, Brisbane, QLD 4001, Australia
| | - Yuanlin Ma
- Key Laboratory of Intelligent Computing and Information Processing of Ministry of Education and Hunan Key Laboratory for Computation and Simulation in Science and Engineering, Xiangtan University, Hunan 411105, China
| | - Yaoqun Wu
- Key Laboratory of Intelligent Computing and Information Processing of Ministry of Education and Hunan Key Laboratory for Computation and Simulation in Science and Engineering, Xiangtan University, Hunan 411105, China
| | - Yi-Ping Phoebe Chen
- Department of Computer Science and Information Technology, La Trobe University, Melbourne, VIC 3086, Australia
| | - Limsoon Wong
- School of Computing, National University of Singapore, Singapore 117417, Singapore
| | - Jinyan Li
- Advanced Analytics Institute, University of Technology Sydney, Sydney, NSW 2007, Australia
| |
Collapse
|
8
|
Saha I, Ghosh N, Maity D, Seal A, Plewczynski D. COVID-DeepPredictor: Recurrent Neural Network to Predict SARS-CoV-2 and Other Pathogenic Viruses. Front Genet 2021; 12:569120. [PMID: 33643375 PMCID: PMC7906283 DOI: 10.3389/fgene.2021.569120] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/03/2020] [Accepted: 01/13/2021] [Indexed: 11/13/2022] Open
Abstract
The COVID-19 disease for Novel coronavirus (SARS-CoV-2) has turned out to be a global pandemic. The high transmission rate of this pathogenic virus demands an early prediction and proper identification for the subsequent treatment. However, polymorphic nature of this virus allows it to adapt and sustain in different kinds of environment which makes it difficult to predict. On the other hand, there are other pathogens like SARS-CoV-1, MERS-CoV, Ebola, Dengue, and Influenza as well, so that a predictor is highly required to distinguish them with the use of their genomic information. To mitigate this problem, in this work COVID-DeepPredictor is proposed on the framework of deep learning to identify an unknown sequence of these pathogens. COVID-DeepPredictor uses Long Short Term Memory as Recurrent Neural Network for the underlying prediction with an alignment-free technique. In this regard, k-mer technique is applied to create Bag-of-Descriptors (BoDs) in order to generate Bag-of-Unique-Descriptors (BoUDs) as vocabulary and subsequently embedded representation is prepared for the given virus sequences. This predictor is not only validated for the dataset using K -fold cross-validation but also for unseen test datasets of SARS-CoV-2 sequences and sequences from other viruses as well. To verify the efficacy of COVID-DeepPredictor, it has been compared with other state-of-the-art prediction techniques based on Linear Discriminant Analysis, Random Forests, and Gradient Boosting Method. COVID-DeepPredictor achieves 100% prediction accuracy on validation dataset while on test datasets, the accuracy ranges from 99.51 to 99.94%. It shows superior results over other prediction techniques as well. In addition to this, accuracy and runtime of COVID-DeepPredictor are considered simultaneously to determine the value of k in k-mer, a comparative study among k values in k-mer, Bag-of-Descriptors (BoDs), and Bag-of-Unique-Descriptors (BoUDs) and a comparison between COVID-DeepPredictor and Nucleotide BLAST have also been performed. The code, training, and test datasets used for COVID-DeepPredictor are available at http://www.nitttrkol.ac.in/indrajit/projects/COVID-DeepPredictor/.
Collapse
Affiliation(s)
- Indrajit Saha
- Department of Computer Science and Engineering, National Institute of Technical Teachers' Training and Research, Kolkata, India
| | - Nimisha Ghosh
- Department of Computer Science and Information Technology, Institute of Technical Education and Research, Siksha ‘O’ Anusandhan (Deemed to Be University), Bhubaneswar, India
| | - Debasree Maity
- Department of Electronics and Communication Engineering, MCKV Institute of Engineering, Howrah, India
| | - Arjit Seal
- Cognizant Technology Solutions Pvt. Ltd., Kolkata, India
| | - Dariusz Plewczynski
- Laboratory of Bioinformatics and Computational Genomics, Faculty of Mathematics and Information Science, Warsaw University of Technology, Warsaw, Poland
- Laboratory of Functional and Structural Genomics, Centre of New Technologies, University of Warsaw, Warsaw, Poland
| |
Collapse
|
9
|
Abstract
K-mer based comparisons have emerged as powerful complements to BLAST-like alignment algorithms, particularly when the sequences being compared lack direct evolutionary relationships. In this chapter, we describe methods to compare k-mer content between groups of long noncoding RNAs (lncRNAs), to identify communities of lncRNAs with related k-mer contents, to identify the enrichment of protein-binding motifs in lncRNAs, and to scan for domains of related k-mer contents in lncRNAs. Our step-by-step instructions are complemented by Python code deposited in Github. Though our chapter focuses on lncRNAs, the methods we describe could be applied to any set of nucleic acid sequences.
Collapse
Affiliation(s)
- Jessime M Kirk
- Department of Pharmacology, University of North Carolina at Chapel Hill, Chapel Hill, NC, USA
- Curriculum in Bioinformatics and Computational Biology, University of North Carolina at Chapel Hill, Chapel Hill, NC, USA
- Lineberger Comprehensive Cancer Center, University of North Carolina at Chapel Hill, Chapel Hill, NC, USA
- Invitae Corporation, San Francisco, CA, USA
| | - Daniel Sprague
- Department of Pharmacology, University of North Carolina at Chapel Hill, Chapel Hill, NC, USA
- Lineberger Comprehensive Cancer Center, University of North Carolina at Chapel Hill, Chapel Hill, NC, USA
- Curriculum in Pharmacology, University of North Carolina at Chapel Hill, Chapel Hill, NC, USA
- Flagship Pioneering, Boston, MA, USA
| | - J Mauro Calabrese
- Department of Pharmacology, University of North Carolina at Chapel Hill, Chapel Hill, NC, USA.
- Curriculum in Bioinformatics and Computational Biology, University of North Carolina at Chapel Hill, Chapel Hill, NC, USA.
- Lineberger Comprehensive Cancer Center, University of North Carolina at Chapel Hill, Chapel Hill, NC, USA.
- Curriculum in Pharmacology, University of North Carolina at Chapel Hill, Chapel Hill, NC, USA.
| |
Collapse
|
10
|
Sengupta DC, Hill MD, Benton KR, Banerjee HN. Similarity Studies of Corona Viruses through Chaos Game Representation. COMPUTATIONAL MOLECULAR BIOSCIENCE 2020; 10:61-72. [PMID: 32953249 PMCID: PMC7497811 DOI: 10.4236/cmb.2020.103004] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 11/20/2022]
Abstract
The novel coronavirus (SARS-COV-2) is generally referred to as Covid-19 virus has spread to 213 countries with nearly 7 million confirmed cases and nearly 400,000 deaths. Such major outbreaks demand classification and origin of the virus genomic sequence, for planning, containment, and treatment. Motivated by the above need, we report two alignment-free methods combing with CGR to perform clustering analysis and create a phylogenetic tree based on it. To each DNA sequence we associate a matrix then define distance between two DNA sequences to be the distance between their associated matrix. These methods are being used for phylogenetic analysis of coronavirus sequences. Our approach provides a powerful tool for analyzing and annotating genomes and their phylogenetic relationships. We also compare our tool to ClustalX algorithm which is one of the most popular alignment methods. Our alignment-free methods are shown to be capable of finding closest genetic relatives of coronaviruses.
Collapse
Affiliation(s)
- Dipendra C Sengupta
- Department of Mathematics, Computer Science & Engineering Technology, Elizabeth City State University, Elizabeth City, North Carolina, USA
| | - Matthew D Hill
- Department of Mathematics, Computer Science & Engineering Technology, Elizabeth City State University, Elizabeth City, North Carolina, USA
| | - Kevin R Benton
- Department of Mathematics, Computer Science & Engineering Technology, Elizabeth City State University, Elizabeth City, North Carolina, USA
| | - Hirendra N Banerjee
- Department Natural Sciences, Elizabeth City State University, Elizabeth City, North Carolina, USA
| |
Collapse
|
11
|
Randhawa GS, Soltysiak MPM, El Roz H, de Souza CPE, Hill KA, Kari L. Machine learning using intrinsic genomic signatures for rapid classification of novel pathogens: COVID-19 case study. PLoS One 2020; 15:e0232391. [PMID: 32330208 PMCID: PMC7182198 DOI: 10.1371/journal.pone.0232391] [Citation(s) in RCA: 195] [Impact Index Per Article: 48.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/20/2020] [Accepted: 04/14/2020] [Indexed: 12/24/2022] Open
Abstract
The 2019 novel coronavirus (renamed SARS-CoV-2, and generally referred to as the COVID-19 virus) has spread to 184 countries with over 1.5 million confirmed cases. Such major viral outbreaks demand early elucidation of taxonomic classification and origin of the virus genomic sequence, for strategic planning, containment, and treatment. This paper identifies an intrinsic COVID-19 virus genomic signature and uses it together with a machine learning-based alignment-free approach for an ultra-fast, scalable, and highly accurate classification of whole COVID-19 virus genomes. The proposed method combines supervised machine learning with digital signal processing (MLDSP) for genome analyses, augmented by a decision tree approach to the machine learning component, and a Spearman's rank correlation coefficient analysis for result validation. These tools are used to analyze a large dataset of over 5000 unique viral genomic sequences, totalling 61.8 million bp, including the 29 COVID-19 virus sequences available on January 27, 2020. Our results support a hypothesis of a bat origin and classify the COVID-19 virus as Sarbecovirus, within Betacoronavirus. Our method achieves 100% accurate classification of the COVID-19 virus sequences, and discovers the most relevant relationships among over 5000 viral genomes within a few minutes, ab initio, using raw DNA sequence data alone, and without any specialized biological knowledge, training, gene or genome annotations. This suggests that, for novel viral and pathogen genome sequences, this alignment-free whole-genome machine-learning approach can provide a reliable real-time option for taxonomic classification.
Collapse
Affiliation(s)
- Gurjit S. Randhawa
- Department of Computer Science, The University of Western Ontario, London, ON, Canada
| | | | - Hadi El Roz
- Department of Biology, The University of Western Ontario, London, ON, Canada
| | - Camila P. E. de Souza
- Department of Statistical and Actuarial Sciences, The University of Western Ontario, London, ON, Canada
| | - Kathleen A. Hill
- Department of Biology, The University of Western Ontario, London, ON, Canada
| | - Lila Kari
- School of Computer Science, University of Waterloo, Waterloo, ON, Canada
| |
Collapse
|
12
|
Randhawa GS, Hill KA, Kari L. MLDSP-GUI: an alignment-free standalone tool with an interactive graphical user interface for DNA sequence comparison and analysis. Bioinformatics 2019; 36:2258-2259. [DOI: 10.1093/bioinformatics/btz918] [Citation(s) in RCA: 11] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/02/2019] [Revised: 11/22/2019] [Accepted: 12/11/2019] [Indexed: 11/14/2022] Open
Abstract
Abstract
Summary
Machine Learning with Digital Signal Processing and Graphical User Interface (MLDSP-GUI) is an open-source, alignment-free, ultrafast, computationally lightweight, and standalone software tool with an interactive GUI for comparison and analysis of DNA sequences. MLDSP-GUI is a general-purpose tool that can be used for a variety of applications such as taxonomic classification, disease classification, virus subtype classification, evolutionary analyses, among others.
Availability and implementation
MLDSP-GUI is open-source, cross-platform compatible, and is available under the terms of the Creative Commons Attribution 4.0 International license (http://creativecommons.org/licenses/by/4.0/). The executable and dataset files are available at https://sourceforge.net/projects/mldsp-gui/.
Supplementary information
Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Gurjit S Randhawa
- Department of Computer Science, University of Western Ontario, London, ON N6A 5B7, Canada
| | - Kathleen A Hill
- Department of Biology, University of Western Ontario, London, ON N6A 5B7, Canada
| | - Lila Kari
- School of Computer Science, University of Waterloo, Waterloo, ON N2L 3G1, Canada
| |
Collapse
|
13
|
Smith KN, Miller SC, Varani G, Calabrese JM, Magnuson T. Multimodal Long Noncoding RNA Interaction Networks: Control Panels for Cell Fate Specification. Genetics 2019; 213:1093-1110. [PMID: 31796550 PMCID: PMC6893379 DOI: 10.1534/genetics.119.302661] [Citation(s) in RCA: 20] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/11/2019] [Accepted: 10/03/2019] [Indexed: 12/20/2022] Open
Abstract
Lineage specification in early development is the basis for the exquisitely precise body plan of multicellular organisms. It is therefore critical to understand cell fate decisions in early development. Moreover, for regenerative medicine, the accurate specification of cell types to replace damaged/diseased tissue is strongly dependent on identifying determinants of cell identity. Long noncoding RNAs (lncRNAs) have been shown to regulate cellular plasticity, including pluripotency establishment and maintenance, differentiation and development, yet broad phenotypic analysis and the mechanistic basis of their function remains lacking. As components of molecular condensates, lncRNAs interact with almost all classes of cellular biomolecules, including proteins, DNA, mRNAs, and microRNAs. With functions ranging from controlling alternative splicing of mRNAs, to providing scaffolding upon which chromatin modifiers are assembled, it is clear that at least a subset of lncRNAs are far from the transcriptional noise they were once deemed. This review highlights the diversity of lncRNA interactions in the context of cell fate specification, and provides examples of each type of interaction in relevant developmental contexts. Also highlighted are experimental and computational approaches to study lncRNAs.
Collapse
Affiliation(s)
- Keriayn N Smith
- Department of Genetics, University of North Carolina, Chapel Hill, North Carolina 27599
| | - Sarah C Miller
- Department of Genetics, University of North Carolina, Chapel Hill, North Carolina 27599
| | - Gabriele Varani
- Department of Chemistry, University of Washington, Seattle, Washington 98195
| | - J Mauro Calabrese
- Department of Pharmacology, University of North Carolina, Chapel Hill, North Carolina 27599
| | - Terry Magnuson
- Department of Genetics, University of North Carolina, Chapel Hill, North Carolina 27599
| |
Collapse
|
14
|
Randhawa GS, Hill KA, Kari L. ML-DSP: Machine Learning with Digital Signal Processing for ultrafast, accurate, and scalable genome classification at all taxonomic levels. BMC Genomics 2019; 20:267. [PMID: 30943897 PMCID: PMC6448311 DOI: 10.1186/s12864-019-5571-y] [Citation(s) in RCA: 20] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/04/2018] [Accepted: 02/27/2019] [Indexed: 11/11/2022] Open
Abstract
Background Although software tools abound for the comparison, analysis, identification, and classification of genomic sequences, taxonomic classification remains challenging due to the magnitude of the datasets and the intrinsic problems associated with classification. The need exists for an approach and software tool that addresses the limitations of existing alignment-based methods, as well as the challenges of recently proposed alignment-free methods. Results We propose a novel combination of supervised Machine Learning with Digital Signal Processing, resulting in ML-DSP: an alignment-free software tool for ultrafast, accurate, and scalable genome classification at all taxonomic levels. We test ML-DSP by classifying 7396 full mitochondrial genomes at various taxonomic levels, from kingdom to genus, with an average classification accuracy of >97%. A quantitative comparison with state-of-the-art classification software tools is performed, on two small benchmark datasets and one large 4322 vertebrate mtDNA genomes dataset. Our results show that ML-DSP overwhelmingly outperforms the alignment-based software MEGA7 (alignment with MUSCLE or CLUSTALW) in terms of processing time, while having comparable classification accuracies for small datasets and superior accuracies for the large dataset. Compared with the alignment-free software FFP (Feature Frequency Profile), ML-DSP has significantly better classification accuracy, and is overall faster. We also provide preliminary experiments indicating the potential of ML-DSP to be used for other datasets, by classifying 4271 complete dengue virus genomes into subtypes with 100% accuracy, and 4,710 bacterial genomes into phyla with 95.5% accuracy. Lastly, our analysis shows that the “Purine/Pyrimidine”, “Just-A” and “Real” numerical representations of DNA sequences outperform ten other such numerical representations used in the Digital Signal Processing literature for DNA classification purposes. Conclusions Due to its superior classification accuracy, speed, and scalability to large datasets, ML-DSP is highly relevant in the classification of newly discovered organisms, in distinguishing genomic signatures and identifying their mechanistic determinants, and in evaluating genome integrity.
Collapse
Affiliation(s)
- Gurjit S Randhawa
- Department of Computer Science, University of Western Ontario, London, ON, Canada.
| | - Kathleen A Hill
- Department of Biology, University of Western Ontario, London, ON, Canada
| | - Lila Kari
- School of Computer Science, University of Waterloo, Waterloo, ON, Canada
| |
Collapse
|
15
|
Yin C. Encoding and Decoding DNA Sequences by Integer Chaos Game Representation. J Comput Biol 2018; 26:143-151. [PMID: 30517021 DOI: 10.1089/cmb.2018.0173] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open
Abstract
DNA sequences are fundamental for encoding genetic information. The genetic information may be understood not only from symbolic sequences but also from the hidden signals inside the sequences. The symbolic sequences need to be transformed into numerical sequences so the hidden signals can be revealed by signal processing techniques. All current transformation methods encode DNA sequences into numerical values of the same length. These representations have limitations in the applications of genomic signal compression, encryption, and steganography. We propose a novel integer chaos game representation (inter-CGR or iCGR) of DNA sequences and a lossless encoding method DNA sequences by the iCGR. In the iCGR method, a DNA sequence is represented by the iterated function of the nucleotides and their positions in the sequence. Then the DNA sequence can be uniquely encoded and recovered using three integers from iCGR. One integer is the sequence length and the other two integers represent the accumulated distributions of nucleotides in the sequence. The integer encoding scheme can compress a DNA sequence by 2 bits per nucleotide. The integer representation of DNA sequences provides a prospective tool for sequence analysis and operations.
Collapse
Affiliation(s)
- Changchuan Yin
- Department of Mathematics, Statistics, and Computer Science, The University of Illinois at Chicago , Chicago, Illinois
| |
Collapse
|
16
|
An open-source k-mer based machine learning tool for fast and accurate subtyping of HIV-1 genomes. PLoS One 2018; 13:e0206409. [PMID: 30427878 PMCID: PMC6235296 DOI: 10.1371/journal.pone.0206409] [Citation(s) in RCA: 36] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/24/2018] [Accepted: 10/14/2018] [Indexed: 01/11/2023] Open
Abstract
For many disease-causing virus species, global diversity is clustered into a taxonomy of subtypes with clinical significance. In particular, the classification of infections among the subtypes of human immunodeficiency virus type 1 (HIV-1) is a routine component of clinical management, and there are now many classification algorithms available for this purpose. Although several of these algorithms are similar in accuracy and speed, the majority are proprietary and require laboratories to transmit HIV-1 sequence data over the network to remote servers. This potentially exposes sensitive patient data to unauthorized access, and makes it impossible to determine how classifications are made and to maintain the data provenance of clinical bioinformatic workflows. We propose an open-source supervised and alignment-free subtyping method (Kameris) that operates on k-mer frequencies in HIV-1 sequences. We performed a detailed study of the accuracy and performance of subtype classification in comparison to four state-of-the-art programs. Based on our testing data set of manually curated real-world HIV-1 sequences (n = 2, 784), Kameris obtained an overall accuracy of 97%, which matches or exceeds all other tested software, with a processing rate of over 1,500 sequences per second. Furthermore, our fully standalone general-purpose software provides key advantages in terms of data security and privacy, transparency and reproducibility. Finally, we show that our method is readily adaptable to subtype classification of other viruses including dengue, influenza A, and hepatitis B and C virus.
Collapse
|
17
|
Nagahashi M, Shimada Y, Ichikawa H, Nakagawa S, Sato N, Kaneko K, Homma K, Kawasaki T, Kodama K, Lyle S, Takabe K, Wakai T. Formalin-fixed paraffin-embedded sample conditions for deep next generation sequencing. J Surg Res 2017; 220:125-132. [PMID: 29180174 PMCID: PMC5726294 DOI: 10.1016/j.jss.2017.06.077] [Citation(s) in RCA: 29] [Impact Index Per Article: 4.1] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/27/2017] [Revised: 06/26/2017] [Accepted: 06/28/2017] [Indexed: 12/16/2022]
Abstract
INTRODUCTION Precision medicine is only possible in oncology practice if targetable genes in fragmented DNA, such as DNA from formalin-fixed paraffin-embedded (FFPE) samples, can be sequenced using next generation sequencing (NGS). The aim of this study was to examine the quality and quantity of DNA from FFPE cancerous tissue samples from surgically resected and biopsy specimens. METHODS DNA was extracted from unstained FFPE tissue sections prepared from surgically resected specimens of breast, colorectal and gastric cancer, and biopsy specimens of breast cancer. A total quantity of DNA ≥60 ng from a sample was considered adequate for NGS. The DNA quality was assessed by Q-ratios, with a Q-ratio >0.1 considered sufficient for NGS. RESULTS The Q-ratio for DNA from FFPE tissue processed with neutral-buffered formalin was significantly better than that processed with unbuffered formalin. All Q-ratios for DNA from breast, colorectal and gastric cancer samples indicated DNA levels sufficient for NGS. DNA extracted from gastric cancer FFPE samples prepared within the last 7 years is suitable for NGS analysis, whereas those older than 7 years may not be suitable. Our data suggested that adequate amounts of DNA can be extracted from FFPE samples, not only of surgically resected tissue but also of biopsy specimens. CONCLUSIONS The type of formalin used for fixation and the time since FFPE sample preparation affect DNA quality. Sufficient amounts of DNA can be extracted from FFPE samples of both surgically resected and biopsy tissue, thus expanding the potential diagnostic uses of NGS in a clinical setting.
Collapse
Affiliation(s)
- Masayuki Nagahashi
- Division of Digestive and General Surgery, Niigata University Graduate School of Medical and Dental Sciences, Niigata City, Niigata, Japan.
| | - Yoshifumi Shimada
- Division of Digestive and General Surgery, Niigata University Graduate School of Medical and Dental Sciences, Niigata City, Niigata, Japan
| | - Hiroshi Ichikawa
- Division of Digestive and General Surgery, Niigata University Graduate School of Medical and Dental Sciences, Niigata City, Niigata, Japan
| | - Satoru Nakagawa
- Department of Surgery, Niigata Cancer Center Hospital, Niigata City, Niigata, Japan
| | - Nobuaki Sato
- Department of Breast Oncology, Niigata Cancer Center Hospital, Niigata City, Niigata, Japan
| | - Koji Kaneko
- Department of Breast Oncology, Niigata Cancer Center Hospital, Niigata City, Niigata, Japan
| | - Keiichi Homma
- Department of Pathology, Niigata Cancer Center Hospital, Niigata City, Niigata, Japan
| | - Takashi Kawasaki
- Department of Pathology, Niigata Cancer Center Hospital, Niigata City, Niigata, Japan
| | - Keisuke Kodama
- Diagnostics Research Department, Life innovation Research Institute, Denka innovation Center, Denka Co., Ltd, Machida City, Tokyo, Japan
| | - Stephen Lyle
- Department of Molecular, Cell and Cancer Biology, University of Massachusetts Medical School, Boston, Massachusetts; KEW, Inc, Boston, Massachusetts
| | - Kazuaki Takabe
- Breast Surgery, Department of Surgical Oncology, Roswell Park Cancer Institute, Buffalo, New York; Department of Surgery, University at Buffalo Jacobs School of Medicine and Biomedical Sciences, The State University of New York, Buffalo, New York
| | - Toshifumi Wakai
- Division of Digestive and General Surgery, Niigata University Graduate School of Medical and Dental Sciences, Niigata City, Niigata, Japan
| |
Collapse
|
18
|
Karamichalis R, Kari L, Konstantinidis S, Kopecki S, Solis-Reyes S. Additive methods for genomic signatures. BMC Bioinformatics 2016; 17:313. [PMID: 27549194 PMCID: PMC4994249 DOI: 10.1186/s12859-016-1157-8] [Citation(s) in RCA: 12] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/13/2016] [Accepted: 07/19/2016] [Indexed: 01/09/2023] Open
Abstract
Background Studies exploring the potential of Chaos Game Representations (CGR) of genomic sequences to act as “genomic signatures” (to be species- and genome-specific) showed that CGR patterns of nuclear and organellar DNA sequences of the same organism can be very different. While the hypothesis that CGRs of mitochondrial DNA sequences can act as genomic signatures was validated for a snapshot of all sequenced mitochondrial genomes available in the NCBI GenBank sequence database, to our knowledge no such extensive analysis of CGRs of nuclear DNA sequences exists to date. Results We analyzed an extensive dataset, totalling 1.45 gigabase pairs, of nuclear/nucleoid genomic sequences (nDNA) from 42 different organisms, spanning all major kingdoms of life. Our computational experiments indicate that CGR signatures of nDNA of two different origins cannot always be differentiated, especially if they originate from closely-related species such as H. sapiens and P. troglodytes or E. coli and E. fergusonii. To address this issue, we propose the general concept of additive DNA signature of a set (collection) of DNA sequences. One particular instance, the composite DNA signature, combines information from nDNA fragments and organellar (mitochondrial, chloroplast, or plasmid) genomes. We demonstrate that, in this dataset, composite DNA signatures originating from two different organisms can be differentiated in all cases, including those where the use of CGR signatures of nDNA failed or was inconclusive. Another instance, the assembled DNA signature, combines information from many short DNA subfragments (e.g., 100 basepairs) of a given DNA fragment, to produce its signature. We show that an assembled DNA signature has the same distinguishing power as a conventionally computed CGR signature, while using shorter contiguous sequences and potentially less sequence information. Conclusions Our results suggest that, while CGR signatures of nDNA cannot always play the role of genomic signatures, composite and assembled DNA signatures (separately or in combination) could potentially be used instead. Such additive signatures could be used, e.g., with raw unassembled next-generation sequencing (NGS) read data, when high-quality sequencing data is not available, or to complement information obtained by other methods of species identification or classification. Electronic supplementary material The online version of this article (doi:10.1186/s12859-016-1157-8) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Rallis Karamichalis
- Department of Computer Science, University of Western Ontario, London ON, N6A 5B7, Canada
| | - Lila Kari
- School of Computing Science, University of Waterloo, Waterloo, ON, N2L 3G1, Canada. .,Department of Computer Science, University of Western Ontario, London ON, N6A 5B7, Canada.
| | - Stavros Konstantinidis
- Department of Mathematics and Computing Science, Saint Mary's University, Halifax NS, Canada
| | - Steffen Kopecki
- Department of Computer Science, University of Western Ontario, London ON, N6A 5B7, Canada.,Department of Mathematics and Computing Science, Saint Mary's University, Halifax NS, Canada
| | - Stephen Solis-Reyes
- Department of Computer Science, University of Western Ontario, London ON, N6A 5B7, Canada
| |
Collapse
|
19
|
Karamichalis R, Kari L, Konstantinidis S, Kopecki S. An investigation into inter- and intragenomic variations of graphic genomic signatures. BMC Bioinformatics 2015; 16:246. [PMID: 26249837 PMCID: PMC4527362 DOI: 10.1186/s12859-015-0655-4] [Citation(s) in RCA: 18] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/19/2014] [Accepted: 06/30/2015] [Indexed: 11/30/2022] Open
Abstract
Background Motivated by the general need to identify and classify species based on molecular evidence, genome comparisons have been proposed that are based on measuring mostly Euclidean distances between Chaos Game Representation (CGR) patterns of genomic DNA sequences. Results We provide, on an extensive dataset and using several different distances, confirmation of the hypothesis that CGR patterns are preserved along a genomic DNA sequence, and are different for DNA sequences originating from genomes of different species. This finding lends support to the theory that CGRs of genomic sequences can act as graphic genomic signatures. In particular, we compare the CGR patterns of over five hundred different 150,000 bp genomic sequences spanning one complete chromosome from each of six organisms, representing all kingdoms of life: H. sapiens (Animalia; chromosome 21), S. cerevisiae (Fungi; chromosome 4), A. thaliana (Plantae; chromosome 1), P. falciparum (Protista; chromosome 14), E. coli (Bacteria - full genome), and P. furiosus (Archaea - full genome). To maximize the diversity within each species, we also analyze the interrelationships within a set of over five hundred 150,000 bp genomic sequences sampled from the entire aforementioned genomes. Lastly, we provide some preliminary evidence of this method’s ability to classify genomic DNA sequences at lower taxonomic levels by comparing sequences sampled from the entire genome of H. sapiens (class Mammalia, order Primates) and of M. musculus (class Mammalia, order Rodentia), for a total length of approximately 174 million basepairs analyzed. We compute pairwise distances between CGRs of these genomic sequences using six different distances, and construct Molecular Distance Maps, which visualize all sequences as points in a two-dimensional or three-dimensional space, to simultaneously display their interrelationships. Conclusion Our analysis confirms, for this dataset, that CGR patterns of DNA sequences from the same genome are in general quantitatively similar, while being different for DNA sequences from genomes of different species. Our assessment of the performance of the six distances analyzed uses three different quality measures and suggests that several distances outperform the Euclidean distance, which has so far been almost exclusively used for such studies.
Collapse
Affiliation(s)
- Rallis Karamichalis
- Department of Computer Science, University of Western Ontario, London, ON, Canada.
| | - Lila Kari
- Department of Computer Science, University of Western Ontario, London, ON, Canada.
| | - Stavros Konstantinidis
- Department of Mathematics and Computing Science, Saint Mary's University, Halifax, NS, Canada.
| | - Steffen Kopecki
- Department of Computer Science, University of Western Ontario, London, ON, Canada. .,Department of Mathematics and Computing Science, Saint Mary's University, Halifax, NS, Canada.
| |
Collapse
|