1
|
Cahuantzi R, Lythgoe KA, Hall I, Pellis L, House T. Unsupervised identification of significant lineages of SARS-CoV-2 through scalable machine learning methods. Proc Natl Acad Sci U S A 2024; 121:e2317284121. [PMID: 38478692 PMCID: PMC10962941 DOI: 10.1073/pnas.2317284121] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/10/2023] [Accepted: 02/05/2024] [Indexed: 03/21/2024] Open
Abstract
Since its emergence in late 2019, SARS-CoV-2 has diversified into a large number of lineages and caused multiple waves of infection globally. Novel lineages have the potential to spread rapidly and internationally if they have higher intrinsic transmissibility and/or can evade host immune responses, as has been seen with the Alpha, Delta, and Omicron variants of concern. They can also cause increased mortality and morbidity if they have increased virulence, as was seen for Alpha and Delta. Phylogenetic methods provide the "gold standard" for representing the global diversity of SARS-CoV-2 and to identify newly emerging lineages. However, these methods are computationally expensive, struggle when datasets get too large, and require manual curation to designate new lineages. These challenges provide a motivation to develop complementary methods that can incorporate all of the genetic data available without down-sampling to extract meaningful information rapidly and with minimal curation. In this paper, we demonstrate the utility of using algorithmic approaches based on word-statistics to represent whole sequences, bringing speed, scalability, and interpretability to the construction of genetic topologies. While not serving as a substitute for current phylogenetic analyses, the proposed methods can be used as a complementary, and fully automatable, approach to identify and confirm new emerging variants.
Collapse
Affiliation(s)
- Roberto Cahuantzi
- Department of Mathematics, The University of Manchester, ManchesterM13 9PL, United Kingdom
- United Kingdom Health Security Agency, University of Oxford, OxfordOX3 7LF, United Kingdom
| | - Katrina A. Lythgoe
- Department of Biology, University of Oxford, OxfordOX1 3SZ, United Kingdom
- Big Data Institute, University of Oxford, OxfordOX3 7LF, United Kingdom
- Pandemic Sciences Institute, University of Oxford, OxfordOX3 7LF, United Kingdom
| | - Ian Hall
- Department of Mathematics, The University of Manchester, ManchesterM13 9PL, United Kingdom
| | - Lorenzo Pellis
- Department of Mathematics, The University of Manchester, ManchesterM13 9PL, United Kingdom
| | - Thomas House
- Department of Mathematics, The University of Manchester, ManchesterM13 9PL, United Kingdom
| |
Collapse
|
2
|
Pal J, Ghosh S, Maji B, Bhattacharya DK. MMV method: a new approach to compare protein sequences under binary representation. J Biomol Struct Dyn 2024:1-7. [PMID: 38375605 DOI: 10.1080/07391102.2024.2317982] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/28/2023] [Accepted: 02/07/2024] [Indexed: 02/21/2024]
Abstract
In the present work, a new form of descriptor using minimal moment vector (MMV) is introduced to compare protein sequences in the frequency domain under their component wise binary representations. From every sequence, 20 different binary component sequences are formed, each corresponding to 20 amino acids. Each such vector is now shifted from the time domain to the frequency domain by applying the Fast Fourier Transform (FFT). Next, the power spectrum calculated from the FFT values for each component sequence is so normalized that the sum of the components equals 1. The descriptor is defined as a 20-component vector composed of the 20 second-order minimal moments calculated from the normalized spectrum of the 20 component sequences. Once the descriptor is known, the distance matrix is created by applying the Euclidean Distance measure. The phylogenetic tree is generated by applying the unweighted pair group method with the arithmetic mean (UPGMA) algorithm using Molecular Evolutionary Genetics Analysis11 (MEGA11) software. In this work, the datasets used for similarity studies are 9 NADH dehydrogenase 5 (ND5), 12 Baculoviruses, 24 Transferrins (TF) proteins, and 50 Spike Protein of coronavirus. A qualitative measure using rationalized perception is used to compare the effectiveness of the proposed method. Quantitative measure based on symmetric distance (SD) is used to compare the phylogenetic trees of the present method with those obtained by other methods. It is observed that the phylogenetic trees generated by the proposed technique are at par with their known biological references, and they produce results better than those of the earlier methods.Communicated by Ramaswamy H. Sarma.
Collapse
Affiliation(s)
- Jayanta Pal
- Department of ECE, National Institute of Technology, Durgapur, India
- Department of CSE, Narula Institute of Technology, Kolkata, India
| | - Soumen Ghosh
- Department of ECE, National Institute of Technology, Durgapur, India
- Department of IT, Narula Institute of Technology, Kolkata, India
| | - Bansibadan Maji
- Department of ECE, National Institute of Technology, Durgapur, India
| | | |
Collapse
|
3
|
Alipour F, Holmes C, Lu YY, Hill KA, Kari L. Leveraging machine learning for taxonomic classification of emerging astroviruses. Front Mol Biosci 2024; 10:1305506. [PMID: 38274100 PMCID: PMC10808839 DOI: 10.3389/fmolb.2023.1305506] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/01/2023] [Accepted: 12/12/2023] [Indexed: 01/27/2024] Open
Abstract
Astroviruses are a family of genetically diverse viruses associated with disease in humans and birds with significant health effects and economic burdens. Astrovirus taxonomic classification includes two genera, Avastrovirus and Mamastrovirus. However, with next-generation sequencing, broader interspecies transmission has been observed necessitating a reexamination of the current host-based taxonomic classification approach. In this study, a novel taxonomic classification method is presented for emergent and as yet unclassified astroviruses, based on whole genome sequence k-mer composition in addition to host information. An optional component responsible for identifying recombinant sequences was added to the method's pipeline, to counteract the impact of genetic recombination on viral classification. The proposed three-pronged classification method consists of a supervised machine learning method, an unsupervised machine learning method, and the consideration of host species. Using this three-pronged approach, we propose genus labels for 191 as yet unclassified astrovirus genomes. Genus labels are also suggested for an additional eight as yet unclassified astrovirus genomes for which incompatibility was observed with the host species, suggesting cross-species infection. Lastly, our machine learning-based approach augmented by a principal component analysis (PCA) analysis provides evidence supporting the hypothesis of the existence of human astrovirus (HAstV) subgenus of the genus Mamastrovirus, and a goose astrovirus (GoAstV) subgenus of the genus Avastrovirus. Overall, this multipronged machine learning approach provides a fast, reliable, and scalable prediction method of taxonomic labels, able to keep pace with emerging viruses and the exponential increase in the output of modern genome sequencing technologies.
Collapse
Affiliation(s)
- Fatemeh Alipour
- School of Computer Science, University of Waterloo, Waterloo, ON, Canada
| | - Connor Holmes
- Department of Biology, University of Western Ontario, London, ON, Canada
| | - Yang Young Lu
- School of Computer Science, University of Waterloo, Waterloo, ON, Canada
| | - Kathleen A. Hill
- Department of Biology, University of Western Ontario, London, ON, Canada
| | - Lila Kari
- School of Computer Science, University of Waterloo, Waterloo, ON, Canada
| |
Collapse
|
4
|
Shaukat MA, Nguyen TT, Hsu EB, Yang S, Bhatti A. Comparative study of encoded and alignment-based methods for virus taxonomy classification. Sci Rep 2023; 13:18662. [PMID: 37907535 PMCID: PMC10618506 DOI: 10.1038/s41598-023-45461-0] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/04/2023] [Accepted: 10/19/2023] [Indexed: 11/02/2023] Open
Abstract
The emergence of viruses and their variants has made virus taxonomy more important than ever before in controlling the spread of diseases. The creation of efficient treatments and cures that target particular virus properties can be aided by understanding virus taxonomy. Alignment-based methods are commonly used for this task, but are computationally expensive and time-consuming, especially when dealing with large datasets or when detecting new virus variants is time sensitive. An alternative approach, the encoded method, has been developed that does not require prior sequence alignment and provides faster results. However, each encoded method has its own claimed accuracy. Therefore, careful evaluation and comparison of the performance of different encoded methods are essential to identify the most accurate and reliable approach for virus taxonomy classification. This study aims to address this issue by providing a comprehensive and comparative analysis of the potential of encoded methods for virus classification and phylogenetics. We compared the vectors generated for each encoded method using distance metrics to determine their similarity to alignment-based methods. The results and their validation show that K-merNV followed by CgrDft encoded methods, perform similarly to state-of-the-art multi-sequence alignment methods. This is the first study to incorporate and compare encoded methods that will facilitate future research in making more informed decisions regarding selection of a suitable method for virus taxonomy.
Collapse
Affiliation(s)
- Muhammad Arslan Shaukat
- Institute for Intelligent Systems Research and Innovation (IISRI), Deakin University, Victoria, Australia.
| | - Thanh Thi Nguyen
- Faculty of Information Technology, Monash University, Victoria, Australia
| | - Edbert B Hsu
- Department of Emergency Medicine, Johns Hopkins University, Maryland, USA
| | - Samuel Yang
- Department of Emergency Medicine, Stanford University, California, USA
| | - Asim Bhatti
- Institute for Intelligent Systems Research and Innovation (IISRI), Deakin University, Victoria, Australia
| |
Collapse
|
5
|
Abadeh R, Aminafshar M, Ghaderi-Zefrehei M, Chamani M. A new gene tree algorithm employing DNA sequences of bovine genome using discrete Fourier transformation. PLoS One 2023; 18:e0277480. [PMID: 36893167 PMCID: PMC9997877 DOI: 10.1371/journal.pone.0277480] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/16/2022] [Accepted: 10/28/2022] [Indexed: 03/10/2023] Open
Abstract
Within the realms of human thoughts on nature, Fourier analysis is considered as one of the greatest ideas currently put forwarded. The Fourier transform shows that any periodic function can be rewritten as the sum of sinusoidal functions. Having a Fourier transform view on real-world problems like the DNA sequence of genes, would make things intuitively simple to understand in comparison with their initial formal domain view. In this study we used discrete Fourier transform (DFT) on DNA sequences of a set of genes in the bovine genome known to govern milk production, in order to develop a new gene clustering algorithm. The implementation of this algorithm is very user-friendly and requires only simple routine mathematical operations. By transforming the configuration of gene sequences into frequency domain, we sought to elucidate important features and reveal hidden gene properties. This is biologically appealing since no information is lost via this transformation and we are therefore not reducing the number of degrees of freedom. The results from different clustering methods were integrated using evidence accumulation algorithms to provide in insilico validation of our results. We propose using candidate gene sequences accompanied by other genes of biologically unknown function. These will then be assigned some degree of relevant annotation by using our proposed algorithm. Current knowledge in biological gene clustering investigation is also lacking, and so DFT-based methods will help shine a light on use of these algorithms for biological insight.
Collapse
Affiliation(s)
- Roxana Abadeh
- Department of Animal Science, Science and Research Branch, Islamic Azad University, Tehran, Iran
| | - Mehdi Aminafshar
- Department of Animal Science, Science and Research Branch, Islamic Azad University, Tehran, Iran
| | | | - Mohammad Chamani
- Department of Animal Science, Science and Research Branch, Islamic Azad University, Tehran, Iran
| |
Collapse
|
6
|
Grigoriadis D, Perdikopanis N, Georgakilas GK, Hatzigeorgiou AG. DeepTSS: multi-branch convolutional neural network for transcription start site identification from CAGE data. BMC Bioinformatics 2022; 23:395. [PMID: 36510136 PMCID: PMC9743497 DOI: 10.1186/s12859-022-04945-y] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/14/2022] [Accepted: 09/16/2022] [Indexed: 12/14/2022] Open
Abstract
BACKGROUND The widespread usage of Cap Analysis of Gene Expression (CAGE) has led to numerous breakthroughs in understanding the transcription mechanisms. Recent evidence in the literature, however, suggests that CAGE suffers from transcriptional and technical noise. Regardless of the sample quality, there is a significant number of CAGE peaks that are not associated with transcription initiation events. This type of signal is typically attributed to technical noise and more frequently to random five-prime capping or transcription bioproducts. Thus, the need for computational methods emerges, that can accurately increase the signal-to-noise ratio in CAGE data, resulting in error-free transcription start site (TSS) annotation and quantification of regulatory region usage. In this study, we present DeepTSS, a novel computational method for processing CAGE samples, that combines genomic signal processing (GSP), structural DNA features, evolutionary conservation evidence and raw DNA sequence with Deep Learning (DL) to provide single-nucleotide TSS predictions with unprecedented levels of performance. RESULTS To evaluate DeepTSS, we utilized experimental data, protein-coding gene annotations and computationally-derived genome segmentations by chromatin states. DeepTSS was found to outperform existing algorithms on all benchmarks, achieving 98% precision and 96% sensitivity (accuracy 95.4%) on the protein-coding gene strategy, with 96.66% of its positive predictions overlapping active chromatin, 98.27% and 92.04% co-localized with at least one transcription factor and H3K4me3 peak. CONCLUSIONS CAGE is a key protocol in deciphering the language of transcription, however, as every experimental protocol, it suffers from biological and technical noise that can severely affect downstream analyses. DeepTSS is a novel DL-based method for effectively removing noisy CAGE signal. In contrast to existing software, DeepTSS does not require feature selection since the embedded convolutional layers can readily identify patterns and only utilize the important ones for the classification task. This study highlights the key role that DL can play in Molecular Biology, by removing the inherent flaws of experimental protocols, that form the backbone of contemporary research. Here, we show how DeepTSS can unleash the full potential of an already popular and mature method such as CAGE, and push the boundaries of coding and non-coding gene expression regulator research even further.
Collapse
Affiliation(s)
- Dimitris Grigoriadis
- grid.418497.7Hellenic Pasteur Institute, 11521 Athens, Greece ,grid.410558.d0000 0001 0035 6670Department of Computer Science and Biomedical Informatics, University of Thessaly, 35131 Lamia, Greece
| | - Nikos Perdikopanis
- grid.418497.7Hellenic Pasteur Institute, 11521 Athens, Greece ,grid.5216.00000 0001 2155 0800Department of Informatics and Telecommunications, National and Kapodistrian University of Athens, 15784 Athens, Greece ,grid.410558.d0000 0001 0035 6670Department of Electrical and Computer Engineering, University of Thessaly, 38221 Volos, Greece
| | - Georgios K. Georgakilas
- grid.410558.d0000 0001 0035 6670Department of Electrical and Computer Engineering, University of Thessaly, 38221 Volos, Greece ,ommAI Technologies, Tallinn, Estonia
| | - Artemis G. Hatzigeorgiou
- grid.418497.7Hellenic Pasteur Institute, 11521 Athens, Greece ,grid.410558.d0000 0001 0035 6670Department of Computer Science and Biomedical Informatics, University of Thessaly, 35131 Lamia, Greece
| |
Collapse
|
7
|
Pal J, Ghosh S, Maji B, Bhattacharya DK. Mathematical Approach to Protein Sequence Comparison Based on Physiochemical Properties. ACS OMEGA 2022; 7:39446-39455. [PMID: 36340165 PMCID: PMC9631895 DOI: 10.1021/acsomega.2c06103] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 09/21/2022] [Accepted: 09/27/2022] [Indexed: 06/16/2023]
Abstract
The difficult aspect of developing new protein sequence comparison techniques is coming up with a method that can quickly and effectively handle huge data sets of various lengths in a timely manner. In this work, we first obtain two numerical representations of protein sequences separately based on one physical property and one chemical property of amino acids. The lengths of all the sequences under comparison are made equal by appending the required number of zeroes. Then, fast Fourier transform is applied to this numerical time series to obtain the corresponding spectrum. Next, the spectrum values are reduced by the standard inter coefficient difference method. Finally, the corresponding normalized values of the reduced spectrum are selected as the descriptors for protein sequence comparison. Using these descriptors, the distance matrices are obtained using Euclidian distance. They are subsequently used to draw the phylogenetic trees using the UPGMA algorithm. Phylogenetic trees are first constructed for 9 ND4, 9 ND5, and 9 ND6 proteins using the polarity value as the chemical property and the molecular weight as the physical property. They are compared, and it is seen that polarity is a better choice than molecular weight in protein sequence comparison. Next, using the polarity property, phylogenetic trees are obtained for 12 baculovirus and 24 transferrin proteins. The results are compared with those obtained earlier on the identical sequences by other methods. Three assessment criteria are considered for comparison of the results-quality based on rationalized perception, quantitative measures based on symmetric distance, and computational speed. In all the cases, the results are found to be more satisfactory.
Collapse
Affiliation(s)
- Jayanta Pal
- Department
of ECE, National Institute of Technology, Durgapur 713209, India
- Department
of CSE, Narula Institute of Technology, Kolkata 700109, India
| | - Soumen Ghosh
- Department
of IT, Narula Institute of Technology, Kolkata 700109, India
| | - Bansibadan Maji
- Department
of ECE, National Institute of Technology, Durgapur 713209, India
| | | |
Collapse
|
8
|
Li W, Yang L, Qiu Y, Yuan Y, Li X, Meng Z. FFP: joint Fast Fourier transform and fractal dimension in amino acid property-aware phylogenetic analysis. BMC Bioinformatics 2022; 23:347. [PMID: 35986255 PMCID: PMC9392226 DOI: 10.1186/s12859-022-04889-3] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/30/2022] [Accepted: 08/11/2022] [Indexed: 11/10/2022] Open
Abstract
Abstract
Background
Amino acid property-aware phylogenetic analysis (APPA) refers to the phylogenetic analysis method based on amino acid property encoding, which is used for understanding and inferring evolutionary relationships between species from the molecular perspective. Fast Fourier transform (FFT) and Higuchi’s fractal dimension (HFD) have excellent performance in describing sequences’ structural and complexity information for APPA. However, with the exponential growth of protein sequence data, it is very important to develop a reliable APPA method for protein sequence analysis.
Results
Consequently, we propose a new method named FFP, it joints FFT and HFD. Firstly, FFP is used to encode protein sequences on the basis of the important physicochemical properties of amino acids, the dissociation constant, which determines acidity and basicity of protein molecules. Secondly, FFT and HFD are used to generate the feature vectors of encoded sequences, whereafter, the distance matrix is calculated from the cosine function, which describes the degree of similarity between species. The smaller the distance between them, the more similar they are. Finally, the phylogenetic tree is constructed. When FFP is tested for phylogenetic analysis on four groups of protein sequences, the results are obviously better than other comparisons, with the highest accuracy up to more than 97%.
Conclusion
FFP has higher accuracy in APPA and multi-sequence alignment. It also can measure the protein sequence similarity effectively. And it is hoped to play a role in APPA’s related research.
Collapse
|
9
|
nTreeClus: A tree-based sequence encoder for clustering categorical series. Neurocomputing 2022. [DOI: 10.1016/j.neucom.2022.04.076] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/19/2022]
|
10
|
Millán Arias P, Alipour F, Hill KA, Kari L. DeLUCS: Deep learning for unsupervised clustering of DNA sequences. PLoS One 2022; 17:e0261531. [PMID: 35061715 PMCID: PMC8782307 DOI: 10.1371/journal.pone.0261531] [Citation(s) in RCA: 13] [Impact Index Per Article: 6.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/03/2021] [Accepted: 12/06/2021] [Indexed: 11/25/2022] Open
Abstract
We present a novel Deep Learning method for the Unsupervised Clustering of DNA Sequences (DeLUCS) that does not require sequence alignment, sequence homology, or (taxonomic) identifiers. DeLUCS uses Frequency Chaos Game Representations (FCGR) of primary DNA sequences, and generates "mimic" sequence FCGRs to self-learn data patterns (genomic signatures) through the optimization of multiple neural networks. A majority voting scheme is then used to determine the final cluster assignment for each sequence. The clusters learned by DeLUCS match true taxonomic groups for large and diverse datasets, with accuracies ranging from 77% to 100%: 2,500 complete vertebrate mitochondrial genomes, at taxonomic levels from sub-phylum to genera; 3,200 randomly selected 400 kbp-long bacterial genome segments, into clusters corresponding to bacterial families; three viral genome and gene datasets, averaging 1,300 sequences each, into clusters corresponding to virus subtypes. DeLUCS significantly outperforms two classic clustering methods (K-means++ and Gaussian Mixture Models) for unlabelled data, by as much as 47%. DeLUCS is highly effective, it is able to cluster datasets of unlabelled primary DNA sequences totalling over 1 billion bp of data, and it bypasses common limitations to classification resulting from the lack of sequence homology, variation in sequence length, and the absence or instability of sequence annotations and taxonomic identifiers. Thus, DeLUCS offers fast and accurate DNA sequence clustering for previously intractable datasets.
Collapse
Affiliation(s)
- Pablo Millán Arias
- School of Computer Science, University of Waterloo, Waterloo, ON, Canada
| | - Fatemeh Alipour
- School of Computer Science, University of Waterloo, Waterloo, ON, Canada
| | - Kathleen A. Hill
- Department of Biology, University of Western Ontario, London, ON, Canada
| | - Lila Kari
- School of Computer Science, University of Waterloo, Waterloo, ON, Canada
| |
Collapse
|
11
|
Canessa E. Uncovering Signals from the Coronavirus Genome. Genes (Basel) 2021; 12:genes12070973. [PMID: 34202172 PMCID: PMC8303286 DOI: 10.3390/genes12070973] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/30/2021] [Revised: 06/16/2021] [Accepted: 06/22/2021] [Indexed: 01/21/2023] Open
Abstract
A signal analysis of the complete genome sequenced for coronavirus variants of concern—B.1.1.7 (Alpha), B.1.135 (Beta) and P1 (Gamma)—and coronavirus variants of interest—B.1.429–B.1.427 (Epsilon) and B.1.525 (Eta)—is presented using open GISAID data. We deal with a certain new type of finite alternating sum series having independently distributed terms associated with binary (0,1) indicators for the nucleotide bases. Our method provides additional information to conventional similarity comparisons via alignment methods and Fourier Power Spectrum approaches. It leads to uncover distinctive patterns regarding the intrinsic data organization of complete genomics sequences according to its progression along the nucleotide bases position. The present new method could be useful for the bioinformatics surveillance and dynamics of coronavirus genome variants.
Collapse
Affiliation(s)
- Enrique Canessa
- The Abdus Salam International Centre for Theoretical Physics (ICTP), Science Dissemination Unit (SDU), 34151 Trieste, Italy
| |
Collapse
|
12
|
A novel entropy-based mapping method for determining the protein-protein interactions in viral genomes by using coevolution analysis. Biomed Signal Process Control 2021. [DOI: 10.1016/j.bspc.2020.102359] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/23/2022]
|
13
|
A new graph-theoretic approach to determine the similarity of genome sequences based on nucleotide triplets. Genomics 2020; 112:4701-4714. [PMID: 32827671 PMCID: PMC7437474 DOI: 10.1016/j.ygeno.2020.08.023] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/09/2019] [Revised: 07/15/2020] [Accepted: 08/17/2020] [Indexed: 11/22/2022]
Abstract
Methods of finding sequence similarity play a significant role in computational biology. Owing to the rapid increase of genome sequences in public databases, the evolutionary relationship of species becomes more challenging. But traditional alignment-based methods are found inappropriate due to their time-consuming nature. Therefore, it is necessary to find a faster method, which applies to species phylogeny. In this paper, a new graph-theory based alignment-free sequence comparison method is proposed. A complete-bipartite graph is used to represent each genome sequence based on its nucleotide triplets. Subsequently, with the help of the weights of edges of the graph, a vector descriptor is formed. Finally, the phylogenetic tree is drawn using the UPGMA algorithm. In the present case, the datasets for comparison are related to mammals, viruses, and bacteria. In most of the cases, the phylogeny in the present case is found to be more satisfactory as compared to earlier methods. A new graph-theory based alignment-free genome sequence comparison. Use of complete bipartite graph to represent genome sequences. Descriptor based on the weights of the edges of the graph. Comparison of the phylogenetic trees of different mammals, viruses, and bacteria. Less time complexity compared to that of earlier methods.
Collapse
|
14
|
Sun N, Dong R, Pei S, Yin C, Yau SST. A New Method Based on Coding Sequence Density to Cluster Bacteria. J Comput Biol 2020; 27:1688-1698. [PMID: 32392428 DOI: 10.1089/cmb.2019.0509] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/31/2022] Open
Abstract
Bacterial evolution is an important study field, biological sequences are often used to construct phylogenetic relationships. Multiple sequence alignment is very time-consuming and cannot deal with large scales of bacterial genome sequences in a reasonable time. Hence, a new mathematical method, joining density vector method, is proposed to cluster bacteria, which characterizes the features of coding sequence (CDS) in a DNA sequence. Coding sequences carry genetic information that can synthesize proteins. The correspondence between a genomic sequence and its joining density vector (JDV) is one-to-one. JDV reflects the statistical characteristics of genomic sequence and large amounts of data can be analyzed using this new approach. We apply the novel method to do phylogenetic analysis on four bacterial data sets at hierarchies of genus and species. The phylogenetic trees prove that our new method accurately describes the evolutionary relationships of bacterial coding sequences, and is faster than ClustalW and the existing alignment-free methods.
Collapse
Affiliation(s)
- Nan Sun
- Department of Mathematical Sciences, Tsinghua University, Beijing, China
| | - Rui Dong
- Department of Mathematical Sciences, Tsinghua University, Beijing, China
| | - Shaojun Pei
- Department of Mathematical Sciences, Tsinghua University, Beijing, China
| | - Changchuan Yin
- Department of Mathematics, Statistics, and Computer Science, The University of Illinois at Chicago, Chicago, Illinois, USA
| | - Stephen S-T Yau
- Department of Mathematical Sciences, Tsinghua University, Beijing, China
| |
Collapse
|
15
|
Abo-Elkhier MM, Abd Elwahaab MA, Abo El Maaty MI. Measuring Similarity among Protein Sequences Using a New Descriptor. BIOMED RESEARCH INTERNATIONAL 2019; 2019:2796971. [PMID: 31886192 PMCID: PMC6893242 DOI: 10.1155/2019/2796971] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 07/06/2019] [Revised: 09/03/2019] [Accepted: 10/28/2019] [Indexed: 12/01/2022]
Abstract
The comparison of protein sequences according to similarity is a fundamental aspect of today's biomedical research. With the developments of sequencing technologies, a large number of protein sequences increase exponentially in the public databases. Famous sequences' comparison methods are alignment based. They generally give excellent results when the sequences under study are closely related and they are time consuming. Herein, a new alignment-free method is introduced. Our technique depends on a new graphical representation and descriptor. The graphical representation of protein sequence is a simple way to visualize protein sequences. The descriptor compresses the primary sequence into a single vector composed of only two values. Our approach gives good results with both short and long sequences within a little computation time. It is applied on nine beta globin, nine ND5 (NADH dehydrogenase subunit 5), and 24 spike protein sequences. Correlation and significance analyses are also introduced to compare our similarity/dissimilarity results with others' approaches, results, and sequence homology.
Collapse
Affiliation(s)
- Mervat M. Abo-Elkhier
- Department of Engineering Mathematics and Physics, Faculty of Engineering, Mansoura University, Mansoura 35516, Egypt
| | - Marwa A. Abd Elwahaab
- Department of Engineering Mathematics and Physics, Faculty of Engineering, Mansoura University, Mansoura 35516, Egypt
| | - Moheb I. Abo El Maaty
- Department of Engineering Mathematics and Physics, Faculty of Engineering, Mansoura University, Mansoura 35516, Egypt
| |
Collapse
|
16
|
Das S, Das A, Mondal B, Dey N, Bhattacharya DK, Tibarewala DN. Genome sequence comparison under a new form of tri-nucleotide representation based on bio-chemical properties of nucleotides. Gene 2019; 730:144257. [PMID: 31759983 DOI: 10.1016/j.gene.2019.144257] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/29/2019] [Revised: 11/01/2019] [Accepted: 11/05/2019] [Indexed: 10/25/2022]
Abstract
Genetic sequence analysis, classification of genome sequence and evolutionary relationship between species using their biological sequences, are the emerging research domain in Bioinformatics. Several methods have already been applied to DNA sequence comparison under tri-nucleotide representation. In this paper, a new form of tri-nucleotide representation is proposed for sequence comparison. The comparison does not depend on the alignment of the sequences. In this representation, the bio-chemical properties of the nucleotides are considered. The novelty of this method is that the sequences of unequal lengths are represented by vectors of the same length and each of the tri-nucleotide formed out of the given sequence has its unique representation. To validate the proposed method, it is verified on several data sets related to mammalians, viruses and bacteria. The results of this method are further compared with those obtained by methods such as probabilistic method, natural vector method, Fourier power spectrum method, multiple encoding vector method, and feature frequency profiles method. Moreover, this method produces accurate phylogeny in all the cases. It is also proved that the time complexity of the present method is less.
Collapse
Affiliation(s)
- Subhram Das
- Computer Science and Engineering, Narula Institute of Technology, Kolkata, India.
| | - Arijit Das
- Computer Science and Engineering, Narula Institute of Technology, Kolkata, India
| | - Bingshati Mondal
- Computer Science and Engineering, Narula Institute of Technology, Kolkata, India
| | - Nilanjan Dey
- Department of Information Technology, Techno India College of Technology, Kolkata, India
| | - D K Bhattacharya
- Department of Pure Mathematics, Calcutta University, Kolkata, India
| | - D N Tibarewala
- Department of Bio-Science and Engineering, Jadavpur University, Kolkata, India
| |
Collapse
|
17
|
Dougan TJ, Quake SR. Viral taxonomy derived from evolutionary genome relationships. PLoS One 2019; 14:e0220440. [PMID: 31412051 PMCID: PMC6693820 DOI: 10.1371/journal.pone.0220440] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/21/2018] [Accepted: 07/16/2019] [Indexed: 11/23/2022] Open
Abstract
We describe a new genome alignment-based model for understanding the diversity of viruses based on evolutionary genetic relationships. This approach uses information theory and a physical model to determine the information shared by the genes in two genomes. Pairwise comparisons of genes from the viruses are created from alignments using NCBI BLAST, and their match scores are combined to produce a metric between genomes, which is in turn used to determine a global classification using the 5,817 viruses on RefSeq. In cases where there is no measurable alignment between any genes, the method falls back to a coarser measure of genome relationship: the mutual information of 4-mer frequency. This results in a principled model which depends only on the genome sequence, which captures many interesting relationships between viral families, and which creates clusters which correlate well with both the Baltimore and ICTV classifications. The incremental computational cost of classifying a novel virus is low and therefore newly discovered viruses can be quickly identified and classified. The model goes beyond alignment-free classifications by producing a full phylogeny similar to those constructed by virologists using qualitative features, while relying only on objective genes. These results bolster the case for mathematical models in microbiology which can characterize organisms using only their genetic material and provide an independent check for phylogenies constructed by humans, considerably faster and more cheaply than less modern approaches.
Collapse
Affiliation(s)
- Tyler J Dougan
- Department of Physics, Stanford University, Stanford, California, United States of America
| | - Stephen R Quake
- Departments of Bioengineering and Applied Physics, Stanford University and Chan Zuckerberg Biohub, Stanford, California, United States of America
| |
Collapse
|
18
|
Pei S, Dong R, He RL, Yau SST. Large-Scale Genome Comparison Based on Cumulative Fourier Power and Phase Spectra: Central Moment and Covariance Vector. Comput Struct Biotechnol J 2019; 17:982-994. [PMID: 31384399 PMCID: PMC6661692 DOI: 10.1016/j.csbj.2019.07.003] [Citation(s) in RCA: 11] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/01/2019] [Revised: 06/24/2019] [Accepted: 07/10/2019] [Indexed: 01/04/2023] Open
Abstract
Genome comparison is a vital research area of bioinformatics. For large-scale genome comparisons, the Multiple Sequence Alignment (MSA) methods have been impractical to use due to its algorithmic complexity. In this study, we propose a novel alignment-free method based on the one-to-one correspondence between a DNA sequence and its complete central moment vector of the cumulative Fourier power and phase spectra. In addition, the covariance between the four nucleotides in the power and phase spectra is included. We use the cumulative Fourier power and phase spectra to define a 28-dimensional vector for each DNA sequence. Euclidean distances between the vectors can measure the dissimilarity between DNA sequences. We perform testing with datasets of different sizes and types including simulated DNA sequences, exon-intron and complete genomes. The results show that our method is more accurate and efficient for performing hierarchical clustering than other alignment-free methods and MSA methods.
Collapse
Affiliation(s)
- Shaojun Pei
- Department of Mathematical Sciences, Tsinghua University, Beijing, PR China
| | - Rui Dong
- Department of Mathematical Sciences, Tsinghua University, Beijing, PR China
| | - Rong Lucy He
- Department of Biological Sciences, Chicago State University, Chicago, IL 60628, USA
| | - Stephen S.-T. Yau
- Department of Mathematical Sciences, Tsinghua University, Beijing, PR China
| |
Collapse
|
19
|
Farkaš T, Sitarčík J, Brejová B, Lucká M. SWSPM: A Novel Alignment-Free DNA Comparison Method Based on Signal Processing Approaches. Evol Bioinform Online 2019; 15:1176934319849071. [PMID: 31210725 PMCID: PMC6545658 DOI: 10.1177/1176934319849071] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/12/2019] [Accepted: 04/12/2019] [Indexed: 11/16/2022] Open
Abstract
Computing similarity between 2 nucleotide sequences is one of the fundamental problems in bioinformatics. Current methods are based mainly on 2 major approaches: (1) sequence alignment, which is computationally expensive, and (2) faster, but less accurate, alignment-free methods based on various statistical summaries, for example, short word counts. We propose a new distance measure based on mathematical transforms from the domain of signal processing. To tolerate large-scale rearrangements in the sequences, the transform is computed across sliding windows. We compare our method on several data sets with current state-of-art alignment-free methods. Our method compares favorably in terms of accuracy and outperforms other methods in running time and memory requirements. In addition, it is massively scalable up to dozens of processing units without the loss of performance due to communication overhead. Source files and sample data are available at https://bitbucket.org/fiitstubioinfo/swspm/src.
Collapse
Affiliation(s)
- Tomáš Farkaš
- Faculty of Informatics and Information Technologies, Slovak University of Technology in Bratislava, Bratislava, Slovakia
| | - Jozef Sitarčík
- Faculty of Informatics and Information Technologies, Slovak University of Technology in Bratislava, Bratislava, Slovakia
| | - Broňa Brejová
- Faculty of Mathematics, Physics and Informatics, Comenius University in Bratislava, Bratislava, Slovakia
| | - Mária Lucká
- Faculty of Informatics and Information Technologies, Slovak University of Technology in Bratislava, Bratislava, Slovakia
| |
Collapse
|
20
|
Dong R, He L, He RL, Yau SST. A Novel Approach to Clustering Genome Sequences Using Inter-nucleotide Covariance. Front Genet 2019; 10:234. [PMID: 31024610 PMCID: PMC6465635 DOI: 10.3389/fgene.2019.00234] [Citation(s) in RCA: 12] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Abstract] [Key Words] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/07/2018] [Accepted: 03/04/2019] [Indexed: 11/30/2022] Open
Abstract
Classification of DNA sequences is an important issue in the bioinformatics study, yet most existing methods for phylogenetic analysis including Multiple Sequence Alignment (MSA) are time-consuming and computationally expensive. The alignment-free methods are popular nowadays, whereas the manual intervention in those methods usually decreases the accuracy. Also, the interactions among nucleotides are neglected in most methods. Here we propose a new Accumulated Natural Vector (ANV) method which represents each DNA sequence by a point in ℝ18. By calculating the Accumulated Indicator Functions of nucleotides, we can further find an Accumulated Natural Vector for each sequence. This new Accumulated Natural Vector not only can capture the distribution of each nucleotide, but also provide the covariance among nucleotides. Thus global comparison of DNA sequences or genomes can be done easily in ℝ18. The tests of ANV of datasets of different sizes and types have proved the accuracy and time-efficiency of the new proposed ANV method.
Collapse
Affiliation(s)
- Rui Dong
- Department of Mathematical Sciences, Tsinghua University, Beijing, China
| | - Lily He
- Department of Mathematical Sciences, Tsinghua University, Beijing, China
| | - Rong Lucy He
- Department of Biological Sciences, Chicago State University, Chicago, IL, United States
| | - Stephen S-T Yau
- Department of Mathematical Sciences, Tsinghua University, Beijing, China
| |
Collapse
|
21
|
Randhawa GS, Hill KA, Kari L. ML-DSP: Machine Learning with Digital Signal Processing for ultrafast, accurate, and scalable genome classification at all taxonomic levels. BMC Genomics 2019; 20:267. [PMID: 30943897 PMCID: PMC6448311 DOI: 10.1186/s12864-019-5571-y] [Citation(s) in RCA: 20] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/04/2018] [Accepted: 02/27/2019] [Indexed: 11/11/2022] Open
Abstract
Background Although software tools abound for the comparison, analysis, identification, and classification of genomic sequences, taxonomic classification remains challenging due to the magnitude of the datasets and the intrinsic problems associated with classification. The need exists for an approach and software tool that addresses the limitations of existing alignment-based methods, as well as the challenges of recently proposed alignment-free methods. Results We propose a novel combination of supervised Machine Learning with Digital Signal Processing, resulting in ML-DSP: an alignment-free software tool for ultrafast, accurate, and scalable genome classification at all taxonomic levels. We test ML-DSP by classifying 7396 full mitochondrial genomes at various taxonomic levels, from kingdom to genus, with an average classification accuracy of >97%. A quantitative comparison with state-of-the-art classification software tools is performed, on two small benchmark datasets and one large 4322 vertebrate mtDNA genomes dataset. Our results show that ML-DSP overwhelmingly outperforms the alignment-based software MEGA7 (alignment with MUSCLE or CLUSTALW) in terms of processing time, while having comparable classification accuracies for small datasets and superior accuracies for the large dataset. Compared with the alignment-free software FFP (Feature Frequency Profile), ML-DSP has significantly better classification accuracy, and is overall faster. We also provide preliminary experiments indicating the potential of ML-DSP to be used for other datasets, by classifying 4271 complete dengue virus genomes into subtypes with 100% accuracy, and 4,710 bacterial genomes into phyla with 95.5% accuracy. Lastly, our analysis shows that the “Purine/Pyrimidine”, “Just-A” and “Real” numerical representations of DNA sequences outperform ten other such numerical representations used in the Digital Signal Processing literature for DNA classification purposes. Conclusions Due to its superior classification accuracy, speed, and scalability to large datasets, ML-DSP is highly relevant in the classification of newly discovered organisms, in distinguishing genomic signatures and identifying their mechanistic determinants, and in evaluating genome integrity.
Collapse
Affiliation(s)
- Gurjit S Randhawa
- Department of Computer Science, University of Western Ontario, London, ON, Canada.
| | - Kathleen A Hill
- Department of Biology, University of Western Ontario, London, ON, Canada
| | - Lila Kari
- School of Computer Science, University of Waterloo, Waterloo, ON, Canada
| |
Collapse
|
22
|
Saw AK, Raj G, Das M, Talukdar NC, Tripathy BC, Nandi S. Alignment-free method for DNA sequence clustering using Fuzzy integral similarity. Sci Rep 2019; 9:3753. [PMID: 30842590 PMCID: PMC6403383 DOI: 10.1038/s41598-019-40452-6] [Citation(s) in RCA: 14] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/31/2018] [Accepted: 01/28/2019] [Indexed: 12/28/2022] Open
Abstract
A larger amount of sequence data in private and public databases produced by next-generation sequencing put new challenges due to limitation associated with the alignment-based method for sequence comparison. So, there is a high need for faster sequence analysis algorithms. In this study, we developed an alignment-free algorithm for faster sequence analysis. The novelty of our approach is the inclusion of fuzzy integral with Markov chain for sequence analysis in the alignment-free model. The method estimate the parameters of a Markov chain by considering the frequencies of occurrence of all possible nucleotide pairs from each DNA sequence. These estimated Markov chain parameters were used to calculate similarity among all pairwise combinations of DNA sequences based on a fuzzy integral algorithm. This matrix is used as an input for the neighbor program in the PHYLIP package for phylogenetic tree construction. Our method was tested on eight benchmark datasets and on in-house generated datasets (18 s rDNA sequences from 11 arbuscular mycorrhizal fungi (AMF) and 16 s rDNA sequences of 40 bacterial isolates from plant interior). The results indicate that the fuzzy integral algorithm is an efficient and feasible alignment-free method for sequence analysis on the genomic scale.
Collapse
Affiliation(s)
- Ajay Kumar Saw
- Institute of Advanced Study in Science and Technology, Mathematical Sciences Division, Guwahati, 781035, India
| | - Garima Raj
- Institute of Advanced Study in Science and Technology, Life Science Division, Guwahati, 781035, India
| | - Manashi Das
- Institute of Advanced Study in Science and Technology, Life Science Division, Guwahati, 781035, India
| | - Narayan Chandra Talukdar
- Institute of Advanced Study in Science and Technology, Life Science Division, Guwahati, 781035, India
| | | | - Soumyadeep Nandi
- Institute of Advanced Study in Science and Technology, Life Science Division, Guwahati, 781035, India.
| |
Collapse
|
23
|
Huang HH, Girimurugan SB. Discrete Wavelet Packet Transform Based Discriminant Analysis for Whole Genome Sequences. Stat Appl Genet Mol Biol 2019; 18:/j/sagmb.ahead-of-print/sagmb-2018-0045/sagmb-2018-0045.xml. [PMID: 30772870 DOI: 10.1515/sagmb-2018-0045] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/15/2022]
Abstract
In recent years, alignment-free methods have been widely applied in comparing genome sequences, as these methods compute efficiently and provide desirable phylogenetic analysis results. These methods have been successfully combined with hierarchical clustering methods for finding phylogenetic trees. However, it may not be suitable to apply these alignment-free methods directly to existing statistical classification methods, because an appropriate statistical classification theory for integrating with the alignment-free representation methods is still lacking. In this article, we propose a discriminant analysis method which uses the discrete wavelet packet transform to classify whole genome sequences. The proposed alignment-free representation statistics of features follow a joint normal distribution asymptotically. The data analysis results indicate that the proposed method provides satisfactory classification results in real time.
Collapse
Affiliation(s)
- Hsin-Hsiung Huang
- University of Central Florida, Department of Statistics, Orlando, FL, USA
| | | |
Collapse
|
24
|
Using a Classifier Fusion Strategy to Identify Anti-angiogenic Peptides. Sci Rep 2018; 8:14062. [PMID: 30218091 PMCID: PMC6138733 DOI: 10.1038/s41598-018-32443-w] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/25/2018] [Accepted: 09/07/2018] [Indexed: 12/27/2022] Open
Abstract
Anti-angiogenic peptides perform distinct physiological functions and potential therapies for angiogenesis-related diseases. Accurate identification of anti-angiogenic peptides may provide significant clues to understand the essential angiogenic homeostasis within tissues and develop antineoplastic therapies. In this study, an ensemble predictor is proposed for anti-angiogenic peptide prediction by fusing an individual classifier with the best sensitivity and another individual one with the best specificity. We investigate predictive capabilities of various feature spaces with respect to the corresponding optimal individual classifiers and ensemble classifiers. The accuracy and Matthew’s Correlation Coefficient (MCC) of the ensemble classifier trained by Bi-profile Bayes (BpB) features are 0.822 and 0.649, respectively, which represents the highest prediction results among the investigated prediction models. Discriminative features are obtained from BpB using the Relief algorithm followed by the Incremental Feature Selection (IFS) method. The sensitivity, specificity, accuracy, and MCC of the ensemble classifier trained by the discriminative features reach up to 0.776, 0.888, 0.832, and 0.668, respectively. Experimental results indicate that the proposed method is far superior to the previous study for anti-angiogenic peptide prediction.
Collapse
|
25
|
Phylogenetic analysis of protein sequences based on a novel k-mer natural vector method. Genomics 2018; 111:1298-1305. [PMID: 30195069 DOI: 10.1016/j.ygeno.2018.08.010] [Citation(s) in RCA: 12] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/25/2018] [Revised: 08/19/2018] [Accepted: 08/27/2018] [Indexed: 11/22/2022]
Abstract
Based on the k-mer model for protein sequence, a novel k-mer natural vector method is proposed to characterize the features of k-mers in a protein sequence, in which the numbers and distributions of k-mers are considered. It is proved that the relationship between a protein sequence and its k-mer natural vector is one-to-one. Phylogenetic analysis of protein sequences therefore can be easily performed without requiring evolutionary models or human intervention. In addition, there exists no a criterion to choose a suitable k, and k has a great influence on obtaining results as well as computational complexity. In this paper, a compound k-mer natural vector is utilized to quantify each protein sequence. The results gotten from phylogenetic analysis on three protein datasets demonstrate that our new method can precisely describe the evolutionary relationships of proteins, and greatly heighten the computing efficiency.
Collapse
|
26
|
Dong R, Zhu Z, Yin C, He RL, Yau SST. A new method to cluster genomes based on cumulative Fourier power spectrum. Gene 2018; 673:239-250. [PMID: 29935353 DOI: 10.1016/j.gene.2018.06.042] [Citation(s) in RCA: 13] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/07/2018] [Revised: 06/12/2018] [Accepted: 06/14/2018] [Indexed: 11/27/2022]
Abstract
Analyzing phylogenetic relationships using mathematical methods has always been of importance in bioinformatics. Quantitative research may interpret the raw biological data in a precise way. Multiple Sequence Alignment (MSA) is used frequently to analyze biological evolutions, but is very time-consuming. When the scale of data is large, alignment methods cannot finish calculation in reasonable time. Therefore, we present a new method using moments of cumulative Fourier power spectrum in clustering the DNA sequences. Each sequence is translated into a vector in Euclidean space. Distances between the vectors can reflect the relationships between sequences. The mapping between the spectra and moment vector is one-to-one, which means that no information is lost in the power spectra during the calculation. We cluster and classify several datasets including Influenza A, primates, and human rhinovirus (HRV) datasets to build up the phylogenetic trees. Results show that the new proposed cumulative Fourier power spectrum is much faster and more accurately than MSA and another alignment-free method known as k-mer. The research provides us new insights in the study of phylogeny, evolution, and efficient DNA comparison algorithms for large genomes. The computer programs of the cumulative Fourier power spectrum are available at GitHub (https://github.com/YaulabTsinghua/cumulative-Fourier-power-spectrum).
Collapse
Affiliation(s)
- Rui Dong
- Department of Mathematical Sciences, Tsinghua University, Beijing 100084, China
| | - Ziyue Zhu
- Department of Mathematical Sciences, Tsinghua University, Beijing 100084, China
| | - Changchuan Yin
- Department of Mathematics, Statistics and Computer Science, University of Illinois at Chicago, IL 60607, USA
| | - Rong L He
- Department of Biological Sciences, Chicago State University, Chicago, IL 60628, USA
| | - Stephen S-T Yau
- Department of Mathematical Sciences, Tsinghua University, Beijing 100084, China.
| |
Collapse
|
27
|
Mendizabal-Ruiz G, Román-Godínez I, Torres-Ramos S, Salido-Ruiz RA, Vélez-Pérez H, Morales JA. Genomic signal processing for DNA sequence clustering. PeerJ 2018; 6:e4264. [PMID: 29379686 PMCID: PMC5786891 DOI: 10.7717/peerj.4264] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/17/2017] [Accepted: 12/24/2017] [Indexed: 11/20/2022] Open
Abstract
Genomic signal processing (GSP) methods which convert DNA data to numerical values have recently been proposed, which would offer the opportunity of employing existing digital signal processing methods for genomic data. One of the most used methods for exploring data is cluster analysis which refers to the unsupervised classification of patterns in data. In this paper, we propose a novel approach for performing cluster analysis of DNA sequences that is based on the use of GSP methods and the K-means algorithm. We also propose a visualization method that facilitates the easy inspection and analysis of the results and possible hidden behaviors. Our results support the feasibility of employing the proposed method to find and easily visualize interesting features of sets of DNA data.
Collapse
Affiliation(s)
| | - Israel Román-Godínez
- Departamento de Ciencias Computacionales, Universidad de Guadalajara, Guadalajara, Mexico
| | - Sulema Torres-Ramos
- Departamento de Ciencias Computacionales, Universidad de Guadalajara, Guadalajara, Mexico
| | - Ricardo A Salido-Ruiz
- Departamento de Ciencias Computacionales, Universidad de Guadalajara, Guadalajara, Mexico
| | - Hugo Vélez-Pérez
- Departamento de Ciencias Computacionales, Universidad de Guadalajara, Guadalajara, Mexico
| | - J Alejandro Morales
- Departamento de Ciencias Computacionales, Universidad de Guadalajara, Guadalajara, Mexico
| |
Collapse
|
28
|
Huang HH, Girimurugan SB. A Novel Real-Time Genome Comparison Method Using Discrete Wavelet Transform. J Comput Biol 2017; 25:405-416. [PMID: 29272149 DOI: 10.1089/cmb.2017.0115] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open
Abstract
Real-time genome comparison is important for identifying unknown species and clustering organisms. We propose a novel method that can represent genome sequences of different lengths as a 12-dimensional numerical vector in real time for this purpose. Given a genome sequence, a binary indicator sequence of each nucleotide base location is computed, and then discrete wavelet transform is applied to these four binary indicator sequences to attain the respective power spectra. Afterward, moments of the power spectra are calculated. Consequently, the 12-dimensional numerical vectors are constructed from the first three order moments. Our experimental results on various data sets show that the proposed method is efficient and effective to cluster genes and genomes. It runs significantly faster than other alignment-free and alignment-based methods.
Collapse
Affiliation(s)
- Hsin-Hsiung Huang
- 1 Department of Statistics, University of Central Florida , Orlando, Florida
| | | |
Collapse
|
29
|
He L, Li Y, He RL, Yau SST. A novel alignment-free vector method to cluster protein sequences. J Theor Biol 2017; 427:41-52. [DOI: 10.1016/j.jtbi.2017.06.002] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/17/2017] [Revised: 05/04/2017] [Accepted: 06/02/2017] [Indexed: 11/29/2022]
|
30
|
Yu C, Baune BT, Licinio J, Wong ML. Whole-genome single nucleotide variant distribution on genomic regions and its relationship to major depression. Psychiatry Res 2017; 252:75-79. [PMID: 28258043 PMCID: PMC5730269 DOI: 10.1016/j.psychres.2017.02.041] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 09/15/2016] [Revised: 02/06/2017] [Accepted: 02/19/2017] [Indexed: 11/22/2022]
Abstract
Recent advances in DNA technologies have provided unprecedented opportunities for biological and medical research. In contrast to current popular genotyping platforms which identify specific variations, whole-genome sequencing (WGS) allows for the detection of all private mutations within an individual. Major depressive disorder (MDD) is a chronic condition with enormous medical, social and economic impacts. Genetic analysis, by identifying risk variants and thereby increasing our understanding of how MDD arises, could lead to improved prevention and the development of new and more effective treatments. Here we investigated the distributions of whole-genome single nucleotide variants (SNVs) on 12 different genomic regions for 25 human subjects using the symmetrised Kullback-Leibler divergence to measure the similarity between their SNV distributions. We performed cluster analysis for MDD patients and ethnically matched healthy controls. The results showed that Mexican-American controls grouped closer; in contrast depressed Mexican-American participants grouped away from their ethnically matched controls. This implies that whole-genome SNV distribution on the genomic regions may be related to major depression.
Collapse
Affiliation(s)
- Chenglong Yu
- Mind and Brain Theme, South Australian Health and Medical Research Institute, North Terrace, Adelaide, SA 5000, Australia; School of Medicine, Flinders University, Bedford Park, SA 5042, Australia.
| | - Bernhard T Baune
- Discipline of Psychiatry, School of Medicine, University of Adelaide, Adelaide, SA 5005, Australia
| | - Julio Licinio
- Mind and Brain Theme, South Australian Health and Medical Research Institute, North Terrace, Adelaide, SA 5000, Australia; School of Medicine, Flinders University, Bedford Park, SA 5042, Australia
| | - Ma-Li Wong
- Mind and Brain Theme, South Australian Health and Medical Research Institute, North Terrace, Adelaide, SA 5000, Australia; School of Medicine, Flinders University, Bedford Park, SA 5042, Australia
| |
Collapse
|
31
|
Yu C, Arcos-Burgos M, Licinio J, Wong ML. A latent genetic subtype of major depression identified by whole-exome genotyping data in a Mexican-American cohort. Transl Psychiatry 2017; 7:e1134. [PMID: 28509902 PMCID: PMC5534938 DOI: 10.1038/tp.2017.102] [Citation(s) in RCA: 18] [Impact Index Per Article: 2.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 11/09/2016] [Revised: 04/04/2017] [Accepted: 04/10/2017] [Indexed: 02/07/2023] Open
Abstract
Identifying data-driven subtypes of major depressive disorder (MDD) is an important topic of psychiatric research. Currently, MDD subtypes are based on clinically defined depression symptom patterns. Although a few data-driven attempts have been made to identify more homogenous subgroups within MDD, other studies have not focused on using human genetic data for MDD subtyping. Here we used a computational strategy to identify MDD subtypes based on single-nucleotide polymorphism genotyping data from MDD cases and controls using Hamming distance and cluster analysis. We examined a cohort of Mexican-American participants from Los Angeles, including MDD patients (n=203) and healthy controls (n=196). The results in cluster trees indicate that a significant latent subtype exists in the Mexican-American MDD group. The individuals in this hidden subtype have increased common genetic substrates related to major depression and they also have more anxiety and less middle insomnia, depersonalization and derealisation, and paranoid symptoms. Advances in this line of research to validate this strategy in other patient groups of different ethnicities will have the potential to eventually be translated to clinical practice, with the tantalising possibility that in the future it may be possible to refine MDD diagnosis based on genetic data.
Collapse
Affiliation(s)
- C Yu
- Mind and Brain Theme, South Australian Health and Medical Research Institute, Adelaide, SA, Australia
- School of Medicine, Flinders University, Bedford Park, Adelaide, SA, Australia
| | - M Arcos-Burgos
- Department of Genome Sciences, John Curtin School of Medical Research, Australian National University, Canberra, ACT, Australia
- University of Rosario International Institute of Translational Medicine, Bogota, Colombia
| | - J Licinio
- Mind and Brain Theme, South Australian Health and Medical Research Institute, Adelaide, SA, Australia
- School of Medicine, Flinders University, Bedford Park, Adelaide, SA, Australia
- South Ural State University Biomedical School, Chelyabinsk, Russia
| | - M-L Wong
- Mind and Brain Theme, South Australian Health and Medical Research Institute, Adelaide, SA, Australia
- School of Medicine, Flinders University, Bedford Park, Adelaide, SA, Australia
| |
Collapse
|
32
|
Yin C, Yau SST. A coevolution analysis for identifying protein-protein interactions by Fourier transform. PLoS One 2017; 12:e0174862. [PMID: 28430779 PMCID: PMC5400233 DOI: 10.1371/journal.pone.0174862] [Citation(s) in RCA: 17] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/23/2016] [Accepted: 03/16/2017] [Indexed: 12/29/2022] Open
Abstract
Protein-protein interactions (PPIs) play key roles in life processes, such as signal transduction, transcription regulations, and immune response, etc. Identification of PPIs enables better understanding of the functional networks within a cell. Common experimental methods for identifying PPIs are time consuming and expensive. However, recent developments in computational approaches for inferring PPIs from protein sequences based on coevolution theory avoid these problems. In the coevolution theory model, interacted proteins may show coevolutionary mutations and have similar phylogenetic trees. The existing coevolution methods depend on multiple sequence alignments (MSA); however, the MSA-based coevolution methods often produce high false positive interactions. In this paper, we present a computational method using an alignment-free approach to accurately detect PPIs and reduce false positives. In the method, protein sequences are numerically represented by biochemical properties of amino acids, which reflect the structural and functional differences of proteins. Fourier transform is applied to the numerical representation of protein sequences to capture the dissimilarities of protein sequences in biophysical context. The method is assessed for predicting PPIs in Ebola virus. The results indicate strong coevolution between the protein pairs (NP-VP24, NP-VP30, NP-VP40, VP24-VP30, VP24-VP40, and VP30-VP40). The method is also validated for PPIs in influenza and E.coli genomes. Since our method can reduce false positive and increase the specificity of PPI prediction, it offers an effective tool to understand mechanisms of disease pathogens and find potential targets for drug design. The Python programs in this study are available to public at URL (https://github.com/cyinbox/PPI).
Collapse
Affiliation(s)
- Changchuan Yin
- Department of Mathematics, Statistics and Computer Science, The University of Illinois at Chicago, Chicago, IL 60607-7045, United States of America
| | - Stephen S. -T. Yau
- Department of Mathematical Sciences, Tsinghua University, Beijing 100084, China
| |
Collapse
|
33
|
Pal J, Ghosh S, Maji B, Bhattacharya DK. WITHDRAWN: A Novel Way of Comparing Protein Sequences Represented Under Physio-Chemical Properties of their Amino Acids. Comput Biol Chem 2017. [DOI: 10.1016/j.compbiolchem.2017.04.001] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/19/2022]
|
34
|
Yu C, Baune BT, Licinio J, Wong ML. A novel strategy for clustering major depression individuals using whole-genome sequencing variant data. Sci Rep 2017; 7:44389. [PMID: 28287625 PMCID: PMC5347377 DOI: 10.1038/srep44389] [Citation(s) in RCA: 14] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/02/2016] [Accepted: 02/07/2017] [Indexed: 12/01/2022] Open
Abstract
Major depressive disorder (MDD) is highly prevalent, resulting in an exceedingly high disease burden. The identification of generic risk factors could lead to advance prevention and therapeutics. Current approaches examine genotyping data to identify specific variations between cases and controls. Compared to genotyping, whole-genome sequencing (WGS) allows for the detection of private mutations. In this proof-of-concept study, we establish a conceptually novel computational approach that clusters subjects based on the entirety of their WGS. Those clusters predicted MDD diagnosis. This strategy yielded encouraging results, showing that depressed Mexican-American participants were grouped closer; in contrast ethnically-matched controls grouped away from MDD patients. This implies that within the same ancestry, the WGS data of an individual can be used to check whether this individual is within or closer to MDD subjects or to controls. We propose a novel strategy to apply WGS data to clinical medicine by facilitating diagnosis through genetic clustering. Further studies utilising our method should examine larger WGS datasets on other ethnical groups.
Collapse
Affiliation(s)
- Chenglong Yu
- Mind and Brain Theme, South Australian Health and Medical Research Institute, North Terrace, Adelaide, SA 5000, Australia
- School of Medicine, Flinders University, Bedford Park, SA 5042, Australia
| | - Bernhard T. Baune
- Discipline of Psychiatry, School of Medicine, University of Adelaide, Adelaide, SA 5005, Australia
| | - Julio Licinio
- Mind and Brain Theme, South Australian Health and Medical Research Institute, North Terrace, Adelaide, SA 5000, Australia
- School of Medicine, Flinders University, Bedford Park, SA 5042, Australia
| | - Ma-Li Wong
- Mind and Brain Theme, South Australian Health and Medical Research Institute, North Terrace, Adelaide, SA 5000, Australia
- School of Medicine, Flinders University, Bedford Park, SA 5042, Australia
| |
Collapse
|
35
|
Hou W, Pan Q, Peng Q, He M. A new method to analyze protein sequence similarity using Dynamic Time Warping. Genomics 2016; 109:123-130. [PMID: 27974244 PMCID: PMC7125777 DOI: 10.1016/j.ygeno.2016.12.002] [Citation(s) in RCA: 15] [Impact Index Per Article: 1.9] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/12/2016] [Revised: 12/06/2016] [Accepted: 12/10/2016] [Indexed: 12/05/2022]
Abstract
Sequences similarity analysis is one of the major topics in bioinformatics. It helps researchers to reveal evolution relationships of different species. In this paper, we outline a new method to analyze the similarity of proteins by Discrete Fourier Transform (DFT) and Dynamic Time Warping (DTW). The original symbol sequences are converted to numerical sequences according to their physico-chemical properties. We obtain the power spectra of sequences from DFT and extend the spectra to the same length to calculate the distance between different sequences by DTW. Our method is tested in different datasets and the results are compared with that of other software algorithms. In the comparison we find our scheme could amend some wrong classifications appear in other software. The comparison shows our approach is reasonable and effective. We propose a novel method to extract the features of the sequences based on physicochemical property of proteins. We apply the Discrete Fourier Transform (DFT) and Dynamic Time Warping (DTW) to analyze the similarity of proteins. Different datasets are used to prove our model's effectiveness.
Collapse
Affiliation(s)
- Wenbing Hou
- School of Mathematical Sciences, Dalian University of Technology, Dalian 116024, PR China
| | - Qiuhui Pan
- School of Innovation and Entrepreneurship, Dalian University of Technology, Dalian 116024, PR China; School of Mathematical Sciences, Dalian University of Technology, Dalian 116024, PR China
| | - Qianying Peng
- Department of Academics, Dalian Naval Academy, Dalian 116001, PR China
| | - Mingfeng He
- School of Mathematical Sciences, Dalian University of Technology, Dalian 116024, PR China.
| |
Collapse
|
36
|
Hoang T, Yin C, Yau SST. Numerical encoding of DNA sequences by chaos game representation with application in similarity comparison. Genomics 2016; 108:134-142. [PMID: 27538895 DOI: 10.1016/j.ygeno.2016.08.002] [Citation(s) in RCA: 38] [Impact Index Per Article: 4.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/07/2016] [Revised: 08/04/2016] [Accepted: 08/12/2016] [Indexed: 11/19/2022]
Abstract
Numerical encoding plays an important role in DNA sequence analysis via computational methods, in which numerical values are associated with corresponding symbolic characters. After numerical representation, digital signal processing methods can be exploited to analyze DNA sequences. To reflect the biological properties of the original sequence, it is vital that the representation is one-to-one. Chaos Game Representation (CGR) is an iterative mapping technique that assigns each nucleotide in a DNA sequence to a respective position on the plane that allows the depiction of the DNA sequence in the form of image. Using CGR, a biological sequence can be transformed one-to-one to a numerical sequence that preserves the main features of the original sequence. In this research, we propose to encode DNA sequences by considering 2D CGR coordinates as complex numbers, and apply digital signal processing methods to analyze their evolutionary relationship. Computational experiments indicate that this approach gives comparable results to the state-of-the-art multiple sequence alignment method, Clustal Omega, and is significantly faster. The MATLAB code for our method can be accessed from: www.mathworks.com/matlabcentral/fileexchange/57152.
Collapse
Affiliation(s)
- Tung Hoang
- Department of Mathematics, Statistics and Computer Science, University of Ilinois at Chicago, Chicago, IL 60607, USA
| | - Changchuan Yin
- Department of Mathematics, Statistics and Computer Science, University of Ilinois at Chicago, Chicago, IL 60607, USA
| | - Stephen S-T Yau
- Department of Mathematical Sciences, Tsinghua University, Beijing 100084, China.
| |
Collapse
|
37
|
Huang HH. An ensemble distance measure of k-mer and Natural Vector for the phylogenetic analysis of multiple-segmented viruses. J Theor Biol 2016; 398:136-44. [DOI: 10.1016/j.jtbi.2016.03.004] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/02/2016] [Revised: 02/25/2016] [Accepted: 03/02/2016] [Indexed: 11/29/2022]
|
38
|
Yin C, Wang J. Periodic power spectrum with applications in detection of latent periodicities in DNA sequences. J Math Biol 2016; 73:1053-1079. [PMID: 26942584 DOI: 10.1007/s00285-016-0982-8] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/21/2015] [Revised: 02/19/2016] [Indexed: 12/27/2022]
Abstract
Periodic elements play important roles in genomic structures and functions, yet some complex periodic elements in genomes are difficult to detect by conventional methods such as digital signal processing and statistical analysis. We propose a periodic power spectrum (PPS) method for analyzing periodicities of DNA sequences. The PPS method employs periodic nucleotide distributions of DNA sequences and directly calculates power spectra at specific periodicities. The magnitude of a PPS reflects the strength of a signal on periodic positions. In comparison with Fourier transform, the PPS method avoids spectral leakage, and reduces background noise that appears high in Fourier power spectrum. Thus, the PPS method can effectively capture hidden periodicities in DNA sequences. Using a sliding window approach, the PPS method can precisely locate periodic regions in DNA sequences. We apply the PPS method for detection of hidden periodicities in different genome elements, including exons, microsatellite DNA sequences, and whole genomes. The results show that the PPS method can minimize the impact of spectral leakage and thus capture true hidden periodicities in genomes. In addition, performance tests indicate that the PPS method is more effective and efficient than a fast Fourier transform. The computational complexity of the PPS algorithm is [Formula: see text]. Therefore, the PPS method may have a broad range of applications in genomic analysis. The MATLAB programs for implementing the PPS method are available from MATLAB Central ( http://www.mathworks.com/matlabcentral/fileexchange/55298 ).
Collapse
Affiliation(s)
- Changchuan Yin
- Department of Mathematics, Statistics and Computer Science, University of Illinois at Chicago, Chicago, IL, 60607-7045, USA.
| | - Jiasong Wang
- Department of Mathematics, Nanjing University, Nanjing, Jiangsu, 210093, China
| |
Collapse
|
39
|
An estimator for local analysis of genome based on the minimal absent word. J Theor Biol 2016; 395:23-30. [PMID: 26829314 DOI: 10.1016/j.jtbi.2016.01.023] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/12/2015] [Revised: 01/17/2016] [Accepted: 01/19/2016] [Indexed: 11/22/2022]
Abstract
This study presents an alternative alignment-free relative feature analysis method based on the minimal absent word, which has potential advantages over the local alignment method in local analysis. Smooth-local-analysis-curve and similarity-distribution are constructed for a fast, efficient, and visual comparison. Moreover, when the multi-sequence-comparison is needed, the local-analysis-curves can illustrate some interesting zones.
Collapse
|
40
|
Aflitos SA, Severing E, Sanchez-Perez G, Peters S, de Jong H, de Ridder D. Cnidaria: fast, reference-free clustering of raw and assembled genome and transcriptome NGS data. BMC Bioinformatics 2015; 16:352. [PMID: 26525298 PMCID: PMC4630969 DOI: 10.1186/s12859-015-0806-7] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/11/2015] [Accepted: 10/29/2015] [Indexed: 12/05/2022] Open
Abstract
Background Identification of biological specimens is a requirement for a range of applications. Reference-free methods analyse unprocessed sequencing data without relying on prior knowledge, but generally do not scale to arbitrarily large genomes and arbitrarily large phylogenetic distances. Results We present Cnidaria, a practical tool for clustering genomic and transcriptomic data with no limitation on genome size or phylogenetic distances. We successfully simultaneously clustered 169 genomic and transcriptomic datasets from 4 kingdoms, achieving 100 % identification accuracy at supra-species level and 78 % accuracy at the species level. Conclusion CNIDARIA allows for fast, resource-efficient comparison and identification of both raw and assembled genome and transcriptome data. This can help answer both fundamental (e.g. in phylogeny, ecological diversity analysis) and practical questions (e.g. sequencing quality control, primer design). Electronic supplementary material The online version of this article (doi:10.1186/s12859-015-0806-7) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Saulo Alves Aflitos
- Applied Bioinformatics, Plant Research International, Wageningen, The Netherlands. .,Bioinformatics Group, Department of Plant Sciences, Wageningen University, Wageningen, The Netherlands.
| | - Edouard Severing
- Laboratory of Genetics, Wageningen University, Wageningen, The Netherlands.
| | - Gabino Sanchez-Perez
- Applied Bioinformatics, Plant Research International, Wageningen, The Netherlands. .,Bioinformatics Group, Department of Plant Sciences, Wageningen University, Wageningen, The Netherlands.
| | - Sander Peters
- Applied Bioinformatics, Plant Research International, Wageningen, The Netherlands.
| | - Hans de Jong
- Laboratory of Genetics, Wageningen University, Wageningen, The Netherlands.
| | - Dick de Ridder
- Bioinformatics Group, Department of Plant Sciences, Wageningen University, Wageningen, The Netherlands.
| |
Collapse
|
41
|
Progressive alignment of genomic signals by multiple dynamic time warping. J Theor Biol 2015; 385:20-30. [PMID: 26300069 DOI: 10.1016/j.jtbi.2015.08.007] [Citation(s) in RCA: 23] [Impact Index Per Article: 2.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/17/2014] [Revised: 07/21/2015] [Accepted: 08/03/2015] [Indexed: 11/22/2022]
Abstract
This paper presents the utilization of progressive alignment principle for positional adjustment of a set of genomic signals with different lengths. The new method of multiple alignment of signals based on dynamic time warping is tested for the purpose of evaluating the similarity of different length genes in phylogenetic studies. Two sets of phylogenetic markers were used to demonstrate the effectiveness of the evaluation of intraspecies and interspecies genetic variability. The part of the proposed method is modification of pairwise alignment of two signals by dynamic time warping with using correlation in a sliding window. The correlation based dynamic time warping allows more accurate alignment dependent on local homologies in sequences without the need of scoring matrix or evolutionary models, because mutual similarities of residues are included in the numerical code of signals.
Collapse
|