1
|
Wang T, Yu ZG, Li J. CGRWDL: alignment-free phylogeny reconstruction method for viruses based on chaos game representation weighted by dynamical language model. Front Microbiol 2024; 15:1339156. [PMID: 38572227 PMCID: PMC10987876 DOI: 10.3389/fmicb.2024.1339156] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/15/2023] [Accepted: 02/23/2024] [Indexed: 04/05/2024] Open
Abstract
Traditional alignment-based methods meet serious challenges in genome sequence comparison and phylogeny reconstruction due to their high computational complexity. Here, we propose a new alignment-free method to analyze the phylogenetic relationships (classification) among species. In our method, the dynamical language (DL) model and the chaos game representation (CGR) method are used to characterize the frequency information and the context information of k-mers in a sequence, respectively. Then for each DNA sequence or protein sequence in a dataset, our method converts the sequence into a feature vector that represents the sequence information based on CGR weighted by the DL model to infer phylogenetic relationships. We name our method CGRWDL. Its performance was tested on both DNA and protein sequences of 8 datasets of viruses to construct the phylogenetic trees. We compared the Robinson-Foulds (RF) distance between the phylogenetic tree constructed by CGRWDL and the reference tree by other advanced methods for each dataset. The results show that the phylogenetic trees constructed by CGRWDL can accurately classify the viruses, and the RF scores between the trees and the reference trees are smaller than that with other methods.
Collapse
Affiliation(s)
- Ting Wang
- National Center for Applied Mathematics in Hunan, Xiangtan University, Xiangtan, Hunan, China
- Key Laboratory of Intelligent Computing and Information Processing of Ministry of Education, Xiangtan University, Xiangtan, Hunan, China
| | - Zu-Guo Yu
- National Center for Applied Mathematics in Hunan, Xiangtan University, Xiangtan, Hunan, China
- Key Laboratory of Intelligent Computing and Information Processing of Ministry of Education, Xiangtan University, Xiangtan, Hunan, China
| | - Jinyan Li
- School of Computer Science and Control Engineering, Shenzhen Institute of Advanced Technology, Shenzhen, Guangdong, China
| |
Collapse
|
2
|
Ferreira LM, Sáfadi T, Ferreira JL. K-mer applied in Mycobacterium tuberculosis genome cluster analysis. BRAZ J BIOL 2024; 84:e258258. [DOI: 10.1590/1519-6984.258258] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/15/2021] [Accepted: 05/26/2022] [Indexed: 11/22/2022] Open
Abstract
Abstract According to studies carried out, approximately 10 million people developed tuberculosis in 2018. Of this total, 1.5 million people died from the disease. To study the behavior of the genome sequences of Mycobacterium tuberculosis (MTB), the bacterium responsible for the development of tuberculosis (TB), an analysis was performed using k-mers (DNA word frequency). The k values ranged from 1 to 10, because the analysis was performed on the full length of the sequences, where each sequence is composed of approximately 4 million base pairs, k values above 10, the analysis is interrupted, as consequence of the program's capacity. The aim of this work was to verify the formation of the phylogenetic tree in each k-mer analyzed. The results showed the formation of distinct groups in some k-mers analyzed, taking into account the threshold line. However, in all groups, the multidrug-resistant (MDR) and extensively drug-resistant (XDR) strains remained together and separated from the other strains.
Collapse
|
3
|
Ayad LAK, Chikhi R, Pissis SP. Seedability: optimizing alignment parameters for sensitive sequence comparison. BIOINFORMATICS ADVANCES 2023; 3:vbad108. [PMID: 37621456 PMCID: PMC10444664 DOI: 10.1093/bioadv/vbad108] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 04/21/2023] [Revised: 08/02/2023] [Accepted: 08/10/2023] [Indexed: 08/26/2023]
Abstract
Motivation Most sequence alignment techniques make use of exact k-mer hits, called seeds, as anchors to optimize alignment speed. A large number of bioinformatics tools employing seed-based alignment techniques, such as Minimap2 , use a single value of k per sequencing technology, without a strong guarantee that this is the best possible value. Given the ubiquity of sequence alignment, identifying values of k that lead to more sensitive alignments is thus an important task. To aid this, we present Seedability , a seed-based alignment framework designed for estimating an optimal seed k-mer length (as well as a minimal number of shared seeds) based on a given alignment identity threshold. In particular, we were motivated to make Minimap2 more sensitive in the pairwise alignment of short sequences. Results The experimental results herein show improved alignments of short and divergent sequences when using the parameter values determined by Seedability in comparison to the default values of Minimap2 . We also show several cases of pairs of real divergent sequences, where the default parameter values of Minimap2 yield no output alignments, but the values output by Seedability produce plausible alignments. Availability and implementation https://github.com/lorrainea/Seedability (distributed under GPL v3.0).
Collapse
Affiliation(s)
- Lorraine A K Ayad
- Department of Computer Science, Brunel University London, London UB8 3PH, UK
| | - Rayan Chikhi
- G5 Sequence Bioinformatics, Institut Pasteur, Université Paris Cité, 75015 Paris, France
| | - Solon P Pissis
- Networks & Optimization, CWI, 1098 XG Amsterdam, The Netherlands
- Department of Computer Science, Vrije Universiteit, 1081 HV Amsterdam, The Netherlands
| |
Collapse
|
4
|
de Souza LC, Azevedo KS, de Souza JG, Barbosa RDM, Fernandes MAC. New proposal of viral genome representation applied in the classification of SARS-CoV-2 with deep learning. BMC Bioinformatics 2023; 24:92. [PMID: 36906520 PMCID: PMC10007673 DOI: 10.1186/s12859-023-05188-1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/11/2022] [Accepted: 02/15/2023] [Indexed: 03/13/2023] Open
Abstract
BACKGROUND In December 2019, the first case of COVID-19 was described in Wuhan, China, and by July 2022, there were already 540 million confirmed cases. Due to the rapid spread of the virus, the scientific community has made efforts to develop techniques for the viral classification of SARS-CoV-2. RESULTS In this context, we developed a new proposal for gene sequence representation with Genomic Signal Processing techniques for the work presented in this paper. First, we applied the mapping approach to samples of six viral species of the Coronaviridae family, which belongs SARS-CoV-2 Virus. We then used the sequence downsized obtained by the method proposed in a deep learning architecture for viral classification, achieving an accuracy of 98.35%, 99.08%, and 99.69% for the 64, 128, and 256 sizes of the viral signatures, respectively, and obtaining 99.95% precision for the vectors with size 256. CONCLUSIONS The classification results obtained, in comparison to the results produced using other state-of-the-art representation techniques, demonstrate that the proposed mapping can provide a satisfactory performance result with low computational memory and processing time costs.
Collapse
Affiliation(s)
- Luísa C. de Souza
- Laboratory of Machine Learning and Intelligent Instrumentation, Federal University of Rio Grande do Norte, Natal, RN 59078-970 Brazil
| | - Karolayne S. Azevedo
- Laboratory of Machine Learning and Intelligent Instrumentation, Federal University of Rio Grande do Norte, Natal, RN 59078-970 Brazil
| | - Jackson G. de Souza
- Laboratory of Machine Learning and Intelligent Instrumentation, Federal University of Rio Grande do Norte, Natal, RN 59078-970 Brazil
| | - Raquel de M. Barbosa
- Department of Pharmacy and Pharmaceutical Technology, University of Granada, Granada, Spain
| | - Marcelo A. C. Fernandes
- Laboratory of Machine Learning and Intelligent Instrumentation, Federal University of Rio Grande do Norte, Natal, RN 59078-970 Brazil
- Department of Computer Engineering and Automation, Federal University of Rio Grande do Norte, Natal, RN 59078-970 Brazil
- Bioinformatics Multidisciplinary Environment (BioME), Federal University of Rio Grande do Norte, Natal, RN 59078-970 Brazil
| |
Collapse
|
5
|
Alakus TB, Baykara M. Comparison of Monkeypox and Wart DNA Sequences with Deep Learning Model. APPLIED SCIENCES 2022; 12:10216. [DOI: 10.3390/app122010216] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 09/01/2023]
Abstract
After the COVID-19 disease, monkeypox disease has emerged today and has started to be seen almost everywhere in the world in a short time. Monkeypox causes symptoms such as fever, chills, and headache in people. In addition, rashes are seen on the skin and lumps are formed. Early diagnosis and treatment of monkeypox, which is a contagious disease, are of great importance. An expert interpretation and clinical examination are usually needed to detect monkeypox. This may cause the treatment process to be slow. Furthermore, monkeypox is sometimes confused with warts. This leads to incorrect diagnosis and treatment. Because of these disadvantages, in this study, the DNA sequences of HPV causing warts and MPV causing monkeypox were analyzed and the classification of these sequences was performed with a deep learning algorithm. The study consisted of four stages. In the first stage, DNA sequences of viruses that cause warts and monkeypox were obtained. In the second stage, these sequences were mapped using various DNA-mapping methods. In the third stage, the mapped sequences were classified using a deep learning algorithm. At the last stage, the performances of DNA-mapping methods were compared by calculating accuracy and F1-score. At the end of the study, an average accuracy of 96.08% and an F1-score of 99.83% were obtained. These results showed that these two diseases can be effectively classified according to their DNA sequences.
Collapse
|
6
|
Li W, Yang L, Qiu Y, Yuan Y, Li X, Meng Z. FFP: joint Fast Fourier transform and fractal dimension in amino acid property-aware phylogenetic analysis. BMC Bioinformatics 2022; 23:347. [PMID: 35986255 PMCID: PMC9392226 DOI: 10.1186/s12859-022-04889-3] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/30/2022] [Accepted: 08/11/2022] [Indexed: 11/10/2022] Open
Abstract
Abstract
Background
Amino acid property-aware phylogenetic analysis (APPA) refers to the phylogenetic analysis method based on amino acid property encoding, which is used for understanding and inferring evolutionary relationships between species from the molecular perspective. Fast Fourier transform (FFT) and Higuchi’s fractal dimension (HFD) have excellent performance in describing sequences’ structural and complexity information for APPA. However, with the exponential growth of protein sequence data, it is very important to develop a reliable APPA method for protein sequence analysis.
Results
Consequently, we propose a new method named FFP, it joints FFT and HFD. Firstly, FFP is used to encode protein sequences on the basis of the important physicochemical properties of amino acids, the dissociation constant, which determines acidity and basicity of protein molecules. Secondly, FFT and HFD are used to generate the feature vectors of encoded sequences, whereafter, the distance matrix is calculated from the cosine function, which describes the degree of similarity between species. The smaller the distance between them, the more similar they are. Finally, the phylogenetic tree is constructed. When FFP is tested for phylogenetic analysis on four groups of protein sequences, the results are obviously better than other comparisons, with the highest accuracy up to more than 97%.
Conclusion
FFP has higher accuracy in APPA and multi-sequence alignment. It also can measure the protein sequence similarity effectively. And it is hoped to play a role in APPA’s related research.
Collapse
|
7
|
Li W, Yang L, Meng Z, Qiu Y, Wang PSP, Li X. Phylogenetic Analysis: A Novel Method of Protein Sequence Similarity Analysis. INT J PATTERN RECOGN 2022. [DOI: 10.1142/s0218001422580071] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022]
Abstract
Protein sequence similarity analysis (PSSA) is a significant task in bioinformatics, which can obtain information about unknown sequences such as protein structures and homology relationships. Protein sequence refers to the series of amino acids with rich physical and chemical properties, namely the basic structure of proteins. However, sequence similarity analysis and phylogenetic analysis between different species which have complex amino acid sequences is a challenging problem. In this paper, nine properties of amino acids were considered and the sequence was converted into numerical values by principal component analysis (PCA); with Haar Wavelet Transform, and Higuchi fractal dimension (HFD), a new feature vector is constructed to represent the sequence; Spearman distance was selected to calculate the distance matrix and the phylogenetic tree was constructed. In this paper, two representative protein sequences (9 ND5 (NADH dehydrogenase 5) and 8 ND6 (NADH dehydrogenase 6)) were selected for similarity analysis and phylogenetic analysis, and compared with MEGA software and other existing methods. The extensive results show that our method is outperforming and results consistent with the known facts.
Collapse
Affiliation(s)
- Wei Li
- School of Computer, Electronics and Information, Guangxi University, Nanning, P. R. China
| | - Lina Yang
- School of Computer, Electronics and Information, Guangxi University, Nanning, P. R. China
| | - Zuqiang Meng
- School of Computer, Electronics and Information, Guangxi University, Nanning, P. R. China
| | - Yu Qiu
- School of Computer, Electronics and Information, Guangxi University, Nanning, P. R. China
| | | | - Xichun Li
- Guangxi Normal University for Nationalities, Chongzuo 532200, China
| |
Collapse
|
8
|
Thornton M, Mcgee M. Use of DFT Distance Metrics for Classification of SARS-CoV-2 Genomes. J Comput Biol 2022; 29:453-464. [PMID: 35325549 DOI: 10.1089/cmb.2021.0229] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open
Abstract
In this work, we investigate using Fourier coefficients (FCs) for capturing useful information about viral sequences in a computationally efficient and compact manner. Specifically, we extract geographic submission location from SARS-CoV-2 sequence headers submitted to the GISAID Initiative, calculate corresponding FCs, and use the FCs to classify these sequences according to geographic location. We show that the FCs serve as useful numerical summaries for sequences that allow manipulation, identification, and differentiation via classical mathematical and statistical methods that are not readily applicable for character strings. Further, we argue that subsets of the FCs may be usable for the same purposes, which results in a reduction in storage requirements. We conclude by offering extensions of the research and potential future directions for subsequent analyses, such as the use of other series transforms for discreetly indexed signals such as genomes.
Collapse
Affiliation(s)
- Micah Thornton
- Department of Statistical Science, Southern Methodist University, Dallas, Texas, USA
| | - Monnie Mcgee
- Department of Statistical Science, Southern Methodist University, Dallas, Texas, USA
| |
Collapse
|
9
|
VanWallendael A, Alvarez M. Alignment-free methods for polyploid genomes: Quick and reliable genetic distance estimation. Mol Ecol Resour 2021; 22:612-622. [PMID: 34478242 DOI: 10.1111/1755-0998.13499] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/04/2020] [Accepted: 08/20/2021] [Indexed: 01/10/2023]
Abstract
Polyploid genomes pose several inherent challenges to population genetic analyses. While alignment-based methods are fundamentally limited in their applicability to polyploids, alignment-free methods bypass most of these limits. We investigated the use of Mash, a k-mer analysis tool that uses the MinHash method to reduce complexity in large genomic data sets, for basic population genetic analyses of polyploid sequences. We measured the degree to which Mash correctly estimated pairwise genetic distance in simulated haploid and polyploid short-read sequences with various levels of missing data. Mash-based estimates of genetic distance were comparable to alignment-based estimates, and were less impacted by missing data. We also used Mash to analyse publicly available short-read data for three polyploid and one diploid species, then compared Mash results to published results. For both simulated and real data, Mash accurately estimated pairwise genetic differences for polyploids as well as diploids as much as 476 times faster than alignment-based methods, though we found that Mash genetic distance estimates could be biased by per-sample read depth. Mash may be a particularly useful addition to the toolkit of polyploid geneticists for rapid confirmation of alignment-based results and for basic population genetics in reference-free systems or those with only poor-quality sequence data available.
Collapse
Affiliation(s)
- Acer VanWallendael
- Department of Plant Biology, Michigan State University, East Lansing, MI, USA
| | - Mariano Alvarez
- Biology Department, Wesleyan University, Middletown, CT, USA
| |
Collapse
|
10
|
Deformity Index: A Semi-Reference Clade-Based Quality Metric of Phylogenetic Trees. J Mol Evol 2021; 89:302-312. [PMID: 33811501 DOI: 10.1007/s00239-021-10006-4] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/08/2020] [Accepted: 03/20/2021] [Indexed: 10/21/2022]
Abstract
Measuring the dissimilarity of a phylogenetic tree with respect to a reference tree or the hypotheses is a fundamental task in the phylogenetic study. A large number of methods have been proposed to compute the distance between the reference tree and the target tree. Due to the presence of unresolved relationships among the species, it is challenging to obtain a precise and an accurate reference tree for a selected dataset. As a result, the existing tree comparison methods may behave unexpectedly in various scenarios. In this paper, we introduce a novel scoring function, called the deformity index, to quantify the dissimilarity of a tree based on the list of clades of a reference tree. The strength of our proposed method is that it depends on the list of clades that can be acquired either from the reference tree or from the hypotheses. We investigate the distributions of different modules of the deformity index and perform different goodness-of-fit tests to understand the cumulative distribution. Then, we examine, in detail, the robustness as well as the scalability of our measure by performing different statistical tests under various models. Finally, we experiment on different biological datasets and show that our proposed scoring function overcomes the limitations of the conventional methods.
Collapse
|
11
|
Kania A, Sarapata K. The robustness of the chaos game representation to mutations and its application in free-alignment methods. Genomics 2021; 113:1428-1437. [PMID: 33713823 DOI: 10.1016/j.ygeno.2021.03.015] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/22/2020] [Revised: 01/22/2021] [Accepted: 03/05/2021] [Indexed: 02/06/2023]
Abstract
Numerical representation of biological sequences plays an important role in bioinformatics and has many practical applications. One of the most popular approaches is the chaos game representation. In this paper, the authors propose a novel look into chaos game construction - an analytical description of this procedure. This type enables to build more general number sequences using different weight functions. The authors suggest three conditions that these functions should hold. Additionally, they present some criteria to compare them and check whether they provide a unique representation. One of the most important advantages of our approach is the possibility to construct such a description that is less sensitive to mutations and as a result, give more reliable values for free-alignment phylogenetic trees constructions. Finally, the authors applied the DFT method using four types of functions and compared the obtained results using the BLAST tool.
Collapse
Affiliation(s)
- Adrian Kania
- Department of Computational Biophysics and Bioinformatics, Faculty of Biochemistry, Biophysics and Biotechnology, Jagiellonian University, Gronostajowa 7, 30-387 Cracow, Poland.
| | - Krzysztof Sarapata
- Department of Computational Biophysics and Bioinformatics, Faculty of Biochemistry, Biophysics and Biotechnology, Jagiellonian University, Gronostajowa 7, 30-387 Cracow, Poland
| |
Collapse
|
12
|
A novel entropy-based mapping method for determining the protein-protein interactions in viral genomes by using coevolution analysis. Biomed Signal Process Control 2021. [DOI: 10.1016/j.bspc.2020.102359] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/23/2022]
|
13
|
A new graph-theoretic approach to determine the similarity of genome sequences based on nucleotide triplets. Genomics 2020; 112:4701-4714. [PMID: 32827671 PMCID: PMC7437474 DOI: 10.1016/j.ygeno.2020.08.023] [Citation(s) in RCA: 10] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/09/2019] [Revised: 07/15/2020] [Accepted: 08/17/2020] [Indexed: 11/22/2022]
Abstract
Methods of finding sequence similarity play a significant role in computational biology. Owing to the rapid increase of genome sequences in public databases, the evolutionary relationship of species becomes more challenging. But traditional alignment-based methods are found inappropriate due to their time-consuming nature. Therefore, it is necessary to find a faster method, which applies to species phylogeny. In this paper, a new graph-theory based alignment-free sequence comparison method is proposed. A complete-bipartite graph is used to represent each genome sequence based on its nucleotide triplets. Subsequently, with the help of the weights of edges of the graph, a vector descriptor is formed. Finally, the phylogenetic tree is drawn using the UPGMA algorithm. In the present case, the datasets for comparison are related to mammals, viruses, and bacteria. In most of the cases, the phylogeny in the present case is found to be more satisfactory as compared to earlier methods. A new graph-theory based alignment-free genome sequence comparison. Use of complete bipartite graph to represent genome sequences. Descriptor based on the weights of the edges of the graph. Comparison of the phylogenetic trees of different mammals, viruses, and bacteria. Less time complexity compared to that of earlier methods.
Collapse
|
14
|
Positional Correlation Natural Vector: A Novel Method for Genome Comparison. Int J Mol Sci 2020; 21:ijms21113859. [PMID: 32485813 PMCID: PMC7312176 DOI: 10.3390/ijms21113859] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/13/2020] [Revised: 05/17/2020] [Accepted: 05/26/2020] [Indexed: 12/17/2022] Open
Abstract
Advances in sequencing technology have made large amounts of biological data available. Evolutionary analysis of data such as DNA sequences is highly important in biological studies. As alignment methods are ineffective for analyzing large-scale data due to their inherently high costs, alignment-free methods have recently attracted attention in the field of bioinformatics. In this paper, we introduce a new positional correlation natural vector (PCNV) method that involves converting a DNA sequence into an 18-dimensional numerical feature vector. Using frequency and position correlation to represent the nucleotide distribution, it is possible to obtain a PCNV for a DNA sequence. This new numerical vector design uses six suitable features to characterize the correlation among nucleotide positions in sequences. PCNV is also very easy to compute and can be used for rapid genome comparison. To test our novel method, we performed phylogenetic analysis with several viral and bacterial genome datasets with PCNV. For comparison, an alignment-based method, Bayesian inference, and two alignment-free methods, feature frequency profile and natural vector, were performed using the same datasets. We found that the PCNV technique is fast and accurate when used for phylogenetic analysis and classification of viruses and bacteria.
Collapse
|
15
|
Zielezinski A, Girgis HZ, Bernard G, Leimeister CA, Tang K, Dencker T, Lau AK, Röhling S, Choi JJ, Waterman MS, Comin M, Kim SH, Vinga S, Almeida JS, Chan CX, James BT, Sun F, Morgenstern B, Karlowski WM. Benchmarking of alignment-free sequence comparison methods. Genome Biol 2019; 20:144. [PMID: 31345254 PMCID: PMC6659240 DOI: 10.1186/s13059-019-1755-7] [Citation(s) in RCA: 113] [Impact Index Per Article: 18.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/16/2019] [Accepted: 07/03/2019] [Indexed: 11/22/2022] Open
Abstract
BACKGROUND Alignment-free (AF) sequence comparison is attracting persistent interest driven by data-intensive applications. Hence, many AF procedures have been proposed in recent years, but a lack of a clearly defined benchmarking consensus hampers their performance assessment. RESULTS Here, we present a community resource (http://afproject.org) to establish standards for comparing alignment-free approaches across different areas of sequence-based research. We characterize 74 AF methods available in 24 software tools for five research applications, namely, protein sequence classification, gene tree inference, regulatory element detection, genome-based phylogenetic inference, and reconstruction of species trees under horizontal gene transfer and recombination events. CONCLUSION The interactive web service allows researchers to explore the performance of alignment-free tools relevant to their data types and analytical goals. It also allows method developers to assess their own algorithms and compare them with current state-of-the-art tools, accelerating the development of new, more accurate AF solutions.
Collapse
Affiliation(s)
- Andrzej Zielezinski
- Department of Computational Biology, Faculty of Biology, Adam Mickiewicz University Poznan, Uniwersytetu Poznańskiego 6, 61-614, Poznan, Poland
| | - Hani Z Girgis
- Tandy School of Computer Science, The University of Tulsa, 800 South Tucker Drive, Tulsa, OK, 74104, USA
| | | | - Chris-Andre Leimeister
- Department of Bioinformatics, Institute of Microbiology and Genetics, University of Göttingen, Goldschmidtstr. 1, 37077, Göttingen, Germany
| | - Kujin Tang
- Department of Biological Sciences, Quantitative and Computational Biology Program, University of Southern California, Los Angeles, CA, 90089, USA
| | - Thomas Dencker
- Department of Bioinformatics, Institute of Microbiology and Genetics, University of Göttingen, Goldschmidtstr. 1, 37077, Göttingen, Germany
| | - Anna Katharina Lau
- Department of Bioinformatics, Institute of Microbiology and Genetics, University of Göttingen, Goldschmidtstr. 1, 37077, Göttingen, Germany
| | - Sophie Röhling
- Department of Bioinformatics, Institute of Microbiology and Genetics, University of Göttingen, Goldschmidtstr. 1, 37077, Göttingen, Germany
| | - Jae Jin Choi
- Department of Chemistry, University of California, Berkeley, CA, 94720, USA
- Molecular Biophysics & Integrated Bioimaging Division, Lawrence Berkeley National Laboratory, Berkeley, CA, 94720, USA
| | - Michael S Waterman
- Department of Biological Sciences, Quantitative and Computational Biology Program, University of Southern California, Los Angeles, CA, 90089, USA
- Centre for Computational Systems Biology, School of Mathematical Sciences, Fudan University, Shanghai, 200433, China
| | - Matteo Comin
- Department of Information Engineering, University of Padova, Padova, Italy
| | - Sung-Hou Kim
- Department of Chemistry, University of California, Berkeley, CA, 94720, USA
- Molecular Biophysics & Integrated Bioimaging Division, Lawrence Berkeley National Laboratory, Berkeley, CA, 94720, USA
| | - Susana Vinga
- INESC-ID, Instituto Superior Técnico, Universidade de Lisboa, Av. Rovisco Pais 1, 1049-001, Lisbon, Portugal
- IDMEC, Instituto Superior Técnico, Universidade de Lisboa, Av. Rovisco Pais 1, 1049-001, Lisbon, Portugal
| | - Jonas S Almeida
- Division of Cancer Epidemiology and Genetics (DCEG), National Cancer Institute (NIH/NCI), Bethesda, USA
| | - Cheong Xin Chan
- Institute for Molecular Bioscience, and School of Chemistry and Molecular Biosciences, The University of Queensland, Brisbane, QLD, 4072, Australia
| | - Benjamin T James
- Tandy School of Computer Science, The University of Tulsa, 800 South Tucker Drive, Tulsa, OK, 74104, USA
| | - Fengzhu Sun
- Department of Biological Sciences, Quantitative and Computational Biology Program, University of Southern California, Los Angeles, CA, 90089, USA
- Centre for Computational Systems Biology, School of Mathematical Sciences, Fudan University, Shanghai, 200433, China
| | - Burkhard Morgenstern
- Department of Bioinformatics, Institute of Microbiology and Genetics, University of Göttingen, Goldschmidtstr. 1, 37077, Göttingen, Germany
| | - Wojciech M Karlowski
- Department of Computational Biology, Faculty of Biology, Adam Mickiewicz University Poznan, Uniwersytetu Poznańskiego 6, 61-614, Poznan, Poland.
| |
Collapse
|
16
|
Pei S, Dong R, He RL, Yau SST. Large-Scale Genome Comparison Based on Cumulative Fourier Power and Phase Spectra: Central Moment and Covariance Vector. Comput Struct Biotechnol J 2019; 17:982-994. [PMID: 31384399 PMCID: PMC6661692 DOI: 10.1016/j.csbj.2019.07.003] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/01/2019] [Revised: 06/24/2019] [Accepted: 07/10/2019] [Indexed: 01/04/2023] Open
Abstract
Genome comparison is a vital research area of bioinformatics. For large-scale genome comparisons, the Multiple Sequence Alignment (MSA) methods have been impractical to use due to its algorithmic complexity. In this study, we propose a novel alignment-free method based on the one-to-one correspondence between a DNA sequence and its complete central moment vector of the cumulative Fourier power and phase spectra. In addition, the covariance between the four nucleotides in the power and phase spectra is included. We use the cumulative Fourier power and phase spectra to define a 28-dimensional vector for each DNA sequence. Euclidean distances between the vectors can measure the dissimilarity between DNA sequences. We perform testing with datasets of different sizes and types including simulated DNA sequences, exon-intron and complete genomes. The results show that our method is more accurate and efficient for performing hierarchical clustering than other alignment-free methods and MSA methods.
Collapse
Affiliation(s)
- Shaojun Pei
- Department of Mathematical Sciences, Tsinghua University, Beijing, PR China
| | - Rui Dong
- Department of Mathematical Sciences, Tsinghua University, Beijing, PR China
| | - Rong Lucy He
- Department of Biological Sciences, Chicago State University, Chicago, IL 60628, USA
| | - Stephen S.-T. Yau
- Department of Mathematical Sciences, Tsinghua University, Beijing, PR China
| |
Collapse
|
17
|
Randhawa GS, Hill KA, Kari L. ML-DSP: Machine Learning with Digital Signal Processing for ultrafast, accurate, and scalable genome classification at all taxonomic levels. BMC Genomics 2019; 20:267. [PMID: 30943897 PMCID: PMC6448311 DOI: 10.1186/s12864-019-5571-y] [Citation(s) in RCA: 20] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/04/2018] [Accepted: 02/27/2019] [Indexed: 11/11/2022] Open
Abstract
Background Although software tools abound for the comparison, analysis, identification, and classification of genomic sequences, taxonomic classification remains challenging due to the magnitude of the datasets and the intrinsic problems associated with classification. The need exists for an approach and software tool that addresses the limitations of existing alignment-based methods, as well as the challenges of recently proposed alignment-free methods. Results We propose a novel combination of supervised Machine Learning with Digital Signal Processing, resulting in ML-DSP: an alignment-free software tool for ultrafast, accurate, and scalable genome classification at all taxonomic levels. We test ML-DSP by classifying 7396 full mitochondrial genomes at various taxonomic levels, from kingdom to genus, with an average classification accuracy of >97%. A quantitative comparison with state-of-the-art classification software tools is performed, on two small benchmark datasets and one large 4322 vertebrate mtDNA genomes dataset. Our results show that ML-DSP overwhelmingly outperforms the alignment-based software MEGA7 (alignment with MUSCLE or CLUSTALW) in terms of processing time, while having comparable classification accuracies for small datasets and superior accuracies for the large dataset. Compared with the alignment-free software FFP (Feature Frequency Profile), ML-DSP has significantly better classification accuracy, and is overall faster. We also provide preliminary experiments indicating the potential of ML-DSP to be used for other datasets, by classifying 4271 complete dengue virus genomes into subtypes with 100% accuracy, and 4,710 bacterial genomes into phyla with 95.5% accuracy. Lastly, our analysis shows that the “Purine/Pyrimidine”, “Just-A” and “Real” numerical representations of DNA sequences outperform ten other such numerical representations used in the Digital Signal Processing literature for DNA classification purposes. Conclusions Due to its superior classification accuracy, speed, and scalability to large datasets, ML-DSP is highly relevant in the classification of newly discovered organisms, in distinguishing genomic signatures and identifying their mechanistic determinants, and in evaluating genome integrity.
Collapse
Affiliation(s)
- Gurjit S Randhawa
- Department of Computer Science, University of Western Ontario, London, ON, Canada.
| | - Kathleen A Hill
- Department of Biology, University of Western Ontario, London, ON, Canada
| | - Lila Kari
- School of Computer Science, University of Waterloo, Waterloo, ON, Canada
| |
Collapse
|
18
|
Chen W, Liao B, Li W. Use of image texture analysis to find DNA sequence similarities. J Theor Biol 2018; 455:1-6. [DOI: 10.1016/j.jtbi.2018.07.001] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/14/2017] [Revised: 06/29/2018] [Accepted: 07/03/2018] [Indexed: 11/29/2022]
|
19
|
Zhao J, Wang J, Jiang H. Detecting Periodicities in Eukaryotic Genomes by Ramanujan Fourier Transform. J Comput Biol 2018; 25:963-975. [PMID: 29963923 DOI: 10.1089/cmb.2017.0252] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
Abstract
Ramanujan Fourier transform (RFT) nowadays is becoming a popular signal processing method. RFT is used to detect periodicities in exons and introns of eukaryotic genomes in this article. Genomic sequences of nine species were analyzed. The highest peak in the spectrum amplitude corresponding to each exon or intron is regarded as the significant signal. Accordingly, the periodicity corresponding to the significant signal can be also regarded as a valuable periodicity. Exons and introns have different periodic phenomena. The computational results reveal that the 2-, 3-, 4-, and 6-base periodicities of exons and introns are four kinds of important periodicities based on RFT. It is the first time that the 2-base periodicity of introns is discovered through signal processing method. The frequencies of the 2-base periodicity and the 3-base periodicity occurrence are polar opposite between the exons and the introns. With regard to the cyclicality of the Ramanujan sums, which is the base function of the transformation, RFT is suggested for studying the periodic features of dinucleotides, trinucleotides, and q nucleotides.
Collapse
Affiliation(s)
- Jian Zhao
- 1 Department of Mathematics, Nanjing Tech University , Nanjing, China .,2 Department of Statistics, Northwestern University , Evanston, Illinois
| | - Jiasong Wang
- 3 Department of Mathematics, Nanjing University , Nanjing, China
| | - Hongmei Jiang
- 2 Department of Statistics, Northwestern University , Evanston, Illinois
| |
Collapse
|
20
|
Ren J, Bai X, Lu YY, Tang K, Wang Y, Reinert G, Sun F. Alignment-Free Sequence Analysis and Applications. Annu Rev Biomed Data Sci 2018; 1:93-114. [PMID: 31828235 PMCID: PMC6905628 DOI: 10.1146/annurev-biodatasci-080917-013431] [Citation(s) in RCA: 58] [Impact Index Per Article: 8.3] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/06/2023]
Abstract
Genome and metagenome comparisons based on large amounts of next generation sequencing (NGS) data pose significant challenges for alignment-based approaches due to the huge data size and the relatively short length of the reads. Alignment-free approaches based on the counts of word patterns in NGS data do not depend on the complete genome and are generally computationally efficient. Thus, they contribute significantly to genome and metagenome comparison. Recently, novel statistical approaches have been developed for the comparison of both long and shotgun sequences. These approaches have been applied to many problems including the comparison of gene regulatory regions, genome sequences, metagenomes, binning contigs in metagenomic data, identification of virus-host interactions, and detection of horizontal gene transfers. We provide an updated review of these applications and other related developments of word-count based approaches for alignment-free sequence analysis.
Collapse
Affiliation(s)
- Jie Ren
- Molecular and Computational Biology Program, University of Southern California, Los Angeles, California, USA
| | - Xin Bai
- Molecular and Computational Biology Program, University of Southern California, Los Angeles, California, USA
- Centre for Computational Systems Biology, School of Mathematical Sciences, Fudan University, Shanghai, China
| | - Yang Young Lu
- Molecular and Computational Biology Program, University of Southern California, Los Angeles, California, USA
| | - Kujin Tang
- Molecular and Computational Biology Program, University of Southern California, Los Angeles, California, USA
| | - Ying Wang
- Department of Automation, Xiamen University, Xiamen, Fujian, China
| | - Gesine Reinert
- Department of Statistics, University of Oxford, Oxford, United Kingdom
| | - Fengzhu Sun
- Molecular and Computational Biology Program, University of Southern California, Los Angeles, California, USA
- Centre for Computational Systems Biology, School of Mathematical Sciences, Fudan University, Shanghai, China
| |
Collapse
|
21
|
Abstract
With sharp increasing in biological sequences, the traditional sequence alignment methods become unsuitable and infeasible. It motivates a surge of fast alignment-free techniques for sequence analysis. Among these methods, many sorts of feature vector methods are established and applied to reconstruction of species phylogeny. The vectors basically consist of some typical numerical features for certain biological problems. The features may come from the primary sequences, secondary or three dimensional structures of macromolecules. In this study, we propose a novel numerical vector based on only primary sequences of organism to build their phylogeny. Three chemical and physical properties of primary sequences: purine, pyrimidine and keto are also incorporated to the vector. Using each property, we convert the nucleotide sequence into a new sequence consisting of only two kinds of letters. Therefore, three sequences are constructed according to the three properties. For each letter of each sequence we calculate the number of the letter, the average position of the letter and the variation of the position of the letter appearing in the sequence. Tested on several datasets related to mammals, viruses and bacteria, this new tool is fast in speed and accurate for inferring the phylogeny of organisms.
Collapse
|
22
|
Hou W, Pan Q, Peng Q, He M. A new method to analyze protein sequence similarity using Dynamic Time Warping. Genomics 2016; 109:123-130. [PMID: 27974244 PMCID: PMC7125777 DOI: 10.1016/j.ygeno.2016.12.002] [Citation(s) in RCA: 15] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/12/2016] [Revised: 12/06/2016] [Accepted: 12/10/2016] [Indexed: 12/05/2022]
Abstract
Sequences similarity analysis is one of the major topics in bioinformatics. It helps researchers to reveal evolution relationships of different species. In this paper, we outline a new method to analyze the similarity of proteins by Discrete Fourier Transform (DFT) and Dynamic Time Warping (DTW). The original symbol sequences are converted to numerical sequences according to their physico-chemical properties. We obtain the power spectra of sequences from DFT and extend the spectra to the same length to calculate the distance between different sequences by DTW. Our method is tested in different datasets and the results are compared with that of other software algorithms. In the comparison we find our scheme could amend some wrong classifications appear in other software. The comparison shows our approach is reasonable and effective. We propose a novel method to extract the features of the sequences based on physicochemical property of proteins. We apply the Discrete Fourier Transform (DFT) and Dynamic Time Warping (DTW) to analyze the similarity of proteins. Different datasets are used to prove our model's effectiveness.
Collapse
Affiliation(s)
- Wenbing Hou
- School of Mathematical Sciences, Dalian University of Technology, Dalian 116024, PR China
| | - Qiuhui Pan
- School of Innovation and Entrepreneurship, Dalian University of Technology, Dalian 116024, PR China; School of Mathematical Sciences, Dalian University of Technology, Dalian 116024, PR China
| | - Qianying Peng
- Department of Academics, Dalian Naval Academy, Dalian 116001, PR China
| | - Mingfeng He
- School of Mathematical Sciences, Dalian University of Technology, Dalian 116024, PR China.
| |
Collapse
|
23
|
Hoang T, Yin C, Yau SST. Numerical encoding of DNA sequences by chaos game representation with application in similarity comparison. Genomics 2016; 108:134-142. [PMID: 27538895 DOI: 10.1016/j.ygeno.2016.08.002] [Citation(s) in RCA: 38] [Impact Index Per Article: 4.2] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/07/2016] [Revised: 08/04/2016] [Accepted: 08/12/2016] [Indexed: 11/19/2022]
Abstract
Numerical encoding plays an important role in DNA sequence analysis via computational methods, in which numerical values are associated with corresponding symbolic characters. After numerical representation, digital signal processing methods can be exploited to analyze DNA sequences. To reflect the biological properties of the original sequence, it is vital that the representation is one-to-one. Chaos Game Representation (CGR) is an iterative mapping technique that assigns each nucleotide in a DNA sequence to a respective position on the plane that allows the depiction of the DNA sequence in the form of image. Using CGR, a biological sequence can be transformed one-to-one to a numerical sequence that preserves the main features of the original sequence. In this research, we propose to encode DNA sequences by considering 2D CGR coordinates as complex numbers, and apply digital signal processing methods to analyze their evolutionary relationship. Computational experiments indicate that this approach gives comparable results to the state-of-the-art multiple sequence alignment method, Clustal Omega, and is significantly faster. The MATLAB code for our method can be accessed from: www.mathworks.com/matlabcentral/fileexchange/57152.
Collapse
Affiliation(s)
- Tung Hoang
- Department of Mathematics, Statistics and Computer Science, University of Ilinois at Chicago, Chicago, IL 60607, USA
| | - Changchuan Yin
- Department of Mathematics, Statistics and Computer Science, University of Ilinois at Chicago, Chicago, IL 60607, USA
| | - Stephen S-T Yau
- Department of Mathematical Sciences, Tsinghua University, Beijing 100084, China.
| |
Collapse
|
24
|
Huang HH, Yu C. Clustering DNA sequences using the out-of-place measure with reduced n-grams. J Theor Biol 2016; 406:61-72. [PMID: 27375217 DOI: 10.1016/j.jtbi.2016.06.029] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/10/2015] [Revised: 05/18/2016] [Accepted: 06/21/2016] [Indexed: 11/25/2022]
Abstract
The alignment-free n-gram based method with the out-of-place measures as the distance has been successfully applied to automatic text or natural languages categorization in real time. However, it is not clear about its performance and the selection of n for comparing genome sequences. Here we propose a symmetric version of the out-of-place measure and a new approach for finding the optimal range of n to construct a phylogenetic tree with the symmetric out-of-place measures. Our method is then applied to real genome sequence datasets. The resulting phylogenetic trees are matching with the standard biological classification. It shows that our proposed method is a very powerful tool for phylogenetic analysis in terms of both classification accuracy and computation efficiency.
Collapse
Affiliation(s)
- Hsin-Hsiung Huang
- Department of Statistics, University of Central Florida, Orlando, FL 32816, USA.
| | - Chenglong Yu
- Mind and Brain Theme, South Australian Health and Medical Research Institute, North Terrace, Adelaide, SA 5000, Australia; School of Medicine, Flinders University, Adelaide, SA 5001, Australia
| |
Collapse
|
25
|
An estimator for local analysis of genome based on the minimal absent word. J Theor Biol 2016; 395:23-30. [PMID: 26829314 DOI: 10.1016/j.jtbi.2016.01.023] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/12/2015] [Revised: 01/17/2016] [Accepted: 01/19/2016] [Indexed: 11/22/2022]
Abstract
This study presents an alternative alignment-free relative feature analysis method based on the minimal absent word, which has potential advantages over the local alignment method in local analysis. Smooth-local-analysis-curve and similarity-distribution are constructed for a fast, efficient, and visual comparison. Moreover, when the multi-sequence-comparison is needed, the local-analysis-curves can illustrate some interesting zones.
Collapse
|