1
|
Tao Y, Ge S. A distribution-guided Mapper algorithm. BMC Bioinformatics 2025; 26:73. [PMID: 40045218 PMCID: PMC11881416 DOI: 10.1186/s12859-025-06085-5] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/08/2024] [Accepted: 02/14/2025] [Indexed: 03/09/2025] Open
Abstract
BACKGROUND The Mapper algorithm is an essential tool for exploring the data shape in topological data analysis. With a dataset as an input, the Mapper algorithm outputs a graph representing the topological features of the whole dataset. This graph is often regarded as an approximation of a Reeb graph of a dataset. The classic Mapper algorithm uses fixed interval lengths and overlapping ratios, which might fail to reveal subtle features of a dataset, especially when the underlying structure is complex. RESULTS In this work, we introduce a distribution-guided Mapper algorithm named D-Mapper, which utilizes the property of the probability model and data intrinsic characteristics to generate density-guided covers and provide enhanced topological features. Moreover, we introduce a metric accounting for both the quality of overlap clustering and extended persistent homology to measure the performance of Mapper-type algorithms. Our numerical experiments indicate that the D-Mapper outperforms the classic Mapper algorithm in various scenarios. We also apply the D-Mapper to a SARS-COV-2 coronavirus RNA sequence dataset to explore the topological structure of different virus variants. The results indicate that the D-Mapper algorithm can reveal both the vertical and horizontal evolutionary processes of the viruses. Our code is available at https://github.com/ShufeiGe/D-Mapper . CONCLUSION The D-Mapper algorithm can generate covers from data based on a probability model. This work demonstrates the power of fusing probabilistic models with Mapper algorithms.
Collapse
Affiliation(s)
- Yuyang Tao
- Institute of Mathematical Sciences, ShanghaiTech University, 393 Middle Huaxia Road, 201210, Shanghai, China
| | - Shufei Ge
- Institute of Mathematical Sciences, ShanghaiTech University, 393 Middle Huaxia Road, 201210, Shanghai, China.
| |
Collapse
|
2
|
Filar JA, Holden MH, Mendiolar M, Streipert SH. Overcoming the impossibility of age-balanced harvest. Math Biosci 2024; 367:109111. [PMID: 37996065 DOI: 10.1016/j.mbs.2023.109111] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/23/2023] [Revised: 10/30/2023] [Accepted: 11/20/2023] [Indexed: 11/25/2023]
Abstract
In many countries, sustainability targets for managed fisheries are often expressed in terms of a fixed percentage of the carrying capacity. Despite the appeal of such a simple quantitative target, an unintended consequence may be a significant tilting of the proportions of biomass across different ages, from what they would have been under harvest-free conditions. Within the framework of a widely used age-structured model, we propose a novel quantitative definition of "age-balanced harvest" that considers the age-class composition relative to that of the unfished population. We show that achieving a perfectly age-balanced policy is impossible if we harvest any fish whatsoever. However, every non-trivial harvest policy has a special structure that favours the young. To quantify the degree of age-imbalance, we propose a cross-entropy function. We formulate an optimisation problem that aims to attain an "age-balanced steady state", subject to adequate yield. We demonstrate that near balanced harvest policies are achievable by sacrificing a small amount of yield. These findings have important implications for sustainable fisheries management by providing insights into trade-offs and harvest policy recommendations.
Collapse
Affiliation(s)
- Jerzy A Filar
- School of Mathematics and Physics, The University of Queensland, Australia.
| | - Matthew H Holden
- School of Mathematics and Physics, The University of Queensland, Australia.
| | - Manuela Mendiolar
- School of Mathematics and Physics, The University of Queensland, Australia.
| | | |
Collapse
|
3
|
Munagala NVTS, Amanchi PK, Balasubramanian K, Panicker A, Nagaraj N. Compression-Complexity Measures for Analysis and Classification of Coronaviruses. ENTROPY (BASEL, SWITZERLAND) 2022; 25:81. [PMID: 36673224 PMCID: PMC9857615 DOI: 10.3390/e25010081] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 10/05/2022] [Revised: 12/10/2022] [Accepted: 12/18/2022] [Indexed: 06/17/2023]
Abstract
Finding a vaccine or specific antiviral treatment for a global pandemic of virus diseases (such as the ongoing COVID-19) requires rapid analysis, annotation and evaluation of metagenomic libraries to enable a quick and efficient screening of nucleotide sequences. Traditional sequence alignment methods are not suitable and there is a need for fast alignment-free techniques for sequence analysis. Information theory and data compression algorithms provide a rich set of mathematical and computational tools to capture essential patterns in biological sequences. In this study, we investigate the use of compression-complexity (Effort-to-Compress or ETC and Lempel-Ziv or LZ complexity) based distance measures for analyzing genomic sequences. The proposed distance measure is used to successfully reproduce the phylogenetic trees for a mammalian dataset consisting of eight species clusters, a set of coronaviruses belonging to group I, group II, group III, and SARS-CoV-1 coronaviruses, and a set of coronaviruses causing COVID-19 (SARS-CoV-2), and those not causing COVID-19. Having demonstrated the usefulness of these compression complexity measures, we employ them for the automatic classification of COVID-19-causing genome sequences using machine learning techniques. Two flavors of SVM (linear and quadratic) along with linear discriminant and fine K Nearest Neighbors classifer are used for classification. Using a data set comprising 1001 coronavirus sequences (causing COVID-19 and those not causing COVID-19), a classification accuracy of 98% is achieved with a sensitivity of 95% and a specificity of 99.8%. This work could be extended further to enable medical practitioners to automatically identify and characterize coronavirus strains and their rapidly growing mutants in a fast and efficient fashion.
Collapse
Affiliation(s)
- Naga Venkata Trinath Sai Munagala
- Department of Electronics and Communication Engineering, Amrita School of Engineering, Coimbatore, Amrita Vishwa Vidyapeetham, Ettimadai 641112, Tamil Nadu, India
| | - Prem Kumar Amanchi
- Department of Electronics and Communication Engineering, Amrita School of Engineering, Coimbatore, Amrita Vishwa Vidyapeetham, Ettimadai 641112, Tamil Nadu, India
| | - Karthi Balasubramanian
- Department of Electronics and Communication Engineering, Amrita School of Engineering, Coimbatore, Amrita Vishwa Vidyapeetham, Ettimadai 641112, Tamil Nadu, India
| | - Athira Panicker
- Department of Electronics and Communication Engineering, Amrita School of Engineering, Coimbatore, Amrita Vishwa Vidyapeetham, Ettimadai 641112, Tamil Nadu, India
| | - Nithin Nagaraj
- Consciousness Studies Programme, National Institute of Advanced Studies, Bengaluru 560012, Karnataka, India
| |
Collapse
|
4
|
Zakarczemny M, Zajęcka M. Note on DNA Analysis and Redesigning Using Markov Chain. Genes (Basel) 2022; 13:genes13030554. [PMID: 35328107 PMCID: PMC8953815 DOI: 10.3390/genes13030554] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/28/2022] [Revised: 03/07/2022] [Accepted: 03/09/2022] [Indexed: 02/04/2023] Open
Abstract
The paper contains a discussion on mathematical modifying and redesigning DNA with the use of Markov chains. We give a simple mathematical technique for overwriting missing parts of DNA. With a certain probability (without even knowing the function of the missing codon) we can find a synonymous codon, so that there is no frequency change in amino acid sequences of proteins. We use Markov Chain to analyze the dependencies in DNA sequence of the human gene Alpha 1,3-Galactosyltransfe rase 2. We include a theoretical introduction which facilitates the understanding of the paper for non-mathematicians, especially for biologists not familiar with the theory of Markov chains.
Collapse
|
5
|
Qian Y, Zhang Y, Zhang J. Alignment-Free Sequence Comparison With Multiple k Values. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2021; 18:1841-1849. [PMID: 31765317 DOI: 10.1109/tcbb.2019.2955081] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/10/2023]
Abstract
Alignment-free sequence comparison approaches have become increasingly popular in computational biology, because alignment-based approaches are inefficient to process large-scale datasets. Still, there is no way to determine the optimal value of the critical parameter k for alignment-free approaches in general. In this article, we tried to solve the problem by involving multiple k values simultaneously. The method counts the occurrence of each k-mer with different k values in a sequence. Two weighting schemes, based on maximizing deviation method and genetic algorithm, are then used on these counts. We applied the method to enhance the three common alignment-free approaches D2, D2S, and D2*, and evaluated its performance on similarity search and functionally related regulatory sequences recognition. The enhanced approaches achieve better performance than the original approaches in all cases, and much better performance than some other common measures, such as Pcc, Eu, Ma, Ch, Kld, and Cos.
Collapse
|
6
|
Ramanathan N, Ramamurthy J, Natarajan G. Numerical Characterization of DNA Sequences for Alignment-free Sequence Comparison - A Review. Comb Chem High Throughput Screen 2021; 25:365-380. [PMID: 34382516 DOI: 10.2174/1386207324666210811101437] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/30/2020] [Revised: 06/16/2021] [Accepted: 06/24/2021] [Indexed: 11/22/2022]
Abstract
BACKGROUND Biological macromolecules namely, DNA, RNA, and protein have their building blocks organized in a particular sequence and the sequential arrangement encodes evolutionary history of the organism (species). Hence, biological sequences have been used for studying evolutionary relationships among the species. This is usually carried out by multiple sequence algorithms (MSA). Due to certain limitations of MSA, alignment-free sequence comparison methods were developed. The present review is on alignment-free sequence comparison methods carried out using numerical characterization of DNA sequences. <P> Discussion: The graphical representation of DNA sequences by chaos game representation and other 2-dimesnional and 3-dimensional methods are discussed. The evolution of numerical characterization from the various graphical representations and the application of the DNA invariants thus computed in phylogenetic analysis is presented. The extension of computing molecular descriptors in chemometrics to the calculation of new set of DNA invariants and their use in alignment-free sequence comparison in a N-dimensional space and construction of phylogenetic tress is also reviewed. <P> Conclusion: The phylogenetic tress constructed by the alignment-free sequence comparison methods using DNA invariants were found to be better than those constructed using alignment-based tools such as PHLYIP and ClustalW. One of the graphical representation methods is now extended to study viral sequences of infectious diseases for the identification of conserved regions to design peptide-based vaccine by combining numerical characterization and graphical representation.
Collapse
Affiliation(s)
- Natarajan Ramanathan
- Department of Chemistry, Sri Sarada Niketan College for Women, Karur-639005, Tamil Nadu. India
| | - Jayalakshmi Ramamurthy
- Department of Computer Science, Sri Sarada Niketan College for Women, Karur-639005, Tamil Nadu. India
| | - Ganapathy Natarajan
- Department of Mechanical Engineering and Industrial Engineering, University of Wisconsin, Platteville, WI 53818. United States
| |
Collapse
|
7
|
Lee S, Lee T, Noh YK, Kim S. Ranked k-Spectrum Kernel for Comparative and Evolutionary Comparison of Exons, Introns, and CpG Islands. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2021; 18:1174-1183. [PMID: 31494555 DOI: 10.1109/tcbb.2019.2938949] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/10/2023]
Abstract
MOTIVATION Existing k-mer based string kernel methods have been successfully used for sequence comparison. However, existing kernel methods have limitations for comparative and evolutionary comparisons of genomes due to the sensitiveness to over-represented k-mers and variable sequence lengths. RESULTS In this study, we propose a novel ranked k-spectrum string (RKSS) kernel. 1) RKSS kernel utilizes common k-mer sets across species, named landmarks, that can be used for comparing multiple genomes. 2) Based on the landmarks, we can use ranks of k-mers, rather than frequencies, that can produce more robust distances between genomes. To show the power of RKSS kernel, we conducted two experiments using 10 mammalian species with exon, intron, and CpG island sequences. RKSS kernel reconstructed more consistent evolutionary trees than the k-spectrum string kernel. In the subsequent experiment, for each sequence, kernel distance was calculated from 30 landmarks representing exon, intron, and CpG island sequences of 10 genomes. Based on kernel distances, concordance tests were performed and the result suggested that more information is conserved in CpG islands across species than in introns. In conclusion, our analysis suggests that the relational order, exon CpG island intron, in terms of evolutionary information contents.
Collapse
|
8
|
Luczak BB, James BT, Girgis HZ. A survey and evaluations of histogram-based statistics in alignment-free sequence comparison. Brief Bioinform 2020; 20:1222-1237. [PMID: 29220512 PMCID: PMC6781583 DOI: 10.1093/bib/bbx161] [Citation(s) in RCA: 26] [Impact Index Per Article: 5.2] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/15/2017] [Revised: 10/13/2017] [Indexed: 11/29/2022] Open
Abstract
Motivation Since the dawn of the bioinformatics field, sequence alignment scores have been the main method for comparing sequences. However, alignment algorithms are quadratic, requiring long execution time. As alternatives, scientists have developed tens of alignment-free statistics for measuring the similarity between two sequences. Results We surveyed tens of alignment-free k-mer statistics. Additionally, we evaluated 33 statistics and multiplicative combinations between the statistics and/or their squares. These statistics are calculated on two k-mer histograms representing two sequences. Our evaluations using global alignment scores revealed that the majority of the statistics are sensitive and capable of finding similar sequences to a query sequence. Therefore, any of these statistics can filter out dissimilar sequences quickly. Further, we observed that multiplicative combinations of the statistics are highly correlated with the identity score. Furthermore, combinations involving sequence length difference or Earth Mover’s distance, which takes the length difference into account, are always among the highest correlated paired statistics with identity scores. Similarly, paired statistics including length difference or Earth Mover’s distance are among the best performers in finding the K-closest sequences. Interestingly, similar performance can be obtained using histograms of shorter words, resulting in reducing the memory requirement and increasing the speed remarkably. Moreover, we found that simple single statistics are sufficient for processing next-generation sequencing reads and for applications relying on local alignment. Finally, we measured the time requirement of each statistic. The survey and the evaluations will help scientists with identifying efficient alternatives to the costly alignment algorithm, saving thousands of computational hours. Availability The source code of the benchmarking tool is available as Supplementary Materials.
Collapse
Affiliation(s)
| | | | - Hani Z Girgis
- Corresponding author. Hani Z. Girgis, Tandy School of Computer Science, The University of Tulsa, 800 South Tucker Drive, Tulsa, OK 74104, USA. E-mail:
| |
Collapse
|
9
|
An open-source k-mer based machine learning tool for fast and accurate subtyping of HIV-1 genomes. PLoS One 2018; 13:e0206409. [PMID: 30427878 PMCID: PMC6235296 DOI: 10.1371/journal.pone.0206409] [Citation(s) in RCA: 36] [Impact Index Per Article: 5.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/24/2018] [Accepted: 10/14/2018] [Indexed: 01/11/2023] Open
Abstract
For many disease-causing virus species, global diversity is clustered into a taxonomy of subtypes with clinical significance. In particular, the classification of infections among the subtypes of human immunodeficiency virus type 1 (HIV-1) is a routine component of clinical management, and there are now many classification algorithms available for this purpose. Although several of these algorithms are similar in accuracy and speed, the majority are proprietary and require laboratories to transmit HIV-1 sequence data over the network to remote servers. This potentially exposes sensitive patient data to unauthorized access, and makes it impossible to determine how classifications are made and to maintain the data provenance of clinical bioinformatic workflows. We propose an open-source supervised and alignment-free subtyping method (Kameris) that operates on k-mer frequencies in HIV-1 sequences. We performed a detailed study of the accuracy and performance of subtype classification in comparison to four state-of-the-art programs. Based on our testing data set of manually curated real-world HIV-1 sequences (n = 2, 784), Kameris obtained an overall accuracy of 97%, which matches or exceeds all other tested software, with a processing rate of over 1,500 sequences per second. Furthermore, our fully standalone general-purpose software provides key advantages in terms of data security and privacy, transparency and reproducibility. Finally, we show that our method is readily adaptable to subtype classification of other viruses including dengue, influenza A, and hepatitis B and C virus.
Collapse
|
10
|
Chen W, Liao B, Li W. Use of image texture analysis to find DNA sequence similarities. J Theor Biol 2018; 455:1-6. [DOI: 10.1016/j.jtbi.2018.07.001] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/14/2017] [Revised: 06/29/2018] [Accepted: 07/03/2018] [Indexed: 11/29/2022]
|
11
|
Ren J, Bai X, Lu YY, Tang K, Wang Y, Reinert G, Sun F. Alignment-Free Sequence Analysis and Applications. Annu Rev Biomed Data Sci 2018; 1:93-114. [PMID: 31828235 PMCID: PMC6905628 DOI: 10.1146/annurev-biodatasci-080917-013431] [Citation(s) in RCA: 58] [Impact Index Per Article: 8.3] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/06/2023]
Abstract
Genome and metagenome comparisons based on large amounts of next generation sequencing (NGS) data pose significant challenges for alignment-based approaches due to the huge data size and the relatively short length of the reads. Alignment-free approaches based on the counts of word patterns in NGS data do not depend on the complete genome and are generally computationally efficient. Thus, they contribute significantly to genome and metagenome comparison. Recently, novel statistical approaches have been developed for the comparison of both long and shotgun sequences. These approaches have been applied to many problems including the comparison of gene regulatory regions, genome sequences, metagenomes, binning contigs in metagenomic data, identification of virus-host interactions, and detection of horizontal gene transfers. We provide an updated review of these applications and other related developments of word-count based approaches for alignment-free sequence analysis.
Collapse
Affiliation(s)
- Jie Ren
- Molecular and Computational Biology Program, University of Southern California, Los Angeles, California, USA
| | - Xin Bai
- Molecular and Computational Biology Program, University of Southern California, Los Angeles, California, USA
- Centre for Computational Systems Biology, School of Mathematical Sciences, Fudan University, Shanghai, China
| | - Yang Young Lu
- Molecular and Computational Biology Program, University of Southern California, Los Angeles, California, USA
| | - Kujin Tang
- Molecular and Computational Biology Program, University of Southern California, Los Angeles, California, USA
| | - Ying Wang
- Department of Automation, Xiamen University, Xiamen, Fujian, China
| | - Gesine Reinert
- Department of Statistics, University of Oxford, Oxford, United Kingdom
| | - Fengzhu Sun
- Molecular and Computational Biology Program, University of Southern California, Los Angeles, California, USA
- Centre for Computational Systems Biology, School of Mathematical Sciences, Fudan University, Shanghai, China
| |
Collapse
|
12
|
Lin J, Wei J, Adjeroh D, Jiang BH, Jiang Y. SSAW: A new sequence similarity analysis method based on the stationary discrete wavelet transform. BMC Bioinformatics 2018; 19:165. [PMID: 29720081 PMCID: PMC5930706 DOI: 10.1186/s12859-018-2155-9] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/19/2017] [Accepted: 04/11/2018] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Alignment-free sequence similarity analysis methods often lead to significant savings in computational time over alignment-based counterparts. RESULTS A new alignment-free sequence similarity analysis method, called SSAW is proposed. SSAW stands for Sequence Similarity Analysis using the Stationary Discrete Wavelet Transform (SDWT). It extracts k-mers from a sequence, then maps each k-mer to a complex number field. Then, the series of complex numbers formed are transformed into feature vectors using the stationary discrete wavelet transform. After these steps, the original sequence is turned into a feature vector with numeric values, which can then be used for clustering and/or classification. CONCLUSIONS Using two different types of applications, namely, clustering and classification, we compared SSAW against the the-state-of-the-art alignment free sequence analysis methods. SSAW demonstrates competitive or superior performance in terms of standard indicators, such as accuracy, F-score, precision, and recall. The running time was significantly better in most cases. These make SSAW a suitable method for sequence analysis, especially, given the rapidly increasing volumes of sequence data required by most modern applications.
Collapse
Affiliation(s)
- Jie Lin
- College of Mathematics and Informatics, Fujian Normal University, Fuzhou, 350108, People's Republic of China
| | - Jing Wei
- College of Mathematics and Informatics, Fujian Normal University, Fuzhou, 350108, People's Republic of China
| | - Donald Adjeroh
- Lane Department of Computer Science and Electrical Engineering, West Virginia University, Morgantown, 26506, WV, USA
| | - Bing-Hua Jiang
- Department of Pathology, University of Iowa, Iowa city, 52242, Iowa, USA
| | - Yue Jiang
- College of Mathematics and Informatics, Fujian Normal University, Fuzhou, 350108, People's Republic of China.
| |
Collapse
|
13
|
Bai X, Tang K, Ren J, Waterman M, Sun F. Optimal choice of word length when comparing two Markov sequences using a χ 2-statistic. BMC Genomics 2017; 18:732. [PMID: 28984181 PMCID: PMC5629589 DOI: 10.1186/s12864-017-4020-z] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/29/2022] Open
Abstract
Background Alignment-free sequence comparison using counts of word patterns (grams, k-tuples) has become an active research topic due to the large amount of sequence data from the new sequencing technologies. Genome sequences are frequently modelled by Markov chains and the likelihood ratio test or the corresponding approximate χ2-statistic has been suggested to compare two sequences. However, it is not known how to best choose the word length k in such studies. Results We develop an optimal strategy to choose k by maximizing the statistical power of detecting differences between two sequences. Let the orders of the Markov chains for the two sequences be r1 and r2, respectively. We show through both simulations and theoretical studies that the optimal k= max(r1,r2)+1 for both long sequences and next generation sequencing (NGS) read data. The orders of the Markov chains may be unknown and several methods have been developed to estimate the orders of Markov chains based on both long sequences and NGS reads. We study the power loss of the statistics when the estimated orders are used. It is shown that the power loss is minimal for some of the estimators of the orders of Markov chains. Conclusion Our studies provide guidelines on choosing the optimal word length for the comparison of Markov sequences. Electronic supplementary material The online version of this article (doi:10.1186/s12864-017-4020-z) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Xin Bai
- Centre for Computational Systems Biology, School of Mathematical Sciences, Fudan University, Shanghai, China
| | - Kujin Tang
- Molecular and Computational Biology Program, University of Southern California, Los Angeles, California, USA
| | - Jie Ren
- Molecular and Computational Biology Program, University of Southern California, Los Angeles, California, USA
| | - Michael Waterman
- Centre for Computational Systems Biology, School of Mathematical Sciences, Fudan University, Shanghai, China.,Molecular and Computational Biology Program, University of Southern California, Los Angeles, California, USA
| | - Fengzhu Sun
- Centre for Computational Systems Biology, School of Mathematical Sciences, Fudan University, Shanghai, China. .,Molecular and Computational Biology Program, University of Southern California, Los Angeles, California, USA.
| |
Collapse
|
14
|
Farrel A, Murphy J, Guo JT. Structure-based prediction of transcription factor binding specificity using an integrative energy function. Bioinformatics 2017; 32:i306-i313. [PMID: 27307632 PMCID: PMC4908348 DOI: 10.1093/bioinformatics/btw264] [Citation(s) in RCA: 17] [Impact Index Per Article: 2.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/11/2022] Open
Abstract
UNLABELLED Transcription factors (TFs) regulate gene expression through binding to specific target DNA sites. Accurate annotation of transcription factor binding sites (TFBSs) at genome scale represents an essential step toward our understanding of gene regulation networks. In this article, we present a structure-based method for computational prediction of TFBSs using a novel, integrative energy (IE) function. The new energy function combines a multibody (MB) knowledge-based potential and two atomic energy terms (hydrogen bond and π interaction) that might not be accurately captured by the knowledge-based potential owing to the mean force nature and low count problem. We applied the new energy function to the TFBS prediction using a non-redundant dataset that consists of TFs from 12 different families. Our results show that the new IE function improves the prediction accuracy over the knowledge-based, statistical potentials, especially for homeodomain TFs, the second largest TF family in mammals. CONTACT jguo4@uncc.edu SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Alvin Farrel
- Department of Bioinformatics and Genomics, University of North Carolina at Charlotte, Charlotte, NC 28223, USA
| | - Jonathan Murphy
- Department of Bioinformatics and Genomics, University of North Carolina at Charlotte, Charlotte, NC 28223, USA
| | - Jun-Tao Guo
- Department of Bioinformatics and Genomics, University of North Carolina at Charlotte, Charlotte, NC 28223, USA
| |
Collapse
|
15
|
Farrel A, Guo JT. An efficient algorithm for improving structure-based prediction of transcription factor binding sites. BMC Bioinformatics 2017; 18:342. [PMID: 28715997 PMCID: PMC5514533 DOI: 10.1186/s12859-017-1755-0] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/11/2017] [Accepted: 07/12/2017] [Indexed: 01/07/2023] Open
Abstract
Background Gene expression is regulated by transcription factors binding to specific target DNA sites. Understanding how and where transcription factors bind at genome scale represents an essential step toward our understanding of gene regulation networks. Previously we developed a structure-based method for prediction of transcription factor binding sites using an integrative energy function that combines a knowledge-based multibody potential and two atomic energy terms. While the method performs well, it is not computationally efficient due to the exponential increase in the number of binding sequences to be evaluated for longer binding sites. In this paper, we present an efficient pentamer algorithm by splitting DNA binding sequences into overlapping fragments along with a simplified integrative energy function for transcription factor binding site prediction. Results A DNA binding sequence is split into overlapping pentamers (5 base pairs) for calculating transcription factor-pentamer interaction energy. To combine the results from overlapping pentamer scores, we developed two methods, Kmer-Sum and PWM (Position Weight Matrix) stacking, for full-length binding motif prediction. Our results show that both Kmer-Sum and PWM stacking in the new pentamer approach along with a simplified integrative energy function improved transcription factor binding site prediction accuracy and dramatically reduced computation time, especially for longer binding sites. Conclusion Our new fragment-based pentamer algorithm and simplified energy function improve both efficiency and accuracy. To our knowledge, this is the first fragment-based method for structure-based transcription factor binding sites prediction. Electronic supplementary material The online version of this article (doi:10.1186/s12859-017-1755-0) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Alvin Farrel
- Department of Bioinformatics and Genomics, University of North Carolina at Charlotte, 9201 University City Blvd, Charlotte, NC, 28223, USA
| | - Jun-Tao Guo
- Department of Bioinformatics and Genomics, University of North Carolina at Charlotte, 9201 University City Blvd, Charlotte, NC, 28223, USA.
| |
Collapse
|
16
|
Thankachan SV, Chockalingam SP, Liu Y, Krishnan A, Aluru S. A greedy alignment-free distance estimator for phylogenetic inference. BMC Bioinformatics 2017; 18:238. [PMID: 28617225 PMCID: PMC5471951 DOI: 10.1186/s12859-017-1658-0] [Citation(s) in RCA: 26] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/25/2022] Open
Abstract
BACKGROUND Alignment-free sequence comparison approaches have been garnering increasing interest in various data- and compute-intensive applications such as phylogenetic inference for large-scale sequences. While k-mer based methods are predominantly used in real applications, the average common substring (ACS) approach is emerging as one of the prominent alignment-free approaches. This ACS approach has been further generalized by some recent work, either greedily or exactly, by allowing a bounded number of mismatches in the common substrings. RESULTS We present ALFRED-G, a greedy alignment-free distance estimator for phylogenetic tree reconstruction based on the concept of the generalized ACS approach. In this algorithm, we have investigated a new heuristic to efficiently compute the lengths of common strings with mismatches allowed, and have further applied this heuristic to phylogeny reconstruction. Performance evaluation using real sequence datasets shows that our heuristic is able to reconstruct comparable, or even more accurate, phylogenetic tree topologies than the kmacs heuristic algorithm at highly competitive speed. CONCLUSIONS ALFRED-G is an alignment-free heuristic for evolutionary distance estimation between two biological sequences. This algorithm is implemented in C++ and has been incorporated into our open-source ALFRED software package ( http://alurulab.cc.gatech.edu/phylo ).
Collapse
Affiliation(s)
- Sharma V. Thankachan
- Department of Computer Science, University of Central Florida, Orlando, 32816 FL USA
| | - Sriram P. Chockalingam
- Institute for Data Engineering and Science, Georgia Institute of Technology, Atlanta, 30332 GA USA
| | - Yongchao Liu
- School of Computational Science and Engineering, Georgia Institute of Technology, Atlanta, 30332 GA USA
| | - Ambujam Krishnan
- School of Electrical Engineering and Computer Science, Louisiana State University, Baton Rouge, 70703 LA USA
| | - Srinivas Aluru
- Institute for Data Engineering and Science, Georgia Institute of Technology, Atlanta, 30332 GA USA
- School of Computational Science and Engineering, Georgia Institute of Technology, Atlanta, 30332 GA USA
| |
Collapse
|
17
|
PCV: An Alignment Free Method for Finding Homologous Nucleotide Sequences and its Application in Phylogenetic Study. Interdiscip Sci 2016; 9:173-183. [PMID: 26825665 DOI: 10.1007/s12539-015-0136-5] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/07/2015] [Revised: 11/03/2015] [Accepted: 12/15/2015] [Indexed: 10/22/2022]
Abstract
Online retrieval of the homologous nucleotide sequences through existing alignment techniques is a common practice against the given database of sequences. The salient point of these techniques is their dependence on local alignment techniques and scoring matrices the reliability of which is limited by computational complexity and accuracy. Toward this direction, this work offers a novel way for numerical representation of genes which can further help in dividing the data space into smaller partitions helping formation of a search tree. In this context, this paper introduces a 36-dimensional Periodicity Count Value (PCV) which is representative of a particular nucleotide sequence and created through adaptation from the concept of stochastic model of Kolekar et al. (American Institute of Physics 1298:307-312, 2010. doi: 10.1063/1.3516320 ). The PCV construct uses information on physicochemical properties of nucleotides and their positional distribution pattern within a gene. It is observed that PCV representation of gene reduces computational cost in the calculation of distances between a pair of genes while being consistent with the existing methods. The validity of PCV-based method was further tested through their use in molecular phylogeny constructs in comparison with that using existing sequence alignment methods.
Collapse
|
18
|
Zhu XS, McGee M. Metagenomic Classification Using an Abstraction Augmented Markov Model. J Comput Biol 2015; 23:111-122. [PMID: 26618474 DOI: 10.1089/cmb.2015.0141] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/01/2023] Open
Abstract
The abstraction augmented Markov model (AAMM) is an extension of a Markov model that can be used for the analysis of genetic sequences. It is developed using the frequencies of all possible consecutive words with same length (p-mers). This article will review the theory behind AAMM and apply the theory behind AAMM in metagenomic classification.
Collapse
Affiliation(s)
| | - Monnie McGee
- 2 Department of Statistical Science, Southern Methodist University , Dallas, Texas
| |
Collapse
|
19
|
Karamichalis R, Kari L, Konstantinidis S, Kopecki S. An investigation into inter- and intragenomic variations of graphic genomic signatures. BMC Bioinformatics 2015; 16:246. [PMID: 26249837 PMCID: PMC4527362 DOI: 10.1186/s12859-015-0655-4] [Citation(s) in RCA: 18] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/19/2014] [Accepted: 06/30/2015] [Indexed: 11/30/2022] Open
Abstract
Background Motivated by the general need to identify and classify species based on molecular evidence, genome comparisons have been proposed that are based on measuring mostly Euclidean distances between Chaos Game Representation (CGR) patterns of genomic DNA sequences. Results We provide, on an extensive dataset and using several different distances, confirmation of the hypothesis that CGR patterns are preserved along a genomic DNA sequence, and are different for DNA sequences originating from genomes of different species. This finding lends support to the theory that CGRs of genomic sequences can act as graphic genomic signatures. In particular, we compare the CGR patterns of over five hundred different 150,000 bp genomic sequences spanning one complete chromosome from each of six organisms, representing all kingdoms of life: H. sapiens (Animalia; chromosome 21), S. cerevisiae (Fungi; chromosome 4), A. thaliana (Plantae; chromosome 1), P. falciparum (Protista; chromosome 14), E. coli (Bacteria - full genome), and P. furiosus (Archaea - full genome). To maximize the diversity within each species, we also analyze the interrelationships within a set of over five hundred 150,000 bp genomic sequences sampled from the entire aforementioned genomes. Lastly, we provide some preliminary evidence of this method’s ability to classify genomic DNA sequences at lower taxonomic levels by comparing sequences sampled from the entire genome of H. sapiens (class Mammalia, order Primates) and of M. musculus (class Mammalia, order Rodentia), for a total length of approximately 174 million basepairs analyzed. We compute pairwise distances between CGRs of these genomic sequences using six different distances, and construct Molecular Distance Maps, which visualize all sequences as points in a two-dimensional or three-dimensional space, to simultaneously display their interrelationships. Conclusion Our analysis confirms, for this dataset, that CGR patterns of DNA sequences from the same genome are in general quantitatively similar, while being different for DNA sequences from genomes of different species. Our assessment of the performance of the six distances analyzed uses three different quality measures and suggests that several distances outperform the Euclidean distance, which has so far been almost exclusively used for such studies.
Collapse
Affiliation(s)
- Rallis Karamichalis
- Department of Computer Science, University of Western Ontario, London, ON, Canada.
| | - Lila Kari
- Department of Computer Science, University of Western Ontario, London, ON, Canada.
| | - Stavros Konstantinidis
- Department of Mathematics and Computing Science, Saint Mary's University, Halifax, NS, Canada.
| | - Steffen Kopecki
- Department of Computer Science, University of Western Ontario, London, ON, Canada. .,Department of Mathematics and Computing Science, Saint Mary's University, Halifax, NS, Canada.
| |
Collapse
|
20
|
Prediction of high-risk types of human papillomaviruses using statistical model of protein "sequence space". COMPUTATIONAL AND MATHEMATICAL METHODS IN MEDICINE 2015; 2015:756345. [PMID: 25972913 PMCID: PMC4418008 DOI: 10.1155/2015/756345] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 09/19/2014] [Accepted: 03/31/2015] [Indexed: 11/29/2022]
Abstract
Discrimination of high-risk types of human papillomaviruses plays an important role in the diagnosis and remedy of cervical cancer. Recently, several computational methods have been proposed based on protein sequence-based and structure-based information, but the information of their related proteins has not been used until now. In this paper, we proposed using protein “sequence space” to explore this information and used it to predict high-risk types of HPVs. The proposed method was tested on 68 samples with known HPV types and 4 samples without HPV types and further compared with the available approaches. The results show that the proposed method achieved the best performance among all the evaluated methods with accuracy 95.59% and F1-score 90.91%, which indicates that protein “sequence space” could potentially be used to improve prediction of high-risk types of HPVs.
Collapse
|
21
|
Xie X, Guan J, Zhou S. Similarity evaluation of DNA sequences based on frequent patterns and entropy. BMC Genomics 2015; 16 Suppl 3:S5. [PMID: 25707937 PMCID: PMC4331808 DOI: 10.1186/1471-2164-16-s3-s5] [Citation(s) in RCA: 9] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND DNA sequence analysis is an important research topic in bioinformatics. Evaluating the similarity between sequences, which is crucial for sequence analysis, has attracted much research effort in the last two decades, and a dozen of algorithms and tools have been developed. These methods are based on alignment, word frequency and geometric representation respectively, each of which has its advantage and disadvantage. RESULTS In this paper, for effectively computing the similarity between DNA sequences, we introduce a novel method based on frequency patterns and entropy to construct representative vectors of DNA sequences. Experiments are conducted to evaluate the proposed method, which is compared with two recently-developed alignment-free methods and the BLASTN tool. When testing on the β-globin genes of 11 species and using the results from MEGA as the baseline, our method achieves higher correlation coefficients than the two alignment-free methods and the BLASTN tool. CONCLUSIONS Our method is not only able to capture fine-granularity information (location and ordering) of DNA sequences via sequence blocking, but also insensitive to noise and sequence rearrangement due to considering only the maximal frequent patterns. It outperforms major existing methods or tools.
Collapse
|
22
|
Wen J, Zhang Y, Yau SS. k-mer Sparse matrix model for genetic sequence and its applications in sequence comparison. J Theor Biol 2014; 363:145-50. [DOI: 10.1016/j.jtbi.2014.08.028] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/06/2014] [Revised: 07/14/2014] [Accepted: 08/17/2014] [Indexed: 10/24/2022]
|
23
|
A novel alignment-free method for whole genome analysis: Application to HIV-1 subtyping and HEV genotyping. Inf Sci (N Y) 2014. [DOI: 10.1016/j.ins.2014.04.029] [Citation(s) in RCA: 10] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022]
|
24
|
K-mer natural vector and its application to the phylogenetic analysis of genetic sequences. Gene 2014; 546:25-34. [PMID: 24858075 DOI: 10.1016/j.gene.2014.05.043] [Citation(s) in RCA: 43] [Impact Index Per Article: 3.9] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/08/2014] [Revised: 05/04/2014] [Accepted: 05/20/2014] [Indexed: 11/21/2022]
Abstract
Based on the well-known k-mer model, we propose a k-mer natural vector model for representing a genetic sequence based on the numbers and distributions of k-mers in the sequence. We show that there exists a one-to-one correspondence between a genetic sequence and its associated k-mer natural vector. The k-mer natural vector method can be easily and quickly used to perform phylogenetic analysis of genetic sequences without requiring evolutionary models or human intervention. Whole or partial genomes can be handled more effective with our proposed method. It is applied to the phylogenetic analysis of genetic sequences, and the obtaining results fully demonstrate that the k-mer natural vector method is a very powerful tool for analysing and annotating genetic sequences and determining evolutionary relationships both in terms of accuracy and efficiency.
Collapse
|
25
|
Gunasinghe U, Alahakoon D, Bedingfield S. Extraction of high quality k-words for alignment-free sequence comparison. J Theor Biol 2014; 358:31-51. [PMID: 24846728 DOI: 10.1016/j.jtbi.2014.05.016] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/04/2013] [Revised: 05/04/2014] [Accepted: 05/07/2014] [Indexed: 10/25/2022]
Abstract
The weighted Euclidean distance (D(2)) is one of the earliest dissimilarity measures used for alignment free comparison of biological sequences. This distance measure and its variants have been used in numerous applications due to its fast computation, and many variants of it have been subsequently introduced. The D(2) distance measure is based on the count of k-words in the two sequences that are compared. Traditionally, all k-words are compared when computing the distance. In this paper we show that similar accuracy in sequence comparison can be achieved by using a selected subset of k-words. We introduce a term variance based quality measure for identifying the important k-words. We demonstrate the application of the proposed technique in phylogeny reconstruction and show that up to 99% of the k-words can be filtered out for certain datasets, resulting in faster sequence comparison. The paper also presents an exploratory analysis based evaluation of optimal k-word values and discusses the impact of using subsets of k-words in such optimal instances.
Collapse
Affiliation(s)
- Upuli Gunasinghe
- Clayton School of Information Technology, Faculty of Information Technology, Monash University, VIC 3800, Australia.
| | - Damminda Alahakoon
- School of Information and Business Analytics, Deakin University, VIC 3125, Australia.
| | - Susan Bedingfield
- Clayton School of Information Technology, Faculty of Information Technology, Monash University, VIC 3800, Australia.
| |
Collapse
|
26
|
Abstract
Among alignment-free methods, Iterated Maps (IMs) are on a particular extreme: they are also scale free (order free). The use of IMs for sequence analysis is also distinct from other alignment-free methodologies in being rooted in statistical mechanics instead of computational linguistics. Both of these roots go back over two decades to the use of fractal geometry in the characterization of phase-space representations. The time series analysis origin of the field is betrayed by the title of the manuscript that started this alignment-free subdomain in 1990, 'Chaos Game Representation'. The clash between the analysis of sequences as continuous series and the better established use of Markovian approaches to discrete series was almost immediate, with a defining critique published in same journal 2 years later. The rest of that decade would go by before the scale-free nature of the IM space was uncovered. The ensuing decade saw this scalability generalized for non-genomic alphabets as well as an interest in its use for graphic representation of biological sequences. Finally, in the past couple of years, in step with the emergence of BigData and MapReduce as a new computational paradigm, there is a surprising third act in the IM story. Multiple reports have described gains in computational efficiency of multiple orders of magnitude over more conventional sequence analysis methodologies. The stage appears to be now set for a recasting of IMs with a central role in processing nextgen sequencing results.
Collapse
Affiliation(s)
- Jonas S Almeida
- Division of Informatics, Department of Pathology, University of Alabama at Birmingham, Birmingham, AL, USA.
| |
Collapse
|
27
|
Schwende I, Pham TD. Pattern recognition and probabilistic measures in alignment-free sequence analysis. Brief Bioinform 2013; 15:354-68. [PMID: 24096012 DOI: 10.1093/bib/bbt070] [Citation(s) in RCA: 19] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/30/2022] Open
Abstract
With the massive production of genomic and proteomic data, the number of available biological sequences in databases has reached a level that is not feasible anymore for exact alignments even when just a fraction of all sequences is used. To overcome this inevitable time complexity, ultrafast alignment-free methods are studied. Within the past two decades, a broad variety of nonalignment methods have been proposed including dissimilarity measures on classical representations of sequences like k-words or Markov models. Furthermore, articles were published that describe distance measures on alternative representations such as compression complexity, spectral time series or chaos game representation. However, alignments are still the standard method for real world applications in biological sequence analysis, and the time efficient alignment-free approaches are usually applied in cases when the accustomed algorithms turn out to fail or be too inconvenient.
Collapse
Affiliation(s)
- Isabel Schwende
- PhD, Aizu Research Cluster for Medical Informatics and Engineering (ARC-Medical), Research Center for Advanced Information Science and Technology (CAIST), The University of Aizu, Aizuwakamatsu, Fukushima 965-8580, Japan.
| | | |
Collapse
|
28
|
Yu C, He RL, Yau SST. Protein sequence comparison based on K-string dictionary. Gene 2013; 529:250-6. [DOI: 10.1016/j.gene.2013.07.092] [Citation(s) in RCA: 29] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/08/2012] [Revised: 06/14/2013] [Accepted: 07/25/2013] [Indexed: 11/30/2022]
|
29
|
Song K, Ren J, Reinert G, Deng M, Waterman MS, Sun F. New developments of alignment-free sequence comparison: measures, statistics and next-generation sequencing. Brief Bioinform 2013; 15:343-53. [PMID: 24064230 DOI: 10.1093/bib/bbt067] [Citation(s) in RCA: 112] [Impact Index Per Article: 9.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/07/2023] Open
Abstract
With the development of next-generation sequencing (NGS) technologies, a large amount of short read data has been generated. Assembly of these short reads can be challenging for genomes and metagenomes without template sequences, making alignment-based genome sequence comparison difficult. In addition, sequence reads from NGS can come from different regions of various genomes and they may not be alignable. Sequence signature-based methods for genome comparison based on the frequencies of word patterns in genomes and metagenomes can potentially be useful for the analysis of short reads data from NGS. Here we review the recent development of alignment-free genome and metagenome comparison based on the frequencies of word patterns with emphasis on the dissimilarity measures between sequences, the statistical power of these measures when two sequences are related and the applications of these measures to NGS data.
Collapse
Affiliation(s)
- Kai Song
- Molecular and Computational Biology Program, University of Southern California, 1050 Childs Way, Los Angeles, CA 90089, USA. or
| | | | | | | | | | | |
Collapse
|
30
|
Yang X, Wang T. Linear regression model of short k-word: a similarity distance suitable for biological sequences with various lengths. J Theor Biol 2013; 337:61-70. [PMID: 23933105 DOI: 10.1016/j.jtbi.2013.07.028] [Citation(s) in RCA: 18] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/20/2013] [Revised: 07/02/2013] [Accepted: 07/30/2013] [Indexed: 01/10/2023]
Abstract
Originating from sequences' length difference, both k-word based methods and graphical representation approaches have uncovered biological information in their distinct ways. However, it is less likely that the mechanisms of information storage vary with sequences' length. A similarity distance suitable for sequences with various lengths will be much near to the mechanisms of information storage. In this paper, new sub-sequences of k-word were extracted from biological sequences under a one-to-one mapping. The new sub-sequences were evaluated by a linear regression model. Moreover, a new distance was defined on the invariants from the linear regression model. With comparison to other alignment-free distances, the results of four experiments demonstrated that our similarity distance was more efficient.
Collapse
Affiliation(s)
- Xiwu Yang
- School of Mathematical Sciences, Dalian University of Technology, Dalian, Liaoning 116024, PR China; School of Mathematics, Liaoning Normal University, Dalian, Liaoning 116029, PR China.
| | - Tianming Wang
- School of Mathematical Sciences, Dalian University of Technology, Dalian, Liaoning 116024, PR China
| |
Collapse
|
31
|
Dai Q, Yan Z, Shi Z, Liu X, Yao Y, He P. Study of LZ-word distribution and its application for sequence comparison. J Theor Biol 2013; 336:52-60. [PMID: 23876763 PMCID: PMC7094135 DOI: 10.1016/j.jtbi.2013.07.008] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/12/2013] [Revised: 07/06/2013] [Accepted: 07/10/2013] [Indexed: 11/29/2022]
Abstract
Lempel-Ziv complexity has been widely used for sequence comparison and achieved promising results, but until now components' distribution in exhaustive history has not been studied. This paper investigated the whole distribution of LZ-words and presented a novel statistical method for sequence comparison. With the components' length in mind, we revised Lempel-Ziv complexity and obtained various sets of LZ-words. Instead of calculating the LZ-words' contents, we defined a series of set operations on LZ-word set to compare biological sequences. In order to assess the effectiveness of the proposed method, we performed two sets of experiments and compared it with alignment-based methods.
Collapse
Affiliation(s)
- Qi Dai
- College of Life Sciences, Zhejiang Sci-Tech University, Hangzhou 310018, People's Republic of China.
| | | | | | | | | | | |
Collapse
|
32
|
Abstract
A plethora of biologically useful information lies obscured in the genomes of organisms. Encoded within the genome of an organism is the information about its evolutionary history. Evolutionary signals are scattered throughout the genome. Bioinformatics approaches are frequently invoked to deconstruct the evolutionary patterns underlying genomes, which are difficult to decipher using traditional laboratory experiments. However, interpreting constantly evolving genomes is a non-trivial task for bioinformaticians. Processes such as mutations, recombinations, insertions and deletions make genomes not only heterogeneous and difficult to decipher but also renders direct sequence comparison less effective. Here we present a brief overview of the sequence comparison methods with a focus on recently proposed alignment-free sequence comparison methods based on Shannon information entropy. Many of these sequence comparison methods have been adapted to construct phylogenetic trees to infer relationships among organisms.
Collapse
Affiliation(s)
- Mehul Jani
- University of North Texas, Denton, Texas
| | | |
Collapse
|
33
|
Yu HJ. Segmented K-mer and its application on similarity analysis of mitochondrial genome sequences. Gene 2013; 518:419-24. [DOI: 10.1016/j.gene.2012.12.079] [Citation(s) in RCA: 17] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/13/2012] [Revised: 12/01/2012] [Accepted: 12/19/2012] [Indexed: 11/25/2022]
|
34
|
Narlikar L, Mehta N, Galande S, Arjunwadkar M. One size does not fit all: on how Markov model order dictates performance of genomic sequence analyses. Nucleic Acids Res 2012; 41:1416-24. [PMID: 23267010 PMCID: PMC3562003 DOI: 10.1093/nar/gks1285] [Citation(s) in RCA: 21] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/01/2023] Open
Abstract
The structural simplicity and ability to capture serial correlations make Markov models a popular modeling choice in several genomic analyses, such as identification of motifs, genes and regulatory elements. A critical, yet relatively unexplored, issue is the determination of the order of the Markov model. Most biological applications use a predetermined order for all data sets indiscriminately. Here, we show the vast variation in the performance of such applications with the order. To identify the 'optimal' order, we investigated two model selection criteria: Akaike information criterion and Bayesian information criterion (BIC). The BIC optimal order delivers the best performance for mammalian phylogeny reconstruction and motif discovery. Importantly, this order is different from orders typically used by many tools, suggesting that a simple additional step determining this order can significantly improve results. Further, we describe a novel classification approach based on BIC optimal Markov models to predict functionality of tissue-specific promoters. Our classifier discriminates between promoters active across 12 different tissues with remarkable accuracy, yielding 3 times the precision expected by chance. Application to the metagenomics problem of identifying the taxum from a short DNA fragment yields accuracies at least as high as the more complex mainstream methodologies, while retaining conceptual and computational simplicity.
Collapse
Affiliation(s)
- Leelavati Narlikar
- Centre for Modeling and Simulation, University of Pune, Pune 411 007, India
| | | | | | | |
Collapse
|
35
|
A novel statistical measure for sequence comparison on the basis of k-word counts. J Theor Biol 2012; 318:91-100. [PMID: 23147229 DOI: 10.1016/j.jtbi.2012.10.035] [Citation(s) in RCA: 15] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/26/2011] [Revised: 10/10/2012] [Accepted: 10/31/2012] [Indexed: 11/24/2022]
Abstract
Numerous efficient methods based on word counts for sequence analysis have been proposed to characterize DNA sequences to help in comparison, retrieval from the databases and reconstructing evolutionary relations. However, most of them seem unrelated to any intrinsic characteristics of DNA. In this paper, we proposed a novel statistical measure for sequence comparison on the basis of k-word counts. This new measure removed the influence of sequences' lengths and uncovered bulk property of DNA sequences. The proposed measure was tested by similarity search and phylogenetic analysis. The experimental assessment demonstrated that our similarity measure was efficient.
Collapse
|
36
|
Ng KH, Ho CK, Phon-Amnuaisuk S. A hybrid distance measure for clustering expressed sequence tags originating from the same gene family. PLoS One 2012; 7:e47216. [PMID: 23071763 PMCID: PMC3469558 DOI: 10.1371/journal.pone.0047216] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/15/2012] [Accepted: 09/10/2012] [Indexed: 01/22/2023] Open
Abstract
BACKGROUND Clustering is a key step in the processing of Expressed Sequence Tags (ESTs). The primary goal of clustering is to put ESTs from the same transcript of a single gene into a unique cluster. Recent EST clustering algorithms mostly adopt the alignment-free distance measures, where they tend to yield acceptable clustering accuracies with reasonable computational time. Despite the fact that these clustering methods work satisfactorily on a majority of the EST datasets, they have a common weakness. They are prone to deliver unsatisfactory clustering results when dealing with ESTs from the genes derived from the same family. The root cause is the distance measures applied on them are not sensitive enough to separate these closely related genes. METHODOLOGY/PRINCIPAL FINDINGS We propose a hybrid distance measure that combines the global and local features extracted from ESTs, with the aim to address the clustering problem faced by ESTs derived from the same gene family. The clustering process is implemented using the DBSCAN algorithm. We test the hybrid distance measure on the ten EST datasets, and the clustering results are compared with the two alignment-free EST clustering tools, i.e. wcd and PEACE. The clustering results indicate that the proposed hybrid distance measure performs relatively better (in terms of clustering accuracy) than both EST clustering tools. CONCLUSIONS/SIGNIFICANCE The clustering results provide support for the effectiveness of the proposed hybrid distance measure in solving the clustering problem for ESTs that originate from the same gene family. The improvement of clustering accuracies on the experimental datasets has supported the claim that the sensitivity of the hybrid distance measure is sufficient to solve the clustering problem.
Collapse
Affiliation(s)
- Keng-Hoong Ng
- Faculty of Computing and Informatics, Multimedia University, Cyberjaya, Malaysia.
| | | | | |
Collapse
|
37
|
Abstract
Since the emergence of high-throughput genome sequencing platforms and more recently the next-generation platforms, the genome databases are growing at an astronomical rate. Tremendous efforts have been invested in recent years in understanding intriguing complexities beneath the vast ocean of genomic data. This is apparent in the spurt of computational methods for interpreting these data in the past few years. Genomic data interpretation is notoriously difficult, partly owing to the inherent heterogeneities appearing at different scales. Methods developed to interpret these data often suffer from their inability to adequately measure the underlying heterogeneities and thus lead to confounding results. Here, we present an information entropy-based approach that unravels the distinctive patterns underlying genomic data efficiently and thus is applicable in addressing a variety of biological problems. We show the robustness and consistency of the proposed methodology in addressing three different biological problems of significance—identification of alien DNAs in bacterial genomes, detection of structural variants in cancer cell lines and alignment-free genome comparison.
Collapse
Affiliation(s)
- Rajeev K Azad
- Department of Biological Sciences, University of Pittsburgh, Pittsburgh, PA 15260, USA.
| | | |
Collapse
|
38
|
Wei D, Jiang Q, Wei Y, Wang S. A novel hierarchical clustering algorithm for gene sequences. BMC Bioinformatics 2012; 13:174. [PMID: 22823405 PMCID: PMC3443659 DOI: 10.1186/1471-2105-13-174] [Citation(s) in RCA: 39] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/05/2011] [Accepted: 06/30/2012] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Clustering DNA sequences into functional groups is an important problem in bioinformatics. We propose a new alignment-free algorithm, mBKM, based on a new distance measure, DMk, for clustering gene sequences. This method transforms DNA sequences into the feature vectors which contain the occurrence, location and order relation of k-tuples in DNA sequence. Afterwards, a hierarchical procedure is applied to clustering DNA sequences based on the feature vectors. RESULTS The proposed distance measure and clustering method are evaluated by clustering functionally related genes and by phylogenetic analysis. This method is also compared with BlastClust, CD-HIT-EST and some others. The experimental results show our method is effective in classifying DNA sequences with similar biological characteristics and in discovering the underlying relationship among the sequences. CONCLUSIONS We introduced a novel clustering algorithm which is based on a new sequence similarity measure. It is effective in classifying DNA sequences with similar biological characteristics and in discovering the relationship among the sequences.
Collapse
Affiliation(s)
- Dan Wei
- Cognitive Science Department & Fujian Key Laboratory of the Brain-like Intelligent Systems, Xiamen University, Xiamen, China
| | | | | | | |
Collapse
|
39
|
Yang L, Zhang X, Zhu H. Alignment free comparison: similarity distribution between the DNA primary sequences based on the shortest absent word. J Theor Biol 2011; 295:125-31. [PMID: 22138094 DOI: 10.1016/j.jtbi.2011.11.021] [Citation(s) in RCA: 10] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/09/2011] [Revised: 11/18/2011] [Accepted: 11/19/2011] [Indexed: 11/15/2022]
Abstract
This work proposes an alignment free comparison model for the DNA primary sequences. In this paper, we treat the double strands of the DNA rather than single strand. We define the shortest absent word of the double strands between the DNA sequences and some properties are studied to speed up the algorithm for searching the shortest absent word. We present a novel model for comparison, in which the similarity distribution is introduced to describe the similarity between the sequences. A distance measure is deduced based on the Shannon entropy meanwhile is used in phylogenetic analysis. Some experiments show that our model performs well in the field of sequence analysis.
Collapse
Affiliation(s)
- Lianping Yang
- College of Sciences, Northeastern University, Shenyang, China
| | | | | |
Collapse
|
40
|
Integrating overlapping structures and background information of words significantly improves biological sequence comparison. PLoS One 2011; 6:e26779. [PMID: 22102867 PMCID: PMC3213098 DOI: 10.1371/journal.pone.0026779] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/10/2011] [Accepted: 10/04/2011] [Indexed: 12/19/2022] Open
Abstract
Word-based models have achieved promising results in sequence comparison. However, as the important statistical properties of words in biological sequence, how to use the overlapping structures and background information of the words to improve sequence comparison is still a problem. This paper proposed a new statistical method that integrates the overlapping structures and the background information of the words in biological sequences. To assess the effectiveness of this integration for sequence comparison, two sets of evaluation experiments were taken to test the proposed model. The first one, performed via receiver operating curve analysis, is the application of proposed method in discrimination between functionally related regulatory sequences and unrelated sequences, intron and exon. The second experiment is to evaluate the performance of the proposed method with f-measure for clustering Hepatitis E virus genotypes. It was demonstrated that the proposed method integrating the overlapping structures and the background information of words significantly improves biological sequence comparison and outperforms the existing models.
Collapse
|
41
|
Protein map: an alignment-free sequence comparison method based on various properties of amino acids. Gene 2011; 486:110-8. [PMID: 21803133 DOI: 10.1016/j.gene.2011.07.002] [Citation(s) in RCA: 34] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/27/2011] [Accepted: 07/09/2011] [Indexed: 11/23/2022]
Abstract
In this paper, we propose a new protein map which incorporates with various properties of amino acids. As a powerful tool for protein classification, this new protein map both considers phylogenetic factors arising from amino acid mutations and provides computational efficiency for the huge amount of data. The ten amino acid physico-chemical properties (the chemical composition of the side chain, two polarity measures, hydropathy, isoelectric point, volume, aromaticity, aliphaticity, hydrogenation, and hydroxythiolation) are utilized according to their relative importance. Moreover, during the course of calculation of genetic distances between pairs of proteins, this approach does not require any alignment of sequences. Therefore, the proposed model is easier and quicker in handling protein sequences than multiple alignment methods, and gives protein classification greater evolutionary significance at the amino acid sequence level.
Collapse
|
42
|
Sun Y, Gilbert PB. Estimation of Stratified Mark-Specific Proportional Hazards Models with Missing Marks. Scand Stat Theory Appl 2011; 39:34-52. [PMID: 23519918 DOI: 10.1111/j.1467-9469.2011.00746.x] [Citation(s) in RCA: 17] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/28/2022]
Abstract
An objective of randomized placebo-controlled preventive HIV vaccine efficacy trials is to assess the relationship between the vaccine effect to prevent infection and the genetic distance of the exposing HIV to the HIV strain represented in the vaccine construct. Motivated by this objective, recently a mark-specific proportional hazards model with a continuum of competing risks has been studied, where the genetic distance of the transmitting strain is the continuous `mark' defined and observable only in failures. A high percentage of genetic marks of interest may be missing for a variety of reasons, predominantly due to rapid evolution of HIV sequences after transmission before a blood sample is drawn from which HIV sequences are measured. This research investigates the stratified mark-specific proportional hazards model with missing marks where the baseline functions may vary with strata. We develop two consistent estimation approaches, the first based on the inverse probability weighted complete-case (IPW) technique, and the second based on augmenting the IPW estimator by incorporating auxiliary information predictive of the mark. We investigate the asymptotic properties and finite-sample performance of the two estimators, and show that the augmented IPW estimator, which satisfies a double robustness property, is more efficient.
Collapse
Affiliation(s)
- Yanqing Sun
- Department of Mathematics and Statistics, The University of North Carolina at Charlotte, Charlotte, NC 28223, USA.
| | | |
Collapse
|
43
|
Alignment-free comparison of genome sequences by a new numerical characterization. J Theor Biol 2011; 281:107-12. [PMID: 21536050 DOI: 10.1016/j.jtbi.2011.04.003] [Citation(s) in RCA: 22] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/04/2010] [Revised: 04/01/2011] [Accepted: 04/02/2011] [Indexed: 01/29/2023]
Abstract
In order to compare different genome sequences, an alignment-free method has proposed. First, we presented a new graphical representation of DNA sequences without degeneracy, which is conducive to intuitive comparison of sequences. Then, a new numerical characterization based on the representation was introduced to quantitatively depict the intrinsic nature of genome sequences, and considered as a 10-dimensional vector in the mathematical space. Alignment-free comparison of sequences was performed by computing the distances between vectors of the corresponding numerical characterizations, which define the evolutionary relationship. Two data sets of DNA sequences were constructed to assess the performance on sequence comparison. The results illustrate well validity of the method. The new numerical characterization provides a powerful tool for genome comparison.
Collapse
|
44
|
Using Markov model to improve word normalization algorithm for biological sequence comparison. Amino Acids 2011; 42:1867-77. [DOI: 10.1007/s00726-011-0906-2] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/13/2010] [Accepted: 03/29/2011] [Indexed: 10/18/2022]
|
45
|
Chang G, Wang T. Weighted relative entropy for alignment-free sequence comparison based on Markov model. J Biomol Struct Dyn 2011; 28:545-55. [PMID: 21142223 DOI: 10.1080/07391102.2011.10508594] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/01/2023]
Abstract
In this paper, we introduce a probabilistic measure for computing the similarity between two biological sequences without alignment. The computation of the similarity measure is based on the Kullback-Leibler divergence of two constructed Markov models. We firstly validate the method on clustering nine chromosomes from three species. Secondly, we give the result of similarity search based on our new method. We lastly apply the measure to the construction of phylogenetic tree of 48 HEV genome sequences. Our results indicate that the weighted relative entropy is an efficient and powerful alignment-free measure for the analysis of sequences in the genomic scale.
Collapse
Affiliation(s)
- Guisong Chang
- School of Mathematical Sciences, Dalian University of Technology, Dalian 116024, PR China.
| | | |
Collapse
|
46
|
Dai Q, Liu X, Yao Y, Zhao F. Numerical characteristics of word frequencies and their application to dissimilarity measure for sequence comparison. J Theor Biol 2011; 276:174-80. [PMID: 21334347 DOI: 10.1016/j.jtbi.2011.02.005] [Citation(s) in RCA: 15] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/20/2010] [Revised: 02/05/2011] [Accepted: 02/07/2011] [Indexed: 10/18/2022]
Abstract
Sequence comparison is one of the major tasks in bioinformatics, which can be used to study structural and functional conservation, as well as evolutionary relations among the sequences. Numerous dissimilarity measures achieve promising results in sequence comparison, but challenges remain. This paper studied numerical characteristics of word frequencies and proposed a novel dissimilarity measure for sequence comparison. Instead of using the word frequencies directly, the proposed measure considers both the word frequencies and overlapping structures of words. To verify the effectiveness of the proposed measure, we tested it with two experiments and further compared it with alignment-based and alignment-free measures. The results demonstrate that the proposed measure extracting more information on the overlapping structures of the words improves the efficiency of sequence comparison.
Collapse
Affiliation(s)
- Qi Dai
- College of Life Sciences, Zhejiang Sci-Tech University, Hangzhou 310018, People's Republic of China.
| | | | | | | |
Collapse
|
47
|
Liu X, Dai Q, Li L, He Z. An efficient binomial model-based measure for sequence comparison and its application. J Biomol Struct Dyn 2011; 28:833-43. [PMID: 21294594 DOI: 10.1080/07391102.2011.10508611] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/28/2022]
Abstract
Sequence comparison is one of the major tasks in bioinformatics, which could serve as evidence of structural and functional conservation, as well as of evolutionary relations. There are several similarity/dissimilarity measures for sequence comparison, but challenges remains. This paper presented a binomial model-based measure to analyze biological sequences. With help of a random indicator, the occurrence of a word at any position of sequence can be regarded as a random Bernoulli variable, and the distribution of a sum of the word occurrence is well known to be a binomial one. By using a recursive formula, we computed the binomial probability of the word count and proposed a binomial model-based measure based on the relative entropy. The proposed measure was tested by extensive experiments including classification of HEV genotypes and phylogenetic analysis, and further compared with alignment-based and alignment-free measures. The results demonstrate that the proposed measure based on binomial model is more efficient.
Collapse
Affiliation(s)
- Xiaoqing Liu
- School of Science, Hangzhou Dianzi Unviersity, Hangzhou 310018, People's Republic of China
| | | | | | | |
Collapse
|
48
|
Zhang Y, Chen W. A Measure of DNA Sequence Dissimilarity Based on Free Energy of Nearest-neighbor Interaction. J Biomol Struct Dyn 2011; 28:557-65. [DOI: 10.1080/07391102.2011.10508595] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/17/2022]
|
49
|
Huang Y, Yang L, Wang T. Phylogenetic analysis of DNA sequences based on the generalized pseudo-amino acid composition. J Theor Biol 2011; 269:217-23. [DOI: 10.1016/j.jtbi.2010.10.027] [Citation(s) in RCA: 12] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/07/2010] [Revised: 10/22/2010] [Accepted: 10/22/2010] [Indexed: 11/15/2022]
|
50
|
Vector representations and related matrices of DNA primary sequence based on L-tuple. Math Biosci 2010; 227:147-52. [DOI: 10.1016/j.mbs.2010.07.004] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/02/2007] [Revised: 07/24/2010] [Accepted: 07/27/2010] [Indexed: 11/24/2022]
|