Reference Citation Analysis: Find an Article, Find a Category, Find a Journal, Find a Scholar

For: Lippert RA, Huang H, Waterman MS. Distributional regimes for the number of k-word matches between two random sequences. Proc Natl Acad Sci U S A 2002;99:13980-9. [PMID: 12374863 PMCID: PMC137823 DOI: 10.1073/pnas.202468099] [Citation(s) in RCA: 89] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 08/05/2002] [Indexed: 11/18/2022] Open

For:	Lippert RA, Huang H, Waterman MS. Distributional regimes for the number of k-word matches between two random sequences. Proc Natl Acad Sci U S A 2002;99:13980-9. [PMID: 12374863 PMCID: PMC137823 DOI: 10.1073/pnas.202468099] [Citation(s) in RCA: 89] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 08/05/2002] [Indexed: 11/18/2022] Open

Number

Cited by Other Article(s)

Fan WTL, Legried B, Roch S. An impossibility result for phylogeny reconstruction from k-mer counts. ANN APPL PROBAB 2022. [DOI: 10.1214/22-aap1805] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/23/2022]

Birth N, Dencker T, Morgenstern B. Insertions and deletions as phylogenetic signal in an alignment-free context. PLoS Comput Biol 2022;18:e1010303. [PMID: 35939516 PMCID: PMC9387925 DOI: 10.1371/journal.pcbi.1010303] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/18/2021] [Revised: 08/18/2022] [Accepted: 06/14/2022] [Indexed: 11/18/2022] Open

Abstract

Most methods for phylogenetic tree reconstruction are based on sequence alignments; they infer phylogenies from substitutions that may have occurred at the aligned sequence positions. Gaps in alignments are usually not employed as phylogenetic signal. In this paper, we explore an alignment-free approach that uses insertions and deletions (indels) as an additional source of information for phylogeny inference. For a set of four or more input sequences, we generate so-called quartet blocks of four putative homologous segments each. For pairs of such quartet blocks involving the same four sequences, we compare the distances between the two blocks in these sequences, to obtain hints about indels that may have happened between the blocks since the respective four sequences have evolved from their last common ancestor. A prototype implementation that we call Gap-SpaM is presented to infer phylogenetic trees from these data, using a quartet-tree approach or, alternatively, under the maximum-parsimony paradigm. This approach should not be regarded as an alternative to established methods, but rather as a complementary source of phylogenetic information. Interestingly, however, our software is able to produce phylogenetic trees from putative indels alone that are comparable to trees obtained with existing alignment-free methods.

Phylogenetic tree inference based on DNA or protein sequence comparison is a fundamental task in computational biology. Given a multiple alignment of a set of input sequences, most approaches compare aligned sequence positions to each other, to find a suitable tree, based on a model of molecular evolution. Insertions and deletions that may have happened since the input sequences evolved from their last common ancestor are ignored by most phylogeny methods. Herein, we show that insertions and deletions can provide an additional source of information for phylogeny inference, and that such information can be obtained with a simple alignment-free approach. We provide an implementation of this idea that we call Gap-SpaM. The proposed approach is complementary to existing phylogeny methods since it is based on a completely different source of information. It is, thus, not meant to be an alternative to those existing methods but rather as a possible additional source of information for tree inference.

Collapse

Swain MT, Vickers M. Interpreting alignment-free sequence comparison: what makes a score a good score? NAR Genom Bioinform 2022;4:lqac062. [PMID: 36071721 PMCID: PMC9442500 DOI: 10.1093/nargab/lqac062] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/01/2022] [Revised: 07/01/2022] [Accepted: 08/16/2022] [Indexed: 11/13/2022] Open

Pevzner P, Vingron M, Reidys C, Sun F, Istrail S. Michael Waterman's Contributions to Computational Biology and Bioinformatics. J Comput Biol 2022;29:601-615. [PMID: 35727100 DOI: 10.1089/cmb.2022.29066.pp] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open

Anastasiou A, Barp A, Briol FX, Ebner B, Gaunt RE, Ghaderinezhad F, Gorham J, Gretton A, Ley C, Liu Q, Mackey L, Oates CJ, Reinert G, Swan Y. Stein’s Method Meets Computational Statistics: A Review of Some Recent Developments. Stat Sci 2022. [DOI: 10.1214/22-sts863] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/24/2022]

Sarkar BK, Sharma AR, Bhattacharya M, Sharma G, Lee SS, Chakraborty C. Determination of k-mer density in a DNA sequence and subsequent cluster formation algorithm based on the application of electronic filter. Sci Rep 2021;11:13701. [PMID: 34211040 PMCID: PMC8249421 DOI: 10.1038/s41598-021-93154-3] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/11/2021] [Accepted: 06/07/2021] [Indexed: 02/06/2023] Open

Maurer-Stroh S, Krutz NL, Kern PS, Gunalan V, Nguyen MN, Limviphuvadh V, Eisenhaber F, Gerberick GF. AllerCatPro-prediction of protein allergenicity potential from the protein sequence. Bioinformatics 2020;35:3020-3027. [PMID: 30657872 PMCID: PMC6736023 DOI: 10.1093/bioinformatics/btz029] [Citation(s) in RCA: 94] [Impact Index Per Article: 23.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/30/2018] [Revised: 12/18/2018] [Accepted: 01/14/2019] [Indexed: 12/22/2022] Open

Abstract

Motivation

Due to the risk of inducing an immediate Type I (IgE-mediated) allergic response, proteins intended for use in consumer products must be investigated for their allergenic potential before introduction into the marketplace. The FAO/WHO guidelines for computational assessment of allergenic potential of proteins based on short peptide hits and linear sequence window identity thresholds misclassify many proteins as allergens.

Results

We developed AllerCatPro which predicts the allergenic potential of proteins based on similarity of their 3D protein structure as well as their amino acid sequence compared with a data set of known protein allergens comprising of 4180 unique allergenic protein sequences derived from the union of the major databases Food Allergy Research and Resource Program, Comprehensive Protein Allergen Resource, WHO/International Union of Immunological Societies, UniProtKB and Allergome. We extended the hexamer hit rule by removing peptides with high probability of random occurrence measured by sequence entropy as well as requiring 3 or more hexamer hits consistent with natural linear epitope patterns in known allergens. This is complemented with a Gluten-like repeat pattern detection. We also switched from a linear sequence window similarity to a B-cell epitope-like 3D surface similarity window which became possible through extensive 3D structure modeling covering the majority (74%) of allergens. In case no structure similarity is found, the decision workflow reverts to the old linear sequence window rule. The overall accuracy of AllerCatPro is 84% compared with other current methods which range from 51 to 73%. Both the FAO/WHO rules and AllerCatPro achieve highest sensitivity but AllerCatPro provides a 37-fold increase in specificity.

Availability and implementation

https://allercatpro.bii.a-star.edu.sg/

Supplementary information

Supplementary data are available at Bioinformatics online.

Collapse

Gaunt RE. Stein’s method for functions of multivariate normal random variables. ANNALES DE L'INSTITUT HENRI POINCARÉ, PROBABILITÉS ET STATISTIQUES 2020. [DOI: 10.1214/19-aihp1011] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/19/2022]

Luczak BB, James BT, Girgis HZ. A survey and evaluations of histogram-based statistics in alignment-free sequence comparison. Brief Bioinform 2020;20:1222-1237. [PMID: 29220512 PMCID: PMC6781583 DOI: 10.1093/bib/bbx161] [Citation(s) in RCA: 27] [Impact Index Per Article: 6.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/15/2017] [Revised: 10/13/2017] [Indexed: 11/29/2022] Open

Abstract

Motivation

Since the dawn of the bioinformatics field, sequence alignment scores have been the main method for comparing sequences. However, alignment algorithms are quadratic, requiring long execution time. As alternatives, scientists have developed tens of alignment-free statistics for measuring the similarity between two sequences.

Results

We surveyed tens of alignment-free k-mer statistics. Additionally, we evaluated 33 statistics and multiplicative combinations between the statistics and/or their squares. These statistics are calculated on two k-mer histograms representing two sequences. Our evaluations using global alignment scores revealed that the majority of the statistics are sensitive and capable of finding similar sequences to a query sequence. Therefore, any of these statistics can filter out dissimilar sequences quickly. Further, we observed that multiplicative combinations of the statistics are highly correlated with the identity score. Furthermore, combinations involving sequence length difference or Earth Mover’s distance, which takes the length difference into account, are always among the highest correlated paired statistics with identity scores. Similarly, paired statistics including length difference or Earth Mover’s distance are among the best performers in finding the K-closest sequences. Interestingly, similar performance can be obtained using histograms of shorter words, resulting in reducing the memory requirement and increasing the speed remarkably. Moreover, we found that simple single statistics are sufficient for processing next-generation sequencing reads and for applications relying on local alignment. Finally, we measured the time requirement of each statistic. The survey and the evaluations will help scientists with identifying efficient alternatives to the costly alignment algorithm, saving thousands of computational hours.

Availability

The source code of the benchmarking tool is available as Supplementary Materials.

Collapse

Qian J, Comin M. MetaCon: unsupervised clustering of metagenomic contigs with probabilistic k-mers statistics and coverage. BMC Bioinformatics 2019;20:367. [PMID: 31757198 PMCID: PMC6873667 DOI: 10.1186/s12859-019-2904-4] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/30/2019] [Accepted: 05/15/2019] [Indexed: 11/30/2022] Open

Chen S, Chen Y, Sun F, Waterman MS, Zhang X. A new statistic for efficient detection of repetitive sequences. Bioinformatics 2019;35:4596-4606. [PMID: 30993316 DOI: 10.1093/bioinformatics/btz262] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/23/2018] [Revised: 03/01/2019] [Accepted: 04/10/2019] [Indexed: 11/14/2022] Open

Bernard G, Chan CX, Chan YB, Chua XY, Cong Y, Hogan JM, Maetschke SR, Ragan MA. Alignment-free inference of hierarchical and reticulate phylogenomic relationships. Brief Bioinform 2019;20:426-435. [PMID: 28673025 PMCID: PMC6433738 DOI: 10.1093/bib/bbx067] [Citation(s) in RCA: 55] [Impact Index Per Article: 11.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/08/2017] [Revised: 05/04/2017] [Indexed: 11/22/2022] Open

Gaunt RE. Wasserstein and Kolmogorov Error Bounds for Variance-Gamma Approximation via Stein’s Method I. J THEOR PROBAB 2018. [DOI: 10.1007/s10959-018-0867-4] [Citation(s) in RCA: 17] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/28/2022]

Alignment-free Transcriptomic and Metatranscriptomic Comparison Using Sequencing Signatures with Variable Length Markov Chains. Sci Rep 2016;6:37243. [PMID: 27876823 PMCID: PMC5120338 DOI: 10.1038/srep37243] [Citation(s) in RCA: 16] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/08/2016] [Accepted: 10/27/2016] [Indexed: 11/08/2022] Open

Girotto S, Pizzi C, Comin M. MetaProb: accurate metagenomic reads binning based on probabilistic sequence signatures. Bioinformatics 2016;32:i567-i575. [DOI: 10.1093/bioinformatics/btw466] [Citation(s) in RCA: 48] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/15/2022] Open

Comin M, Schimd M. Fast comparison of genomic and meta-genomic reads with alignment-free measures based on quality values. BMC Med Genomics 2016;9 Suppl 1:36. [PMID: 27535823 PMCID: PMC4989896 DOI: 10.1186/s12920-016-0193-6] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/20/2022] Open

Ondov BD, Treangen TJ, Melsted P, Mallonee AB, Bergman NH, Koren S, Phillippy AM. Mash: fast genome and metagenome distance estimation using MinHash. Genome Biol 2016;17:132. [PMID: 27323842 PMCID: PMC4915045 DOI: 10.1186/s13059-016-0997-x] [Citation(s) in RCA: 1571] [Impact Index Per Article: 196.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/31/2015] [Accepted: 06/03/2016] [Indexed: 02/07/2023] Open

Fan H, Ives AR, Surget-Groba Y, Cannon CH. An assembly and alignment-free method of phylogeny reconstruction from next-generation sequencing data. BMC Genomics 2015;16:522. [PMID: 26169061 PMCID: PMC4501066 DOI: 10.1186/s12864-015-1647-5] [Citation(s) in RCA: 89] [Impact Index Per Article: 9.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/09/2015] [Accepted: 05/20/2015] [Indexed: 11/10/2022] Open

Galzitskaya OV, Lobanov MY. Phyloproteomic Analysis of 11780 Six-Residue-Long Motifs Occurrences. BIOMED RESEARCH INTERNATIONAL 2015;2015:208346. [PMID: 26114101 PMCID: PMC4465679 DOI: 10.1155/2015/208346] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 07/11/2014] [Accepted: 11/03/2014] [Indexed: 12/31/2022]

Morgenstern B, Zhu B, Horwege S, Leimeister CA. Estimating evolutionary distances between genomic sequences from spaced-word matches. Algorithms Mol Biol 2015;10:5. [PMID: 25685176 PMCID: PMC4327811 DOI: 10.1186/s13015-015-0032-x] [Citation(s) in RCA: 39] [Impact Index Per Article: 4.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/19/2014] [Accepted: 01/06/2015] [Indexed: 01/06/2023] Open

Comin M, Leoni A, Schimd M. Clustering of reads with alignment-free measures and quality values. Algorithms Mol Biol 2015;10:4. [PMID: 25691913 PMCID: PMC4331138 DOI: 10.1186/s13015-014-0029-x] [Citation(s) in RCA: 24] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/19/2014] [Accepted: 12/17/2014] [Indexed: 12/03/2022] Open

Bonham-Carter O, Steele J, Bastola D. Alignment-free genetic sequence comparisons: a review of recent approaches by word analysis. Brief Bioinform 2014;15:890-905. [PMID: 23904502 PMCID: PMC4296134 DOI: 10.1093/bib/bbt052] [Citation(s) in RCA: 76] [Impact Index Per Article: 7.6] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/11/2013] [Accepted: 05/31/2013] [Indexed: 12/17/2022] Open

Comin M, Schimd M. Assembly-free genome comparison based on next-generation sequencing reads and variable length patterns. BMC Bioinformatics 2014;15 Suppl 9:S1. [PMID: 25252700 PMCID: PMC4168702 DOI: 10.1186/1471-2105-15-s9-s1] [Citation(s) in RCA: 12] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/16/2023] Open

Comin M, Verzotto D. Beyond Fixed-Resolution Alignment-Free Measures for Mammalian Enhancers Sequence Comparison. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2014;11:628-637. [PMID: 26356333 DOI: 10.1109/tcbb.2014.2306830] [Citation(s) in RCA: 9] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/05/2023]

Chen S, Wang A, Li LM. SEME: a fast mapper of Illumina sequencing reads with statistical evaluation. J Comput Biol 2014;20:847-60. [PMID: 24195707 DOI: 10.1089/cmb.2013.0111] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/06/2023] Open

Gunasinghe U, Alahakoon D, Bedingfield S. Extraction of high quality k-words for alignment-free sequence comparison. J Theor Biol 2014;358:31-51. [PMID: 24846728 DOI: 10.1016/j.jtbi.2014.05.016] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/04/2013] [Revised: 05/04/2014] [Accepted: 05/07/2014] [Indexed: 10/25/2022]

Gaunt R. Variance-Gamma approximation via Stein's method. ELECTRON J PROBAB 2014. [DOI: 10.1214/ejp.v19-3020] [Citation(s) in RCA: 24] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/19/2022]

Taillefer E, Miller J. Exhaustive computation of exact duplications via super and non-nested local maximal repeats. J Bioinform Comput Biol 2013;12:1350018. [PMID: 24467757 DOI: 10.1142/s0219720013500182] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/05/2023]

Maurer-Stroh S, Gunalan V, Wong WC, Eisenhaber F. A simple shortcut to unsupervised alignment-free phylogenetic genome groupings, even from unassembled sequencing reads. J Bioinform Comput Biol 2013;11:1343005. [PMID: 24372034 DOI: 10.1142/s0219720013430051] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/21/2023]

Burden CJ, Leopardi P, Forêt S. The distribution of word matches between Markovian sequences with periodic boundary conditions. J Comput Biol 2013;21:41-63. [PMID: 24160839 DOI: 10.1089/cmb.2012.0277] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open

Schwende I, Pham TD. Pattern recognition and probabilistic measures in alignment-free sequence analysis. Brief Bioinform 2013;15:354-68. [PMID: 24096012 DOI: 10.1093/bib/bbt070] [Citation(s) in RCA: 19] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/30/2022] Open

Song K, Ren J, Reinert G, Deng M, Waterman MS, Sun F. New developments of alignment-free sequence comparison: measures, statistics and next-generation sequencing. Brief Bioinform 2013;15:343-53. [PMID: 24064230 DOI: 10.1093/bib/bbt067] [Citation(s) in RCA: 112] [Impact Index Per Article: 10.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/07/2023] Open

Chen LHY, Röllin A. Approximating dependent rare events. BERNOULLI 2013. [DOI: 10.3150/12-bejsp18] [Citation(s) in RCA: 20] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/19/2022]

Ren J, Song K, Sun F, Deng M, Reinert G. Multiple alignment-free sequence comparison. Bioinformatics 2013;29:2690-8. [PMID: 23990418 DOI: 10.1093/bioinformatics/btt462] [Citation(s) in RCA: 10] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open

Song K, Ren J, Zhai Z, Liu X, Deng M, Sun F. Alignment-free sequence comparison based on next-generation sequencing reads. J Comput Biol 2013;20:64-79. [PMID: 23383994 DOI: 10.1089/cmb.2012.0228] [Citation(s) in RCA: 65] [Impact Index Per Article: 5.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/22/2023] Open

Behnam E, Waterman MS, Smith AD. A geometric interpretation for local alignment-free sequence comparison. J Comput Biol 2013;20:471-85. [PMID: 23829649 PMCID: PMC3704055 DOI: 10.1089/cmb.2012.0280] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open

A novel statistical measure for sequence comparison on the basis of k-word counts. J Theor Biol 2012;318:91-100. [PMID: 23147229 DOI: 10.1016/j.jtbi.2012.10.035] [Citation(s) in RCA: 15] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/26/2011] [Revised: 10/10/2012] [Accepted: 10/31/2012] [Indexed: 11/24/2022]

Randić M. Very efficient search for nucleotide alignments. J Comput Chem 2012;34:77-82. [PMID: 22949371 DOI: 10.1002/jcc.23105] [Citation(s) in RCA: 10] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/29/2012] [Revised: 08/10/2012] [Accepted: 08/13/2012] [Indexed: 11/12/2022]

Göke J, Schulz MH, Lasserre J, Vingron M. Estimation of pairwise sequence similarity of mammalian enhancers with word neighbourhood counts. ACTA ACUST UNITED AC 2012;28:656-63. [PMID: 22247280 PMCID: PMC3289921 DOI: 10.1093/bioinformatics/bts028] [Citation(s) in RCA: 39] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022]

Integrating overlapping structures and background information of words significantly improves biological sequence comparison. PLoS One 2011;6:e26779. [PMID: 22102867 PMCID: PMC3213098 DOI: 10.1371/journal.pone.0026779] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/10/2011] [Accepted: 10/04/2011] [Indexed: 12/19/2022] Open

Liu X, Wan L, Li J, Reinert G, Waterman MS, Sun F. New powerful statistics for alignment-free sequence comparison under a pattern transfer model. J Theor Biol 2011;284:106-16. [PMID: 21723298 DOI: 10.1016/j.jtbi.2011.06.020] [Citation(s) in RCA: 37] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/16/2011] [Revised: 05/30/2011] [Accepted: 06/17/2011] [Indexed: 12/15/2022]

Alignment-free comparison of genome sequences by a new numerical characterization. J Theor Biol 2011;281:107-12. [PMID: 21536050 DOI: 10.1016/j.jtbi.2011.04.003] [Citation(s) in RCA: 22] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/04/2010] [Revised: 04/01/2011] [Accepted: 04/02/2011] [Indexed: 01/29/2023]

Dai Q, Liu X, Yao Y, Zhao F. Numerical characteristics of word frequencies and their application to dissimilarity measure for sequence comparison. J Theor Biol 2011;276:174-80. [PMID: 21334347 DOI: 10.1016/j.jtbi.2011.02.005] [Citation(s) in RCA: 15] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/20/2010] [Revised: 02/05/2011] [Accepted: 02/07/2011] [Indexed: 10/18/2022]

Ross N. Fundamentals of Stein’s method. PROBABILITY SURVEYS 2011. [DOI: 10.1214/11-ps182] [Citation(s) in RCA: 162] [Impact Index Per Article: 12.5] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/19/2022]

Wan L, Reinert G, Sun F, Waterman MS. Alignment-free sequence comparison (II): theoretical power of comparison statistics. J Comput Biol 2010;17:1467-90. [PMID: 20973742 DOI: 10.1089/cmb.2010.0056] [Citation(s) in RCA: 74] [Impact Index Per Article: 5.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/03/2023] Open

Abstract

Rapid methods for alignment-free sequence comparison make large-scale comparisons between sequences increasingly feasible. Here we study the power of the statistic D2, which counts the number of matching k-tuples between two sequences, as well as D2*, which uses centralized counts, and D2S, which is a self-standardized version, both from a theoretical viewpoint and numerically, providing an easy to use program. The power is assessed under two alternative hidden Markov models; the first one assumes that the two sequences share a common motif, whereas the second model is a pattern transfer model; the null model is that the two sequences are composed of independent and identically distributed letters and they are independent. Under the first alternative model, the means of the tuple counts in the individual sequences change, whereas under the second alternative model, the marginal means are the same as under the null model. Using the limit distributions of the count statistics under the null and the alternative models, we find that generally, asymptotically D2S has the largest power, followed by D2*, whereas the power of D2 can even be zero in some cases. In contrast, even for sequences of length 140,000 bp, in simulations D2* generally has the largest power. Under the first alternative model of a shared motif, the power of D2*approaches 100% when sufficiently many motifs are shared, and we recommend the use of D2* for such practical applications. Under the second alternative model of pattern transfer,the power for all three count statistics does not increase with sequence length when the sequence is sufficiently long, and hence none of the three statistics under consideration canbe recommended in such a situation. We illustrate the approach on 323 transcription factor binding motifs with length at most 10 from JASPAR CORE (October 12, 2009 version),verifying that D2* is generally more powerful than D2. The program to calculate the power of D2, D2* and D2S can be downloaded from http://meta.cmb.usc.edu/d2. Supplementary Material is available at www.liebertonline.com/cmb.

Collapse

Jing J, Burden CJ, Forêt S, Wilson SR. Statistical considerations underpinning an alignment-free sequence comparison method. J Korean Stat Soc 2010. [DOI: 10.1016/j.jkss.2010.02.009] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/01/2022]

Koohy H, Dyer NP, Reid JE, Koentges G, Ott S. An alignment-free model for comparison of regulatory sequences. ACTA ACUST UNITED AC 2010;26:2391-7. [PMID: 20696736 DOI: 10.1093/bioinformatics/btq453] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/16/2022]

Reinert G, Chew D, Sun F, Waterman MS. Alignment-free sequence comparison (I): statistics and power. J Comput Biol 2010;16:1615-34. [PMID: 20001252 DOI: 10.1089/cmb.2009.0198] [Citation(s) in RCA: 126] [Impact Index Per Article: 9.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open

Pavlović-Lazetić GM, Mitić NS, Beljanski MV. n-Gram characterization of genomic islands in bacterial genomes. COMPUTER METHODS AND PROGRAMS IN BIOMEDICINE 2009;93:241-56. [PMID: 19101056 PMCID: PMC7185697 DOI: 10.1016/j.cmpb.2008.10.014] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 04/20/2008] [Revised: 09/10/2008] [Accepted: 10/21/2008] [Indexed: 05/27/2023]

Dai Q, Yang Y, Wang T. Markov model plus k-word distributions: a synergy that produces novel statistical measures for sequence comparison. ACTA ACUST UNITED AC 2008;24:2296-302. [PMID: 18710871 DOI: 10.1093/bioinformatics/btn436] [Citation(s) in RCA: 63] [Impact Index Per Article: 3.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/15/2022]