1
|
Fan WTL, Legried B, Roch S. An impossibility result for phylogeny reconstruction from k-mer counts. ANN APPL PROBAB 2022. [DOI: 10.1214/22-aap1805] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/23/2022]
Affiliation(s)
| | | | - Sebastien Roch
- Department of Mathematics, University of Wisconsin–Madison
| |
Collapse
|
2
|
Birth N, Dencker T, Morgenstern B. Insertions and deletions as phylogenetic signal in an alignment-free context. PLoS Comput Biol 2022; 18:e1010303. [PMID: 35939516 PMCID: PMC9387925 DOI: 10.1371/journal.pcbi.1010303] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/18/2021] [Revised: 08/18/2022] [Accepted: 06/14/2022] [Indexed: 11/18/2022] Open
Abstract
Most methods for phylogenetic tree reconstruction are based on sequence alignments; they infer phylogenies from substitutions that may have occurred at the aligned sequence positions. Gaps in alignments are usually not employed as phylogenetic signal. In this paper, we explore an alignment-free approach that uses insertions and deletions (indels) as an additional source of information for phylogeny inference. For a set of four or more input sequences, we generate so-called quartet blocks of four putative homologous segments each. For pairs of such quartet blocks involving the same four sequences, we compare the distances between the two blocks in these sequences, to obtain hints about indels that may have happened between the blocks since the respective four sequences have evolved from their last common ancestor. A prototype implementation that we call Gap-SpaM is presented to infer phylogenetic trees from these data, using a quartet-tree approach or, alternatively, under the maximum-parsimony paradigm. This approach should not be regarded as an alternative to established methods, but rather as a complementary source of phylogenetic information. Interestingly, however, our software is able to produce phylogenetic trees from putative indels alone that are comparable to trees obtained with existing alignment-free methods. Phylogenetic tree inference based on DNA or protein sequence comparison is a fundamental task in computational biology. Given a multiple alignment of a set of input sequences, most approaches compare aligned sequence positions to each other, to find a suitable tree, based on a model of molecular evolution. Insertions and deletions that may have happened since the input sequences evolved from their last common ancestor are ignored by most phylogeny methods. Herein, we show that insertions and deletions can provide an additional source of information for phylogeny inference, and that such information can be obtained with a simple alignment-free approach. We provide an implementation of this idea that we call Gap-SpaM. The proposed approach is complementary to existing phylogeny methods since it is based on a completely different source of information. It is, thus, not meant to be an alternative to those existing methods but rather as a possible additional source of information for tree inference.
Collapse
Affiliation(s)
- Niklas Birth
- Department of Bioinformatics, Institute of Microbiology and Genetics, Universisät Göttingen, Göttingen, Germany
| | - Thomas Dencker
- Department of Bioinformatics, Institute of Microbiology and Genetics, Universisät Göttingen, Göttingen, Germany
| | - Burkhard Morgenstern
- Department of Bioinformatics, Institute of Microbiology and Genetics, Universisät Göttingen, Göttingen, Germany
- Göttingen Center of Molecular Biosciences (GZMB), Göttingen, Germany
- Campus-Institute Data Science (CIDAS), Göttingen, Germany
- * E-mail:
| |
Collapse
|
3
|
Swain MT, Vickers M. Interpreting alignment-free sequence comparison: what makes a score a good score? NAR Genom Bioinform 2022; 4:lqac062. [PMID: 36071721 PMCID: PMC9442500 DOI: 10.1093/nargab/lqac062] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/01/2022] [Revised: 07/01/2022] [Accepted: 08/16/2022] [Indexed: 11/13/2022] Open
Abstract
Alignment-free methods are alternatives to alignment-based methods when searching sequence data sets. The output from an alignment-free sequence comparison is a similarity score, the interpretation of which is not straightforward. We propose objective functions to interpret and calibrate outputs from alignment-free searches, noting that different objective functions are necessary for different biological contexts. This leads to advantages: visualising and comparing score distributions, including those from true positives, may be a relatively simple method to gain insight into the performance of different metrics. Using an empirical approach with both DNA and protein sequences, we characterise different similarity score distributions generated under different parameters. In particular, we demonstrate how sequence length can affect the scores. We show that scores of true positive sequence pairs may correlate significantly with their mean length; and even if the correlation is weak, the relative difference in length of the sequence pair may significantly reduce the effectiveness of alignment-free metrics. Importantly, we show how objective functions can be used with test data to accurately estimate the probability of true positives. This can significantly increase the utility of alignment-free approaches. Finally, we have developed a general-purpose software tool called KAST for use in high-throughput workflows on Linux clusters.
Collapse
Affiliation(s)
- Martin T Swain
- Department of Life Sciences, Aberystwyth University , Penglais, Aberystwyth, Ceredigion, SY23 3DA, UK
| | - Martin Vickers
- The John Innes Centre, Norwich Research Park , Norwich NR4 7UH, UK
| |
Collapse
|
4
|
Pevzner P, Vingron M, Reidys C, Sun F, Istrail S. Michael Waterman's Contributions to Computational Biology and Bioinformatics. J Comput Biol 2022; 29:601-615. [PMID: 35727100 DOI: 10.1089/cmb.2022.29066.pp] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
Abstract
On the occasion of Dr. Michael Waterman's 80th birthday, we review his major contributions to the field of computational biology and bioinformatics including the famous Smith-Waterman algorithm for sequence alignment, the probability and statistics theory related to sequence alignment, algorithms for sequence assembly, the Lander-Waterman model for genome physical mapping, combinatorics and predictions of ribonucleic acid structures, word counting statistics in molecular sequences, alignment-free sequence comparison, and algorithms for haplotype block partition and tagSNP selection related to the International HapMap Project. His books Introduction to Computational Biology: Maps, Sequences and Genomes for graduate students and Computational Genome Analysis: An Introduction geared toward undergraduate students played key roles in computational biology and bioinformatics education. We also highlight his efforts of building the computational biology and bioinformatics community as the founding editor of the Journal of Computational Biology and a founding member of the International Conference on Research in Computational Molecular Biology (RECOMB).
Collapse
Affiliation(s)
- Pavel Pevzner
- Department of Computer Science and Engineering, University of California San Diego, San Diego, California, USA
| | - Martin Vingron
- Department of Computational Molecular Biology, Max Planck Institute for Molecular Genetics, Berlin, Germany
| | - Christian Reidys
- Department of Mathematics, Biocomplexity Institute & Initiative, University of Virginia, Charlottesville, Virginia, USA
| | - Fengzhu Sun
- Department of Quantitative and Computational Biology, University of Southern California, Los Angeles, California, USA
| | - Sorin Istrail
- Department of Computer Science, Center for Computational Molecular Biology, Brown University, Providence, Rhode Island, USA
| |
Collapse
|
5
|
Anastasiou A, Barp A, Briol FX, Ebner B, Gaunt RE, Ghaderinezhad F, Gorham J, Gretton A, Ley C, Liu Q, Mackey L, Oates CJ, Reinert G, Swan Y. Stein’s Method Meets Computational Statistics: A Review of Some Recent Developments. Stat Sci 2022. [DOI: 10.1214/22-sts863] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/24/2022]
Affiliation(s)
- Andreas Anastasiou
- Andreas Anastasiou is a Lecturer, Department of Mathematics and Statistics, University of Cyprus, P.O. Box: 20537, 1678 Nicosia, Cyprus
| | - Alessandro Barp
- Alessandro Barp is a Research Associate, University of Cambridge, Engineering Dept, Trumpington St, Cambridge CB2 1PZ, UK
| | - François-Xavier Briol
- François-Xavier Briol is a Lecturer, University College London, 1-19 Torrington Place, London WC1E 7HB, UK
| | - Bruno Ebner
- Bruno Ebner is a Lecturer, Institute of Stochastics, Karlsruhe Institute of Technology (KIT), Englerstr. 2, 76128 Karlsruhe, Germany
| | - Robert E. Gaunt
- Robert E. Gaunt is a Lecturer, The University of Manchester, Oxford Road, Manchester M13 9PL, UK
| | - Fatemeh Ghaderinezhad
- Fatemeh Ghaderinezhad is a Data Analyst, amfori, The Gradient Building, Avenue de Tervueren 270, 1150 Brussels, Belgium
| | - Jackson Gorham
- Jackson Gorham is a Data Scientist, Whisper.ai, Inc., USA
| | - Arthur Gretton
- Arthur Gretton is Professor with the Gatsby Computational Neuroscience Unit, University College London, Sainsbury Wellcome Centre, 25 Howland Street, London W1T 4JG, UK
| | - Christophe Ley
- Christophe Ley is Associate Professor, University of Luxembourg, Maison du Nombre, 6 Avenue de la Fonte, L-4364 Esch-sur-Alzette, Luxembourg
| | - Qiang Liu
- Qiang Liu is Assistant Professor, The University of Texas at Austin, Austin, Texas 78712, USA
| | - Lester Mackey
- Lester Mackey is Principal Researcher, Microsoft Research New England, 1 Memorial Drive, Cambridge, Massachusetts 02142, USA
| | | | - Gesine Reinert
- Gesine Reinert is Professor, University of Oxford, Department of Statistics, 24-29 St Giles’, Oxford OX1 3LB, UK
| | - Yvik Swan
- Yvik Swan is Professor, Université Libre de Bruxelles, Department of Mathematics—CP 210, Boulevard du Triomphe, 1050 Brussels, Belgium
| |
Collapse
|
6
|
Sarkar BK, Sharma AR, Bhattacharya M, Sharma G, Lee SS, Chakraborty C. Determination of k-mer density in a DNA sequence and subsequent cluster formation algorithm based on the application of electronic filter. Sci Rep 2021; 11:13701. [PMID: 34211040 PMCID: PMC8249421 DOI: 10.1038/s41598-021-93154-3] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/11/2021] [Accepted: 06/07/2021] [Indexed: 02/06/2023] Open
Abstract
We describe a novel algorithm for information recovery from DNA sequences by using a digital filter. This work proposes a three-part algorithm to decide the k-mer or q-gram word density. Employing a finite impulse response digital filter, one can calculate the sequence's k-mer or q-gram word density. Further principal component analysis is used on word density distribution to analyze the dissimilarity between sequences. A dissimilarity matrix is thus formed and shows the appearance of cluster formation. This cluster formation is constructed based on the alignment-free sequence method. Furthermore, the clusters are used to build phylogenetic relations. The cluster algorithm is in good agreement with alignment-based algorithms. The present algorithm is simple and requires less time for computation than other currently available algorithms. We tested the algorithm using beta hemoglobin coding sequences (HBB) of 10 different species and 18 primate mitochondria genome (mtDNA) sequences.
Collapse
Affiliation(s)
| | - Ashish Ranjan Sharma
- Institute for Skeletal Aging and Orthopedic Surgery, Chuncheon Sacred Heart Hospital, Hallym University, College of Medicine, Chuncheon-si, Gangwon-do, 24252, Republic of Korea
| | - Manojit Bhattacharya
- Department of Zoology, Fakir Mohan University, Vyasa Vihar, Balasore, Odisha, 756020, India
| | - Garima Sharma
- Neuropsychopharmacology and Toxicology Program, College of Pharmacy, Kangwon National University, Chuncheon-si, Gangwon-do, Republic of Korea
| | - Sang-Soo Lee
- Institute for Skeletal Aging and Orthopedic Surgery, Chuncheon Sacred Heart Hospital, Hallym University, College of Medicine, Chuncheon-si, Gangwon-do, 24252, Republic of Korea.
| | - Chiranjib Chakraborty
- Department of Biotechnology, School of Life Science and Biotechnology, Adamas University, Barasat-Barrackpore, Rd, Jagannathpur, Kolkata, West Bengal, 700126, India.
| |
Collapse
|
7
|
Maurer-Stroh S, Krutz NL, Kern PS, Gunalan V, Nguyen MN, Limviphuvadh V, Eisenhaber F, Gerberick GF. AllerCatPro-prediction of protein allergenicity potential from the protein sequence. Bioinformatics 2020; 35:3020-3027. [PMID: 30657872 PMCID: PMC6736023 DOI: 10.1093/bioinformatics/btz029] [Citation(s) in RCA: 94] [Impact Index Per Article: 23.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/30/2018] [Revised: 12/18/2018] [Accepted: 01/14/2019] [Indexed: 12/22/2022] Open
Abstract
Motivation Due to the risk of inducing an immediate Type I (IgE-mediated) allergic response, proteins intended for use in consumer products must be investigated for their allergenic potential before introduction into the marketplace. The FAO/WHO guidelines for computational assessment of allergenic potential of proteins based on short peptide hits and linear sequence window identity thresholds misclassify many proteins as allergens. Results We developed AllerCatPro which predicts the allergenic potential of proteins based on similarity of their 3D protein structure as well as their amino acid sequence compared with a data set of known protein allergens comprising of 4180 unique allergenic protein sequences derived from the union of the major databases Food Allergy Research and Resource Program, Comprehensive Protein Allergen Resource, WHO/International Union of Immunological Societies, UniProtKB and Allergome. We extended the hexamer hit rule by removing peptides with high probability of random occurrence measured by sequence entropy as well as requiring 3 or more hexamer hits consistent with natural linear epitope patterns in known allergens. This is complemented with a Gluten-like repeat pattern detection. We also switched from a linear sequence window similarity to a B-cell epitope-like 3D surface similarity window which became possible through extensive 3D structure modeling covering the majority (74%) of allergens. In case no structure similarity is found, the decision workflow reverts to the old linear sequence window rule. The overall accuracy of AllerCatPro is 84% compared with other current methods which range from 51 to 73%. Both the FAO/WHO rules and AllerCatPro achieve highest sensitivity but AllerCatPro provides a 37-fold increase in specificity. Availability and implementation https://allercatpro.bii.a-star.edu.sg/ Supplementary information Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Sebastian Maurer-Stroh
- Biomolecular Function Discovery Division, Bioinformatics Institute, Agency for Science, Technology and Research, Singapore.,Department of Biological Sciences, National University of Singapore, Singapore
| | - Nora L Krutz
- The Procter & Gamble Services Company, Strombeek-Bever, Belgium
| | - Petra S Kern
- The Procter & Gamble Services Company, Strombeek-Bever, Belgium
| | - Vithiagaran Gunalan
- Biomolecular Function Discovery Division, Bioinformatics Institute, Agency for Science, Technology and Research, Singapore
| | - Minh N Nguyen
- Biomolecular Function Discovery Division, Bioinformatics Institute, Agency for Science, Technology and Research, Singapore
| | - Vachiranee Limviphuvadh
- Biomolecular Function Discovery Division, Bioinformatics Institute, Agency for Science, Technology and Research, Singapore
| | - Frank Eisenhaber
- Biomolecular Function Discovery Division, Bioinformatics Institute, Agency for Science, Technology and Research, Singapore.,Department of Biological Sciences, National University of Singapore, Singapore
| | | |
Collapse
|
8
|
Gaunt RE. Stein’s method for functions of multivariate normal random variables. ANNALES DE L'INSTITUT HENRI POINCARÉ, PROBABILITÉS ET STATISTIQUES 2020. [DOI: 10.1214/19-aihp1011] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/19/2022]
|
9
|
Luczak BB, James BT, Girgis HZ. A survey and evaluations of histogram-based statistics in alignment-free sequence comparison. Brief Bioinform 2020; 20:1222-1237. [PMID: 29220512 PMCID: PMC6781583 DOI: 10.1093/bib/bbx161] [Citation(s) in RCA: 27] [Impact Index Per Article: 6.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/15/2017] [Revised: 10/13/2017] [Indexed: 11/29/2022] Open
Abstract
Motivation Since the dawn of the bioinformatics field, sequence alignment scores have been the main method for comparing sequences. However, alignment algorithms are quadratic, requiring long execution time. As alternatives, scientists have developed tens of alignment-free statistics for measuring the similarity between two sequences. Results We surveyed tens of alignment-free k-mer statistics. Additionally, we evaluated 33 statistics and multiplicative combinations between the statistics and/or their squares. These statistics are calculated on two k-mer histograms representing two sequences. Our evaluations using global alignment scores revealed that the majority of the statistics are sensitive and capable of finding similar sequences to a query sequence. Therefore, any of these statistics can filter out dissimilar sequences quickly. Further, we observed that multiplicative combinations of the statistics are highly correlated with the identity score. Furthermore, combinations involving sequence length difference or Earth Mover’s distance, which takes the length difference into account, are always among the highest correlated paired statistics with identity scores. Similarly, paired statistics including length difference or Earth Mover’s distance are among the best performers in finding the K-closest sequences. Interestingly, similar performance can be obtained using histograms of shorter words, resulting in reducing the memory requirement and increasing the speed remarkably. Moreover, we found that simple single statistics are sufficient for processing next-generation sequencing reads and for applications relying on local alignment. Finally, we measured the time requirement of each statistic. The survey and the evaluations will help scientists with identifying efficient alternatives to the costly alignment algorithm, saving thousands of computational hours. Availability The source code of the benchmarking tool is available as Supplementary Materials.
Collapse
Affiliation(s)
| | | | - Hani Z Girgis
- Corresponding author. Hani Z. Girgis, Tandy School of Computer Science, The University of Tulsa, 800 South Tucker Drive, Tulsa, OK 74104, USA. E-mail:
| |
Collapse
|
10
|
Qian J, Comin M. MetaCon: unsupervised clustering of metagenomic contigs with probabilistic k-mers statistics and coverage. BMC Bioinformatics 2019; 20:367. [PMID: 31757198 PMCID: PMC6873667 DOI: 10.1186/s12859-019-2904-4] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/30/2019] [Accepted: 05/15/2019] [Indexed: 11/30/2022] Open
Abstract
Motivation Sequencing technologies allow the sequencing of microbial communities directly from the environment without prior culturing. Because assembly typically produces only genome fragments, also known as contigs, it is crucial to group them into putative species for further taxonomic profiling and down-streaming functional analysis. Taxonomic analysis of microbial communities requires contig clustering, a process referred to as binning, that is still one of the most challenging tasks when analyzing metagenomic data. The major problems are the lack of taxonomically related genomes in existing reference databases, the uneven abundance ratio of species, sequencing errors, and the limitations due to binning contig of different lengths. Results In this context we present MetaCon a novel tool for unsupervised metagenomic contig binning based on probabilistic k-mers statistics and coverage. MetaCon uses a signature based on k-mers statistics that accounts for the different probability of appearance of a k-mer in different species, also contigs of different length are clustered in two separate phases. The effectiveness of MetaCon is demonstrated in both simulated and real datasets in comparison with state-of-art binning approaches such as CONCOCT, MaxBin and MetaBAT. Electronic supplementary material The online version of this article (10.1186/s12859-019-2904-4) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Jia Qian
- Department of Information Engineering, University of Padova, Via Giovanni Gradenigo 6, Padova, Italy
| | - Matteo Comin
- Department of Information Engineering, University of Padova, Via Giovanni Gradenigo 6, Padova, Italy.
| |
Collapse
|
11
|
Chen S, Chen Y, Sun F, Waterman MS, Zhang X. A new statistic for efficient detection of repetitive sequences. Bioinformatics 2019; 35:4596-4606. [PMID: 30993316 DOI: 10.1093/bioinformatics/btz262] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/23/2018] [Revised: 03/01/2019] [Accepted: 04/10/2019] [Indexed: 11/14/2022] Open
Abstract
MOTIVATION Detecting sequences containing repetitive regions is a basic bioinformatics task with many applications. Several methods have been developed for various types of repeat detection tasks. An efficient generic method for detecting most types of repetitive sequences is still desirable. Inspired by the excellent properties and successful applications of the D2 family of statistics in comparative analyses of genomic sequences, we developed a new statistic D2R that can efficiently discriminate sequences with or without repetitive regions. RESULTS Using the statistic, we developed an algorithm of linear time and space complexity for detecting most types of repetitive sequences in multiple scenarios, including finding candidate clustered regularly interspaced short palindromic repeats regions from bacterial genomic or metagenomics sequences. Simulation and real data experiments show that the method works well on both assembled sequences and unassembled short reads. AVAILABILITY AND IMPLEMENTATION The codes are available at https://github.com/XuegongLab/D2R_codes under GPL 3.0 license. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Sijie Chen
- Department of Automation, MOE Key Laboratory of Bioinformatics, Bioinformatics Division and Center for Synthetic & Systems Biology, BNRist, Tsinghua University, Beijing 100084, China
| | - Yixin Chen
- Department of Automation, MOE Key Laboratory of Bioinformatics, Bioinformatics Division and Center for Synthetic & Systems Biology, BNRist, Tsinghua University, Beijing 100084, China
| | - Fengzhu Sun
- Quantitative and Computational Biology Program, Department of Biological Sciences, University of Southern California, Los Angeles, CA 90089, USA.,Institute of Science and Technology for Brain-Inspired Intelligence, Fudan University, Shanghai 200433, China
| | - Michael S Waterman
- Department of Automation, MOE Key Laboratory of Bioinformatics, Bioinformatics Division and Center for Synthetic & Systems Biology, BNRist, Tsinghua University, Beijing 100084, China.,Quantitative and Computational Biology Program, Department of Biological Sciences, University of Southern California, Los Angeles, CA 90089, USA.,Institute of Science and Technology for Brain-Inspired Intelligence, Fudan University, Shanghai 200433, China
| | - Xuegong Zhang
- Department of Automation, MOE Key Laboratory of Bioinformatics, Bioinformatics Division and Center for Synthetic & Systems Biology, BNRist, Tsinghua University, Beijing 100084, China.,School of Life Sciences, Tsinghua University, Beijing 100084, China
| |
Collapse
|
12
|
Bernard G, Chan CX, Chan YB, Chua XY, Cong Y, Hogan JM, Maetschke SR, Ragan MA. Alignment-free inference of hierarchical and reticulate phylogenomic relationships. Brief Bioinform 2019; 20:426-435. [PMID: 28673025 PMCID: PMC6433738 DOI: 10.1093/bib/bbx067] [Citation(s) in RCA: 55] [Impact Index Per Article: 11.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/08/2017] [Revised: 05/04/2017] [Indexed: 11/22/2022] Open
Abstract
We are amidst an ongoing flood of sequence data arising from the application of high-throughput technologies, and a concomitant fundamental revision in our understanding of how genomes evolve individually and within the biosphere. Workflows for phylogenomic inference must accommodate data that are not only much larger than before, but often more error prone and perhaps misassembled, or not assembled in the first place. Moreover, genomes of microbes, viruses and plasmids evolve not only by tree-like descent with modification but also by incorporating stretches of exogenous DNA. Thus, next-generation phylogenomics must address computational scalability while rethinking the nature of orthogroups, the alignment of multiple sequences and the inference and comparison of trees. New phylogenomic workflows have begun to take shape based on so-called alignment-free (AF) approaches. Here, we review the conceptual foundations of AF phylogenetics for the hierarchical (vertical) and reticulate (lateral) components of genome evolution, focusing on methods based on k-mers. We reflect on what seems to be successful, and on where further development is needed.
Collapse
|
13
|
Gaunt RE. Wasserstein and Kolmogorov Error Bounds for Variance-Gamma Approximation via Stein’s Method I. J THEOR PROBAB 2018. [DOI: 10.1007/s10959-018-0867-4] [Citation(s) in RCA: 17] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/28/2022]
|
14
|
Alignment-free Transcriptomic and Metatranscriptomic Comparison Using Sequencing Signatures with Variable Length Markov Chains. Sci Rep 2016; 6:37243. [PMID: 27876823 PMCID: PMC5120338 DOI: 10.1038/srep37243] [Citation(s) in RCA: 16] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/08/2016] [Accepted: 10/27/2016] [Indexed: 11/08/2022] Open
Abstract
The comparison between microbial sequencing data is critical to understand the dynamics of microbial communities. The alignment-based tools analyzing metagenomic datasets require reference sequences and read alignments. The available alignment-free dissimilarity approaches model the background sequences with Fixed Order Markov Chain (FOMC) yielding promising results for the comparison of microbial communities. However, in FOMC, the number of parameters grows exponentially with the increase of the order of Markov Chain (MC). Under a fixed high order of MC, the parameters might not be accurately estimated owing to the limitation of sequencing depth. In our study, we investigate an alternative to FOMC to model background sequences with the data-driven Variable Length Markov Chain (VLMC) in metatranscriptomic data. The VLMC originally designed for long sequences was extended to apply to high-throughput sequencing reads and the strategies to estimate the corresponding parameters were developed. The flexible number of parameters in VLMC avoids estimating the vast number of parameters of high-order MC under limited sequencing depth. Different from the manual selection in FOMC, VLMC determines the MC order adaptively. Several beta diversity measures based on VLMC were applied to compare the bacterial RNA-Seq and metatranscriptomic datasets. Experiments show that VLMC outperforms FOMC to model the background sequences in transcriptomic and metatranscriptomic samples. A software pipeline is available at https://d2vlmc.codeplex.com.
Collapse
|
15
|
Girotto S, Pizzi C, Comin M. MetaProb: accurate metagenomic reads binning based on probabilistic sequence signatures. Bioinformatics 2016; 32:i567-i575. [DOI: 10.1093/bioinformatics/btw466] [Citation(s) in RCA: 48] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/15/2022] Open
|
16
|
Comin M, Schimd M. Fast comparison of genomic and meta-genomic reads with alignment-free measures based on quality values. BMC Med Genomics 2016; 9 Suppl 1:36. [PMID: 27535823 PMCID: PMC4989896 DOI: 10.1186/s12920-016-0193-6] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/20/2022] Open
Abstract
Background Sequencing technologies are generating enormous amounts of read data, however assembly of genomes and metagenomes remain among the most challenging tasks. In this paper we study the comparison of genomes and metagenomes only based on read data, using word counts statistics called alignment-free thus not requiring reference genomes or assemblies. Quality scores produced by sequencing platforms are fundamental for various analyses, moreover future-generation sequencing platforms, will produce longer reads but with error rate around 15 %. In this context it will be fundamental to exploit quality values information within the framework of alignment-free measures. Results In this paper we present a family of alignment-free measures, called dq-type, that are based on k-mer counts and quality values. These statistics can be used to compare genomes and metagenomes based on their read sets. Results show that the evolutionary relationship of genomes can be reconstructed based on the direct comparison of theirs reads sets. Conclusion The use of quality values on average improves the classification accuracy, and its contribution increases when the reads are more noisy. Also the comparison of metagenomic microbial communities can be performed efficiently. Similar metagenomes are quickly detected, just by processing their read data, without the need of costly alignments.
Collapse
Affiliation(s)
- Matteo Comin
- Department of Information Engineering, University of Padova, Via Gradenigo 6/A, Padova, Italy.
| | - Michele Schimd
- Department of Information Engineering, University of Padova, Via Gradenigo 6/A, Padova, Italy
| |
Collapse
|
17
|
Ondov BD, Treangen TJ, Melsted P, Mallonee AB, Bergman NH, Koren S, Phillippy AM. Mash: fast genome and metagenome distance estimation using MinHash. Genome Biol 2016; 17:132. [PMID: 27323842 PMCID: PMC4915045 DOI: 10.1186/s13059-016-0997-x] [Citation(s) in RCA: 1571] [Impact Index Per Article: 196.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/31/2015] [Accepted: 06/03/2016] [Indexed: 02/07/2023] Open
Abstract
Mash extends the MinHash dimensionality-reduction technique to include a pairwise mutation distance and P value significance test, enabling the efficient clustering and search of massive sequence collections. Mash reduces large sequences and sequence sets to small, representative sketches, from which global mutation distances can be rapidly estimated. We demonstrate several use cases, including the clustering of all 54,118 NCBI RefSeq genomes in 33 CPU h; real-time database search using assembled or unassembled Illumina, Pacific Biosciences, and Oxford Nanopore data; and the scalable clustering of hundreds of metagenomic samples by composition. Mash is freely released under a BSD license (https://github.com/marbl/mash).
Collapse
Affiliation(s)
- Brian D Ondov
- National Biodefense Analysis and Countermeasures Center, Frederick, MD, USA
| | - Todd J Treangen
- National Biodefense Analysis and Countermeasures Center, Frederick, MD, USA
| | - Páll Melsted
- Faculty of Industrial Engineering, Mechanical Engineering and Computer Science, University of Iceland, Reykjavik, Iceland
| | - Adam B Mallonee
- National Biodefense Analysis and Countermeasures Center, Frederick, MD, USA
| | - Nicholas H Bergman
- National Biodefense Analysis and Countermeasures Center, Frederick, MD, USA
| | - Sergey Koren
- Genome Informatics Section, Computational and Statistical Genomics Branch, National Human Genome Research Institute, National Institutes of Health, Bethesda, MD, USA
| | - Adam M Phillippy
- Genome Informatics Section, Computational and Statistical Genomics Branch, National Human Genome Research Institute, National Institutes of Health, Bethesda, MD, USA.
| |
Collapse
|
18
|
Fan H, Ives AR, Surget-Groba Y, Cannon CH. An assembly and alignment-free method of phylogeny reconstruction from next-generation sequencing data. BMC Genomics 2015; 16:522. [PMID: 26169061 PMCID: PMC4501066 DOI: 10.1186/s12864-015-1647-5] [Citation(s) in RCA: 89] [Impact Index Per Article: 9.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/09/2015] [Accepted: 05/20/2015] [Indexed: 11/10/2022] Open
Abstract
Background Next-generation sequencing technologies are rapidly generating whole-genome datasets for an increasing number of organisms. However, phylogenetic reconstruction of genomic data remains difficult because de novo assembly for non-model genomes and multi-genome alignment are challenging. Results To greatly simplify the analysis, we present an Assembly and Alignment-Free (AAF) method (https://sourceforge.net/projects/aaf-phylogeny) that constructs phylogenies directly from unassembled genome sequence data, bypassing both genome assembly and alignment. Using mathematical calculations, models of sequence evolution, and simulated sequencing of published genomes, we address both evolutionary and sampling issues caused by direct reconstruction, including homoplasy, sequencing errors, and incomplete sequencing coverage. From these results, we calculate the statistical properties of the pairwise distances between genomes, allowing us to optimize parameter selection and perform bootstrapping. As a test case with real data, we successfully reconstructed the phylogeny of 12 mammals using raw sequencing reads. We also applied AAF to 21 tropical tree genome datasets with low coverage to demonstrate its effectiveness on non-model organisms. Conclusion Our AAF method opens up phylogenomics for species without an appropriate reference genome or high sequence coverage, and rapidly creates a phylogenetic framework for further analysis of genome structure and diversity among non-model organisms. Electronic supplementary material The online version of this article (doi:10.1186/s12864-015-1647-5) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Huan Fan
- Key Laboratory of Tropical Forest Ecology, Xishuangbanna Tropical Botanical Garden, Chinese Academy of Sciences, Mengla, Yunnan, 666303, China. .,University of Chinese Academy of Sciences, Beijing, 100049, China. .,Department of Zoology, University of Wisconsin-Madison, Madison, WI, 53706, USA.
| | - Anthony R Ives
- Department of Zoology, University of Wisconsin-Madison, Madison, WI, 53706, USA.
| | - Yann Surget-Groba
- Institut des Sciences de la Forêt Tempérée, Université du Québec en Outaouais, 58 rue Principale, Ripon, QC, J0V 1V0, Canada.
| | - Charles H Cannon
- Key Laboratory of Tropical Forest Ecology, Xishuangbanna Tropical Botanical Garden, Chinese Academy of Sciences, Mengla, Yunnan, 666303, China. .,Department of Biological Sciences, Texas Tech University, Lubbock, TX, 79410, USA.
| |
Collapse
|
19
|
Galzitskaya OV, Lobanov MY. Phyloproteomic Analysis of 11780 Six-Residue-Long Motifs Occurrences. BIOMED RESEARCH INTERNATIONAL 2015; 2015:208346. [PMID: 26114101 PMCID: PMC4465679 DOI: 10.1155/2015/208346] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 07/11/2014] [Accepted: 11/03/2014] [Indexed: 12/31/2022]
Abstract
How is it possible to find good traits for phylogenetic reconstructions? Here, we present a new phyloproteomic criterion that is an occurrence of simple motifs which can be imprints of evolution history. We studied the occurrences of 11780 six-residue-long motifs consisting of two randomly located amino acids in 97 eukaryotic and 25 bacterial proteomes. For all eukaryotic proteomes, with the exception of the Amoebozoa, Stramenopiles, and Diplomonadida kingdoms, the number of proteins containing the motifs from the first group (one of the two amino acids occurs once at the terminal position) made about 20%; in the case of motifs from the second (one of two amino acids occurs one time within the pattern) and third (the two amino acids occur randomly) groups, 30% and 50%, respectively. For bacterial proteomes, this relationship was 10%, 27%, and 63%, respectively. The matrices of correlation coefficients between numbers of proteins where a motif from the set of 11780 motifs appears at least once in 9 kingdoms and 5 phyla of bacteria were calculated. Among the correlation coefficients for eukaryotic proteomes, the correlation between the animal and fungi kingdoms (0.62) is higher than between fungi and plants (0.54). Our study provides support that animals and fungi are sibling kingdoms. Comparison of the frequencies of six-residue-long motifs in different proteomes allows obtaining phylogenetic relationships based on similarities between these frequencies: the Diplomonadida kingdoms are more close to Bacteria than to Eukaryota; Stramenopiles and Amoebozoa are more close to each other than to other kingdoms of Eukaryota.
Collapse
Affiliation(s)
- O. V. Galzitskaya
- Institute of Protein Research, Russian Academy of Sciences, 4 Institutskaya Street, Pushchino, Moscow Region 142290, Russia
| | - M. Yu. Lobanov
- Institute of Protein Research, Russian Academy of Sciences, 4 Institutskaya Street, Pushchino, Moscow Region 142290, Russia
| |
Collapse
|
20
|
Morgenstern B, Zhu B, Horwege S, Leimeister CA. Estimating evolutionary distances between genomic sequences from spaced-word matches. Algorithms Mol Biol 2015; 10:5. [PMID: 25685176 PMCID: PMC4327811 DOI: 10.1186/s13015-015-0032-x] [Citation(s) in RCA: 39] [Impact Index Per Article: 4.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/19/2014] [Accepted: 01/06/2015] [Indexed: 01/06/2023] Open
Abstract
Alignment-free methods are increasingly used to calculate evolutionary distances between DNA and protein sequences as a basis of phylogeny reconstruction. Most of these methods, however, use heuristic distance functions that are not based on any explicit model of molecular evolution. Herein, we propose a simple estimator d N of the evolutionary distance between two DNA sequences that is calculated from the number N of (spaced) word matches between them. We show that this distance function is more accurate than other distance measures that are used by alignment-free methods. In addition, we calculate the variance of the normalized number N of (spaced) word matches. We show that the variance of N is smaller for spaced words than for contiguous words, and that the variance is further reduced if our spaced-words approach is used with multiple patterns of 'match positions' and 'don't care positions'. Our software is available online and as downloadable source code at: http://spaced.gobics.de/.
Collapse
|
21
|
Comin M, Leoni A, Schimd M. Clustering of reads with alignment-free measures and quality values. Algorithms Mol Biol 2015; 10:4. [PMID: 25691913 PMCID: PMC4331138 DOI: 10.1186/s13015-014-0029-x] [Citation(s) in RCA: 24] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/19/2014] [Accepted: 12/17/2014] [Indexed: 12/03/2022] Open
Abstract
Background The data volume generated by Next-Generation Sequencing (NGS) technologies is growing at a pace that is now challenging the storage and data processing capacities of modern computer systems. In this context an important aspect is the reduction of data complexity by collapsing redundant reads in a single cluster to improve the run time, memory requirements, and quality of post-processing steps like assembly and error correction. Several alignment-free measures, based on k-mers counts, have been used to cluster reads. Quality scores produced by NGS platforms are fundamental for various analysis of NGS data like reads mapping and error detection. Moreover future-generation sequencing platforms will produce long reads but with a large number of erroneous bases (up to 15 %). Results In this scenario it will be fundamental to exploit quality value information within the alignment-free framework. To the best of our knowledge this is the first study that incorporates quality value information and k-mers counts, in the context of alignment-free measures, for the comparison of reads data. Based on this principles, in this paper we present a family of alignment-free measures called Dq-type. A set of experiments on simulated and real reads data confirms that the new measures are superior to other classical alignment-free statistics, especially when erroneous reads are considered. Also results on de novo assembly and metagenomic reads classification show that the introduction of quality values improves over standard alignment-free measures. These statistics are implemented in a software called QCluster (http://www.dei.unipd.it/~ciompin/main/qcluster.html).
Collapse
|
22
|
Bonham-Carter O, Steele J, Bastola D. Alignment-free genetic sequence comparisons: a review of recent approaches by word analysis. Brief Bioinform 2014; 15:890-905. [PMID: 23904502 PMCID: PMC4296134 DOI: 10.1093/bib/bbt052] [Citation(s) in RCA: 76] [Impact Index Per Article: 7.6] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/11/2013] [Accepted: 05/31/2013] [Indexed: 12/17/2022] Open
Abstract
Modern sequencing and genome assembly technologies have provided a wealth of data, which will soon require an analysis by comparison for discovery. Sequence alignment, a fundamental task in bioinformatics research, may be used but with some caveats. Seminal techniques and methods from dynamic programming are proving ineffective for this work owing to their inherent computational expense when processing large amounts of sequence data. These methods are prone to giving misleading information because of genetic recombination, genetic shuffling and other inherent biological events. New approaches from information theory, frequency analysis and data compression are available and provide powerful alternatives to dynamic programming. These new methods are often preferred, as their algorithms are simpler and are not affected by synteny-related problems. In this review, we provide a detailed discussion of computational tools, which stem from alignment-free methods based on statistical analysis from word frequencies. We provide several clear examples to demonstrate applications and the interpretations over several different areas of alignment-free analysis such as base-base correlations, feature frequency profiles, compositional vectors, an improved string composition and the D2 statistic metric. Additionally, we provide detailed discussion and an example of analysis by Lempel-Ziv techniques from data compression.
Collapse
|
23
|
Comin M, Schimd M. Assembly-free genome comparison based on next-generation sequencing reads and variable length patterns. BMC Bioinformatics 2014; 15 Suppl 9:S1. [PMID: 25252700 PMCID: PMC4168702 DOI: 10.1186/1471-2105-15-s9-s1] [Citation(s) in RCA: 12] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/16/2023] Open
Abstract
Background With the advent of Next-Generation Sequencing technologies (NGS), a large amount of short read data has been generated. If a reference genome is not available, the assembly of a template sequence is usually challenging because of repeats and the short length of reads. When NGS reads cannot be mapped onto a reference genome alignment-based methods are not applicable. However it is still possible to study the evolutionary relationship of unassembled genomes based on NGS data. Results We present a parameter-free alignment-free method, called Under2¯, based on variable-length patterns, for the direct comparison of sets of NGS reads. We define a similarity measure using variable-length patterns, as well as reverses and reverse-complements, along with their statistical and syntactical properties. We evaluate several alignment-free statistics on the comparison of NGS reads coming from simulated and real genomes. In almost all simulations our method Under2¯ outperforms all other statistics. The performance gain becomes more evident when real genomes are used. Conclusion The new alignment-free statistic is highly successful in discriminating related genomes based on NGS reads data. In almost all experiments, it outperforms traditional alignment-free statistics that are based on fixed length patterns.
Collapse
|
24
|
Comin M, Verzotto D. Beyond Fixed-Resolution Alignment-Free Measures for Mammalian Enhancers Sequence Comparison. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2014; 11:628-637. [PMID: 26356333 DOI: 10.1109/tcbb.2014.2306830] [Citation(s) in RCA: 9] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/05/2023]
Abstract
UNLABELLED The cell-type diversity is to a large degree driven by transcription regulation, i.e., enhancers. It has been recently shown that in high-level eukaryotes enhancers rarely work alone, instead they collaborate by forming clusters of cis-regulatory modules (CRMs). Even if the binding of transcription factors is sequence-specific, the identification of functionally similar enhancers is very difficult. A similarity measure to detect related regulatory sequences is crucial to understand functional correlation between two enhancers. This will allow large-scale analyses, clustering and genome-wide classifications. In this paper we present Under2, a parameter-free alignment-free statistic based on variable-length words. As opposed to traditional alignment-free methods, which are based on fixed-length patterns or, in other words, tied to a fixed resolution, our statistic is built upon variable-length words, and thus multiple resolutions are allowed. This will capture the great variability of lengths of CRMs. We evaluate several alignment-free statistics on simulated data and real ChIP-seq sequences. The new statistic is highly successful in discriminating functionally related enhancers and, in almost all experiments, it outperforms fixed-resolution methods. Finally, experiments on mouse enhancers show that Under2 can separate enhancers active in different tissues. AVAILABILITY http://www.dei.unipd.it/~ciompin/main/UnderIICRMS.html.
Collapse
|
25
|
Chen S, Wang A, Li LM. SEME: a fast mapper of Illumina sequencing reads with statistical evaluation. J Comput Biol 2014; 20:847-60. [PMID: 24195707 DOI: 10.1089/cmb.2013.0111] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/06/2023] Open
Abstract
Mapping reads to a reference genome is a routine yet computationally intensive task in research based on high-throughput sequencing. In recent years, the sequencing reads of the Illumina platform have become longer and their quality scores higher. According to our calculation, this allows perfect k-mer seed match for almost all reads when a close reference genome is available subject to reasonable specificity. Our other observation is that the majority reads contain at most one short INDEL polymorphism. Based on these observations, we propose a fast-mapping approach, referred to as "SEME," which has two core steps: First it scans a read sequentially in a specific order for a k-mer exact match seed; next it extends the alignment on both sides allowing, at most, one short INDEL each using a novel method called "auto-match function." We decompose the evaluation of the sensitivity and specificity into two parts corresponding to the seed and extension step, and the composite result provides an approximate overall reliability estimate of each mapping. We compare SEME with some existing mapping methods on several datasets, and SEME shows better performance in terms of both running time and mapping rates.
Collapse
Affiliation(s)
- Shijian Chen
- 1 National Center for Mathematics and Interdisciplinary Sciences , Academy of Mathematics and Systems Science, Chinese Academy of Sciences, Beijing, China
| | | | | |
Collapse
|
26
|
Gunasinghe U, Alahakoon D, Bedingfield S. Extraction of high quality k-words for alignment-free sequence comparison. J Theor Biol 2014; 358:31-51. [PMID: 24846728 DOI: 10.1016/j.jtbi.2014.05.016] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/04/2013] [Revised: 05/04/2014] [Accepted: 05/07/2014] [Indexed: 10/25/2022]
Abstract
The weighted Euclidean distance (D(2)) is one of the earliest dissimilarity measures used for alignment free comparison of biological sequences. This distance measure and its variants have been used in numerous applications due to its fast computation, and many variants of it have been subsequently introduced. The D(2) distance measure is based on the count of k-words in the two sequences that are compared. Traditionally, all k-words are compared when computing the distance. In this paper we show that similar accuracy in sequence comparison can be achieved by using a selected subset of k-words. We introduce a term variance based quality measure for identifying the important k-words. We demonstrate the application of the proposed technique in phylogeny reconstruction and show that up to 99% of the k-words can be filtered out for certain datasets, resulting in faster sequence comparison. The paper also presents an exploratory analysis based evaluation of optimal k-word values and discusses the impact of using subsets of k-words in such optimal instances.
Collapse
Affiliation(s)
- Upuli Gunasinghe
- Clayton School of Information Technology, Faculty of Information Technology, Monash University, VIC 3800, Australia.
| | - Damminda Alahakoon
- School of Information and Business Analytics, Deakin University, VIC 3125, Australia.
| | - Susan Bedingfield
- Clayton School of Information Technology, Faculty of Information Technology, Monash University, VIC 3800, Australia.
| |
Collapse
|
27
|
|
28
|
Taillefer E, Miller J. Exhaustive computation of exact duplications via super and non-nested local maximal repeats. J Bioinform Comput Biol 2013; 12:1350018. [PMID: 24467757 DOI: 10.1142/s0219720013500182] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/05/2023]
Abstract
We propose and implement a method to obtain all duplicated sequences (repeats) from a chromosome or whole genome. Unlike existing approaches our method makes it possible to simultaneously identify and classify repeats into super, local, and non-nested local maximal repeats. Computation verification demonstrates that maximal repeats for a genome of several gigabases can be identified in a reasonable time, enabling us to identified these maximal repeats for any sequenced genome. The algorithm used for the identification relies on enhanced suffix array data structure to achieve practical space and time efficiency, to identify and classify the maximal repeats, and to perform further post-processing on the identified duplicated sequences. The simplicity and effectiveness of the implementation makes the method readily extendible to more sophisticated computations. Maxmers can be exhaustively accounted for in few minutes for genome sequences of dozen megabases in length and in less than a day or two for genome sequences of few gigabases in length. One application of duplicated sequence identification is to the study of duplicated sequence length distributions, which our found to exhibit for large lengths a persistent power-law behavior. Variation of estimated exponents of this power law are studied among different species and successive assembly release versions of the same species. This makes the characterization of the power-law regime of sequenced genomes via maximal repeats identification and classification, an important task for the derivation of models that would help us to elucidate sequence duplication and genome evolution.
Collapse
Affiliation(s)
- Eddy Taillefer
- Physics and Biology Unit, Okinawa Institute of Science and Technology, 1919-1 Tancha, Onna-son, Kunigami-gun 904-0412, Japan
| | | |
Collapse
|
29
|
Maurer-Stroh S, Gunalan V, Wong WC, Eisenhaber F. A simple shortcut to unsupervised alignment-free phylogenetic genome groupings, even from unassembled sequencing reads. J Bioinform Comput Biol 2013; 11:1343005. [PMID: 24372034 DOI: 10.1142/s0219720013430051] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/21/2023]
Abstract
We propose an extension to alignment-free approaches that can produce reasonably accurate phylogenetic groupings starting from unaligned genomes, for example, as fast as 1 min on a standard desktop computer for 25 bacterial genomes. A 6-fold speed-up and 11-fold reduction in memory requirements compared to previous alignment-free methods is achieved by reducing the comparison space to a representative sample of k-mers of optimal length and with specific tag motifs. This approach was applied to the test case of fitting the enterohemorrhagic O104:H4 E.coli strain from the 2011 outbreak in Germany into the phylogenetic network of previously known E.coli-related strains and extend the method to allow assigning any new strain to the correct phylogenetic group even directly from unassembled short sequence reads from next generation sequencing data. Hence, this approach is also useful to quickly identify the most suitable reference genome for subsequent assembly steps.
Collapse
Affiliation(s)
- Sebastian Maurer-Stroh
- Bioinformatics Institute (BII), Agency for Science Technology and Research (A*STAR), 30 Biopolis Street, #07-01, Matrix, Singapore 138671, Singapore , School of Biological Sciences (SBS), Nanyang Technological University (NTU), 60 Nanyang Drive, Singapore 637551, Singapore
| | | | | | | |
Collapse
|
30
|
Burden CJ, Leopardi P, Forêt S. The distribution of word matches between Markovian sequences with periodic boundary conditions. J Comput Biol 2013; 21:41-63. [PMID: 24160839 DOI: 10.1089/cmb.2012.0277] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open
Abstract
Word match counts have traditionally been proposed as an alignment-free measure of similarity for biological sequences. The D(2) statistic, which simply counts the number of exact word matches between two sequences, is a useful test bed for developing rigorous mathematical results, which can then be extended to more biologically useful measures. The distributional properties of the D(2) statistic under the null hypothesis of identically and independently distributed letters have been studied extensively, but no comprehensive study of the D(2) distribution for biologically more realistic higher-order Markovian sequences exists. Here we derive exact formulas for the mean and variance of the D(2) statistic for Markovian sequences of any order, and demonstrate through Monte Carlo simulations that the entire distribution is accurately characterized by a Pólya-Aeppli distribution for sequence lengths of biological interest. The approach is novel in that Markovian dependency is defined for sequences with periodic boundary conditions, and this enables exact analytic formulas for the mean and variance to be derived. We also carry out a preliminary comparison between the approximate D(2) distribution computed with the theoretical mean and variance under a Markovian hypothesis and an empirical D(2) distribution from the human genome.
Collapse
Affiliation(s)
- Conrad J Burden
- 1 Mathematical Sciences Institute, Australian National University , Canberra, ACT, Australia
| | | | | |
Collapse
|
31
|
Schwende I, Pham TD. Pattern recognition and probabilistic measures in alignment-free sequence analysis. Brief Bioinform 2013; 15:354-68. [PMID: 24096012 DOI: 10.1093/bib/bbt070] [Citation(s) in RCA: 19] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/30/2022] Open
Abstract
With the massive production of genomic and proteomic data, the number of available biological sequences in databases has reached a level that is not feasible anymore for exact alignments even when just a fraction of all sequences is used. To overcome this inevitable time complexity, ultrafast alignment-free methods are studied. Within the past two decades, a broad variety of nonalignment methods have been proposed including dissimilarity measures on classical representations of sequences like k-words or Markov models. Furthermore, articles were published that describe distance measures on alternative representations such as compression complexity, spectral time series or chaos game representation. However, alignments are still the standard method for real world applications in biological sequence analysis, and the time efficient alignment-free approaches are usually applied in cases when the accustomed algorithms turn out to fail or be too inconvenient.
Collapse
Affiliation(s)
- Isabel Schwende
- PhD, Aizu Research Cluster for Medical Informatics and Engineering (ARC-Medical), Research Center for Advanced Information Science and Technology (CAIST), The University of Aizu, Aizuwakamatsu, Fukushima 965-8580, Japan.
| | | |
Collapse
|
32
|
Song K, Ren J, Reinert G, Deng M, Waterman MS, Sun F. New developments of alignment-free sequence comparison: measures, statistics and next-generation sequencing. Brief Bioinform 2013; 15:343-53. [PMID: 24064230 DOI: 10.1093/bib/bbt067] [Citation(s) in RCA: 112] [Impact Index Per Article: 10.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/07/2023] Open
Abstract
With the development of next-generation sequencing (NGS) technologies, a large amount of short read data has been generated. Assembly of these short reads can be challenging for genomes and metagenomes without template sequences, making alignment-based genome sequence comparison difficult. In addition, sequence reads from NGS can come from different regions of various genomes and they may not be alignable. Sequence signature-based methods for genome comparison based on the frequencies of word patterns in genomes and metagenomes can potentially be useful for the analysis of short reads data from NGS. Here we review the recent development of alignment-free genome and metagenome comparison based on the frequencies of word patterns with emphasis on the dissimilarity measures between sequences, the statistical power of these measures when two sequences are related and the applications of these measures to NGS data.
Collapse
Affiliation(s)
- Kai Song
- Molecular and Computational Biology Program, University of Southern California, 1050 Childs Way, Los Angeles, CA 90089, USA. or
| | | | | | | | | | | |
Collapse
|
33
|
|
34
|
Ren J, Song K, Sun F, Deng M, Reinert G. Multiple alignment-free sequence comparison. Bioinformatics 2013; 29:2690-8. [PMID: 23990418 DOI: 10.1093/bioinformatics/btt462] [Citation(s) in RCA: 10] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open
Abstract
MOTIVATION Recently, a range of new statistics have become available for the alignment-free comparison of two sequences based on k-tuple word content. Here, we extend these statistics to the simultaneous comparison of more than two sequences. Our suite of statistics contains, first, C(*)1 and C(S)1, extensions of statistics for pairwise comparison of the joint k-tuple content of all the sequences, and second, C(*)2, C(S)2 and C(geo)2, averages of sums of pairwise comparison statistics. The two tasks we consider are, first, to identify sequences that are similar to a set of target sequences, and, second, to measure the similarity within a set of sequences. RESULTS Our investigation uses both simulated data as well as cis-regulatory module data where the task is to identify cis-regulatory modules with similar transcription factor binding sites. We find that although for real data, all of our statistics show a similar performance, on simulated data the Shepp-type statistics are in some instances outperformed by star-type statistics. The multiple alignment-free statistics are more sensitive to contamination in the data than the pairwise average statistics. AVAILABILITY Our implementation of the five statistics is available as R package named 'multiAlignFree' at be http://www-rcf.usc.edu/∼fsun/Programs/multiAlignFree/multiAlignFreemain.html. CONTACT reinert@stats.ox.ac.uk. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Jie Ren
- School of Mathematics, Peking University, Beijing 100871, PR China, Molecular and Computational Biology, University of Southern California, Los Angeles, CA 90089-2910, USA, MOE Key Laboratory of Bioinformatics and Bioinformatics Division, TNLIST/Department of Automation, Tsinghua University, Beijing 100084, PR China and Department of Statistics, University of Oxford, 1 South Parks Road, Oxford OX1 3TG, UK
| | | | | | | | | |
Collapse
|
35
|
Song K, Ren J, Zhai Z, Liu X, Deng M, Sun F. Alignment-free sequence comparison based on next-generation sequencing reads. J Comput Biol 2013; 20:64-79. [PMID: 23383994 DOI: 10.1089/cmb.2012.0228] [Citation(s) in RCA: 65] [Impact Index Per Article: 5.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/22/2023] Open
Abstract
Next-generation sequencing (NGS) technologies have generated enormous amounts of shotgun read data, and assembly of the reads can be challenging, especially for organisms without template sequences. We study the power of genome comparison based on shotgun read data without assembly using three alignment-free sequence comparison statistics, D(2), D(*)(2) and D(s)(2), both theoretically and by simulations. Theoretical formulas for the power of detecting the relationship between two sequences related through a common motif model are derived. It is shown that both D(*)(2) and D(s)(2), outperform D(2) for detecting the relationship between two sequences based on NGS data. We then study the effects of length of the tuple, read length, coverage, and sequencing error on the power of D(*)(2) and D(s)(2). Finally, variations of these statistics, d(2), d(*)(2) and d(s)(2), respectively, are used to first cluster five mammalian species with known phylogenetic relationships, and then cluster 13 tree species whose complete genome sequences are not available using NGS shotgun reads. The clustering results using d(s)(2) are consistent with biological knowledge for the 5 mammalian and 13 tree species, respectively. Thus, the statistic d(s)(2) provides a powerful alignment-free comparison tool to study the relationships among different organisms based on NGS read data without assembly.
Collapse
Affiliation(s)
- Kai Song
- School of Mathematics, Peking University, Beijing, PR China
| | | | | | | | | | | |
Collapse
|
36
|
Behnam E, Waterman MS, Smith AD. A geometric interpretation for local alignment-free sequence comparison. J Comput Biol 2013; 20:471-85. [PMID: 23829649 PMCID: PMC3704055 DOI: 10.1089/cmb.2012.0280] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
Abstract
Local alignment-free sequence comparison arises in the context of identifying similar segments of sequences that may not be alignable in the traditional sense. We propose a randomized approximation algorithm that is both accurate and efficient. We show that under D2 and its important variant [Formula: see text] as the similarity measure, local alignment-free comparison between a pair of sequences can be formulated as the problem of finding the maximum bichromatic dot product between two sets of points in high dimensions. We introduce a geometric framework that reduces this problem to that of finding the bichromatic closest pair (BCP), allowing the properties of the underlying metric to be leveraged. Local alignment-free sequence comparison can be solved by making a quadratic number of alignment-free substring comparisons. We show both theoretically and through empirical results on simulated data that our approximation algorithm requires a subquadratic number of such comparisons and trades only a small amount of accuracy to achieve this efficiency. Therefore, our algorithm can extend the current usage of alignment-free-based methods and can also be regarded as a substitute for local alignment algorithms in many biological studies.
Collapse
Affiliation(s)
- Ehsan Behnam
- Molecular and Computational Biology, University of Southern California, Los Angeles, California 90089-2910, USA
| | | | | |
Collapse
|
37
|
A novel statistical measure for sequence comparison on the basis of k-word counts. J Theor Biol 2012; 318:91-100. [PMID: 23147229 DOI: 10.1016/j.jtbi.2012.10.035] [Citation(s) in RCA: 15] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/26/2011] [Revised: 10/10/2012] [Accepted: 10/31/2012] [Indexed: 11/24/2022]
Abstract
Numerous efficient methods based on word counts for sequence analysis have been proposed to characterize DNA sequences to help in comparison, retrieval from the databases and reconstructing evolutionary relations. However, most of them seem unrelated to any intrinsic characteristics of DNA. In this paper, we proposed a novel statistical measure for sequence comparison on the basis of k-word counts. This new measure removed the influence of sequences' lengths and uncovered bulk property of DNA sequences. The proposed measure was tested by similarity search and phylogenetic analysis. The experimental assessment demonstrated that our similarity measure was efficient.
Collapse
|
38
|
Randić M. Very efficient search for nucleotide alignments. J Comput Chem 2012; 34:77-82. [PMID: 22949371 DOI: 10.1002/jcc.23105] [Citation(s) in RCA: 10] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/29/2012] [Revised: 08/10/2012] [Accepted: 08/13/2012] [Indexed: 11/12/2022]
Abstract
We describe a very efficient search for nucleotide alignments, which is analogous to the novel very efficient search for protein alignment. Just as it has been the case with the alignment of proteins, based on 20 × 20 adjacency matrices for amino acids, obtained from a superposition of labeled amino acids adjacency matrices for the proteins considered, one can construct labeled matrices of size 4 × 4, listing adjacencies of nucleotides in DNA sequence. The matrix elements correspond to 16 pairs of adjacent nucleotides. To obtain DNA alignments, one combines information in the corresponding matrices for a pair of DNA nucleotides. Matrices are obtained by insertion of the sequential labels for pairs of nucleotides in the corresponding cells of the 4 × 4 tables. When two such matrices are superimposed, one can identify all segments in two DNA sequences, which are shifted relative to one another by the same amount in either direction, without using trial-and-error displacements of the two sequences one relative to the other to find local nucleotide alignments.
Collapse
Affiliation(s)
- Milan Randić
- Department of Mathematics and Computer Science, Drake University, Des Moines, Iowa 50311, USA.
| |
Collapse
|
39
|
Göke J, Schulz MH, Lasserre J, Vingron M. Estimation of pairwise sequence similarity of mammalian enhancers with word neighbourhood counts. ACTA ACUST UNITED AC 2012; 28:656-63. [PMID: 22247280 PMCID: PMC3289921 DOI: 10.1093/bioinformatics/bts028] [Citation(s) in RCA: 39] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022]
Abstract
MOTIVATION The identity of cells and tissues is to a large degree governed by transcriptional regulation. A major part is accomplished by the combinatorial binding of transcription factors at regulatory sequences, such as enhancers. Even though binding of transcription factors is sequence-specific, estimating the sequence similarity of two functionally similar enhancers is very difficult. However, a similarity measure for regulatory sequences is crucial to detect and understand functional similarities between two enhancers and will facilitate large-scale analyses like clustering, prediction and classification of genome-wide datasets. RESULTS We present the standardized alignment-free sequence similarity measure N2, a flexible framework that is defined for word neighbourhoods. We explore the usefulness of adding reverse complement words as well as words including mismatches into the neighbourhood. On simulated enhancer sequences as well as functional enhancers in mouse development, N2 is shown to outperform previous alignment-free measures. N2 is flexible, faster than competing methods and less susceptible to single sequence noise and the occurrence of repetitive sequences. Experiments on the mouse enhancers reveal that enhancers active in different tissues can be separated by pairwise comparison using N2. CONCLUSION N2 represents an improvement over previous alignment-free similarity measures without compromising speed, which makes it a good candidate for large-scale sequence comparison of regulatory sequences. AVAILABILITY The software is part of the open-source C++ library SeqAn (www.seqan.de) and a compiled version can be downloaded at http://www.seqan.de/projects/alf.html. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Jonathan Göke
- Department for Computational Molecular Biology, Max Planck Institute for Molecular Genetics, 14195 Berlin, Germany.
| | | | | | | |
Collapse
|
40
|
Integrating overlapping structures and background information of words significantly improves biological sequence comparison. PLoS One 2011; 6:e26779. [PMID: 22102867 PMCID: PMC3213098 DOI: 10.1371/journal.pone.0026779] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/10/2011] [Accepted: 10/04/2011] [Indexed: 12/19/2022] Open
Abstract
Word-based models have achieved promising results in sequence comparison. However, as the important statistical properties of words in biological sequence, how to use the overlapping structures and background information of the words to improve sequence comparison is still a problem. This paper proposed a new statistical method that integrates the overlapping structures and the background information of the words in biological sequences. To assess the effectiveness of this integration for sequence comparison, two sets of evaluation experiments were taken to test the proposed model. The first one, performed via receiver operating curve analysis, is the application of proposed method in discrimination between functionally related regulatory sequences and unrelated sequences, intron and exon. The second experiment is to evaluate the performance of the proposed method with f-measure for clustering Hepatitis E virus genotypes. It was demonstrated that the proposed method integrating the overlapping structures and the background information of words significantly improves biological sequence comparison and outperforms the existing models.
Collapse
|
41
|
Liu X, Wan L, Li J, Reinert G, Waterman MS, Sun F. New powerful statistics for alignment-free sequence comparison under a pattern transfer model. J Theor Biol 2011; 284:106-16. [PMID: 21723298 DOI: 10.1016/j.jtbi.2011.06.020] [Citation(s) in RCA: 37] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/16/2011] [Revised: 05/30/2011] [Accepted: 06/17/2011] [Indexed: 12/15/2022]
Abstract
Alignment-free sequence comparison is widely used for comparing gene regulatory regions and for identifying horizontally transferred genes. Recent studies on the power of a widely used alignment-free comparison statistic D2 and its variants D*2 and D(s)2 showed that their power approximates a limit smaller than 1 as the sequence length tends to infinity under a pattern transfer model. We develop new alignment-free statistics based on D2, D*2 and D(s)2 by comparing local sequence pairs and then summing over all the local sequence pairs of certain length. We show that the new statistics are much more powerful than the corresponding statistics and the power tends to 1 as the sequence length tends to infinity under the pattern transfer model.
Collapse
Affiliation(s)
- Xuemei Liu
- School of Physics, South China University of Technology, Guangzhou, PR China
| | | | | | | | | | | |
Collapse
|
42
|
Alignment-free comparison of genome sequences by a new numerical characterization. J Theor Biol 2011; 281:107-12. [PMID: 21536050 DOI: 10.1016/j.jtbi.2011.04.003] [Citation(s) in RCA: 22] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/04/2010] [Revised: 04/01/2011] [Accepted: 04/02/2011] [Indexed: 01/29/2023]
Abstract
In order to compare different genome sequences, an alignment-free method has proposed. First, we presented a new graphical representation of DNA sequences without degeneracy, which is conducive to intuitive comparison of sequences. Then, a new numerical characterization based on the representation was introduced to quantitatively depict the intrinsic nature of genome sequences, and considered as a 10-dimensional vector in the mathematical space. Alignment-free comparison of sequences was performed by computing the distances between vectors of the corresponding numerical characterizations, which define the evolutionary relationship. Two data sets of DNA sequences were constructed to assess the performance on sequence comparison. The results illustrate well validity of the method. The new numerical characterization provides a powerful tool for genome comparison.
Collapse
|
43
|
Dai Q, Liu X, Yao Y, Zhao F. Numerical characteristics of word frequencies and their application to dissimilarity measure for sequence comparison. J Theor Biol 2011; 276:174-80. [PMID: 21334347 DOI: 10.1016/j.jtbi.2011.02.005] [Citation(s) in RCA: 15] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/20/2010] [Revised: 02/05/2011] [Accepted: 02/07/2011] [Indexed: 10/18/2022]
Abstract
Sequence comparison is one of the major tasks in bioinformatics, which can be used to study structural and functional conservation, as well as evolutionary relations among the sequences. Numerous dissimilarity measures achieve promising results in sequence comparison, but challenges remain. This paper studied numerical characteristics of word frequencies and proposed a novel dissimilarity measure for sequence comparison. Instead of using the word frequencies directly, the proposed measure considers both the word frequencies and overlapping structures of words. To verify the effectiveness of the proposed measure, we tested it with two experiments and further compared it with alignment-based and alignment-free measures. The results demonstrate that the proposed measure extracting more information on the overlapping structures of the words improves the efficiency of sequence comparison.
Collapse
Affiliation(s)
- Qi Dai
- College of Life Sciences, Zhejiang Sci-Tech University, Hangzhou 310018, People's Republic of China.
| | | | | | | |
Collapse
|
44
|
|
45
|
Wan L, Reinert G, Sun F, Waterman MS. Alignment-free sequence comparison (II): theoretical power of comparison statistics. J Comput Biol 2010; 17:1467-90. [PMID: 20973742 DOI: 10.1089/cmb.2010.0056] [Citation(s) in RCA: 74] [Impact Index Per Article: 5.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/03/2023] Open
Abstract
Rapid methods for alignment-free sequence comparison make large-scale comparisons between sequences increasingly feasible. Here we study the power of the statistic D2, which counts the number of matching k-tuples between two sequences, as well as D2*, which uses centralized counts, and D2S, which is a self-standardized version, both from a theoretical viewpoint and numerically, providing an easy to use program. The power is assessed under two alternative hidden Markov models; the first one assumes that the two sequences share a common motif, whereas the second model is a pattern transfer model; the null model is that the two sequences are composed of independent and identically distributed letters and they are independent. Under the first alternative model, the means of the tuple counts in the individual sequences change, whereas under the second alternative model, the marginal means are the same as under the null model. Using the limit distributions of the count statistics under the null and the alternative models, we find that generally, asymptotically D2S has the largest power, followed by D2*, whereas the power of D2 can even be zero in some cases. In contrast, even for sequences of length 140,000 bp, in simulations D2* generally has the largest power. Under the first alternative model of a shared motif, the power of D2*approaches 100% when sufficiently many motifs are shared, and we recommend the use of D2* for such practical applications. Under the second alternative model of pattern transfer,the power for all three count statistics does not increase with sequence length when the sequence is sufficiently long, and hence none of the three statistics under consideration canbe recommended in such a situation. We illustrate the approach on 323 transcription factor binding motifs with length at most 10 from JASPAR CORE (October 12, 2009 version),verifying that D2* is generally more powerful than D2. The program to calculate the power of D2, D2* and D2S can be downloaded from http://meta.cmb.usc.edu/d2. Supplementary Material is available at www.liebertonline.com/cmb.
Collapse
Affiliation(s)
- Lin Wan
- Molecular and Computational Biology, University of Southern California , Los Angeles, California 90089-2910, USA
| | | | | | | |
Collapse
|
46
|
Jing J, Burden CJ, Forêt S, Wilson SR. Statistical considerations underpinning an alignment-free sequence comparison method. J Korean Stat Soc 2010. [DOI: 10.1016/j.jkss.2010.02.009] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/01/2022]
|
47
|
Koohy H, Dyer NP, Reid JE, Koentges G, Ott S. An alignment-free model for comparison of regulatory sequences. ACTA ACUST UNITED AC 2010; 26:2391-7. [PMID: 20696736 DOI: 10.1093/bioinformatics/btq453] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/16/2022]
Abstract
MOTIVATION Some recent comparative studies have revealed that regulatory regions can retain function over large evolutionary distances, even though the DNA sequences are divergent and difficult to align. It is also known that such enhancers can drive very similar expression patterns. This poses a challenge for the in silico detection of biologically related sequences, as they can only be discovered using alignment-free methods. RESULTS Here, we present a new computational framework called Regulatory Region Scoring (RRS) model for the detection of functional conservation of regulatory sequences using predicted occupancy levels of transcription factors of interest. We demonstrate that our model can detect the functional and/or evolutionary links between some non-alignable enhancers with a strong statistical significance. We also identify groups of enhancers that are likely to be similarly regulated. Our model is motivated by previous work on prediction of expression patterns and it can capture similarity by strong binding sites, weak binding sites and even the statistically significant absence of sites. Our results support the hypothesis that weak binding sites contribute to the functional similarity of sequences. Our model fills a gap between two families of models: detailed, data-intensive models for the prediction of precise spatio-temporal expression patterns on the one side, and crude, generally applicable models on the other side. Our model borrows some of the strengths of each group and addresses their drawbacks. AVAILABILITY The RRS source code is freely available upon publication of this manuscript: http://www2.warwick.ac.uk/fac/sci/systemsbiology/staff/ott/tools_and_software/rrs.
Collapse
Affiliation(s)
- Hashem Koohy
- MOAC Doctoral Training Centre, Coventry House, University of Warwick, Coventry, CV4 7AL, UK.
| | | | | | | | | |
Collapse
|
48
|
Reinert G, Chew D, Sun F, Waterman MS. Alignment-free sequence comparison (I): statistics and power. J Comput Biol 2010; 16:1615-34. [PMID: 20001252 DOI: 10.1089/cmb.2009.0198] [Citation(s) in RCA: 126] [Impact Index Per Article: 9.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
Abstract
Large-scale comparison of the similarities between two biological sequences is a major issue in computational biology; a fast method, the D(2) statistic, relies on the comparison of the k-tuple content for both sequences. Although it has been known for some years that the D(2) statistic is not suitable for this task, as it tends to be dominated by single-sequence noise, to date no suitable adjustments have been proposed. In this article, we suggest two new variants of the D(2) word count statistic, which we call D(2)(S) and D(2)(*). For D(2)(S), which is a self-standardized statistic, we show that the statistic is asymptotically normally distributed, when sequence lengths tend to infinity, and not dominated by the noise in the individual sequences. The second statistic, D(2)(*), outperforms D(2)(S) in terms of power for detecting the relatedness between the two sequences in our examples; but although it is straightforward to simulate from the asymptotic distribution of D(2)(*), we cannot provide a closed form for power calculations.
Collapse
Affiliation(s)
- Gesine Reinert
- Department of Statistics, University of Oxford, Oxford OX1 3TG, UK
| | | | | | | |
Collapse
|
49
|
Pavlović-Lazetić GM, Mitić NS, Beljanski MV. n-Gram characterization of genomic islands in bacterial genomes. COMPUTER METHODS AND PROGRAMS IN BIOMEDICINE 2009; 93:241-56. [PMID: 19101056 PMCID: PMC7185697 DOI: 10.1016/j.cmpb.2008.10.014] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 04/20/2008] [Revised: 09/10/2008] [Accepted: 10/21/2008] [Indexed: 05/27/2023]
Abstract
The paper presents a novel, n-gram-based method for analysis of bacterial genome segments known as genomic islands (GIs). Identification of GIs in bacterial genomes is an important task since many of them represent inserts that may contribute to bacterial evolution and pathogenesis. In order to characterize and distinguish GIs from rest of the genome, binary classification of islands based on n-gram frequency distribution have been performed. It consists of testing the agreement of islands n-gram frequency distributions with the complete genome and backbone sequence. In addition, a statistic based on the maximal order Markov model is used to identify significantly overrepresented and underrepresented n-grams in islands. The results may be used as a basis for Zipf-like analysis suggesting that some of the n-grams are overrepresented in a subset of islands and underrepresented in the backbone, or vice versa, thus complementing the binary classification. The method is applied to strain-specific regions in the Escherichia coli O157:H7 EDL933 genome (O-islands), resulting in two groups of O-islands with different n-gram characteristics. It refines a characterization based on other compositional features such as G+C content and codon usage, and may help in identification of GIs, and also in research and development of adequate drugs targeting virulence genes in them.
Collapse
|
50
|
Dai Q, Yang Y, Wang T. Markov model plus k-word distributions: a synergy that produces novel statistical measures for sequence comparison. ACTA ACUST UNITED AC 2008; 24:2296-302. [PMID: 18710871 DOI: 10.1093/bioinformatics/btn436] [Citation(s) in RCA: 63] [Impact Index Per Article: 3.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/15/2022]
Abstract
MOTIVATION Many proposed statistical measures can efficiently compare biological sequences to further infer their structures, functions and evolutionary information. They are related in spirit because all the ideas for sequence comparison try to use the information on the k-word distributions, Markov model or both. Motivated by adding k-word distributions to Markov model directly, we investigated two novel statistical measures for sequence comparison, called wre.k.r and S2.k.r. RESULTS The proposed measures were tested by similarity search, evaluation on functionally related regulatory sequences and phylogenetic analysis. This offers the systematic and quantitative experimental assessment of our measures. Moreover, we compared our achievements with these based on alignment or alignment-free. We grouped our experiments into two sets. The first one, performed via ROC (receiver operating curve) analysis, aims at assessing the intrinsic ability of our statistical measures to search for similar sequences from a database and discriminate functionally related regulatory sequences from unrelated sequences. The second one aims at assessing how well our statistical measure is used for phylogenetic analysis. The experimental assessment demonstrates that our similarity measures intending to incorporate k-word distributions into Markov model are more efficient.
Collapse
Affiliation(s)
- Qi Dai
- Department of Applied Mathematics, Dalian University of Technology, Dalian 116024, China.
| | | | | |
Collapse
|