1
|
Bai X, Ren J, Sun F. MLR-OOD: A Markov Chain Based Likelihood Ratio Method for Out-Of-Distribution Detection of Genomic Sequences. J Mol Biol 2022; 434:167586. [PMID: 35427634 PMCID: PMC10433695 DOI: 10.1016/j.jmb.2022.167586] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/16/2022] [Revised: 04/05/2022] [Accepted: 04/05/2022] [Indexed: 12/23/2022]
Abstract
Machine learning or deep learning models have been widely used for taxonomic classification of metagenomic sequences and many studies reported high classification accuracy. Such models are usually trained based on sequences in several training classes in hope of accurately classifying unknown sequences into these classes. However, when deploying the classification models on real testing data sets, sequences that do not belong to any of the training classes may be present and are falsely assigned to one of the training classes with high confidence. Such sequences are referred to as out-of-distribution (OOD) sequences and are ubiquitous in metagenomic studies. To address this problem, we develop a deep generative model-based method, MLR-OOD, that measures the probability of a testing sequencing belonging to OOD by the likelihood ratio of the maximum of the in-distribution (ID) class conditional likelihoods and the Markov chain likelihood of the testing sequence measuring the sequence complexity. We compose three different microbial data sets consisting of bacterial, viral, and plasmid sequences for comprehensively benchmarking OOD detection methods. We show that MLR-OOD achieves the state-of-the-art performance demonstrating the generality of MLR-OOD to various types of microbial data sets. It is also shown that MLR-OOD is robust to the GC content, which is a major confounding effect for OOD detection of genomic sequences. In conclusion, MLR-OOD will greatly reduce false positives caused by OOD sequences in metagenomic sequence classification.
Collapse
Affiliation(s)
- Xin Bai
- Department of Quantitative and Computational Biology, University of Southern California, Los Angeles, CA 90089, USA
| | - Jie Ren
- Google Research, Brain Team, USA
| | - Fengzhu Sun
- Department of Quantitative and Computational Biology, University of Southern California, Los Angeles, CA 90089, USA.
| |
Collapse
|
2
|
An S, Ren J, Sun F, Wan L. A New Context Tree Inference Algorithm for Variable Length Markov Chain Model with Applications to Biological Sequence Analyses. J Comput Biol 2022; 29:839-856. [PMID: 35451885 PMCID: PMC9419963 DOI: 10.1089/cmb.2021.0604] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
Abstract
The statistical inference of high-order Markov chains (MCs) for biological sequences is vital for molecular sequence analyses but can be hindered by the high dimensionality of free parameters. In the seminal article by Bühlmann and Wyner, variable length Markov chain (VLMC) model was proposed to embed the full-order MC in a sparse structured context tree. In the key procedure of tree pruning of their proposed context algorithm, the word count-based statistic for each branch was defined and compared with a fixed cutoff threshold calculated from a common chi-square distribution to prune the branch of the context tree. In this study, we find that the word counts for each branch are highly intercorrelated, resulting in non-negligible effects on the distribution of the statistic of interest. We demonstrate that the inferred context tree based on the original context algorithm by Bühlmann and Wyner, which uses a fixed cutoff threshold based on a common chi-square distribution, can be systematically biased and error prone. We denote the original context algorithm as VLMC-Biased (VLMC-B). To solve this problem, we propose a new context tree inference algorithm using an adaptive tree-pruning scheme, termed VLMC-Consistent (VLMC-C). The VLMC-C is founded on the consistent branch-specific mixed chi-square distributions calculated based on asymptotic normal distribution of multiple word patterns. We validate our theoretical branch-specific asymptotic distribution using simulated data. We compare VLMC-C with VLMC-B on context tree inference using both simulated and real genome sequence data and demonstrate that VLMC-C outperforms VLMC-B for both context tree reconstruction accuracy and model compression capacity.
Collapse
Affiliation(s)
- Shaokun An
- Academy of Mathematics and Systems Science, Chinese Academy of Sciences, Beijing, China
| | - Jie Ren
- Quantitative and Computational Biology Department, University of Southern California, Los Angeles, California, USA
| | - Fengzhu Sun
- Quantitative and Computational Biology Department, University of Southern California, Los Angeles, California, USA
| | - Lin Wan
- Academy of Mathematics and Systems Science, Chinese Academy of Sciences, Beijing, China
| |
Collapse
|
3
|
Gustafsson J, Norberg P, Qvick-Wester JR, Schliep A. Fast parallel construction of variable-length Markov chains. BMC Bioinformatics 2021; 22:487. [PMID: 34627154 PMCID: PMC8501649 DOI: 10.1186/s12859-021-04387-y] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/20/2021] [Accepted: 09/20/2021] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Alignment-free methods are a popular approach for comparing biological sequences, including complete genomes. The methods range from probability distributions of sequence composition to first and higher-order Markov chains, where a k-th order Markov chain over DNA has [Formula: see text] formal parameters. To circumvent this exponential growth in parameters, variable-length Markov chains (VLMCs) have gained popularity for applications in molecular biology and other areas. VLMCs adapt the depth depending on sequence context and thus curtail excesses in the number of parameters. The scarcity of available fast, or even parallel software tools, prompted the development of a parallel implementation using lazy suffix trees and a hash-based alternative. RESULTS An extensive evaluation was performed on genomes ranging from 12Mbp to 22Gbp. Relevant learning parameters were chosen guided by the Bayesian Information Criterion (BIC) to avoid over-fitting. Our implementation greatly improves upon the state-of-the-art even in serial execution. It exhibits very good parallel scaling with speed-ups for long sequences close to the optimum indicated by Amdahl's law of 3 for 4 threads and about 6 for 16 threads, respectively. CONCLUSIONS Our parallel implementation released as open-source under the GPLv3 license provides a practically useful alternative to the state-of-the-art which allows the construction of VLMCs even for very large genomes significantly faster than previously possible. Additionally, our parameter selection based on BIC gives guidance to end-users comparing genomes.
Collapse
Affiliation(s)
- Joel Gustafsson
- Institute of Biomedicine, Department of Infectious Diseases, University of Gothenburg, Gothenburg, Sweden.
| | - Peter Norberg
- Institute of Biomedicine, Department of Infectious Diseases, University of Gothenburg, Gothenburg, Sweden
| | - Jan R Qvick-Wester
- Department of Computer Science and Engineering, University of Gothenburg - Chalmers University of Technology, Gothenburg, Sweden
| | - Alexander Schliep
- Department of Computer Science and Engineering, University of Gothenburg - Chalmers University of Technology, Gothenburg, Sweden
| |
Collapse
|
4
|
Song K. Reads Binning Improves the Assembly of Viral Genome Sequences From Metagenomic Samples. Front Microbiol 2021; 12:664560. [PMID: 34093479 PMCID: PMC8175635 DOI: 10.3389/fmicb.2021.664560] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/05/2021] [Accepted: 04/27/2021] [Indexed: 11/13/2022] Open
Abstract
Metagenomes can be considered as mixtures of viral, bacterial, and other eukaryotic DNA sequences. Mining viral sequences from metagenomes could shed insight into virus-host relationships and expand viral databases. Current alignment-based methods are unsuitable for identifying viral sequences from metagenome sequences because most assembled metagenomic contigs are short and possess few or no predicted genes, and most metagenomic viral genes are dissimilar to known viral genes. In this study, I developed a Markov model-based method, VirMC, to identify viral sequences from metagenomic data. VirMC uses Markov chains to model sequence signatures and construct a scoring model using a likelihood test to distinguish viral and bacterial sequences. Compared with the other two state-of-the-art viral sequence-prediction methods, VirFinder and PPR-Meta, my proposed method outperformed VirFinder and had similar performance with PPR-Meta for short contigs with length less than 400 bp. VirMC outperformed VirFinder and PPR-Meta for identifying viral sequences in contaminated metagenomic samples with eukaryotic sequences. VirMC showed better performance in assembling viral-genome sequences from metagenomic data (based on filtering potential bacterial reads). Applying VirMC to human gut metagenomes from healthy subjects and patients with type-2 diabetes (T2D) revealed that viral contigs could help classify healthy and diseased statuses. This alignment-free method complements gene-based alignment approaches and will significantly improve the precision of viral sequence identification.
Collapse
Affiliation(s)
- Kai Song
- School of Mathematics and Statistics, Qingdao University, Qingdao China
| |
Collapse
|
5
|
Lu YY, Bai J, Wang Y, Wang Y, Sun F. CRAFT: Compact genome Representation toward large-scale Alignment-Free daTabase. Bioinformatics 2021; 37:155-161. [PMID: 32766810 DOI: 10.1093/bioinformatics/btaa699] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/18/2019] [Revised: 03/11/2020] [Accepted: 07/28/2020] [Indexed: 01/02/2023] Open
Abstract
MOTIVATION Rapid developments in sequencing technologies have boosted generating high volumes of sequence data. To archive and analyze those data, one primary step is sequence comparison. Alignment-free sequence comparison based on k-mer frequencies offers a computationally efficient solution, yet in practice, the k-mer frequency vectors for large k of practical interest lead to excessive memory and storage consumption. RESULTS We report CRAFT, a general genomic/metagenomic search engine to learn compact representations of sequences and perform fast comparison between DNA sequences. Specifically, given genome or high throughput sequencing data as input, CRAFT maps the data into a much smaller embedding space and locates the best matching genome in the archived massive sequence repositories. With 102-104-fold reduction of storage space, CRAFT performs fast query for gigabytes of data within seconds or minutes, achieving comparable performance as six state-of-the-art alignment-free measures. AVAILABILITY AND IMPLEMENTATION CRAFT offers a user-friendly graphical user interface with one-click installation on Windows and Linux operating systems, freely available at https://github.com/jiaxingbai/CRAFT. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Yang Young Lu
- Quantitative and Computational Biology Program, Department of Biological Sciences, University of Southern California, Los Angeles, CA 90089, USA
| | - Jiaxing Bai
- Department of Automation, Xiamen University, Xiamen 361000, China
| | - Yiwen Wang
- Department of Automation, Xiamen University, Xiamen 361000, China
| | - Ying Wang
- Department of Automation, Xiamen University, Xiamen 361000, China.,Xiamen Key Lab. of Big Data Intelligent Analysis and Decision, Xiamen 361000, China
| | - Fengzhu Sun
- Quantitative and Computational Biology Program, Department of Biological Sciences, University of Southern California, Los Angeles, CA 90089, USA
| |
Collapse
|
6
|
Confidence intervals for Markov chain transition probabilities based on next generation sequencing reads data. QUANTITATIVE BIOLOGY 2020; 8:143-154. [PMID: 34262790 DOI: 10.1007/s40484-020-0200-y] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/24/2022]
Abstract
Background Markov chains (MC) have been widely used to model molecular sequences. The estimations of MC transition matrix and confidence intervals of the transition probabilities from long sequence data have been intensively studied in the past decades. In next generation sequencing (NGS), a large amount of short reads are generated. These short reads can overlap and some regions of the genome may not be sequenced resulting in a new type of data. Based on NGS data, the transition probabilities of MC can be estimated by moment estimators. However, the classical asymptotic distribution theory for MC transition probability estimators based on long sequences is no longer valid. Methods In this study, we present the asymptotic distributions of several statistics related to MC based on NGS data. We show that, after scaling by the effective coverage d defined in a previous study by the authors, these statistics based on NGS data approximate to the same distributions as the corresponding statistics for long sequences. Results We apply the asymptotic properties of these statistics for finding the theoretical confidence regions for MC transition probabilities based on NGS short reads data. We validate our theoretical confidence intervals using both simulated data and real data sets, and compare the results with those by the parametric bootstrap method. Conclusions We find that the asymptotic distributions of these statistics and the theoretical confidence intervals of transition probabilities based on NGS data given in this study are highly accurate, providing a powerful tool for NGS data analysis.
Collapse
|
7
|
Song K, Ren J, Sun F. Reads Binning Improves Alignment-Free Metagenome Comparison. Front Genet 2019; 10:1156. [PMID: 31824565 PMCID: PMC6881972 DOI: 10.3389/fgene.2019.01156] [Citation(s) in RCA: 11] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/10/2019] [Accepted: 10/22/2019] [Indexed: 12/26/2022] Open
Abstract
Comparing metagenomic samples is a critical step in understanding the relationships among microbial communities. Recently, next-generation sequencing (NGS) technologies have produced a massive amount of short reads data for microbial communities from different environments. The assembly of these short reads can, however, be time-consuming and challenging. In addition, alignment-based methods for metagenome comparison are limited by incomplete genome and/or pathway databases. In contrast, alignment-free methods for metagenome comparison do not depend on the completeness of genome or pathway databases. Still, the existing alignment-free methods,d 2 S andd 2 * , which model k-tuple patterns using only one Markov chain for each sample, neglect the heterogeneity within metagenomic data wherein potentially thousands of types of microorganisms are sequenced. To address this imperfection ind 2 S andd 2 * , we organized NGS sequences into different reads bins and constructed several corresponding Markov models. Next, we modified the definition of our previous alignment-free methods,d 2 S andd 2 * , to make them more compatible with a scheme of analysis which uses the proposed reads bins. We then used two simulated and three real metagenomic datasets to test the effect of the k-tuple size and Markov orders of background sequences on the performance of these de novo alignment-free methods. For dependable comparison of metagenomic samples, our newly developed alignment-free methods with reads binning outperformed alignment-free methods without reads binning in detecting the relationship among microbial communities, including whether they form groups or change according to some environmental gradients.
Collapse
Affiliation(s)
- Kai Song
- School of Mathematics and Statistics, Qingdao University, Qingdao, China
| | - Jie Ren
- Quantitative and Computational Biology Program, University of Southern California, Los Angeles, CA, United States
| | - Fengzhu Sun
- Quantitative and Computational Biology Program, University of Southern California, Los Angeles, CA, United States
| |
Collapse
|
8
|
Chen S, Chen Y, Sun F, Waterman MS, Zhang X. A new statistic for efficient detection of repetitive sequences. Bioinformatics 2019; 35:4596-4606. [PMID: 30993316 DOI: 10.1093/bioinformatics/btz262] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/23/2018] [Revised: 03/01/2019] [Accepted: 04/10/2019] [Indexed: 11/14/2022] Open
Abstract
MOTIVATION Detecting sequences containing repetitive regions is a basic bioinformatics task with many applications. Several methods have been developed for various types of repeat detection tasks. An efficient generic method for detecting most types of repetitive sequences is still desirable. Inspired by the excellent properties and successful applications of the D2 family of statistics in comparative analyses of genomic sequences, we developed a new statistic D2R that can efficiently discriminate sequences with or without repetitive regions. RESULTS Using the statistic, we developed an algorithm of linear time and space complexity for detecting most types of repetitive sequences in multiple scenarios, including finding candidate clustered regularly interspaced short palindromic repeats regions from bacterial genomic or metagenomics sequences. Simulation and real data experiments show that the method works well on both assembled sequences and unassembled short reads. AVAILABILITY AND IMPLEMENTATION The codes are available at https://github.com/XuegongLab/D2R_codes under GPL 3.0 license. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Sijie Chen
- Department of Automation, MOE Key Laboratory of Bioinformatics, Bioinformatics Division and Center for Synthetic & Systems Biology, BNRist, Tsinghua University, Beijing 100084, China
| | - Yixin Chen
- Department of Automation, MOE Key Laboratory of Bioinformatics, Bioinformatics Division and Center for Synthetic & Systems Biology, BNRist, Tsinghua University, Beijing 100084, China
| | - Fengzhu Sun
- Quantitative and Computational Biology Program, Department of Biological Sciences, University of Southern California, Los Angeles, CA 90089, USA.,Institute of Science and Technology for Brain-Inspired Intelligence, Fudan University, Shanghai 200433, China
| | - Michael S Waterman
- Department of Automation, MOE Key Laboratory of Bioinformatics, Bioinformatics Division and Center for Synthetic & Systems Biology, BNRist, Tsinghua University, Beijing 100084, China.,Quantitative and Computational Biology Program, Department of Biological Sciences, University of Southern California, Los Angeles, CA 90089, USA.,Institute of Science and Technology for Brain-Inspired Intelligence, Fudan University, Shanghai 200433, China
| | - Xuegong Zhang
- Department of Automation, MOE Key Laboratory of Bioinformatics, Bioinformatics Division and Center for Synthetic & Systems Biology, BNRist, Tsinghua University, Beijing 100084, China.,School of Life Sciences, Tsinghua University, Beijing 100084, China
| |
Collapse
|
9
|
Lu YY, Tang K, Ren J, Fuhrman JA, Waterman MS, Sun F. CAFE: aCcelerated Alignment-FrEe sequence analysis. Nucleic Acids Res 2019; 45:W554-W559. [PMID: 28472388 PMCID: PMC5793812 DOI: 10.1093/nar/gkx351] [Citation(s) in RCA: 40] [Impact Index Per Article: 8.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/27/2017] [Accepted: 04/20/2017] [Indexed: 12/13/2022] Open
Abstract
Alignment-free genome and metagenome comparisons are increasingly important with the development of next generation sequencing (NGS) technologies. Recently developed state-of-the-art k-mer based alignment-free dissimilarity measures including CVTree, \documentclass[12pt]{minimal}
\usepackage{amsmath}
\usepackage{wasysym}
\usepackage{amsfonts}
\usepackage{amssymb}
\usepackage{amsbsy}
\usepackage{upgreek}
\usepackage{mathrsfs}
\setlength{\oddsidemargin}{-69pt}
\begin{document}
}{}$d_2^*$\end{document} and \documentclass[12pt]{minimal}
\usepackage{amsmath}
\usepackage{wasysym}
\usepackage{amsfonts}
\usepackage{amssymb}
\usepackage{amsbsy}
\usepackage{upgreek}
\usepackage{mathrsfs}
\setlength{\oddsidemargin}{-69pt}
\begin{document}
}{}$d_2^S$\end{document} are more computationally expensive than measures based solely on the k-mer frequencies. Here, we report a standalone software, aCcelerated Alignment-FrEe sequence analysis (CAFE), for efficient calculation of 28 alignment-free dissimilarity measures. CAFE allows for both assembled genome sequences and unassembled NGS shotgun reads as input, and wraps the output in a standard PHYLIP format. In downstream analyses, CAFE can also be used to visualize the pairwise dissimilarity measures, including dendrograms, heatmap, principal coordinate analysis and network display. CAFE serves as a general k-mer based alignment-free analysis platform for studying the relationships among genomes and metagenomes, and is freely available at https://github.com/younglululu/CAFE.
Collapse
Affiliation(s)
- Yang Young Lu
- Molecular and Computational Biology Program, Department of Biological Sciences, University of Southern California, CA 90089, USA
| | - Kujin Tang
- Molecular and Computational Biology Program, Department of Biological Sciences, University of Southern California, CA 90089, USA
| | - Jie Ren
- Molecular and Computational Biology Program, Department of Biological Sciences, University of Southern California, CA 90089, USA
| | - Jed A Fuhrman
- Department of Biological Sciences and Wrigley Institute for Environmental Studies, University of Southern California, Los Angeles, CA 90089, USA
| | - Michael S Waterman
- Molecular and Computational Biology Program, Department of Biological Sciences, University of Southern California, CA 90089, USA.,Centre for Computational Systems Biology, School of Mathematical Sciences, Fudan University, 200433 Shanghai, China
| | - Fengzhu Sun
- Molecular and Computational Biology Program, Department of Biological Sciences, University of Southern California, CA 90089, USA.,Centre for Computational Systems Biology, School of Mathematical Sciences, Fudan University, 200433 Shanghai, China
| |
Collapse
|
10
|
Ren J, Bai X, Lu YY, Tang K, Wang Y, Reinert G, Sun F. Alignment-Free Sequence Analysis and Applications. Annu Rev Biomed Data Sci 2018; 1:93-114. [PMID: 31828235 PMCID: PMC6905628 DOI: 10.1146/annurev-biodatasci-080917-013431] [Citation(s) in RCA: 58] [Impact Index Per Article: 9.7] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/06/2023]
Abstract
Genome and metagenome comparisons based on large amounts of next generation sequencing (NGS) data pose significant challenges for alignment-based approaches due to the huge data size and the relatively short length of the reads. Alignment-free approaches based on the counts of word patterns in NGS data do not depend on the complete genome and are generally computationally efficient. Thus, they contribute significantly to genome and metagenome comparison. Recently, novel statistical approaches have been developed for the comparison of both long and shotgun sequences. These approaches have been applied to many problems including the comparison of gene regulatory regions, genome sequences, metagenomes, binning contigs in metagenomic data, identification of virus-host interactions, and detection of horizontal gene transfers. We provide an updated review of these applications and other related developments of word-count based approaches for alignment-free sequence analysis.
Collapse
Affiliation(s)
- Jie Ren
- Molecular and Computational Biology Program, University of Southern California, Los Angeles, California, USA
| | - Xin Bai
- Molecular and Computational Biology Program, University of Southern California, Los Angeles, California, USA
- Centre for Computational Systems Biology, School of Mathematical Sciences, Fudan University, Shanghai, China
| | - Yang Young Lu
- Molecular and Computational Biology Program, University of Southern California, Los Angeles, California, USA
| | - Kujin Tang
- Molecular and Computational Biology Program, University of Southern California, Los Angeles, California, USA
| | - Ying Wang
- Department of Automation, Xiamen University, Xiamen, Fujian, China
| | - Gesine Reinert
- Department of Statistics, University of Oxford, Oxford, United Kingdom
| | - Fengzhu Sun
- Molecular and Computational Biology Program, University of Southern California, Los Angeles, California, USA
- Centre for Computational Systems Biology, School of Mathematical Sciences, Fudan University, Shanghai, China
| |
Collapse
|
11
|
Bai X, Tang K, Ren J, Waterman M, Sun F. Optimal choice of word length when comparing two Markov sequences using a χ 2-statistic. BMC Genomics 2017; 18:732. [PMID: 28984181 PMCID: PMC5629589 DOI: 10.1186/s12864-017-4020-z] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/29/2022] Open
Abstract
Background Alignment-free sequence comparison using counts of word patterns (grams, k-tuples) has become an active research topic due to the large amount of sequence data from the new sequencing technologies. Genome sequences are frequently modelled by Markov chains and the likelihood ratio test or the corresponding approximate χ2-statistic has been suggested to compare two sequences. However, it is not known how to best choose the word length k in such studies. Results We develop an optimal strategy to choose k by maximizing the statistical power of detecting differences between two sequences. Let the orders of the Markov chains for the two sequences be r1 and r2, respectively. We show through both simulations and theoretical studies that the optimal k= max(r1,r2)+1 for both long sequences and next generation sequencing (NGS) read data. The orders of the Markov chains may be unknown and several methods have been developed to estimate the orders of Markov chains based on both long sequences and NGS reads. We study the power loss of the statistics when the estimated orders are used. It is shown that the power loss is minimal for some of the estimators of the orders of Markov chains. Conclusion Our studies provide guidelines on choosing the optimal word length for the comparison of Markov sequences. Electronic supplementary material The online version of this article (doi:10.1186/s12864-017-4020-z) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Xin Bai
- Centre for Computational Systems Biology, School of Mathematical Sciences, Fudan University, Shanghai, China
| | - Kujin Tang
- Molecular and Computational Biology Program, University of Southern California, Los Angeles, California, USA
| | - Jie Ren
- Molecular and Computational Biology Program, University of Southern California, Los Angeles, California, USA
| | - Michael Waterman
- Centre for Computational Systems Biology, School of Mathematical Sciences, Fudan University, Shanghai, China.,Molecular and Computational Biology Program, University of Southern California, Los Angeles, California, USA
| | - Fengzhu Sun
- Centre for Computational Systems Biology, School of Mathematical Sciences, Fudan University, Shanghai, China. .,Molecular and Computational Biology Program, University of Southern California, Los Angeles, California, USA.
| |
Collapse
|
12
|
Antanaviciute A, Baquero-Perez B, Watson CM, Harrison SM, Lascelles C, Crinnion L, Markham AF, Bonthron DT, Whitehouse A, Carr IM. m6aViewer: software for the detection, analysis, and visualization of N6-methyladenosine peaks from m 6A-seq/ME-RIP sequencing data. RNA (NEW YORK, N.Y.) 2017; 23:1493-1501. [PMID: 28724534 PMCID: PMC5602108 DOI: 10.1261/rna.058206.116] [Citation(s) in RCA: 28] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 07/06/2016] [Accepted: 07/07/2017] [Indexed: 05/29/2023]
Abstract
Recent methods for transcriptome-wide N6-methyladenosine (m6A) profiling have facilitated investigations into the RNA methylome and established m6A as a dynamic modification that has critical regulatory roles in gene expression and may play a role in human disease. However, bioinformatics resources available for the analysis of m6A sequencing data are still limited. Here, we describe m6aViewer-a cross-platform application for analysis and visualization of m6A peaks from sequencing data. m6aViewer implements a novel m6A peak-calling algorithm that identifies high-confidence methylated residues with more precision than previously described approaches. The application enables data analysis through a graphical user interface, and thus, in contrast to other currently available tools, does not require the user to be skilled in computer programming. m6aViewer and test data can be downloaded here: http://dna2.leeds.ac.uk/m6a.
Collapse
Affiliation(s)
- Agne Antanaviciute
- Section of Genetics, Institute of Biomedical and Clinical Sciences, School of Medicine, University of Leeds, St James's University Hospital, Leeds LS9 7TF, United Kingdom
| | - Belinda Baquero-Perez
- School of Molecular and Cellular Biology, Faculty of Biological Sciences and Astbury Centre of Structural Molecular Biology, University of Leeds, Leeds LS2 9JT, United Kingdom
| | - Christopher M Watson
- Yorkshire Regional Genetics Service, St James's University Hospital, Leeds LS9 7TF, United Kingdom
| | - Sally M Harrison
- Section of Genetics, Institute of Biomedical and Clinical Sciences, School of Medicine, University of Leeds, St James's University Hospital, Leeds LS9 7TF, United Kingdom
| | - Carolina Lascelles
- Section of Genetics, Institute of Biomedical and Clinical Sciences, School of Medicine, University of Leeds, St James's University Hospital, Leeds LS9 7TF, United Kingdom
| | - Laura Crinnion
- Yorkshire Regional Genetics Service, St James's University Hospital, Leeds LS9 7TF, United Kingdom
| | - Alexander F Markham
- Section of Genetics, Institute of Biomedical and Clinical Sciences, School of Medicine, University of Leeds, St James's University Hospital, Leeds LS9 7TF, United Kingdom
| | - David T Bonthron
- Section of Genetics, Institute of Biomedical and Clinical Sciences, School of Medicine, University of Leeds, St James's University Hospital, Leeds LS9 7TF, United Kingdom
| | - Adrian Whitehouse
- School of Molecular and Cellular Biology, Faculty of Biological Sciences and Astbury Centre of Structural Molecular Biology, University of Leeds, Leeds LS2 9JT, United Kingdom
| | - Ian M Carr
- Section of Genetics, Institute of Biomedical and Clinical Sciences, School of Medicine, University of Leeds, St James's University Hospital, Leeds LS9 7TF, United Kingdom
| |
Collapse
|
13
|
Ahlgren NA, Ren J, Lu YY, Fuhrman JA, Sun F. Alignment-free $d_2^*$ oligonucleotide frequency dissimilarity measure improves prediction of hosts from metagenomically-derived viral sequences. Nucleic Acids Res 2016; 45:39-53. [PMID: 27899557 PMCID: PMC5224470 DOI: 10.1093/nar/gkw1002] [Citation(s) in RCA: 167] [Impact Index Per Article: 20.9] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/12/2016] [Accepted: 10/31/2016] [Indexed: 01/17/2023] Open
Abstract
Viruses and their host genomes often share similar oligonucleotide frequency (ONF) patterns, which can be used to predict the host of a given virus by finding the host with the greatest ONF similarity. We comprehensively compared 11 ONF metrics using several k-mer lengths for predicting host taxonomy from among ∼32 000 prokaryotic genomes for 1427 virus isolate genomes whose true hosts are known. The background-subtracting measure [Formula: see text] at k = 6 gave the highest host prediction accuracy (33%, genus level) with reasonable computational times. Requiring a maximum dissimilarity score for making predictions (thresholding) and taking the consensus of the 30 most similar hosts further improved accuracy. Using a previous dataset of 820 bacteriophage and 2699 bacterial genomes, [Formula: see text] host prediction accuracies with thresholding and consensus methods (genus-level: 64%) exceeded previous Euclidian distance ONF (32%) or homology-based (22-62%) methods. When applied to metagenomically-assembled marine SUP05 viruses and the human gut virus crAssphage, [Formula: see text]-based predictions overlapped (i.e. some same, some different) with the previously inferred hosts of these viruses. The extent of overlap improved when only using host genomes or metagenomic contigs from the same habitat or samples as the query viruses. The [Formula: see text] ONF method will greatly improve the characterization of novel, metagenomic viruses.
Collapse
Affiliation(s)
- Nathan A Ahlgren
- Department of Biological Sciences, University of Southern California, 3616 Trousdale Pkwy Los, Angeles, CA 90089, USA
| | - Jie Ren
- Molecular and Computational Biology Program, University of Southern California, 1050 Childs Way, Los Angeles, CA 90089, USA
| | - Yang Young Lu
- Molecular and Computational Biology Program, University of Southern California, 1050 Childs Way, Los Angeles, CA 90089, USA
| | - Jed A Fuhrman
- Department of Biological Sciences, University of Southern California, 3616 Trousdale Pkwy Los, Angeles, CA 90089, USA
| | - Fengzhu Sun
- Department of Biological Sciences, University of Southern California, 3616 Trousdale Pkwy Los, Angeles, CA 90089, USA.,Molecular and Computational Biology Program, University of Southern California, 1050 Childs Way, Los Angeles, CA 90089, USA.,Center for Computational Systems Biology, Fudan University, Shanghai 200433, China
| |
Collapse
|
14
|
Ren J, Song K, Deng M, Reinert G, Cannon CH, Sun F. Inference of Markovian properties of molecular sequences from NGS data and applications to comparative genomics. Bioinformatics 2016; 32:993-1000. [PMID: 26130573 PMCID: PMC6169497 DOI: 10.1093/bioinformatics/btv395] [Citation(s) in RCA: 23] [Impact Index Per Article: 2.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/11/2015] [Revised: 03/11/2015] [Accepted: 06/25/2015] [Indexed: 11/13/2022] Open
Abstract
MOTIVATION Next-generation sequencing (NGS) technologies generate large amounts of short read data for many different organisms. The fact that NGS reads are generally short makes it challenging to assemble the reads and reconstruct the original genome sequence. For clustering genomes using such NGS data, word-count based alignment-free sequence comparison is a promising approach, but for this approach, the underlying expected word counts are essential.A plausible model for this underlying distribution of word counts is given through modeling the DNA sequence as a Markov chain (MC). For single long sequences, efficient statistics are available to estimate the order of MCs and the transition probability matrix for the sequences. As NGS data do not provide a single long sequence, inference methods on Markovian properties of sequences based on single long sequences cannot be directly used for NGS short read data. RESULTS Here we derive a normal approximation for such word counts. We also show that the traditional Chi-square statistic has an approximate gamma distribution ,: using the Lander-Waterman model for physical mapping. We propose several methods to estimate the order of the MC based on NGS reads and evaluate those using simulations. We illustrate the applications of our results by clustering genomic sequences of several vertebrate and tree species based on NGS reads using alignment-free sequence dissimilarity measures. We find that the estimated order of the MC has a considerable effect on the clustering results ,: and that the clustering results that use a N: MC of the estimated order give a plausible clustering of the species. AVAILABILITY AND IMPLEMENTATION Our implementation of the statistics developed here is available as R package 'NGS.MC' at http://www-rcf.usc.edu/∼fsun/Programs/NGS-MC/NGS-MC.html CONTACT fsun@usc.edu SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Jie Ren
- Molecular and Computational Biology Program, University of Southern California, Los Angeles, CA, USA
| | - Kai Song
- School of Mathematical Sciences, Peking University, Beijing, China
| | - Minghua Deng
- School of Mathematical Sciences, Peking University, Beijing, China
| | - Gesine Reinert
- Department of Statistics, University of Oxford, 1 South Parks Road, Oxford OX1 3TG, UK
| | - Charles H Cannon
- Department of Biological Sciences, Texas Tech University, TX 79409-3131, USA, Xishuangbanna Tropical Botanic Garden, Chinese Academy of Sciences, Yunnan, China and
| | - Fengzhu Sun
- Molecular and Computational Biology Program, University of Southern California, Los Angeles, CA, USA, Centre for Computational Systems Biology, School of Mathematical Sciences, Fudan University, Shanghai, China
| |
Collapse
|
15
|
Yang WF, Yu ZG, Anh V. Whole genome/proteome based phylogeny reconstruction for prokaryotes using higher order Markov model and chaos game representation. Mol Phylogenet Evol 2015; 96:102-111. [PMID: 26724405 DOI: 10.1016/j.ympev.2015.12.011] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/02/2015] [Revised: 12/17/2015] [Accepted: 12/18/2015] [Indexed: 01/18/2023]
Abstract
UNLABELLED Traditional methods for sequence comparison and phylogeny reconstruction rely on pair wise and multiple sequence alignments. But alignment could not be directly applied to whole genome/proteome comparison and phylogenomic studies due to their high computational complexity. Hence alignment-free methods became popular in recent years. Here we propose a fast alignment-free method for whole genome/proteome comparison and phylogeny reconstruction using higher order Markov model and chaos game representation. In the present method, we use the transition matrices of higher order Markov models to characterize amino acid or DNA sequences for their comparison. The order of the Markov model is uniquely identified by maximizing the average Shannon entropy of conditional probability distributions. Using one-dimensional chaos game representation and linked list, this method can reduce large memory and time consumption which is due to the large-scale conditional probability distributions. To illustrate the effectiveness of our method, we employ it for fast phylogeny reconstruction based on genome/proteome sequences of two species data sets used in previous published papers. Our results demonstrate that the present method is useful and efficient. AVAILABILITY AND IMPLEMENTATION The source codes for our algorithm to get the distance matrix and genome/proteome sequences can be downloaded from ftp://121.199.20.25/. The software Phylip and EvolView we used to construct phylogenetic trees can be referred from their websites.
Collapse
Affiliation(s)
- Wei-Feng Yang
- Hunan Key Laboratory for Computation and Simulation in Science and Engineering and Key Laboratory of Intelligent Computing and Information Processing of Ministry of Education, Xiangtan University, Hunan 411105, PR China; Department of Mathematics and Physics, Hunan Institute of Engineering, Hunan 411104, PR China.
| | - Zu-Guo Yu
- Hunan Key Laboratory for Computation and Simulation in Science and Engineering and Key Laboratory of Intelligent Computing and Information Processing of Ministry of Education, Xiangtan University, Hunan 411105, PR China; School of Mathematical Sciences, Queensland University of Technology, GPO Box 2434, Brisbane, QLD 4001, Australia.
| | - Vo Anh
- School of Mathematical Sciences, Queensland University of Technology, GPO Box 2434, Brisbane, QLD 4001, Australia.
| |
Collapse
|
16
|
Struck D, Lawyer G, Ternes AM, Schmit JC, Bercoff DP. COMET: adaptive context-based modeling for ultrafast HIV-1 subtype identification. Nucleic Acids Res 2014; 42:e144. [PMID: 25120265 PMCID: PMC4191385 DOI: 10.1093/nar/gku739] [Citation(s) in RCA: 255] [Impact Index Per Article: 25.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022] Open
Abstract
Viral sequence classification has wide applications in clinical, epidemiological, structural and functional categorization studies. Most existing approaches rely on an initial alignment step followed by classification based on phylogenetic or statistical algorithms. Here we present an ultrafast alignment-free subtyping tool for human immunodeficiency virus type one (HIV-1) adapted from Prediction by Partial Matching compression. This tool, named COMET, was compared to the widely used phylogeny-based REGA and SCUEAL tools using synthetic and clinical HIV data sets (1,090,698 and 10,625 sequences, respectively). COMET's sensitivity and specificity were comparable to or higher than the two other subtyping tools on both data sets for known subtypes. COMET also excelled in detecting and identifying new recombinant forms, a frequent feature of the HIV epidemic. Runtime comparisons showed that COMET was almost as fast as USEARCH. This study demonstrates the advantages of alignment-free classification of viral sequences, which feature high rates of variation, recombination and insertions/deletions. COMET is free to use via an online interface.
Collapse
Affiliation(s)
- Daniel Struck
- Laboratory of Retrovirology, CRP-Santé, 84, Val Fleuri, L-1526, Luxembourg
| | - Glenn Lawyer
- Department of Computational Biology and Applied Algorithmics, Max Planck Institute for Informatics, Campus E1 4, 66123 Saarbrücken, Germany
| | - Anne-Marie Ternes
- Laboratory of Retrovirology, CRP-Santé, 84, Val Fleuri, L-1526, Luxembourg
| | - Jean-Claude Schmit
- Laboratory of Retrovirology, CRP-Santé, 84, Val Fleuri, L-1526, Luxembourg
| | | |
Collapse
|
17
|
Abstract
Transcription factor binding sites (TFBSs) on the DNA are generally accepted as the key nodes of gene control. However, the multitudes of TFBSs identified in genome-wide studies, some of them seemingly unconstrained in evolution, have prompted the view that in many cases TF binding may serve no biological function. Yet, insights from transcriptional biochemistry, population genetics and functional genomics suggest that rather than segregating into 'functional' or 'non-functional', TFBS inputs to their target genes may be generally cumulative, with varying degrees of potency and redundancy. As TFBS redundancy can be diminished by mutations and environmental stress, some of the apparently 'spurious' sites may turn out to be important for maintaining adequate transcriptional regulation under these conditions. This has significant implications for interpreting the phenotypic effects of TFBS mutations, particularly in the context of genome-wide association studies for complex traits.
Collapse
|
18
|
Strelioff CC, Crutchfield JP. Bayesian structural inference for hidden processes. PHYSICAL REVIEW. E, STATISTICAL, NONLINEAR, AND SOFT MATTER PHYSICS 2014; 89:042119. [PMID: 24827205 DOI: 10.1103/physreve.89.042119] [Citation(s) in RCA: 12] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Received: 09/10/2013] [Indexed: 06/03/2023]
Abstract
We introduce a Bayesian approach to discovering patterns in structurally complex processes. The proposed method of Bayesian structural inference (BSI) relies on a set of candidate unifilar hidden Markov model (uHMM) topologies for inference of process structure from a data series. We employ a recently developed exact enumeration of topological ε-machines. (A sequel then removes the topological restriction.) This subset of the uHMM topologies has the added benefit that inferred models are guaranteed to be ε-machines, irrespective of estimated transition probabilities. Properties of ε-machines and uHMMs allow for the derivation of analytic expressions for estimating transition probabilities, inferring start states, and comparing the posterior probability of candidate model topologies, despite process internal structure being only indirectly present in data. We demonstrate BSI's effectiveness in estimating a process's randomness, as reflected by the Shannon entropy rate, and its structure, as quantified by the statistical complexity. We also compare using the posterior distribution over candidate models and the single, maximum a posteriori model for point estimation and show that the former more accurately reflects uncertainty in estimated values. We apply BSI to in-class examples of finite- and infinite-order Markov processes, as well to an out-of-class, infinite-state hidden process.
Collapse
Affiliation(s)
- Christopher C Strelioff
- Complexity Sciences Center and Physics Department, University of California at Davis, One Shields Avenue, Davis, California 95616, USA
| | - James P Crutchfield
- Complexity Sciences Center and Physics Department, University of California at Davis, One Shields Avenue, Davis, California 95616, USA and Santa Fe Institute, 1399 Hyde Park Road, Santa Fe, New Mexico 87501, USA
| |
Collapse
|