1
|
Ball RL, Bogue MA, Liang H, Srivastava A, Ashbrook DG, Lamoureux A, Gerring MW, Hatoum AS, Kim MJ, He H, Emerson J, Berger AK, Walton DO, Sheppard K, El Kassaby B, Castellanos F, Kunde-Ramamoorthy G, Lu L, Bluis J, Desai S, Sundberg BA, Peltz G, Fang Z, Churchill GA, Williams RW, Agrawal A, Bult CJ, Philip VM, Chesler EJ. GenomeMUSter mouse genetic variation service enables multitrait, multipopulation data integration and analysis. Genome Res 2024; 34:145-159. [PMID: 38290977 PMCID: PMC10903950 DOI: 10.1101/gr.278157.123] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/08/2023] [Accepted: 01/10/2024] [Indexed: 02/01/2024]
Abstract
Hundreds of inbred mouse strains and intercross populations have been used to characterize the function of genetic variants that contribute to disease. Thousands of disease-relevant traits have been characterized in mice and made publicly available. New strains and populations including consomics, the collaborative cross, expanded BXD, and inbred wild-derived strains add to existing complex disease mouse models, mapping populations, and sensitized backgrounds for engineered mutations. The genome sequences of inbred strains, along with dense genotypes from others, enable integrated analysis of trait-variant associations across populations, but these analyses are hampered by the sparsity of genotypes available. Moreover, the data are not readily interoperable with other resources. To address these limitations, we created a uniformly dense variant resource by harmonizing multiple data sets. Missing genotypes were imputed using the Viterbi algorithm with a data-driven technique that incorporates local phylogenetic information, an approach that is extendable to other model organisms. The result is a web- and programmatically accessible data service called GenomeMUSter, comprising single-nucleotide variants covering 657 strains at 106.8 million segregating sites. Interoperation with phenotype databases, analytic tools, and other resources enable a wealth of applications, including multitrait, multipopulation meta-analysis. We show this in cross-species comparisons of type 2 diabetes and substance use disorder meta-analyses, leveraging mouse data to characterize the likely role of human variant effects in disease. Other applications include refinement of mapped loci and prioritization of strain backgrounds for disease modeling to further unlock extant mouse diversity for genetic and genomic studies in health and disease.
Collapse
Affiliation(s)
- Robyn L Ball
- The Jackson Laboratory, Bar Harbor, Maine 04609, USA;
| | - Molly A Bogue
- The Jackson Laboratory, Bar Harbor, Maine 04609, USA
| | | | - Anuj Srivastava
- The Jackson Laboratory for Genomic Medicine, Farmington, Connecticut 06032, USA
| | - David G Ashbrook
- University of Tennessee Health Science Center, Memphis, Tennessee 38163, USA
| | | | | | - Alexander S Hatoum
- Psychological and Brain Sciences, Washington University in St. Louis, St. Louis, Missouri 63130, USA
- Artificial Intelligence and the Internet of Things Institute, Washington University School of Medicine, St. Louis, Missouri 63110, USA
| | - Matthew J Kim
- University of British Columbia, Vancouver, British Columbia V6T 1Z4, Canada
| | - Hao He
- The Jackson Laboratory, Bar Harbor, Maine 04609, USA
| | - Jake Emerson
- The Jackson Laboratory, Bar Harbor, Maine 04609, USA
| | | | | | | | | | | | | | - Lu Lu
- University of Tennessee Health Science Center, Memphis, Tennessee 38163, USA
| | - John Bluis
- The Jackson Laboratory, Bar Harbor, Maine 04609, USA
| | - Sejal Desai
- The Jackson Laboratory, Bar Harbor, Maine 04609, USA
| | | | - Gary Peltz
- Department of Anesthesia, Pain and Perioperative Medicine, Stanford University School of Medicine, Stanford, California 94305, USA
| | - Zhuoqing Fang
- Department of Anesthesia, Pain and Perioperative Medicine, Stanford University School of Medicine, Stanford, California 94305, USA
| | | | - Robert W Williams
- University of Tennessee Health Science Center, Memphis, Tennessee 38163, USA
| | - Arpana Agrawal
- Department of Psychiatry, Washington University School of Medicine, St. Louis, Missouri 63110, USA
| | - Carol J Bult
- The Jackson Laboratory, Bar Harbor, Maine 04609, USA
| | | | | |
Collapse
|
2
|
Ball RL, Bogue MA, Liang H, Srivastava A, Ashbrook DG, Lamoureux A, Gerring MW, Hatoum AS, Kim M, He H, Emerson J, Berger AK, Walton DO, Sheppard K, Kassaby BE, Castellanos F, Kunde-Ramamoorthy G, Lu L, Bluis J, Desai S, Sundberg BA, Peltz G, Fang Z, Churchill GA, Williams RW, Agrawal A, Bult CJ, Philip VM, Chesler EJ. GenomeMUSter mouse genetic variation service enables multi-trait, multi-population data integration and analyses. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2023:2023.08.08.552506. [PMID: 37609331 PMCID: PMC10441370 DOI: 10.1101/2023.08.08.552506] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 08/24/2023]
Abstract
Hundreds of inbred laboratory mouse strains and intercross populations have been used to functionalize genetic variants that contribute to disease. Thousands of disease relevant traits have been characterized in mice and made publicly available. New strains and populations including the Collaborative Cross, expanded BXD and inbred wild-derived strains add to set of complex disease mouse models, genetic mapping resources and sensitized backgrounds against which to evaluate engineered mutations. The genome sequences of many inbred strains, along with dense genotypes from others could allow integrated analysis of trait - variant associations across populations, but these analyses are not feasible due to the sparsity of genotypes available. Moreover, the data are not readily interoperable with other resources. To address these limitations, we created a uniformly dense data resource by harmonizing multiple variant datasets. Missing genotypes were imputed using the Viterbi algorithm with a data-driven technique that incorporates local phylogenetic information, an approach that is extensible to other model organism species. The result is a web- and programmatically-accessible data service called GenomeMUSter ( https://muster.jax.org ), comprising allelic data covering 657 strains at 106.8M segregating sites. Interoperation with phenotype databases, analytic tools and other resources enable a wealth of applications including multi-trait, multi-population meta-analysis. We demonstrate this in a cross-species comparison of the meta-analysis of Type 2 Diabetes and of substance use disorders, resulting in the more specific characterization of the role of human variant effects in light of mouse phenotype data. Other applications include refinement of mapped loci and prioritization of strain backgrounds for disease modeling to further unlock extant mouse diversity for genetic and genomic studies in health and disease.
Collapse
|
3
|
Genome Mining of Pseudomonas Species: Diversity and Evolution of Metabolic and Biosynthetic Potential. Molecules 2021; 26:molecules26247524. [PMID: 34946606 PMCID: PMC8704066 DOI: 10.3390/molecules26247524] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/10/2021] [Revised: 12/07/2021] [Accepted: 12/08/2021] [Indexed: 11/17/2022] Open
Abstract
Microbial genome sequencing has uncovered a myriad of natural products (NPs) that have yet to be explored. Bacteria in the genus Pseudomonas serve as pathogens, plant growth promoters, and therapeutically, industrially, and environmentally important microorganisms. Though most species of Pseudomonas have a large number of NP biosynthetic gene clusters (BGCs) in their genomes, it is difficult to link many of these BGCs with products under current laboratory conditions. In order to gain new insights into the diversity, distribution, and evolution of these BGCs in Pseudomonas for the discovery of unexplored NPs, we applied several bioinformatic programming approaches to characterize BGCs from Pseudomonas reference genome sequences available in public databases along with phylogenetic and genomic comparison. Our research revealed that most BGCs in the genomes of Pseudomonas species have a high diversity for NPs at the species and subspecies levels and built the correlation of species with BGC taxonomic ranges. These data will pave the way for the algorithmic detection of species- and subspecies-specific pathways for NP development.
Collapse
|
4
|
AC2: An Efficient Protein Sequence Compression Tool Using Artificial Neural Networks and Cache-Hash Models. ENTROPY 2021; 23:e23050530. [PMID: 33925812 PMCID: PMC8146440 DOI: 10.3390/e23050530] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 02/01/2021] [Revised: 04/19/2021] [Accepted: 04/22/2021] [Indexed: 12/28/2022]
Abstract
Recently, the scientific community has witnessed a substantial increase in the generation of protein sequence data, triggering emergent challenges of increasing importance, namely efficient storage and improved data analysis. For both applications, data compression is a straightforward solution. However, in the literature, the number of specific protein sequence compressors is relatively low. Moreover, these specialized compressors marginally improve the compression ratio over the best general-purpose compressors. In this paper, we present AC2, a new lossless data compressor for protein (or amino acid) sequences. AC2 uses a neural network to mix experts with a stacked generalization approach and individual cache-hash memory models to the highest-context orders. Compared to the previous compressor (AC), we show gains of 2–9% and 6–7% in reference-free and reference-based modes, respectively. These gains come at the cost of three times slower computations. AC2 also improves memory usage against AC, with requirements about seven times lower, without being affected by the sequences’ input size. As an analysis application, we use AC2 to measure the similarity between each SARS-CoV-2 protein sequence with each viral protein sequence from the whole UniProt database. The results consistently show higher similarity to the pangolin coronavirus, followed by the bat and human coronaviruses, contributing with critical results to a current controversial subject. AC2 is available for free download under GPLv3 license.
Collapse
|
5
|
Abstract
In 1981, the Journal of Molecular Evolution (JME) published an article entitled "Evolutionary trees from DNA sequences: A maximum likelihood approach" by Joseph (Joe) Felsenstein (J Mol Evol 17:368-376, 1981). This groundbreaking work laid the foundation for the emerging field of statistical phylogenetics, providing a tractable way of finding maximum likelihood (ML) estimates of evolutionary trees from DNA sequence data. This paper is the second most cited (more than 9000 citations) in JME after Kimura's (J Mol Evol 16:111-120, 1980) seminal paper on a model of nucleotide substitution (with nearly 20,000 citations). On the occasion of the 50th anniversary of JME, we elaborate on the significance of Felsenstein's ML approach to estimating phylogenetic trees.
Collapse
Affiliation(s)
- David Posada
- CINBIO, Universidade de Vigo, 36310, Vigo, Spain.
- Department of Biochemistry, Genetics, and Immunology, Universidade de Vigo, 36310, Vigo, Spain.
- Galicia Sur Health Research Institute (IIS Galicia Sur), SERGAS-UVIGO, Vigo, Spain.
| | - Keith A Crandall
- Computational Biology Institute and Milken Institute School of Public Health, The George Washington University, Washington, DC, 20052, USA.
- Department of Biostatistics & Bioinformatics, Milken Institute School of Public Health, The George Washington University, Washington, DC, 20052, USA.
| |
Collapse
|
6
|
Weiß CH. Regime-Switching Discrete ARMA Models for Categorical Time Series. ENTROPY 2020; 22:e22040458. [PMID: 33286232 PMCID: PMC7516940 DOI: 10.3390/e22040458] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 03/18/2020] [Revised: 04/14/2020] [Accepted: 04/16/2020] [Indexed: 11/27/2022]
Abstract
For the modeling of categorical time series, both nominal or ordinal time series, an extension of the basic discrete autoregressive moving-average (ARMA) models is proposed. It uses an observation-driven regime-switching mechanism, leading to the family of RS-DARMA models. After having discussed the stochastic properties of RS-DARMA models in general, we focus on the particular case of the first-order RS-DAR model. This RS-DAR(1) model constitutes a parsimoniously parameterized type of Markov chain, which has an easy-to-interpret data-generating mechanism and may also handle negative forms of serial dependence. Approaches for model fitting are elaborated on, and they are illustrated by two real-data examples: the modeling of a nominal sequence from biology, and of an ordinal time series regarding cloudiness. For future research, one might use the RS-DAR(1) model for constructing parsimonious advanced models, and one might adapt techniques for smoother regime transitions.
Collapse
Affiliation(s)
- Christian H Weiß
- Department of Mathematics and Statistics, Helmut Schmidt University, 22043 Hamburg, Germany
| |
Collapse
|
7
|
Tang M, Hasan MS, Zhu H, Zhang L, Wu X. vi-HMM: a novel HMM-based method for sequence variant identification in short-read data. Hum Genomics 2019; 13:9. [PMID: 30795817 PMCID: PMC6387560 DOI: 10.1186/s40246-019-0194-6] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/02/2018] [Accepted: 01/29/2019] [Indexed: 12/30/2022] Open
Abstract
Background Accurate and reliable identification of sequence variants, including single nucleotide polymorphisms (SNPs) and insertion-deletion polymorphisms (INDELs), plays a fundamental role in next-generation sequencing (NGS) applications. Existing methods for calling these variants often make simplified assumptions of positional independence and fail to leverage the dependence between genotypes at nearby loci that is caused by linkage disequilibrium (LD). Results and conclusion We propose vi-HMM, a hidden Markov model (HMM)-based method for calling SNPs and INDELs in mapped short-read data. This method allows transitions between hidden states (defined as “SNP,” “Ins,” “Del,” and “Match”) of adjacent genomic bases and determines an optimal hidden state path by using the Viterbi algorithm. The inferred hidden state path provides a direct solution to the identification of SNPs and INDELs. Simulation studies show that, under various sequencing depths, vi-HMM outperforms commonly used variant calling methods in terms of sensitivity and F1 score. When applied to the real data, vi-HMM demonstrates higher accuracy in calling SNPs and INDELs. Electronic supplementary material The online version of this article (10.1186/s40246-019-0194-6) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Man Tang
- Department of Statistics, Virginia Tech, 250 Drillfield Drive, Blacksburg, 24061, VA, USA
| | - Mohammad Shabbir Hasan
- Department of Computer Science, Virginia Tech, 225 Stanger Street, Blacksburg, 24060, VA, USA
| | - Hongxiao Zhu
- Department of Statistics, Virginia Tech, 250 Drillfield Drive, Blacksburg, 24061, VA, USA
| | - Liqing Zhang
- Department of Computer Science, Virginia Tech, 225 Stanger Street, Blacksburg, 24060, VA, USA
| | - Xiaowei Wu
- Department of Statistics, Virginia Tech, 250 Drillfield Drive, Blacksburg, 24061, VA, USA.
| |
Collapse
|
8
|
A method of data mining using Hidden Markov Models (HMMs) for protein secondary structure prediction. ACTA ACUST UNITED AC 2018. [DOI: 10.1016/j.procs.2018.01.096] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/29/2022]
|
9
|
Totterdell JA, Nur D, Mengersen KL. Bayesian hidden Markov models in DNA sequence segmentation using R: the case of Simian Vacuolating virus (SV40). J STAT COMPUT SIM 2017. [DOI: 10.1080/00949655.2017.1344666] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/19/2022]
Affiliation(s)
| | - Darfiana Nur
- School of Computer Science, Engineering and Mathematics, Flinders University, Tonsley, SA, Australia
| | - Kerrie L. Mengersen
- School of Mathematical Sciences, Queensland University of Technology and The Australian Research Council (ARC) Centre of Excellence for Mathematical and Statistical Frontiers (ACEMS), Brisbane, QLD, Australia
| |
Collapse
|
10
|
Kim J, Lee C. Stochastic service life cycle analysis using customer reviews. SERVICE INDUSTRIES JOURNAL 2017. [DOI: 10.1080/02642069.2017.1316379] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 10/19/2022]
Affiliation(s)
- Juram Kim
- School of Management Engineering, Ulsan National Institute of Science and Technology, Ulsan, Republic of Korea
| | - Changyong Lee
- School of Management Engineering, Ulsan National Institute of Science and Technology, Ulsan, Republic of Korea
| |
Collapse
|
11
|
Lee C, Kim J, Noh M, Woo HG, Gang K. Patterns of technology life cycles: stochastic analysis based on patent citations. TECHNOLOGY ANALYSIS & STRATEGIC MANAGEMENT 2016. [DOI: 10.1080/09537325.2016.1194974] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/21/2022]
|
12
|
Yang WF, Yu ZG, Anh V. Whole genome/proteome based phylogeny reconstruction for prokaryotes using higher order Markov model and chaos game representation. Mol Phylogenet Evol 2015; 96:102-111. [PMID: 26724405 DOI: 10.1016/j.ympev.2015.12.011] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/02/2015] [Revised: 12/17/2015] [Accepted: 12/18/2015] [Indexed: 01/18/2023]
Abstract
UNLABELLED Traditional methods for sequence comparison and phylogeny reconstruction rely on pair wise and multiple sequence alignments. But alignment could not be directly applied to whole genome/proteome comparison and phylogenomic studies due to their high computational complexity. Hence alignment-free methods became popular in recent years. Here we propose a fast alignment-free method for whole genome/proteome comparison and phylogeny reconstruction using higher order Markov model and chaos game representation. In the present method, we use the transition matrices of higher order Markov models to characterize amino acid or DNA sequences for their comparison. The order of the Markov model is uniquely identified by maximizing the average Shannon entropy of conditional probability distributions. Using one-dimensional chaos game representation and linked list, this method can reduce large memory and time consumption which is due to the large-scale conditional probability distributions. To illustrate the effectiveness of our method, we employ it for fast phylogeny reconstruction based on genome/proteome sequences of two species data sets used in previous published papers. Our results demonstrate that the present method is useful and efficient. AVAILABILITY AND IMPLEMENTATION The source codes for our algorithm to get the distance matrix and genome/proteome sequences can be downloaded from ftp://121.199.20.25/. The software Phylip and EvolView we used to construct phylogenetic trees can be referred from their websites.
Collapse
Affiliation(s)
- Wei-Feng Yang
- Hunan Key Laboratory for Computation and Simulation in Science and Engineering and Key Laboratory of Intelligent Computing and Information Processing of Ministry of Education, Xiangtan University, Hunan 411105, PR China; Department of Mathematics and Physics, Hunan Institute of Engineering, Hunan 411104, PR China.
| | - Zu-Guo Yu
- Hunan Key Laboratory for Computation and Simulation in Science and Engineering and Key Laboratory of Intelligent Computing and Information Processing of Ministry of Education, Xiangtan University, Hunan 411105, PR China; School of Mathematical Sciences, Queensland University of Technology, GPO Box 2434, Brisbane, QLD 4001, Australia.
| | - Vo Anh
- School of Mathematical Sciences, Queensland University of Technology, GPO Box 2434, Brisbane, QLD 4001, Australia.
| |
Collapse
|
13
|
Dell'Acqua M, Gatti DM, Pea G, Cattonaro F, Coppens F, Magris G, Hlaing AL, Aung HH, Nelissen H, Baute J, Frascaroli E, Churchill GA, Inzé D, Morgante M, Pè ME. Genetic properties of the MAGIC maize population: a new platform for high definition QTL mapping in Zea mays. Genome Biol 2015; 16:167. [PMID: 26357913 PMCID: PMC4566846 DOI: 10.1186/s13059-015-0716-z] [Citation(s) in RCA: 138] [Impact Index Per Article: 15.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/17/2015] [Accepted: 07/03/2015] [Indexed: 02/07/2023] Open
Abstract
BACKGROUND Maize (Zea mays) is a globally produced crop with broad genetic and phenotypic variation. New tools that improve our understanding of the genetic basis of quantitative traits are needed to guide predictive crop breeding. We have produced the first balanced multi-parental population in maize, a tool that provides high diversity and dense recombination events to allow routine quantitative trait loci (QTL) mapping in maize. RESULTS We produced 1,636 MAGIC maize recombinant inbred lines derived from eight genetically diverse founder lines. The characterization of 529 MAGIC maize lines shows that the population is a balanced, evenly differentiated mosaic of the eight founders, with mapping power and resolution strengthened by high minor allele frequencies and a fast decay of linkage disequilibrium. We show how MAGIC maize may find strong candidate genes by incorporating genome sequencing and transcriptomics data. We discuss three QTL for grain yield and three for flowering time, reporting candidate genes. Power simulations show that subsets of MAGIC maize might achieve high-power and high-definition QTL mapping. CONCLUSIONS We demonstrate MAGIC maize's value in identifying the genetic bases of complex traits of agronomic relevance. The design of MAGIC maize allows the accumulation of sequencing and transcriptomics layers to guide the identification of candidate genes for a number of maize traits at different developmental stages. The characterization of the full MAGIC maize population will lead to higher power and definition in QTL mapping, and lay the basis for improved understanding of maize phenotypes, heterosis included. MAGIC maize is available to researchers.
Collapse
Affiliation(s)
- Matteo Dell'Acqua
- Institute of Life Sciences, Scuola Superiore Sant'Anna, Pisa, Italy.
| | | | - Giorgio Pea
- Institute of Life Sciences, Scuola Superiore Sant'Anna, Pisa, Italy.
- Current address: Thermo Fisher Scientific, Via G.B Tiepolo 18, 20900, Monza, MB, Italy.
| | | | - Frederik Coppens
- Department of Plant Biotechnology and Bioinformatics, Ghent University, Gent, Belgium.
| | - Gabriele Magris
- Institute of Applied Genomics, Udine, Italy.
- Department of Agricultural and Environmental Sciences, University of Udine, Udine, Italy.
| | - Aye L Hlaing
- Institute of Life Sciences, Scuola Superiore Sant'Anna, Pisa, Italy.
- Current address: Department of Agricultural Research, Nay Pyi Taw, Myanmar.
| | - Htay H Aung
- Institute of Life Sciences, Scuola Superiore Sant'Anna, Pisa, Italy.
- Current address: Plant Biotechnology Center, Yangon, Myanmar.
| | - Hilde Nelissen
- Department of Plant Biotechnology and Bioinformatics, Ghent University, Gent, Belgium.
- Department of Plant Systems Biology, VIB, Gent, Belgium.
| | - Joke Baute
- Department of Plant Biotechnology and Bioinformatics, Ghent University, Gent, Belgium.
- Department of Plant Systems Biology, VIB, Gent, Belgium.
| | | | | | - Dirk Inzé
- Department of Plant Biotechnology and Bioinformatics, Ghent University, Gent, Belgium.
- Department of Plant Systems Biology, VIB, Gent, Belgium.
| | - Michele Morgante
- Institute of Applied Genomics, Udine, Italy.
- Department of Agricultural and Environmental Sciences, University of Udine, Udine, Italy.
| | - Mario Enrico Pè
- Institute of Life Sciences, Scuola Superiore Sant'Anna, Pisa, Italy.
| |
Collapse
|
14
|
|
15
|
González-Tortuero E, Rusek J, Petrusek A, Gießler S, Lyras D, Grath S, Castro-Monzón F, Wolinska J. The Quantification of Representative Sequences pipeline for amplicon sequencing: case study on within-population ITS1 sequence variation in a microparasite infecting Daphnia. Mol Ecol Resour 2015; 15:1385-95. [PMID: 25728529 DOI: 10.1111/1755-0998.12396] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/16/2014] [Revised: 02/16/2015] [Accepted: 02/19/2015] [Indexed: 11/28/2022]
Abstract
Next generation sequencing (NGS) platforms are replacing traditional molecular biology protocols like cloning and Sanger sequencing. However, accuracy of NGS platforms has rarely been measured when quantifying relative frequencies of genotypes or taxa within populations. Here we developed a new bioinformatic pipeline (QRS) that pools similar sequence variants and estimates their frequencies in NGS data sets from populations or communities. We tested whether the estimated frequency of representative sequences, generated by 454 amplicon sequencing, differs significantly from that obtained by Sanger sequencing of cloned PCR products. This was performed by analysing sequence variation of the highly variable first internal transcribed spacer (ITS1) of the ichthyosporean Caullerya mesnili, a microparasite of cladocerans of the genus Daphnia. This analysis also serves as a case example of the usage of this pipeline to study within-population variation. Additionally, a public Illumina data set was used to validate the pipeline on community-level data. Overall, there was a good correspondence in absolute frequencies of C. mesnili ITS1 sequences obtained from Sanger and 454 platforms. Furthermore, analyses of molecular variance (amova) revealed that population structure of C. mesnili differs across lakes and years independently of the sequencing platform. Our results support not only the usefulness of amplicon sequencing data for studies of within-population structure but also the successful application of the QRS pipeline on Illumina-generated data. The QRS pipeline is freely available together with its documentation under GNU Public Licence version 3 at http://code.google.com/p/quantification-representative-sequences.
Collapse
Affiliation(s)
- E González-Tortuero
- Department of Ecosystem Research, Leibniz-Institute of Freshwater Ecology and Inland Fisheries (IGB), Müggelseedamm 301, 12587, Berlin, Germany.,Berlin Centre for Genomics in Biodiversity Research (BeGenDiv), Königin-Luise-Straße 6-8, 14195, Berlin, Germany.,Department of Biology II, Ludwig Maximilians University, Großhaderner Straße 2, 82512, Planegg-Martinsried, Germany
| | - J Rusek
- Department of Biology II, Ludwig Maximilians University, Großhaderner Straße 2, 82512, Planegg-Martinsried, Germany
| | - A Petrusek
- Department of Ecology, Faculty of Science, Charles University in Prague, Viničná 7, 128 44, Prague, Czech Republic
| | - S Gießler
- Department of Biology II, Ludwig Maximilians University, Großhaderner Straße 2, 82512, Planegg-Martinsried, Germany
| | - D Lyras
- Department of Biology II, Ludwig Maximilians University, Großhaderner Straße 2, 82512, Planegg-Martinsried, Germany
| | - S Grath
- Department of Biology II, Ludwig Maximilians University, Großhaderner Straße 2, 82512, Planegg-Martinsried, Germany
| | - F Castro-Monzón
- Department of Ecosystem Research, Leibniz-Institute of Freshwater Ecology and Inland Fisheries (IGB), Müggelseedamm 301, 12587, Berlin, Germany
| | - J Wolinska
- Department of Ecosystem Research, Leibniz-Institute of Freshwater Ecology and Inland Fisheries (IGB), Müggelseedamm 301, 12587, Berlin, Germany
| |
Collapse
|
16
|
Quantitative trait locus mapping methods for diversity outbred mice. G3-GENES GENOMES GENETICS 2014; 4:1623-33. [PMID: 25237114 PMCID: PMC4169154 DOI: 10.1534/g3.114.013748] [Citation(s) in RCA: 150] [Impact Index Per Article: 15.0] [Reference Citation Analysis] [Abstract] [Key Words] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 11/26/2022]
Abstract
Genetic mapping studies in the mouse and other model organisms are used to search for genes underlying complex phenotypes. Traditional genetic mapping studies that employ single-generation crosses have poor mapping resolution and limit discovery to loci that are polymorphic between the two parental strains. Multiparent outbreeding populations address these shortcomings by increasing the density of recombination events and introducing allelic variants from multiple founder strains. However, multiparent crosses present new analytical challenges and require specialized software to take full advantage of these benefits. Each animal in an outbreeding population is genetically unique and must be genotyped using a high-density marker set; regression models for mapping must accommodate multiple founder alleles, and complex breeding designs give rise to polygenic covariance among related animals that must be accounted for in mapping analysis. The Diversity Outbred (DO) mice combine the genetic diversity of eight founder strains in a multigenerational breeding design that has been maintained for >16 generations. The large population size and randomized mating ensure the long-term genetic stability of this population. We present a complete analytical pipeline for genetic mapping in DO mice, including algorithms for probabilistic reconstruction of founder haplotypes from genotyping array intensity data, and mapping methods that accommodate multiple founder haplotypes and account for relatedness among animals. Power analysis suggests that studies with as few as 200 DO mice can detect loci with large effects, but loci that account for <5% of trait variance may require a sample size of up to 1000 animals. The methods described here are implemented in the freely available R package DOQTL.
Collapse
|
17
|
Suvorova YM, Korotkova MA, Korotkov EV. Study of the Paired Change Points in Bacterial Genes. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2014; 11:955-964. [PMID: 26356866 DOI: 10.1109/tcbb.2014.2321154] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/05/2023]
Abstract
It is known that nucleotide sequences are not totally homogeneous and this heterogeneity could not be due to random fluctuations only. Such heterogeneity poses a problem of making sequence segmentation into a set of homogeneous parts divided by the points called "change points". In this work we investigated a special case of change points-paired change points (PCP). We used a well-known property of coding sequences-triplet periodicity (TP). The sequences that we are especially interested in consist of three successive parts: the first and the last parts have similar TP while the middle part has different TP type. We aimed to find the genes with PCP and provide explanation for this phenomenon. We developed a mathematical method for the PCP detection based on the new measure of similarity between TP matrices. We investigated 66,936 bacterial genes from 17 bacterial genomes and revealed 2,700 genes with PCP and 6,459 genes with single change point (SCP). We developed a mathematical approach to visualize the PCP cases. We suppose that PCP could be associated with double fusion or insertion events. The results of investigating the sequences with artificial insertions/fusions and distribution of TP inside the genome support the idea that the real number of genes formed by insertion/ fusion events could be 5-7 times greater than the number of genes revealed in the present work.
Collapse
|
18
|
Algama M, Keith JM. Investigating genomic structure using changept: A Bayesian segmentation model. Comput Struct Biotechnol J 2014; 10:107-15. [PMID: 25349679 PMCID: PMC4204429 DOI: 10.1016/j.csbj.2014.08.003] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/06/2023] Open
Abstract
Genomes are composed of a wide variety of elements with distinct roles and characteristics. Some of these elements are well-characterised functional components such as protein-coding exons. Other elements play regulatory or structural roles, encode functional non-protein-coding RNAs, or perform some other function yet to be characterised. Still others may have no functional importance, though they may nevertheless be of interest to biologists. One technique for investigating the composition of genomes is to segment sequences into compositionally homogenous blocks. This technique, known as 'sequence segmentation' or 'change-point analysis', is used to identify patterns of variation across genomes such as GC-rich and GC-poor regions, coding and non-coding regions, slowly evolving and rapidly evolving regions and many other types of variation. In this mini-review we outline many of the genome segmentation methods currently available and then focus on a Bayesian DNA segmentation algorithm, with examples of its various applications.
Collapse
Affiliation(s)
- Manjula Algama
- School of Mathematical Sciences, Monash University, Clayton, VIC 3800, Australia
| | - Jonathan M Keith
- School of Mathematical Sciences, Monash University, Clayton, VIC 3800, Australia
| |
Collapse
|
19
|
Futschik A, Hotz T, Munk A, Sieling H. Multiscale DNA partitioning: statistical evidence for segments. Bioinformatics 2014; 30:2255-62. [PMID: 24753487 DOI: 10.1093/bioinformatics/btu180] [Citation(s) in RCA: 16] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open
Abstract
MOTIVATION DNA segmentation, i.e. the partitioning of DNA in compositionally homogeneous segments, is a basic task in bioinformatics. Different algorithms have been proposed for various partitioning criteria such as Guanine/Cytosine (GC) content, local ancestry in population genetics or copy number variation. A critical component of any such method is the choice of an appropriate number of segments. Some methods use model selection criteria and do not provide a suitable error control. Other methods that are based on simulating a statistic under a null model provide suitable error control only if the correct null model is chosen. RESULTS Here, we focus on partitioning with respect to GC content and propose a new approach that provides statistical error control: as in statistical hypothesis testing, it guarantees with a user-specified probability [Formula: see text] that the number of identified segments does not exceed the number of actually present segments. The method is based on a statistical multiscale criterion, rendering this as a segmentation method that searches segments of any length (on all scales) simultaneously. It is also accurate in localizing segments: under benchmark scenarios, our approach leads to a segmentation that is more accurate than the approaches discussed in the comparative review of Elhaik et al. In our real data examples, we find segments that often correspond well to features taken from standard University of California at Santa Cruz (UCSC) genome annotation tracks. AVAILABILITY AND IMPLEMENTATION Our method is implemented in function smuceR of the R-package stepR available at http://www.stochastik.math.uni-goettingen.de/smuce.
Collapse
Affiliation(s)
- Andreas Futschik
- Department of Applied Statistics, JK University Linz, A-4040 Linz, Austria, Institute of Mathematics, Technische Universität Ilmenau, D-98693 Ilmenau, Germany, Institute for Mathematical Stochastics and Felix Bernstein Institute for Mathematical Statistics in Biosciences, Georgia Augusta University of Goettingen and Max Planck Institute for Biophysical Chemistry, D-37077 Goettingen, Germany
| | - Thomas Hotz
- Department of Applied Statistics, JK University Linz, A-4040 Linz, Austria, Institute of Mathematics, Technische Universität Ilmenau, D-98693 Ilmenau, Germany, Institute for Mathematical Stochastics and Felix Bernstein Institute for Mathematical Statistics in Biosciences, Georgia Augusta University of Goettingen and Max Planck Institute for Biophysical Chemistry, D-37077 Goettingen, Germany
| | - Axel Munk
- Department of Applied Statistics, JK University Linz, A-4040 Linz, Austria, Institute of Mathematics, Technische Universität Ilmenau, D-98693 Ilmenau, Germany, Institute for Mathematical Stochastics and Felix Bernstein Institute for Mathematical Statistics in Biosciences, Georgia Augusta University of Goettingen and Max Planck Institute for Biophysical Chemistry, D-37077 Goettingen, GermanyDepartment of Applied Statistics, JK University Linz, A-4040 Linz, Austria, Institute of Mathematics, Technische Universität Ilmenau, D-98693 Ilmenau, Germany, Institute for Mathematical Stochastics and Felix Bernstein Institute for Mathematical Statistics in Biosciences, Georgia Augusta University of Goettingen and Max Planck Institute for Biophysical Chemistry, D-37077 Goettingen, Germany
| | - Hannes Sieling
- Department of Applied Statistics, JK University Linz, A-4040 Linz, Austria, Institute of Mathematics, Technische Universität Ilmenau, D-98693 Ilmenau, Germany, Institute for Mathematical Stochastics and Felix Bernstein Institute for Mathematical Statistics in Biosciences, Georgia Augusta University of Goettingen and Max Planck Institute for Biophysical Chemistry, D-37077 Goettingen, Germany
| |
Collapse
|
20
|
Minoarivelo HO, Hui C, Terblanche JS, Pond SLK, Scheffler K. Detecting phylogenetic signal in mutualistic interaction networks using a Markov process model. OIKOS 2014; 123:1250-1260. [PMID: 25294947 DOI: 10.1111/oik.00857] [Citation(s) in RCA: 21] [Impact Index Per Article: 2.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/29/2022]
Abstract
Ecological interaction networks, such as those describing the mutualistic interactions between plants and their pollinators or between plants and their frugivores, exhibit non-random structural properties that cannot be explained by simple models of network formation. One factor affecting the formation and eventual structure of such a network is its evolutionary history. We argue that this, in many cases, is closely linked to the evolutionary histories of the species involved in the interactions. Indeed, empirical studies of interaction networks along with the phylogenies of the interacting species have demonstrated significant associations between phylogeny and network structure. To date, however, no generative model explaining the way in which the evolution of individual species affects the evolution of interaction networks has been proposed. We present a model describing the evolution of pairwise interactions as a branching Markov process, drawing on phylogenetic models of molecular evolution. Using knowledge of the phylogenies of the interacting species, our model yielded a significantly better fit to 21% of a set of plant - pollinator and plant - frugivore mutualistic networks. This highlights the importance, in a substantial minority of cases, of inheritance of interaction patterns without excluding the potential role of ecological novelties in forming the current network architecture. We suggest that our model can be used as a null model for controlling evolutionary signals when evaluating the role of other factors in shaping the emergence of ecological networks.
Collapse
Affiliation(s)
- H O Minoarivelo
- H. O. Minoarivelo and K. Scheffler ( ), Computer Science Division, Dept of Mathematical Sciences, Stellenbosch Univ., Matieland 7602, South Africa. - HOM and C. Hui, Centre for Invasion Biology, Dept of Mathematical Sciences, Stellenbosch Univ., Matieland 7602, South Africa. - KS and S. L. Kosakovsky Pond, Dept of Medicine, Univ. of California, San Diego, USA. - J. S. Terblanche, Centre for Invasion Biology, Dept of Conservation Ecology and Entomology, Stellenbosch Univ., Matieland 7602, South Africa
| | - C Hui
- H. O. Minoarivelo and K. Scheffler ( ), Computer Science Division, Dept of Mathematical Sciences, Stellenbosch Univ., Matieland 7602, South Africa. - HOM and C. Hui, Centre for Invasion Biology, Dept of Mathematical Sciences, Stellenbosch Univ., Matieland 7602, South Africa. - KS and S. L. Kosakovsky Pond, Dept of Medicine, Univ. of California, San Diego, USA. - J. S. Terblanche, Centre for Invasion Biology, Dept of Conservation Ecology and Entomology, Stellenbosch Univ., Matieland 7602, South Africa
| | - J S Terblanche
- H. O. Minoarivelo and K. Scheffler ( ), Computer Science Division, Dept of Mathematical Sciences, Stellenbosch Univ., Matieland 7602, South Africa. - HOM and C. Hui, Centre for Invasion Biology, Dept of Mathematical Sciences, Stellenbosch Univ., Matieland 7602, South Africa. - KS and S. L. Kosakovsky Pond, Dept of Medicine, Univ. of California, San Diego, USA. - J. S. Terblanche, Centre for Invasion Biology, Dept of Conservation Ecology and Entomology, Stellenbosch Univ., Matieland 7602, South Africa
| | - S L Kosakovsky Pond
- H. O. Minoarivelo and K. Scheffler ( ), Computer Science Division, Dept of Mathematical Sciences, Stellenbosch Univ., Matieland 7602, South Africa. - HOM and C. Hui, Centre for Invasion Biology, Dept of Mathematical Sciences, Stellenbosch Univ., Matieland 7602, South Africa. - KS and S. L. Kosakovsky Pond, Dept of Medicine, Univ. of California, San Diego, USA. - J. S. Terblanche, Centre for Invasion Biology, Dept of Conservation Ecology and Entomology, Stellenbosch Univ., Matieland 7602, South Africa
| | - K Scheffler
- H. O. Minoarivelo and K. Scheffler ( ), Computer Science Division, Dept of Mathematical Sciences, Stellenbosch Univ., Matieland 7602, South Africa. - HOM and C. Hui, Centre for Invasion Biology, Dept of Mathematical Sciences, Stellenbosch Univ., Matieland 7602, South Africa. - KS and S. L. Kosakovsky Pond, Dept of Medicine, Univ. of California, San Diego, USA. - J. S. Terblanche, Centre for Invasion Biology, Dept of Conservation Ecology and Entomology, Stellenbosch Univ., Matieland 7602, South Africa
| |
Collapse
|
21
|
Sand A, Kristiansen M, Pedersen CNS, Mailund T. zipHMMlib: a highly optimised HMM library exploiting repetitions in the input to speed up the forward algorithm. BMC Bioinformatics 2013; 14:339. [PMID: 24266924 PMCID: PMC4222747 DOI: 10.1186/1471-2105-14-339] [Citation(s) in RCA: 10] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/18/2013] [Accepted: 11/14/2013] [Indexed: 11/10/2022] Open
Abstract
Background Hidden Markov models are widely used for genome analysis as they combine ease of modelling with efficient analysis algorithms. Calculating the likelihood of a model using the forward algorithm has worst case time complexity linear in the length of the sequence and quadratic in the number of states in the model. For genome analysis, however, the length runs to millions or billions of observations, and when maximising the likelihood hundreds of evaluations are often needed. A time efficient forward algorithm is therefore a key ingredient in an efficient hidden Markov model library. Results We have built a software library for efficiently computing the likelihood of a hidden Markov model. The library exploits commonly occurring substrings in the input to reuse computations in the forward algorithm. In a pre-processing step our library identifies common substrings and builds a structure over the computations in the forward algorithm which can be reused. This analysis can be saved between uses of the library and is independent of concrete hidden Markov models so one preprocessing can be used to run a number of different models. Using this library, we achieve up to 78 times shorter wall-clock time for realistic whole-genome analyses with a real and reasonably complex hidden Markov model. In one particular case the analysis was performed in less than 8 minutes compared to 9.6 hours for the previously fastest library. Conclusions We have implemented the preprocessing procedure and forward algorithm as a C++ library, zipHMM, with Python bindings for use in scripts. The library is available at http://birc.au.dk/software/ziphmm/.
Collapse
Affiliation(s)
- Andreas Sand
- Bioinformatics Research Centre, Aarhus University, Aarhus, Denmark.
| | | | | | | |
Collapse
|
22
|
Gompert Z, Buerkle CA. Analyses of genetic ancestry enable key insights for molecular ecology. Mol Ecol 2013; 22:5278-94. [DOI: 10.1111/mec.12488] [Citation(s) in RCA: 29] [Impact Index Per Article: 2.6] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/11/2013] [Revised: 08/05/2013] [Accepted: 08/08/2013] [Indexed: 12/15/2022]
Affiliation(s)
| | - C. Alex Buerkle
- Department of Botany; University of Wyoming; Laramie WY 82071 USA
| |
Collapse
|
23
|
Gao R, Hu W, Tarn TJ. The application of finite state machine in modeling and control of gene mutation process. IEEE Trans Nanobioscience 2013; 12:265-74. [PMID: 23771396 DOI: 10.1109/tnb.2013.2260866] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/06/2022]
Abstract
This paper extends our previous study on discrete events system formulations of DNA hybridization, and focuses discussions on metabolism and gene mutation in molecular biology. Finite state machine (FSM) theory is extensively applied to represent key concepts and analyzes the processes related to the biological phenomena mentioned above. The goal is to mathematically represent and interpret the process of gene mutation and the effects on structures of protein macro molecule caused by gene mutation. We hope the proposed model will provide a foothold for introducing the information science and the control theory tools in molecular biology.
Collapse
|
24
|
Gonzalez MW, Spouge JL. Domain analysis of symbionts and hosts (DASH) in a genome-wide survey of pathogenic human viruses. BMC Res Notes 2013; 6:209. [PMID: 23706066 PMCID: PMC3672079 DOI: 10.1186/1756-0500-6-209] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/11/2012] [Accepted: 05/17/2013] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND In the coevolution of viruses and their hosts, viruses often capture host genes, gaining advantageous functions (e.g. immune system control). Identifying functional similarities shared by viruses and their hosts can help decipher mechanisms of pathogenesis and accelerate virus-targeted drug and vaccine development. Cellular homologs in viruses are usually documented using pairwise-sequence comparison methods. Yet, pairwise-sequence searches have limited sensitivity resulting in poor identification of divergent homologies. RESULTS Methods based on profiles from multiple sequences provide a more sensitive alternative to identify similarities in host-pathogen systems. The present work describes a profile-based bioinformatics pipeline that we call the Domain Analysis of Symbionts and Hosts (DASH). DASH provides a web platform for the functional analysis of viral and host genomes. This study uses Human Herpesvirus 8 (HHV-8) as a model to validate the methodology. Our results indicate that HHV-8 shares at least 29% of its genes with humans (fourteen immunomodulatory and ten metabolic genes). DASH also suggests functions for fifty-one additional HHV-8 structural and metabolic proteins. We also perform two other comparative genomics studies of human viruses: (1) a broad survey of eleven viruses of disparate sizes and transcription strategies; and (2) a closer examination of forty-one viruses of the order Mononegavirales. In the survey, DASH detects human homologs in 4/5 DNA viruses. None of the non-retro-transcribing RNA viruses in the survey showed evidence of homology to humans. The order Mononegavirales are also non-retro-transcribing RNA viruses, however, and DASH found homology in 39/41 of them. Mononegaviruses display larger fractions of human similarities (up to 75%) than any of the other RNA or DNA viruses (up to 55% and 29% respectively). CONCLUSIONS We conclude that gene sharing probably occurs between humans and both DNA and RNA viruses, in viral genomes of differing sizes, regardless of transcription strategies. Our method (DASH) simultaneously analyzes the genomes of two interacting species thereby mining functional information to identify shared as well as exclusive domains to each organism. Our results validate our approach, showing that DASH has potential as a pipeline for making therapeutic discoveries in other host-symbiont systems. DASH results are available at http://tinyurl.com/spouge-dash.
Collapse
Affiliation(s)
- Mileidy W Gonzalez
- National Institutes of Health, National Library of Medicine, National Center for Biotechnology Information, 8600 Rockville Pike, Building 38A, Room 6N611-M, Bethesda, MD 20894, USA.
| | | |
Collapse
|
25
|
Choquet R, Guédon Y, Besnard A, Guillemain M, Pradel R. Estimating stop over duration in the presence of trap-effects. Ecol Modell 2013. [DOI: 10.1016/j.ecolmodel.2012.11.002] [Citation(s) in RCA: 10] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/24/2022]
|
26
|
Choi H, Fermin D, Nesvizhskii AI, Ghosh D, Qin ZS. Sparsely correlated hidden Markov models with application to genome-wide location studies. ACTA ACUST UNITED AC 2013; 29:533-41. [PMID: 23325620 DOI: 10.1093/bioinformatics/btt012] [Citation(s) in RCA: 14] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/28/2023]
Abstract
MOTIVATION Multiply correlated datasets have become increasingly common in genome-wide location analysis of regulatory proteins and epigenetic modifications. Their correlation can be directly incorporated into a statistical model to capture underlying biological interactions, but such modeling quickly becomes computationally intractable. RESULTS We present sparsely correlated hidden Markov models (scHMM), a novel method for performing simultaneous hidden Markov model (HMM) inference for multiple genomic datasets. In scHMM, a single HMM is assumed for each series, but the transition probability in each series depends on not only its own hidden states but also the hidden states of other related series. For each series, scHMM uses penalized regression to select a subset of the other data series and estimate their effects on the odds of each transition in the given series. Following this, hidden states are inferred using a standard forward-backward algorithm, with the transition probabilities adjusted by the model at each position, which helps retain the order of computation close to fitting independent HMMs (iHMM). Hence, scHMM is a collection of inter-dependent non-homogeneous HMMs, capable of giving a close approximation to a fully multivariate HMM fit. A simulation study shows that scHMM achieves comparable sensitivity to the multivariate HMM fit at a much lower computational cost. The method was demonstrated in the joint analysis of 39 histone modifications, CTCF and RNA polymerase II in human CD4+ T cells. scHMM reported fewer high-confidence regions than iHMM in this dataset, but scHMM could recover previously characterized histone modifications in relevant genomic regions better than iHMM. In addition, the resulting combinatorial patterns from scHMM could be better mapped to the 51 states reported by the multivariate HMM method of Ernst and Kellis. AVAILABILITY The scHMM package can be freely downloaded from http://sourceforge.net/p/schmm/ and is recommended for use in a linux environment.
Collapse
Affiliation(s)
- Hyungwon Choi
- National University of Singapore and National University Health System, Singapore 117597, Singapore
| | | | | | | | | |
Collapse
|
27
|
Abstract
Since the emergence of high-throughput genome sequencing platforms and more recently the next-generation platforms, the genome databases are growing at an astronomical rate. Tremendous efforts have been invested in recent years in understanding intriguing complexities beneath the vast ocean of genomic data. This is apparent in the spurt of computational methods for interpreting these data in the past few years. Genomic data interpretation is notoriously difficult, partly owing to the inherent heterogeneities appearing at different scales. Methods developed to interpret these data often suffer from their inability to adequately measure the underlying heterogeneities and thus lead to confounding results. Here, we present an information entropy-based approach that unravels the distinctive patterns underlying genomic data efficiently and thus is applicable in addressing a variety of biological problems. We show the robustness and consistency of the proposed methodology in addressing three different biological problems of significance—identification of alien DNAs in bacterial genomes, detection of structural variants in cancer cell lines and alignment-free genome comparison.
Collapse
Affiliation(s)
- Rajeev K Azad
- Department of Biological Sciences, University of Pittsburgh, Pittsburgh, PA 15260, USA.
| | | |
Collapse
|
28
|
Rios JJ, Shastry S, Jasso J, Hauser N, Garg A, Bensadoun A, Cohen JC, Hobbs HH. Deletion of GPIHBP1 causing severe chylomicronemia. J Inherit Metab Dis 2012; 35:531-40. [PMID: 22008945 PMCID: PMC3319888 DOI: 10.1007/s10545-011-9406-5] [Citation(s) in RCA: 78] [Impact Index Per Article: 6.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 07/18/2011] [Revised: 09/20/2011] [Accepted: 09/22/2011] [Indexed: 12/19/2022]
Abstract
Lipoprotein lipase (LPL) is a hydrolase that cleaves circulating triglycerides to release fatty acids to the surrounding tissues. The enzyme is synthesized in parenchymal cells and is transported to its site of action on the capillary endothelium by glycophosphatidylinositol (GPI)-anchored high-density lipoprotein-binding protein 1 (GPIHBP1). Inactivating mutations in LPL; in its cofactor, apolipoprotein (Apo) C2; or in GPIHBP1 cause severe hypertriglyceridemia. Here we describe an individual with complete deficiency of GPIHBP1. The proband was an Asian Indian boy who had severe chylomicronemia at 2 months of age. Array-based copy-number analysis of his genomic DNA revealed homozygosity for a 17.5-kb deletion that included GPIHBP1. A 44-year-old aunt with a history of hypertriglyceridemia and pancreatitis was also homozygous for the deletion. A bolus of intravenously administered heparin caused a rapid increase in circulating LPL and decreased plasma triglyceride levels in control individuals but not in two GPIHBP1-deficient patients. Thus, short-term treatment with heparin failed to attenuate the hypertriglyceridemia in patients with GPIHBP1 deficiency. The increasing resolution of copy number microarrays and their widespread adoption for routine cytogenetic analysis is likely to reveal a greater role for submicroscopic deletions in Mendelian conditions. We describe the first neonate with complete GPIHBP1 deficiency due to homozygosity for a deletion of GPIHBP1.
Collapse
Affiliation(s)
- Jonathan J. Rios
- Eugene McDermott Center for Human Growth and Development, University of Texas Southwestern Medical Center, 5323 Harry Hines Blvd, Dallas, TX 75390 USA
| | - Savitha Shastry
- Division of Nutrition and Metabolic Diseases, Center for Human Nutrition, University of Texas Southwestern Medical Center, Dallas, TX USA
| | - Juan Jasso
- Eugene McDermott Center for Human Growth and Development, University of Texas Southwestern Medical Center, 5323 Harry Hines Blvd, Dallas, TX 75390 USA
| | - Natalie Hauser
- Eugene McDermott Center for Human Growth and Development, University of Texas Southwestern Medical Center, 5323 Harry Hines Blvd, Dallas, TX 75390 USA
| | - Abhimanyu Garg
- Eugene McDermott Center for Human Growth and Development, University of Texas Southwestern Medical Center, 5323 Harry Hines Blvd, Dallas, TX 75390 USA
- Division of Nutrition and Metabolic Diseases, Center for Human Nutrition, University of Texas Southwestern Medical Center, Dallas, TX USA
| | - André Bensadoun
- Division of Nutritional Sciences, Cornell University, Ithaca, NY USA
| | - Jonathan C. Cohen
- Eugene McDermott Center for Human Growth and Development, University of Texas Southwestern Medical Center, 5323 Harry Hines Blvd, Dallas, TX 75390 USA
- Division of Nutrition and Metabolic Diseases, Center for Human Nutrition, University of Texas Southwestern Medical Center, Dallas, TX USA
| | - Helen H. Hobbs
- Eugene McDermott Center for Human Growth and Development, University of Texas Southwestern Medical Center, 5323 Harry Hines Blvd, Dallas, TX 75390 USA
- Howard Hughes Medical Institute, University of Texas Southwestern Medical Center, Dallas, TX USA
| |
Collapse
|
29
|
|
30
|
Mukherjee S, Mitra S. HIDDEN MARKOV MODELS, GRAMMARS, AND BIOLOGY: A TUTORIAL. J Bioinform Comput Biol 2011; 3:491-526. [PMID: 15852517 DOI: 10.1142/s0219720005001077] [Citation(s) in RCA: 19] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/23/2004] [Revised: 01/05/2004] [Accepted: 01/06/2005] [Indexed: 11/18/2022]
Abstract
Biological sequences and structures have been modelled using various machine learning techniques and abstract mathematical concepts. This article surveys methods using Hidden Markov Model and functional grammars for this purpose. We provide a formal introduction to Hidden Markov Model and grammars, stressing on a comprehensive mathematical description of the methods and their natural continuity. The basic algorithms and their application to analyzing biological sequences and modelling structures of bio-molecules like proteins and nucleic acids are discussed. A comparison of the different approaches is discussed, and possible areas of work and problems are highlighted. Related databases and softwares, available on the internet, are also mentioned.
Collapse
Affiliation(s)
- Shibaji Mukherjee
- Association for Studies in Computational Biology, Kolkata 700 018, India.
| | | |
Collapse
|
31
|
|
32
|
Ekisheva S, Borodovsky M. Uniform Accuracy of the Maximum Likelihood Estimates for Probabilistic Models of Biological Sequences. Methodol Comput Appl Probab 2011; 13:105-120. [PMID: 21318122 PMCID: PMC3035201 DOI: 10.1007/s11009-009-9125-7] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/30/2022]
Abstract
Probabilistic models for biological sequences (DNA and proteins) have many useful applications in bioinformatics. Normally, the values of parameters of these models have to be estimated from empirical data. However, even for the most common estimates, the maximum likelihood (ML) estimates, properties have not been completely explored. Here we assess the uniform accuracy of the ML estimates for models of several types: the independence model, the Markov chain and the hidden Markov model (HMM). Particularly, we derive rates of decay of the maximum estimation error by employing the measure concentration as well as the Gaussian approximation, and compare these rates.
Collapse
Affiliation(s)
- Svetlana Ekisheva
- Department of Mathematics, Syktyvkar State University, Oktjabrskii pr., 55, Syktyvkar, 167000, Russia
| | - Mark Borodovsky
- Wallace H. Coulter Department of Biomedical Engineering and Computational Science and Engineering Division, Georgia Institute of Technology, Atlanta, GA 30332-0535, USA,
| |
Collapse
|
33
|
Margam VM, Coates BS, Hellmich RL, Agunbiade T, Seufferheld MJ, Sun W, Ba MN, Sanon A, Binso-Dabire CL, Baoua I, Ishiyaku MF, Covas FG, Srinivasan R, Armstrong J, Murdock LL, Pittendrigh BR. Mitochondrial genome sequence and expression profiling for the legume pod borer Maruca vitrata (Lepidoptera: Crambidae). PLoS One 2011; 6:e16444. [PMID: 21311752 PMCID: PMC3032770 DOI: 10.1371/journal.pone.0016444] [Citation(s) in RCA: 41] [Impact Index Per Article: 3.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/18/2010] [Accepted: 12/20/2010] [Indexed: 11/18/2022] Open
Abstract
We report the assembly of the 14,054 bp near complete sequencing of the mitochondrial genome of the legume pod borer (LPB), Maruca vitrata (Lepidoptera: Crambidae), which we subsequently used to estimate divergence and relationships within the lepidopteran lineage. The arrangement and orientation of the 13 protein-coding, 2 rRNA, and 19 tRNA genes sequenced was typical of insect mitochondrial DNA sequences described to date. The sequence contained a high A+T content of 80.1% and a bias for the use of codons with A or T nucleotides in the 3rd position. Transcript mapping with midgut and salivary gland ESTs for mitochondrial genome annotation showed that translation from protein-coding genes initiates and terminates at standard mitochondrial codons, except for the coxI gene, which may start from an arginine CGA codon. The genomic copy of coxII terminates at a T nucleotide, and a proposed polyadenylation mechanism for completion of the TAA stop codon was confirmed by comparisons to EST data. EST contig data further showed that mature M. vitrata mitochondrial transcripts are monocistronic, except for bicistronic transcripts for overlapping genes nd4/nd4L and nd6/cytb, and a tricistronic transcript for atp8/atp6/coxIII. This processing of polycistronic mitochondrial transcripts adheres to the tRNA punctuated cleavage mechanism, whereby mature transcripts are cleaved only at intervening tRNA gene sequences. In contrast, the tricistronic atp8/atp6/coxIII in Drosophila is present as separate atp8/atp6 and coxIII transcripts despite the lack of an intervening tRNA. Our results indicate that mitochondrial processing mechanisms vary between arthropod species, and that it is crucial to use transcriptional information to obtain full annotation of mitochondrial genomes.
Collapse
Affiliation(s)
- Venu M. Margam
- Department of Entomology, Purdue University, West Lafayette, Indiana, United States of America
| | - Brad S. Coates
- United States Department of Agriculture – Agricultural Research Service, Corn Insect and Crop Genetics Research Unit, Genetics Laboratory, Iowa State University, Ames, Iowa, United States of America
| | - Richard L. Hellmich
- United States Department of Agriculture – Agricultural Research Service, Corn Insect and Crop Genetics Research Unit, Genetics Laboratory, Iowa State University, Ames, Iowa, United States of America
| | - Tolulope Agunbiade
- Department of Entomology, University of Illinois at Urbana-Champaign, Champaign, Illinois, United States of America
| | - Manfredo J. Seufferheld
- Department of Crop Sciences, University of Illinois at Urbana-Champaign, Champaign, Illinois, United States of America
| | - Weilin Sun
- Department of Entomology, University of Illinois at Urbana-Champaign, Champaign, Illinois, United States of America
| | - Malick N. Ba
- Station de Kamboinsé,Institut de l'Environnement et de Recherches Agricoles (INERA), Ouagadougou, Burkina Faso
| | - Antoine Sanon
- Station de Kamboinsé,Institut de l'Environnement et de Recherches Agricoles (INERA), Ouagadougou, Burkina Faso
| | - Clementine L. Binso-Dabire
- Station de Kamboinsé,Institut de l'Environnement et de Recherches Agricoles (INERA), Ouagadougou, Burkina Faso
| | - Ibrahim Baoua
- Institut National de Recherche Agronomique du Niger, Maradi, Niger
| | - Mohammad F. Ishiyaku
- Department of Plant Science, Institute for Agricultural Research, Ahmadu Bello University, Zaria, Nigeria
| | - Fernando G. Covas
- University of Puerto Rico, Mayaguez, Puerto Rico, United States of America
| | | | - Joel Armstrong
- Entomology, The Commonweatlth of Scientific and Industrial Research Organization, Black Mountain, Australian Capital Territory, Australia
| | - Larry L. Murdock
- Department of Entomology, Purdue University, West Lafayette, Indiana, United States of America
| | - Barry R. Pittendrigh
- Department of Entomology, University of Illinois at Urbana-Champaign, Champaign, Illinois, United States of America
| |
Collapse
|
34
|
Chen G, Zhou Q. Heterogeneity in DNA multiple alignments: modeling, inference, and applications in motif finding. Biometrics 2011; 66:694-704. [PMID: 19995355 DOI: 10.1111/j.1541-0420.2009.01362.x] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/31/2022]
Abstract
Transcription factors bind sequence-specific sites in DNA to regulate gene transcription. Identifying transcription factor binding sites (TFBSs) is an important step for understanding gene regulation. Although sophisticated in modeling TFBSs and their combinatorial patterns, computational methods for TFBS detection and motif finding often make oversimplified homogeneous model assumptions for background sequences. Since nucleotide base composition varies across genomic regions, it is expected to be helpful for motif finding to incorporate the heterogeneity into background modeling. When sequences from multiple species are utilized, variation in evolutionary conservation violates the common assumption of an identical conservation level in multiple alignments. To handle both types of heterogeneity, we propose a generative model in which a segmented Markov chain is used to partition a multiple alignment into regions of homogeneous nucleotide base composition and a hidden Markov model (HMM) is employed to account for different conservation levels. Bayesian inference on the model is developed via Gibbs sampling with dynamic programming recursions. Simulation studies and empirical evidence from biological data sets reveal the dramatic effect of background modeling on motif finding, and demonstrate that the proposed approach is able to achieve substantial improvements over commonly used background models.
Collapse
Affiliation(s)
- Gong Chen
- Department of Statistics, University of California, Los Angeles, Los Angeles, California 90095, USA
| | | |
Collapse
|
35
|
Bickel PJ, Boley N, Brown JB, Huang H, Zhang NR. Subsampling methods for genomic inference. Ann Appl Stat 2010. [DOI: 10.1214/10-aoas363] [Citation(s) in RCA: 53] [Impact Index Per Article: 3.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/19/2022]
|
36
|
Regional context in the alignment of biological sequence pairs. J Mol Evol 2010; 72:147-59. [PMID: 21107551 PMCID: PMC3064887 DOI: 10.1007/s00239-010-9409-0] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/30/2010] [Accepted: 11/08/2010] [Indexed: 11/24/2022]
Abstract
Sequence divergence derives from either point substitution or indel (insertion or deletion) processes. We investigated the rates of these two processes both in protein and non-protein coding DNA. We aligned sequence pairs using two pair-hidden Markov models (PHMMs) conjoined by one silent state. The two PHMMs had their own set of parameters to model rates in their respective regions. The aim was to test the hypothesis that the indel mutation rate mimics the point mutation rate. That is, indels are found less often in conserved regions (slow point substitution rate) and more often in non-conserved regions (fast point substitution rate). Both polypeptides and rRNA molecules in our data exhibited a clear distinction between slow and fast rates of the two processes. These two rates served as surrogates to conserved and non-conserved secondary structure components, respectively. With polypeptides we found both the fast indel rate and the fast replacement rate were co-located with hydrophilic residues. We also found that the average concordance, of our alignments with corresponding curated alignments, improves markedly when the model allows either of the two fast rates to colocate with hydrophilic residues. With rRNA molecules, our model did not detect colocation between the fast indel rate and the fast substitution rate. Nevertheless, coupling the indel rates with the point substitution rates across the two regions markedly increased model fit. This result suggests that rRNA pairwise alignments should be modeled after allowing for the two processes to vary simultaneously and independently in the two regions.
Collapse
|
37
|
Statistical Issues in the Analysis of ChIP-Seq and RNA-Seq Data. Genes (Basel) 2010; 1:317-34. [PMID: 24710049 PMCID: PMC3954086 DOI: 10.3390/genes1020317] [Citation(s) in RCA: 11] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/17/2010] [Accepted: 09/20/2010] [Indexed: 11/29/2022] Open
Abstract
The recent arrival of ultra-high throughput, next generation sequencing (NGS) technologies has revolutionized the genetics and genomics fields by allowing rapid and inexpensive sequencing of billions of bases. The rapid deployment of NGS in a variety of sequencing-based experiments has resulted in fast accumulation of massive amounts of sequencing data. To process this new type of data, a torrent of increasingly sophisticated algorithms and software tools are emerging to help the analysis stage of the NGS applications. In this article, we strive to comprehensively identify the critical challenges that arise from all stages of NGS data analysis and provide an objective overview of what has been achieved in existing works. At the same time, we highlight selected areas that need much further research to improve our current capabilities to delineate the most information possible from NGS data. The article focuses on applications dealing with ChIP-Seq and RNA-Seq.
Collapse
|
38
|
Mount DW. Using hidden Markov models to align multiple sequences. Cold Spring Harb Protoc 2010; 2009:pdb.top41. [PMID: 20147223 DOI: 10.1101/pdb.top41] [Citation(s) in RCA: 17] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/25/2022]
Abstract
A hidden Markov model (HMM) is a probabilistic model of a multiple sequence alignment (msa) of proteins. In the model, each column of symbols in the alignment is represented by a frequency distribution of the symbols (called a "state"), and insertions and deletions are represented by other states. One moves through the model along a particular path from state to state in a Markov chain (i.e., random choice of next move), trying to match a given sequence. The next matching symbol is chosen from each state, recording its probability (frequency) and also the probability of going to that state from a previous one (the transition probability). State and transition probabilities are multiplied to obtain a probability of the given sequence. The hidden nature of the HMM is due to the lack of information about the value of a specific state, which is instead represented by a probability distribution over all possible values. This article discusses the advantages and disadvantages of HMMs in msa and presents algorithms for calculating an HMM and the conditions for producing the best HMM.
Collapse
|
39
|
Wu H, Caffo B, Jaffee HA, Irizarry RA, Feinberg AP. Redefining CpG islands using hidden Markov models. Biostatistics 2010; 11:499-514. [PMID: 20212320 DOI: 10.1093/biostatistics/kxq005] [Citation(s) in RCA: 119] [Impact Index Per Article: 8.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
Abstract
The DNA of most vertebrates is depleted in CpG dinucleotide: a C followed by a G in the 5' to 3' direction. CpGs are the target for DNA methylation, a chemical modification of cytosine (C) heritable during cell division and the most well-characterized epigenetic mechanism. The remaining CpGs tend to cluster in regions referred to as CpG islands (CGI). Knowing CGI locations is important because they mark functionally relevant epigenetic loci in development and disease. For various mammals, including human, a readily available and widely used list of CGI is available from the UCSC Genome Browser. This list was derived using algorithms that search for regions satisfying a definition of CGI proposed by Gardiner-Garden and Frommer more than 20 years ago. Recent findings, enabled by advances in technology that permit direct measurement of epigenetic endpoints at a whole-genome scale, motivate the need to adapt the current CGI definition. In this paper, we propose a procedure, guided by hidden Markov models, that permits an extensible approach to detecting CGI. The main advantage of our approach over others is that it summarizes the evidence for CGI status as probability scores. This provides flexibility in the definition of a CGI and facilitates the creation of CGI lists for other species. The utility of this approach is demonstrated by generating the first CGI lists for invertebrates, and the fact that we can create CGI lists that substantially increases overlap with recently discovered epigenetic marks. A CGI list and the probability scores, as a function of genome location, for each species are available at http://www.rafalab.org.
Collapse
Affiliation(s)
- Hao Wu
- Department of Biostatistics, Johns Hopkins University, Baltimore, MD 21205, USA
| | | | | | | | | |
Collapse
|
40
|
Gao R, Yu J, Zhang M, Tarn TJ, Li JS. Systems theoretic analysis of the central dogma of molecular biology: some recent results. IEEE Trans Nanobioscience 2010; 9:59-70. [PMID: 20123579 DOI: 10.1109/tnb.2010.2041065] [Citation(s) in RCA: 12] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/08/2022]
Abstract
This paper extends our early study on a mathematical formulation of the central dogma of molecular biology, and focuses discussions on recent insights obtained by employing advanced systems theoretic analysis. The goal of this paper is to mathematically represent and interpret the genetic information flow at the molecular level, and explore the fundamental principle of molecular biology at the system level. Specifically, group theory was employed to interpret concepts and properties of gene mutation, and predict backbone torsion angle along the peptide chain. Finite state machine theory was extensively applied to interpret key concepts and analyze the processes related to DNA hybridization. Using the proposed model, we have transferred the character-based model in molecular biology to a sophisticated mathematical model for calculation and interpretation.
Collapse
Affiliation(s)
- Rui Gao
- School of Control Science and Engineering, Shandong University, Jinan 250061, China.
| | | | | | | | | |
Collapse
|
41
|
Kern AD, Haussler D. A population genetic hidden Markov model for detecting genomic regions under selection. Mol Biol Evol 2010; 27:1673-85. [PMID: 20185453 DOI: 10.1093/molbev/msq053] [Citation(s) in RCA: 17] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022] Open
Abstract
Recently, hidden Markov models have been applied to numerous problems in genomics. Here, we introduce an explicit population genetics hidden Markov model (popGenHMM) that uses single nucleotide polymorphism (SNP) frequency data to identify genomic regions that have experienced recent selection. Our popGenHMM assumes that SNP frequencies are emitted independently following diffusion approximation expectations but that neighboring SNP frequencies are partially correlated by selective state. We give results from the training and application of our popGenHMM to a set of early release data from the Drosophila Population Genomics Project (dpgp.org) that consists of approximately 7.8 Mb of resequencing from 32 North American Drosophila melanogaster lines. These results demonstrate the potential utility of our model, making predictions based on the site frequency spectrum (SFS) for regions of the genome that represent selected elements.
Collapse
Affiliation(s)
- Andrew D Kern
- Department of Biological Sciences, Dartmouth College, Hanover, NH, USA.
| | | |
Collapse
|
42
|
Abstract
BACKGROUND Transposons are "jumping genes" that account for large quantities of repetitive content in genomes. They are known to affect transcriptional regulation in several different ways, and are implicated in many human diseases. Transposons are related to microRNAs and viruses, and many genes, pseudogenes, and gene promoters are derived from transposons or have origins in transposon-induced duplication. Modeling transposon-derived genomic content is difficult because they are poorly conserved. Profile hidden Markov models (profile HMMs), widely used for protein sequence family modeling, are rarely used for modeling DNA sequence families. The algorithm commonly used to estimate the parameters of profile HMMs, Baum-Welch, is prone to prematurely converge to local optima. The DNA domain is especially problematic for the Baum-Welch algorithm, since it has only four letters as opposed to the twenty residues of the amino acid alphabet. RESULTS We demonstrate with a simulation study and with an application to modeling the MIR family of transposons that two recently introduced methods, Conditional Baum-Welch and Dynamic Model Surgery, achieve better estimates of the parameters of profile HMMs across a range of conditions. CONCLUSIONS We argue that these new algorithms expand the range of potential applications of profile HMMs to many important DNA sequence family modeling problems, including that of searching for and modeling the virus-like transposons that are found in all known genomes.
Collapse
Affiliation(s)
- Paul T Edlefsen
- Department of Statistics, Harvard University, One Oxford Street, Cambridge, MA, USA
| | - Jun S Liu
- Department of Statistics, Harvard University, One Oxford Street, Cambridge, MA, USA
| |
Collapse
|
43
|
Nuel G, Regad L, Martin J, Camproux AC. Exact distribution of a pattern in a set of random sequences generated by a Markov source: applications to biological data. Algorithms Mol Biol 2010; 5:15. [PMID: 20205909 PMCID: PMC2828453 DOI: 10.1186/1748-7188-5-15] [Citation(s) in RCA: 13] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/22/2009] [Accepted: 01/26/2010] [Indexed: 11/18/2022] Open
Abstract
BACKGROUND In bioinformatics it is common to search for a pattern of interest in a potentially large set of rather short sequences (upstream gene regions, proteins, exons, etc.). Although many methodological approaches allow practitioners to compute the distribution of a pattern count in a random sequence generated by a Markov source, no specific developments have taken into account the counting of occurrences in a set of independent sequences. We aim to address this problem by deriving efficient approaches and algorithms to perform these computations both for low and high complexity patterns in the framework of homogeneous or heterogeneous Markov models. RESULTS The latest advances in the field allowed us to use a technique of optimal Markov chain embedding based on deterministic finite automata to introduce three innovative algorithms. Algorithm 1 is the only one able to deal with heterogeneous models. It also permits to avoid any product of convolution of the pattern distribution in individual sequences. When working with homogeneous models, Algorithm 2 yields a dramatic reduction in the complexity by taking advantage of previous computations to obtain moment generating functions efficiently. In the particular case of low or moderate complexity patterns, Algorithm 3 exploits power computation and binary decomposition to further reduce the time complexity to a logarithmic scale. All these algorithms and their relative interest in comparison with existing ones were then tested and discussed on a toy-example and three biological data sets: structural patterns in protein loop structures, PROSITE signatures in a bacterial proteome, and transcription factors in upstream gene regions. On these data sets, we also compared our exact approaches to the tempting approximation that consists in concatenating the sequences in the data set into a single sequence. CONCLUSIONS Our algorithms prove to be effective and able to handle real data sets with multiple sequences, as well as biological patterns of interest, even when the latter display a high complexity (PROSITE signatures for example). In addition, these exact algorithms allow us to avoid the edge effect observed under the single sequence approximation, which leads to erroneous results, especially when the marginal distribution of the model displays a slow convergence toward the stationary distribution. We end up with a discussion on our method and on its potential improvements.
Collapse
Affiliation(s)
- Gregory Nuel
- LSG, Laboratoire Statistique et Génome, CNRS UMR-8071, INRA UMR-1152, University of Evry, Evry, France
- CNRS, Paris, France
- MAP5, Department of Applied Mathematics, CNRS UMR-8145, University Paris Descartes, Paris, France
| | - Leslie Regad
- EBGM, Equipe de Bioinformatique Génomique et Moleculaire, INSERM UMRS-726, University Paris Diderot, Paris, France
- MTi, Molecules Thérapeutique in silico, INSERM UMRS-973, University Paris Diderot, Paris, France
| | - Juliette Martin
- EBGM, Equipe de Bioinformatique Génomique et Moleculaire, INSERM UMRS-726, University Paris Diderot, Paris, France
- MIG, Mathématique Informatique et Genome, INRA UR-1077, Jouy-en-Josas, France
- IBCP, Institut de Biologie et Chimie des Protéines, IFR 128, CNRS UMR 5086, University of Lyon 1, Lyon, France
| | - Anne-Claude Camproux
- EBGM, Equipe de Bioinformatique Génomique et Moleculaire, INSERM UMRS-726, University Paris Diderot, Paris, France
- MTi, Molecules Thérapeutique in silico, INSERM UMRS-973, University Paris Diderot, Paris, France
| |
Collapse
|
44
|
Elhaik E, Graur D, Josic K. Comparative testing of DNA segmentation algorithms using benchmark simulations. Mol Biol Evol 2009; 27:1015-24. [PMID: 20018981 DOI: 10.1093/molbev/msp307] [Citation(s) in RCA: 15] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open
Abstract
Numerous segmentation methods for the detection of compositionally homogeneous domains within genomic sequences have been proposed. Unfortunately, these methods yield inconsistent results. Here, we present a benchmark consisting of two sets of simulated genomic sequences for testing the performances of segmentation algorithms. Sequences in the first set are composed of fixed-sized homogeneous domains, distinct in their between-domain guanine and cytosine (GC) content variability. The sequences in the second set are composed of a mosaic of many short domains and a few long ones, distinguished by sharp GC content boundaries between neighboring domains. We use these sets to test the performance of seven segmentation algorithms in the literature. Our results show that recursive segmentation algorithms based on the Jensen-Shannon divergence outperform all other algorithms. However, even these algorithms perform poorly in certain instances because of the arbitrary choice of a segmentation-stopping criterion.
Collapse
Affiliation(s)
- Eran Elhaik
- Department of Biology & Biochemistry, University of Houston, TX, USA.
| | | | | |
Collapse
|
45
|
Irizarry RA, Wu H, Feinberg AP. A species-generalized probabilistic model-based definition of CpG islands. Mamm Genome 2009; 20:674-80. [PMID: 19777308 PMCID: PMC2962567 DOI: 10.1007/s00335-009-9222-5] [Citation(s) in RCA: 37] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/17/2009] [Accepted: 08/17/2009] [Indexed: 10/20/2022]
Abstract
The DNA of most vertebrates is depleted in CpG dinucleotides, the target for DNA methylation. The remaining CpGs tend to cluster in regions referred to as CpG islands (CGI). CGI have been useful as marking functionally relevant epigenetic loci for genome studies. For example, CGI are enriched in the promoters of vertebrate genes and thought to play an important role in regulation. Currently, CGI are defined algorithmically as an observed-to-expected ratio (O/E) of CpG greater than 0.6, G+C content greater than 0.5, and usually but not necessarily greater than a certain length. Here we find that the current definition leaves out important CpG clusters associated with epigenetic marks, relevant to development and disease, and does not apply at all to nonvertabrate genomes. We propose an alternative Hidden Markov model-based approach that solves these problems. We fit our model to genomes from 30 species, and the results support a new epigenomic view toward the development of DNA methylation in species diversity and evolution. The O/E of CpG in islands and nonislands segregated closely phylogenetically and showed substantial loss in both groups in animals of greater complexity, while maintaining a nearly constant difference in CpG O/E between islands and nonisland compartments. Lists of CGI for some species are available at http://www.rafalab.org .
Collapse
Affiliation(s)
- Rafael A Irizarry
- Department of Biostatistics, Johns Hopkins Bloomberg School of Public Health, 615 North Wolfe Street, E3620, Baltimore, MD 21205, USA.
| | | | | |
Collapse
|
46
|
Bayesian hidden Markov model for DNA sequence segmentation: A prior sensitivity analysis. Comput Stat Data Anal 2009. [DOI: 10.1016/j.csda.2008.07.007] [Citation(s) in RCA: 11] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/17/2022]
|
47
|
Bidargaddi NP, Chetty M, Kamruzzaman J. Hidden Markov models incorporating fuzzy measures and integrals for protein sequence identification and alignment. GENOMICS PROTEOMICS & BIOINFORMATICS 2009; 6:98-110. [PMID: 18973866 PMCID: PMC5054101 DOI: 10.1016/s1672-0229(08)60025-x] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 12/02/2022]
Abstract
Profile hidden Markov models (HMMs) based on classical HMMs have been widely applied for protein sequence identification. The formulation of the forward and backward variables in profile HMMs is made under statistical independence assumption of the probability theory. We propose a fuzzy profile HMM to overcome the limitations of that assumption and to achieve an improved alignment for protein sequences belonging to a given family. The proposed model fuzzifies the forward and backward variables by incorporating Sugeno fuzzy measures and Choquet integrals, thus further extends the generalized HMM. Based on the fuzzified forward and backward variables, we propose a fuzzy Baum-Welch parameter estimation algorithm for profiles. The strong correlations and the sequence preference involved in the protein structures make this fuzzy architecture based model as a suitable candidate for building profiles of a given family, since the fuzzy set can handle uncertainties better than classical methods.
Collapse
|
48
|
Reynolds SM, Käll L, Riffle ME, Bilmes JA, Noble WS. Transmembrane topology and signal peptide prediction using dynamic bayesian networks. PLoS Comput Biol 2008; 4:e1000213. [PMID: 18989393 PMCID: PMC2570248 DOI: 10.1371/journal.pcbi.1000213] [Citation(s) in RCA: 184] [Impact Index Per Article: 11.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/02/2008] [Accepted: 09/23/2008] [Indexed: 11/19/2022] Open
Abstract
Hidden Markov models (HMMs) have been successfully applied to the tasks of transmembrane protein topology prediction and signal peptide prediction. In this paper we expand upon this work by making use of the more powerful class of dynamic Bayesian networks (DBNs). Our model, Philius, is inspired by a previously published HMM, Phobius, and combines a signal peptide submodel with a transmembrane submodel. We introduce a two-stage DBN decoder that combines the power of posterior decoding with the grammar constraints of Viterbi-style decoding. Philius also provides protein type, segment, and topology confidence metrics to aid in the interpretation of the predictions. We report a relative improvement of 13% over Phobius in full-topology prediction accuracy on transmembrane proteins, and a sensitivity and specificity of 0.96 in detecting signal peptides. We also show that our confidence metrics correlate well with the observed precision. In addition, we have made predictions on all 6.3 million proteins in the Yeast Resource Center (YRC) database. This large-scale study provides an overall picture of the relative numbers of proteins that include a signal-peptide and/or one or more transmembrane segments as well as a valuable resource for the scientific community. All DBNs are implemented using the Graphical Models Toolkit. Source code for the models described here is available at http://noble.gs.washington.edu/proj/philius. A Philius Web server is available at http://www.yeastrc.org/philius, and the predictions on the YRC database are available at http://www.yeastrc.org/pdr.
Collapse
Affiliation(s)
- Sheila M. Reynolds
- Department of Electrical Engineering, University of Washington, Seattle, Washington, United States of America
| | - Lukas Käll
- Department of Genome Sciences, University of Washington, Seattle, Washington, United States of America
| | - Michael E. Riffle
- Department of Biochemistry, University of Washington, Seattle, Washington, United States of America
| | - Jeff A. Bilmes
- Department of Electrical Engineering, University of Washington, Seattle, Washington, United States of America
- Department of Computer Science and Engineering, University of Washington, Seattle, Washington, United States of America
| | - William Stafford Noble
- Department of Genome Sciences, University of Washington, Seattle, Washington, United States of America
- Department of Computer Science and Engineering, University of Washington, Seattle, Washington, United States of America
- * E-mail:
| |
Collapse
|
49
|
Probabilistic phylogenetic inference with insertions and deletions. PLoS Comput Biol 2008; 4:e1000172. [PMID: 18787703 PMCID: PMC2527138 DOI: 10.1371/journal.pcbi.1000172] [Citation(s) in RCA: 47] [Impact Index Per Article: 2.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/24/2007] [Accepted: 07/31/2008] [Indexed: 11/19/2022] Open
Abstract
A fundamental task in sequence analysis is to calculate the probability of a multiple alignment given a phylogenetic tree relating the sequences and an evolutionary model describing how sequences change over time. However, the most widely used phylogenetic models only account for residue substitution events. We describe a probabilistic model of a multiple sequence alignment that accounts for insertion and deletion events in addition to substitutions, given a phylogenetic tree, using a rate matrix augmented by the gap character. Starting from a continuous Markov process, we construct a non-reversible generative (birth-death) evolutionary model for insertions and deletions. The model assumes that insertion and deletion events occur one residue at a time. We apply this model to phylogenetic tree inference by extending the program dnaml in phylip. Using standard benchmarking methods on simulated data and a new "concordance test" benchmark on real ribosomal RNA alignments, we show that the extended program dnamlepsilon improves accuracy relative to the usual approach of ignoring gaps, while retaining the computational efficiency of the Felsenstein peeling algorithm.
Collapse
|
50
|
Abstract
Background Profile Hidden Markov Model (HMM) is a powerful statistical model to represent a family of DNA, RNA, and protein sequences. Profile HMM has been widely used in bioinformatics research such as sequence alignment, gene structure prediction, motif identification, protein structure prediction, and biological database search. However, few comprehensive, visual editing tools for profile HMM are publicly available. Results We develop a visual editor for profile Hidden Markov Models (HMMEditor). HMMEditor can visualize the profile HMM architecture, transition probabilities, and emission probabilities. Moreover, it provides functions to edit and save HMM and parameters. Furthermore, HMMEditor allows users to align a sequence against the profile HMM and to visualize the corresponding Viterbi path. Conclusion HMMEditor provides a set of unique functions to visualize and edit a profile HMM. It is a useful tool for biological sequence analysis and modeling. Both HMMEditor software and web service are freely available.
Collapse
Affiliation(s)
- Jianyong Dai
- School of Electrical Engineering and Computer Science, University of Central Florida, Orland, FL 32816, USA.
| | | |
Collapse
|