1
|
De Gregorio J, Sánchez D, Toral R. Entropy Estimators for Markovian Sequences: A Comparative Analysis. ENTROPY (BASEL, SWITZERLAND) 2024; 26:79. [PMID: 38248204 PMCID: PMC11154276 DOI: 10.3390/e26010079] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 10/31/2023] [Revised: 12/21/2023] [Accepted: 01/16/2024] [Indexed: 01/23/2024]
Abstract
Entropy estimation is a fundamental problem in information theory that has applications in various fields, including physics, biology, and computer science. Estimating the entropy of discrete sequences can be challenging due to limited data and the lack of unbiased estimators. Most existing entropy estimators are designed for sequences of independent events and their performances vary depending on the system being studied and the available data size. In this work, we compare different entropy estimators and their performance when applied to Markovian sequences. Specifically, we analyze both binary Markovian sequences and Markovian systems in the undersampled regime. We calculate the bias, standard deviation, and mean squared error for some of the most widely employed estimators. We discuss the limitations of entropy estimation as a function of the transition probabilities of the Markov processes and the sample size. Overall, this paper provides a comprehensive comparison of entropy estimators and their performance in estimating entropy for systems with memory, which can be useful for researchers and practitioners in various fields.
Collapse
Affiliation(s)
| | - David Sánchez
- Institute for Cross-Disciplinary Physics and Complex Systems IFISC (UIB-CSIC), Campus Universitat de les Illes Balears, E-07122 Palma de Mallorca, Spain; (J.D.G.); (R.T.)
| | | |
Collapse
|
2
|
Usatenko OV, Melnyk SS, Pritula GM, Yampol'skii VA. Information entropy and temperature of binary Markov chains. Phys Rev E 2022; 106:034127. [PMID: 36266815 DOI: 10.1103/physreve.106.034127] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/29/2022] [Accepted: 08/17/2022] [Indexed: 06/16/2023]
Abstract
We propose two different approaches for introducing the information temperature of binary Nth-order Markov chains. The first approach is based on a comparison of Markov sequences with equilibrium Ising chains at given temperatures. The second approach uses probabilities of finite-length subsequences of symbols occurring, which determine their entropies. The derivative of the entropy with respect to the energy gives the information temperature measured on the scale of introduced energy. For the case of a nearest-neighbor spin-symbol interaction, both approaches give similar results. However, the method based on the correspondence of the N-step Markov and Ising chains appears to be very cumbersome for N>3. We also introduce the information temperature for the weakly correlated one-parametric Markov chains and present results for the stepwise and power memory functions. An application of the developed method to obtain the information temperature of some literary texts is given.
Collapse
Affiliation(s)
- O V Usatenko
- A. Ya. Usikov Institute for Radiophysics and Electronics NASU, 61085 Kharkov, Ukraine
| | - S S Melnyk
- A. Ya. Usikov Institute for Radiophysics and Electronics NASU, 61085 Kharkov, Ukraine
| | - G M Pritula
- A. Ya. Usikov Institute for Radiophysics and Electronics NASU, 61085 Kharkov, Ukraine
| | - V A Yampol'skii
- A. Ya. Usikov Institute for Radiophysics and Electronics NASU, 61085 Kharkov, Ukraine
- V. N. Karazin Kharkov National University, 61077 Kharkov, Ukraine
| |
Collapse
|
3
|
Usatenko OV, Melnyk SS, Pritula GM. Correlation function inadequacy in random-sequence entropy measures. Phys Rev E 2020; 102:022119. [PMID: 32942436 DOI: 10.1103/physreve.102.022119] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/13/2020] [Accepted: 07/20/2020] [Indexed: 06/11/2023]
Abstract
Considering symbolic and numerical random sequences in the framework of the additive Markov chain approach, we establish a relation between their correlation functions and conditional entropies. We express the entropy by means of the two-point probability distribution functions and then evaluate the entropy for the numerical random chain in terms of the correlation function. We show that such approximation gives a satisfactory result only for special types of random sequences. In general case the conditional entropy of numerical sequences obtained in the two-point distribution function approach is lower. We derive the conditional entropy of the additive Markov chain as a sum of the Kullback-Leibler mutual information and give an example of random sequence with the exactly zero correlation function and the nonzero correlations.
Collapse
Affiliation(s)
- O V Usatenko
- O. Ya. Usikov Institute for Radiophysics and Electronics of the National Academy of Sciences of Ukraine, 12 Proskura Street, 61805 Kharkiv, Ukraine
| | - S S Melnyk
- O. Ya. Usikov Institute for Radiophysics and Electronics of the National Academy of Sciences of Ukraine, 12 Proskura Street, 61805 Kharkiv, Ukraine
| | - G M Pritula
- O. Ya. Usikov Institute for Radiophysics and Electronics of the National Academy of Sciences of Ukraine, 12 Proskura Street, 61805 Kharkiv, Ukraine
| |
Collapse
|
4
|
Zakrzewski F, Gieldon L, Rump A, Seifert M, Grützmann K, Krüger A, Loos S, Zeugner S, Hackmann K, Porrmann J, Wagner J, Kast K, Wimberger P, Baretton G, Schröck E, Aust D, Klink B. Targeted capture-based NGS is superior to multiplex PCR-based NGS for hereditary BRCA1 and BRCA2 gene analysis in FFPE tumor samples. BMC Cancer 2019; 19:396. [PMID: 31029168 PMCID: PMC6487025 DOI: 10.1186/s12885-019-5584-6] [Citation(s) in RCA: 23] [Impact Index Per Article: 4.6] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/05/2018] [Accepted: 04/05/2019] [Indexed: 12/22/2022] Open
Abstract
BACKGROUND With the introduction of Olaparib treatment for BRCA-deficient recurrent ovarian cancer, testing for somatic and/or germline mutations in BRCA1/2 genes in tumor tissues became essential for treatment decisions. In most cases only formalin-fixed paraffin-embedded (FFPE) samples, containing fragmented and chemically modified DNA of minor quality, are available. Thus, multiplex PCR-based sequencing is most commonly applied in routine molecular testing, which is predominantly focused on the identification of known hot spot mutations in oncogenes. METHODS We compared the overall performance of an adjusted targeted capture-based enrichment protocol and a multiplex PCR-based approach for calling of pathogenic SNVs and InDels using DNA extracted from 13 FFPE tissue samples. We further applied both strategies to seven blood samples and five matched FFPE tumor tissues of patients with known germline exon-spanning deletions and gene-wide duplications in BRCA1/2 to evaluate CNV detection based solely on panel NGS data. Finally, we analyzed DNA from FFPE tissues of 11 index patients from families suspected of having hereditary breast and ovarian cancer, of whom no blood samples were available for testing, in order to identify underlying pathogenic germline BRCA1/2 mutations. RESULTS The multiplex PCR-based protocol produced inhomogeneous coverage among targets of each sample and between samples as well as sporadic amplicon drop out, leading to insufficiently or non-covered nucleotides, which subsequently hindered variant detection. This protocol further led to detection of PCR-artifacts that could easily have been misinterpreted as pathogenic mutations. No such limitations were observed by application of an adjusted targeted capture-based protocol, which allowed for CNV calling with 86% sensitivity and 100% specificity. All pathogenic CNVs were confirmed in the five matched FFPE tumor samples from patients carrying known pathogenic germline mutations and we additionally identified somatic loss of the second allele in BRCA1/2. Furthermore we detected pathogenic BRCA1/2 variants in four the eleven FFPE samples from patients of whom no blood was available for analysis. CONCLUSIONS We demonstrate that an adjusted targeted capture-based enrichment protocol is superior to commonly applied multiplex PCR-based protocols for reliable BRCA1/2 variant detection, including CNV-detection, using FFPE tumor samples.
Collapse
Affiliation(s)
- Falk Zakrzewski
- Core Unit for Molecular Tumor Diagnostics (CMTD), National Center for Tumor Diseases (NCT), Schubertstraße 15, 01307 Dresden, Germany
| | - Laura Gieldon
- Core Unit for Molecular Tumor Diagnostics (CMTD), National Center for Tumor Diseases (NCT), Schubertstraße 15, 01307 Dresden, Germany
- Institute for Clinical Genetics, Medical Faculty Carl Gustav Carus, Technische Universität Dresden, Dresden, Germany
| | - Andreas Rump
- Institute for Clinical Genetics, Medical Faculty Carl Gustav Carus, Technische Universität Dresden, Dresden, Germany
| | - Michael Seifert
- Institute for Medical Informatics and Biometry (IMB), Medical Faculty Carl Gustav Carus, Technische Universität Dresden, Dresden, Germany
- National Center for Tumor Diseases (NCT), Dresden, Germany
| | - Konrad Grützmann
- Core Unit for Molecular Tumor Diagnostics (CMTD), National Center for Tumor Diseases (NCT), Schubertstraße 15, 01307 Dresden, Germany
| | - Alexander Krüger
- Core Unit for Molecular Tumor Diagnostics (CMTD), National Center for Tumor Diseases (NCT), Schubertstraße 15, 01307 Dresden, Germany
| | - Sina Loos
- Institute of Pathology, University Hospital Carl Gustav Carus Dresden, Dresden, Germany
| | - Silke Zeugner
- Institute of Pathology, University Hospital Carl Gustav Carus Dresden, Dresden, Germany
| | - Karl Hackmann
- Institute for Clinical Genetics, Medical Faculty Carl Gustav Carus, Technische Universität Dresden, Dresden, Germany
| | - Joseph Porrmann
- Institute for Clinical Genetics, Medical Faculty Carl Gustav Carus, Technische Universität Dresden, Dresden, Germany
| | - Johannes Wagner
- Institute for Clinical Genetics, Medical Faculty Carl Gustav Carus, Technische Universität Dresden, Dresden, Germany
| | - Karin Kast
- National Center for Tumor Diseases (NCT), Dresden, Germany
- German Cancer Consortium (DKTK), Dresden, Germany
- Department of Gynecology and Obstetrics, University Hospital Carl Gustav Carus Dresden, TU Dresden, Dresden, Germany
| | - Pauline Wimberger
- National Center for Tumor Diseases (NCT), Dresden, Germany
- German Cancer Consortium (DKTK), Dresden, Germany
- Department of Gynecology and Obstetrics, University Hospital Carl Gustav Carus Dresden, TU Dresden, Dresden, Germany
| | - Gustavo Baretton
- Core Unit for Molecular Tumor Diagnostics (CMTD), National Center for Tumor Diseases (NCT), Schubertstraße 15, 01307 Dresden, Germany
- National Center for Tumor Diseases (NCT), Dresden, Germany
- Institute of Pathology, University Hospital Carl Gustav Carus Dresden, Dresden, Germany
- German Cancer Consortium (DKTK), Dresden, Germany
- Tumor- and Normal Tissue Bank of the University Cancer Center (UCC), University Hospital Carl Gustav Carus Dresden, Technische Universität Dresden, National Center for Tumor Diseases (NCT) Dresden, Dresden, Germany
| | - Evelin Schröck
- Core Unit for Molecular Tumor Diagnostics (CMTD), National Center for Tumor Diseases (NCT), Schubertstraße 15, 01307 Dresden, Germany
- Institute for Clinical Genetics, Medical Faculty Carl Gustav Carus, Technische Universität Dresden, Dresden, Germany
- National Center for Tumor Diseases (NCT), Dresden, Germany
- German Cancer Consortium (DKTK), Dresden, Germany
| | - Daniela Aust
- Core Unit for Molecular Tumor Diagnostics (CMTD), National Center for Tumor Diseases (NCT), Schubertstraße 15, 01307 Dresden, Germany
- National Center for Tumor Diseases (NCT), Dresden, Germany
- Institute of Pathology, University Hospital Carl Gustav Carus Dresden, Dresden, Germany
- German Cancer Consortium (DKTK), Dresden, Germany
- Tumor- and Normal Tissue Bank of the University Cancer Center (UCC), University Hospital Carl Gustav Carus Dresden, Technische Universität Dresden, National Center for Tumor Diseases (NCT) Dresden, Dresden, Germany
| | - Barbara Klink
- Core Unit for Molecular Tumor Diagnostics (CMTD), National Center for Tumor Diseases (NCT), Schubertstraße 15, 01307 Dresden, Germany
- Institute for Clinical Genetics, Medical Faculty Carl Gustav Carus, Technische Universität Dresden, Dresden, Germany
- National Center for Tumor Diseases (NCT), Dresden, Germany
- German Cancer Consortium (DKTK), Dresden, Germany
| |
Collapse
|
5
|
Eggeling R, Grosse I, Koivisto M. Algorithms for learning parsimonious context trees. Mach Learn 2018. [DOI: 10.1007/s10994-018-5770-9] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/27/2022]
|
6
|
Tamposis IA, Theodoropoulou MC, Tsirigos KD, Bagos PG. Extending hidden Markov models to allow conditioning on previous observations. J Bioinform Comput Biol 2018; 16:1850019. [DOI: 10.1142/s0219720018500191] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022]
Abstract
Hidden Markov Models (HMMs) are probabilistic models widely used in computational molecular biology. However, the Markovian assumption regarding transition probabilities which dictates that the observed symbol depends only on the current state may not be sufficient for some biological problems. In order to overcome the limitations of the first order HMM, a number of extensions have been proposed in the literature to incorporate past information in HMMs conditioning either on the hidden states, or on the observations, or both. Here, we implement a simple extension of the standard HMM in which the current observed symbol (amino acid residue) depends both on the current state and on a series of observed previous symbols. The major advantage of the method is the simplicity in the implementation, which is achieved by properly transforming the observation sequence, using an extended alphabet. Thus, it can utilize all the available algorithms for the training and decoding of HMMs. We investigated the use of several encoding schemes and performed tests in a number of important biological problems previously studied by our team (prediction of transmembrane proteins and prediction of signal peptides). The evaluation shows that, when enough data are available, the performance increased by 1.8%–8.2% and the existing prediction methods may improve using this approach. The methods, for which the improvement was significant (PRED-TMBB2, PRED-TAT and HMM-TM), are available as web-servers freely accessible to academic users at www.compgen.org/tools/ .
Collapse
Affiliation(s)
- Ioannis A. Tamposis
- Department of Computer Science and Biomedical Informatics, University of Thessaly, Papasiopoulou 2-4, 35100 Lamia, Greece
| | - Margarita C. Theodoropoulou
- Department of Computer Science and Biomedical Informatics, University of Thessaly, Papasiopoulou 2-4, 35100 Lamia, Greece
| | - Konstantinos D. Tsirigos
- Department of Computer Science and Biomedical Informatics, University of Thessaly, Papasiopoulou 2-4, 35100 Lamia, Greece
| | - Pantelis G. Bagos
- Department of Computer Science and Biomedical Informatics, University of Thessaly, Papasiopoulou 2-4, 35100 Lamia, Greece
| |
Collapse
|
7
|
Melnik SS, Usatenko OV. Decomposition of conditional probability for high-order symbolic Markov chains. Phys Rev E 2018; 96:012158. [PMID: 29347267 DOI: 10.1103/physreve.96.012158] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/30/2017] [Indexed: 11/07/2022]
Abstract
The main goal of this paper is to develop an estimate for the conditional probability function of random stationary ergodic symbolic sequences with elements belonging to a finite alphabet. We elaborate on a decomposition procedure for the conditional probability function of sequences considered to be high-order Markov chains. We represent the conditional probability function as the sum of multilinear memory function monomials of different orders (from zero up to the chain order). This allows us to introduce a family of Markov chain models and to construct artificial sequences via a method of successive iterations, taking into account at each step increasingly high correlations among random elements. At weak correlations, the memory functions are uniquely expressed in terms of the high-order symbolic correlation functions. The proposed method fills the gap between two approaches, namely the likelihood estimation and the additive Markov chains. The obtained results may have applications for sequential approximation of artificial neural network training.
Collapse
Affiliation(s)
- S S Melnik
- A. Ya. Usikov Institute for Radiophysics and Electronics Ukrainian Academy of Science, 12 Proskura Street, 61805 Kharkov, Ukraine
| | - O V Usatenko
- A. Ya. Usikov Institute for Radiophysics and Electronics Ukrainian Academy of Science, 12 Proskura Street, 61805 Kharkov, Ukraine
| |
Collapse
|
8
|
Zuanetti DA, Milan LA. Second-order autoregressive Hidden Markov Model. BRAZ J PROBAB STAT 2017. [DOI: 10.1214/16-bjps328] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/19/2022]
|
9
|
Melnik SS, Usatenko OV. Entropy and long-range memory in random symbolic additive Markov chains. Phys Rev E 2016; 93:062144. [PMID: 27415245 DOI: 10.1103/physreve.93.062144] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/14/2015] [Indexed: 11/07/2022]
Abstract
The goal of this paper is to develop an estimate for the entropy of random symbolic sequences with elements belonging to a finite alphabet. As a plausible model, we use the high-order additive stationary ergodic Markov chain with long-range memory. Supposing that the correlations between random elements of the chain are weak, we express the conditional entropy of the sequence by means of the symbolic pair correlation function. We also examine an algorithm for estimating the conditional entropy of finite symbolic sequences. We show that the entropy contains two contributions, i.e., the correlation and the fluctuation. The obtained analytical results are used for numerical evaluation of the entropy of written English texts and DNA nucleotide sequences. The developed theory opens the way for constructing a more consistent and sophisticated approach to describe the systems with strong short-range and weak long-range memory.
Collapse
Affiliation(s)
- S S Melnik
- A. Ya. Usikov Institute for Radiophysics and Electronics Ukrainian Academy of Science, 12 Proskura Street, 61805 Kharkov, Ukraine
| | - O V Usatenko
- A. Ya. Usikov Institute for Radiophysics and Electronics Ukrainian Academy of Science, 12 Proskura Street, 61805 Kharkov, Ukraine
| |
Collapse
|
10
|
Abstract
We analyze the structure of DNA molecules of different organisms by using the additive Markov chain approach. Transforming nucleotide sequences into binary strings, we perform statistical analysis of the corresponding "texts". We develop the theory of N-step additive binary stationary ergodic Markov chains and analyze their differential entropy. Supposing that the correlations are weak we express the conditional probability function of the chain by means of the pair correlation function and represent the entropy as a functional of the pair correlator. Since the model uses two point correlators instead of probability of block occurring, it makes possible to calculate the entropy of subsequences at much longer distances than with the use of the standard methods. We utilize the obtained analytical result for numerical evaluation of the entropy of coarse-grained DNA texts. We believe that the entropy study can be used for biological classification of living species.
Collapse
Affiliation(s)
- S S Melnik
- A. Ya. Usikov Institute for Radiophysics and Electronics, Ukrainian Academy of Science, 12 Proskura Street, 61805 Kharkov, Ukraine.
| | - O V Usatenko
- A. Ya. Usikov Institute for Radiophysics and Electronics, Ukrainian Academy of Science, 12 Proskura Street, 61805 Kharkov, Ukraine.
| |
Collapse
|
11
|
Seifert M, Abou-El-Ardat K, Friedrich B, Klink B, Deutsch A. Autoregressive higher-order hidden Markov models: exploiting local chromosomal dependencies in the analysis of tumor expression profiles. PLoS One 2014; 9:e100295. [PMID: 24955771 PMCID: PMC4067306 DOI: 10.1371/journal.pone.0100295] [Citation(s) in RCA: 18] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/17/2014] [Accepted: 05/22/2014] [Indexed: 12/21/2022] Open
Abstract
Changes in gene expression programs play a central role in cancer. Chromosomal aberrations such as deletions, duplications and translocations of DNA segments can lead to highly significant positive correlations of gene expression levels of neighboring genes. This should be utilized to improve the analysis of tumor expression profiles. Here, we develop a novel model class of autoregressive higher-order Hidden Markov Models (HMMs) that carefully exploit local data-dependent chromosomal dependencies to improve the identification of differentially expressed genes in tumor. Autoregressive higher-order HMMs overcome generally existing limitations of standard first-order HMMs in the modeling of dependencies between genes in close chromosomal proximity by the simultaneous usage of higher-order state-transitions and autoregressive emissions as novel model features. We apply autoregressive higher-order HMMs to the analysis of breast cancer and glioma gene expression data and perform in-depth model evaluation studies. We find that autoregressive higher-order HMMs clearly improve the identification of overexpressed genes with underlying gene copy number duplications in breast cancer in comparison to mixture models, standard first- and higher-order HMMs, and other related methods. The performance benefit is attributed to the simultaneous usage of higher-order state-transitions in combination with autoregressive emissions. This benefit could not be reached by using each of these two features independently. We also find that autoregressive higher-order HMMs are better able to identify differentially expressed genes in tumors independent of the underlying gene copy number status in comparison to the majority of related methods. This is further supported by the identification of well-known and of previously unreported hotspots of differential expression in glioblastomas demonstrating the efficacy of autoregressive higher-order HMMs for the analysis of individual tumor expression profiles. Moreover, we reveal interesting novel details of systematic alterations of gene expression levels in known cancer signaling pathways distinguishing oligodendrogliomas, astrocytomas and glioblastomas. An implementation is available under www.jstacs.de/index.php/ARHMM.
Collapse
Affiliation(s)
- Michael Seifert
- Center for Information Services and High Performance Computing, Dresden University of Technology, Dresden, Germany
| | - Khalil Abou-El-Ardat
- Institute for Clinical Genetics, Faculty of Medicine Carl Gustav Carus, Dresden University of Technology, Dresden, Germany
| | - Betty Friedrich
- Center for Information Services and High Performance Computing, Dresden University of Technology, Dresden, Germany
| | - Barbara Klink
- Institute for Clinical Genetics, Faculty of Medicine Carl Gustav Carus, Dresden University of Technology, Dresden, Germany
| | - Andreas Deutsch
- Center for Information Services and High Performance Computing, Dresden University of Technology, Dresden, Germany
| |
Collapse
|
12
|
Bulla I, Schultz AK, Chesneau C, Mark T, Serea F. A model-based information sharing protocol for profile Hidden Markov Models used for HIV-1 recombination detection. BMC Bioinformatics 2014; 15:205. [PMID: 24946781 PMCID: PMC4230192 DOI: 10.1186/1471-2105-15-205] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/19/2013] [Accepted: 06/04/2014] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND In many applications, a family of nucleotide or protein sequences classified into several subfamilies has to be modeled. Profile Hidden Markov Models (pHMMs) are widely used for this task, modeling each subfamily separately by one pHMM. However, a major drawback of this approach is the difficulty of dealing with subfamilies composed of very few sequences. One of the most crucial bioinformatical tasks affected by the problem of small-size subfamilies is the subtyping of human immunodeficiency virus type 1 (HIV-1) sequences, i.e., HIV-1 subtypes for which only a small number of sequences is known. RESULTS To deal with small samples for particular subfamilies of HIV-1, we introduce a novel model-based information sharing protocol. It estimates the emission probabilities of the pHMM modeling a particular subfamily not only based on the nucleotide frequencies of the respective subfamily but also incorporating the nucleotide frequencies of all available subfamilies. To this end, the underlying probabilistic model mimics the pattern of commonality and variation between the subtypes with regards to the biological characteristics of HI viruses. In order to implement the proposed protocol, we make use of an existing HMM architecture and its associated inference engine. CONCLUSIONS We apply the modified algorithm to classify HIV-1 sequence data in the form of partial HIV-1 sequences and semi-artificial recombinants. Thereby, we demonstrate that the performance of pHMMs can be significantly improved by the proposed technique. Moreover, we show that our algorithm performs significantly better than Simplot and Bootscanning.
Collapse
Affiliation(s)
- Ingo Bulla
- Institut für Mathematik und Informatik, Universität Greifswald, Walther-Rathenau-Straße 47, 17487 Greifswald, Germany.
| | | | | | | | | |
Collapse
|
13
|
Aliyu OM, Seifert M, Corral JM, Fuchs J, Sharbel TF. Copy number variation in transcriptionally active regions of sexual and apomictic Boechera demonstrates independently derived apomictic lineages. THE PLANT CELL 2013; 25:3808-23. [PMID: 24170129 PMCID: PMC3877827 DOI: 10.1105/tpc.113.113860] [Citation(s) in RCA: 9] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 05/21/2013] [Revised: 09/11/2013] [Accepted: 10/15/2013] [Indexed: 05/19/2023]
Abstract
In asexual (apomictic) plants, the absence of meiosis and sex is expected to lead to mutation accumulation. To compare mutation accumulation in the transcribed genomic regions of sexual and apomictic plants, we performed a double-validated analysis of copy number variation (CNV) on 10 biological replicates each of diploid sexual and diploid apomictic Boechera, using a high-density (>700 K) custom microarray. The Boechera genome demonstrated higher levels of depleted CNV, compared with enriched CNV, irrespective of reproductive mode. Genome-wide patterns of CNV revealed four divergent lineages, three of which contain both sexual and apomictic genotypes. Hence genome-wide CNV reflects at least three independent origins (i.e., expression) of apomixis from different sexual genetic backgrounds. CNV distributions for different families of transposable elements were lineage specific, and the enrichment of LINE/L1 and long term repeat/Copia elements in lineage 3 apomicts is consistent with sex and meiosis being mechanisms for purging genomic parasites. We hypothesize that significant overrepresentation of specific gene ontology classes (e.g., pollen-pistil interaction) in apomicts implies that gene enrichment could be an adaptive mechanism for genome stability in diploid apomicts by providing a polyploid-like system for buffering the effects of deleterious mutations.
Collapse
Affiliation(s)
- Olawale M. Aliyu
- Apomixis Research Group, Leibniz Institute of Plant Genetics and Crop Plant Research, D-06466 Gatersleben, Germany
| | - Michael Seifert
- Data Inspection Research Group, Leibniz Institute of Plant Genetics and Crop Plant Research, D-06466 Gatersleben, Germany
- Cellular Networks and Systems Biology, Biotechnology Center of the Technical University Dresden, D-01307 Dresden, Germany
- Innovative Methods of Computing, Center for Information Services and High Performance Computing, Technical University Dresden, D-01187 Dresden, Germany
| | - José M. Corral
- Apomixis Research Group, Leibniz Institute of Plant Genetics and Crop Plant Research, D-06466 Gatersleben, Germany
| | - Joerg Fuchs
- Karyotype Evolution Research Group, Leibniz Institute of Plant Genetics and Crop Plant Research, D-06466 Gatersleben, Germany
| | - Timothy F. Sharbel
- Apomixis Research Group, Leibniz Institute of Plant Genetics and Crop Plant Research, D-06466 Gatersleben, Germany
- Address correspondence to
| |
Collapse
|
14
|
Nuclear retention of the transcription factor NLP7 orchestrates the early response to nitrate in plants. Nat Commun 2013; 4:1713. [DOI: 10.1038/ncomms2650] [Citation(s) in RCA: 309] [Impact Index Per Article: 28.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/08/2012] [Accepted: 02/26/2013] [Indexed: 11/08/2022] Open
|
15
|
Hidden Markov Models for Real-Time Estimation of Corn Progress Stages Using MODIS and Meteorological Data. REMOTE SENSING 2013. [DOI: 10.3390/rs5041734] [Citation(s) in RCA: 24] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/16/2022]
|
16
|
Scharpf RB, Beaty TH, Schwender H, Younkin SG, Scott AF, Ruczinski I. Fast detection of de novo copy number variants from SNP arrays for case-parent trios. BMC Bioinformatics 2012; 13:330. [PMID: 23234608 PMCID: PMC3576329 DOI: 10.1186/1471-2105-13-330] [Citation(s) in RCA: 13] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/15/2011] [Accepted: 12/07/2012] [Indexed: 11/10/2022] Open
Abstract
Background In studies of case-parent trios, we define copy number variants (CNVs) in the offspring that differ from the parental copy numbers as de novo and of interest for their potential functional role in disease. Among the leading array-based methods for discovery of de novo CNVs in case-parent trios is the joint hidden Markov model (HMM) implemented in the PennCNV software. However, the computational demands of the joint HMM are substantial and the extent to which false positive identifications occur in case-parent trios has not been well described. We evaluate these issues in a study of oral cleft case-parent trios. Results Our analysis of the oral cleft trios reveals that genomic waves represent a substantial source of false positive identifications in the joint HMM, despite a wave-correction implementation in PennCNV. In addition, the noise of low-level summaries of relative copy number (log R ratios) is strongly associated with batch and correlated with the frequency of de novo CNV calls. Exploiting the trio design, we propose a univariate statistic for relative copy number referred to as the minimum distance that can reduce technical variation from probe effects and genomic waves. We use circular binary segmentation to segment the minimum distance and maximum a posteriori estimation to infer de novo CNVs from the segmented genome. Compared to PennCNV on simulated data, MinimumDistance identifies fewer false positives on average and is comparable to PennCNV with respect to false negatives. Genomic waves contribute to discordance of PennCNV and MinimumDistance for high coverage de novo calls, while highly concordant calls on chromosome 22 were validated by quantitative PCR. Computationally, MinimumDistance provides a nearly 8-fold increase in speed relative to the joint HMM in a study of oral cleft trios. Conclusions Our results indicate that batch effects and genomic waves are important considerations for case-parent studies of de novo CNV, and that the minimum distance is an effective statistic for reducing technical variation contributing to false de novo discoveries. Coupled with segmentation and maximum a posteriori estimation, our algorithm compares favorably to the joint HMM with MinimumDistance being much faster.
Collapse
Affiliation(s)
- Robert B Scharpf
- Department of Oncology, Johns Hopkins University, Baltimore, MD, USA.
| | | | | | | | | | | |
Collapse
|
17
|
Seifert M, Cortijo S, Colomé-Tatché M, Johannes F, Roudier F, Colot V. MeDIP-HMM: genome-wide identification of distinct DNA methylation states from high-density tiling arrays. Bioinformatics 2012; 28:2930-9. [DOI: 10.1093/bioinformatics/bts562] [Citation(s) in RCA: 22] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/16/2022] Open
|