1
|
Brejová B, Gagie T, Herencsárová E, Vinař T. Maximum-scoring path sets on pangenome graphs of constant treewidth. FRONTIERS IN BIOINFORMATICS 2024; 4:1391086. [PMID: 39011297 PMCID: PMC11246863 DOI: 10.3389/fbinf.2024.1391086] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/24/2024] [Accepted: 06/03/2024] [Indexed: 07/17/2024] Open
Abstract
We generalize a problem of finding maximum-scoring segment sets, previously studied by Csűrös (IEEE/ACM Transactions on Computational Biology and Bioinformatics, 2004, 1, 139-150), from sequences to graphs. Namely, given a vertex-weighted graph G and a non-negative startup penalty c, we can find a set of vertex-disjoint paths in G with maximum total score when each path's score is its vertices' total weight minus c. We call this new problem maximum-scoring path sets (MSPS). We present an algorithm that has a linear-time complexity for graphs with a constant treewidth. Generalization from sequences to graphs allows the algorithm to be used on pangenome graphs representing several related genomes and can be seen as a common abstraction for several biological problems on pangenomes, including searching for CpG islands, ChIP-seq data analysis, analysis of region enrichment for functional elements, or simple chaining problems.
Collapse
Affiliation(s)
- Broňa Brejová
- Department of Computer Science, Faculty of Mathematics, Physics and Informatics, Comenius University in Bratislava, Bratislava, Slovakia
| | - Travis Gagie
- Faculty of Computer Science, Dalhousie University, Halifax, NS, Canada
| | - Eva Herencsárová
- Department of Computer Science, Faculty of Mathematics, Physics and Informatics, Comenius University in Bratislava, Bratislava, Slovakia
| | - Tomáš Vinař
- Department of Applied Informatics, Faculty of Mathematics, Physics and Informatics, Comenius University in Bratislava, Bratislava, Slovakia
| |
Collapse
|
2
|
Peters TJ, Buckley MJ, Chen Y, Smyth GK, Goodnow CC, Clark SJ. Calling differentially methylated regions from whole genome bisulphite sequencing with DMRcate. Nucleic Acids Res 2021; 49:e109. [PMID: 34320181 PMCID: PMC8565305 DOI: 10.1093/nar/gkab637] [Citation(s) in RCA: 28] [Impact Index Per Article: 9.3] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/07/2020] [Revised: 05/31/2021] [Accepted: 07/19/2021] [Indexed: 11/12/2022] Open
Abstract
Whole genome bisulphite sequencing (WGBS) permits the genome-wide study of single molecule methylation patterns. One of the key goals of mammalian cell-type identity studies, in both normal differentiation and disease, is to locate differential methylation patterns across the genome. We discuss the most desirable characteristics for DML (differentially methylated locus) and DMR (differentially methylated region) detection tools in a genome-wide context and choose a set of statistical methods that fully or partially satisfy these considerations to compare for benchmarking. Our data simulation strategy is both biologically informed-employing distribution parameters derived from large-scale consortium datasets-and thorough. We report DML detection ability with respect to coverage, group methylation difference, sample size, variability and covariate size, both marginally and jointly, and exhaustively with respect to parameter combination. We also benchmark these methods on FDR control and computational time. We use this result to backend and introduce an expanded version of DMRcate: an existing DMR detection tool for microarray data that we have extended to now call DMRs from WGBS data. We compare DMRcate to a set of alternative DMR callers using a similarly realistic simulation strategy. We find DMRcate and RADmeth are the best predictors of DMRs, and conclusively find DMRcate the fastest.
Collapse
Affiliation(s)
- Timothy J Peters
- The Garvan Institute of Medical Research, 384 Victoria St, Darlinghurst, NSW 2010, Australia.,UNSW Sydney, Sydney 2052, Australia
| | - Michael J Buckley
- The Garvan Institute of Medical Research, 384 Victoria St, Darlinghurst, NSW 2010, Australia.,UNSW Sydney, Sydney 2052, Australia
| | - Yunshun Chen
- The Walter and Eliza Hall Institute of Medical Research, Parkville, VIC 3052, Australia.,Department of Medical Biology, The University of Melbourne, Melbourne, VIC 3010, Australia
| | - Gordon K Smyth
- The Walter and Eliza Hall Institute of Medical Research, Parkville, VIC 3052, Australia.,School of Mathematics and Statistics, The University of Melbourne, Melbourne, VIC 3010, Australia
| | - Christopher C Goodnow
- The Garvan Institute of Medical Research, 384 Victoria St, Darlinghurst, NSW 2010, Australia.,School of Medical Sciences and Cellular Genomics Futures Institute, UNSW Sydney, NSW 2052, Australia
| | - Susan J Clark
- The Garvan Institute of Medical Research, 384 Victoria St, Darlinghurst, NSW 2010, Australia.,St. Vincent's Clinical School, Faculty of Medicine, UNSW Sydney, NSW 2010, Australia
| |
Collapse
|
3
|
Li W, Freudenberg J, Freudenberg J. Alignment-free approaches for predicting novel Nuclear Mitochondrial Segments (NUMTs) in the human genome. Gene 2019; 691:141-152. [PMID: 30630097 DOI: 10.1016/j.gene.2018.12.040] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/05/2018] [Revised: 12/07/2018] [Accepted: 12/14/2018] [Indexed: 10/27/2022]
Abstract
The nuclear human genome harbors sequences of mitochondrial origin, indicating an ancestral transfer of DNA from the mitogenome. Several Nuclear Mitochondrial Segments (NUMTs) have been detected by alignment-based sequence similarity search, as implemented in the Basic Local Alignment Search Tool (BLAST). Identifying NUMTs is important for the comprehensive annotation and understanding of the human genome. Here we explore the possibility of detecting NUMTs in the human genome by alignment-free sequence similarity search, such as k-mers (k-tuples, k-grams, oligos of length k) distributions. We find that when k=6 or larger, the k-mer approach and BLAST search produce almost identical results, e.g., detect the same set of NUMTs longer than 3 kb. However, when k=5 or k=4, certain signals are only detected by the alignment-free approach, and these may indicate yet unrecognized, and potentially more ancestral NUMTs. We introduce a "Manhattan plot" style representation of NUMT predictions across the genome, which are calculated based on the reciprocal of the Jensen-Shannon divergence between the nuclear and mitochondrial k-mer frequencies. The further inspection of the k-mer-based NUMT predictions however shows that most of them contain long-terminal-repeat (LTR) annotations, whereas BLAST-based NUMT predictions do not. Thus, similarity of the mitogenome to LTR sequences is recognized, which we validate by finding the mitochondrial k-mer distribution closer to those for transposable sequences and specifically, close to some types of LTR.
Collapse
Affiliation(s)
- Wentian Li
- The Robert S. Boas Center for Genomics and Human Genetics, The Feinstein Institute for Medical Research, Northwell Health, Manhasset, NY, USA.
| | - Jerome Freudenberg
- The Robert S. Boas Center for Genomics and Human Genetics, The Feinstein Institute for Medical Research, Northwell Health, Manhasset, NY, USA
| | - Jan Freudenberg
- Regeneron Genetics Center, Regeneron Pharmaceuticals, Inc., Tarrytown, NY, USA
| |
Collapse
|
4
|
A model selection approach for multiple sequence segmentation and dimensionality reduction. J MULTIVARIATE ANAL 2018. [DOI: 10.1016/j.jmva.2018.05.006] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/22/2022]
|
5
|
Singh VK, Krishnamachari A. Context based computational analysis and characterization of ARS consensus sequences (ACS) of Saccharomyces cerevisiae genome. GENOMICS DATA 2016; 9:130-6. [PMID: 27508123 PMCID: PMC4971157 DOI: 10.1016/j.gdata.2016.07.005] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 03/26/2016] [Revised: 06/27/2016] [Accepted: 07/06/2016] [Indexed: 01/08/2023]
Abstract
Genome-wide experimental studies in Saccharomyces cerevisiae reveal that autonomous replicating sequence (ARS) requires an essential consensus sequence (ACS) for replication activity. Computational studies identified thousands of ACS like patterns in the genome. However, only a few hundreds of these sites act as replicating sites and the rest are considered as dormant or evolving sites. In a bid to understand the sequence makeup of replication sites, a content and context-based analysis was performed on a set of replicating ACS sequences that binds to origin-recognition complex (ORC) denoted as ORC-ACS and non-replicating ACS sequences (nrACS), that are not bound by ORC. In this study, DNA properties such as base composition, correlation, sequence dependent thermodynamic and DNA structural profiles, and their positions have been considered for characterizing ORC-ACS and nrACS. Analysis reveals that ORC-ACS depict marked differences in nucleotide composition and context features in its vicinity compared to nrACS. Interestingly, an A-rich motif was also discovered in ORC-ACS sequences within its nucleosome-free region. Profound changes in the conformational features, such as DNA helical twist, inclination angle and stacking energy between ORC-ACS and nrACS were observed. Distribution of ACS motifs in the non-coding segments points to the locations of ORC-ACS which are found far away from the adjacent gene start position compared to nrACS thereby enabling an accessible environment for ORC-proteins. Our attempt is novel in considering the contextual view of ACS and its flanking region along with nucleosome positioning in the S. cerevisiae genome and may be useful for any computational prediction scheme.
Collapse
|
6
|
Suvorova YM, Korotkova MA, Korotkov EV. Study of the Paired Change Points in Bacterial Genes. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2014; 11:955-964. [PMID: 26356866 DOI: 10.1109/tcbb.2014.2321154] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/05/2023]
Abstract
It is known that nucleotide sequences are not totally homogeneous and this heterogeneity could not be due to random fluctuations only. Such heterogeneity poses a problem of making sequence segmentation into a set of homogeneous parts divided by the points called "change points". In this work we investigated a special case of change points-paired change points (PCP). We used a well-known property of coding sequences-triplet periodicity (TP). The sequences that we are especially interested in consist of three successive parts: the first and the last parts have similar TP while the middle part has different TP type. We aimed to find the genes with PCP and provide explanation for this phenomenon. We developed a mathematical method for the PCP detection based on the new measure of similarity between TP matrices. We investigated 66,936 bacterial genes from 17 bacterial genomes and revealed 2,700 genes with PCP and 6,459 genes with single change point (SCP). We developed a mathematical approach to visualize the PCP cases. We suppose that PCP could be associated with double fusion or insertion events. The results of investigating the sequences with artificial insertions/fusions and distribution of TP inside the genome support the idea that the real number of genes formed by insertion/ fusion events could be 5-7 times greater than the number of genes revealed in the present work.
Collapse
|
7
|
Algama M, Keith JM. Investigating genomic structure using changept: A Bayesian segmentation model. Comput Struct Biotechnol J 2014; 10:107-15. [PMID: 25349679 PMCID: PMC4204429 DOI: 10.1016/j.csbj.2014.08.003] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/06/2023] Open
Abstract
Genomes are composed of a wide variety of elements with distinct roles and characteristics. Some of these elements are well-characterised functional components such as protein-coding exons. Other elements play regulatory or structural roles, encode functional non-protein-coding RNAs, or perform some other function yet to be characterised. Still others may have no functional importance, though they may nevertheless be of interest to biologists. One technique for investigating the composition of genomes is to segment sequences into compositionally homogenous blocks. This technique, known as 'sequence segmentation' or 'change-point analysis', is used to identify patterns of variation across genomes such as GC-rich and GC-poor regions, coding and non-coding regions, slowly evolving and rapidly evolving regions and many other types of variation. In this mini-review we outline many of the genome segmentation methods currently available and then focus on a Bayesian DNA segmentation algorithm, with examples of its various applications.
Collapse
Affiliation(s)
- Manjula Algama
- School of Mathematical Sciences, Monash University, Clayton, VIC 3800, Australia
| | - Jonathan M Keith
- School of Mathematical Sciences, Monash University, Clayton, VIC 3800, Australia
| |
Collapse
|
8
|
Detecting the borders between coding and non-coding DNA regions in prokaryotes based on recursive segmentation and nucleotide doublets statistics. BMC Genomics 2012; 13 Suppl 8:S19. [PMID: 23282225 PMCID: PMC3535712 DOI: 10.1186/1471-2164-13-s8-s19] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Detecting the borders between coding and non-coding regions is an essential step in the genome annotation. And information entropy measures are useful for describing the signals in genome sequence. However, the accuracies of previous methods of finding borders based on entropy segmentation method still need to be improved. METHODS In this study, we first applied a new recursive entropic segmentation method on DNA sequences to get preliminary significant cuts. A 22-symbol alphabet is used to capture the differential composition of nucleotide doublets and stop codon patterns along three phases in both DNA strands. This process requires no prior training datasets. RESULTS Comparing with the previous segmentation methods, the experimental results on three bacteria genomes, Rickettsia prowazekii, Borrelia burgdorferi and E.coli, show that our approach improves the accuracy for finding the borders between coding and non-coding regions in DNA sequences. CONCLUSIONS This paper presents a new segmentation method in prokaryotes based on Jensen-Rényi divergence with a 22-symbol alphabet. For three bacteria genomes, comparing to A12_JR method, our method raised the accuracy of finding the borders between protein coding and non-coding regions in DNA sequences.
Collapse
|
9
|
Abstract
Since the emergence of high-throughput genome sequencing platforms and more recently the next-generation platforms, the genome databases are growing at an astronomical rate. Tremendous efforts have been invested in recent years in understanding intriguing complexities beneath the vast ocean of genomic data. This is apparent in the spurt of computational methods for interpreting these data in the past few years. Genomic data interpretation is notoriously difficult, partly owing to the inherent heterogeneities appearing at different scales. Methods developed to interpret these data often suffer from their inability to adequately measure the underlying heterogeneities and thus lead to confounding results. Here, we present an information entropy-based approach that unravels the distinctive patterns underlying genomic data efficiently and thus is applicable in addressing a variety of biological problems. We show the robustness and consistency of the proposed methodology in addressing three different biological problems of significance—identification of alien DNAs in bacterial genomes, detection of structural variants in cancer cell lines and alignment-free genome comparison.
Collapse
Affiliation(s)
- Rajeev K Azad
- Department of Biological Sciences, University of Pittsburgh, Pittsburgh, PA 15260, USA.
| | | |
Collapse
|
10
|
Bickel PJ, Boley N, Brown JB, Huang H, Zhang NR. Subsampling methods for genomic inference. Ann Appl Stat 2010. [DOI: 10.1214/10-aoas363] [Citation(s) in RCA: 53] [Impact Index Per Article: 3.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/19/2022]
|
11
|
Zhang W, Wu W, Lin W, Zhou P, Dai L, Zhang Y, Huang J, Zhang D. Deciphering heterogeneity in pig genome assembly Sscrofa9 by isochore and isochore-like region analyses. PLoS One 2010; 5:e13303. [PMID: 20948965 PMCID: PMC2952626 DOI: 10.1371/journal.pone.0013303] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/27/2010] [Accepted: 09/15/2010] [Indexed: 11/18/2022] Open
Abstract
Background The isochore, a large DNA sequence with relatively small GC variance, is one of the most important structures in eukaryotic genomes. Although the isochore has been widely studied in humans and other species, little is known about its distribution in pigs. Principal Findings In this paper, we construct a map of long homogeneous genome regions (LHGRs), i.e., isochores and isochore-like regions, in pigs to provide an intuitive version of GC heterogeneity in each chromosome. The LHGR pattern study not only quantifies heterogeneities, but also reveals some primary characteristics of the chromatin organization, including the followings: (1) the majority of LHGRs belong to GC-poor families and are in long length; (2) a high gene density tends to occur with the appearance of GC-rich LHGRs; and (3) the density of LINE repeats decreases with an increase in the GC content of LHGRs. Furthermore, a portion of LHGRs with particular GC ranges (50%–51% and 54%–55%) tend to have abnormally high gene densities, suggesting that biased gene conversion (BGC), as well as time- and energy-saving principles, could be of importance to the formation of genome organization. Conclusion This study significantly improves our knowledge of chromatin organization in the pig genome. Correlations between the different biological features (e.g., gene density and repeat density) and GC content of LHGRs provide a unique glimpse of in silico gene and repeats prediction.
Collapse
Affiliation(s)
- Wenqian Zhang
- Bioinformatics Center, College of Life Science, Northwest A&F University, Xianyang, Shaanxi, China
| | - Wenwu Wu
- Bioinformatics Center, College of Life Science, Northwest A&F University, Xianyang, Shaanxi, China
| | - Wenchao Lin
- Bioinformatics Center, College of Life Science, Northwest A&F University, Xianyang, Shaanxi, China
| | - Pengfang Zhou
- Bioinformatics Center, College of Life Science, Northwest A&F University, Xianyang, Shaanxi, China
| | - Li Dai
- Bioinformatics Center, College of Life Science, Northwest A&F University, Xianyang, Shaanxi, China
| | - Yang Zhang
- Investigation Group of Molecular Virology, Immunology, Oncology and Systems Biology, and Bioinformatics Center, College of Veterinary Medicine, Northwest A&F University, Xianyang, Shaanxi, China
| | - Jingfei Huang
- State Key Laboratory of Genetic Resources and Evolution, Kunming Institute of Zoology, Chinese Academy of Sciences, Kunming, Yunnan, China
- * E-mail: (DZ); (JH)
| | - Deli Zhang
- Investigation Group of Molecular Virology, Immunology, Oncology and Systems Biology, and Bioinformatics Center, College of Veterinary Medicine, Northwest A&F University, Xianyang, Shaanxi, China
- * E-mail: (DZ); (JH)
| |
Collapse
|
12
|
Abstract
A sequence analysis-oriented binary search-like algorithm was transformed to a sensitive and accurate analysis tool for processing whole-genome data. The advantage of the algorithm over previous methods is its ability to detect the margins of both short and long genome fragments, enriched by up-regulated signals, at equal accuracy. The score of an enriched genome fragment reflects the difference between the actual concentration of up-regulated signals in the fragment and the chromosome signal baseline. The "divide-and-conquer"-type algorithm detects a series of nonintersecting fragments of various lengths with locally optimal scores. The procedure is applied to detected fragments in a nested manner by recalculating the lower-than-baseline signals in the chromosome. The algorithm was applied to simulated whole-genome data, and its sensitivity/specificity were compared with those of several alternative algorithms. The algorithm was also tested with four biological tiling array datasets comprising Arabidopsis (i) expression and (ii) histone 3 lysine 27 trimethylation CHIP-on-chip datasets; Saccharomyces cerevisiae (iii) spliced intron data and (iv) chromatin remodeling factor binding sites. The analyses' results demonstrate the power of the algorithm in identifying both the short up-regulated fragments (such as exons and transcription factor binding sites) and the long--even moderately up-regulated zones--at their precise genome margins. The algorithm generates an accurate whole-genome landscape that could be used for cross-comparison of signals across the same genome in evolutionary and general genomic studies.
Collapse
|
13
|
Hutter B, Paulsen M, Helms V. Identifying CpG islands by different computational techniques. OMICS-A JOURNAL OF INTEGRATIVE BIOLOGY 2010; 13:153-64. [PMID: 19196100 DOI: 10.1089/omi.2008.0046] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/31/2022]
Abstract
CpG islands (CGIs) are generally regarded as important epigenetic regulatory elements due to their association with promoter regions. However, identification of functional CGIs is hampered by repetitive elements and species-specific particularities. Here, we compared the performance of different CGI detection programs on genomic sequences of human and mouse genes. Although mouse CGIs are shorter and G+C poorer than their human counterparts, the different tools tested in our study reliably identify CGIs in promoter regions in both species. Our study confirms that substantially fewer murine than human CGIs coincide with repetitive elements and indicates that such CGIs are subject to accelerated cytosine deamination. In addition, CpG depletion appears to anticorrelate with the epigenetic features of functional regulatory CGIs. Taking into account different deamination rates in unmethylated CGIs versus those in methylated CGIs might support the detection of functional CGIs in other species for which there is little epigenetic information available.
Collapse
Affiliation(s)
- Barbara Hutter
- Lehrstuhl für Computational Biology, Universität des Saarlandes, Saarbrücken, Germany
| | | | | |
Collapse
|
14
|
Stuart PE, Nair RP, Hiremagalore R, Kullavanijaya P, Kullavanijaya P, Tejasvi T, Lim HW, Voorhees JJ, Elder JT. Comparison of MHC class I risk haplotypes in Thai and Caucasian psoriatics shows locus heterogeneity at PSORS1. ACTA ACUST UNITED AC 2010; 76:387-97. [PMID: 20604894 DOI: 10.1111/j.1399-0039.2010.01526.x] [Citation(s) in RCA: 18] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/28/2022]
Abstract
Earlier studies have shown that psoriasis in Japan and Thailand is associated with two different major histocompatibility complex (MHC) haplotypes - those bearing HLA-Cw6 and those bearing HLA-Cw1 and HLA-B46. In an independent case-control sample from Thailand, we confirmed the association of psoriasis with both haplotypes. No association was seen in Thai HLA-Cw1 haplotypes lacking HLA-B46, nor was HLA-Cw1 associated with psoriasis in a large Caucasian sample. To assess whether these risk haplotypes share a common origin, we sequenced genomic DNA from a Thai HLA-Cw1-B46 homozygote across the ∼300 kb MHC risk interval, and compared it with sequence of a HLA-Cw6-B57 risk haplotype. Three small regions of homology were found, but these regions share equivalent sequence similarity with one or more clearly non-risk haplotypes, and they contain no polymorphism alleles unique to all risk haplotypes. Differences in psoriasis phenotype were also observed, including lower risk of disease, greater nail involvement, and later age at onset in HLA-Cw1-B46 carriers compared with HLA-Cw6 carriers. These findings suggest locus heterogeneity at PSORS1 (psoriasis susceptibility 1), the major psoriasis susceptibility locus in the MHC, with HLA-Cw6 imparting risk in both Caucasians and Asians, and an allele other than HLA-Cw1 on the HLA-Cw1-B46 haplotype acting as an additional risk variant in East Asians.
Collapse
Affiliation(s)
- P E Stuart
- Department of Dermatology, University of Michigan Medical School, Ann Arbor, MI 48109-5675, USA
| | | | | | | | | | | | | | | | | |
Collapse
|
15
|
Elhaik E, Graur D, Josić K, Landan G. Identifying compositionally homogeneous and nonhomogeneous domains within the human genome using a novel segmentation algorithm. Nucleic Acids Res 2010; 38:e158. [PMID: 20571085 PMCID: PMC2926622 DOI: 10.1093/nar/gkq532] [Citation(s) in RCA: 14] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/27/2022] Open
Abstract
It has been suggested that the mammalian genome is composed mainly of long compositionally homogeneous domains. Such domains are frequently identified using recursive segmentation algorithms based on the Jensen–Shannon divergence. However, a common difficulty with such methods is deciding when to halt the recursive partitioning and what criteria to use in deciding whether a detected boundary between two segments is real or not. We demonstrate that commonly used halting criteria are intrinsically biased, and propose IsoPlotter, a parameter-free segmentation algorithm that overcomes such biases by using a simple dynamic halting criterion and tests the homogeneity of the inferred domains. IsoPlotter was compared with an alternative segmentation algorithm, DJS, using two sets of simulated genomic sequences. Our results show that IsoPlotter was able to infer both long and short compositionally homogeneous domains with low GC content dispersion, whereas DJS failed to identify short compositionally homogeneous domains and sequences with low compositional dispersion. By segmenting the human genome with IsoPlotter, we found that one-third of the genome is composed of compositionally nonhomogeneous domains and the remaining is a mixture of many short compositionally homogeneous domains and relatively few long ones.
Collapse
Affiliation(s)
- Eran Elhaik
- McKusick-Nathans Institute of Genetic Medicine, Johns Hopkins University School of Medicine, Baltimore, MD 21205, USA.
| | | | | | | |
Collapse
|
16
|
Hackenberg M, Barturen G, Carpena P, Luque-Escamilla PL, Previti C, Oliver JL. Prediction of CpG-island function: CpG clustering vs. sliding-window methods. BMC Genomics 2010; 11:327. [PMID: 20500903 PMCID: PMC2887419 DOI: 10.1186/1471-2164-11-327] [Citation(s) in RCA: 34] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/26/2010] [Accepted: 05/26/2010] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Unmethylated stretches of CpG dinucleotides (CpG islands) are an outstanding property of mammal genomes. Conventionally, these regions are detected by sliding window approaches using %G + C, CpG observed/expected ratio and length thresholds as main parameters. Recently, clustering methods directly detect clusters of CpG dinucleotides as a statistical property of the genome sequence. RESULTS We compare sliding-window to clustering (i.e. CpGcluster) predictions by applying new ways to detect putative functionality of CpG islands. Analyzing the co-localization with several genomic regions as a function of window size vs. statistical significance (p-value), CpGcluster shows a higher overlap with promoter regions and highly conserved elements, at the same time showing less overlap with Alu retrotransposons. The major difference in the prediction was found for short islands (CpG islets), often exclusively predicted by CpGcluster. Many of these islets seem to be functional, as they are unmethylated, highly conserved and/or located within the promoter region. Finally, we show that window-based islands can spuriously overlap several, differentially regulated promoters as well as different methylation domains, which might indicate a wrong merge of several CpG islands into a single, very long island. The shorter CpGcluster islands seem to be much more specific when concerning the overlap with alternative transcription start sites or the detection of homogenous methylation domains. CONCLUSIONS The main difference between sliding-window approaches and clustering methods is the length of the predicted islands. Short islands, often differentially methylated, are almost exclusively predicted by CpGcluster. This suggests that CpGcluster may be the algorithm of choice to explore the function of these short, but putatively functional CpG islands.
Collapse
Affiliation(s)
- Michael Hackenberg
- Dpto. de Genética, Facultad de Ciencias, Universidad de Granada, Campus de Fuentenueva s/n, 18071, Granada, Spain.
| | | | | | | | | | | |
Collapse
|
17
|
Pehkonen P, Wong G, Törönen P. Heuristic Bayesian segmentation for discovery of coexpressed genes within genomic regions. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2010; 7:37-49. [PMID: 20150667 DOI: 10.1109/tcbb.2008.56] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/28/2023]
Abstract
Segmentation aims to separate homogeneous areas from the sequential data, and plays a central role in data mining. It has applications ranging from finance to molecular biology, where bioinformatics tasks such as genome data analysis are active application fields. In this paper, we present a novel application of segmentation in locating genomic regions with coexpressed genes. We aim at automated discovery of such regions without requirement for user-given parameters. In order to perform the segmentation within a reasonable time, we use heuristics. Most of the heuristic segmentation algorithms require some decision on the number of segments. This is usually accomplished by using asymptotic model selection methods like the Bayesian information criterion. Such methods are based on some simplification, which can limit their usage. In this paper, we propose a Bayesian model selection to choose the most proper result from heuristic segmentation. Our Bayesian model presents a simple prior for the segmentation solutions with various segment numbers and a modified Dirichlet prior for modeling multinomial data. We show with various artificial data sets in our benchmark system that our model selection criterion has the best overall performance. The application of our method in yeast cell-cycle gene expression data reveals potential active and passive regions of the genome.
Collapse
Affiliation(s)
- Petri Pehkonen
- Department of Neurobiology, A.I. Virtanen Institute, University of Kuopio, Kuopio, Finland.
| | | | | |
Collapse
|
18
|
Elhaik E, Graur D, Josic K. Comparative testing of DNA segmentation algorithms using benchmark simulations. Mol Biol Evol 2009; 27:1015-24. [PMID: 20018981 DOI: 10.1093/molbev/msp307] [Citation(s) in RCA: 15] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open
Abstract
Numerous segmentation methods for the detection of compositionally homogeneous domains within genomic sequences have been proposed. Unfortunately, these methods yield inconsistent results. Here, we present a benchmark consisting of two sets of simulated genomic sequences for testing the performances of segmentation algorithms. Sequences in the first set are composed of fixed-sized homogeneous domains, distinct in their between-domain guanine and cytosine (GC) content variability. The sequences in the second set are composed of a mosaic of many short domains and a few long ones, distinguished by sharp GC content boundaries between neighboring domains. We use these sets to test the performance of seven segmentation algorithms in the literature. Our results show that recursive segmentation algorithms based on the Jensen-Shannon divergence outperform all other algorithms. However, even these algorithms perform poorly in certain instances because of the arbitrary choice of a segmentation-stopping criterion.
Collapse
Affiliation(s)
- Eran Elhaik
- Department of Biology & Biochemistry, University of Houston, TX, USA.
| | | | | |
Collapse
|
19
|
Characterisation of inactivation domains and evolutionary strata in human X chromosome through Markov segmentation. PLoS One 2009; 4:e7885. [PMID: 19946363 PMCID: PMC2776969 DOI: 10.1371/journal.pone.0007885] [Citation(s) in RCA: 11] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/11/2009] [Accepted: 10/09/2009] [Indexed: 11/19/2022] Open
Abstract
Markov segmentation is a method of identifying compositionally different subsequences in a given symbolic sequence. We have applied this technique to the DNA sequence of the human X chromosome to analyze its compositional structure. The human X chromosome is known to have acquired DNA through distinct evolutionary events and is believed to be composed of five evolutionary strata. In addition, in female mammals all copies of X chromosome in excess of one are transcriptionally inactivated. The location of a gene is correlated with its ability to undergo inactivation, but correlations between evolutionary strata and inactivation domains are less clear. Our analysis provides an accurate estimate of the location of stratum boundaries and gives a high-resolution map of compositionally different regions on the X chromosome. This leads to the identification of a novel stratum, as well as segments wherein a group of genes either undergo inactivation or escape inactivation in toto. We identify oligomers that appear to be unique to inactivation domains alone.
Collapse
|
20
|
Arvey AJ, Azad RK, Raval A, Lawrence JG. Detection of genomic islands via segmental genome heterogeneity. Nucleic Acids Res 2009; 37:5255-66. [PMID: 19589805 PMCID: PMC2760805 DOI: 10.1093/nar/gkp576] [Citation(s) in RCA: 38] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/03/2022] Open
Abstract
While the recognition of genomic islands can be a powerful mechanism for identifying genes that distinguish related bacteria, few methods have been developed to identify them specifically. Rather, identification of islands often begins with cataloging individual genes likely to have been recently introduced into the genome; regions with many putative alien genes are then examined for other features suggestive of recent acquisition of a large genomic region. When few phylogenetic relatives are available, the identification of alien genes relies on their atypical features relative to the bulk of the genes in the genome. The weakness of these ‘bottom–up’ approaches lies in the difficulty in identifying robustly those genes which are atypical, or phylogenetically restricted, due to recent foreign ancestry. Herein, we apply an alternative ‘top–down’ approach where bacterial genomes are recursively divided into progressively smaller regions, each with uniform composition. In this way, large chromosomal regions with atypical features are identified with high confidence due to the simultaneous analysis of multiple genes. This approach is based on a generalized divergence measure to quantify the compositional difference between segments in a hypothesis-testing framework. We tested the proposed genome island prediction algorithm on both artificial chimeric genomes and genuine bacterial genomes.
Collapse
Affiliation(s)
- Aaron J Arvey
- Department of Computer Science, University of California San Diego, La Jolla, CA 92093, USA
| | | | | | | |
Collapse
|
21
|
Zhang Y. Relations between Shannon entropy and genome order index in segmenting DNA sequences. PHYSICAL REVIEW. E, STATISTICAL, NONLINEAR, AND SOFT MATTER PHYSICS 2009; 79:041918. [PMID: 19518267 DOI: 10.1103/physreve.79.041918] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 12/13/2008] [Revised: 03/14/2009] [Indexed: 05/27/2023]
Abstract
Shannon entropy H and genome order index S are used in segmenting DNA sequences. Zhang [Phys. Rev. E 72, 041917 (2005)] found that the two schemes are equivalent when a DNA sequence is converted to a binary sequence of S (strong H bond) and W (weak H bond). They left the mathematical proof to mathematicians who are interested in this issue. In this paper, a possible mathematical explanation is given. Moreover, we find that Chargaff parity rule 2 is the necessary condition of the equivalence, and the equivalence disappears when a DNA sequence is regarded as a four-symbol sequence. At last, we propose that S-2(-H) may be related to species evolution.
Collapse
Affiliation(s)
- Yi Zhang
- Department of Mathematics, Hebei University of Science and Technology, Shijiazhuang, Hebei 050018, People's Republic of China.
| |
Collapse
|
22
|
Keith JM, Adams P, Stephen S, Mattick JS. Delineating slowly and rapidly evolving fractions of the Drosophila genome. J Comput Biol 2008; 15:407-30. [PMID: 18435570 DOI: 10.1089/cmb.2007.0173] [Citation(s) in RCA: 16] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/23/2023] Open
Abstract
Evolutionary conservation is an important indicator of function and a major component of bioinformatic methods to identify non-protein-coding genes. We present a new Bayesian method for segmenting pairwise alignments of eukaryotic genomes while simultaneously classifying segments into slowly and rapidly evolving fractions. We also describe an information criterion similar to the Akaike Information Criterion (AIC) for determining the number of classes. Working with pairwise alignments enables detection of differences in conservation patterns among closely related species. We analyzed three whole-genome and three partial-genome pairwise alignments among eight Drosophila species. Three distinct classes of conservation level were detected. Sequences comprising the most slowly evolving component were consistent across a range of species pairs, and constituted approximately 62-66% of the D. melanogaster genome. Almost all (>90%) of the aligned protein-coding sequence is in this fraction, suggesting much of it (comprising the majority of the Drosophila genome, including approximately 56% of non-protein-coding sequences) is functional. The size and content of the most rapidly evolving component was species dependent, and varied from 1.6% to 4.8%. This fraction is also enriched for protein-coding sequence (while containing significant amounts of non-protein-coding sequence), suggesting it is under positive selection. We also classified segments according to conservation and GC content simultaneously. This analysis identified numerous sub-classes of those identified on the basis of conservation alone, but was nevertheless consistent with that classification. Software, data, and results available at www.maths.qut.edu.au/-keithj/. Genomic segments comprising the conservation classes available in BED format.
Collapse
Affiliation(s)
- Jonathan M Keith
- School of Mathematical Sciences, Queensland University of Technology, Brisbane, Queensland, Australia.
| | | | | | | |
Collapse
|
23
|
Gao F, Zhang CT. Prediction of replication time zones at single nucleotide resolution in the human genome. FEBS Lett 2008; 582:2441-4. [PMID: 18555015 DOI: 10.1016/j.febslet.2008.06.008] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/06/2008] [Revised: 06/03/2008] [Accepted: 06/04/2008] [Indexed: 10/22/2022]
Abstract
The human genome is structured at multiple levels: it is organized into a series of replication time zones, and meanwhile it is composed of isochores. Accumulating evidence suggests a match between these two genome features. Based on newly developed software GC-Profile, we obtained a complete coverage of the human genome by 3198 isochores with boundaries at single nucleotide resolution. Interestingly, the experimentally confirmed replication timing sites in the regions of 1p36.1, 6p21.32, 17q11.2 and 22q12.1 nearly all coincide with the determined isochore boundaries. The precise boundaries of the 3198 isochores are available via the website: http://tubic.tju.edu.cn/isomap/.
Collapse
Affiliation(s)
- Feng Gao
- Department of Physics, Tianjin University, Tianjin 300072, China
| | | |
Collapse
|
24
|
Multipattern consensus regions in multiple aligned protein sequences and their segmentation. EURASIP JOURNAL ON BIOINFORMATICS & SYSTEMS BIOLOGY 2008:35809. [PMID: 18427583 DOI: 10.1155/bsb/2006/35809] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Received: 11/23/2005] [Revised: 05/22/2006] [Accepted: 06/07/2006] [Indexed: 01/10/2023]
Abstract
Decomposing a biological sequence into its functional regions is an important prerequisite to understand the molecule. Using the multiple alignments of the sequences, we evaluate a segmentation based on the type of statistical variation pattern from each of the aligned sites. To describe such a more general pattern, we introduce multipattern consensus regions as segmented regions based on conserved as well as interdependent patterns. Thus the proposed consensus region considers patterns that are statistically significant and extends a local neighborhood. To show its relevance in protein sequence analysis, a cancer suppressor gene called p53 is examined. The results show significant associations between the detected regions and tendency of mutations, location on the 3D structure, and cancer hereditable factors that can be inferred from human twin studies.
Collapse
|
25
|
Zheng WX, Zhang CT. Biological Implications of Isochore Boundaries in the Human Genome. J Biomol Struct Dyn 2008; 25:327-36. [DOI: 10.1080/07391102.2008.10507181] [Citation(s) in RCA: 12] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/28/2022]
|
26
|
Abstract
Whole-genome comparisons among mammalian and other eukaryotic organisms have revealed that they contain large quantities of conserved non-protein-coding sequence. Although some of the functions of this non-coding DNA have been identified, there remains a large quantity of conserved genomic sequence that is of no known function. Moreover, the task of delineating the conserved sequences is non-trivial, particularly when some sequences are conserved in only a small number of lineages. Sequence segmentation is a statistical technique for identifying putative functional elements in genomes based on atypical sequence characteristics, such as conservation levels relative to other genomes, GC content, SNP frequency, and potentially many others. The publicly available program changept and associated programs use Bayesian multiple change-point analysis to delineate classes of genomic segments with similar characteristics, potentially representing new classes of non-coding RNAs (contact web site: http://silmaril.math.sci.qut.edu.au/~keith/) .
Collapse
|
27
|
Haiminen N, Mannila H. Discovering isochores by least-squares optimal segmentation. Gene 2007; 394:53-60. [PMID: 17389148 DOI: 10.1016/j.gene.2007.01.028] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/21/2006] [Revised: 01/16/2007] [Accepted: 01/22/2007] [Indexed: 10/23/2022]
Abstract
The isochore structure of a genome is observable by variation in the G+C (guanine and cytosine) content within and between the chromosomes. Describing the isochore structure of vertebrate genomes is a challenging task, and many computational methods have been developed and applied to it. Here we apply a well-known least-squares optimal segmentation algorithm to isochore discovery. The algorithm finds the best division of the sequence into k pieces, such that the segments are internally as homogeneous as possible. We show how this simple segmentation method can be applied to isochore discovery using as input the G+C content of sliding windows on the sequence. To evaluate the performance of this segmentation technique on isochore detection, we present results from segmenting previously studied isochore regions of the human genome. Detailed results on the MHC locus, on parts of chromosomes 21 and 22, and on a 100 Mb region from chromosome 1 are similar to previously suggested isochore structures. We also give results on segmenting all 22 autosomal human chromosomes. An advantage of this technique is that oversegmentation of G+C rich regions can generally be avoided. This is because the technique concentrates on greater global, instead of smaller local, differences in the sequence composition. The effect is further emphasized by a log-transformation of the data that lowers the high variance that is observed in G+C rich regions. We conclude that the least-squares optimal segmentation method is computationally efficient and yields results close to previous biologically motivated isochore structures.
Collapse
MESH Headings
- Algorithms
- Chromosomes, Human/genetics
- Chromosomes, Human, Pair 1/genetics
- Chromosomes, Human, Pair 21/genetics
- Chromosomes, Human, Pair 22/genetics
- Chromosomes, Human, Pair 6/genetics
- GC Rich Sequence
- Genome, Human
- Genomics/statistics & numerical data
- Humans
- Isochores/chemistry
- Isochores/genetics
- Least-Squares Analysis
- Major Histocompatibility Complex
Collapse
Affiliation(s)
- Niina Haiminen
- HIIT Basic Research Unit, Department of Computer Science, University of Helsinki, Finland.
| | | |
Collapse
|
28
|
Haiminen N, Mannila H, Terzi E. Comparing segmentations by applying randomization techniques. BMC Bioinformatics 2007; 8:171. [PMID: 17521423 PMCID: PMC1904250 DOI: 10.1186/1471-2105-8-171] [Citation(s) in RCA: 13] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/17/2007] [Accepted: 05/23/2007] [Indexed: 11/25/2022] Open
Abstract
Background There exist many segmentation techniques for genomic sequences, and the segmentations can also be based on many different biological features. We show how to evaluate and compare the quality of segmentations obtained by different techniques and alternative biological features. Results We apply randomization techniques for evaluating the quality of a given segmentation. Our example applications include isochore detection and the discovery of coding-noncoding structure. We obtain segmentations of relevant sequences by applying different techniques, and use alternative features to segment on. We show that some of the obtained segmentations are very similar to the underlying true segmentations, and this similarity is statistically significant. For some other segmentations, we show that equally good results are likely to appear by chance. Conclusion We introduce a framework for evaluating segmentation quality, and demonstrate its use on two examples of segmental genomic structures. We transform the process of quality evaluation from simply viewing the segmentations, to obtaining p-values denoting significance of segmentation similarity.
Collapse
Affiliation(s)
- Niina Haiminen
- HIIT Basic Research Unit, Department of Computer Science, P.O.Box 68, FI-00014 University of Helsinki, Finland
| | - Heikki Mannila
- HIIT Basic Research Unit, Department of Computer Science, P.O.Box 68, FI-00014 University of Helsinki, Finland
- Laboratory of Computer and Information Science, Helsinki University of Technology, FI-02015 TKK, Finland
| | - Evimaria Terzi
- HIIT Basic Research Unit, Department of Computer Science, P.O.Box 68, FI-00014 University of Helsinki, Finland
| |
Collapse
|
29
|
Bock C, Walter J, Paulsen M, Lengauer T. CpG island mapping by epigenome prediction. PLoS Comput Biol 2007; 3:e110. [PMID: 17559301 PMCID: PMC1892605 DOI: 10.1371/journal.pcbi.0030110] [Citation(s) in RCA: 129] [Impact Index Per Article: 7.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/22/2006] [Accepted: 05/01/2007] [Indexed: 12/04/2022] Open
Abstract
CpG islands were originally identified by epigenetic and functional properties, namely, absence of DNA methylation and frequent promoter association. However, this concept was quickly replaced by simple DNA sequence criteria, which allowed for genome-wide annotation of CpG islands in the absence of large-scale epigenetic datasets. Although widely used, the current CpG island criteria incur significant disadvantages: (1) reliance on arbitrary threshold parameters that bear little biological justification, (2) failure to account for widespread heterogeneity among CpG islands, and (3) apparent lack of specificity when applied to the human genome. This study is driven by the idea that a quantitative score of “CpG island strength” that incorporates epigenetic and functional aspects can help resolve these issues. We construct an epigenome prediction pipeline that links the DNA sequence of CpG islands to their epigenetic states, including DNA methylation, histone modifications, and chromatin accessibility. By training support vector machines on epigenetic data for CpG islands on human Chromosomes 21 and 22, we identify informative DNA attributes that correlate with open versus compact chromatin structures. These DNA attributes are used to predict the epigenetic states of all CpG islands genome-wide. Combining predictions for multiple epigenetic features, we estimate the inherent CpG island strength for each CpG island in the human genome, i.e., its inherent tendency to exhibit an open and transcriptionally competent chromatin structure. We extensively validate our results on independent datasets, showing that the CpG island strength predictions are applicable and informative across different tissues and cell types, and we derive improved maps of predicted “bona fide” CpG islands. The mapping of CpG islands by epigenome prediction is conceptually superior to identifying CpG islands by widely used sequence criteria since it links CpG island detection to their characteristic epigenetic and functional states. And it is superior to purely experimental epigenome mapping for CpG island detection since it abstracts from specific properties that are limited to a single cell type or tissue. In addition, using computational epigenetics methods we could identify high correlation between the epigenome and characteristics of the DNA sequence, a finding which emphasizes the need for a better understanding of the mechanistic links between genome and epigenome. A key challenge for bioinformatic research is the identification of regulatory regions in the human genome. Regulatory regions are DNA elements that control gene expression and thereby contribute to the organism's phenotype. An important class of regulatory regions consists of so-called CpG islands, which are characterized by frequent occurrence of the CG sequence pattern. CpG islands are strongly associated with open and transcriptionally competent chromatin structure, they play a critical role in gene regulation, and they are involved in the epigenetic causes of cancer. In this article we make several conceptual improvements to the definition and mapping of CpG islands. First, we show that the traditional distinction between CpG islands and non-CpG islands is too harsh, and instead we propose a quantitative measure of CpG island strength to gradually distinguish between stronger and weaker regulatory regions. Second, by genome-wide comparison of multiple epigenome datasets we identify high correlation between features of the genome's DNA sequence and the epigenome, indicating strong functional interdependence. Third, we develop and apply a novel method for predicting the strength of all CpG islands in the human genome, giving rise to an improved and more accurate CpG island mapping.
Collapse
Affiliation(s)
- Christoph Bock
- Max-Planck-Institut für Informatik, Saarbrücken, Germany.
| | | | | | | |
Collapse
|
30
|
Thakur V, Azad RK, Ramaswamy R. Markov models of genome segmentation. PHYSICAL REVIEW. E, STATISTICAL, NONLINEAR, AND SOFT MATTER PHYSICS 2007; 75:011915. [PMID: 17358192 DOI: 10.1103/physreve.75.011915] [Citation(s) in RCA: 9] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 03/02/2006] [Revised: 06/19/2006] [Indexed: 05/14/2023]
Abstract
We introduce Markov models for segmentation of symbolic sequences, extending a segmentation procedure based on the Jensen-Shannon divergence that has been introduced earlier. Higher-order Markov models are more sensitive to the details of local patterns and in application to genome analysis, this makes it possible to segment a sequence at positions that are biologically meaningful. We show the advantage of higher-order Markov-model-based segmentation procedures in detecting compositional inhomogeneity in chimeric DNA sequences constructed from genomes of diverse species, and in application to the E. coli K12 genome, boundaries of genomic islands, cryptic prophages, and horizontally acquired regions are accurately identified.
Collapse
Affiliation(s)
- Vivek Thakur
- Center for Computational Biology and Bioinformatics, School of Information Technology, Jawaharlal Nehru University, New Delhi 110 067, India
| | | | | |
Collapse
|
31
|
Fearnhead P, Sherlock C. An exact Gibbs sampler for the Markov-modulated Poisson process. J R Stat Soc Series B Stat Methodol 2006. [DOI: 10.1111/j.1467-9868.2006.00566.x] [Citation(s) in RCA: 50] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/30/2022]
|
32
|
Hackenberg M, Previti C, Luque-Escamilla PL, Carpena P, Martínez-Aroza J, Oliver JL. CpGcluster: a distance-based algorithm for CpG-island detection. BMC Bioinformatics 2006; 7:446. [PMID: 17038168 PMCID: PMC1617122 DOI: 10.1186/1471-2105-7-446] [Citation(s) in RCA: 110] [Impact Index Per Article: 6.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/22/2006] [Accepted: 10/12/2006] [Indexed: 01/09/2023] Open
Abstract
Background Despite their involvement in the regulation of gene expression and their importance as genomic markers for promoter prediction, no objective standard exists for defining CpG islands (CGIs), since all current approaches rely on a large parameter space formed by the thresholds of length, CpG fraction and G+C content. Results Given the higher frequency of CpG dinucleotides at CGIs, as compared to bulk DNA, the distance distributions between neighboring CpGs should differ for bulk and island CpGs. A new algorithm (CpGcluster) is presented, based on the physical distance between neighboring CpGs on the chromosome and able to predict directly clusters of CpGs, while not depending on the subjective criteria mentioned above. By assigning a p-value to each of these clusters, the most statistically significant ones can be predicted as CGIs. CpGcluster was benchmarked against five other CGI finders by using a test sequence set assembled from an experimental CGI library. CpGcluster reached the highest overall accuracy values, while showing the lowest rate of false-positive predictions. Since a minimum-length threshold is not required, CpGcluster can find short but fully functional CGIs usually missed by other algorithms. The CGIs predicted by CpGcluster present the lowest degree of overlap with Alu retrotransposons and, simultaneously, the highest overlap with vertebrate Phylogenetic Conserved Elements (PhastCons). CpGcluster's CGIs overlapping with the Transcription Start Site (TSS) show the highest statistical significance, as compared to the islands in other genome locations, thus qualifying CpGcluster as a valuable tool in discriminating functional CGIs from the remaining islands in the bulk genome. Conclusion CpGcluster uses only integer arithmetic, thus being a fast and computationally efficient algorithm able to predict statistically significant clusters of CpG dinucleotides. Another outstanding feature is that all predicted CGIs start and end with a CpG dinucleotide, which should be appropriate for a genomic feature whose functionality is based precisely on CpG dinucleotides. The only search parameter in CpGcluster is the distance between two consecutive CpGs, in contrast to previous algorithms. Therefore, none of the main statistical properties of CpG islands (neither G+C content, CpG fraction nor length threshold) are needed as search parameters, which may lead to the high specificity and low overlap with spurious Alu elements observed for CpGcluster predictions.
Collapse
Affiliation(s)
| | - Christopher Previti
- Dpto. de Genética, Facultad de Ciencias, Universidad de Granada, Spain
- Dept. of Molecular Biophysics, German Cancer Research Center, Heidelberg, Germany
| | | | - Pedro Carpena
- Dpto de Física Aplicada II, Universidad de Málaga, Spain
| | - José Martínez-Aroza
- Dpto. de Matemática Aplicada, Facultad de Ciencias, Universidad de Granada, Spain
| | - José L Oliver
- Dpto. de Genética, Facultad de Ciencias, Universidad de Granada, Spain
| |
Collapse
|
33
|
Gao F, Zhang CT. GC-Profile: a web-based tool for visualizing and analyzing the variation of GC content in genomic sequences. Nucleic Acids Res 2006; 34:W686-91. [PMID: 16845098 PMCID: PMC1538862 DOI: 10.1093/nar/gkl040] [Citation(s) in RCA: 114] [Impact Index Per Article: 6.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/24/2022] Open
Abstract
In order to understand the evolution, structure and function of genomes, it is important to know the general compositional features of DNA sequences. Based on the quadratic divergence, a new segmentation algorithm to partition a given genome or DNA sequence into compositionally distinct domains has been put forward. With the aid of the technique of cumulative GC profile, the distribution of segmentation points can be displayed intuitively. We have therefore developed them into GC-Profile, an interactive web-based software system, which can be used to segment prokaryotic and eukaryotic genomes. GC-Profile provides a quantitative and qualitative view of genome organization. Based on the obtained results, the relationships between the G+C content and other genomic features, such as distributions of genes and CpG islands, can be analyzed in a perceivable manner. It shows that GC-Profile would be an appropriate starting point for analyzing the isochore structure of higher eukaryotic genomes, and an intuitive tool for identifying genomic islands in prokaryotic genomes. GC-Profile is freely available at the website . In addition, precompiled binaries, together with examples and documentation, can also be freely downloaded for a local execution.
Collapse
Affiliation(s)
| | - Chun-Ting Zhang
- To whom correspondence should be addressed. Tel: +86 22 2740 2987; Fax: +86 22 2740 2697;
| |
Collapse
|
34
|
Tempel S, Giraud M, Lavenier D, Lerman IC, Valin AS, Couée I, Amrani AE, Nicolas J. Domain organization within repeated DNA sequences: application to the study of a family of transposable elements. Bioinformatics 2006; 22:1948-54. [PMID: 16809391 DOI: 10.1093/bioinformatics/btl337] [Citation(s) in RCA: 9] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
Abstract
MOTIVATION The analysis of repeated elements in genomes is a fascinating domain of research that is lacking relevant tools for transposable elements (TEs), the most complex ones. The dynamics of TEs, which provides the main mechanism of mutation in some genomes, is an essential component of genome evolution. In this study we introduce a new concept of domain, a segmentation unit useful for describing the architecture of different copies of TEs. Our method extracts occurrences of a terminus-defined family of TEs, aligns the sequences, finds the domains in the alignment and searches the distribution of each domain in sequences. After a classification step relative to the presence or the absence of domains, the method results in a graphical view of sequences segmented into domains. RESULTS Analysis of the new non-autonomous TE AtREP21 in the model plant Arabidopsis thaliana reveals copies of very different sizes and various combinations of domains which show the potential of our method. AVAILABILITY DomainOrganizer web page is available at www.irisa.fr/symbiose/DomainOrganizer/.
Collapse
Affiliation(s)
- Sébastien Tempel
- IRISA-INRIA, Campus de Beaulieu Bât 12 35042 Rennes cedex, France
| | | | | | | | | | | | | | | |
Collapse
|
35
|
Abstract
The availability of the complete chicken genome sequence provides an unprecedented opportunity to study the global genome organization at the sequence level. Delineating compositionally homogeneous G + C domains in DNA sequences can provide much insight into the understanding of the organization and biological functions of the chicken genome. A new segmentation algorithm, which is simple and fast, has been proposed to partition a given genome or DNA sequence into compositionally distinct domains. By applying the new segmentation algorithm to the draft chicken genome sequence, the mosaic organization of the chicken genome can be confirmed at the sequence level. It is shown herein that the chicken genome is also characterized by a mosaic structure of isochores, long DNA segments that are fairly homogeneous in the G + C content. Consequently, 25 isochores longer than 2 Mb (megabases) have been identified in the chicken genome. These isochores have a fairly homogeneous G + C content and often correspond to meaningful biological units. With the aid of the technique of cumulative GC profile, we proposed an intuitive picture to display the distribution of segmentation points. The relationships between G + C content and the distributions of genes (CpG islands, and other genomic elements) were analyzed in a perceivable manner. The cumulative GC profile, equipped with the new segmentation algorithm, would be an appropriate starting point for analyzing the isochore structures of higher eukaryotic genomes.
Collapse
Affiliation(s)
- Feng Gao
- Department of Physics, Tianjin University, China
| | | |
Collapse
|
36
|
Nicorici D, Yli-Harja O, Astola J. Finding large domains of similarly expressed genes. A novel method using the MDL principle and the recursive segmentation procedure. IEEE ENGINEERING IN MEDICINE AND BIOLOGY MAGAZINE : THE QUARTERLY MAGAZINE OF THE ENGINEERING IN MEDICINE & BIOLOGY SOCIETY 2006; 25:82-9. [PMID: 16485395 DOI: 10.1109/memb.2006.1578667] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/06/2023]
Affiliation(s)
- Daniel Nicorici
- Institute of Signal Processing, Tampere University of Technology, Finland.
| | | | | |
Collapse
|
37
|
Zhang CT, Gao F, Zhang R. Segmentation algorithm for DNA sequences. PHYSICAL REVIEW. E, STATISTICAL, NONLINEAR, AND SOFT MATTER PHYSICS 2005; 72:041917. [PMID: 16383430 DOI: 10.1103/physreve.72.041917] [Citation(s) in RCA: 19] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 03/07/2005] [Indexed: 05/05/2023]
Abstract
A new measure, to quantify the difference between two probability distributions, called the quadratic divergence, has been proposed. Based on the quadratic divergence, a new segmentation algorithm to partition a given genome or DNA sequence into compositionally distinct domains is put forward. The new algorithm has been applied to segment the 24 human chromosome sequences, and the boundaries of isochores for each chromosome were obtained. Compared with the results obtained by using the entropic segmentation algorithm based on the Jensen-Shannon divergence, both algorithms resulted in all identical coordinates of segmentation points. An explanation of the equivalence of the two segmentation algorithms is presented. The new algorithm has a number of advantages. Particularly, it is much simpler and faster than the entropy-based method. Therefore, the new algorithm is more suitable for analyzing long genome sequences, such as human and other newly sequenced eukaryotic genome sequences.
Collapse
Affiliation(s)
- Chun-Ting Zhang
- Department of Physics, Tianjin University, Tianjin 300072, China.
| | | | | |
Collapse
|
38
|
Barral P J, Cantini L, Hasmy A, Jiménez J, Marcano A. Correlation between strand asymmetry and phylogeny in mitochondrial DNA. J Theor Biol 2005; 236:422-6. [PMID: 15927203 DOI: 10.1016/j.jtbi.2005.03.022] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/24/2004] [Revised: 03/17/2005] [Accepted: 03/17/2005] [Indexed: 11/25/2022]
Abstract
An evolutionary distance is introduced in order to propose an efficient and feasible procedure for phylogeny studies. Our analysis are based on the strand asymmetry property of mitochondrial DNA, but can be applied to other genomes. Comparison of our results with those reported in conventional phylogenetic trees, gives confidence about our approximation. Our findings support the hypotheses about the origin of the skew and its dependence upon evolutionary pressures, and improves previous efforts on using the strand asymmetry property of genomes for phylogeny inference. For the evolutionary distance introduced here, we observe that the more adequate technique for tree reconstructions correspond to an average link method which employs a sequential clustering algorithm.
Collapse
Affiliation(s)
- J Barral P
- Centro Nacional de Secuenciación y Análisis de Acidos Nucleicos CeSAAN, IVIC, Apartado Postal 21827, Caracas 1020A, Venezuela
| | | | | | | | | |
Collapse
|
39
|
Abstract
Sarment is a package of Python modules for easy building and manipulation of sequence segmentations. It provides efficient implementation of usual algorithms for hidden Markov Model computation, as well as for maximal predictive partitioning. Owing to its very large variety of criteria for computing segmentations, Sarment can handle many kinds of models. Because of object-oriented programming, the results of the segmentation are very easy tomanipulate.
Collapse
Affiliation(s)
- Laurent Guéguen
- Laboratoire Biométrie et Biologie Evolutive, (UMR 5558); (NRS); Univ Lyon 1, 43 bd 11 Nov, 69622 Villeurbanne cedex, France.
| |
Collapse
|
40
|
Luque-Escamilla PL, Martínez-Aroza J, Oliver JL, Gómez-Lopera JF, Román-Roldán R. Compositional searching of CpG islands in the human genome. PHYSICAL REVIEW. E, STATISTICAL, NONLINEAR, AND SOFT MATTER PHYSICS 2005; 71:061925. [PMID: 16089783 DOI: 10.1103/physreve.71.061925] [Citation(s) in RCA: 9] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 10/21/2004] [Revised: 01/31/2005] [Indexed: 05/03/2023]
Abstract
We report on an entropic edge detector based on the local calculation of the Jensen-Shannon divergence with application to the search for CpG islands. CpG islands are pieces of the genome related to gene expression and cell differentiation, and thus to cancer formation. Searching for these CpG islands is a major task in genetics and bioinformatics. Some algorithms have been proposed in the literature, based on moving statistics in a sliding window, but its size may greatly influence the results. The local use of Jensen-Shannon divergence is a completely different strategy: the nucleotide composition inside the islands is different from that in their environment, so a statistical distance--the Jensen-Shannon divergence--between the composition of two adjacent windows may be used as a measure of their dissimilarity. Sliding this double window over the entire sequence allows us to segment it compositionally. The fusion of those segments into greater ones that satisfy certain identification criteria must be achieved in order to obtain the definitive results. We find that the local use of Jensen-Shannon divergence is very suitable in processing DNA sequences for searching for compositionally different structures such as CpG islands, as compared to other algorithms in literature.
Collapse
Affiliation(s)
- Pedro Luis Luque-Escamilla
- Department of Engineering and Mining Mechanics, University of Jaén, Escuela Politécnica Superior, Campus Las Lagunillas s/n, 23071 Jaén, Spain
| | | | | | | | | |
Collapse
|
41
|
Cohen N, Dagan T, Stone L, Graur D. GC composition of the human genome: in search of isochores. Mol Biol Evol 2005; 22:1260-72. [PMID: 15728737 DOI: 10.1093/molbev/msi115] [Citation(s) in RCA: 61] [Impact Index Per Article: 3.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022] Open
Abstract
The isochore theory, proposed nearly three decades ago, depicts the mammalian genome as a mosaic of long, fairly homogeneous genomic regions that are characterized by their guanine and cytosine (GC) content. The human genome, for instance, was claimed to consist of five distinct isochore families: L1, L2, H1, H2, and H3, with GC contents of <37%, 37%-42%, 42%-47%, 47%-52%, and >52%, respectively. In this paper, we address the question of the validity of the isochore theory through a rigorous sequence-based analysis of the human genome. Toward this end, we adopt a set of six attributes that are generally claimed to characterize isochores and statistically test their veracity against the available draft sequence of the complete human genome. By the selection criteria used in this study: distinctiveness, homogeneity, and minimal length of 300 kb, we identify 1,857 genomic segments that warrant the label "isochore." These putative isochores are nonuniformly scattered throughout the genome and cover about 41% of the human genome. We found that a four-family model of putative isochores is the most parsimonious multi-Gaussian model that can be fitted to the empirical data. These families, however, are GC poor, with mean GC contents of 35%, 38%, 41%, and 48% and do not resemble the five isochore families in the literature. Moreover, due to large overlaps among the families, it is impossible to classify genomic segments into isochore families reliably, according to compositional properties alone. These findings undermine the utility of the isochore theory and seem to indicate that the theory may have reached the limits of its usefulness as a description of genomic compositional structures.
Collapse
Affiliation(s)
- Netta Cohen
- School of Computing, University of Leeds, Leeds, United Kingdom
| | | | | | | |
Collapse
|
42
|
Zhang CT, Zhang R. Isochore structures in the mouse genome. Genomics 2004; 83:384-94. [PMID: 14962664 DOI: 10.1016/j.ygeno.2003.09.011] [Citation(s) in RCA: 16] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/28/2003] [Accepted: 09/04/2003] [Indexed: 10/26/2022]
Abstract
The distribution of the G+C content in the mouse genome has been studied using a windowless technique. We have found that: (i). Abrupt variations of the G+C content from a GC-rich region to a GC-poor region, and vice versa, occur frequently at some sites along the sequence of the mouse genome. (ii). Long domains with relatively homogeneous G+C content (isochores) exist, which usually have sharp boundaries. Consequently, 28 isochores longer than 1 Mb have been identified in the mouse genome. A homogeneity index was used to quantify the variations of the G+C content within isochores. The precise boundaries, sizes, and G+C contents of these isochores have been determined. The windowless technique for the G+C content computation was also used to analyze the DNA sequence containing the mouse MHC region, which has a GC-poor isochore. This isochore is located at the central part of the sequence with boundaries at 468459 and 812716 bp, where the sequence is extended from the centromeric end to the telomeric end. In addition, the analysis of a segment of the rat genome shows that the rat genome also has clear isochore structures.
Collapse
Affiliation(s)
- Chun-Ting Zhang
- Department of Physics, Tianjin University, Tianjin 300072, China.
| | | |
Collapse
|
43
|
Li W, Holste D. An unusual 500,000 bases long oscillation of guanine and cytosine content in human chromosome 21. Comput Biol Chem 2004; 28:393-9. [PMID: 15556480 DOI: 10.1016/j.compbiolchem.2004.09.011] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/17/2004] [Revised: 09/30/2004] [Accepted: 09/30/2004] [Indexed: 01/09/2023]
Abstract
An oscillation with a period of around 500 kb in guanine and cytosine content (GC%) is observed in the DNA sequence of human chromosome 21. This oscillation is localized in the rightmost one-eighth region of the chromosome, from 43.5 Mb to 46.5 Mb. Five cycles of oscillation are observed in this region with six GC-rich peaks and five GC-poor valleys. The GC-poor valleys comprise regions with low density of CpG islands and, alternating between the two DNA strands, low gene density regions. Consequently, the long-range oscillation of GC% result in spacing patterns of both CpG island density, and to a lesser extent, gene densities.
Collapse
Affiliation(s)
- Wentian Li
- The Robert S. Boas Center for Genomics and Human Genetics, North Shore LIJ Institute for Medical Research, 350 Community Drive, Manhasset, NY 11030, USA.
| | | |
Collapse
|
44
|
Csurös M. Maximum-scoring segment sets. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2004; 1:139-50. [PMID: 17051696 DOI: 10.1109/tcbb.2004.43] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/12/2023]
Abstract
We examine the problem of finding maximum-scoring sets of disjoint segments in a sequence of scores. The problem arises in DNA and protein segmentation and in postprocessing of sequence alignments. Our key result states a simple recursive relationship between maximum-scoring segment sets. The statement leads to fast algorithms for finding such segment sets. We apply our methods to the identification of noncoding RNA genes in thermophiles.
Collapse
Affiliation(s)
- Miklós Csurös
- Départment d'informatique et de recherche opérationnelle, Université de Montréal, C.P. 6128, succ. Centre-Ville, Montréal, Qué. H3C 3J7, Canada.
| |
Collapse
|
45
|
Bernaola-Galván P, Oliver JL, Carpena P, Clay O, Bernardi G. Quantifying intrachromosomal GC heterogeneity in prokaryotic genomes. Gene 2004; 333:121-33. [PMID: 15177687 DOI: 10.1016/j.gene.2004.02.042] [Citation(s) in RCA: 26] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/29/2003] [Revised: 11/14/2003] [Accepted: 02/10/2004] [Indexed: 11/15/2022]
Abstract
The sequencing of prokaryotic genomes covering a wide taxonomic range has sparked renewed interest in intrachromosomal compositional (GC) heterogeneity, largely in view of lateral transfers. We present here a brief overview of some methods for visualizing and quantifying GC variation in prokaryotes. We used these methods to examine heterogeneity levels in sequenced prokaryotes, for a range of scales or stringencies. Some species are consistently homogeneous, whereas others are markedly heterogeneous in comparison, in particular Aeropyrum pernix, Xylella fastidiosa, Mycoplasma genitalium, Enterococcus faecalis, Bacillus subtilis, Pyrobaculum aerophilum, Vibrio vulnificus chromosome I, Deinococcus radiodurans chromosome II and Halobacterium. As we discuss here, the wide range of heterogeneities calls for reexamination of an accepted belief, namely that the endogenous DNA of bacteria and archaea should typically exhibit low intrachromosomal GC contrasts. Supplementary results for all species analyzed are available at our website: http://bioinfo2.ugr.es/prok.
Collapse
|
46
|
Zhang R, Zhang CT. Isochore Structures in the Genome of the Plant Arabidopsis thaliana. J Mol Evol 2004; 59:227-38. [PMID: 15486696 DOI: 10.1007/s00239-004-2617-8] [Citation(s) in RCA: 18] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/08/2003] [Accepted: 02/10/2004] [Indexed: 10/26/2022]
Abstract
Arabidopsis thaliana is an important model system for the study of plant biology. We have analyzed the complete genome sequences of Arabidopsis by using a newly developed windowless method for the GC content computation, the cumulative GC profile. It is shown that the Arabidopsis genome is organized into a mosaic structure of isochores. All the centromeric regions are located in GC-rich isochores, called centromere-isochores, which are characterized by a high GC content but low gene and T-DNA insertion densities. This characteristic distinguishes centromere-isochores from the other class of GC-rich isochores, called GC-isochores, which have high gene and T-DNA insertion densities. Consequently, 15 isochores have been identified, i.e., 7 AT-isochores, 3 GC-isochores, and 5 centromere-isochores. The genes in centromere-isochores, which have the highest GC content, have much shorter intron lengths and lower intron numbers, compared to those of the other two types. There is also considerable difference in the numbers and lengths of transposable elements (TEs) between AT and GC-isochores, i.e., the TE number (length) of AT-isochores is 6.3 (7.3) times that of GC-isochores. It is generally believed that TEs are accumulated in the regions surrounding the centromeres. However, within these TE-rich regions, there are regions of extremely low TE numbers (TE deserts), which correspond to the positions of centromere-isochores. In addition, a heterochromatic knob is located at the boundary of an AT-isochore. Furthermore, we show that the differences in GC content among isochores are mainly due to the GC content variation of introns, the third codon positions and intergenic regions.
Collapse
Affiliation(s)
- Ren Zhang
- Department of Epidemiology and Biostatistics, Tianjin Cancer Institute and Hospital, 300060 Tianjin, China
| | | |
Collapse
|
47
|
Krishnamachari A, moy Mandal V. Study of DNA binding sites using the Rényi parametric entropy measure. J Theor Biol 2004; 227:429-36. [PMID: 15019509 DOI: 10.1016/j.jtbi.2003.11.026] [Citation(s) in RCA: 11] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/27/2003] [Revised: 11/06/2003] [Accepted: 11/17/2003] [Indexed: 10/26/2022]
Abstract
Shannon's definition of uncertainty or surprisal has been applied extensively to measure the information content of aligned DNA sequences and characterizing DNA binding sites. In contrast to Shannon's uncertainty, this study investigates the applicability and suitability of a parametric uncertainty measure due to Rényi. It is observed that this measure also provides results in agreement with Shannon's measure, pointing to its utility in analysing DNA binding site region. For facilitating the comparison between these uncertainty measures, a dimensionless quantity called "redundancy" has been employed. It is found that Rényi's measure at low parameter values possess a better delineating feature of binding sites (of binding regions) than Shannon's measure. The critical value of the parameter is chosen with an outlier criterion.
Collapse
Affiliation(s)
- A Krishnamachari
- Bioinformatics Centre, Jawaharlal Nehru University, New Delhi 110 067, India
| | | |
Collapse
|
48
|
Wen SY, Zhang CT. Identification of isochore boundaries in the human genome using the technique of wavelet multiresolution analysis. Biochem Biophys Res Commun 2004; 311:215-22. [PMID: 14575716 DOI: 10.1016/j.bbrc.2003.09.198] [Citation(s) in RCA: 17] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/30/2022]
Abstract
Incorporated with the Z curve method, the technique of wavelet multiresolution (also known as multiscale) analysis has been proposed to identify the boundaries of isochores in the human genome. The human MHC sequence and the longest contigs of human chromosomes 21 and 22 are used as examples. The boundary between the isochores of Class III and Class II in the MHC sequence has been detected and found to be situated at the position 2,490,368bp. This result is in good agreement with the experimental evidence. An isochore with a length of about 7Mb in chromosome 21 has been identified and found to be gene- and Alu-poor. We have also found that the G+C content of chromosome 21 is more homogeneous than that of chromosome 22. Compared with the window-based methods, the present method has the highest resolution for identifying the boundaries of isochores, even at a scale of single base. Compared with the entropic segmentation method, the present method has the merits of more intuitiveness and less calculations. The important conclusion drawn in this study is that the segmentation points, at which the G+C content undergoes relatively dramatic changes, do exist in the human genome. These 'singularity' points may be considered to be candidates of isochore boundaries in the human genome. The method presented is a general one and can be used to analyze any other genomes.
Collapse
|
49
|
Abstract
The distribution of the G+C content in the human genome has been studied by using a windowless technique derived from the Z curve method. The most important findings presented in this paper are twofold. First, abrupt variations of the G+C content along human chromosome sequences are the main variation patterns of G+C content. It is found that at some sites, the G+C content undergoes abrupt changes from a G+C-rich region to a G+C-poor region alternatively and vice versa. Second, it is shown that long domains with relatively homogeneous G+C content along each chromosome do exist. These domains are thought to be isochores, which usually have sharp boundaries. Consequently, 56 isochores longer than 3 Mb have been identified in chromosomes 1-22, X and Y. Boundaries, size and G+C content of each isochore identified are listed in detail. As an example to demonstrate the power of the method, the boundary between the Classes III and II isochores of the MHC sequence has been determined and found to be at 2,477,936, which is in good agreement with the experimental evidence. A homogeneity index is introduced to measure the homogeneity of G+C content in isochores. We emphasize that the homogeneity of G+C content is relative. The isochores in which the G+C content keeps absolutely constant do not exist. Isochore structures appear to be a basic organization of the human genome. Due to the relevance to many important biological functions, the clarification of isochore structures will provide much insight into the understanding of the human genome.
Collapse
Affiliation(s)
- Chun-Ting Zhang
- Department of Physics, Tianjin University, Nankai District, Tianjin 300072, China.
| | | |
Collapse
|
50
|
Current Awareness on Comparative and Functional Genomics. Comp Funct Genomics 2003. [PMCID: PMC2447381 DOI: 10.1002/cfg.226] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/29/2022] Open
|