1
|
Sheinman M, Arndt PF, Massip F. Modeling the mosaic structure of bacterial genomes to infer their evolutionary history. Proc Natl Acad Sci U S A 2024; 121:e2313367121. [PMID: 38517978 PMCID: PMC10990148 DOI: 10.1073/pnas.2313367121] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/03/2023] [Accepted: 01/30/2024] [Indexed: 03/24/2024] Open
Abstract
The chronology and phylogeny of bacterial evolution are difficult to reconstruct due to a scarce fossil record. The analysis of bacterial genomes remains challenging because of large sequence divergence, the plasticity of bacterial genomes due to frequent gene loss, horizontal gene transfer, and differences in selective pressure from one locus to another. Therefore, taking advantage of the rich and rapidly accumulating genomic data requires accurate modeling of genome evolution. An important technical consideration is that loci with high effective mutation rates may diverge beyond the detection limit of the alignment algorithms used, biasing the genome-wide divergence estimates toward smaller divergences. In this article, we propose a novel method to gain insight into bacterial evolution based on statistical properties of genome comparisons. We find that the length distribution of sequence matches is shaped by the effective mutation rates of different loci, by the horizontal transfers, and by the aligner sensitivity. Based on these inputs, we build a model and show that it accounts for the empirically observed distributions, taking the Enterobacteriaceae family as an example. Our method allows to distinguish segments of vertical and horizontal origins and to estimate the time divergence and exchange rate between any pair of taxa from genome-wide alignments. Based on the estimated time divergences, we construct a time-calibrated phylogenetic tree to demonstrate the accuracy of the method.
Collapse
Affiliation(s)
- Michael Sheinman
- Institute for Advanced Studies, Sevastopol State University, Sevastopol299053, Crimea
| | - Peter F. Arndt
- Department of Computational Molecular Biology, Max Planck Institute for Molecular Genetics, Berlin12163, Germany
| | - Florian Massip
- Department U900, Centre for Computational Biology, Mines Paris, PSL University, Paris75006, France
- Department U900, Institut Curie, Université Paris Sciences et Lettres, Paris75005, France
- INSERM, U900, Paris75005, France
| |
Collapse
|
2
|
Costa MO, Silva R, Anselmo DHAL. Superstatistical and DNA sequence coding of the human genome. Phys Rev E 2022; 106:064407. [PMID: 36671113 DOI: 10.1103/physreve.106.064407] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/06/2022] [Accepted: 11/16/2022] [Indexed: 12/14/2022]
Abstract
In this work, by considering superstatistics we investigate the short-range correlations (SRCs) and the fluctuations in the distribution of lengths of strings of nucleotides. To this end, a stochastic model provides the distributions of the size of the exons based on the q-Gamma and inverse q-Gamma distributions. Specifically, we define a time series for exon sizes to investigate the SRC and the fluctuations through the superstatistics distributions. To test the model's viability, we use the Project Ensembl database of genes to extract the time evolution of exon sizes, calculated in terms of the number of base pairs (bp) in these biological databases. Our findings show that, depending on the chromosome, both distributions are suitable for describing the length distribution of human DNA for lengths greater than 10 bp. In addition, we used Bayesian statistics to perform a selection model approach, which revealed weak evidence for the inverse q-Gamma distribution for a considerable number of chromosomes.
Collapse
Affiliation(s)
- M O Costa
- Departamento de Física, Universidade Federal do Rio Grande do Norte, Natal - RN, 59072-970, Brasil
| | - R Silva
- Departamento de Física, Universidade Federal do Rio Grande do Norte, Natal - RN, 59072-970, Brasil and Programa de Pós-Graduação em Física, Universidade do Estado do Rio Grande do Norte, Mossoró - Rio Grande do Norte, 59610-210, Brasil
| | - D H A L Anselmo
- Departamento de Física, Universidade Federal do Rio Grande do Norte, Natal - RN, 59072-970, Brasil and Programa de Pós-Graduação em Física, Universidade do Estado do Rio Grande do Norte, Mossoró - Rio Grande do Norte, 59610-210, Brasil
| |
Collapse
|
3
|
Abstract
Data from a long time evolution experiment with Escherichia Coli and from a large study on copy number variations in subjects with European ancestry are analyzed in order to argue that mutations can be described as Levy flights in the mutation space. These Levy flights have at least two components: random single-base substitutions and large DNA rearrangements. From the data, we get estimations for the time rates of both events and the size distribution function of large rearrangements.
Collapse
Affiliation(s)
- Dario A Leon
- University of Modena & Reggio Emilia, 41125, Modena, Italy.
- Institute of Cybernetics, Mathematics and Physics, 10400, Havana, Cuba.
| | - Augusto Gonzalez
- Institute of Cybernetics, Mathematics and Physics, 10400, Havana, Cuba
- University of Electronic Science and Technology, Chengdu, 610051, People's Republic of China
| |
Collapse
|
4
|
McCole RB, Erceg J, Saylor W, Wu CT. Ultraconserved Elements Occupy Specific Arenas of Three-Dimensional Mammalian Genome Organization. Cell Rep 2019; 24:479-488. [PMID: 29996107 DOI: 10.1016/j.celrep.2018.06.031] [Citation(s) in RCA: 15] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/11/2017] [Revised: 05/09/2018] [Accepted: 06/07/2018] [Indexed: 12/23/2022] Open
Abstract
This study explores the relationship between three-dimensional genome organization and ultraconserved elements (UCEs), an enigmatic set of DNA elements that are perfectly conserved between the reference genomes of distantly related species. Examining both human and mouse genomes, we interrogate the relationship of UCEs to three features of chromosome organization derived from Hi-C studies. We find that UCEs are enriched within contact domains and, further, that the subset of UCEs within domains shared across diverse cell types are linked to kidney-related and neuronal processes. In boundaries, UCEs are generally depleted, with those that do overlap boundaries being overrepresented in exonic UCEs. Regarding loop anchors, UCEs are neither overrepresented nor underrepresented, but those present in loop anchors are enriched for splice sites. Finally, as the relationships between UCEs and human Hi-C features are conserved in mouse, our findings suggest that UCEs contribute to interspecies conservation of genome organization and, thus, genome stability.
Collapse
Affiliation(s)
- Ruth B McCole
- Department of Genetics, Harvard Medical School, Boston, MA 02115, USA
| | - Jelena Erceg
- Department of Genetics, Harvard Medical School, Boston, MA 02115, USA
| | - Wren Saylor
- Department of Genetics, Harvard Medical School, Boston, MA 02115, USA
| | - Chao-Ting Wu
- Department of Genetics, Harvard Medical School, Boston, MA 02115, USA.
| |
Collapse
|
5
|
Ayad LAK, Pissis SP, Polychronopoulos D. CNEFinder: finding conserved non-coding elements in genomes. Bioinformatics 2019; 34:i743-i747. [PMID: 30423090 PMCID: PMC6129273 DOI: 10.1093/bioinformatics/bty601] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/20/2022] Open
Abstract
Motivation Conserved non-coding elements (CNEs) represent an enigmatic class of genomic elements which, despite being extremely conserved across evolution, do not encode for proteins. Their functions are still largely unknown. Thus, there exists a need to systematically investigate their roles in genomes. Towards this direction, identifying sets of CNEs in a wide range of organisms is an important first step. Currently, there are no tools published in the literature for systematically identifying CNEs in genomes. Results We fill this gap by presenting CNEFinder; a tool for identifying CNEs between two given DNA sequences with user-defined criteria. The results presented here show the tool’s ability of identifying CNEs accurately and efficiently. CNEFinder is based on a k-mer technique for computing maximal exact matches. The tool thus does not require or compute whole-genome alignments or indexes, such as the suffix array or the Burrows Wheeler Transform (BWT), which makes it flexible to use on a wide scale. Availability and implementation Free software under the terms of the GNU GPL (https://github.com/lorrainea/CNEFinder).
Collapse
Affiliation(s)
| | - Solon P Pissis
- Department of Informatics, King's College London, London, UK
| | | |
Collapse
|
6
|
Abstract
Recently, there has been growing interest in genome sequencing, driven by advances in sequencing technology, in terms of both efficiency and affordability. These developments have allowed many to envision whole-genome sequencing as an invaluable tool for both personalized medical care and public health. As a result, increasingly large and ubiquitous genomic data sets are being generated. This poses a significant challenge for the storage and transmission of these data. Already, it is more expensive to store genomic data for a decade than it is to obtain the data in the first place. This situation calls for efficient representations of genomic information. In this review, we emphasize the need for designing specialized compressors tailored to genomic data and describe the main solutions already proposed. We also give general guidelines for storing these data and conclude with our thoughts on the future of genomic formats and compressors.
Collapse
Affiliation(s)
- Mikel Hernaez
- Carl R. Woese Institute for Genomic Biology, University of Illinois at Urbana–Champaign, Urbana, Illinois 61801, USA
| | - Dmitri Pavlichin
- Department of Electrical Engineering, Stanford University, Stanford, California 94305, USA
| | - Tsachy Weissman
- Department of Electrical Engineering, Stanford University, Stanford, California 94305, USA
| | - Idoia Ochoa
- Department of Electrical and Computer Engineering, University of Illinois at Urbana–Champaign, Urbana, Illinois 61801, USA
| |
Collapse
|
7
|
Metze K, Adam R, Florindo JB. The fractal dimension of chromatin - a potential molecular marker for carcinogenesis, tumor progression and prognosis. Expert Rev Mol Diagn 2019; 19:299-312. [DOI: 10.1080/14737159.2019.1597707] [Citation(s) in RCA: 15] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/13/2022]
Affiliation(s)
- Konradin Metze
- Department of Pathology, Faculty of Medical Sciences, State University of Campinas (UNICAMP), Campinas, Brazil
| | - Randall Adam
- Department of Pathology, Faculty of Medical Sciences, State University of Campinas (UNICAMP), Campinas, Brazil
| | - João Batista Florindo
- Department of Applied Mathematics, Institute of Mathematics, Statistics and Scientific Computing, State University of Campinas, Campinas, Brazil
| |
Collapse
|
8
|
Oswald JA, Harvey MG, Remsen RC, Foxworth DU, Dittmann DL, Cardiff SW, Brumfield RT. Evolutionary dynamics of hybridization and introgression following the recent colonization of Glossy Ibis (Aves:Plegadis falcinellus) into the New World. Mol Ecol 2019; 28:1675-1691. [DOI: 10.1111/mec.15008] [Citation(s) in RCA: 20] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/30/2018] [Revised: 12/07/2018] [Accepted: 12/19/2018] [Indexed: 01/03/2023]
Affiliation(s)
- Jessica A. Oswald
- Museum of Natural Science Louisiana State University Baton Rouge Louisiana
- Florida Museum of Natural History University of Florida Gainesville Florida
| | - Michael G. Harvey
- Museum of Natural Science Louisiana State University Baton Rouge Louisiana
- Department of Biological Sciences Louisiana State University Baton Rouge Louisiana
| | - Rosalind C. Remsen
- Museum of Natural Science Louisiana State University Baton Rouge Louisiana
| | - DePaul U. Foxworth
- Museum of Natural Science Louisiana State University Baton Rouge Louisiana
| | - Donna L. Dittmann
- Museum of Natural Science Louisiana State University Baton Rouge Louisiana
| | - Steven W. Cardiff
- Museum of Natural Science Louisiana State University Baton Rouge Louisiana
| | - Robb T. Brumfield
- Museum of Natural Science Louisiana State University Baton Rouge Louisiana
- Department of Biological Sciences Louisiana State University Baton Rouge Louisiana
| |
Collapse
|
9
|
Apostolou-Karampelis K, Polychronopoulos D, Almirantis Y. Introduction of 'Generalized Genomic Signatures' for the quantification of neighbour preferences leads to taxonomy- and functionality-based distinction among sequences. Sci Rep 2019; 9:1700. [PMID: 30737442 PMCID: PMC6368578 DOI: 10.1038/s41598-018-38157-3] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/02/2018] [Accepted: 12/06/2018] [Indexed: 11/16/2022] Open
Abstract
Analysis of DNA composition at several length scales constitutes the bulk of many early studies aimed at unravelling the complexity of the organization and functionality of genomes. Dinucleotide relative abundances are considered an idiosyncratic feature of genomes, regarded as a ‘genomic signature’. Motivated by this finding, we introduce the ‘Generalized Genomic Signatures’ (GGSs), composed of over- and under-abundances of all oligonucleotides of a given length, thus filtering out compositional trends and neighbour preferences at any shorter range. Previous works on alignment-free genomic comparisons mostly rely on k-mer frequencies and not on distance-dependent neighbour preferences. Therein, nucleotide composition and proximity preferences are combined, while in the present work they are strictly separated, focusing uniquely on neighbour relationships. GGSs retain the potential or even outperform genomic signatures defined at the dinucleotide level in distinguishing between taxonomic subdivisions of bacteria, and can be more effectively implemented in microbial phylogenetic reconstruction. Moreover, we compare DNA sequences from the human genome corresponding to protein coding segments, conserved non-coding elements and non-functional DNA stretches. These classes of sequences have distinctive GGSs according to their genomic role and degree of conservation. Overall, GGSs constitute a trait characteristic of the evolutionary origin and functionality of different genomic segments.
Collapse
Affiliation(s)
| | | | - Yannis Almirantis
- Institute of Biosciences and Applications, National Center for Scientific Research "Demokritos", 15310, Athens, Greece.
| |
Collapse
|
10
|
Polychronopoulos D, King JWD, Nash AJ, Tan G, Lenhard B. Conserved non-coding elements: developmental gene regulation meets genome organization. Nucleic Acids Res 2018; 45:12611-12624. [PMID: 29121339 PMCID: PMC5728398 DOI: 10.1093/nar/gkx1074] [Citation(s) in RCA: 57] [Impact Index Per Article: 9.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/26/2017] [Accepted: 10/24/2017] [Indexed: 12/20/2022] Open
Abstract
Comparative genomics has revealed a class of non-protein-coding genomic sequences that display an extraordinary degree of conservation between two or more organisms, regularly exceeding that found within protein-coding exons. These elements, collectively referred to as conserved non-coding elements (CNEs), are non-randomly distributed across chromosomes and tend to cluster in the vicinity of genes with regulatory roles in multicellular development and differentiation. CNEs are organized into functional ensembles called genomic regulatory blocks–dense clusters of elements that collectively coordinate the expression of shared target genes, and whose span in many cases coincides with topologically associated domains. CNEs display sequence properties that set them apart from other sequences under constraint, and have recently been proposed as useful markers for the reconstruction of the evolutionary history of organisms. Disruption of several of these elements is known to contribute to diseases linked with development, and cancer. The emergence, evolutionary dynamics and functions of CNEs still remain poorly understood, and new approaches are required to enable comprehensive CNE identification and characterization. Here, we review current knowledge and identify challenges that need to be tackled to resolve the impasse in understanding extreme non-coding conservation.
Collapse
Affiliation(s)
- Dimitris Polychronopoulos
- Computational Regulatory Genomics Group, MRC London Institute of Medical Sciences, Du Cane Road, London W12 0NN, UK.,Institute of Clinical Sciences, Faculty of Medicine, Imperial College London, Hammersmith Campus, Du Cane Road, London W12 0NN, UK
| | - James W D King
- Computational Regulatory Genomics Group, MRC London Institute of Medical Sciences, Du Cane Road, London W12 0NN, UK.,Institute of Clinical Sciences, Faculty of Medicine, Imperial College London, Hammersmith Campus, Du Cane Road, London W12 0NN, UK
| | - Alexander J Nash
- Computational Regulatory Genomics Group, MRC London Institute of Medical Sciences, Du Cane Road, London W12 0NN, UK.,Institute of Clinical Sciences, Faculty of Medicine, Imperial College London, Hammersmith Campus, Du Cane Road, London W12 0NN, UK
| | - Ge Tan
- Computational Regulatory Genomics Group, MRC London Institute of Medical Sciences, Du Cane Road, London W12 0NN, UK.,Institute of Clinical Sciences, Faculty of Medicine, Imperial College London, Hammersmith Campus, Du Cane Road, London W12 0NN, UK
| | - Boris Lenhard
- Computational Regulatory Genomics Group, MRC London Institute of Medical Sciences, Du Cane Road, London W12 0NN, UK.,Institute of Clinical Sciences, Faculty of Medicine, Imperial College London, Hammersmith Campus, Du Cane Road, London W12 0NN, UK.,Sars International Centre for Marine Molecular Biology, University of Bergen, Thormøhlensgate 55, N-5008 Bergen, Norway
| |
Collapse
|
11
|
Almirantis Y, Charalampopoulos P, Gao J, Iliopoulos CS, Mohamed M, Pissis SP, Polychronopoulos D. On avoided words, absent words, and their application to biological sequence analysis. Algorithms Mol Biol 2017; 12:5. [PMID: 28293277 PMCID: PMC5348888 DOI: 10.1186/s13015-017-0094-z] [Citation(s) in RCA: 15] [Impact Index Per Article: 2.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/14/2016] [Accepted: 03/02/2017] [Indexed: 11/10/2022] Open
Abstract
Background The deviation of the observed frequency of a word w from its expected frequency in a given sequence x is used to determine whether or not the word is avoided. This concept is particularly useful in DNA linguistic analysis. The value of the deviation of w, denoted by \documentclass[12pt]{minimal}
\usepackage{amsmath}
\usepackage{wasysym}
\usepackage{amsfonts}
\usepackage{amssymb}
\usepackage{amsbsy}
\usepackage{mathrsfs}
\usepackage{upgreek}
\setlength{\oddsidemargin}{-69pt}
\begin{document}$$\textit{dev}(w)$$\end{document}dev(w), effectively characterises the extent of a word by its edge contrast in the context in which it occurs. A word w of length \documentclass[12pt]{minimal}
\usepackage{amsmath}
\usepackage{wasysym}
\usepackage{amsfonts}
\usepackage{amssymb}
\usepackage{amsbsy}
\usepackage{mathrsfs}
\usepackage{upgreek}
\setlength{\oddsidemargin}{-69pt}
\begin{document}$$k>2$$\end{document}k>2 is a \documentclass[12pt]{minimal}
\usepackage{amsmath}
\usepackage{wasysym}
\usepackage{amsfonts}
\usepackage{amssymb}
\usepackage{amsbsy}
\usepackage{mathrsfs}
\usepackage{upgreek}
\setlength{\oddsidemargin}{-69pt}
\begin{document}$$\rho $$\end{document}ρ-avoided word in x if \documentclass[12pt]{minimal}
\usepackage{amsmath}
\usepackage{wasysym}
\usepackage{amsfonts}
\usepackage{amssymb}
\usepackage{amsbsy}
\usepackage{mathrsfs}
\usepackage{upgreek}
\setlength{\oddsidemargin}{-69pt}
\begin{document}$$\textit{dev}(w) \le \rho $$\end{document}dev(w)≤ρ, for a given threshold \documentclass[12pt]{minimal}
\usepackage{amsmath}
\usepackage{wasysym}
\usepackage{amsfonts}
\usepackage{amssymb}
\usepackage{amsbsy}
\usepackage{mathrsfs}
\usepackage{upgreek}
\setlength{\oddsidemargin}{-69pt}
\begin{document}$$\rho < 0$$\end{document}ρ<0. Notice that such a word may be completely absent from x. Hence, computing all such words naïvely can be a very time-consuming procedure, in particular for large k. Results In this article, we propose an \documentclass[12pt]{minimal}
\usepackage{amsmath}
\usepackage{wasysym}
\usepackage{amsfonts}
\usepackage{amssymb}
\usepackage{amsbsy}
\usepackage{mathrsfs}
\usepackage{upgreek}
\setlength{\oddsidemargin}{-69pt}
\begin{document}$$\mathcal {O}(n)$$\end{document}O(n)-time and \documentclass[12pt]{minimal}
\usepackage{amsmath}
\usepackage{wasysym}
\usepackage{amsfonts}
\usepackage{amssymb}
\usepackage{amsbsy}
\usepackage{mathrsfs}
\usepackage{upgreek}
\setlength{\oddsidemargin}{-69pt}
\begin{document}$$\mathcal {O}(n)$$\end{document}O(n)-space algorithm to compute all \documentclass[12pt]{minimal}
\usepackage{amsmath}
\usepackage{wasysym}
\usepackage{amsfonts}
\usepackage{amssymb}
\usepackage{amsbsy}
\usepackage{mathrsfs}
\usepackage{upgreek}
\setlength{\oddsidemargin}{-69pt}
\begin{document}$$\rho $$\end{document}ρ-avoided words of length k in a given sequence of length n over a fixed-sized alphabet. We also present a time-optimal \documentclass[12pt]{minimal}
\usepackage{amsmath}
\usepackage{wasysym}
\usepackage{amsfonts}
\usepackage{amssymb}
\usepackage{amsbsy}
\usepackage{mathrsfs}
\usepackage{upgreek}
\setlength{\oddsidemargin}{-69pt}
\begin{document}$$\mathcal {O}(\sigma n)$$\end{document}O(σn)-time algorithm to compute all \documentclass[12pt]{minimal}
\usepackage{amsmath}
\usepackage{wasysym}
\usepackage{amsfonts}
\usepackage{amssymb}
\usepackage{amsbsy}
\usepackage{mathrsfs}
\usepackage{upgreek}
\setlength{\oddsidemargin}{-69pt}
\begin{document}$$\rho $$\end{document}ρ-avoided words (of any length) in a sequence of length n over an integer alphabet of size \documentclass[12pt]{minimal}
\usepackage{amsmath}
\usepackage{wasysym}
\usepackage{amsfonts}
\usepackage{amssymb}
\usepackage{amsbsy}
\usepackage{mathrsfs}
\usepackage{upgreek}
\setlength{\oddsidemargin}{-69pt}
\begin{document}$$\sigma $$\end{document}σ. In addition, we provide a tight asymptotic upper bound for the number of \documentclass[12pt]{minimal}
\usepackage{amsmath}
\usepackage{wasysym}
\usepackage{amsfonts}
\usepackage{amssymb}
\usepackage{amsbsy}
\usepackage{mathrsfs}
\usepackage{upgreek}
\setlength{\oddsidemargin}{-69pt}
\begin{document}$$\rho $$\end{document}ρ-avoided words over an integer alphabet and the expected length of the longest one. We make available an implementation of our algorithm. Experimental results, using both real and synthetic data, show the efficiency and applicability of our implementation in biological sequence analysis. Conclusions The systematic search for avoided words is particularly useful for biological sequence analysis. We present a linear-time and linear-space algorithm for the computation of avoided words of length k in a given sequence x. We suggest a modification to this algorithm so that it computes all avoided words of x, irrespective of their length, within the same time complexity. We also present combinatorial results with regards to avoided words and absent words.
Collapse
|
12
|
Hettiarachchi N, Saitou N. GC Content Heterogeneity Transition of Conserved Noncoding Sequences Occurred at the Emergence of Vertebrates. Genome Biol Evol 2016; 8:3377-3392. [PMID: 28040773 PMCID: PMC5203776 DOI: 10.1093/gbe/evw231] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/13/2022] Open
Abstract
Conserved non-coding sequences (CNSs) of Eukaryotes are known to be significantly enriched in regulatory sequences. CNSs of diverse lineages follow different patterns in abundance, sequence composition, and location. Here, we report a thorough analysis of CNSs in diverse groups of Eukaryotes with respect to GC content heterogeneity. We examined 24 fungi, 19 invertebrates, and 12 non-mammalian vertebrates so as to find lineage specific features of CNSs. We found that fungi and invertebrate CNSs are predominantly GC rich as in plants we previously observed, whereas vertebrate CNSs are GC poor. This result suggests that the CNS GC content transition occurred from the ancestral GC rich state of Eukaryotes to GC poor in the vertebrate lineage due to the enrollment of GC poor transcription factor binding sites that are lineage specific. CNS GC content is closely linked with the nucleosome occupancy that determines the location and structural architecture of DNAs.
Collapse
Affiliation(s)
- Nilmini Hettiarachchi
- Department of Genetics, School of Life Science, Graduate University for Advanced Studies (SOKENDAI), Mishima, Japan.,Division of Population Genetics, National institute of Genetics, Mishima, Japan
| | - Naruya Saitou
- Department of Genetics, School of Life Science, Graduate University for Advanced Studies (SOKENDAI), Mishima, Japan .,Division of Population Genetics, National institute of Genetics, Mishima, Japan
| |
Collapse
|
13
|
Polychronopoulos D, Athanasopoulou L, Almirantis Y. Fractality and entropic scaling in the chromosomal distribution of conserved noncoding elements in the human genome. Gene 2016; 584:148-60. [DOI: 10.1016/j.gene.2016.02.022] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/07/2015] [Revised: 01/22/2016] [Accepted: 02/14/2016] [Indexed: 11/15/2022]
|
14
|
|
15
|
Classification of selectively constrained DNA elements using feature vectors and rule-based classifiers. Genomics 2014; 104:79-86. [PMID: 25058025 DOI: 10.1016/j.ygeno.2014.07.004] [Citation(s) in RCA: 22] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/09/2014] [Accepted: 07/15/2014] [Indexed: 12/29/2022]
Abstract
Scarce work has been done in the analysis of the composition of conserved non-coding elements (CNEs) that are identified by comparisons of two or more genomes and are found to exist in all metazoan genomes. Here we present the analysis of CNEs with a methodology that takes into account word occurrence at various lengths scales in the form of feature vector representation and rule based classifiers. We implement our approach on both protein-coding exons and CNEs, originating from human, insect (Drosophila melanogaster) and worm (Caenorhabditis elegans) genomes, that are either identified in the present study or obtained from the literature. Alignment free feature vector representation of sequences combined with rule-based classification methods leads to successful classification of the different CNEs classes. Biologically meaningful results are derived by comparison with the genomic signatures approach, and classification rates for a variety of functional elements of the genomes along with surrogates are presented.
Collapse
|