1
|
Gondhalekar R, Kempes CP, McGlynn SE. Scaling of Protein Function across the Tree of Life. Genome Biol Evol 2023; 15:evad214. [PMID: 38007693 PMCID: PMC10715193 DOI: 10.1093/gbe/evad214] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/27/2023] [Revised: 11/07/2023] [Accepted: 11/12/2023] [Indexed: 11/28/2023] Open
Abstract
Scaling laws are a powerful way to compare genomes because they put all organisms onto a single curve and reveal nontrivial generalities as genomes change in size. The abundance of functional categories across genomes has previously been found to show power law scaling with respect to the total number of functional categories, suggesting that universal constraints shape genomic category abundance. Here, we look across the tree of life to understand how genome evolution may be related to functional scaling. We revisit previous observations of functional genome scaling with an expanded taxonomy by analyzing 3,726 bacterial, 220 archaeal, and 79 unicellular eukaryotic genomes. We find that for some functional classes, scaling is best described by multiple exponents, revealing previously unobserved shifts in scaling as genome-encoded protein annotations increase or decrease. Furthermore, we find that scaling varies between phyletic groups at both the domain and phyla levels and is less universal than previously thought. This variability in functional scaling is not related to taxonomic phylogeny resolved at the phyla level, suggesting that differences in cell plan or physiology outweigh broad patterns of taxonomic evolution. Since genomes are maintained and replicated by the functional proteins encoded by them, these results point to functional degeneracy between taxonomic groups and unique evolutionary trajectories toward these. We also find that individual phyla frequently span scaling exponents of functional classes, revealing that individual clades can move across scaling exponents. Together, our results reveal unique shifts in functions across the tree of life and highlight that as genomes grow or shrink, proteins of various functions may be added or lost.
Collapse
Affiliation(s)
- Riddhi Gondhalekar
- Earth-Life Science Institute, Tokyo Institute of Technology, Tokyo, Japan
- School of Life Sciences and Technology, Tokyo Institute of Technology, Tokyo, Japan
| | | | - Shawn Erin McGlynn
- Earth-Life Science Institute, Tokyo Institute of Technology, Tokyo, Japan
- School of Life Sciences and Technology, Tokyo Institute of Technology, Tokyo, Japan
- Blue Marble Space Institute of Science, Seattle, Washington, USA
- Center for Sustainable Resource Science, RIKEN, Saitama, Japan
| |
Collapse
|
2
|
Baker L, David C, Jacobs DJ. Ab initio gene prediction for protein-coding regions. BIOINFORMATICS ADVANCES 2023; 3:vbad105. [PMID: 37638212 PMCID: PMC10448985 DOI: 10.1093/bioadv/vbad105] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 02/15/2023] [Revised: 07/04/2023] [Accepted: 08/08/2023] [Indexed: 08/29/2023]
Abstract
Motivation Ab initio gene prediction in nonmodel organisms is a difficult task. While many ab initio methods have been developed, their average accuracy over long segments of a genome, and especially when assessed over a wide range of species, generally yields results with sensitivity and specificity levels in the low 60% range. A common weakness of most methods is the tendency to learn patterns that are species-specific to varying degrees. The need exists for methods to extract genetic features that can distinguish coding and noncoding regions that are not sensitive to specific organism characteristics. Results A new method based on a neural network (NN) that uses a collection of sensors to create input features is presented. It is shown that accurate predictions are achieved even when trained on organisms that are significantly different phylogenetically than test organisms. A consensus prediction algorithm for a CoDing Sequence (CDS) is subsequently applied to the first nucleotide level of NN predictions that boosts accuracy through a data-driven procedure that optimizes a CDS/non-CDS threshold. An aggregate accuracy benchmark at the nucleotide level shows that this new approach performs better than existing ab initio methods, while requiring significantly less training data. Availability and implementation https://github.com/BioMolecularPhysicsGroup-UNCC/MachineLearning.
Collapse
Affiliation(s)
- Lonnie Baker
- Department of Bioinformatics and Genomics, University of North Carolina at Charlotte, NC 28223, United States
| | - Charles David
- Department of Bioinformatics, The New Zealand Institute for Plant and Food Research, Lincoln 7608, New Zealand
| | - Donald J Jacobs
- Department of Physics and Optical Science, University of North Carolina at Charlotte, NC 28223, United States
- UNC Charlotte School of Data Science, University of North Carolina at Charlotte, NC 28223, United States
| |
Collapse
|
3
|
Loewenthal G, Wygoda E, Nagar N, Glick L, Mayrose I, Pupko T. The evolutionary dynamics that retain long neutral genomic sequences in face of indel deletion bias: a model and its application to human introns. Open Biol 2022; 12:220223. [PMID: 36514983 PMCID: PMC9748784 DOI: 10.1098/rsob.220223] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/15/2022] Open
Abstract
Insertions and deletions (indels) of short DNA segments are common evolutionary events. Numerous studies showed that deletions occur more often than insertions in both prokaryotes and eukaryotes. It raises the question why neutral sequences are not eradicated from the genome. We suggest that this is due to a phenomenon we term border-induced selection. Accordingly, a neutral sequence is bordered between conserved regions. Deletions occurring near the borders occasionally protrude to the conserved region and are thereby subject to strong purifying selection. Thus, for short neutral sequences, an insertion bias is expected. Here, we develop a set of increasingly complex models of indel dynamics that incorporate border-induced selection. Furthermore, we show that short conserved sequences within the neutrally evolving sequence help explain: (i) the presence of very long sequences; (ii) the high variance of sequence lengths; and (iii) the possible emergence of multimodality in sequence length distributions. Finally, we fitted our models to the human intron length distribution, as introns are thought to be mostly neutral and bordered by conserved exons. We show that when accounting for the occurrence of short conserved sequences within introns, we reproduce the main features, including the presence of long introns and the multimodality of intron distribution.
Collapse
Affiliation(s)
- Gil Loewenthal
- The Shmunis School of Biomedicine and Cancer Research, Tel Aviv University, Tel Aviv 69978, Israel
| | - Elya Wygoda
- The Shmunis School of Biomedicine and Cancer Research, Tel Aviv University, Tel Aviv 69978, Israel
| | - Natan Nagar
- The Shmunis School of Biomedicine and Cancer Research, Tel Aviv University, Tel Aviv 69978, Israel
| | - Lior Glick
- School of Plant Sciences and Food Security, George S. Wise Faculty of Life Sciences, Tel Aviv University, Tel Aviv 69978, Israel
| | - Itay Mayrose
- School of Plant Sciences and Food Security, George S. Wise Faculty of Life Sciences, Tel Aviv University, Tel Aviv 69978, Israel
| | - Tal Pupko
- The Shmunis School of Biomedicine and Cancer Research, Tel Aviv University, Tel Aviv 69978, Israel
| |
Collapse
|
4
|
B Chromosomes’ Sequences in Yellow-Necked Mice Apodemus flavicollis—Exploring the Transcription. Life (Basel) 2021; 12:life12010050. [PMID: 35054443 PMCID: PMC8781039 DOI: 10.3390/life12010050] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/22/2021] [Revised: 12/27/2021] [Accepted: 12/28/2021] [Indexed: 11/17/2022] Open
Abstract
B chromosomes (Bs) are highly polymorphic additional chromosomes in the genomes of many species. Due to the dispensability of Bs and the lack of noticeable phenotypic effects in their carriers, they were considered genetically inert for a long time. Recent studies on Bs in Apodemus flavicollis revealed their genetic composition, potential origin, and spatial organization in the interphase nucleus. Surprisingly, the genetic content of Bs in this species is preserved in all studied samples, even in geographically distinct populations, indicating its biological importance. Using RT-PCR we studied the transcription activity of three genes (Rraga, Haus6, and Cenpe) previously identified on Bs in A. flavicollis. We analysed mRNA isolated from spleen tissues of 34 animals harboring different numbers of Bs (0–3).The products of transcriptional activity of the analysed sequences differ in individuals with and without Bs. We recorded B-genes and/or genes from the standard genome in the presence of Bs, showing sex-dependent higher levels of transcriptional activity. Furthermore, the transcriptional activity of Cenpe varied with the age of the animals differently in the group with and without Bs. With aging, the amount of product was only found to significantly decrease in B carriers. The potential biological significance of all these differences is discussed in the paper.
Collapse
|
5
|
Grammatikakis I, Lal A. Significance of lncRNA abundance to function. Mamm Genome 2021; 33:271-280. [PMID: 34406447 DOI: 10.1007/s00335-021-09901-4] [Citation(s) in RCA: 27] [Impact Index Per Article: 9.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/30/2021] [Accepted: 08/03/2021] [Indexed: 12/12/2022]
Abstract
Long noncoding RNAs (lncRNAs) have emerged as regulators of diverse cellular processes. Although the vast majority of lncRNAs are expressed at lower levels compared to messenger RNAs (mRNAs), many lncRNAs play a central role in the regulation of cellular homeostasis and gene expression. With the advancement of next generation sequencing technologies, recent studies illustrate the diversity of lncRNA function. This diversity can be due to differences in their mechanisms of action, spatio-temporal expression, and/or abundance, all of which can vary depending on the particular cell type or tissue. Here, we discuss how the abundance of lncRNAs is an important feature that is often linked to their functions, and why it is crucial to quantitate lncRNA abundance, its local concentration within a cell or a tissue or the dynamic changes in expression levels during cell cycle progression or upon environmental stimuli, to shed light on their physiological roles.
Collapse
Affiliation(s)
- Ioannis Grammatikakis
- Regulatory RNAs and Cancer Section, Genetics Branch, Center for Cancer Research, National Cancer Institute, National Institutes of Health, Bethesda, MD, USA
| | - Ashish Lal
- Regulatory RNAs and Cancer Section, Genetics Branch, Center for Cancer Research, National Cancer Institute, National Institutes of Health, Bethesda, MD, USA.
| |
Collapse
|
6
|
Callens M, Pradier L, Finnegan M, Rose C, Bedhomme S. Read between the lines: Diversity of non-translational selection pressures on local codon usage. Genome Biol Evol 2021; 13:6263832. [PMID: 33944930 PMCID: PMC8410138 DOI: 10.1093/gbe/evab097] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 04/28/2021] [Indexed: 12/14/2022] Open
Abstract
Protein coding genes can contain specific motifs within their nucleotide sequence that function as a signal for various biological pathways. The presence of such sequence motifs within a gene can have beneficial or detrimental effects on the phenotype and fitness of an organism, and this can lead to the enrichment or avoidance of this sequence motif. The degeneracy of the genetic code allows for the existence of alternative synonymous sequences that exclude or include these motifs, while keeping the encoded amino acid sequence intact. This implies that locally, there can be a selective pressure for preferentially using a codon over its synonymous alternative in order to avoid or enrich a specific sequence motif. This selective pressure could -in addition to mutation, drift and selection for translation efficiency and accuracy- contribute to shape the codon usage bias. In this review, we discuss patterns of avoidance of (or enrichment for) the various biological signals contained in specific nucleotide sequence motifs: transcription and translation initiation and termination signals, mRNA maturation signals, and antiviral immune system targets. Experimental data on the phenotypic or fitness effects of synonymous mutations in these sequence motifs confirm that they can be targets of local selection pressures on codon usage. We also formulate the hypothesis that transposable elements could have a similar impact on codon usage through their preferred integration sequences. Overall, selection on codon usage appears to be a combination of a global selection pressure imposed by the translation machinery, and a patchwork of local selection pressures related to biological signals contained in specific sequence motifs.
Collapse
Affiliation(s)
- Martijn Callens
- Centre d'Ecologie Fonctionnelle et Evolutive, CNRS, Université de Montpellier, Université Paul Valéry Montpellier 3, Ecole Pratique des Hautes Etudes, Institut de Recherche pour le Développement, 34000 Montpellier, France
| | - Léa Pradier
- Centre d'Ecologie Fonctionnelle et Evolutive, CNRS, Université de Montpellier, Université Paul Valéry Montpellier 3, Ecole Pratique des Hautes Etudes, Institut de Recherche pour le Développement, 34000 Montpellier, France
| | - Michael Finnegan
- Centre d'Ecologie Fonctionnelle et Evolutive, CNRS, Université de Montpellier, Université Paul Valéry Montpellier 3, Ecole Pratique des Hautes Etudes, Institut de Recherche pour le Développement, 34000 Montpellier, France
| | - Caroline Rose
- Centre d'Ecologie Fonctionnelle et Evolutive, CNRS, Université de Montpellier, Université Paul Valéry Montpellier 3, Ecole Pratique des Hautes Etudes, Institut de Recherche pour le Développement, 34000 Montpellier, France
| | - Stéphanie Bedhomme
- Centre d'Ecologie Fonctionnelle et Evolutive, CNRS, Université de Montpellier, Université Paul Valéry Montpellier 3, Ecole Pratique des Hautes Etudes, Institut de Recherche pour le Développement, 34000 Montpellier, France
| |
Collapse
|
7
|
A Single Cell but Many Different Transcripts: A Journey into the World of Long Non-Coding RNAs. Int J Mol Sci 2020; 21:ijms21010302. [PMID: 31906285 PMCID: PMC6982300 DOI: 10.3390/ijms21010302] [Citation(s) in RCA: 43] [Impact Index Per Article: 10.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/22/2019] [Revised: 12/17/2019] [Accepted: 12/23/2019] [Indexed: 02/07/2023] Open
Abstract
In late 2012 it was evidenced that most of the human genome is transcribed but only a small percentage of the transcripts are translated. This observation supported the importance of non-coding RNAs and it was confirmed in several organisms. The most abundant non-translated transcripts are long non-coding RNAs (lncRNAs). In contrast to protein-coding RNAs, they show a more cell-specific expression. To understand the function of lncRNAs, it is fundamental to investigate in which cells they are preferentially expressed and to detect their subcellular localization. Recent improvements of techniques that localize single RNA molecules in tissues like single-cell RNA sequencing and fluorescence amplification methods have given a considerable boost in the knowledge of the lncRNA functions. In recent years, single-cell transcription variability was associated with non-coding RNA expression, revealing this class of RNAs as important transcripts in the cell lineage specification. The purpose of this review is to collect updated information about lncRNA classification and new findings on their function derived from single-cell analysis. We also retained useful for all researchers to describe the methods available for single-cell analysis and the databases collecting single-cell and lncRNA data. Tables are included to schematize, describe, and compare exposed concepts.
Collapse
|
8
|
Vandevenne M, Delmarcelle M, Galleni M. RNA Regulatory Networks as a Control of Stochasticity in Biological Systems. Front Genet 2019; 10:403. [PMID: 31134128 PMCID: PMC6514243 DOI: 10.3389/fgene.2019.00403] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/08/2019] [Accepted: 04/12/2019] [Indexed: 01/24/2023] Open
Abstract
The discovery that the non-protein coding part of human genome, dismissed as "junk DNA," is actively transcripted and carries out crucial functions is probably one of the most important discoveries of the past decades. These transcripts are becoming the rising stars of modern biology. In this review, we have casted a new light on RNAs. We have placed these molecules in the context of life origins, evolution with a big emphasize on the "RNA networks" concept. We discuss how this view can help us to understand the global role of RNA networks in modern cells, and can change our perception of the cell biology and therapy. Finally, although high-throughput methods as well as traditional case-to-case studies have laid the groundwork for our current knowledge of transcriptomes, we would like to discuss new strategies that are better suited to uncover and tackle these integrated and complex RNA networks.
Collapse
Affiliation(s)
- Marylène Vandevenne
- InBioS - Center for Protein Engineering, University of Liège, Liège, Belgium
| | - Michael Delmarcelle
- InBioS - Center for Protein Engineering, University of Liège, Liège, Belgium
| | - Moreno Galleni
- InBioS - Center for Protein Engineering, University of Liège, Liège, Belgium
| |
Collapse
|
9
|
Wu F, Zhang Q, Wang X. Design of Adjacent Transcriptional Regions to Tune Gene Expression and Facilitate Circuit Construction. Cell Syst 2018; 6:206-215.e6. [PMID: 29428414 PMCID: PMC5832616 DOI: 10.1016/j.cels.2018.01.010] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/28/2017] [Revised: 11/05/2017] [Accepted: 01/08/2018] [Indexed: 01/23/2023]
Abstract
Polycistronic architecture is common for synthetic gene circuits, however, it remains unknown how expression of one gene is affected by the presence of other genes/noncoding regions in the operon, termed adjacent transcriptional regions (ATR). Here, we constructed synthetic operons with a reporter gene flanked by different ATRs, and we found that ATRs with high GC content, small size, and low folding energy lead to high gene expression. Based on these results, we built a model of gene expression and generated a metric that takes into account ATRs. We used the metric to design and construct logic gates with low basal expression and high sensitivity and nonlinearity. Furthermore, we rationally designed synthetic 5'ATRs with different GC content and sizes to tune protein expression levels over a 300-fold range and used these to build synthetic toggle switches with varying basal expression and degrees of bistability. Our comprehensive model and gene expression metric could facilitate the future engineering of more complex synthetic gene circuits.
Collapse
Affiliation(s)
- Fuqing Wu
- School of Biological and Health Systems Engineering, Arizona State University, Tempe, AZ 85287, USA
| | - Qi Zhang
- School of Biological and Health Systems Engineering, Arizona State University, Tempe, AZ 85287, USA
| | - Xiao Wang
- School of Biological and Health Systems Engineering, Arizona State University, Tempe, AZ 85287, USA.
| |
Collapse
|
10
|
Kempes CP, Wolpert D, Cohen Z, Pérez-Mercader J. The thermodynamic efficiency of computations made in cells across the range of life. PHILOSOPHICAL TRANSACTIONS. SERIES A, MATHEMATICAL, PHYSICAL, AND ENGINEERING SCIENCES 2017; 375:20160343. [PMID: 29133443 PMCID: PMC5686401 DOI: 10.1098/rsta.2016.0343] [Citation(s) in RCA: 21] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Accepted: 07/31/2017] [Indexed: 06/01/2023]
Abstract
Biological organisms must perform computation as they grow, reproduce and evolve. Moreover, ever since Landauer's bound was proposed, it has been known that all computation has some thermodynamic cost-and that the same computation can be achieved with greater or smaller thermodynamic cost depending on how it is implemented. Accordingly an important issue concerning the evolution of life is assessing the thermodynamic efficiency of the computations performed by organisms. This issue is interesting both from the perspective of how close life has come to maximally efficient computation (presumably under the pressure of natural selection), and from the practical perspective of what efficiencies we might hope that engineered biological computers might achieve, especially in comparison with current computational systems. Here we show that the computational efficiency of translation, defined as free energy expended per amino acid operation, outperforms the best supercomputers by several orders of magnitude, and is only about an order of magnitude worse than the Landauer bound. However, this efficiency depends strongly on the size and architecture of the cell in question. In particular, we show that the useful efficiency of an amino acid operation, defined as the bulk energy per amino acid polymerization, decreases for increasing bacterial size and converges to the polymerization cost of the ribosome. This cost of the largest bacteria does not change in cells as we progress through the major evolutionary shifts to both single- and multicellular eukaryotes. However, the rates of total computation per unit mass are non-monotonic in bacteria with increasing cell size, and also change across different biological architectures, including the shift from unicellular to multicellular eukaryotes.This article is part of the themed issue 'Reconceptualizing the origins of life'.
Collapse
Affiliation(s)
| | - David Wolpert
- The Santa Fe Institute, 1399 Hyde Park Road, Santa Fe, NM 87501, USA
- Department of Physics, Massachusetts Institute of Technology, Cambridge, MA 02139, USA
- Beyond Center, Arizona State University, Tempe, AZ 85287, USA
| | - Zachary Cohen
- Department of Biology, University of Illinois, Urbana Champagne, Urbana, IL 61801, USA
| | - Juan Pérez-Mercader
- Department of Earth and Planetary Sciences, Harvard University, Cambridge, MA 02138, USA
| |
Collapse
|
11
|
Ramos É, Cardoso AL, Brown J, Marques DF, Fantinatti BEA, Cabral-de-Mello DC, Oliveira RA, O'Neill RJ, Martins C. The repetitive DNA element BncDNA, enriched in the B chromosome of the cichlid fish Astatotilapia latifasciata, transcribes a potentially noncoding RNA. Chromosoma 2016; 126:313-323. [PMID: 27169573 DOI: 10.1007/s00412-016-0601-x] [Citation(s) in RCA: 23] [Impact Index Per Article: 2.9] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/08/2016] [Revised: 04/03/2016] [Accepted: 05/03/2016] [Indexed: 12/27/2022]
Abstract
Supernumerary chromosomes have been studied in many species of eukaryotes, including the cichlid fish, Astatotilapia latifasciata. However, there are many unanswered questions about the maintenance, inheritance, and functional aspects of supernumerary chromosomes. The cichlid family has been highlighted as a model for evolutionary studies, including those that focus on mechanisms of chromosome evolution. Individuals of A. latifasciata are known to carry up to two B heterochromatic isochromosomes that are enriched in repetitive DNA and contain few intact gene sequences. We isolated and characterized a transcriptionally active repeated DNA, called B chromosome noncoding DNA (BncDNA), highly represented across all B chromosomes of A. latifasciata. BncDNA transcripts are differentially processed among six different tissues, including the production of smaller transcripts, indicating transcriptional variation may be linked to B chromosome presence and sexual phenotype. The transcript lengths and lack of similarity with known protein/gene sequences indicate BncRNA might represent a novel long noncoding RNA family (lncRNA). The potential for interaction between BncRNA and known miRNAs were computationally predicted, resulting in the identification of possible binding of this sequence in upregulated miRNAs related to the presence of B chromosomes. In conclusion, Bnc is a transcriptionally active repetitive DNA enriched in B chromosomes with potential action over B chromosome maintenance in somatic cells and meiotic drive in gametic cells.
Collapse
Affiliation(s)
- Érica Ramos
- Department of Morphology, Institute of Biosciences, Sao Paulo State University, 18618-689, Botucatu, SP, Brazil
| | - Adauto L Cardoso
- Department of Morphology, Institute of Biosciences, Sao Paulo State University, 18618-689, Botucatu, SP, Brazil
| | - Judith Brown
- Allied Health Sciences Department and Institute for Systems Genomics, University of Connecticut, 06269, Storrs, CT, USA
| | - Diego F Marques
- Department of Morphology, Institute of Biosciences, Sao Paulo State University, 18618-689, Botucatu, SP, Brazil
| | - Bruno E A Fantinatti
- Department of Morphology, Institute of Biosciences, Sao Paulo State University, 18618-689, Botucatu, SP, Brazil
| | - Diogo C Cabral-de-Mello
- Department of Biology, Institute of Biosciences, Sao Paulo State University, 13506-900, Rio Claro, SP, Brazil
| | - Rogério A Oliveira
- Department of Biostatistics, Institute of Biosciences, Sao Paulo State University, 18618-689, Botucatu, SP, Brazil
| | - Rachel J O'Neill
- Department of Molecular and Cell Biology and Institute for Systems Genomics, University of Connecticut, 06269, Storrs, CT, USA
| | - Cesar Martins
- Department of Morphology, Institute of Biosciences, Sao Paulo State University, 18618-689, Botucatu, SP, Brazil.
| |
Collapse
|
12
|
Sanli K, Bengtsson-Palme J, Nilsson RH, Kristiansson E, Alm Rosenblad M, Blanck H, Eriksson KM. Metagenomic sequencing of marine periphyton: taxonomic and functional insights into biofilm communities. Front Microbiol 2015; 6:1192. [PMID: 26579098 PMCID: PMC4626570 DOI: 10.3389/fmicb.2015.01192] [Citation(s) in RCA: 26] [Impact Index Per Article: 2.9] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/07/2015] [Accepted: 10/13/2015] [Indexed: 11/13/2022] Open
Abstract
Periphyton communities are complex phototrophic, multispecies biofilms that develop on surfaces in aquatic environments. These communities harbor a large diversity of organisms comprising viruses, bacteria, algae, fungi, protozoans, and metazoans. However, thus far the total biodiversity of periphyton has not been described. In this study, we use metagenomics to characterize periphyton communities from the marine environment of the Swedish west coast. Although we found approximately ten times more eukaryotic rRNA marker gene sequences compared to prokaryotic, the whole metagenome-based similarity searches showed that bacteria constitute the most abundant phyla in these biofilms. We show that marine periphyton encompass a range of heterotrophic and phototrophic organisms. Heterotrophic bacteria, including the majority of proteobacterial clades and Bacteroidetes, and eukaryotic macro-invertebrates were found to dominate periphyton. The phototrophic groups comprise Cyanobacteria and the alpha-proteobacterial genus Roseobacter, followed by different micro- and macro-algae. We also assess the metabolic pathways that predispose these communities to an attached lifestyle. Functional indicators of the biofilm form of life in periphyton involve genes coding for enzymes that catalyze the production and degradation of extracellular polymeric substances, mainly in the form of complex sugars such as starch and glycogen-like meshes together with chitin. Genes for 278 different transporter proteins were detected in the metagenome, constituting the most abundant protein complexes. Finally, genes encoding enzymes that participate in anaerobic pathways, such as denitrification and methanogenesis, were detected suggesting the presence of anaerobic or low-oxygen micro-zones within the biofilms.
Collapse
Affiliation(s)
- Kemal Sanli
- Department of Biological and Environmental Sciences, University of Gothenburg Gothenburg, Sweden
| | - Johan Bengtsson-Palme
- Department of Infectious Diseases, Institute of Biomedicine, The Sahlgrenska Academy, University of Gothenburg Gothenburg, Sweden
| | - R Henrik Nilsson
- Department of Biological and Environmental Sciences, University of Gothenburg Gothenburg, Sweden
| | - Erik Kristiansson
- Department of Mathematical Sciences, Chalmers University of Technology Gothenburg, Sweden
| | - Magnus Alm Rosenblad
- Department of Chemistry and Molecular Biology, University of Gothenburg Gothenburg, Sweden
| | - Hans Blanck
- Department of Biological and Environmental Sciences, University of Gothenburg Gothenburg, Sweden
| | - Karl M Eriksson
- Department of Shipping and Marine Technology, Chalmers University of Technology Gothenburg, Sweden
| |
Collapse
|
13
|
Kucharova V, Wiker HG. Proteogenomics in microbiology: taking the right turn at the junction of genomics and proteomics. Proteomics 2014; 14:2360-675. [PMID: 25263021 DOI: 10.1002/pmic.201400168] [Citation(s) in RCA: 24] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/28/2014] [Revised: 08/18/2014] [Accepted: 09/23/2014] [Indexed: 12/14/2022]
Abstract
High-accuracy and high-throughput proteomic methods have completely changed the way we can identify and characterize proteins. MS-based proteomics can now provide a unique supplement to genomic data and add a new level of information to the interpretation of genomic sequences. Proteomics-driven genome annotation has become especially relevant in microbiology where genomes are sequenced on a daily basis and limitations of an in silico driven annotation process are well recognized. In this review paper, we outline different strategies on how one can design a proteogenomic experiment, for example on genome-sequenced (synonymous proteogenomics) versus unsequenced organisms (ortho-proteogenomics) or with the aid of other "omic" data such as RNA-seq. We touch upon many challenges that are encountered during a typical proteogenomic study, mostly concerning bioinformatics methods and downstream data analysis, but also related to creation and use of sequence databases. A large list of proteogenomic case studies of different microorganisms is provided to illustrate the mapping of MS/MS-derived peptide spectra to genomic DNA sequences. These investigations have led to accurate determination of translational initiation sites, pointed out eventual read-throughs or programmed frameshifts, detected signal peptide processing or other protein maturation events, removed questionable annotation assignments, and provided evidence for predicted hypothetical proteins.
Collapse
Affiliation(s)
- Veronika Kucharova
- Department of Clinical Science, The Gade Research Group for Infection and Immunity, University of Bergen, Norway
| | | |
Collapse
|
14
|
Combining in silico prediction and ribosome profiling in a genome-wide search for novel putatively coding sORFs. BMC Genomics 2013; 14:648. [PMID: 24059539 PMCID: PMC3852105 DOI: 10.1186/1471-2164-14-648] [Citation(s) in RCA: 72] [Impact Index Per Article: 6.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/23/2013] [Accepted: 09/13/2013] [Indexed: 11/23/2022] Open
Abstract
Background It was long assumed that proteins are at least 100 amino acids (AAs) long. Moreover, the detection of short translation products (e.g. coded from small Open Reading Frames, sORFs) is very difficult as the short length makes it hard to distinguish true coding ORFs from ORFs occurring by chance. Nevertheless, over the past few years many such non-canonical genes (with ORFs < 100 AAs) have been discovered in different organisms like Arabidopsis thaliana, Saccharomyces cerevisiae, and Drosophila melanogaster. Thanks to advances in sequencing, bioinformatics and computing power, it is now possible to scan the genome in unprecedented scrutiny, for example in a search of this type of small ORFs. Results Using bioinformatics methods, we performed a systematic search for putatively functional sORFs in the Mus musculus genome. A genome-wide scan detected all sORFs which were subsequently analyzed for their coding potential, based on evolutionary conservation at the AA level, and ranked using a Support Vector Machine (SVM) learning model. The ranked sORFs are finally overlapped with ribosome profiling data, hinting to sORF translation. All candidates are visually inspected using an in-house developed genome browser. In this way dozens of highly conserved sORFs, targeted by ribosomes were identified in the mouse genome, putatively encoding micropeptides. Conclusion Our combined genome-wide approach leads to the prediction of a comprehensive but manageable set of putatively coding sORFs, a very important first step towards the identification of a new class of bioactive peptides, called micropeptides.
Collapse
|
15
|
Liu W, Sun L, Zhong M, Zhou Q, Gong Z, Li P, Tai P, Li X. Cadmium-induced DNA damage and mutations in Arabidopsis plantlet shoots identified by DNA fingerprinting. CHEMOSPHERE 2012; 89:1048-55. [PMID: 22717160 DOI: 10.1016/j.chemosphere.2012.05.068] [Citation(s) in RCA: 25] [Impact Index Per Article: 2.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 01/05/2012] [Revised: 05/11/2012] [Accepted: 05/16/2012] [Indexed: 05/03/2023]
Abstract
Random amplified polymorphic DNA (RAPD) test is a feasible method to evaluate the toxicity of environmental pollutants on vegetal organisms. Herein, Arabidopsis thaliana (Arabidopsis) plantlets following Cadmium (Cd) treatment for 26 d were screened for DNA genetic alterations by DNA fingerprinting. Four primers amplified 20-23 mutated RAPD fragments in 0.125-3.0 mg L(-1) Cd-treated Arabidopsis plantlets, respectively. Cloning and sequencing analysis of eight randomly selected mutated fragments revealed 99-100% homology with the genes of VARICOSE-Related, SLEEPY1 F-box, 40S ribosomal protein S3, phosphoglucomutase, and noncoding regions in Arabidopsis genome correspondingly. The results show the ability of RAPD analysis to detect significant genetic alterations in Cd-exposed seedlings. Although the exact functional importance of the other mutated bands is unknown, the presence of mutated loci in Cd-treated seedlings, prior to the onset of significant physiological effects, suggests that these altered loci are the early events in Cd-treated Arabidopsis seedlings and would greatly improve environmental risk assessment.
Collapse
Affiliation(s)
- Wan Liu
- Key Laboratory of Pollution Ecology and Environmental Engineering, Institute of Applied Ecology, Chinese Academy of Sciences, Shenyang 110016, China.
| | | | | | | | | | | | | | | |
Collapse
|
16
|
Friar JL, Goldman T, Pérez–Mercader J. Genome sizes and the Benford distribution. PLoS One 2012; 7:e36624. [PMID: 22629319 PMCID: PMC3356352 DOI: 10.1371/journal.pone.0036624] [Citation(s) in RCA: 24] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/07/2011] [Accepted: 04/11/2012] [Indexed: 11/19/2022] Open
Abstract
BACKGROUND Data on the number of Open Reading Frames (ORFs) coded by genomes from the 3 domains of Life show the presence of some notable general features. These include essential differences between the Prokaryotes and Eukaryotes, with the number of ORFs growing linearly with total genome size for the former, but only logarithmically for the latter. RESULTS Simply by assuming that the (protein) coding and non-coding fractions of the genome must have different dynamics and that the non-coding fraction must be particularly versatile and therefore be controlled by a variety of (unspecified) probability distribution functions (pdf's), we are able to predict that the number of ORFs for Eukaryotes follows a Benford distribution and must therefore have a specific logarithmic form. Using the data for the 1000+ genomes available to us in early 2010, we find that the Benford distribution provides excellent fits to the data over several orders of magnitude. CONCLUSIONS In its linear regime the Benford distribution produces excellent fits to the Prokaryote data, while the full non-linear form of the distribution similarly provides an excellent fit to the Eukaryote data. Furthermore, in their region of overlap the salient features are statistically congruent. This allows us to interpret the difference between Prokaryotes and Eukaryotes as the manifestation of the increased demand in the biological functions required for the larger Eukaryotes, to estimate some minimal genome sizes, and to predict a maximal Prokaryote genome size on the order of 8-12 megabasepairs. These results naturally allow a mathematical interpretation in terms of maximal entropy and, therefore, most efficient information transmission.
Collapse
Affiliation(s)
- James L. Friar
- Theoretical Division, Los Alamos National Laboratory, Los Alamos, New Mexico, United States of America
| | - Terrance Goldman
- Theoretical Division, Los Alamos National Laboratory, Los Alamos, New Mexico, United States of America
| | - Juan Pérez–Mercader
- Department of Earth and Planetary Sciences, Harvard University, Cambridge, Massachusetts, United States of America and Santa Fe Institute, Santa Fe, New Mexico, United States of America
- * E-mail:
| |
Collapse
|
17
|
Mehmood T, Bohlin J, Kristoffersen AB, Sæbø S, Warringer J, Snipen L. Exploration of multivariate analysis in microbial coding sequence modeling. BMC Bioinformatics 2012; 13:97. [PMID: 22583558 PMCID: PMC3473301 DOI: 10.1186/1471-2105-13-97] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/27/2012] [Accepted: 04/24/2012] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Gene finding is a complicated procedure that encapsulates algorithms for coding sequence modeling, identification of promoter regions, issues concerning overlapping genes and more. In the present study we focus on coding sequence modeling algorithms; that is, algorithms for identification and prediction of the actual coding sequences from genomic DNA. In this respect, we promote a novel multivariate method known as Canonical Powered Partial Least Squares (CPPLS) as an alternative to the commonly used Interpolated Markov model (IMM). Comparisons between the methods were performed on DNA, codon and protein sequences with highly conserved genes taken from several species with different genomic properties. RESULTS The multivariate CPPLS approach classified coding sequence substantially better than the commonly used IMM on the same set of sequences. We also found that the use of CPPLS with codon representation gave significantly better classification results than both IMM with protein (p < 0.001) and with DNA (p < 0.001). Further, although the mean performance was similar, the variation of CPPLS performance on codon representation was significantly smaller than for IMM (p < 0.001). CONCLUSIONS The performance of coding sequence modeling can be substantially improved by using an algorithm based on the multivariate CPPLS method applied to codon or DNA frequencies.
Collapse
Affiliation(s)
- Tahir Mehmood
- Biostatistics, Department of Chemistry, Biotechnology and Food Sciences, Norwegian University of Life Sciences, Aas, Norway.
| | | | | | | | | | | |
Collapse
|
18
|
Gilmore SJ. High throughput investigative Dermatology in 2012 and beyond: A new era beckons. Australas J Dermatol 2012; 54:1-8. [PMID: 22506776 DOI: 10.1111/j.1440-0960.2012.00883.x] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/31/2022]
Abstract
High throughput molecular biology began around the mid-1990s with the introduction of microarrays - a technology that enabled investigators to quantify the cellular expression levels of tens of thousands of mRNA transcripts simultaneously. To date, a large number of microarray experiments have been performed in the investigation of RNA expression signatures in normal and pathological tissues. This review focuses on a next generation tool in high throughput investigation: RNA sequencing or RNA-Seq, highlighting its advantages over traditional microarray investigation and discussing its utility in investigative dermatology. In contrast with the results obtained from microarray experiments, RNA-Seq generates mRNA abundance counts, can identify novel transcripts and splice variants, and provides sequence resolution at the level of single base-pairs. Implementing RNA-Seq in the investigation of skin disease will yield novel insights into the pathogenesis of disease, will facilitate the discovery of new diseases and new mechanisms of disease, and will allow researchers to probe genetic disease in high resolution and with unprecedented efficiency.
Collapse
Affiliation(s)
- Stephen J Gilmore
- Dermatology Research Centre, University of Queensland, School of Medicine, Princess Alexandra Hospital, Brisbane, Australia.
| |
Collapse
|
19
|
Yoon SL, Jung SI, Kim WJ, Kim SI, Park IH, Leem SH. Variants of BORIS minisatellites and relation to prognosis of prostate cancer. Genes Genomics 2011. [DOI: 10.1007/s13258-010-0111-9] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/30/2022]
|
20
|
Yok NG, Rosen GL. Combining gene prediction methods to improve metagenomic gene annotation. BMC Bioinformatics 2011; 12:20. [PMID: 21232129 PMCID: PMC3042383 DOI: 10.1186/1471-2105-12-20] [Citation(s) in RCA: 35] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/20/2010] [Accepted: 01/13/2011] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Traditional gene annotation methods rely on characteristics that may not be available in short reads generated from next generation technology, resulting in suboptimal performance for metagenomic (environmental) samples. Therefore, in recent years, new programs have been developed that optimize performance on short reads. In this work, we benchmark three metagenomic gene prediction programs and combine their predictions to improve metagenomic read gene annotation. RESULTS We not only analyze the programs' performance at different read-lengths like similar studies, but also separate different types of reads, including intra- and intergenic regions, for analysis. The main deficiencies are in the algorithms' ability to predict non-coding regions and gene edges, resulting in more false-positives and false-negatives than desired. In fact, the specificities of the algorithms are notably worse than the sensitivities. By combining the programs' predictions, we show significant improvement in specificity at minimal cost to sensitivity, resulting in 4% improvement in accuracy for 100 bp reads with ~1% improvement in accuracy for 200 bp reads and above. To correctly annotate the start and stop of the genes, we find that a consensus of all the predictors performs best for shorter read lengths while a unanimous agreement is better for longer read lengths, boosting annotation accuracy by 1-8%. We also demonstrate use of the classifier combinations on a real dataset. CONCLUSIONS To optimize the performance for both prediction and annotation accuracies, we conclude that the consensus of all methods (or a majority vote) is the best for reads 400 bp and shorter, while using the intersection of GeneMark and Orphelia predictions is the best for reads 500 bp and longer. We demonstrate that most methods predict over 80% coding (including partially coding) reads on a real human gut sample sequenced by Illumina technology.
Collapse
Affiliation(s)
- Non G Yok
- Genomic Signal Processing Laboratory, Electrical and Computer Engineering, Drexel University, Philadelphia, PA 19104, USA.
| | | |
Collapse
|
21
|
Abstract
Understanding the evolutionary origin of the nucleus and its compartmentalized architecture provides a huge but, as expected, greatly rewarding challenge in the post-genomic era. We start this chapter with a survey of current hypotheses on the evolutionary origin of the cell nucleus. Thereafter, we provide an overview of evolutionarily conserved features of chromatin organization and arrangements, as well as topographical aspects of DNA replication and transcription, followed by a brief introduction of current models of nuclear architecture. In addition to features which may possibly apply to all eukaryotes, the evolutionary plasticity of higher-order nuclear organization is reflected by cell-type- and species-specific features, by the ability of nuclear architecture to adapt to specific environmental demands, as well as by the impact of aberrant nuclear organization on senescence and human disease. We conclude this chapter with a reflection on the necessity of interdisciplinary research strategies to map epigenomes in space and time.
Collapse
|
22
|
Charoensawan V, Wilson D, Teichmann SA. Genomic repertoires of DNA-binding transcription factors across the tree of life. Nucleic Acids Res 2010; 38:7364-77. [PMID: 20675356 PMCID: PMC2995046 DOI: 10.1093/nar/gkq617] [Citation(s) in RCA: 106] [Impact Index Per Article: 7.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022] Open
Abstract
Sequence-specific transcription factors (TFs) are important to genetic regulation in all organisms because they recognize and directly bind to regulatory regions on DNA. Here, we survey and summarize the TF resources available. We outline the organisms for which TF annotation is provided, and discuss the criteria and methods used to annotate TFs by different databases. By using genomic TF repertoires from ∼700 genomes across the tree of life, covering Bacteria, Archaea and Eukaryota, we review TF abundance with respect to the number of genes, as well as their structural complexity in diverse lineages. While typical eukaryotic TFs are longer than the average eukaryotic proteins, the inverse is true for prokaryotes. Only in eukaryotes does the same family of DNA-binding domain (DBD) occur multiple times within one polypeptide chain. This potentially increases the length and diversity of DNA-recognition sequence by reusing DBDs from the same family. We examined the increase in TF abundance with the number of genes in genomes, using the largest set of prokaryotic and eukaryotic genomes to date. As pointed out before, prokaryotic TFs increase faster than linearly. We further observe a similar relationship in eukaryotic genomes with a slower increase in TFs.
Collapse
|
23
|
Høvik H, Chen T. Dynamic probe selection for studying microbial transcriptome with high-density genomic tiling microarrays. BMC Bioinformatics 2010; 11:82. [PMID: 20144223 PMCID: PMC2836303 DOI: 10.1186/1471-2105-11-82] [Citation(s) in RCA: 9] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/15/2009] [Accepted: 02/09/2010] [Indexed: 12/27/2022] Open
Abstract
Background Current commercial high-density oligonucleotide microarrays can hold millions of probe spots on a single microscopic glass slide and are ideal for studying the transcriptome of microbial genomes using a tiling probe design. This paper describes a comprehensive computational pipeline implemented specifically for designing tiling probe sets to study microbial transcriptome profiles. Results The pipeline identifies every possible probe sequence from both forward and reverse-complement strands of all DNA sequences in the target genome including circular or linear chromosomes and plasmids. Final probe sequence lengths are adjusted based on the maximal oligonucleotide synthesis cycles and best isothermality allowed. Optimal probes are then selected in two stages - sequential and gap-filling. In the sequential stage, probes are selected from sequence windows tiled alongside the genome. In the gap-filling stage, additional probes are selected from the largest gaps between adjacent probes that have already been selected, until a predefined number of probes is reached. Selection of the highest quality probe within each window and gap is based on five criteria: sequence uniqueness, probe self-annealing, melting temperature, oligonucleotide length, and probe position. Conclusions The probe selection pipeline evaluates global and local probe sequence properties and selects a set of probes dynamically and evenly distributed along the target genome. Unique to other similar methods, an exact number of non-redundant probes can be designed to utilize all the available probe spots on any chosen microarray platform. The pipeline can be applied to microbial genomes when designing high-density tiling arrays for comparative genomics, ChIP chip, gene expression and comprehensive transcriptome studies.
Collapse
Affiliation(s)
- Hedda Høvik
- Department of Oral Biology, Faculty of Dentistry, University of Oslo, Oslo, Norway
| | | |
Collapse
|
24
|
Li DJ, Zhang S. The Cambrian explosion triggered by critical turning point in genome size evolution. Biochem Biophys Res Commun 2010; 392:240-5. [DOI: 10.1016/j.bbrc.2010.01.032] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/23/2009] [Accepted: 01/08/2010] [Indexed: 11/26/2022]
|
25
|
Smith DMD, Onnela JP, Jones NS. Master-equation analysis of accelerating networks. PHYSICAL REVIEW. E, STATISTICAL, NONLINEAR, AND SOFT MATTER PHYSICS 2009; 79:056101. [PMID: 19518515 DOI: 10.1103/physreve.79.056101] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 10/22/2008] [Indexed: 05/27/2023]
Abstract
In many real-world networks, the rates of node and link addition are time dependent. This observation motivates the definition of accelerating networks. There has been relatively little investigation of accelerating networks and previous efforts at analyzing their degree distributions have employed mean-field techniques. By contrast, we show that it is possible to apply a master-equation approach to such network development. We provide full time-dependent expressions for the evolution of the degree distributions for the canonical situations of random and preferential attachment in networks undergoing constant acceleration. These results are in excellent agreement with results obtained from simulations. We note that a growing nonequilibrium network undergoing constant acceleration with random attachment is equivalent to a classical random graph, bridging the gap between nonequilibrium and classical equilibrium networks.
Collapse
Affiliation(s)
- David M D Smith
- Centre for Mathematical Biology, Department of Physics, Clarendon Laboratory, Oxford University, Oxford OX1 3PU, United Kingdom.
| | | | | |
Collapse
|