1
|
Paugh SW, Coss DR, Bao J, Laudermilk LT, Grace CR, Ferreira AM, Waddell MB, Ridout G, Naeve D, Leuze M, LoCascio PF, Panetta JC, Wilkinson MR, Pui CH, Naeve CW, Uberbacher EC, Bonten EJ, Evans WE. MicroRNAs Form Triplexes with Double Stranded DNA at Sequence-Specific Binding Sites; a Eukaryotic Mechanism via which microRNAs Could Directly Alter Gene Expression. PLoS Comput Biol 2016; 12:e1004744. [PMID: 26844769 PMCID: PMC4742280 DOI: 10.1371/journal.pcbi.1004744] [Citation(s) in RCA: 49] [Impact Index Per Article: 6.1] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/04/2015] [Accepted: 01/07/2016] [Indexed: 11/18/2022] Open
Abstract
MicroRNAs are important regulators of gene expression, acting primarily by binding to sequence-specific locations on already transcribed messenger RNAs (mRNA) and typically down-regulating their stability or translation. Recent studies indicate that microRNAs may also play a role in up-regulating mRNA transcription levels, although a definitive mechanism has not been established. Double-helical DNA is capable of forming triple-helical structures through Hoogsteen and reverse Hoogsteen interactions in the major groove of the duplex, and we show physical evidence (i.e., NMR, FRET, SPR) that purine or pyrimidine-rich microRNAs of appropriate length and sequence form triple-helical structures with purine-rich sequences of duplex DNA, and identify microRNA sequences that favor triplex formation. We developed an algorithm (Trident) to search genome-wide for potential triplex-forming sites and show that several mammalian and non-mammalian genomes are enriched for strong microRNA triplex binding sites. We show that those genes containing sequences favoring microRNA triplex formation are markedly enriched (3.3 fold, p<2.2 × 10−16) for genes whose expression is positively correlated with expression of microRNAs targeting triplex binding sequences. This work has thus revealed a new mechanism by which microRNAs could interact with gene promoter regions to modify gene transcription. We provide physical evidence, using NMR, FRET and SPR, that purine or pyrimidine-rich microRNAs can form triplexes with complementary purine-rich sequences of duplex DNA and provide an algorithm (Trident) to search genome-wide for potential microRNA double-stranded DNA triplex-forming sites. Using this algorithm we document enrichment of microRNA triplex binding sites in mammalian and non-mammalian genomes. We found in primary leukemia cells from patients a significant over-representation of positively correlated microRNA and mRNA expression for genes containing sequences favoring microRNA-duplex DNA triplex formation, suggesting this as a mechanism by which microRNA may enhance gene transcription.
Collapse
Affiliation(s)
- Steven W. Paugh
- Hematological Malignancies Program, St. Jude Children’s Research Hospital, Memphis, Tennessee, United States of America
- Department of Pharmaceutical Sciences, St. Jude Children’s Research Hospital, Memphis, Tennessee, United States of America
| | - David R. Coss
- High Performance Computing Facility, St. Jude Children’s Research Hospital, Memphis, Tennessee, United States of America
| | - Ju Bao
- Department of Pharmaceutical Sciences, St. Jude Children’s Research Hospital, Memphis, Tennessee, United States of America
- Department of Structural Biology, St. Jude Children’s Research Hospital, Memphis, Tennessee, United States of America
| | - Lucas T. Laudermilk
- Hematological Malignancies Program, St. Jude Children’s Research Hospital, Memphis, Tennessee, United States of America
- Department of Pharmaceutical Sciences, St. Jude Children’s Research Hospital, Memphis, Tennessee, United States of America
| | - Christy R. Grace
- Department of Structural Biology, St. Jude Children’s Research Hospital, Memphis, Tennessee, United States of America
| | - Antonio M. Ferreira
- High Performance Computing Facility, St. Jude Children’s Research Hospital, Memphis, Tennessee, United States of America
| | - M. Brett Waddell
- Molecular Interaction Analysis Laboratory, St. Jude Children’s Research Hospital, Memphis, Tennessee, United States of America
| | - Granger Ridout
- Functional Genomics Laboratory, Hartwell Center for Bioinformatics & Biotechnology, St. Jude Children’s Research Hospital, Memphis, Tennessee, United States of America
| | - Deanna Naeve
- Functional Genomics Laboratory, Hartwell Center for Bioinformatics & Biotechnology, St. Jude Children’s Research Hospital, Memphis, Tennessee, United States of America
| | - Michael Leuze
- Computer Science and Mathematics Division, Oak Ridge National Laboratory, Oak Ridge, Tennessee, United States of America
| | | | - John C. Panetta
- Department of Pharmaceutical Sciences, St. Jude Children’s Research Hospital, Memphis, Tennessee, United States of America
| | - Mark R. Wilkinson
- Hematological Malignancies Program, St. Jude Children’s Research Hospital, Memphis, Tennessee, United States of America
- Department of Pharmaceutical Sciences, St. Jude Children’s Research Hospital, Memphis, Tennessee, United States of America
| | - Ching-Hon Pui
- Hematological Malignancies Program, St. Jude Children’s Research Hospital, Memphis, Tennessee, United States of America
- Department of Oncology, St. Jude Children’s Research Hospital, Memphis, Tennessee, United States of America
| | - Clayton W. Naeve
- Department of Information Sciences, St. Jude Children’s Research Hospital, Memphis, Tennessee, United States of America
| | - Edward C. Uberbacher
- Computer Science and Mathematics Division, Oak Ridge National Laboratory, Oak Ridge, Tennessee, United States of America
| | - Erik J. Bonten
- Hematological Malignancies Program, St. Jude Children’s Research Hospital, Memphis, Tennessee, United States of America
- Department of Pharmaceutical Sciences, St. Jude Children’s Research Hospital, Memphis, Tennessee, United States of America
| | - William E. Evans
- Hematological Malignancies Program, St. Jude Children’s Research Hospital, Memphis, Tennessee, United States of America
- Department of Pharmaceutical Sciences, St. Jude Children’s Research Hospital, Memphis, Tennessee, United States of America
- * E-mail:
| |
Collapse
|
2
|
Jun SR, Leuze MR, Nookaew I, Uberbacher EC, Land M, Zhang Q, Wanchai V, Chai J, Nielsen M, Trolle T, Lund O, Buzard GS, Pedersen TD, Wassenaar TM, Ussery DW. Ebolavirus comparative genomics. FEMS Microbiol Rev 2015; 39:764-78. [PMID: 26175035 PMCID: PMC4551310 DOI: 10.1093/femsre/fuv031] [Citation(s) in RCA: 39] [Impact Index Per Article: 4.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 06/08/2015] [Indexed: 12/17/2022] Open
Abstract
The 2014 Ebola outbreak in West Africa is the largest documented for this virus. To examine the dynamics of this genome, we compare more than 100 currently available ebolavirus genomes to each other and to other viral genomes. Based on oligomer frequency analysis, the family Filoviridae forms a distinct group from all other sequenced viral genomes. All filovirus genomes sequenced to date encode proteins with similar functions and gene order, although there is considerable divergence in sequences between the three genera Ebolavirus, Cuevavirus and Marburgvirus within the family Filoviridae. Whereas all ebolavirus genomes are quite similar (multiple sequences of the same strain are often identical), variation is most common in the intergenic regions and within specific areas of the genes encoding the glycoprotein (GP), nucleoprotein (NP) and polymerase (L). We predict regions that could contain epitope-binding sites, which might be good vaccine targets. This information, combined with glycosylation sites and experimentally determined epitopes, can identify the most promising regions for the development of therapeutic strategies.This manuscript has been authored by UT-Battelle, LLC under Contract No. DE-AC05-00OR22725 with the U.S. Department of Energy. The United States Government retains and the publisher, by accepting the article for publication, acknowledges that the United States Government retains a non-exclusive, paid-up, irrevocable, world-wide license to publish or reproduce the published form of this manuscript, or allow others to do so, for United States Government purposes. The Department of Energy will provide public access to these results of federally sponsored research in accordance with the DOE Public Access Plan (http://energy.gov/downloads/doe-public-access-plan).
Collapse
Affiliation(s)
- Se-Ran Jun
- Comparative Genomics Group, Biosciences Division, Oak Ridge National Laboratory, Oak Ridge, TN 37831, USA Joint Institute for Computational Sciences, University of Tennessee, Knoxville, TN 37996, USA
| | - Michael R Leuze
- Computer Science and Mathematics Division, Computer Science Research Group, Oak Ridge National Laboratory, Oak Ridge, TN 37831, USA
| | - Intawat Nookaew
- Comparative Genomics Group, Biosciences Division, Oak Ridge National Laboratory, Oak Ridge, TN 37831, USA
| | - Edward C Uberbacher
- Comparative Genomics Group, Biosciences Division, Oak Ridge National Laboratory, Oak Ridge, TN 37831, USA
| | - Miriam Land
- Comparative Genomics Group, Biosciences Division, Oak Ridge National Laboratory, Oak Ridge, TN 37831, USA
| | - Qian Zhang
- Comparative Genomics Group, Biosciences Division, Oak Ridge National Laboratory, Oak Ridge, TN 37831, USA UT-ORNL Graduate School of Genome Science and Technology, University of Tennessee, Knoxville, TN 37996, USA
| | - Visanu Wanchai
- Comparative Genomics Group, Biosciences Division, Oak Ridge National Laboratory, Oak Ridge, TN 37831, USA
| | - Juanjuan Chai
- Computer Science and Mathematics Division, Computer Science Research Group, Oak Ridge National Laboratory, Oak Ridge, TN 37831, USA
| | - Morten Nielsen
- Center for Biological Sequence Analysis, Department of Systems Biology, The Technical University of Denmark, Building 208, DK-2800 Lyngby, Denmark Instituto de Investigaciones Biotecnológicas, Universidad Nacional de San Martín, San Martín, B 1650 HMP, Buenos Aires, Argentina
| | - Thomas Trolle
- Center for Biological Sequence Analysis, Department of Systems Biology, The Technical University of Denmark, Building 208, DK-2800 Lyngby, Denmark
| | - Ole Lund
- Center for Biological Sequence Analysis, Department of Systems Biology, The Technical University of Denmark, Building 208, DK-2800 Lyngby, Denmark
| | | | - Thomas D Pedersen
- Center for Biological Sequence Analysis, Department of Systems Biology, The Technical University of Denmark, Building 208, DK-2800 Lyngby, Denmark Assays, Cultures and Enzymes Division, Chr. Hansen A/S, Hørsholm, Denmark
| | - Trudy M Wassenaar
- Molecular Microbiology and Genomics Consultants, Tannenstr 7, D-55576 Zotzenheim, Germany
| | - David W Ussery
- Comparative Genomics Group, Biosciences Division, Oak Ridge National Laboratory, Oak Ridge, TN 37831, USA UT-ORNL Graduate School of Genome Science and Technology, University of Tennessee, Knoxville, TN 37996, USA Center for Biological Sequence Analysis, Department of Systems Biology, The Technical University of Denmark, Building 208, DK-2800 Lyngby, Denmark
| |
Collapse
|
3
|
Karpinets TV, Park BH, Syed MH, Klotz MG, Uberbacher EC. Metabolic environments and genomic features associated with pathogenic and mutualistic interactions between bacteria and plants. Mol Plant Microbe Interact 2014; 27:664-677. [PMID: 24580106 DOI: 10.1094/mpmi-12-13-0368-r] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [What about the content of this article? (0)] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/03/2023]
Abstract
Genomic characteristics discriminating parasitic and mutualistic relationship of bacterial symbionts with plants are poorly understood. This study comparatively analyzed the genomes of 54 mutualists and pathogens to discover genomic markers associated with the different phenotypes. Using metabolic network models, we predict external environments associated with free-living and symbiotic lifestyles and quantify dependences of symbionts on the host in terms of the consumed metabolites. We show that specific differences between the phenotypes are pronounced at the levels of metabolic enzymes, especially carbohydrate active, and protein functions. Overall, biosynthetic functions are enriched and more diverse in plant mutualists whereas processes and functions involved in degradation and host invasion are enriched and more diverse in pathogens. A distinctive characteristic of plant pathogens is a putative novel secretion system with a circadian rhythm regulator. A specific marker of plant mutualists is the co-residence of genes encoding nitrogenase and ribulose bisphosphate carboxylase/oxygenase (RuBisCO). We predict that RuBisCO is likely used in a putative metabolic pathway to supplement carbon obtained heterotrophically with low-cost assimilation of carbon from CO2. We validate results of the comparative analysis by predicting correct phenotype, pathogenic or mutualistic, for 20 symbionts in an independent set of 30 pathogens, mutualists, and commensals.
Collapse
|
4
|
Ghattyvenkatakrishna PK, Alekozai EM, Beckham GT, Schulz R, Crowley MF, Uberbacher EC, Cheng X. Initial recognition of a cellodextrin chain in the cellulose-binding tunnel may affect cellobiohydrolase directional specificity. Biophys J 2013; 104:904-12. [PMID: 23442969 DOI: 10.1016/j.bpj.2012.12.052] [Citation(s) in RCA: 27] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/11/2012] [Revised: 12/14/2012] [Accepted: 12/27/2012] [Indexed: 10/27/2022] Open
Abstract
Cellobiohydrolases processively hydrolyze glycosidic linkages in individual polymer chains of cellulose microfibrils, and typically exhibit specificity for either the reducing or nonreducing end of cellulose. Here, we conduct molecular dynamics simulations and free energy calculations to examine the initial binding of a cellulose chain into the catalytic tunnel of the reducing-end-specific Family 7 cellobiohydrolase (Cel7A) from Hypocrea jecorina. In unrestrained simulations, the cellulose diffuses into the tunnel from the -7 to the -5 positions, and the associated free energy profiles exhibit no barriers for initial processivity. The comparison of the free energy profiles for different cellulose chain orientations show a thermodynamic preference for the reducing end, suggesting that the preferential initial binding may affect the directional specificity of the enzyme by impeding nonproductive (nonreducing end) binding. Finally, the Trp-40 at the tunnel entrance is shown with free energy calculations to have a significant effect on initial chain complexation in Cel7A.
Collapse
|
5
|
GhattyVenkataKrishna PK, Uberbacher EC, Cheng X. Effect of the amyloid β hairpin's structure on the handedness of helices formed by its aggregates. FEBS Lett 2013; 587:2649-55. [PMID: 23845280 DOI: 10.1016/j.febslet.2013.06.050] [Citation(s) in RCA: 10] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/19/2013] [Revised: 05/16/2013] [Accepted: 06/21/2013] [Indexed: 11/16/2022]
Abstract
Various structural models for amyloid β fibrils have been derived from a variety of experimental techniques. However, these models cannot differentiate between the relative position of the two arms of the β hairpin called the stagger. Amyloid fibrils of various hierarchical levels form left-handed helices composed of β sheets. However it is unclear if positive, negative and zero staggers all form the macroscopic left-handed helices. To address this issue we have conducted extensive molecular dynamics simulations of amyloid β sheets of various staggers and shown that only negative staggers lead to the experimentally observed left-handed helices while positive staggers generate the incorrect right-handed helices. This result suggests that the negative staggers are physiologically relevant structure of the amyloid β fibrils.
Collapse
|
6
|
Ghattyvenkatakrishna PK, Uberbacher EC. Effect of temperature and glycerol on the hydrogen-bond dynamics of water. Cryo Letters 2013; 34:166-173. [PMID: 23625085] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Subscribe] [Scholar Register] [Indexed: 06/02/2023]
Abstract
The effect of glycerol, water and glycerol-water binary mixtures on the structure and dynamics of biomolecules has been well studied. However, a lot remains to be learned about the effect of varying glycerol concentration and temperature on the dynamics of water. We have studied the effect of concentration and temperature on the hydrogen bonded network formed by water molecules. A strong correlation between the relaxation time of the network and average number of hydrogen bonds per water molecules was found. The radial distribution function of water oxygen and hydrogen atoms clarifies the effect of concentration on the structure and clustering of water.
Collapse
|
7
|
Abstract
MOTIVATION Gene prediction in metagenomic sequences remains a difficult problem. Current sequencing technologies do not achieve sufficient coverage to assemble the individual genomes in a typical sample; consequently, sequencing runs produce a large number of short sequences whose exact origin is unknown. Since these sequences are usually smaller than the average length of a gene, algorithms must make predictions based on very little data. RESULTS We present MetaProdigal, a metagenomic version of the gene prediction program Prodigal, that can identify genes in short, anonymous coding sequences with a high degree of accuracy. The novel value of the method consists of enhanced translation initiation site identification, ability to identify sequences that use alternate genetic codes and confidence values for each gene call. We compare the results of MetaProdigal with other methods and conclude with a discussion of future improvements. AVAILABILITY The Prodigal software is freely available under the General Public License from http://code.google.com/p/prodigal/.
Collapse
Affiliation(s)
- Doug Hyatt
- Computational Biology and Bioinformatics Group, Oak Ridge National Laboratory, Oak Ridge, TN 37831, USA.
| | | | | | | |
Collapse
|
8
|
Leuze MR, Karpinets TV, Syed MH, Beliaev AS, Uberbacher EC. Binding Motifs in Bacterial Gene Promoters Modulate Transcriptional Effects of Global Regulators CRP and ArcA. Gene Regul Syst Bio 2012; 6:93-107. [PMID: 22701314 PMCID: PMC3370831 DOI: 10.4137/grsb.s9357] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 01/09/2023]
Abstract
Bacterial gene regulation involves transcription factors (TF) that bind to DNA recognition sequences in operon promoters. These recognition sequences, many of which are palindromic, are known as regulatory elements or transcription factor binding sites (TFBS). Some TFs are global regulators that can modulate the expression of hundreds of genes. In this study we examine global regulator half-sites, where a half-site, which we shall call a binding motif (BM), is one half of a palindromic TFBS. We explore the hypothesis that the number of BMs plays an important role in transcriptional regulation, examining empirical data from transcriptional profiling of the CRP and ArcA regulons. We compare the power of BM counts and of full TFBS characteristics to predict induced transcriptional activity. We find that CRP BM counts have a nonlinear effect on CRP-dependent transcriptional activity and predict this activity better than full TFBS quality or location.
Collapse
Affiliation(s)
- Michael R. Leuze
- Computer Science and Mathematics Division, Oak Ridge National
Laboratory, Oak Ridge, TN, USA
| | - Tatiana V. Karpinets
- Biosciences Division, Oak Ridge National Laboratory, Oak Ridge, TN,
USA
- Department of Plant Sciences, University of Tennessee, Knoxville,
TN, USA
| | - Mustafa H. Syed
- Biosciences Division, Oak Ridge National Laboratory, Oak Ridge, TN,
USA
| | - Alexander S. Beliaev
- Biological Sciences Division, Pacific Northwest National Laboratory,
Richland, WA, USA
| | | |
Collapse
|
9
|
Abstract
Due to advances in high-throughput biotechnologies biological information is being collected in databases at an amazing rate, requiring novel computational approaches that process collected data into new knowledge in a timely manner. In this study, we propose a computational framework for discovering modular structure, relationships and regularities in complex data. The framework utilizes a semantic-preserving vocabulary to convert records of biological annotations of an object, such as an organism, gene, chemical or sequence, into networks (Anets) of the associated annotations. An association between a pair of annotations in an Anet is determined by the similarity of their co-occurrence pattern with all other annotations in the data. This feature captures associations between annotations that do not necessarily co-occur with each other and facilitates discovery of the most significant relationships in the collected data through clustering and visualization of the Anet. To demonstrate this approach, we applied the framework to the analysis of metadata from the Genomes OnLine Database and produced a biological map of sequenced prokaryotic organisms with three major clusters of metadata that represent pathogens, environmental isolates and plant symbionts.
Collapse
Affiliation(s)
- Tatiana V Karpinets
- Biosciences Division, Oak Ridge National Laboratory, Oak Ridge, TN 37831, USA.
| | | | | |
Collapse
|
10
|
Abstract
UNLABELLED The BioEnergy Science Center (BESC) is undertaking large experimental campaigns to understand the biosynthesis and biodegradation of biomass and to develop biofuel solutions. BESC is generating large volumes of diverse data, including genome sequences, omics data and assay results. The purpose of the BESC Knowledgebase is to serve as a centralized repository for experimentally generated data and to provide an integrated, interactive and user-friendly analysis framework. The Portal makes available tools for visualization, integration and analysis of data either produced by BESC or obtained from external resources. AVAILABILITY http://besckb.ornl.gov.
Collapse
Affiliation(s)
- Mustafa H Syed
- BioEnergy Science Center, Oak Ridge National Laboratory, Oak Ridge, TN, USA.
| | | | | | | | | | | | | | | | | | | |
Collapse
|
11
|
Park BH, Karpinets TV, Syed MH, Leuze MR, Uberbacher EC. CAZymes Analysis Toolkit (CAT): web service for searching and analyzing carbohydrate-active enzymes in a newly sequenced organism using CAZy database. Glycobiology 2010; 20:1574-84. [PMID: 20696711 DOI: 10.1093/glycob/cwq106] [Citation(s) in RCA: 231] [Impact Index Per Article: 16.5] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/29/2023] Open
Abstract
The Carbohydrate-Active Enzyme (CAZy) database provides a rich set of manually annotated enzymes that degrade, modify, or create glycosidic bonds. Despite rich and invaluable information stored in the database, software tools utilizing this information for annotation of newly sequenced genomes by CAZy families are limited. We have employed two annotation approaches to fill the gap between manually curated high-quality protein sequences collected in the CAZy database and the growing number of other protein sequences produced by genome or metagenome sequencing projects. The first approach is based on a similarity search against the entire nonredundant sequences of the CAZy database. The second approach performs annotation using links or correspondences between the CAZy families and protein family domains. The links were discovered using the association rule learning algorithm applied to sequences from the CAZy database. The approaches complement each other and in combination achieved high specificity and sensitivity when cross-evaluated with the manually curated genomes of Clostridium thermocellum ATCC 27405 and Saccharophagus degradans 2-40. The capability of the proposed framework to predict the function of unknown protein domains and of hypothetical proteins in the genome of Neurospora crassa is demonstrated. The framework is implemented as a Web service, the CAZymes Analysis Toolkit, and is available at http://cricket.ornl.gov/cgi-bin/cat.cgi.
Collapse
|
12
|
Karpinets TV, Romine MF, Schmoyer DD, Kora GH, Syed MH, Leuze MR, Serres MH, Park BH, Samatova NF, Uberbacher EC. Shewanella knowledgebase: integration of the experimental data and computational predictions suggests a biological role for transcription of intergenic regions. Database (Oxford) 2010; 2010:baq012. [PMID: 20627862 PMCID: PMC2911847 DOI: 10.1093/database/baq012] [Citation(s) in RCA: 10] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 02/07/2023]
Abstract
Shewanellae are facultative γ-proteobacteria whose remarkable respiratory versatility has resulted in interest in their utility for bioremediation of heavy metals and radionuclides and for energy generation in microbial fuel cells. Extensive experimental efforts over the last several years and the availability of 21 sequenced Shewanella genomes made it possible to collect and integrate a wealth of information on the genus into one public resource providing new avenues for making biological discoveries and for developing a system level understanding of the cellular processes. The Shewanella knowledgebase was established in 2005 to provide a framework for integrated genome-based studies on Shewanella ecophysiology. The present version of the knowledgebase provides access to a diverse set of experimental and genomic data along with tools for curation of genome annotations and visualization and integration of genomic data with experimental data. As a demonstration of the utility of this resource, we examined a single microarray data set from Shewanella oneidensis MR-1 for new insights into regulatory processes. The integrated analysis of the data predicted a new type of bacterial transcriptional regulation involving co-transcription of the intergenic region with the downstream gene and suggested a biological role for co-transcription that likely prevents the binding of a regulator of the upstream gene to the regulator binding site located in the intergenic region. Database URL:http://shewanella-knowledgebase.org:8080/Shewanella/ or http://spruce.ornl.gov:8080/Shewanella/
Collapse
|
13
|
Syed MH, Karpinets TV, Leuze MR, Kora GH, Romine MR, Uberbacher EC. Shewregdb: Database and visualization environment for experimental and predicted regulatory information in Shewanella oneidensis mr-1. Bioinformation 2009; 4:169-72. [PMID: 20198195 PMCID: PMC2825598 DOI: 10.6026/97320630004169] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/27/2009] [Revised: 07/20/2009] [Accepted: 09/11/2009] [Indexed: 12/05/2022] Open
Abstract
Shewanella oneidensis MR-1 is an important model organism for environmental research as it has an exceptional metabolic and respiratory
versatility regulated by a complex regulatory network. We have developed a database to collect experimental and computational data relating to
regulation of gene and protein expression, and, a visualization environment that enables integration of these data types. The regulatory
information in the database includes predictions of DNA regulator binding sites, sigma factor binding sites, transcription units, operons,
promoters, and RNA regulators including non-coding RNAs, riboswitches, and different types of terminators.
Collapse
Affiliation(s)
- Mustafa H Syed
- Oak Ridge National Laboratory, Oak Ridge, Tennessee, 37831, USA.
| | | | | | | | | | | |
Collapse
|
14
|
Karpinets TV, Obraztsova AY, Wang Y, Schmoyer DD, Kora GH, Park BH, Serres MH, Romine MF, Land ML, Kothe TB, Fredrickson JK, Nealson KH, Uberbacher EC. Conserved synteny at the protein family level reveals genes underlying Shewanella species' cold tolerance and predicts their novel phenotypes. Funct Integr Genomics 2009; 10:97-110. [PMID: 19802638 PMCID: PMC2834769 DOI: 10.1007/s10142-009-0142-y] [Citation(s) in RCA: 9] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/28/2009] [Revised: 08/24/2009] [Accepted: 09/10/2009] [Indexed: 01/26/2023]
Abstract
Bacteria of the genus Shewanella can thrive in different environments and demonstrate significant variability in their metabolic and ecophysiological capabilities including cold and salt tolerance. Genomic characteristics underlying this variability across species are largely unknown. In this study, we address the problem by a comparison of the physiological, metabolic, and genomic characteristics of 19 sequenced Shewanella species. We have employed two novel approaches based on association of a phenotypic trait with the number of the trait-specific protein families (Pfam domains) and on the conservation of synteny (order in the genome) of the trait-related genes. Our first approach is top-down and involves experimental evaluation and quantification of the species’ cold tolerance followed by identification of the correlated Pfam domains and genes with a conserved synteny. The second, a bottom-up approach, predicts novel phenotypes of the species by calculating profiles of each Pfam domain among their genomes and following pair-wise correlation of the profiles and their network clustering. Using the first approach, we find a link between cold and salt tolerance of the species and the presence in the genome of a Na+/H+ antiporter gene cluster. Other cold-tolerance-related genes include peptidases, chemotaxis sensory transducer proteins, a cysteine exporter, and helicases. Using the bottom-up approach, we found several novel phenotypes in the newly sequenced Shewanella species, including degradation of aromatic compounds by an aerobic hybrid pathway in Shewanella woodyi, degradation of ethanolamine by Shewanella benthica, and propanediol degradation by Shewanella putrefaciens CN32 and Shewanella sp. W3-18-1.
Collapse
|
15
|
Karpinets TV, Pelletier DA, Pan C, Uberbacher EC, Melnichenko GV, Hettich RL, Samatova NF. Phenotype fingerprinting suggests the involvement of single-genotype consortia in degradation of aromatic compounds by Rhodopseudomonas palustris. PLoS One 2009; 4:e4615. [PMID: 19242537 PMCID: PMC2643473 DOI: 10.1371/journal.pone.0004615] [Citation(s) in RCA: 16] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/11/2008] [Accepted: 01/07/2009] [Indexed: 11/18/2022] Open
Abstract
Anaerobic degradation of complex organic compounds by microorganisms is crucial for development of innovative biotechnologies for bioethanol production and for efficient degradation of environmental pollutants. In natural environments, the degradation is usually accomplished by syntrophic consortia comprised of different bacterial species. This strategy allows consortium organisms to reduce efforts required for maintenance of the redox homeostasis at each syntrophic level. Cellular mechanisms that maintain the redox homeostasis during the degradation of aromatic compounds by one organism are not fully understood. Here we present a hypothesis that the metabolically versatile phototrophic bacterium Rhodopseudomonas palustris forms its own syntrophic consortia, when it grows anaerobically on p-coumarate or benzoate as a sole carbon source. We have revealed the consortia from large-scale measurements of mRNA and protein expressions under p-coumarate, benzoate and succinate degrading conditions using a novel computational approach referred as phenotype fingerprinting. In this approach, marker genes for known R. palustris phenotypes are employed to determine the relative expression levels of genes and proteins in aromatics versus non-aromatics degrading condition. Subpopulations of the consortia are inferred from the expression of phenotypes and known metabolic modes of the R. palustris growth. We find that p-coumarate degrading conditions may lead to at least three R. palustris subpopulations utilizing p-coumarate, benzoate, and CO2 and H2. Benzoate degrading conditions may also produce at least three subpopulations utilizing benzoate, CO2 and H2, and N2 and formate. Communication among syntrophs and inter-syntrophic dynamics in each consortium are indicated by up-regulation of transporters and genes involved in the curli formation and chemotaxis. The N2-fixing subpopulation in the benzoate degrading consortium has preferential activation of the vanadium nitrogenase over the molybdenum nitrogenase. This subpopulation in the consortium was confirmed in an independent experiment by consumption of dissolved nitrogen gas under the benzoate degrading conditions.
Collapse
Affiliation(s)
- Tatiana V Karpinets
- Computational Biology Institute, Oak Ridge National Laboratory, Oak Ridge, Tennessee, USA.
| | | | | | | | | | | | | |
Collapse
|
16
|
Crowley MF, Uberbacher EC, Brooks CL, Walker RC, Nimlos MR, Himmel ME. Developing improved MD codes for understanding processive cellulases. ACTA ACUST UNITED AC 2008. [DOI: 10.1088/1742-6596/125/1/012049] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [What about the content of this article? (0)] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022]
|
17
|
Abstract
The Gene Recognition and Analysis Internet Link (GRAIL) is one of the most widely used systems for evaluating the protein-coding potential of anonymous DNA sequences. This unit describes the use of the XGRAIL and genQuest client-server applications to locate exons in DNA sequences, to develop gene models, and to search databases for homologs. A support protocol describes how to obtain the GRAIL and genQuest client software by anonymous FTP.
Collapse
|
18
|
Narasimhan C, Tabb DL, Verberkmoes NC, Thompson MR, Hettich RL, Uberbacher EC. MASPIC: intensity-based tandem mass spectrometry scoring scheme that improves peptide identification at high confidence. Anal Chem 2007; 77:7581-93. [PMID: 16316165 DOI: 10.1021/ac0501745] [Citation(s) in RCA: 43] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/28/2022]
Abstract
Algorithmic search engines bridge the gap between large tandem mass spectrometry data sets and the identification of proteins associated with biological samples. Improvements in these tools can greatly enhance biological discovery. We present a new scoring scheme for comparing tandem mass spectra with a protein sequence database. The MASPIC (Multinomial Algorithm for Spectral Profile-based Intensity Comparison) scorer converts an experimental tandem mass spectrum into a m/z profile of probability and then scores peak lists from potential candidate peptides using a multinomial distribution model. The MASPIC scoring scheme incorporates intensity, spectral peak density variations, and m/z error distribution associated with peak matches into a multinomial distribution. The scoring scheme was validated on two standard protein mixtures and an additional set of spectra collected on a complex ribosomal protein mixture from Rhodopseudomonas palustris. The results indicate a 5-15% improvement over Sequest for high-confidence identifications. The performance gap grows as sequence database size increases. Additional tests on spectra from proteinase-K digest data showed similar performance improvements demonstrating the advantages in using MASPIC for studying proteins digested with less specific proteases. All these investigations show MASPIC to be a versatile and reliable system for peptide tandem mass spectral identification.
Collapse
Affiliation(s)
- Chandrasegaran Narasimhan
- Graduate School of Genome Science and Technology, University of Tennessee--Oak Ridge National Laboratory, 37830-8026, USA.
| | | | | | | | | | | |
Collapse
|
19
|
Razumovskaya J, Olman V, Xu D, Uberbacher EC, VerBerkmoes NC, Hettich RL, Xu Y. A computational method for assessing peptide- identification reliability in tandem mass spectrometry analysis with SEQUEST. Proteomics 2004; 4:961-9. [PMID: 15048978 DOI: 10.1002/pmic.200300656] [Citation(s) in RCA: 44] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/06/2022]
Abstract
High-throughput protein identification in mass spectrometry is predominantly achieved by first identifying tryptic peptides by a database search and then by combining the peptide hits for protein identification. One of the popular tools used for the database search is SEQUEST. Peptide identification is carried out by selecting SEQUEST hits above a specified threshold, the value of which is typically chosen empirically in an attempt to separate true identifications from false ones. These SEQUEST scores are not normalized with respect to the composition, length and other parameters of the peptides. Furthermore, there is no rigorous reliability estimate assigned to the protein identifications derived from these scores. Hence, the interpretation of SEQUEST hits generally requires human involvement, making it difficult to scale up the identification process for genome-scale applications. To overcome these limitations, we have developed a method, which combines a neural network and a statistical model, for normalizing SEQUEST scores, and also for providing a reliability estimate for each SEQUEST hit. This method improves the sensitivity and specificity of peptide identification compared to the standard filtering procedure used in the SEQUEST package, and provides a basis for estimating the reliability of protein identifications.
Collapse
|
20
|
Uberbacher EC, Hyatt D, Shah M. GrailEXP and Genome Analysis Pipeline for genome annotation. Curr Protoc Bioinformatics 2004; Chapter 4:Unit4.9. [PMID: 18428726 DOI: 10.1002/0471250953.bi0409s04] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/26/2023]
Abstract
The Basic Protocol describes the use of GrailEXP, the latest version of the gene finding system from Oak Ridge National Laboratory. GrailEXP provides gene models, by making use of sequence similarity with Expressed Sequence Tags (ESTs) and known genes. GrailEXP also provides alternatively spliced constructs for each gene based on the available EST evidence. The Support Protocol describes the use of the Genome Analysis Pipeline, a web application which allows users to perform comprehensive sequence analysis by offering a selection from a wide choice of supported gene finders, other biological feature finders, and database searches.
Collapse
|
21
|
Xu D, Kim D, Dam P, Shah M, Uberbacher EC, Xu Y. Characterization of protein structure and function at genome scale with a computational prediction pipeline. Genet Eng (N Y) 2003; 25:269-93. [PMID: 15260242 DOI: 10.1007/978-1-4615-0073-5_12] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 04/30/2023]
Affiliation(s)
- Dong Xu
- Life Sciences Division, Oak Ridge National Laboratory, Oak Ridge, TN 37830, USA
| | | | | | | | | | | |
Collapse
|
22
|
Abstract
Protein modeling is playing a more and more important role in protein and peptide sciences due to improvements in modeling methods, advances in computer technology, and the huge amount of biological data becoming available. Modeling tools can often predict the structure and shed some light on the function and its underlying mechanism. They can also provide insight to design experiments and suggest possible leads for drug design. This review attempts to provide a comprehensive introduction to major computer programs, especially on-line servers, for protein modeling. The review covers the following aspects: (1) protein sequence comparison, including sequence alignment/search, sequence-based protein family classification, domain parsing, and phylogenetic classification; (2) sequence annotation, including annotation/prediction of hydrophobic profiles, transmembrane regions, active sites, signaling sites, and secondary structures; (3) protein structure analysis, including visualization, geometry analysis, structure comparison/classification, dynamics, and electrostatics; (4) three-dimensional structure prediction, including homology modeling, fold recognition using threading, ab initio prediction, and docking. We will address what a user can expect from the computer tools in terms of their strengths and limitations. We will also discuss the major challenges and the future trends in the field. A collection of the links of tools can be found at http://compbio.ornl.gov/structure/resource/.
Collapse
Affiliation(s)
- D Xu
- Computational Biosciences Section Life Sciences, Division Oak Ridge National Laboratory, TN 37831-6480, USA.
| | | | | |
Collapse
|
23
|
Abstract
MOTIVATION This paper investigates the sequence-structure specificity of a representative knowledge based energy function by applying it to threading at the level of secondary structures of proteins. Assessing the strengths and weaknesses of an energy function at this fundamental level provides more detailed and insightful information than at the tertiary structure level and the results obtained can be useful in tertiary level threading. RESULTS We threaded each of the 293 non-redundant proteins onto the secondary structures contained in its respective native protein (host template). We also used 68 pairs of proteins with similar folds and low sequence identity. For each pair, we threaded the sequence of one protein onto the secondary structures of the other protein. The discerning power of the total energy function and its one-body, pairwise, and mutation components is studied. We then applied our energy function to a recent study which demonstrated how a designed 11-amino acid sequence can replace distinct segments (one segment is an alpha-helix, the other is a beta-sheet) of a protein without changing its fold. We conducted random mutations of the designed sequence to determine the patterns for favorable mutations. We also studied the sequence-structure specificity at the boundaries of a secondary structure. Finally, we demonstrated how to speed up tertiary level threading by filtering out alignments found to be energetically unfavorable during the secondary structure threading. AVAILABILITY The program is available on request from the authors. CONTACT xud@ornl.gov
Collapse
Affiliation(s)
- D Xu
- Computational Biosciences Section, Life Sciences Division Center for Engineering Science Advanced Research, Computer Science and Mathematics Division, Oak Ridge National Laboratory, PO Box 2008, Oak Ridge, TN 37831-6480, USA.
| | | | | | | |
Collapse
|
24
|
Affiliation(s)
- R J Mural
- Computational Biology Section, Oak Ridge National Laboratory, Oak Ridge, TN, USA.
| | | | | | | | | |
Collapse
|
25
|
Abstract
Computational recognition of native-like folds of an anonymous amino acid sequence from a protein fold database is considered to be a promising approach to the three-dimensional (3D) fold prediction of the amino acid sequence. We present a new method for protein fold recognition through optimally aligning an amino acid sequence and a protein fold template (protein threading). The fitness of aligning an amino acid sequence with a fold template is measured by (1) the singleton fitness, representing the compatibility of substituting one amino acid by another and the combined preference of secondary structure and solvent accessibility for a particular amino acid, (2) the pairwise interaction, representing the contact preference between a pair of amino acids, and (3) alignment gap penalties. Though a protein threading problem so defined is known to be NP-hard in the most general sense, our algorithm runs efficiently if we place a cutoff distance on the pairwise interactions, as many of the existing threading programs do. For an amino acid sequence of size n and a fold template of size m with M core secondary structures, the algorithm finds an optimal alignment in O (Mn1.5C + 1 + mnC + 1) time and O (MnC + 1) space, where C is a (small) nonnegative integer, determined by a particular mathematical property of the pairwise interactions. As a case study, we have demonstrated that C is less than or equal to 4 for about 75% of the 293 unique folds in our protein database, when pairwise interactions are restricted to amino acids < or = 7 A apart (measured between their beta carbon atoms). An approximation scheme is developed for fold templates with C > 4, when threading requires too much memory and time to be practical on a typical workstation.
Collapse
Affiliation(s)
- Y Xu
- Computational Biosciences Section, Oak Ridge National Laboratory, Tennessee 37831-6480, USA
| | | | | |
Collapse
|
26
|
Uberbacher EC, Xu Y, Shah MB, Olman V, Parang M, Mural RJ. An editing environment for DNA sequence analysis and annotation (extended abstract). Pac Symp Biocomput 1998:217-27. [PMID: 9697184 DOI: 10.2172/563243] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/02/2022]
Abstract
This paper presents a computer system for analyzing and annotating large-scale genomic sequences. The core of the system is a multiple-gene structure identification program, which predicts the most "probable" gene structures based on the given evidence, including pattern recognition, EST and protein homology information. A graphics-based user interface provides an environment which allows the user to interactively control the evidence to be used in the gene identification process. To overcome the computational bottleneck in the database similarity search used in the gene identification process, we have developed an effective way to partition a database into a set of sub-databases of "related" sequences, and reduced the search problem on a large database to a signature identification problem and a search problem on a much smaller sub-database. This reduces the number of sequences to be searched from N to O ([square root of] N) on average, and hence greatly reduces the search time, where N is the number of sequences in the original database. The system provides the user with the ability to facilitate and modify the analysis and modeling in real time.
Collapse
Affiliation(s)
- E C Uberbacher
- Computer Science and Mathematics Division, Oak Ridge National Laboratory TN 37831-6364, USA
| | | | | | | | | | | |
Collapse
|
27
|
Xu Y, Mural RJ, Uberbacher EC. Inferring gene structures in genomic sequences using pattern recognition and expressed sequence tags. Proc Int Conf Intell Syst Mol Biol 1997; 5:344-53. [PMID: 9322060] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Subscribe] [Scholar Register] [Indexed: 02/05/2023]
Abstract
Computational methods for gene identification in genomic sequences typically have two phases: coding region prediction and gene parsing. While there are many effective methods for predicting coding regions (exons), parsing the predicted exons into proper gene structures, to a large extent, remains an unsolved problem. This paper presents an algorithm for inferring gene structures from predicted exon candidates, based on Expressed Sequence Tags (ESTs) and biological intuition/rules. The algorithm first finds all the related ESTs in the EST database (dbEST) for each predicted exon, and infers the boundaries of one or a series of genes based on the available EST information and biological rules. Then it constructs gene models within each pair of gene boundaries, that are most consistent with the EST information. By exploiting EST information and biological rules, the algorithm can (1) model complicated multiple gene structures, including embedded genes, (2) identify falsely-predicted exons and locate missed exons, and (3) make more accurate exon boundary predictions. The algorithm has been implemented and tested on long genomic sequences with a number of genes. Test results show that very accurate (predicted) gene models can be expected when related ESTs exist for the predicted exons.
Collapse
Affiliation(s)
- Y Xu
- Computer Science and Mathematics Division, Oak Ridge National Laboratory, TN 37831-6364, USA.
| | | | | |
Collapse
|
28
|
Abstract
Computational methods for gene identification in genomic sequences typically have two phases: coding region recognition and gene parsing. While there are a number of effective methods for recognizing coding regions (exons), parsing the recognized exons into proper gene structures, to a large extent, remains an unsolved problem. We have developed a computer program which can automatically parse the recognized exons into gene models that are most consistent with the available Expressed Sequence Tags (ESTs) and a set of biological heuristics, derived empirically. The gene modeling algorithm used in this program provides a general framework for applying EST information so the modeling accuracy improves as the amount of available EST information increases. Based on preliminary tests on a number of large DNA sequences, using the dbEST database, we have observed that the algorithm can (1) accurately model complicated multiple gene structures, including embedded genes, (2) identify falsely-recognized exons and locate missed exons by the initial exon recognition phase, and (3) make more accurate exon boundary predictions, if the necessary EST information is available. We have extended this EST-based gene modeling algorithm to model genes on unfinished DNA contigs at the end of the shotgun sequencing. This extended version can automatically determine the orientations and the relative order of the DNA contigs (with gaps between them) using the available ESTs as reference models, before the gene modeling phase.
Collapse
Affiliation(s)
- Y Xu
- Computer Science and Mathematics Division, Oak Ridge National Laboratory, Tennessee 37831-6364, USA.
| | | |
Collapse
|
29
|
Xu Y, Uberbacher EC. A polynomial-time algorithm for a class of protein threading problems. Comput Appl Biosci 1996; 12:511-7. [PMID: 9021270 DOI: 10.1093/bioinformatics/12.6.511] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 02/03/2023]
Abstract
This paper presents an algorithm for constructing an optimal alignment between a three-dimensional protein structure template and an amino acid sequence. A protein structure template is given as a sequence of amino acid residue positions in three-dimensional space, along with an array of physical properties attached to each position; these residue positions are sequentially grouped into a series of core secondary structures (central helices and beta sheets). In addition to match scores and gap penalties, as in a traditional sequence-sequence alignment problem, the quality of a structure-sequence alignment is also determined by interaction preferences among amino acids aligned with structure positions that are spatially close (we call these 'long-range interactions'). Although it is known that constructing such a structure-sequence alignment in the most general form is NP-hard, our algorithm runs in polynomial time when restricted to structures with a 'modest' number of long-range amino acid interactions. In the current work, long-range interactions are limited to interactions between amino acids from different core secondary structures. Dividing the series of core secondary structures into two subseries creates a cut set of long-range interactions. If we use N, M and C to represent the size of an amino acid sequence, the size of a structure template, and the maximum cut size of long-range interactions, respectively, the algorithm finds an optimal structure-sequence alignment in O(21C NM) time, a polynomial function of N and M when C = O(log(N + M)). When running on structure-sequence alignment problems without long-range intersections, i.e. C = 0, the algorithm achieves the same asymptotic computational complexity of the Smith-Waterman sequence-sequence alignment algorithm.
Collapse
Affiliation(s)
- Y Xu
- Computer Science and Mathematics Division, Oak Ridge National Laboratory, TN 37831-6364, USA.
| | | |
Collapse
|
30
|
Harp JM, Uberbacher EC, Roberson AE, Palmer EL, Gewiess A, Bunick GJ. X-ray Diffraction Analysis of Crystals Containing Twofold Symmetric Nucleosome Core Particles. Acta Crystallogr D Biol Crystallogr 1996; 52:283-8. [PMID: 15299701 DOI: 10.1107/s0907444995009139] [Citation(s) in RCA: 21] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/10/2022]
Abstract
Nucleosome core particles containing a DNA palindrome and purified chicken erythrocyte histone octamer have been reconstituted and crystallized. The dyad symmetry of the palindrome extends the dyad symmetry of the histone octamer to result in a twofold symmetric particle. This ensures that the structure determined by X-ray diffraction will yield a true representation of the DNA strand rather than the twofold averaged structure which would result from using a non-palindromic DNA sequence. The crystals provide isotropic diffraction to 3.2 A with observed reflections extending to d spacings of about 2.8 A using a rotating-anode Cu Kalpha X-ray source. Although the DNA palindrome is a factor contributing to the quality of the diffraction data, another significant factor is an improved preparative technique which enriches for correctly phased nucleosome core particles.
Collapse
Affiliation(s)
- J M Harp
- The University of Tennessee/Oak Ridge, Graduate School of Biomedical Sciences, Oak Ridge National Laboratory, 37831-8077, USA
| | | | | | | | | | | |
Collapse
|
31
|
Abstract
Molecular sequences, like all experimental data, are subject to error. Many current DNA sequencing protocols have very significant error rates and often generate artefactual insertions and deletions of bases (indels) which corrupt the translation of sequences and compromise the detection of protein homologies. The impact of these errors on the utility of molecular sequence data is dependent on the analytic technique used to interpret the data. In the presence of frameshift errors, standard algorithms using six-frame translation can miss important homologies because only subfragments of the correct translation are available in any given frame. We present a new algorithm which can detect and correct frameshift errors in DNA sequences during comparison of translated sequences with protein sequences in the databases. This algorithm can recognize homologous proteins sharing 30% identity even in the presence of a 7% frameshift error rate. Our algorithm uses dynamic programming, producing a guaranteed optimal alignment in the presence of frameshifts, and has a sensitivity equivalent to Smith-Waterman. The computational efficiency of the algorithm is O(nm) where n and m are the sizes of two sequences being compared. The algorithm does not rely on prior knowledge or heuristic rules and performs significantly better than any previously reported method.
Collapse
Affiliation(s)
- X Guan
- Computer Science and Mathematics Division, Oak Ridge National Laboratory, TN 37831-6364, USA.
| | | |
Collapse
|
32
|
Abstract
Insertion and deletion (indel) sequencing errors in DNA coding regions disrupt DNA-to-protein translation frames, and hence make most frame-sensitive coding recognition approaches fail. This paper extends the authors' previous work on indel detection and "correction" algorithms, and presents a more effective algorithm for localizing indels that appear in DNA coding regions and "correcting" the located indels by inserting or deleting DNA bases. The algorithm localizes indels by discovering changes of the preferred translation frames within presumed coding regions, and then "corrects" them to restore a consistent translation frame within each coding region. An iterative strategy is exploited to repeatedly localize and "correct" indels until no more indels can be found. Test results have shown that this improved algorithm can detect and "correct" more indels while not worsening the rate of introduction of false indels when compared to the authors' previous work.
Collapse
Affiliation(s)
- Y Xu
- Computer Science and Mathematics Division, Oak Ridge National Laboratory, Tennessee 37831-6364, USA.
| | | | | |
Collapse
|
33
|
Affiliation(s)
- E C Uberbacher
- Computer Sciences and Mathematics Division, Oak Ridge National Laboratory, Tennessee 37831, USA
| | | | | |
Collapse
|
34
|
Abstract
This paper presents an algorithm for detecting and 'correcting' sequencing errors that occur in DNA coding regions. The types of sequencing errors addressed are insertions and deletions (indels) of DNA bases. The goal is to provide a capability which makes single-pass or low-redundancy sequence data more informative, reducing the need for high-redundancy sequencing for gene identification and characterization purposes. This would permit improved sequencing efficiency and reduce genome sequencing costs. The algorithm detects sequencing errors by discovering changes in the statistically preferred reading frame within a putative coding region and then inserts a number of 'neutral' bases at a perceived reading frame transition point to make the putative exon candidate frame consistent. We have implemented the algorithm as a front-end subsystem of the GRAIL DNA sequence analysis system to construct a version which is very error tolerant and also intend to use this as a testbed for further development of sequencing error-correction technology. Preliminary test results have shown the usefulness of this algorithm and also exhibited some of its weakness, providing possible directions for further improvement. On a test set consisting of 68 human DNA sequences with 1% randomly generated indels in coding regions, the algorithm detected and corrected 76% of the indels. The average distance between the position of an indel and the predicted one was 9.4 bases. With this subsystem in place, GRAIL correctly predicted 89% of the coding messages with 10% false message on the 'corrected' sequences, compared to 69% correctly predicted coding messages and 11% falsely predicted messages on the 'corrupted' sequences using standard GRAIL II method (version 1.2).(ABSTRACT TRUNCATED AT 250 WORDS)
Collapse
Affiliation(s)
- Y Xu
- Engineering Physics and Mathematics Division, Oak Ridge National Laboratory, TN 37831-6364, USA
| | | | | |
Collapse
|
35
|
Craven MW, Mural RJ, Hauser LJ, Uberbacher EC. Predicting protein folding classes without overly relying on homology. Proc Int Conf Intell Syst Mol Biol 1995; 3:98-106. [PMID: 7584472] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Subscribe] [Scholar Register] [Indexed: 05/21/2023]
Abstract
An important open problem in molecular biology is how to use computational methods to understand the structure and function of proteins given only their primary sequences. We describe and evaluate an original machine-learning approach to classifying protein sequences according to their structural folding class. Our work is novel in several respects: we use a set of protein classes that previously have not been used for classifying primary sequences, and we use a unique set of attributes to represent protein sequences to the learners. We evaluate our approach by measuring its ability to correctly classify proteins that were not in its training set. We compare our input representation to a commonly used input representation--amino acid composition--and show that our approach more accurately classifies proteins that have very limited homology to the sequences on which the systems are trained.
Collapse
Affiliation(s)
- M W Craven
- Computer Sciences Department, University of Wisconsin-Madison 53706, USA
| | | | | | | |
Collapse
|
36
|
Abstract
This paper presents a computationally efficient algorithm, the Gene Assembly Program III (GAP III), for constructing gene models from a set of accurately-predicted 'exons'. The input to the algorithm is a set of clusters of exon candidates, generated by a new version of the GRAIL coding region recognition system. The exon candidates of a cluster differ in their presumed edges and occasionally in their reading frames. Each exon candidate has a numerical score representing its 'probability' of being an actual exon. GAP III uses a dynamic programming algorithm to construct a gene model, complete or partial, by optimizing a predefined objective function. The optimal gene models constructed by GAP III correspond very well with the structures of genes which have been determined experimentally and reported in the Genome Sequence Database (GSDB). On a test set of 137 human and mouse DNA sequences consisting of 954 true exons, GAP III constructed 137 gene models using 892 exons, among which 859 (859/954 = 90%) are true exons and 33 (33/892 = 3%) are false positive. Among the 859 true positives, 635 (74%) match the actual exons exactly, and 838 (98%) have at least one edge correct. GAP III is computationally efficient. If we use E and C to represent the total number of exon candidates in all clusters and the number of clusters, respectively, the running time of GAP III is proportional to (E x C).
Collapse
Affiliation(s)
- Y Xu
- Engineering Physics and Mathematics Division, Oak Ridge National Laboratory, TN 37831-6364, USA
| | | | | |
Collapse
|
37
|
Abstract
The ultimate goal of the Human Genome project is to extract the biologically relevant information recorded in the estimated 100,000 genes encoded by the 3 x 10(9) bases of the human genome. This necessitates development of reliable computer-based methods capable of analysing and correctly identifying genes in the vast amounts of DNA-sequence data generated. Such tools may save time and labour by simplifying, for example, screening of cDNA libraries. They may also facilitate the localization of human disease genes by identifying candidate genes in promising regions of anonymous DNA sequence.
Collapse
Affiliation(s)
- R J Mural
- Biology Division, Oak Ridge National Laboratory, TN 37831-8077
| | | | | | | | | |
Collapse
|
38
|
Abstract
Genes in higher eukaryotes may span tens or hundreds of kilobases with the protein-coding regions accounting for only a few percent of the total sequence. Identifying genes within large regions of uncharacterized DNA is a difficult undertaking and is currently the focus of many research efforts. We describe a reliable computational approach for locating protein-coding portions of genes in anonymous DNA sequence. Using a concept suggested by robotic environmental sensing, our method combines a set of sensor algorithms and a neural network to localize the coding regions. Several algorithms that report local characteristics of the DNA sequence, and therefore act as sensors, are also described. In its current configuration the "coding recognition module" identifies 90% of coding exons of length 100 bases or greater with less than one false positive coding exon indicated per five coding exons indicated. This is a significantly lower false positive rate than any method of which we are aware. This module demonstrates a method with general applicability to sequence-pattern recognition problems and is available for current research efforts.
Collapse
|
39
|
Abstract
The x-ray crystallographic structure of the nucleosome core particle has been determined using 8 A resolution diffraction data. The particle has a mean diameter of 106 A and a maximum thickness of 65 A in the superhelical axis direction. The longest chord through the histone core measures 85 A and is in a non-axial direction. The 1.87 turn superhelix consists of B-DNA with about 78 base pairs or 7.6 helical repeats per superhelical turn. The mean DNA helical repeat contains 10.2 +/- 0.05 base pairs and spans 35 A, slightly more than standard B-DNA. The superhelix varies several Angstroms in radius and pitch, and has three distinct domains of curvature (with radii of curvature of 60, 45 and 51 A). These regions are separated by localized sharper bends +/- 10 and +/- 40 base pairs from the center of the particle, resulting in an overall radius of curvature about 43 A. Compression of superhelical DNA grooves on the inner surface and expansion on the outer surface can be seen throughout the DNA electron density. This density has been fit with a double helical ribbon model providing groove width estimates of 12 +/- 1 A inside vs. 19 +/- 1 A outside for the major groove, and 8 +/- 1 A inside vs. 13 +/- 1 A outside for the minor groove. The histone core is primarily contained within the bounds defined by the superhelical DNA, contacting the DNA where the phosphate backbone faces in toward the core. Possible extensions of density between the gyres have been located, but these are below the significance level of the electron density map. In cross-section, a tripartite organization of the histone octamer is apparent, with the tetramer occupying the central region and the dimers at the extremes. Several extensions of histone density are present which form contacts between nucleosomes in the crystal, perhaps representing flexible or "tail" histone regions. The radius of gyration of the histone portion of the electron density is calculated to be 30.4 A (in reasonable agreement with solution scattering values), and the histone core volume in the map is 93% of its theoretical volume.
Collapse
Affiliation(s)
- E C Uberbacher
- University of Tennessee-Oak Ridge Graduate School of Biomedical Sciences, Biology Division 37831-8077
| | | |
Collapse
|
40
|
Abstract
Several investigators have recognized the importance of non-periodic DNA sequence information in determining the translational position of precisely positioned nucleosomes. The purpose of this study is to determine the extent of such information, in addition to the character of periodic information present. This is accomplished by examining the half-nucleosome DNA sequences of a considerable number of precisely positioned nucleosomes, and determining the probability of occurrence of each dinucleotide type as a function of position from the nucleosome center to the terminus (positions 0 to 72). By the nature of this procedure, no assumptions of periodicity are made. The results show the importance of several DNA sequence periodicities including 6-7, 10, and 21 base pairs, in addition to significant nonperiodic information. The results demonstrate that each dinucleotide type is unique in terms of its positional preference in precisely positioned nucleosomes (for example AA not equal to TT). The probabilities of occurrence for the dinucleotide types can be used to predict the translational positions of a number of observed nucleosomes.
Collapse
Affiliation(s)
- E C Uberbacher
- University of Tennessee-Oak Ridge Graduate School of Biomedical Sciences, Biology
| | | | | |
Collapse
|
41
|
Bhattacharyya D, Tano K, Bunick GJ, Uberbacher EC, Behnke WD, Mitra S. Rapid, large-scale purification and characterization of 'Ada protein' (O6 methylguanine-DNA methyltransferase) of E. coli. Nucleic Acids Res 1988; 16:6397-410. [PMID: 3041376 PMCID: PMC338304 DOI: 10.1093/nar/16.14.6397] [Citation(s) in RCA: 13] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/03/2023] Open
Abstract
The E. coli Ada protein (O6-methylguanine-DNA methyltransferase) has been purified using a high-level expression vector with a yield of about 3 mg per liter of E. coli culture. The 39-kDa protein has an extinction coefficient (E280 nm (1%)) of 5.3. Its isoelectric point of 7.1 is lower than that predicted from the amino acid content. The homogeneous Ada protein is fully active as a methyl acceptor from O6-methylguanine in DNA. Its reaction with O6-methylguanine in a synthetic DNA has a second-order rate constant of 1.1 x 10(9) M-1 min-1 at O degree C. Both the native form and the protein methylated at Cys-69 are monomeric. The CD spectrum suggests a low alpha-helical content and the radius of gyration of 23 A indicates a compact, globular shape. The middle region of the protein is sensitive to a variety of proteases, including an endogenous activity in E. coli, suggesting that the protein is composed of N-terminal and C-terminal domains connected by a hinge region. E. coli B has a higher level of this protease than does K12.
Collapse
Affiliation(s)
- D Bhattacharyya
- University of Tennessee Graduate School of Biomedical Sciences, Oak Ridge 37831
| | | | | | | | | | | |
Collapse
|
42
|
Consler TG, Uberbacher EC, Bunick GJ, Liebman MN, Lee JC. Domain interaction in rabbit muscle pyruvate kinase. II. Small angle neutron scattering and computer simulation. J Biol Chem 1988; 263:2794-801. [PMID: 3343233] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Grants] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/05/2023] Open
Abstract
The effects of ligands on the structure of rabbit muscle pyruvate kinase were studied by small angle neutron scattering. The radius of gyration, RG, decreases by about 1 A in the presence of the substrate phosphoenolpyruvate, but increases by about the same magnitude in the presence of the allosteric inhibitor phenylalanine. With increasing pH or in the absence of Mg2+ and K+, the RG of pyruvate kinase increases. Hence, there is a 2-A difference in RG between two alternative conformations. Length distribution analysis indicates that, under all experimental conditions which increase the radius of gyration, there is a pronounced increase observed in the probability for interatomic distance between 80 and 110 A. These small angle neutron scattering results indicate a "contraction" and "expansion" of the enzyme when it transforms between its active and inactive forms. Using the alpha-carbon coordinates of crystalline cat muscle pyruvate kinase, a length distribution profile was calculated, and it matches the scattering profile of the inactive form. These observations are expected since the crystals were grown in the absence of divalent cations (Stuart, D. I., Levine, M., Muirhead, H., and Stammers, D. K. (1979) J. Mol. Biol. 134, 109-142). Hence, results from neutron scattering, x-ray crystallographic, and sedimentation studies (Oberfelder, R. W., Lee, L. L.-Y., and Lee, J.C. (1984) Biochemistry 23, 3813-3821) are totally consistent with each other. With the aid of computer modeling, the crystal structure has been manipulated in order to effect changes that are consistent with the conformational change described by the solution scattering data. The structural manipulation involves the rotation of the B domain relative to the A domain, leading to the closure of the cleft between these domains. These manipulations resulted in the generation of new sets of atomic (C-alpha) coordinates, which were utilized in calculations, the result of which compared favorably with the solution data.
Collapse
Affiliation(s)
- T G Consler
- E. A. Doisy Department of Biochemistry, St. Louis University School of Medicine, Missouri 63104
| | | | | | | | | |
Collapse
|
43
|
Consler TG, Uberbacher EC, Bunick GJ, Liebman MN, Lee JC. Domain interaction in rabbit muscle pyruvate kinase. II. Small angle neutron scattering and computer simulation. J Biol Chem 1988. [DOI: 10.1016/s0021-9258(18)69139-2] [Citation(s) in RCA: 15] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [What about the content of this article? (0)] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/22/2022] Open
|
44
|
Stoops JK, Wakil SJ, Uberbacher EC, Bunick GJ. Small-angle neutron-scattering and electron microscope studies of the chicken liver fatty acid synthase. J Biol Chem 1987; 262:10246-51. [PMID: 3611059] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Abstract] [MESH Headings] [Grants] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/06/2023] Open
Abstract
A structural model for the chicken liver fatty acid synthase is proposed based on electron microscope and small-angle neutron-scattering studies of the enzyme. The model has the overall appearance of two side by side cylinders with dimensions of 160 X 146 X 73 A, with each subunit 160 A in length and 73 A in diameter. The model was constructed by dividing each cylinder into three domains having lengths of 32, 82, and 46 A, with the domain structures in the two subunits being related to each other by a dyad axis. The model is consistent with chemical cross-linking studies which indicated that the subunits are arranged in a head to tail fashion. The cross-linking studies further showed that the beta-ketoacyl synthase active site contains a cysteine and a pantetheine residue from adjacent subunits. It is proposed that the domains which catalyze the addition of C2 units from malonate to the growing fatty acid chain lie in the crevice between the two subunits and that the two independent sets of fatty acid-synthesizing centers lie on the major axis of the model on opposite ends of the molecular dyad.
Collapse
|
45
|
Abstract
The conformation of the histone octamer is shown to depend upon the specific salt used to solubilize it. In 2M sodium chloride the octamer is similar in size and shape to the histone component of crystallized core nucleosomes. In contrast, in 3.5M ammonium sulfate the octamer is elongated, resembling an ellipsoid with approximate dimensions of 114 by 62 by 62 angstroms. These results indicate that the elongated conformation seen in the 3.3 angstroms electron density map of the histone octamer crystallized in ammonium sulfate is due to the particular salt conditions used.
Collapse
|
46
|
Goddette DW, Uberbacher EC, Bunick GJ, Frieden C. Formation of actin dimers as studied by small angle neutron scattering. J Biol Chem 1986; 261:2605-9. [PMID: 3949737] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Abstract] [MESH Headings] [Grants] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/08/2023] Open
Abstract
Small angle neutron scattering has been used to study the dimensions of G-actin and the formation of low molecular weight actin oligomers under conditions where rapid polymerization does not take place. In the presence of 200 microM Ca2+, actin in solution consists of a single component with a radius of gyration (Rg) of 19.9 +/- 0.4 A, consistent with the known molecular dimensions of the G-actin molecule. In the presence of 50 microM Mg2+, however, formation of an actin species with a larger Rg occurs over a 4-h period. Multicomponent fits were tried and the data were best fit assuming two components, the monomer and a species with an Rg of 29 +/- 1 A. This latter value is consistent with the dimensions expected for certain actin dimers. The apparent dissociation constant for dimer formation is approximately 150 microM with forward and reverse rate constants of 6.0 X 10(-7) microM-1 s-1 and 8.8 X 10(-5) s-1, respectively. Kinetic fluorescence experiments show that the dimer formed in the presence of low levels of Mg2+ is a nonproductive complex which does not participate in the polymerization process. However, the addition of cytochalasin D to actin in the presence of 50 microM Mg2+ rapidly induces the formation of dimers, presumably related to cytochalasin's ability to nucleate actin polymerization.
Collapse
|
47
|
|
48
|
|
49
|
Abstract
Two monoclinic crystal forms (P2(1),C2) of chicken erythrocyte nucleosomes have been under study in this laboratory. The x-ray structure of the P2(1) crystal form has been solved to 15 A resolution. The B-DNA superhelix has a relatively uniform curvature, with only several local distortions observed in the superhelix. The individual histone domains have been localized and specific contacts between each histone and the DNA can be observed. Histone contacts to the inner surface of the DNA superhelix occur predominantly at the minor groove sites. Most of the histone core is contained within the inner surface of the superhelical DNA, except for part of H2A which extends between the DNA gyres near the terminus of the DNA. No part of H2A blocks the DNA terminus or would prevent a smooth exit of the DNA into the linker region. A similar extension of a portion of histone H4 between the DNA gyres occurs close to the dyad axis. Both unique nucleosomes in the P2(1) asymmetric unit demonstrate good dyad symmetry and are similar to each other throughout the histone core and DNA regions.
Collapse
Affiliation(s)
- E C Uberbacher
- University of Tennessee-Oak Ridge Graduate School of Biomedical Sciences and Biology 37831
| | | |
Collapse
|
50
|
Abstract
Ionic strength studies using homogeneous preparations of chicken erythrocyte nucleosomes containing either 146 or 175 base pairs of DNA show a single unfolding transition at about 1.5 mM ionic strength as determined by small-angle neutron scattering. The transition seen by some investigators at between 2.9 and 7.5 mM ionic strength is not observed by small-angle neutron scattering in either type of nucleosome particle. The two contrasts measured (H2O and D2O) indicate that only small conformational changes occur in the protein core, but the DNA is partially unfolded below the transition point. Patterson inversion of the data and analysis of models indicate that the DNA in both types of particle is unwinding from the ends, leaving about one turn of supercoiled DNA bound to the histone core in approximately its normal (compact) conformation. The mechanism of unfolding appears to be similar for both types of particles and in both cases occurs at the same ionic strength. The unfolding observed for nucleosomes in this study is in definite disagreement with extended superhelical models for the DNA and also disagrees with models incorporating an unfolded histone core.
Collapse
|