1
|
Torres-Tiji Y, Sethuram H, Gupta A, McCauley J, Dutra-Molino JV, Pathania R, Saxton L, Kang K, Hillson NJ, Mayfield SP. Bioinformatic Prediction and High Throughput In Vivo Screening to Identify Cis-Regulatory Elements for the Development of Algal Synthetic Promoters. ACS Synth Biol 2024; 13:2150-2165. [PMID: 38986010 DOI: 10.1021/acssynbio.4c00199] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 07/12/2024]
Abstract
Algae biotechnology holds immense promise for revolutionizing the bioeconomy through the sustainable and scalable production of various bioproducts. However, their development has been hindered by the lack of advanced genetic tools. This study introduces a synthetic biology approach to develop such tools, focusing on the construction and testing of synthetic promoters. By analyzing conserved DNA motifs within the promoter regions of highly expressed genes across six different algal species, we identified cis-regulatory elements (CREs) associated with high transcriptional activity. Combining the algorithms POWRS, STREME, and PhyloGibbs, we predicted 1511 CREs and inserted them into a minimal synthetic promoter sequence in 1, 2, or 3 copies, resulting in 4533 distinct synthetic promoters. These promoters were evaluated in vivo for their capacity to drive the expression of a transgene in a high-throughput manner through next-generation sequencing post antibiotic selection and fluorescence-activated cell sorting. To validate our approach, we sequenced hundreds of transgenic lines showing high levels of GFP expression. Further, we individually tested 14 identified promoters, revealing substantial increases in GFP expression─up to nine times higher than the baseline synthetic promoter, with five matching or even surpassing the performance of the native AR1 promoter. As a result of this study, we identified a catalog of CREs that can now be used to build superior synthetic algal promoters. More importantly, here we present a validated pipeline to generate building blocks for innovative synthetic genetic tools applicable to any algal species with a sequenced genome and transcriptome data set.
Collapse
Affiliation(s)
- Y Torres-Tiji
- Division of Biological Sciences, University of California San Diego, La Jolla, California 92093, United States
| | - H Sethuram
- Division of Biological Sciences, University of California San Diego, La Jolla, California 92093, United States
| | - A Gupta
- Division of Biological Sciences, University of California San Diego, La Jolla, California 92093, United States
| | - J McCauley
- Biological Systems & Engineering Division, Lawrence Berkeley National Laboratory, Berkeley, California 94720, United States
- DOE Agile BioFoundry, Emeryville, California 94608, United States
| | - J-V Dutra-Molino
- Division of Biological Sciences, University of California San Diego, La Jolla, California 92093, United States
| | - R Pathania
- Division of Biological Sciences, University of California San Diego, La Jolla, California 92093, United States
| | - L Saxton
- Division of Biological Sciences, University of California San Diego, La Jolla, California 92093, United States
| | - K Kang
- Division of Biological Sciences, University of California San Diego, La Jolla, California 92093, United States
| | - N J Hillson
- Biological Systems & Engineering Division, Lawrence Berkeley National Laboratory, Berkeley, California 94720, United States
- DOE Agile BioFoundry, Emeryville, California 94608, United States
| | - S P Mayfield
- Division of Biological Sciences, University of California San Diego, La Jolla, California 92093, United States
| |
Collapse
|
2
|
Selvakumar P, Siddharthan R. Position-specific evolution in transcription factor binding sites, and a fast likelihood calculation for the F81 model. ROYAL SOCIETY OPEN SCIENCE 2024; 11:231088. [PMID: 38269075 PMCID: PMC10805598 DOI: 10.1098/rsos.231088] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 07/26/2023] [Accepted: 12/20/2023] [Indexed: 01/26/2024]
Abstract
Transcription factor binding sites (TFBS), like other DNA sequence, evolve via mutation and selection relating to their function. Models of nucleotide evolution describe DNA evolution via single-nucleotide mutation. A stationary vector of such a model is the long-term distribution of nucleotides, unchanging under the model. Neutrally evolving sites may have uniform stationary vectors, but one expects that sites within a TFBS instead have stationary vectors reflective of the fitness of various nucleotides at those positions. We introduce 'position-specific stationary vectors' (PSSVs), the collection of stationary vectors at each site in a TFBS locus, analogous to the position weight matrix (PWM) commonly used to describe TFBS. We infer PSSVs for human TFs using two evolutionary models (Felsenstein 1981 and Hasegawa-Kishino-Yano 1985). We find that PSSVs reflect the nucleotide distribution from PWMs, but with reduced specificity. We infer ancestral nucleotide distributions at individual positions and calculate 'conditional PSSVs' conditioned on specific choices of majority ancestral nucleotide. We find that certain ancestral nucleotides exert a strong evolutionary pressure on neighbouring sequence while others have a negligible effect. Finally, we present a fast likelihood calculation for the F81 model on moderate-sized trees that makes this approach feasible for large-scale studies along these lines.
Collapse
Affiliation(s)
- Pavitra Selvakumar
- The Institute of Mathematical Sciences, Chennai, India
- Homi Bhabha National Institute, Mumbai, India
| | - Rahul Siddharthan
- The Institute of Mathematical Sciences, Chennai, India
- Homi Bhabha National Institute, Mumbai, India
| |
Collapse
|
3
|
Katsantoni M, van Nimwegen E, Zavolan M. Improved analysis of (e)CLIP data with RCRUNCH yields a compendium of RNA-binding protein binding sites and motifs. Genome Biol 2023; 24:77. [PMID: 37069586 PMCID: PMC10108518 DOI: 10.1186/s13059-023-02913-0] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/06/2022] [Accepted: 03/29/2023] [Indexed: 04/19/2023] Open
Abstract
We present RCRUNCH, an end-to-end solution to CLIP data analysis for identification of binding sites and sequence specificity of RNA-binding proteins. RCRUNCH can analyze not only reads that map uniquely to the genome but also those that map to multiple genome locations or across splice boundaries and can consider various types of background in the estimation of read enrichment. By applying RCRUNCH to the eCLIP data from the ENCODE project, we have constructed a comprehensive and homogeneous resource of in-vivo-bound RBP sequence motifs. RCRUNCH automates the reproducible analysis of CLIP data, enabling studies of post-transcriptional control of gene expression.
Collapse
Affiliation(s)
- Maria Katsantoni
- Biozentrum, University of Basel, 4056, Basel, Switzerland.
- Swiss Institute of Bioinformatics, 1015, Lausanne, Switzerland.
| | - Erik van Nimwegen
- Biozentrum, University of Basel, 4056, Basel, Switzerland
- Swiss Institute of Bioinformatics, 1015, Lausanne, Switzerland
| | - Mihaela Zavolan
- Biozentrum, University of Basel, 4056, Basel, Switzerland.
- Swiss Institute of Bioinformatics, 1015, Lausanne, Switzerland.
| |
Collapse
|
4
|
Danan C, Manickavel S, Hafner M. PAR-CLIP: A Method for Transcriptome-Wide Identification of RNA Binding Protein Interaction Sites. METHODS IN MOLECULAR BIOLOGY (CLIFTON, N.J.) 2021; 2404:167-188. [PMID: 34694609 DOI: 10.1007/978-1-0716-1851-6_9] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Subscribe] [Scholar Register] [Indexed: 11/26/2022]
Abstract
During post-transcriptional gene regulation (PTGR), RNA binding proteins (RBPs) interact with all classes of RNA to control RNA maturation, stability, transport, and translation. Here, we describe Photoactivatable-Ribonucleoside-Enhanced Crosslinking and Immunoprecipitation (PAR-CLIP), a transcriptome-scale method for identifying RBP binding sites on target RNAs with nucleotide-level resolution. This method is readily applicable to any protein directly contacting RNA, including RBPs that are predicted to bind in a sequence- or structure-dependent manner at discrete RNA recognition elements (RREs), and those that are thought to bind transiently, such as RNA polymerases or helicases.
Collapse
Affiliation(s)
- Charles Danan
- RNA Molecular Biology Group, NIAMS, Bethesda, MD, USA
| | | | - Markus Hafner
- RNA Molecular Biology Group, NIAMS, Bethesda, MD, USA.
| |
Collapse
|
5
|
Hafner M, Katsantoni M, Köster T, Marks J, Mukherjee J, Staiger D, Ule J, Zavolan M. CLIP and complementary methods. ACTA ACUST UNITED AC 2021. [DOI: 10.1038/s43586-021-00018-1] [Citation(s) in RCA: 50] [Impact Index Per Article: 16.7] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/07/2023]
|
6
|
Carazo F, Romero JP, Rubio A. Upstream analysis of alternative splicing: a review of computational approaches to predict context-dependent splicing factors. Brief Bioinform 2020; 20:1358-1375. [PMID: 29390045 DOI: 10.1093/bib/bby005] [Citation(s) in RCA: 21] [Impact Index Per Article: 5.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/24/2017] [Revised: 12/14/2017] [Indexed: 12/13/2022] Open
Abstract
Alternative splicing (AS) has shown to play a pivotal role in the development of diseases, including cancer. Specifically, all the hallmarks of cancer (angiogenesis, cell immortality, avoiding immune system response, etc.) are found to have a counterpart in aberrant splicing of key genes. Identifying the context-specific regulators of splicing provides valuable information to find new biomarkers, as well as to define alternative therapeutic strategies. The computational models to identify these regulators are not trivial and require three conceptual steps: the detection of AS events, the identification of splicing factors that potentially regulate these events and the contextualization of these pieces of information for a specific experiment. In this work, we review the different algorithmic methodologies developed for each of these tasks. Main weaknesses and strengths of the different steps of the pipeline are discussed. Finally, a case study is detailed to help the reader be aware of the potential and limitations of this computational approach.
Collapse
|
7
|
Sankaranarayanan SR, Ianiri G, Coelho MA, Reza MH, Thimmappa BC, Ganguly P, Vadnala RN, Sun S, Siddharthan R, Tellgren-Roth C, Dawson TL, Heitman J, Sanyal K. Loss of centromere function drives karyotype evolution in closely related Malassezia species. eLife 2020; 9:e53944. [PMID: 31958060 PMCID: PMC7025860 DOI: 10.7554/elife.53944] [Citation(s) in RCA: 35] [Impact Index Per Article: 8.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/26/2019] [Accepted: 01/20/2020] [Indexed: 12/14/2022] Open
Abstract
Genomic rearrangements associated with speciation often result in variation in chromosome number among closely related species. Malassezia species show variable karyotypes ranging between six and nine chromosomes. Here, we experimentally identified all eight centromeres in M. sympodialis as 3-5-kb long kinetochore-bound regions that span an AT-rich core and are depleted of the canonical histone H3. Centromeres of similar sequence features were identified as CENP-A-rich regions in Malassezia furfur, which has seven chromosomes, and histone H3 depleted regions in Malassezia slooffiae and Malassezia globosa with nine chromosomes each. Analysis of synteny conservation across centromeres with newly generated chromosome-level genome assemblies suggests two distinct mechanisms of chromosome number reduction from an inferred nine-chromosome ancestral state: (a) chromosome breakage followed by loss of centromere DNA and (b) centromere inactivation accompanied by changes in DNA sequence following chromosome-chromosome fusion. We propose that AT-rich centromeres drive karyotype diversity in the Malassezia species complex through breakage and inactivation.
Collapse
Affiliation(s)
- Sundar Ram Sankaranarayanan
- Molecular Mycology Laboratory, Molecular Biology and Genetics Unit, Jawaharlal Nehru Centre for Advanced Scientific ResearchBengaluruIndia
| | - Giuseppe Ianiri
- Department of Molecular Genetics and Microbiology, Duke University Medical CenterDurhamUnited States
| | - Marco A Coelho
- Department of Molecular Genetics and Microbiology, Duke University Medical CenterDurhamUnited States
| | - Md Hashim Reza
- Molecular Mycology Laboratory, Molecular Biology and Genetics Unit, Jawaharlal Nehru Centre for Advanced Scientific ResearchBengaluruIndia
| | - Bhagya C Thimmappa
- Molecular Mycology Laboratory, Molecular Biology and Genetics Unit, Jawaharlal Nehru Centre for Advanced Scientific ResearchBengaluruIndia
| | - Promit Ganguly
- Molecular Mycology Laboratory, Molecular Biology and Genetics Unit, Jawaharlal Nehru Centre for Advanced Scientific ResearchBengaluruIndia
| | | | - Sheng Sun
- Department of Molecular Genetics and Microbiology, Duke University Medical CenterDurhamUnited States
| | | | - Christian Tellgren-Roth
- National Genomics Infrastructure, Science for Life Laboratory, Department of Immunology, Genetics and Pathology, Uppsala UniversityUppsalaSweden
| | - Thomas L Dawson
- Skin Research Institute Singapore, Agency for Science, Technology and Research (A*STAR)SingaporeSingapore
- Department of Drug Discovery, Medical University of South Carolina, School of PharmacyCharlestonUnited States
| | - Joseph Heitman
- Department of Molecular Genetics and Microbiology, Duke University Medical CenterDurhamUnited States
| | - Kaustuv Sanyal
- Molecular Mycology Laboratory, Molecular Biology and Genetics Unit, Jawaharlal Nehru Centre for Advanced Scientific ResearchBengaluruIndia
| |
Collapse
|
8
|
Agrawal A, Sambare SV, Narlikar L, Siddharthan R. THiCweed: fast, sensitive detection of sequence features by clustering big datasets. Nucleic Acids Res 2019; 46:e29. [PMID: 29267972 PMCID: PMC5861420 DOI: 10.1093/nar/gkx1251] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/09/2017] [Accepted: 12/01/2017] [Indexed: 11/19/2022] Open
Abstract
We present THiCweed, a new approach to analyzing transcription factor binding data from high-throughput chromatin immunoprecipitation-sequencing (ChIP-seq) experiments. THiCweed clusters bound regions based on sequence similarity using a divisive hierarchical clustering approach based on sequence similarity within sliding windows, while exploring both strands. ThiCweed is specially geared toward data containing mixtures of motifs, which present a challenge to traditional motif-finders. Our implementation is significantly faster than standard motif-finding programs, able to process 30 000 peaks in 1–2 h, on a single CPU core of a desktop computer. On synthetic data containing mixtures of motifs it is as accurate or more accurate than all other tested programs. THiCweed performs best with large ‘window’ sizes (≥50 bp), much longer than typical binding sites (7–15 bp). On real data it successfully recovers literature motifs, but also uncovers complex sequence characteristics in flanking DNA, variant motifs and secondary motifs even when they occur in <5% of the input, all of which appear biologically relevant. We also find recurring sequence patterns across diverse ChIP-seq datasets, possibly related to chromatin architecture and looping. THiCweed thus goes beyond traditional motif finding to give new insights into genomic transcription factor-binding complexity.
Collapse
Affiliation(s)
- Ankit Agrawal
- Computational Biology Group, The Institute of Mathematical Sciences (HBNI), Chennai 600113, Tamil Nadu, India
| | - Snehal V Sambare
- Computational Biology Group, The Institute of Mathematical Sciences (HBNI), Chennai 600113, Tamil Nadu, India
| | - Leelavati Narlikar
- Chemical Engineering and Process Development Division, CSIR-National Chemical Laboratory, Pune 411008, Maharashtra, India
| | - Rahul Siddharthan
- Computational Biology Group, The Institute of Mathematical Sciences (HBNI), Chennai 600113, Tamil Nadu, India
| |
Collapse
|
9
|
Berger S, Pachkov M, Arnold P, Omidi S, Kelley N, Salatino S, van Nimwegen E. Crunch: integrated processing and modeling of ChIP-seq data in terms of regulatory motifs. Genome Res 2019; 29:1164-1177. [PMID: 31138617 PMCID: PMC6633267 DOI: 10.1101/gr.239319.118] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/07/2018] [Accepted: 05/14/2019] [Indexed: 01/10/2023]
Abstract
Although ChIP-seq has become a routine experimental approach for quantitatively characterizing the genome-wide binding of transcription factors (TFs), computational analysis procedures remain far from standardized, making it difficult to compare ChIP-seq results across experiments. In addition, although genome-wide binding patterns must ultimately be determined by local constellations of DNA-binding sites, current analysis is typically limited to identifying enriched motifs in ChIP-seq peaks. Here we present Crunch, a completely automated computational method that performs all ChIP-seq analysis from quality control through read mapping and peak detecting and that integrates comprehensive modeling of the ChIP signal in terms of known and novel binding motifs, quantifying the contribution of each motif and annotating which combinations of motifs explain each binding peak. By applying Crunch to 128 data sets from the ENCODE Project, we show that Crunch outperforms current peak finders and find that TFs naturally separate into "solitary TFs," for which a single motif explains the ChIP-peaks, and "cobinding TFs," for which multiple motifs co-occur within peaks. Moreover, for most data sets, the motifs that Crunch identified de novo outperform known motifs, and both the set of cobinding motifs and the top motif of solitary TFs are consistent across experiments and cell lines. Crunch is implemented as a web server, enabling standardized analysis of any collection of ChIP-seq data sets by simply uploading raw sequencing data. Results are provided both in a graphical web interface and as downloadable files.
Collapse
Affiliation(s)
- Severin Berger
- Biozentrum, University of Basel, and Swiss Institute of Bioinformatics, CH-4056 Basel, Switzerland
| | - Mikhail Pachkov
- Biozentrum, University of Basel, and Swiss Institute of Bioinformatics, CH-4056 Basel, Switzerland
| | - Phil Arnold
- Biozentrum, University of Basel, and Swiss Institute of Bioinformatics, CH-4056 Basel, Switzerland
| | - Saeed Omidi
- Biozentrum, University of Basel, and Swiss Institute of Bioinformatics, CH-4056 Basel, Switzerland
| | - Nicholas Kelley
- Biozentrum, University of Basel, and Swiss Institute of Bioinformatics, CH-4056 Basel, Switzerland
| | - Silvia Salatino
- Biozentrum, University of Basel, and Swiss Institute of Bioinformatics, CH-4056 Basel, Switzerland
| | - Erik van Nimwegen
- Biozentrum, University of Basel, and Swiss Institute of Bioinformatics, CH-4056 Basel, Switzerland
| |
Collapse
|
10
|
Lu J, Cao X, Zhong S. A likelihood approach to testing hypotheses on the co-evolution of epigenome and genome. PLoS Comput Biol 2018; 14:e1006673. [PMID: 30586383 PMCID: PMC6324829 DOI: 10.1371/journal.pcbi.1006673] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/26/2018] [Revised: 01/08/2019] [Accepted: 11/26/2018] [Indexed: 01/03/2023] Open
Abstract
Central questions to epigenome evolution include whether interspecies changes of histone modifications are independent of evolutionary changes of DNA, and if there is dependence whether they depend on any specific types of DNA sequence changes. Here, we present a likelihood approach for testing hypotheses on the co-evolution of genome and histone modifications. The gist of this approach is to convert evolutionary biology hypotheses into probabilistic forms, by explicitly expressing the joint probability of multispecies DNA sequences and histone modifications, which we refer to as a class of Joint Evolutionary Model for the Genome and the Epigenome (JEMGE). JEMGE can be summarized as a mixture model of four components representing four evolutionary hypotheses, namely dependence and independence of interspecies epigenomic variations to underlying sequence substitutions and to underlying sequence insertions and deletions (indels). We implemented a maximum likelihood method to fit the models to the data. Based on comparison of likelihoods, we inferred whether interspecies epigenomic variations depended on substitution or indels in local genomic sequences based on DNase hypersensitivity and spermatid H3K4me3 ChIP-seq data from human and rhesus macaque. Approximately 5.5% of homologous regions in the genomes exhibited H3K4me3 modification in either species, among which approximately 67% homologous regions exhibited local-sequence-dependent interspecies H3K4me3 variations. Substitutions accounted for less local-sequence-dependent H3K4me3 variations than indels. Among transposon-mediated indels, ERV1 insertions and L1 insertions were most strongly associated with H3K4me3 gains and losses, respectively. By initiating probabilistic formulation on the co-evolution of genomes and epigenomes, JEMGE helps to bring evolutionary biology principles to comparative epigenomic studies. Epigenetic modifications play a significant role in gene regulations and thus heavily influence phenotypic outcomes. Whereas cross-species epigenomic comparisons have been fruitful in revealing the function of epigenetic modifications, it still remains unclear how the epigenome changes across species. A central question in epigenome evolution studies is whether interspecies epigenomic variations rely on genomic changes in cis and, if partially yes, whether different genomic changes have distinct impacts. To tackle this question, we initiated a likelihood-based approach, in which different hypotheses related to the co-evolution of the genome and the epigenome could be converted into probabilistic models. By fitting the models to actual data, each model yielded a likelihood, and the hypothesis corresponded to the largest likelihood was selected as most supported by observed data. In this work, we focused on the influence of two types of underlying sequence changes: substitutions, and insertions and deletions (indels). We quantitatively assessed the dependence of H3K4me3 variations on substitutions and indels between human and rhesus, and separated their relative impacts within each genomic region with H3K4me3. The methodology presented here provides a framework for modeling the epigenome together with the genome and a quantitative approach to test different evolutionary hypotheses.
Collapse
Affiliation(s)
- Jia Lu
- Department of Bioengineering, University of California San Diego, La Jolla, California, United States of America
| | - Xiaoyi Cao
- Department of Bioengineering, University of California San Diego, La Jolla, California, United States of America
| | - Sheng Zhong
- Department of Bioengineering, University of California San Diego, La Jolla, California, United States of America
- * E-mail:
| |
Collapse
|
11
|
Dempster-Shafer Theory for the Prediction of Auxin-Response Elements (AuxREs) in Plant Genomes. BIOMED RESEARCH INTERNATIONAL 2018; 2018:3837060. [PMID: 30515394 PMCID: PMC6236769 DOI: 10.1155/2018/3837060] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 05/17/2018] [Accepted: 10/15/2018] [Indexed: 11/17/2022]
Abstract
Auxin is a major regulator of plant growth and development; its action involves transcriptional activation. The identification of Auxin-response element (AuxRE) is one of the most important issues to understand the Auxin regulation of gene expression. Over the past few years, a large number of motif identification tools have been developed. Despite these considerable efforts provided by computational biologists, building reliable models to predict regulatory elements has still been a difficult challenge. In this context, we propose in this work a data fusion approach for the prediction of AuxRE. Our method is based on the combined use of Dempster-Shafer evidence theory and fuzzy theory. To evaluate our model, we have scanning the DORNRÖSCHEN promoter by our model. All proven AuxRE present in the promoter has been detected. At the 0.9 threshold we have no false positive. The comparison of the results of our model and some previous motifs finding tools shows that our model can predict AuxRE more successfully than the other tools and produce less false positive. The comparison of the results before and after combination shows the importance of Dempster-Shafer combination in the decrease of false positive and to improve the reliability of prediction. For an overall evaluation we have chosen to present the performance of our approach in comparison with other methods. In fact, the results indicated that the data fusion method has the highest degree of sensitivity (Sn) and Positive Predictive Value (PPV).
Collapse
|
12
|
Dotu I, Adamson SI, Coleman B, Fournier C, Ricart-Altimiras E, Eyras E, Chuang JH. SARNAclust: Semi-automatic detection of RNA protein binding motifs from immunoprecipitation data. PLoS Comput Biol 2018; 14:e1006078. [PMID: 29596423 PMCID: PMC5892938 DOI: 10.1371/journal.pcbi.1006078] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/28/2017] [Revised: 04/10/2018] [Accepted: 03/05/2018] [Indexed: 12/02/2022] Open
Abstract
RNA-protein binding is critical to gene regulation, controlling fundamental processes including splicing, translation, localization and stability, and aberrant RNA-protein interactions are known to play a role in a wide variety of diseases. However, molecular understanding of RNA-protein interactions remains limited; in particular, identification of RNA motifs that bind proteins has long been challenging, especially when such motifs depend on both sequence and structure. Moreover, although RNA binding proteins (RBPs) often contain more than one binding domain, algorithms capable of identifying more than one binding motif simultaneously have not been developed. In this paper we present a novel pipeline to determine binding peaks in crosslinking immunoprecipitation (CLIP) data, to discover multiple possible RNA sequence/structure motifs among them, and to experimentally validate such motifs. At the core is a new semi-automatic algorithm SARNAclust, the first unsupervised method to identify and deconvolve multiple sequence/structure motifs simultaneously. SARNAclust computes similarity between sequence/structure objects using a graph kernel, providing the ability to isolate the impact of specific features through the bulge graph formalism. Application of SARNAclust to synthetic data shows its capability of clustering 5 motifs at once with a V-measure value of over 0.95, while GraphClust achieves only a V-measure of 0.083 and RNAcontext cannot detect any of the motifs. When applied to existing eCLIP sets, SARNAclust finds known motifs for SLBP and HNRNPC and novel motifs for several other RBPs such as AGGF1, AKAP8L and ILF3. We demonstrate an experimental validation protocol, a targeted Bind-n-Seq-like high-throughput sequencing approach that relies on RNA inverse folding for oligo pool design, that can validate the components within the SLBP motif. Finally, we use this protocol to experimentally interrogate the SARNAclust motif predictions for protein ILF3. Our results support a newly identified partially double-stranded UUUUUGAGA motif similar to that known for the splicing factor HNRNPC. RNA-protein binding is critical to gene regulation, and aberrant RNA-protein interactions play a role in a wide variety of diseases. However, molecular understanding of these interactions remains limited because of the difficulty of ascertaining the motifs that bind each protein. To address this challenge, we have developed a novel algorithm, SARNAclust, to computationally identify combined structure/sequence motifs from immunoprecipitation data. SARNAclust can deconvolve multiple motifs simultaneously and determine the importance of specific features through a graph kernel and bulge graph formalism. We have verified SARNAclust to be effective on synthetic motif data and also tested it on ENCODE eCLIP datasets, identifying known motifs and novel predictions. We have experimentally validated SARNAclust for two proteins, SLBP and ILF3, using RNA Bind-n-Seq measurements. Applying SARNAclust to ENCODE data provides new evidence for previously unknown regulatory interactions, notably splicing co-regulation by ILF3 and the splicing factor hnRNPC.
Collapse
Affiliation(s)
- Ivan Dotu
- The Jackson Laboratory for Genomic Medicine, Farmington, CT, United States of America
- Research Programme on Biomedical Informatics (GRIB), Hospital del Mar Medical Research Institute (IMIM)–Pompeu Fabra University (UPF), Barcelona, Spain
| | - Scott I. Adamson
- The Jackson Laboratory for Genomic Medicine, Farmington, CT, United States of America
- UCONN Health, Department of Genetics and Genome Sciences, Farmington, CT, United States of America
| | - Benjamin Coleman
- The Jackson Laboratory for Genomic Medicine, Farmington, CT, United States of America
| | - Cyril Fournier
- The Jackson Laboratory for Genomic Medicine, Farmington, CT, United States of America
| | - Emma Ricart-Altimiras
- The Jackson Laboratory for Genomic Medicine, Farmington, CT, United States of America
- Research Programme on Biomedical Informatics (GRIB), Hospital del Mar Medical Research Institute (IMIM)–Pompeu Fabra University (UPF), Barcelona, Spain
| | - Eduardo Eyras
- Research Programme on Biomedical Informatics (GRIB), Hospital del Mar Medical Research Institute (IMIM)–Pompeu Fabra University (UPF), Barcelona, Spain
- Catalan Institution for Research and Advanced Studies (ICREA), Barcelona, Spain
| | - Jeffrey H. Chuang
- The Jackson Laboratory for Genomic Medicine, Farmington, CT, United States of America
- UCONN Health, Department of Genetics and Genome Sciences, Farmington, CT, United States of America
- * E-mail:
| |
Collapse
|
13
|
Nettling M, Treutler H, Cerquides J, Grosse I. Unrealistic phylogenetic trees may improve phylogenetic footprinting. Bioinformatics 2018; 33:1639-1646. [PMID: 28130227 PMCID: PMC5447242 DOI: 10.1093/bioinformatics/btx033] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/02/2016] [Accepted: 01/19/2017] [Indexed: 01/10/2023] Open
Abstract
Motivation The computational investigation of DNA binding motifs from binding sites is one of the classic tasks in bioinformatics and a prerequisite for understanding gene regulation as a whole. Due to the development of sequencing technologies and the increasing number of available genomes, approaches based on phylogenetic footprinting become increasingly attractive. Phylogenetic footprinting requires phylogenetic trees with attached substitution probabilities for quantifying the evolution of binding sites, but these trees and substitution probabilities are typically not known and cannot be estimated easily. Results Here, we investigate the influence of phylogenetic trees with different substitution probabilities on the classification performance of phylogenetic footprinting using synthetic and real data. For synthetic data we find that the classification performance is highest when the substitution probability used for phylogenetic footprinting is similar to that used for data generation. For real data, however, we typically find that the classification performance of phylogenetic footprinting surprisingly increases with increasing substitution probabilities and is often highest for unrealistically high substitution probabilities close to one. This finding suggests that choosing realistic model assumptions might not always yield optimal predictions in general and that choosing unrealistically high substitution probabilities close to one might actually improve the classification performance of phylogenetic footprinting. Availability and Implementation The proposed PF is implemented in JAVA and can be downloaded from https://github.com/mgledi/PhyFoo Supplementary information Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Martin Nettling
- Institute of Computer Science, Martin Luther University Halle-Wittenberg, Halle, Germany
| | - Hendrik Treutler
- Department of Stress and Developmental Biology, Leibniz Institute of Plant Biochemistry, Halle, Germany
| | - Jesus Cerquides
- Institut d'Investigació en Intel ligència Artificial, IIIA-CSIC, Campus UAB, Cerdanyola, Spain
| | - Ivo Grosse
- Institute of Computer Science, Martin Luther University Halle-Wittenberg, Halle, Germany.,German Centre for Integrative Biodiversity Research (iDiv) Halle-Jena-Leipzig, Leipzig, Germany
| |
Collapse
|
14
|
Caldonazzo Garbelini JM, Kashiwabara AY, Sanches DS. Sequence motif finder using memetic algorithm. BMC Bioinformatics 2018; 19:4. [PMID: 29298679 PMCID: PMC5751424 DOI: 10.1186/s12859-017-2005-1] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/08/2017] [Accepted: 12/18/2017] [Indexed: 11/10/2022] Open
Abstract
Background De novo prediction of Transcription Factor Binding Sites (TFBS) using computational methods is a difficult task and it is an important problem in Bioinformatics. The correct recognition of TFBS plays an important role in understanding the mechanisms of gene regulation and helps to develop new drugs. Results We here present Memetic Framework for Motif Discovery (MFMD), an algorithm that uses semi-greedy constructive heuristics as a local optimizer. In addition, we used a hybridization of the classic genetic algorithm as a global optimizer to refine the solutions initially found. MFMD can find and classify overrepresented patterns in DNA sequences and predict their respective initial positions. MFMD performance was assessed using ChIP-seq data retrieved from the JASPAR site, promoter sequences extracted from the ABS site, and artificially generated synthetic data. The MFMD was evaluated and compared with well-known approaches in the literature, called MEME and Gibbs Motif Sampler, achieving a higher f-score in the most datasets used in this work. Conclusions We have developed an approach for detecting motifs in biopolymers sequences. MFMD is a freely available software that can be promising as an alternative to the development of new tools for de novo motif discovery. Its open-source software can be downloaded at https://github.com/jadermcg/mfmd. Electronic supplementary material The online version of this article (doi:10.1186/s12859-017-2005-1) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Jader M Caldonazzo Garbelini
- Department of Computer Science, Bioinformatics Graduate Program, Federal University of Technology - Paraná, Cornélio Procópio, PR, Brazil.
| | - André Y Kashiwabara
- Department of Computer Science, Bioinformatics Graduate Program, Federal University of Technology - Paraná, Cornélio Procópio, PR, Brazil
| | - Danilo S Sanches
- Department of Computer Science, Bioinformatics Graduate Program, Federal University of Technology - Paraná, Cornélio Procópio, PR, Brazil
| |
Collapse
|
15
|
Fu H, Zhang X. Noncoding Variants Functional Prioritization Methods Based on Predicted Regulatory Factor Binding Sites. Curr Genomics 2017; 18:322-331. [PMID: 29081688 PMCID: PMC5635616 DOI: 10.2174/1389202918666170228143619] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/01/2016] [Revised: 10/16/2016] [Accepted: 11/02/2016] [Indexed: 12/31/2022] Open
Abstract
BACKGROUNDS With the advent of the post genomic era, the research for the genetic mechanism of the diseases has found to be increasingly depended on the studies of the genes, the gene-networks and gene-protein interaction networks. To explore gene expression and regulation, the researchers have carried out many studies on transcription factors and their binding sites (TFBSs). Based on the large amount of transcription factor binding sites predicting values in the deep learning models, further computation and analysis have been done to reveal the relationship between the gene mutation and the occurrence of the disease. It has been demonstrated that based on the deep learning methods, the performances of the prediction for the functions of the noncoding variants are outperforming than those of the conventional methods. The research on the prediction for functions of Single Nucleotide Polymorphisms (SNPs) is expected to uncover the mechanism of the gene mutation affection on traits and diseases of human beings. RESULTS We reviewed the conventional TFBSs identification methods from different perspectives. As for the deep learning methods to predict the TFBSs, we discussed the related problems, such as the raw data preprocessing, the structure design of the deep convolution neural network (CNN) and the model performance measure et al. And then we summarized the techniques that usually used in finding out the functional noncoding variants from de novo sequence. CONCLUSION Along with the rapid development of the high-throughout assays, more and more sample data and chromatin features would be conducive to improve the prediction accuracy of the deep convolution neural network for TFBSs identification. Meanwhile, getting more insights into the deep CNN framework itself has been proved useful for both the promotion on model performance and the development for more suitable design to sample data. Based on the feature values predicted by the deep CNN model, the prioritization model for functional noncoding variants would contribute to reveal the affection of gene mutation on the diseases.
Collapse
Affiliation(s)
- Haoyue Fu
- College of Sciences, Northeastern University, Shenyang, China
| | - LianpingYang
- College of Sciences, Northeastern University, Shenyang, China
- University of Southern California, Dept. Biol. Sci., Program Mol & Computat Biol, USA
| | - Xiangde Zhang
- College of Sciences, Northeastern University, Shenyang, China
| |
Collapse
|
16
|
Omidi S, Zavolan M, Pachkov M, Breda J, Berger S, van Nimwegen E. Automated incorporation of pairwise dependency in transcription factor binding site prediction using dinucleotide weight tensors. PLoS Comput Biol 2017; 13:e1005176. [PMID: 28753602 PMCID: PMC5550003 DOI: 10.1371/journal.pcbi.1005176] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/04/2016] [Revised: 08/09/2017] [Accepted: 06/02/2017] [Indexed: 11/17/2022] Open
Abstract
Gene regulatory networks are ultimately encoded by the sequence-specific binding of (TFs) to short DNA segments. Although it is customary to represent the binding specificity of a TF by a position-specific weight matrix (PSWM), which assumes each position within a site contributes independently to the overall binding affinity, evidence has been accumulating that there can be significant dependencies between positions. Unfortunately, methodological challenges have so far hindered the development of a practical and generally-accepted extension of the PSWM model. On the one hand, simple models that only consider dependencies between nearest-neighbor positions are easy to use in practice, but fail to account for the distal dependencies that are observed in the data. On the other hand, models that allow for arbitrary dependencies are prone to overfitting, requiring regularization schemes that are difficult to use in practice for non-experts. Here we present a new regulatory motif model, called dinucleotide weight tensor (DWT), that incorporates arbitrary pairwise dependencies between positions in binding sites, rigorously from first principles, and free from tunable parameters. We demonstrate the power of the method on a large set of ChIP-seq data-sets, showing that DWTs outperform both PSWMs and motif models that only incorporate nearest-neighbor dependencies. We also demonstrate that DWTs outperform two previously proposed methods. Finally, we show that DWTs inferred from ChIP-seq data also outperform PSWMs on HT-SELEX data for the same TF, suggesting that DWTs capture inherent biophysical properties of the interactions between the DNA binding domains of TFs and their binding sites. We make a suite of DWT tools available at dwt.unibas.ch, that allow users to automatically perform ‘motif finding’, i.e. the inference of DWT motifs from a set of sequences, binding site prediction with DWTs, and visualization of DWT ‘dilogo’ motifs. Gene regulatory networks are ultimately encoded in constellations of short binding sites in the DNA and RNA that are recognized by regulatory factors such as transcription factors (TFs). For several decades, computational analysis of regulatory networks has relied on a model of TF sequence-specificity, the position-specific weight-matrix (PSWM), that assumes different positions in a binding site contribute independently to the total binding energy of the TF. However, in recent years evidence has been accumulating that, at least for some TFs, this assumption does not hold. Here we present a new model for the sequence-specificity of TFs, the dinucleotide weight tensor (DWT), that takes arbitrary dependencies between positions in binding sites into account and show that it consistently outperforms PSWMs on high-throughput datasets on TF binding. Moreover, in contrast to previous approaches, DWTs are directly derived from first principles within a Bayesian framework, and contain no tunable parameters. This allows them to be easily applied in practice and we make a suite of tools available for computational analysis with DWTs.
Collapse
Affiliation(s)
- Saeed Omidi
- Biozentrum, University of Basel, Basel, Switzerland.,Swiss Institute of Bioinformatics, Basel, Switzerland
| | - Mihaela Zavolan
- Biozentrum, University of Basel, Basel, Switzerland.,Swiss Institute of Bioinformatics, Basel, Switzerland
| | - Mikhail Pachkov
- Biozentrum, University of Basel, Basel, Switzerland.,Swiss Institute of Bioinformatics, Basel, Switzerland
| | - Jeremie Breda
- Biozentrum, University of Basel, Basel, Switzerland.,Swiss Institute of Bioinformatics, Basel, Switzerland
| | - Severin Berger
- Biozentrum, University of Basel, Basel, Switzerland.,Swiss Institute of Bioinformatics, Basel, Switzerland
| | - Erik van Nimwegen
- Biozentrum, University of Basel, Basel, Switzerland.,Swiss Institute of Bioinformatics, Basel, Switzerland
| |
Collapse
|
17
|
Abstract
Alterations in regulatory networks contribute to evolutionary change. Transcriptional networks are reconfigured by changes in the binding specificity of transcription factors and their cognate sites. The evolution of RNA-protein regulatory networks is far less understood. The PUF (Pumilio and FBF) family of RNA regulatory proteins controls the translation, stability, and movements of hundreds of mRNAs in a single species. We probe the evolution of PUF-RNA networks by direct identification of the mRNAs bound to PUF proteins in budding and filamentous fungi and by computational analyses of orthologous RNAs from 62 fungal species. Our findings reveal that PUF proteins gain and lose mRNAs with related and emergent biological functions during evolution. We demonstrate at least two independent rewiring events for PUF3 orthologs, independent but convergent evolution of PUF4/5 binding specificity and the rewiring of the PUF4/5 regulons in different fungal lineages. These findings demonstrate plasticity in RNA regulatory networks and suggest ways in which their rewiring occurs.
Collapse
|
18
|
Liu B, Yang J, Li Y, McDermaid A, Ma Q. An algorithmic perspective of de novo cis-regulatory motif finding based on ChIP-seq data. Brief Bioinform 2017; 19:1069-1081. [DOI: 10.1093/bib/bbx026] [Citation(s) in RCA: 25] [Impact Index Per Article: 3.6] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/20/2016] [Indexed: 01/06/2023] Open
Affiliation(s)
- Bingqiang Liu
- School of Mathematics, Shandong University, Jinan Shandong, P. R. China
| | - Jinyu Yang
- Department of Mathematics and Statistics, South Dakota State University, Brookings, SD, USA
| | - Yang Li
- School of Mathematics, Shandong University, Jinan Shandong, P. R. China
| | - Adam McDermaid
- Department of Mathematics and Statistics, South Dakota State University, Brookings, SD, USA
| | - Qin Ma
- Department of Agronomy, Horticulture and Plant Science, South Dakota State University, Brookings, SD, USA
| |
Collapse
|
19
|
Nettling M, Treutler H, Cerquides J, Grosse I. Combining phylogenetic footprinting with motif models incorporating intra-motif dependencies. BMC Bioinformatics 2017; 18:141. [PMID: 28249564 PMCID: PMC5333389 DOI: 10.1186/s12859-017-1495-1] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/29/2016] [Accepted: 01/24/2017] [Indexed: 11/23/2022] Open
Abstract
Background Transcriptional gene regulation is a fundamental process in nature, and the experimental and computational investigation of DNA binding motifs and their binding sites is a prerequisite for elucidating this process. Approaches for de-novo motif discovery can be subdivided in phylogenetic footprinting that takes into account phylogenetic dependencies in aligned sequences of more than one species and non-phylogenetic approaches based on sequences from only one species that typically take into account intra-motif dependencies. It has been shown that modeling (i) phylogenetic dependencies as well as (ii) intra-motif dependencies separately improves de-novo motif discovery, but there is no approach capable of modeling both (i) and (ii) simultaneously. Results Here, we present an approach for de-novo motif discovery that combines phylogenetic footprinting with motif models capable of taking into account intra-motif dependencies. We study the degree of intra-motif dependencies inferred by this approach from ChIP-seq data of 35 transcription factors. We find that significant intra-motif dependencies of orders 1 and 2 are present in all 35 datasets and that intra-motif dependencies of order 2 are typically stronger than those of order 1. We also find that the presented approach improves the classification performance of phylogenetic footprinting in all 35 datasets and that incorporating intra-motif dependencies of order 2 yields a higher classification performance than incorporating such dependencies of only order 1. Conclusion Combining phylogenetic footprinting with motif models incorporating intra-motif dependencies leads to an improved performance in the classification of transcription factor binding sites. This may advance our understanding of transcriptional gene regulation and its evolution. Electronic supplementary material The online version of this article (doi:10.1186/s12859-017-1495-1) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Martin Nettling
- Institute of Computer Science, Martin Luther University Halle-Wittenberg, Halle, Germany.
| | | | - Jesus Cerquides
- Institut d'Investigació en Intel ·ligència Artificial, IIIA-CSIC, Campus UAB, Cerdanyola, Spain
| | - Ivo Grosse
- Institute of Computer Science, Martin Luther University Halle-Wittenberg, Halle, Germany.,German Centre for Integrative Biodiversity Research (iDiv) Halle-Jena-Leipzig, Leipzig, Germany
| |
Collapse
|
20
|
Furnholm T, Rehan M, Wishart J, Tisa LS. Pb2+ tolerance by Frankia sp. strain EAN1pec involves surface-binding. MICROBIOLOGY-SGM 2017; 163:472-487. [PMID: 28141503 DOI: 10.1099/mic.0.000439] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/18/2022]
Abstract
Several Frankia strains have been shown to be lead-resistant. The mechanism of lead resistance was investigated for Frankia sp. strain EAN1pec. Analysis of the cultures by scanning electron microscopy (SEM), energy dispersive X-ray spectroscopy (EDAX) and Fourier transforming infrared spectroscopy (FTIR) demonstrated that Frankia sp. strain EAN1pec undergoes surface modifications and binds high quantities of Pb+2. Both labelled and unlabelled shotgun proteomics approaches were used to determine changes in Frankia sp. strain EAN1pec protein expression in response to lead and zinc. Pb2+ specifically induced changes in exopolysaccharides, the stringent response, and the phosphate (pho) regulon. Two metal transporters (a Cu2+-ATPase and cation diffusion facilitator), as well as several hypothetical transporters, were also upregulated and may be involved in metal export. The exported Pb2+ may be precipitated at the cell surface by an upregulated polyphosphate kinase, undecaprenyl diphosphate synthase and inorganic diphosphatase. A variety of metal chaperones for ensuring correct cofactor placement were also upregulated with both Pb+2 and Zn+2 stress. Thus, this Pb+2 resistance mechanism is similar to other characterized systems. The cumulative interplay of these many mechanisms may explain the extraordinary resilience of Frankia sp. strain EAN1pec to Pb+2. A potential transcription factor (DUF156) binding site was identified in association with several proteins identified as upregulated with heavy metals. This site was also discovered, for the first time, in thousands of other organisms across two kingdoms.
Collapse
Affiliation(s)
- Teal Furnholm
- Department of Cellular, Molecular, and Biomedical Sciences, University of New Hampshire, Durham, NH, USA
| | - Medhat Rehan
- Department of Cellular, Molecular, and Biomedical Sciences, University of New Hampshire, Durham, NH, USA.,Department of Genetics, College of Agriculture, Kafrelsheikh University, Egypt.,Department of Plant Production and Protection, College of Agriculture and Veterinary Medicine, Qassim University, Saudi Arabia
| | - Jessica Wishart
- Department of Cellular, Molecular, and Biomedical Sciences, University of New Hampshire, Durham, NH, USA.,Department of Microbiology, Oregon State University, Corvallis, OR, USA
| | - Louis S Tisa
- Department of Cellular, Molecular, and Biomedical Sciences, University of New Hampshire, Durham, NH, USA
| |
Collapse
|
21
|
Liu B, Zhang H, Zhou C, Li G, Fennell A, Wang G, Kang Y, Liu Q, Ma Q. An integrative and applicable phylogenetic footprinting framework for cis-regulatory motifs identification in prokaryotic genomes. BMC Genomics 2016; 17:578. [PMID: 27507169 PMCID: PMC4977642 DOI: 10.1186/s12864-016-2982-x] [Citation(s) in RCA: 13] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/05/2016] [Accepted: 07/29/2016] [Indexed: 11/10/2022] Open
Abstract
Background Phylogenetic footprinting is an important computational technique for identifying cis-regulatory motifs in orthologous regulatory regions from multiple genomes, as motifs tend to evolve slower than their surrounding non-functional sequences. Its application, however, has several difficulties for optimizing the selection of orthologous data and reducing the false positives in motif prediction. Results Here we present an integrative phylogenetic footprinting framework for accurate motif predictions in prokaryotic genomes (MP3). The framework includes a new orthologous data preparation procedure, an additional promoter scoring and pruning method and an integration of six existing motif finding algorithms as basic motif search engines. Specifically, we collected orthologous genes from available prokaryotic genomes and built the orthologous regulatory regions based on sequence similarity of promoter regions. This procedure made full use of the large-scale genomic data and taxonomy information and filtered out the promoters with limited contribution to produce a high quality orthologous promoter set. The promoter scoring and pruning is implemented through motif voting by a set of complementary predicting tools that mine as many motif candidates as possible and simultaneously eliminate the effect of random noise. We have applied the framework to Escherichia coli k12 genome and evaluated the prediction performance through comparison with seven existing programs. This evaluation was systematically carried out at the nucleotide and binding site level, and the results showed that MP3 consistently outperformed other popular motif finding tools. We have integrated MP3 into our motif identification and analysis server DMINDA, allowing users to efficiently identify and analyze motifs in 2,072 completely sequenced prokaryotic genomes. Conclusion The performance evaluation indicated that MP3 is effective for predicting regulatory motifs in prokaryotic genomes. Its application may enhance progress in elucidating transcription regulation mechanism, thus provide benefit to the genomic research community and prokaryotic genome researchers in particular. Electronic supplementary material The online version of this article (doi:10.1186/s12864-016-2982-x) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Bingqiang Liu
- School of Mathematics, Shandong University, Jinan, 250100, China
| | - Hanyuan Zhang
- Systems Biology and Biomedical Informatics (SBBI) Laboratory University of Nebraska-Lincoln, Lincoln, NE, 68588-0115, USA
| | - Chuan Zhou
- School of Mathematics, Shandong University, Jinan, 250100, China
| | - Guojun Li
- School of Mathematics, Shandong University, Jinan, 250100, China
| | - Anne Fennell
- Department of Agronomy, Horticulture, and Plant Science, South Dakota State University, Brookings, SD, 57007, USA.,BioSNTR, Brookings, SD, USA
| | - Guanghui Wang
- School of Mathematics, Shandong University, Jinan, 250100, China
| | - Yu Kang
- CAS Key Laboratory of Genome Sciences and information, Beijing Institute of Genomics of CAS, Beijing, 100101, People's Republic of China
| | - Qi Liu
- Department of Bioinformatics, School of Life Sciences and Technology, Tongji University, Shanghai, China
| | - Qin Ma
- Department of Agronomy, Horticulture, and Plant Science, South Dakota State University, Brookings, SD, 57007, USA. .,BioSNTR, Brookings, SD, USA.
| |
Collapse
|
22
|
Danan C, Manickavel S, Hafner M. PAR-CLIP: A Method for Transcriptome-Wide Identification of RNA Binding Protein Interaction Sites. Methods Mol Biol 2016; 1358:153-73. [PMID: 26463383 PMCID: PMC5142217 DOI: 10.1007/978-1-4939-3067-8_10] [Citation(s) in RCA: 28] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 05/17/2023]
Abstract
During post-transcriptional gene regulation (PTGR), RNA binding proteins (RBPs) interact with all classes of RNA to control RNA maturation, stability, transport, and translation. Here, we describe Photoactivatable-Ribonucleoside-Enhanced Crosslinking and Immunoprecipitation (PAR-CLIP), a transcriptome-scale method for identifying RBP binding sites on target RNAs with nucleotide-level resolution. This method is readily applicable to any protein directly contacting RNA, including RBPs that are predicted to bind in a sequence- or structure-dependent manner at discrete RNA recognition elements (RREs), and those that are thought to bind transiently, such as RNA polymerases or helicases.
Collapse
Affiliation(s)
- Charles Danan
- Laboratory of Muscle Stem Cells and Gene Regulation, NIAMS / NIH, 50 South Drive, 20892, Bethesda, MD, USA
| | - Sudhir Manickavel
- Laboratory of Muscle Stem Cells and Gene Regulation, NIAMS / NIH, 50 South Drive, 20892, Bethesda, MD, USA
| | - Markus Hafner
- Laboratory of Muscle Stem Cells and Gene Regulation, NIAMS / NIH, 50 South Drive, 20892, Bethesda, MD, USA.
| |
Collapse
|
23
|
Yen IEH, Lin X, Zhang J, Ravikumar P, Dhillon IS. A Convex Atomic-Norm Approach to Multiple Sequence Alignment and Motif Discovery. JMLR WORKSHOP AND CONFERENCE PROCEEDINGS 2016; 48:2272-2280. [PMID: 27559428 PMCID: PMC4993214] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Subscribe] [Scholar Register] [Indexed: 06/06/2023]
Abstract
Multiple Sequence Alignment and Motif Discovery, known as NP-hard problems, are two fundamental tasks in Bioinformatics. Existing approaches to these two problems are based on either local search methods such as Expectation Maximization (EM), Gibbs Sampling or greedy heuristic methods. In this work, we develop a convex relaxation approach to both problems based on the recent concept of atomic norm and develop a new algorithm, termed Greedy Direction Method of Multiplier, for solving the convex relaxation with two convex atomic constraints. Experiments show that our convex relaxation approach produces solutions of higher quality than those standard tools widely-used in Bioinformatics community on the Multiple Sequence Alignment and Motif Discovery problems.
Collapse
Affiliation(s)
- Ian E. H. Yen
- Department of Computer Science, University of Texas at Austin, TX 78712, USA
| | - Xin Lin
- Department of Computer Science, University of Texas at Austin, TX 78712, USA
| | - Jiong Zhang
- Institute for Computational Engineering and Sciences, University of Texas at Austin, TX 78712, USA
| | - Pradeep Ravikumar
- Department of Computer Science, University of Texas at Austin, TX 78712, USA
- Institute for Computational Engineering and Sciences, University of Texas at Austin, TX 78712, USA
| | - Inderjit S. Dhillon
- Department of Computer Science, University of Texas at Austin, TX 78712, USA
- Institute for Computational Engineering and Sciences, University of Texas at Austin, TX 78712, USA
| |
Collapse
|
24
|
Davies NJ, Krusche P, Tauber E, Ott S. Analysis of 5' gene regions reveals extraordinary conservation of novel non-coding sequences in a wide range of animals. BMC Evol Biol 2015; 15:227. [PMID: 26482678 PMCID: PMC4613772 DOI: 10.1186/s12862-015-0499-6] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/09/2015] [Accepted: 09/28/2015] [Indexed: 01/20/2023] Open
Abstract
Background Phylogenetic footprinting is a comparative method based on the principle that functional sequence elements will acquire fewer mutations over time than non-functional sequences. Successful comparisons of distantly related species will thus yield highly important sequence elements likely to serve fundamental biological roles. RNA regulatory elements are less well understood than those in DNA. In this study we use the emerging model organism Nasonia vitripennis, a parasitic wasp, in a comparative analysis against 12 insect genomes to identify deeply conserved non-coding elements (CNEs) conserved in large groups of insects, with a focus on 5’ UTRs and promoter sequences. Results We report the identification of 322 CNEs conserved across a broad range of insect orders. The identified regions are associated with regulatory and developmental genes, and contain short footprints revealing aspects of their likely function in translational regulation. The most ancient regions identified in our analysis were all found to overlap transcribed regions of genes, reflecting stronger conservation of translational regulatory elements than transcriptional elements. Further expanding sequence analyses to non-insect species we also report the discovery of, to our knowledge, the two oldest and most ubiquitous CNE’s yet described in the animal kingdom (700 MYA). These ancient conserved non-coding elements are associated with the two ribosomal stalk genes, RPLP1 and RPLP2, and were very likely functional in some of the earliest animals. Conclusions We report the identification of the most deeply conserved CNE’s found to date, and several other deeply conserved elements which are without exception, part of 5’ untranslated regions of transcripts, and occur in a number of key translational regulatory genes, highlighting translational regulation of translational regulators as a conserved feature of insect genomes. Electronic supplementary material The online version of this article (doi:10.1186/s12862-015-0499-6) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
| | - Peter Krusche
- Warwick Systems Biology Centre, University of Warwick, Coventry, UK.
| | - Eran Tauber
- Department of Genetics, University of Leicester, Leicester, UK.
| | - Sascha Ott
- Warwick Systems Biology Centre, University of Warwick, Coventry, UK.
| |
Collapse
|
25
|
Abstract
Pumilio is an RNA-binding protein originally identified in Drosophila, with a Puf domain made up of eight Puf repeats, three helix bundles arranged in a rainbow architecture, where each repeat recognizes a single base of the RNA-binding sequence. The eight-base recognition sequence can therefore be modified simply via mutation of the repeat that recognizes the base to be changed and this is understood in detail via high-resolution crystal structures. The binding mechanism is also altered in a variety of homologues from different species, with bases flipped out from the binding site to regenerate a consensus sequence. Thus Pumilios can be designed with bespoke RNA recognition sequences and can be fused to nucleases, split GFP, etc. as tools in vitro and in cells.
Collapse
|
26
|
Thompson D, Regev A, Roy S. Comparative analysis of gene regulatory networks: from network reconstruction to evolution. Annu Rev Cell Dev Biol 2015; 31:399-428. [PMID: 26355593 DOI: 10.1146/annurev-cellbio-100913-012908] [Citation(s) in RCA: 95] [Impact Index Per Article: 10.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/09/2022]
Abstract
Regulation of gene expression is central to many biological processes. Although reconstruction of regulatory circuits from genomic data alone is therefore desirable, this remains a major computational challenge. Comparative approaches that examine the conservation and divergence of circuits and their components across strains and species can help reconstruct circuits as well as provide insights into the evolution of gene regulatory processes and their adaptive contribution. In recent years, advances in genomic and computational tools have led to a wealth of methods for such analysis at the sequence, expression, pathway, module, and entire network level. Here, we review computational methods developed to study transcriptional regulatory networks using comparative genomics, from sequence to functional data. We highlight how these methods use evolutionary conservation and divergence to reliably detect regulatory components as well as estimate the extent and rate of divergence. Finally, we discuss the promise and open challenges in linking regulatory divergence to phenotypic divergence and adaptation.
Collapse
Affiliation(s)
- Dawn Thompson
- Broad Institute of MIT and Harvard, Cambridge, Massachusetts 02142
| | | | | |
Collapse
|
27
|
Abstract
Motivation: The construction of statistics for summarizing posterior samples returned by a Bayesian phylogenetic study has so far been hindered by the poor geometric insights available into the space of phylogenetic trees, and ad hoc methods such as the derivation of a consensus tree makeup for the ill-definition of the usual concepts of posterior mean, while bootstrap methods mitigate the absence of a sound concept of variance. Yielding satisfactory results with sufficiently concentrated posterior distributions, such methods fall short of providing a faithful summary of posterior distributions if the data do not offer compelling evidence for a single topology. Results: Building upon previous work of Billera et al., summary statistics such as sample mean, median and variance are defined as the geometric median, Fréchet mean and variance, respectively. Their computation is enabled by recently published works, and embeds an algorithm for computing shortest paths in the space of trees. Studying the phylogeny of a set of plants, where several tree topologies occur in the posterior sample, the posterior mean balances correctly the contributions from the different topologies, where a consensus tree would be biased. Comparisons of the posterior mean, median and consensus trees with the ground truth using simulated data also reveals the benefits of a sound averaging method when reconstructing phylogenetic trees. Availability and implementation: We provide two independent implementations of the algorithm for computing Fréchet means, geometric medians and variances in the space of phylogenetic trees. TFBayes: https://github.com/pbenner/tfbayes, TrAP: https://github.com/bacak/TrAP. Contact:philipp.benner@mis.mpg.de
Collapse
Affiliation(s)
- Philipp Benner
- Max-Planck Institute for Mathematics in the Sciences, 04103 Leipzig, Germany and Isthmus SARL, 75002 Paris, France
| | - Miroslav Bačák
- Max-Planck Institute for Mathematics in the Sciences, 04103 Leipzig, Germany and Isthmus SARL, 75002 Paris, France
| | - Pierre-Yves Bourguignon
- Max-Planck Institute for Mathematics in the Sciences, 04103 Leipzig, Germany and Isthmus SARL, 75002 Paris, France Max-Planck Institute for Mathematics in the Sciences, 04103 Leipzig, Germany and Isthmus SARL, 75002 Paris, France
| |
Collapse
|
28
|
Taher L, Narlikar L, Ovcharenko I. Identification and computational analysis of gene regulatory elements. Cold Spring Harb Protoc 2015; 2015:pdb.top083642. [PMID: 25561628 PMCID: PMC5885252 DOI: 10.1101/pdb.top083642] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 06/04/2023]
Abstract
Over the last two decades, advances in experimental and computational technologies have greatly facilitated genomic research. Next-generation sequencing technologies have made de novo sequencing of large genomes affordable, and powerful computational approaches have enabled accurate annotations of genomic DNA sequences. Charting functional regions in genomes must account for not only the coding sequences, but also noncoding RNAs, repetitive elements, chromatin states, epigenetic modifications, and gene regulatory elements. A mix of comparative genomics, high-throughput biological experiments, and machine learning approaches has played a major role in this truly global effort. Here we describe some of these approaches and provide an account of our current understanding of the complex landscape of the human genome. We also present overviews of different publicly available, large-scale experimental data sets and computational tools, which we hope will prove beneficial for researchers working with large and complex genomes.
Collapse
Affiliation(s)
- Leila Taher
- Computational Biology Branch, National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, Maryland 20894
- Institute for Biostatistics and Informatics in Medicine and Ageing Research, University of Rostock, 18051 Rostock, Germany
| | - Leelavati Narlikar
- Chemical Engineering and Process Development Division, National Chemical Laboratory, CSIR, Pune 411008, India
| | - Ivan Ovcharenko
- Computational Biology Branch, National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, Maryland 20894
| |
Collapse
|
29
|
Abstract
RNA-binding proteins (RBPs) are important regulators of eukaryotic gene expression. Genomes typically encode dozens to hundreds of proteins containing RNA-binding domains, which collectively recognize diverse RNA sequences and structures. Recent advances in high-throughput methods for assaying the targets of RBPs in vitro and in vivo allow large-scale derivation of RNA-binding motifs as well as determination of RNA–protein interactions in living cells. In parallel, many computational methods have been developed to analyze and interpret these data. The interplay between RNA secondary structure and RBP binding has also been a growing theme. Integrating RNA–protein interaction data with observations of post-transcriptional regulation will enhance our understanding of the roles of these important proteins.
Collapse
|
30
|
Reyes-Herrera PH, Ficarra E. Computational Methods for CLIP-seq Data Processing. Bioinform Biol Insights 2014; 8:199-207. [PMID: 25336930 PMCID: PMC4196881 DOI: 10.4137/bbi.s16803] [Citation(s) in RCA: 17] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/11/2014] [Revised: 07/29/2014] [Accepted: 08/01/2014] [Indexed: 12/25/2022] Open
Abstract
RNA-binding proteins (RBPs) are at the core of post-transcriptional regulation and thus of gene expression control at the RNA level. One of the principal challenges in the field of gene expression regulation is to understand RBPs mechanism of action. As a result of recent evolution of experimental techniques, it is now possible to obtain the RNA regions recognized by RBPs on a transcriptome-wide scale. In fact, CLIP-seq protocols use the joint action of CLIP, crosslinking immunoprecipitation, and high-throughput sequencing to recover the transcriptome-wide set of interaction regions for a particular protein. Nevertheless, computational methods are necessary to process CLIP-seq experimental data and are a key to advancement in the understanding of gene regulatory mechanisms. Considering the importance of computational methods in this area, we present a review of the current status of computational approaches used and proposed for CLIP-seq data.
Collapse
Affiliation(s)
- Paula H Reyes-Herrera
- Facultad de Ingeniería Electrónica y Biomédica, Universidad Antonio Nariño, Bogotá, Colombia
| | - Elisa Ficarra
- Department of Control and Computer Engineering, Politecnico di Torino, TO, Italy
| |
Collapse
|
31
|
Baresic M, Salatino S, Kupr B, van Nimwegen E, Handschin C. Transcriptional network analysis in muscle reveals AP-1 as a partner of PGC-1α in the regulation of the hypoxic gene program. Mol Cell Biol 2014; 34:2996-3012. [PMID: 24912679 PMCID: PMC4135604 DOI: 10.1128/mcb.01710-13] [Citation(s) in RCA: 28] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/26/2013] [Revised: 01/26/2014] [Accepted: 06/03/2014] [Indexed: 12/16/2022] Open
Abstract
Skeletal muscle tissue shows an extraordinary cellular plasticity, but the underlying molecular mechanisms are still poorly understood. Here, we use a combination of experimental and computational approaches to unravel the complex transcriptional network of muscle cell plasticity centered on the peroxisome proliferator-activated receptor γ coactivator 1α (PGC-1α), a regulatory nexus in endurance training adaptation. By integrating data on genome-wide binding of PGC-1α and gene expression upon PGC-1α overexpression with comprehensive computational prediction of transcription factor binding sites (TFBSs), we uncover a hitherto-underestimated number of transcription factor partners involved in mediating PGC-1α action. In particular, principal component analysis of TFBSs at PGC-1α binding regions predicts that, besides the well-known role of the estrogen-related receptor α (ERRα), the activator protein 1 complex (AP-1) plays a major role in regulating the PGC-1α-controlled gene program of the hypoxia response. Our findings thus reveal the complex transcriptional network of muscle cell plasticity controlled by PGC-1α.
Collapse
Affiliation(s)
- Mario Baresic
- Focal Area Growth and Development, Biozentrum, University of Basel, Basel, Switzerland
| | - Silvia Salatino
- Focal Area Growth and Development, Biozentrum, University of Basel, Basel, Switzerland Focal Area Computational and Systems Biology, Biozentrum, University of Basel, Basel, Switzerland Swiss Institute of Bioinformatics, Basel, Switzerland
| | - Barbara Kupr
- Focal Area Growth and Development, Biozentrum, University of Basel, Basel, Switzerland
| | - Erik van Nimwegen
- Focal Area Computational and Systems Biology, Biozentrum, University of Basel, Basel, Switzerland Swiss Institute of Bioinformatics, Basel, Switzerland
| | - Christoph Handschin
- Focal Area Growth and Development, Biozentrum, University of Basel, Basel, Switzerland
| |
Collapse
|
32
|
iRegulon: from a gene list to a gene regulatory network using large motif and track collections. PLoS Comput Biol 2014; 10:e1003731. [PMID: 25058159 PMCID: PMC4109854 DOI: 10.1371/journal.pcbi.1003731] [Citation(s) in RCA: 606] [Impact Index Per Article: 60.6] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/13/2014] [Accepted: 05/27/2014] [Indexed: 01/17/2023] Open
Abstract
Identifying master regulators of biological processes and mapping their downstream gene networks are key challenges in systems biology. We developed a computational method, called iRegulon, to reverse-engineer the transcriptional regulatory network underlying a co-expressed gene set using cis-regulatory sequence analysis. iRegulon implements a genome-wide ranking-and-recovery approach to detect enriched transcription factor motifs and their optimal sets of direct targets. We increase the accuracy of network inference by using very large motif collections of up to ten thousand position weight matrices collected from various species, and linking these to candidate human TFs via a motif2TF procedure. We validate iRegulon on gene sets derived from ENCODE ChIP-seq data with increasing levels of noise, and we compare iRegulon with existing motif discovery methods. Next, we use iRegulon on more challenging types of gene lists, including microRNA target sets, protein-protein interaction networks, and genetic perturbation data. In particular, we over-activate p53 in breast cancer cells, followed by RNA-seq and ChIP-seq, and could identify an extensive up-regulated network controlled directly by p53. Similarly we map a repressive network with no indication of direct p53 regulation but rather an indirect effect via E2F and NFY. Finally, we generalize our computational framework to include regulatory tracks such as ChIP-seq data and show how motif and track discovery can be combined to map functional regulatory interactions among co-expressed genes. iRegulon is available as a Cytoscape plugin from http://iregulon.aertslab.org. Gene regulatory networks control developmental, homeostatic, and disease processes by governing precise levels and spatio-temporal patterns of gene expression. Determining their topology can provide mechanistic insight into these processes. Gene regulatory networks consist of interactions between transcription factors and their direct target genes. Each regulatory interaction represents the binding of the transcription factor to a specific DNA binding site near its target gene. Here we present a computational method, called iRegulon, to identify master regulators and direct target genes in a human gene signature, i.e. a set of co-expressed genes. iRegulon relies on the analysis of the regulatory sequences around each gene in the gene set to detect enriched TF motifs or ChIP-seq peaks, using databases of nearly 10.000 TF motifs and 1000 ChIP-seq data sets or “tracks”. Next, it associates enriched motifs and tracks with candidate transcription factors and determines the optimal subset of direct target genes. We validate iRegulon on ENCODE data, and use it in combination with RNA-seq and ChIP-seq data to map a p53 downstream network with new predicted co-factors and targets. iRegulon is available as a Cytoscape plugin, supporting human, mouse, and Drosophila genes, and provides access to hundreds of cancer-related TF-target subnetworks or “regulons”.
Collapse
|
33
|
Kloetgen A, Münch PC, Borkhardt A, Hoell JI, McHardy AC. Biochemical and bioinformatic methods for elucidating the role of RNA-protein interactions in posttranscriptional regulation. Brief Funct Genomics 2014; 14:102-14. [PMID: 24951655 PMCID: PMC4471435 DOI: 10.1093/bfgp/elu020] [Citation(s) in RCA: 9] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/03/2023] Open
Abstract
Our understanding of transcriptional gene regulation has dramatically increased over the past decades, and many regulators of gene expression, such as transcription factors, have been analyzed extensively. Additionally, in recent years, deeper insights into the physiological roles of RNA have been obtained. More precisely, splicing, polyadenylation, various modifications, localization and the translation of messenger RNAs (mRNAs) are regulated by their interaction with RNA-binding proteins (RBPs). New technologies now enable the analysis of this regulation at different levels. A technique known as ultraviolet (UV) cross-linking and immunoprecipitation (CLIP) allows us to determine physical protein–RNA interactions on a genome-wide scale. UV cross-linking introduces covalent bonds between interacting RBPs and RNAs. In combination with immunoprecipitation and deep sequencing techniques, tens of millions of short reads (representing bound RNAs by an RBP of interest) are generated and are used to characterize the regulatory network mediated by an RBP. Other methods, such as mass spectrometry, can also be used for characterization of cross-linked RBPs and RNAs instead of CLIP methods. In this review, we discuss experimental and computational methods for the generation and analysis of CLIP data. The computational methods include short-read alignment, annotation and RNA-binding motif discovery. We describe the challenges of analyzing CLIP data and indicate areas where improvements are needed.
Collapse
Affiliation(s)
| | | | | | | | - Alice C McHardy
- Corresponding author. Alice C. McHardy, Heinrich-Heine University, Department of Algorithmic Bioinformatics, Universitaetsstrasse 1, 40225 Duesseldorf, Germany. Tel.: +49-211-8110427; Fax: +49-211-8113464; E-mail:
| |
Collapse
|
34
|
Santolini M, Mora T, Hakim V. A general pairwise interaction model provides an accurate description of in vivo transcription factor binding sites. PLoS One 2014; 9:e99015. [PMID: 24926895 PMCID: PMC4057186 DOI: 10.1371/journal.pone.0099015] [Citation(s) in RCA: 24] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/10/2013] [Accepted: 05/09/2014] [Indexed: 11/19/2022] Open
Abstract
The identification of transcription factor binding sites (TFBSs) on genomic DNA is of crucial importance for understanding and predicting regulatory elements in gene networks. TFBS motifs are commonly described by Position Weight Matrices (PWMs), in which each DNA base pair contributes independently to the transcription factor (TF) binding. However, this description ignores correlations between nucleotides at different positions, and is generally inaccurate: analysing fly and mouse in vivo ChIPseq data, we show that in most cases the PWM model fails to reproduce the observed statistics of TFBSs. To overcome this issue, we introduce the pairwise interaction model (PIM), a generalization of the PWM model. The model is based on the principle of maximum entropy and explicitly describes pairwise correlations between nucleotides at different positions, while being otherwise as unconstrained as possible. It is mathematically equivalent to considering a TF-DNA binding energy that depends additively on each nucleotide identity at all positions in the TFBS, like the PWM model, but also additively on pairs of nucleotides. We find that the PIM significantly improves over the PWM model, and even provides an optimal description of TFBS statistics within statistical noise. The PIM generalizes previous approaches to interdependent positions: it accounts for co-variation of two or more base pairs, and predicts secondary motifs, while outperforming multiple-motif models consisting of mixtures of PWMs. We analyse the structure of pairwise interactions between nucleotides, and find that they are sparse and dominantly located between consecutive base pairs in the flanking region of TFBS. Nonetheless, interactions between pairs of non-consecutive nucleotides are found to play a significant role in the obtained accurate description of TFBS statistics. The PIM is computationally tractable, and provides a general framework that should be useful for describing and predicting TFBSs beyond PWMs.
Collapse
Affiliation(s)
- Marc Santolini
- Laboratoire de Physique Statistique, CNRS, Université P. et M. Curie, Université D. Diderot, École Normale Supérieure, Paris, France
| | - Thierry Mora
- Laboratoire de Physique Statistique, CNRS, Université P. et M. Curie, Université D. Diderot, École Normale Supérieure, Paris, France
| | - Vincent Hakim
- Laboratoire de Physique Statistique, CNRS, Université P. et M. Curie, Université D. Diderot, École Normale Supérieure, Paris, France
| |
Collapse
|
35
|
Jablonska A, Polouliakh N. In silico discovery of novel transcription factors regulated by mTOR-pathway activities. Front Cell Dev Biol 2014; 2:23. [PMID: 25364730 PMCID: PMC4206986 DOI: 10.3389/fcell.2014.00023] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/01/2014] [Accepted: 05/09/2014] [Indexed: 12/21/2022] Open
Abstract
The mammalian target of rapamycine (mTOR) pathway is a key regulator of cellular growth, development, and ageing, and unraveling its control is essential for understanding life and death of biological organisms. A motif-discovery workbench including nine tools was used to identify transcription factors involved in five basic (Insulin, MAPK, VEGF, Hypoxia, and mTOR core) activities of the mTOR pathway. Discovered transcription factors are classified as “process-specific” or “pathway-ubiquitous” with highlights toward their regulating/regulated activities within the mTOR pathway. Our transcription regulation results will facilitate further research on investigating the control mechanism in mTOR pathway.
Collapse
Affiliation(s)
- Agnieszka Jablonska
- Faculty of Biotechnology and Food Sciences, Lodz University of Technology Lodz, Poland
| | - Natalia Polouliakh
- Fundamental Research Laboratories, Sony Computer Science Laboratories Inc. Tokyo, Japan ; Systems Biology Institute Tokyo, Japan ; Graduate School of Medicine, Yokohama City University Yokohama, Japan
| |
Collapse
|
36
|
Azmi AM, Al-Ssulami A. Encoded expansion: an efficient algorithm to discover identical string motifs. PLoS One 2014; 9:e95148. [PMID: 24871320 PMCID: PMC4037181 DOI: 10.1371/journal.pone.0095148] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/14/2013] [Accepted: 03/24/2014] [Indexed: 11/19/2022] Open
Abstract
A major task in computational biology is the discovery of short recurring string patterns known as motifs. Most of the schemes to discover motifs are either stochastic or combinatorial in nature. Stochastic approaches do not guarantee finding the correct motifs, while the combinatorial schemes tend to have an exponential time complexity with respect to motif length. To alleviate the cost, the combinatorial approach exploits dynamic data structures such as trees or graphs. Recently (Karci (2009) Efficient automatic exact motif discovery algorithms for biological sequences, Expert Systems with Applications 36:7952-7963) devised a deterministic algorithm that finds all the identical copies of string motifs of all sizes [Formula: see text] in theoretical time complexity of [Formula: see text] and a space complexity of [Formula: see text] where [Formula: see text] is the length of the input sequence and [Formula: see text] is the length of the longest possible string motif. In this paper, we present a significant improvement on Karci's original algorithm. The algorithm that we propose reports all identical string motifs of sizes [Formula: see text] that occur at least [Formula: see text] times. Our algorithm starts with string motifs of size 2, and at each iteration it expands the candidate string motifs by one symbol throwing out those that occur less than [Formula: see text] times in the entire input sequence. We use a simple array and data encoding to achieve theoretical worst-case time complexity of [Formula: see text] and a space complexity of [Formula: see text] Encoding of the substrings can speed up the process of comparison between string motifs. Experimental results on random and real biological sequences confirm that our algorithm has indeed a linear time complexity and it is more scalable in terms of sequence length than the existing algorithms.
Collapse
Affiliation(s)
- Aqil M. Azmi
- Department of Computer Science, College of Computer & Information Sciences, King Saud University, Riyadh, Saudi Arabia
- * E-mail:
| | - Abdulrakeeb Al-Ssulami
- Department of Computer Science, College of Computer & Information Sciences, King Saud University, Riyadh, Saudi Arabia
| |
Collapse
|
37
|
Glenwinkel L, Wu D, Minevich G, Hobert O. TargetOrtho: a phylogenetic footprinting tool to identify transcription factor targets. Genetics 2014; 197:61-76. [PMID: 24558259 PMCID: PMC4012501 DOI: 10.1534/genetics.113.160721] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/23/2014] [Accepted: 02/09/2014] [Indexed: 11/18/2022] Open
Abstract
The identification of the regulatory targets of transcription factors is central to our understanding of how transcription factors fulfill their many key roles in development and homeostasis. DNA-binding sites have been uncovered for many transcription factors through a number of experimental approaches, but it has proven difficult to use this binding site information to reliably predict transcription factor target genes in genomic sequence space. Using the nematode Caenorhabditis elegans and other related nematode species as a starting point, we describe here a bioinformatic pipeline that identifies potential transcription factor target genes from genomic sequences. Among the key features of this pipeline is the use of sequence conservation of transcription-factor-binding sites in related species. Rather than using aligned genomic DNA sequences from the genomes of multiple species as a starting point, TargetOrtho scans related genome sequences independently for matches to user-provided transcription-factor-binding motifs, assigns motif matches to adjacent genes, and then determines whether orthologous genes in different species also contain motif matches. We validate TargetOrtho by identifying previously characterized targets of three different types of transcription factors in C. elegans, and we use TargetOrtho to identify novel target genes of the Collier/Olf/EBF transcription factor UNC-3 in C. elegans ventral nerve cord motor neurons. We have also implemented the use of TargetOrtho in Drosophila melanogaster using conservation among five species in the D. melanogaster species subgroup for target gene discovery.
Collapse
Affiliation(s)
- Lori Glenwinkel
- Department of Biochemistry and Molecular Biophysics, Howard Hughes Medical Institute, Columbia University Medical Center, New York, New York 10032
| | | | - Gregory Minevich
- Department of Biochemistry and Molecular Biophysics, Howard Hughes Medical Institute, Columbia University Medical Center, New York, New York 10032
| | - Oliver Hobert
- Department of Biochemistry and Molecular Biophysics, Howard Hughes Medical Institute, Columbia University Medical Center, New York, New York 10032
| |
Collapse
|
38
|
Rouault H, Santolini M, Schweisguth F, Hakim V. Imogene: identification of motifs and cis-regulatory modules underlying gene co-regulation. Nucleic Acids Res 2014; 42:6128-45. [PMID: 24682824 PMCID: PMC4041412 DOI: 10.1093/nar/gku209] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/23/2022] Open
Abstract
Cis-regulatory modules (CRMs) and motifs play a central role in tissue and condition-specific gene expression. Here we present Imogene, an ensemble of statistical tools that we have developed to facilitate their identification and implemented in a publicly available software. Starting from a small training set of mammalian or fly CRMs that drive similar gene expression profiles, Imogene determines de novocis-regulatory motifs that underlie this co-expression. It can then predict on a genome-wide scale other CRMs with a regulatory potential similar to the training set. Imogene bypasses the need of large datasets for statistical analyses by making central use of the information provided by the sequenced genomes of multiple species, based on the developed statistical tools and explicit models for transcription factor binding site evolution. We test Imogene on characterized tissue-specific mouse developmental CRMs. Its ability to identify CRMs with the same specificity based on its de novo created motifs is comparable to that of previously evaluated ‘motif-blind’ methods. We further show, both in flies and in mammals, that Imogene de novo generated motifs are sufficient to discriminate CRMs related to different developmental programs. Notably, purely relying on sequence data, Imogene performs as well in this discrimination task as a previously reported learning algorithm based on Chromatin Immunoprecipitation (ChIP) data for multiple transcription factors at multiple developmental stages.
Collapse
Affiliation(s)
- Hervé Rouault
- Developmental and Stem Cell Biology Department, Institut Pasteur, F-75015 Paris, France CNRS, URA2578, F-75015 Paris, France
| | - Marc Santolini
- Laboratoire de Physique Statistique, CNRS, École Normale Supérieure, Université P. et M. Curie, Université Paris-Diderot
| | - François Schweisguth
- Developmental and Stem Cell Biology Department, Institut Pasteur, F-75015 Paris, France CNRS, URA2578, F-75015 Paris, France
| | - Vincent Hakim
- Laboratoire de Physique Statistique, CNRS, École Normale Supérieure, Université P. et M. Curie, Université Paris-Diderot
| |
Collapse
|
39
|
Diepenbruck M, Waldmeier L, Ivanek R, Berninger P, Arnold P, van Nimwegen E, Christofori G. Tead2 expression levels control the subcellular distribution of Yap and Taz, zyxin expression and epithelial-mesenchymal transition. J Cell Sci 2014; 127:1523-36. [PMID: 24554433 DOI: 10.1242/jcs.139865] [Citation(s) in RCA: 101] [Impact Index Per Article: 10.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/17/2022] Open
Abstract
The cellular changes during an epithelial-mesenchymal transition (EMT) largely rely on global changes in gene expression orchestrated by transcription factors. Tead transcription factors and their transcriptional co-activators Yap and Taz have been previously implicated in promoting an EMT; however, their direct transcriptional target genes and their functional role during EMT have remained elusive. We have uncovered a previously unanticipated role of the transcription factor Tead2 during EMT. During EMT in mammary gland epithelial cells and breast cancer cells, levels of Tead2 increase in the nucleus of cells, thereby directing a predominant nuclear localization of its co-factors Yap and Taz via the formation of Tead2-Yap-Taz complexes. Genome-wide chromatin immunoprecipitation and next generation sequencing in combination with gene expression profiling revealed the transcriptional targets of Tead2 during EMT. Among these, zyxin contributes to the migratory and invasive phenotype evoked by Tead2. The results demonstrate that Tead transcription factors are crucial regulators of the cellular distribution of Yap and Taz, and together they control the expression of genes critical for EMT and metastasis.
Collapse
Affiliation(s)
- Maren Diepenbruck
- Department of Biomedicine, University of Basel, 4058 Basel, Switzerland
| | | | | | | | | | | | | |
Collapse
|
40
|
Eggeling R, Gohr A, Keilwagen J, Mohr M, Posch S, Smith AD, Grosse I. On the value of intra-motif dependencies of human insulator protein CTCF. PLoS One 2014; 9:e85629. [PMID: 24465627 PMCID: PMC3899044 DOI: 10.1371/journal.pone.0085629] [Citation(s) in RCA: 19] [Impact Index Per Article: 1.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/09/2013] [Accepted: 12/05/2013] [Indexed: 01/08/2023] Open
Abstract
The binding affinity of DNA-binding proteins such as transcription factors is mainly determined by the base composition of the corresponding binding site on the DNA strand. Most proteins do not bind only a single sequence, but rather a set of sequences, which may be modeled by a sequence motif. Algorithms for de novo motif discovery differ in their promoter models, learning approaches, and other aspects, but typically use the statistically simple position weight matrix model for the motif, which assumes statistical independence among all nucleotides. However, there is no clear justification for that assumption, leading to an ongoing debate about the importance of modeling dependencies between nucleotides within binding sites. In the past, modeling statistical dependencies within binding sites has been hampered by the problem of limited data. With the rise of high-throughput technologies such as ChIP-seq, this situation has now changed, making it possible to make use of statistical dependencies effectively. In this work, we investigate the presence of statistical dependencies in binding sites of the human enhancer-blocking insulator protein CTCF by using the recently developed model class of inhomogeneous parsimonious Markov models, which is capable of modeling complex dependencies while avoiding overfitting. These findings lead to a more detailed characterization of the CTCF binding motif, which is only poorly represented by independent nucleotide frequencies at several positions, predominantly at the 3' end.
Collapse
Affiliation(s)
- Ralf Eggeling
- Institute of Computer Science, Martin Luther University Halle–Wittenberg, Halle/Saale, Germany
| | - André Gohr
- Institute of Computer Science, Martin Luther University Halle–Wittenberg, Halle/Saale, Germany
| | - Jens Keilwagen
- Institute for Biosafety in Plant Biotechnology, Julius Kühn-Institut (JKI) - Federal Research Centre for Cultivated Plants, Quedlinburg, Germany
- Department of Genebank, Leibniz Institute of Plant Genetics and Crop Plant Research (IPK), Seeland OT Gatersleben, Germany
| | - Michaela Mohr
- Department of Genebank, Leibniz Institute of Plant Genetics and Crop Plant Research (IPK), Seeland OT Gatersleben, Germany
| | - Stefan Posch
- Institute of Computer Science, Martin Luther University Halle–Wittenberg, Halle/Saale, Germany
| | - Andrew D. Smith
- Molecular and Computational Biology, University of Southern California, Los Angeles, United States of America
| | - Ivo Grosse
- Institute of Computer Science, Martin Luther University Halle–Wittenberg, Halle/Saale, Germany
- Department of Genebank, Leibniz Institute of Plant Genetics and Crop Plant Research (IPK), Seeland OT Gatersleben, Germany
- German Center of Integrative Biodiversity Research (iDiv) Halle-Jena-Leipzig, Leipzig, Germany
| |
Collapse
|
41
|
Re A, Joshi T, Kulberkyte E, Morris Q, Workman CT. RNA-protein interactions: an overview. Methods Mol Biol 2014; 1097:491-521. [PMID: 24639174 DOI: 10.1007/978-1-62703-709-9_23] [Citation(s) in RCA: 76] [Impact Index Per Article: 7.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/05/2023]
Abstract
RNA binding proteins (RBPs) are key players in the regulation of gene expression. In this chapter we discuss the main protein-RNA recognition modes used by RBPs in order to regulate multiple steps of RNA processing. We discuss traditional and state-of-the-art technologies that can be used to study RNAs bound by individual RBPs, or vice versa, for both in vitro and in vivo methodologies. To help highlight the biological significance of RBP mediated regulation, online resources on experimentally verified protein-RNA interactions are briefly presented. Finally, we present the major tools to computationally infer RNA binding sites according to the modeling features and to the unsupervised or supervised frameworks that are adopted. Since some RNA binding site search algorithms are derived from DNA binding site search algorithms, we discuss the commonalities and novelties introduced to handle both sequence and structural features uniquely characterizing protein-RNA interactions.
Collapse
Affiliation(s)
- Angela Re
- University of Trento, Mattarello, Italy
| | | | | | | | | |
Collapse
|
42
|
Duque T, Samee MAH, Kazemian M, Pham HN, Brodsky MH, Sinha S. Simulations of enhancer evolution provide mechanistic insights into gene regulation. Mol Biol Evol 2013; 31:184-200. [PMID: 24097306 PMCID: PMC3879441 DOI: 10.1093/molbev/mst170] [Citation(s) in RCA: 25] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/14/2023] Open
Abstract
There is growing interest in models of regulatory sequence evolution. However, existing models specifically designed for regulatory sequences consider the independent evolution of individual transcription factor (TF)-binding sites, ignoring that the function and evolution of a binding site depends on its context, typically the cis-regulatory module (CRM) in which the site is located. Moreover, existing models do not account for the gene-specific roles of TF-binding sites, primarily because their roles often are not well understood. We introduce two models of regulatory sequence evolution that address some of the shortcomings of existing models and implement simulation frameworks based on them. One model simulates the evolution of an individual binding site in the context of a CRM, while the other evolves an entire CRM. Both models use a state-of-the art sequence-to-expression model to predict the effects of mutations on the regulatory output of the CRM and determine the strength of selection. We use the new framework to simulate the evolution of TF-binding sites in 37 well-studied CRMs belonging to the anterior-posterior patterning system in Drosophila embryos. We show that these simulations provide accurate fits to evolutionary data from 12 Drosophila genomes, which includes statistics of binding site conservation on relatively short evolutionary scales and site loss across larger divergence times. The new framework allows us, for the first time, to test hypotheses regarding the underlying cis-regulatory code by directly comparing the evolutionary implications of the hypothesis with the observed evolutionary dynamics of binding sites. Using this capability, we find that explicitly modeling self-cooperative DNA binding by the TF Caudal (CAD) provides significantly better fits than an otherwise identical evolutionary simulation that lacks this mechanistic aspect. This hypothesis is further supported by a statistical analysis of the distribution of intersite spacing between adjacent CAD sites. Experimental tests confirm direct homodimeric interaction between CAD molecules as well as self-cooperative DNA binding by CAD. We note that computational modeling of the D. melanogaster CRMs alone did not yield significant evidence to support CAD self-cooperativity. We thus demonstrate how specific mechanistic details encoded in CRMs can be revealed by modeling their evolution and fitting such models to multispecies data.
Collapse
Affiliation(s)
- Thyago Duque
- Department of Computer Science, University of Illinois at Urbana-Champaign
| | | | | | | | | | | |
Collapse
|
43
|
Brümmer A, Kishore S, Subasic D, Hengartner M, Zavolan M. Modeling the binding specificity of the RNA-binding protein GLD-1 suggests a function of coding region-located sites in translational repression. RNA (NEW YORK, N.Y.) 2013; 19:1317-1326. [PMID: 23974436 PMCID: PMC3854522 DOI: 10.1261/rna.037531.112] [Citation(s) in RCA: 19] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 12/03/2012] [Accepted: 06/25/2013] [Indexed: 06/02/2023]
Abstract
To understand the function of the hundreds of RNA-binding proteins (RBPs) that are encoded in animal genomes it is important to identify their target RNAs. Although it is generally accepted that the binding specificity of an RBP is well described in terms of the nucleotide sequence of its binding sites, other factors such as the structural accessibility of binding sites or their clustering, to enable binding of RBP multimers, are also believed to play a role. Here we focus on GLD-1, a translational regulator of Caenorhabditis elegans, whose binding specificity and targets have been studied with a variety of methods such as CLIP (cross-linking and immunoprecipitation), RIP-Chip (microarray measurement of RNAs associated with an immunoprecipitated protein), profiling of polysome-associated mRNAs and biophysical determination of binding affinities of GLD-1 for short nucleotide sequences. We show that a simple biophysical model explains the binding of GLD-1 to mRNA targets to a large extent, and that taking into account the accessibility of putative target sites significantly improves the prediction of GLD-1 binding, particularly due to a more accurate prediction of binding in transcript coding regions. Relating GLD-1 binding to translational repression and stabilization of its target transcripts we find that binding sites along the entire transcripts contribute to functional responses, and that CDS-located sites contribute most to translational repression. Finally, biophysical measurements of GLD-1 affinity for a small number of oligonucleotides appear to allow an accurate reconstruction of the sequence specificity of the protein. This approach can be applied to uncover the specificity and function of other RBPs.
Collapse
Affiliation(s)
- Anneke Brümmer
- Biozentrum, University of Basel, 4056 Basel, Switzerland
| | | | - Deni Subasic
- Institute of Molecular Life Sciences, University of Zurich, 8057 Zurich, Switzerland
| | - Michael Hengartner
- Institute of Molecular Life Sciences, University of Zurich, 8057 Zurich, Switzerland
| | | |
Collapse
|
44
|
Saha S, Lindeberg M. Bound to Succeed: transcription factor binding-site prediction and its contribution to understanding virulence and environmental adaptation in bacterial plant pathogens. MOLECULAR PLANT-MICROBE INTERACTIONS : MPMI 2013; 26:1123-1130. [PMID: 23802990 DOI: 10.1094/mpmi-04-13-0090-cr] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/02/2023]
Abstract
Bacterial plant pathogens rely on a battalion of transcription factors to fine-tune their response to changing environmental conditions and to marshal the genetic resources required for successful pathogenesis. Prediction of transcription factor binding sites (TFBS) represents an important tool for elucidating regulatory networks and has been conducted in multiple genera of plant-pathogenic bacteria for the purpose of better understanding mechanisms of survival and pathogenesis. The major categories of TFBS that have been characterized are reviewed here, with emphasis on in silico methods used for site identification and challenges therein, their applicability to different types of sequence datasets, and insights into mechanisms of virulence and survival that have been gained through binding-site mapping. An improved strategy for establishing E-value cutoffs when using existing models to screen uncharacterized genomes is also discussed.
Collapse
|
45
|
Nucleosome free regions in yeast promoters result from competitive binding of transcription factors that interact with chromatin modifiers. PLoS Comput Biol 2013; 9:e1003181. [PMID: 23990766 PMCID: PMC3749953 DOI: 10.1371/journal.pcbi.1003181] [Citation(s) in RCA: 27] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/21/2012] [Accepted: 07/04/2013] [Indexed: 11/19/2022] Open
Abstract
Because DNA packaging in nucleosomes modulates its accessibility to transcription factors (TFs), unraveling the causal determinants of nucleosome positioning is of great importance to understanding gene regulation. Although there is evidence that intrinsic sequence specificity contributes to nucleosome positioning, the extent to which other factors contribute to nucleosome positioning is currently highly debated. Here we obtained both in vivo and in vitro reference maps of positions that are either consistently covered or free of nucleosomes across multiple experimental data-sets in Saccharomyces cerevisiae. We then systematically quantified the contribution of TF binding to nucleosome positiong using a rigorous statistical mechanics model in which TFs compete with nucleosomes for binding DNA. Our results reconcile previous seemingly conflicting results on the determinants of nucleosome positioning and provide a quantitative explanation for the difference between in vivo and in vitro positioning. On a genome-wide scale, nucleosome positioning is dominated by the phasing of nucleosome arrays over gene bodies, and their positioning is mainly determined by the intrinsic sequence preferences of nucleosomes. In contrast, larger nucleosome free regions in promoters, which likely have a much more significant impact on gene expression, are determined mainly by TF binding. Interestingly, of the 158 yeast TFs included in our modeling, we find that only 10–20 significantly contribute to inducing nucleosome-free regions, and these TFs are highly enriched for having direct interations with chromatin remodelers. Together our results imply that nucleosome free regions in yeast promoters results from the binding of a specific class of TFs that recruit chromatin remodelers. The DNA of all eukaryotic organisms is packaged into nucleosomes, which cover roughly of the genome. As nucleosome positioning profoundly affects DNA accessibility to other DNA binding proteins such as transcription factors (TFs), it plays an important role in transcription regulation. However, to what extent nucleosome positioning is guided by intrinsic DNA sequence preferences of nucleosomes, and to what extent other DNA binding factors play a role, is currently highly debated. Here we use a rigorous biophysical model to systematically study the relative contributions of intrinsic sequence preferences and competitive binding of TFs to nucleosome positioning in yeast. We find that, on the one hand, the phasing of the many small spacers within dense nucleosome arrays that cover gene bodies are mainly determined by intrinsic sequence preferences. On the other hand, larger nucleosome free regions (NFRs) in promoters are explained predominantly by TF binding. Strikingly, we find that only 10–20 TFs make a significant contribution to explaining NFRs, and these TFs are highly enriched for directly interacting with chromatin modifiers. Thus, the picture that emerges is that binding by a specific class of TFs recruits chromatin modifiers which mediate local nucleosome expulsion.
Collapse
|
46
|
Leibovich L, Paz I, Yakhini Z, Mandel-Gutfreund Y. DRIMust: a web server for discovering rank imbalanced motifs using suffix trees. Nucleic Acids Res 2013; 41:W174-9. [PMID: 23685432 PMCID: PMC3692051 DOI: 10.1093/nar/gkt407] [Citation(s) in RCA: 46] [Impact Index Per Article: 4.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open
Abstract
Cellular regulation mechanisms that involve proteins and other active molecules interacting with specific targets often involve the recognition of sequence patterns. Short sequence elements on DNA, RNA and proteins play a central role in mediating such molecular recognition events. Studies that focus on measuring and investigating sequence-based recognition processes make use of statistical and computational tools that support the identification and understanding of sequence motifs. We present a new web application, named DRIMust, freely accessible through the website http://drimust.technion.ac.il for de novo motif discovery services. The DRIMust algorithm is based on the minimum hypergeometric statistical framework and uses suffix trees for an efficient enumeration of motif candidates. DRIMust takes as input ranked lists of sequences in FASTA format and returns motifs that are over-represented at the top of the list, where the determination of the threshold that defines top is data driven. The resulting motifs are presented individually with an accurate P-value indication and as a Position Specific Scoring Matrix. Comparing DRIMust with other state-of-the-art tools demonstrated significant advantage to DRIMust, both in result accuracy and in short running times. Overall, DRIMust is unique in combining efficient search on large ranked lists with rigorous P-value assessment for the detected motifs.
Collapse
Affiliation(s)
- Limor Leibovich
- Department of Computer Science, Technion - Israel Institute of Technology, Technion City, Haifa 32000, Israel
| | | | | | | |
Collapse
|
47
|
Signal correlations in ecological niches can shape the organization and evolution of bacterial gene regulatory networks. Adv Microb Physiol 2013; 61:1-36. [PMID: 23046950 DOI: 10.1016/b978-0-12-394423-8.00001-9] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/06/2023]
Abstract
Transcriptional regulation plays a significant role in the biological response of bacteria to changing environmental conditions. Therefore, mapping transcriptional regulatory networks is an important step not only in understanding how bacteria sense and interpret their environment but also to identify the functions involved in biological responses to specific conditions. Recent experimental and computational developments have facilitated the characterization of regulatory networks on a genome-wide scale in model organisms. In addition, the multiplication of complete genome sequences has encouraged comparative analyses to detect conserved regulatory elements and infer regulatory networks in other less well-studied organisms. However, transcription regulation appears to evolve rapidly, thus, creating challenges for the transfer of knowledge to nonmodel organisms. Nevertheless, the mechanisms and constraints driving the evolution of regulatory networks have been the subjects of numerous analyses, and several models have been proposed. Overall, the contributions of mutations, recombination, and horizontal gene transfer are complex. Finally, the rapid evolution of regulatory networks plays a significant role in the remarkable capacity of bacteria to adapt to new or changing environments. Conversely, the characteristics of environmental niches determine the selective pressures and can shape the structure of regulatory network accordingly.
Collapse
|
48
|
Shao J, Zhang J, Zhang Z, Jiang H, Lou X, Huang B, Foltz G, Lan Q, Huang Q, Lin B. Alternative polyadenylation in glioblastoma multiforme and changes in predicted RNA binding protein profiles. OMICS-A JOURNAL OF INTEGRATIVE BIOLOGY 2013; 17:136-49. [PMID: 23421905 DOI: 10.1089/omi.2012.0098] [Citation(s) in RCA: 13] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/12/2022]
Abstract
Alternative polyadenylation (APA) is widely present in the human genome and plays a key role in carcinogenesis. We conducted a comprehensive analysis of the APA products in glioblastoma multiforme (GBM, one of the most lethal brain tumors) and normal brain tissues and further developed a computational pipeline, RNAelements (http://sysbio.zju.edu.cn/RNAelements/), using covariance model from known RNA binding protein (RBP) targets acquired by RNA Immunoprecipitation (RIP) analysis. We identified 4530 APA isoforms for 2733 genes in GBM, and found that 182 APA isoforms from 148 genes showed significant differential expression between normal and GBM brain tissues. We then focused on three genes with long and short APA isoforms that show inconsistent expression changes between normal and GBM brain tissues. These were myocyte enhancer factor 2D, heat shock factor binding protein 1, and polyhomeotic homolog 1 (Drosophila). Using the RNAelements program, we found that RBP binding sites were enriched in the alternative regions between the first and the last polyadenylation sites, which would result in the short APA forms escaping regulation from those RNA binding proteins. To the best of our knowledge, this report is the first comprehensive APA isoform dataset for GBM and normal brain tissues. Additionally, we demonstrated a putative novel APA-mediated mechanism for controlling RNA stability and translation for APA isoforms. These observations collectively lay a foundation for novel diagnostics and molecular mechanisms that can inform future therapeutic interventions for GBM.
Collapse
Affiliation(s)
- Jiaofang Shao
- Systems Biology Division, Zhejiang-California International NanoSystems Institute, Zhejiang University, Hangzhou, China
| | | | | | | | | | | | | | | | | | | |
Collapse
|
49
|
Sahu TK, Rao AR, Vasisht S, Singh N, Singh UP. Computational approaches, databases and tools for in silico motif discovery. Interdiscip Sci 2012; 4:239-255. [PMID: 23354813 DOI: 10.1007/s12539-012-0141-x] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/17/2011] [Revised: 04/12/2012] [Accepted: 06/13/2012] [Indexed: 06/01/2023]
Abstract
Motifs are the biologically significant fragments of nucleotide or peptide sequences in a specific pattern. Motifs are categorized as structural motifs and sequence motifs. These are discovered by phylogenetic studies of similar genes across species. Structural motifs are formed by three dimensional arrangements of amino acids consisting of two or more α helices or β strands whereas sequence motifs are formed by the nucleotide fragments appearing in the exons of a gene. The arrangement of residues in structural motifs may not be continuous while it is continuous in sequence motifs. Sequence motifs may encode to the structural motifs. The algorithms used for motif discovery are important part of the bio-computational studies. The purpose of motif discovery is to identify patterns in biopolymer (nucleotide or protein) sequences to understand the structure and function of the molecules and their evolutionary aspects. The main aim of this paper is to provide systematic compilation of a review on different approaches, databases and tools used in motif discovery.
Collapse
Affiliation(s)
- Tanmaya Kumar Sahu
- Centre for Agricultural Bioinformatics, Indian Agricultural Statistics Research Institute, New Delhi, India
| | | | | | | | | |
Collapse
|
50
|
Müller-Molina AJ, Schöler HR, Araúzo-Bravo MJ. Comprehensive human transcription factor binding site map for combinatory binding motifs discovery. PLoS One 2012; 7:e49086. [PMID: 23209563 PMCID: PMC3509107 DOI: 10.1371/journal.pone.0049086] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/03/2012] [Accepted: 10/08/2012] [Indexed: 11/18/2022] Open
Abstract
To know the map between transcription factors (TFs) and their binding sites is essential to reverse engineer the regulation process. Only about 10%-20% of the transcription factor binding motifs (TFBMs) have been reported. This lack of data hinders understanding gene regulation. To address this drawback, we propose a computational method that exploits never used TF properties to discover the missing TFBMs and their sites in all human gene promoters. The method starts by predicting a dictionary of regulatory "DNA words." From this dictionary, it distills 4098 novel predictions. To disclose the crosstalk between motifs, an additional algorithm extracts TF combinatorial binding patterns creating a collection of TF regulatory syntactic rules. Using these rules, we narrowed down a list of 504 novel motifs that appear frequently in syntax patterns. We tested the predictions against 509 known motifs confirming that our system can reliably predict ab initio motifs with an accuracy of 81%-far higher than previous approaches. We found that on average, 90% of the discovered combinatorial binding patterns target at least 10 genes, suggesting that to control in an independent manner smaller gene sets, supplementary regulatory mechanisms are required. Additionally, we discovered that the new TFBMs and their combinatorial patterns convey biological meaning, targeting TFs and genes related to developmental functions. Thus, among all the possible available targets in the genome, the TFs tend to regulate other TFs and genes involved in developmental functions. We provide a comprehensive resource for regulation analysis that includes a dictionary of "DNA words," newly predicted motifs and their corresponding combinatorial patterns. Combinatorial patterns are a useful filter to discover TFBMs that play a major role in orchestrating other factors and thus, are likely to lock/unlock cellular functional clusters.
Collapse
Affiliation(s)
- Arnoldo J. Müller-Molina
- Computational Biology and Bioinformatics Group, Max Planck Institute for Molecular Biomedicine, Münster, Germany
| | - Hans R. Schöler
- Department of Cell and Developmental Biology, Max Planck Institute for Molecular Biomedicine, Münster, Germany
- Medical Faculty, University of Münster, Münster, Germany
| | - Marcos J. Araúzo-Bravo
- Computational Biology and Bioinformatics Group, Max Planck Institute for Molecular Biomedicine, Münster, Germany
| |
Collapse
|