1
|
Aditama R, Tanjung ZA, Aprilyanto V, Sudania WM, Utomo C, Liwang T. Identification of oil palm cis-regulatory elements based on DNA free energy and single nucleotide polymorphism density. Comput Biol Chem 2023; 106:107931. [PMID: 37481844 DOI: 10.1016/j.compbiolchem.2023.107931] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/08/2022] [Revised: 06/29/2023] [Accepted: 07/17/2023] [Indexed: 07/25/2023]
Abstract
Transcription control through cis-regulatory elements (CREs) is one of important regulators of gene expression. This study aimed to identify the location of CREs in oil palm (Elaeis guineensis Jacq.) using the combination of DNA free energy and single nucleotide polymorphism (SNP) density approaches. Promoter region sequences were extracted oil palm genome spanning from 1500 nucleotides (nt) upstream to 1000 nt downstream of every annotated transcription start sites (TSS). Free energy profiles of each promoter region were calculated using PromPredict software. Raw reads from the deep sequencing of 59 oil palm origins were used to calculate SNP density of each promoter region. The result showed that the average free energy (AFE) on the upstream region of TSS is about 1.5 kcal/mol higher compared to the downstream region. Using DNA free energy method, 16,281 regions of CREs were predicted. Most of predicted CREs was located between 1 and 500 nt upstream of TSS. Anti-correlation pattern between free energy and SNP density was observed on the predicted regions of CREs. This anti-correlated pattern was also observed on an experimentally determined promoter of the oil palm metallothionein gene, EgMSP1. Considering the increasing use of promoter information on plant biotechnology, an easy and accurate promoter prediction using the combination of free energy and SNP density method could be recommended.
Collapse
Affiliation(s)
- Redi Aditama
- Biotechnology Department, Plant Production and Biotechnology Division, PT SMART Tbk., Bogor 16810, Indonesia
| | - Zulfikar Achmad Tanjung
- Biotechnology Department, Plant Production and Biotechnology Division, PT SMART Tbk., Bogor 16810, Indonesia
| | - Victor Aprilyanto
- Biotechnology Department, Plant Production and Biotechnology Division, PT SMART Tbk., Bogor 16810, Indonesia
| | - Widyartini Made Sudania
- Biotechnology Department, Plant Production and Biotechnology Division, PT SMART Tbk., Bogor 16810, Indonesia
| | - Condro Utomo
- Biotechnology Department, Plant Production and Biotechnology Division, PT SMART Tbk., Bogor 16810, Indonesia.
| | - Tony Liwang
- Biotechnology Department, Plant Production and Biotechnology Division, PT SMART Tbk., Bogor 16810, Indonesia
| |
Collapse
|
2
|
Yang TH, Yang YC, Tu KC. regCNN: identifying Drosophila genome-wide cis-regulatory modules via integrating the local patterns in epigenetic marks and transcription factor binding motifs. Comput Struct Biotechnol J 2022; 20:296-308. [PMID: 35035784 PMCID: PMC8724954 DOI: 10.1016/j.csbj.2021.12.015] [Citation(s) in RCA: 8] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/15/2021] [Revised: 12/10/2021] [Accepted: 12/10/2021] [Indexed: 11/20/2022] Open
Abstract
Transcription regulation in metazoa is controlled by the binding events of transcription factors (TFs) or regulatory proteins on specific modular DNA regulatory sequences called cis-regulatory modules (CRMs). Understanding the distributions of CRMs on a genomic scale is essential for constructing the metazoan transcriptional regulatory networks that help diagnose genetic disorders. While traditional reporter-assay CRM identification approaches can provide an in-depth understanding of functions of some CRM, these methods are usually cost-inefficient and low-throughput. It is generally believed that by integrating diverse genomic data, reliable CRM predictions can be made. Hence, researchers often first resort to computational algorithms for genome-wide CRM screening before specific experiments. However, current existing in silico methods for searching potential CRMs were restricted by low sensitivity, poor prediction accuracy, or high computation time from TFBS composition combinatorial complexity. To overcome these obstacles, we designed a novel CRM identification pipeline called regCNN by considering the base-by-base local patterns in TF binding motifs and epigenetic profiles. On the test set, regCNN shows an accuracy/auROC of 84.5%/92.5% in CRM identification. And by further considering local patterns in epigenetic profiles and TF binding motifs, it can accomplish 4.7% (92.5%–87.8%) improvement in the auROC value over the average value-based pure multi-layer perceptron model. We also demonstrated that regCNN outperforms all currently available tools by at least 11.3% in auROC values. Finally, regCNN is verified to be robust against its resizing window hyperparameter in dealing with the variable lengths of CRMs. The model of regCNN can be downloaded athttp://cobisHSS0.im.nuk.edu.tw/regCNN/.
Collapse
Affiliation(s)
- Tzu-Hsien Yang
- Department of Information Management, National University of Kaohsiung, Kaohsiung University Rd, 811 Kaohsiung, Taiwan
| | - Ya-Chiao Yang
- Department of Information Management, National University of Kaohsiung, Kaohsiung University Rd, 811 Kaohsiung, Taiwan
| | - Kai-Chi Tu
- Department of Information Management, National University of Kaohsiung, Kaohsiung University Rd, 811 Kaohsiung, Taiwan
| |
Collapse
|
3
|
Ni P, Su Z. Accurate prediction of cis-regulatory modules reveals a prevalent regulatory genome of humans. NAR Genom Bioinform 2021; 3:lqab052. [PMID: 34159315 PMCID: PMC8210889 DOI: 10.1093/nargab/lqab052] [Citation(s) in RCA: 12] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/26/2021] [Revised: 05/01/2021] [Accepted: 06/14/2021] [Indexed: 02/07/2023] Open
Abstract
cis-regulatory modules(CRMs) formed by clusters of transcription factor (TF) binding sites (TFBSs) are as important as coding sequences in specifying phenotypes of humans. It is essential to categorize all CRMs and constituent TFBSs in the genome. In contrast to most existing methods that predict CRMs in specific cell types using epigenetic marks, we predict a largely cell type agonistic but more comprehensive map of CRMs and constituent TFBSs in the gnome by integrating all available TF ChIP-seq datasets. Our method is able to partition 77.47% of genome regions covered by available 6092 datasets into a CRM candidate (CRMC) set (56.84%) and a non-CRMC set (43.16%). Intriguingly, the predicted CRMCs are under strong evolutionary constraints, while the non-CRMCs are largely selectively neutral, strongly suggesting that the CRMCs are likely cis-regulatory, while the non-CRMCs are not. Our predicted CRMs are under stronger evolutionary constraints than three state-of-the-art predictions (GeneHancer, EnhancerAtlas and ENCODE phase 3) and substantially outperform them for recalling VISTA enhancers and non-coding ClinVar variants. We estimated that the human genome might encode about 1.47M CRMs and 68M TFBSs, comprising about 55% and 22% of the genome, respectively; for both of which, we predicted 80%. Therefore, the cis-regulatory genome appears to be more prevalent than originally thought.
Collapse
Affiliation(s)
- Pengyu Ni
- Department of Bioinformatics and Genomics, the University of North Carolina at Charlotte, 9201 University City Boulevard, Charlotte, NC 28223, USA
| | - Zhengchang Su
- Department of Bioinformatics and Genomics, the University of North Carolina at Charlotte, 9201 University City Boulevard, Charlotte, NC 28223, USA
| |
Collapse
|
4
|
Rivera J, Keränen SVE, Gallo SM, Halfon MS. REDfly: the transcriptional regulatory element database for Drosophila. Nucleic Acids Res 2020; 47:D828-D834. [PMID: 30329093 PMCID: PMC6323911 DOI: 10.1093/nar/gky957] [Citation(s) in RCA: 44] [Impact Index Per Article: 11.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/06/2018] [Accepted: 10/04/2018] [Indexed: 12/21/2022] Open
Abstract
The REDfly database provides a comprehensive curation of experimentally-validated Drosophila transcriptional cis-regulatory elements and includes information on DNA sequence, experimental evidence, patterns of regulated gene expression, and more. Now in its thirteenth year, REDfly has grown to over 23 000 records of tested reporter gene constructs and 2200 tested transcription factor binding sites. Recent developments include the start of curation of predicted cis-regulatory modules in addition to experimentally-verified ones, improved search and filtering, and increased interaction with the authors of curated papers. An expanded data model that will capture information on temporal aspects of gene regulation, regulation in response to environmental and other non-developmental cues, sexually dimorphic gene regulation, and non-endogenous (ectopic) aspects of reporter gene expression is under development and expected to be in place within the coming year. REDfly is freely accessible at http://redfly.ccr.buffalo.edu, and news about database updates and new features can be followed on Twitter at @REDfly_database.
Collapse
Affiliation(s)
- John Rivera
- Center for Computational Research, State University of New York at Buffalo, Buffalo, NY 14203, USA.,New York State Center of Excellence in Bioinformatics and Life Sciences, State University of New York at Buffalo, Buffalo, NY 14203, USA
| | | | - Steven M Gallo
- Center for Computational Research, State University of New York at Buffalo, Buffalo, NY 14203, USA.,New York State Center of Excellence in Bioinformatics and Life Sciences, State University of New York at Buffalo, Buffalo, NY 14203, USA
| | - Marc S Halfon
- New York State Center of Excellence in Bioinformatics and Life Sciences, State University of New York at Buffalo, Buffalo, NY 14203, USA.,Department of Biochemistry, State University of New York at Buffalo, Buffalo, NY 14203, USA.,Department of Biomedical Informatics, State University of New York at Buffalo, Buffalo, NY 14203, USA.,Department of Biological Sciences, State University of New York at Buffalo, Buffalo, NY 14203, USA.,Department of Molecular and Cellular Biology and Program in Cancer Genetics, Roswell Park Cancer Institute, Buffalo, NY 14263, USA
| |
Collapse
|
5
|
Zhang S, Liang Y, Wang X, Su Z, Chen Y. FisherMP: fully parallel algorithm for detecting combinatorial motifs from large ChIP-seq datasets. DNA Res 2019; 26:231-242. [PMID: 30957858 PMCID: PMC6589551 DOI: 10.1093/dnares/dsz004] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/11/2018] [Accepted: 03/05/2019] [Indexed: 11/14/2022] Open
Abstract
Detecting binding motifs of combinatorial transcription factors (TFs) from chromatin immunoprecipitation sequencing (ChIP-seq) experiments is an important and challenging computational problem for understanding gene regulations. Although a number of motif-finding algorithms have been presented, most are either time consuming or have sub-optimal accuracy for processing large-scale datasets. In this article, we present a fully parallelized algorithm for detecting combinatorial motifs from ChIP-seq datasets by using Fisher combined method and OpenMP parallel design. Large scale validations on both synthetic data and 350 ChIP-seq datasets from the ENCODE database showed that FisherMP has not only super speeds on large datasets, but also has high accuracy when compared with multiple popular methods. By using FisherMP, we successfully detected combinatorial motifs of CTCF, YY1, MAZ, STAT3 and USF2 in chromosome X, suggesting that they are functional co-players in gene regulation and chromosomal organization. Integrative and statistical analysis of these TF-binding peaks clearly demonstrate that they are not only highly coordinated with each other, but that they are also correlated with histone modifications. FisherMP can be applied for integrative analysis of binding motifs and for predicting cis-regulatory modules from a large number of ChIP-seq datasets.
Collapse
Affiliation(s)
- Shaoqiang Zhang
- College of Computer and Information Engineering, Tianjin Normal University, Tianjin, China
| | - Ying Liang
- College of Computer and Information Engineering, Tianjin Normal University, Tianjin, China
| | - Xiangyun Wang
- College of Computer and Information Engineering, Tianjin Normal University, Tianjin, China
| | - Zhengchang Su
- College of Computer and Information Engineering, Tianjin Normal University, Tianjin, China
- Department of Bioinformatics and Genomics, the University of North Carolina at Charlotte, NC, USA
| | - Yong Chen
- Department of Biological Sciences, Center for Systems Biology, the University of Texas at Dallas, Richardson, TX, USA
| |
Collapse
|
6
|
Niu M, Tabari E, Ni P, Su Z. Towards a map of cis-regulatory sequences in the human genome. Nucleic Acids Res 2018; 46:5395-5409. [PMID: 29733395 PMCID: PMC6009671 DOI: 10.1093/nar/gky338] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/17/2018] [Revised: 04/14/2018] [Accepted: 04/19/2018] [Indexed: 01/10/2023] Open
Abstract
Accumulating evidence indicates that transcription factor (TF) binding sites, or cis-regulatory elements (CREs), and their clusters termed cis-regulatory modules (CRMs) play a more important role than do gene-coding sequences in specifying complex traits in humans, including the susceptibility to common complex diseases. To fully characterize their roles in deriving the complex traits/diseases, it is necessary to annotate all CREs and CRMs encoded in the human genome. However, the current annotations of CREs and CRMs in the human genome are still very limited and mostly coarse-grained, as they often lack the detailed information of CREs in CRMs. Here, we integrated 620 TF ChIP-seq datasets produced by the ENCODE project for 168 TFs in 79 different cell/tissue types and predicted an unprecedentedly completely map of CREs in CRMs in the human genome at single nucleotide resolution. The map includes 305 912 CRMs containing a total of 1 178 913 CREs belonging to 736 unique TF binding motifs. The predicted CREs and CRMs tend to be subject to either purifying selection or positive selection, thus are likely to be functional. Based on the results, we also examined the status of available ChIP-seq datasets for predicting the entire regulatory genome of humans.
Collapse
Affiliation(s)
- Meng Niu
- Department of Bioinformatics and Genomics, College of Computing and Informatics, The University of North Carolina at Charlotte, 9201 University City Blvd., Charlotte, NC 28223, USA
| | - Ehsan Tabari
- Department of Bioinformatics and Genomics, College of Computing and Informatics, The University of North Carolina at Charlotte, 9201 University City Blvd., Charlotte, NC 28223, USA
| | - Pengyu Ni
- Department of Bioinformatics and Genomics, College of Computing and Informatics, The University of North Carolina at Charlotte, 9201 University City Blvd., Charlotte, NC 28223, USA
| | - Zhengchang Su
- Department of Bioinformatics and Genomics, College of Computing and Informatics, The University of North Carolina at Charlotte, 9201 University City Blvd., Charlotte, NC 28223, USA
| |
Collapse
|
7
|
López Y, Vandenbon A, Nose A, Nakai K. Modeling the cis-regulatory modules of genes expressed in developmental stages of Drosophila melanogaster. PeerJ 2017; 5:e3389. [PMID: 28584716 PMCID: PMC5452948 DOI: 10.7717/peerj.3389] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/20/2016] [Accepted: 05/08/2017] [Indexed: 12/30/2022] Open
Abstract
Because transcription is the first step in the regulation of gene expression, understanding how transcription factors bind to their DNA binding motifs has become absolutely necessary. It has been shown that the promoters of genes with similar expression profiles share common structural patterns. This paper presents an extensive study of the regulatory regions of genes expressed in 24 developmental stages of Drosophila melanogaster. It proposes the use of a combination of structural features, such as positioning of individual motifs relative to the transcription start site, orientation, pairwise distance between motifs, and presence of motifs anywhere in the promoter for predicting gene expression from structural features of promoter sequences. RNA-sequencing data was utilized to create and validate the 24 models. When genes with high-scoring promoters were compared to those identified by RNA-seq samples, 19 (79.2%) statistically significant models, a number that exceeds previous studies, were obtained. Each model yielded a set of highly informative features, which were used to search for genes with similar biological functions.
Collapse
Affiliation(s)
- Yosvany López
- Human Genome Center, The Institute of Medical Science, The University of Tokyo, Tokyo, Japan.,Department of Computational Biology, Graduate School of Frontier Sciences, The University of Tokyo, Chiba, Japan
| | - Alexis Vandenbon
- Immunology Frontier Research Center, Osaka University, Osaka, Japan
| | - Akinao Nose
- Department of Complexity Science and Engineering, Graduate School of Frontier Sciences, The University of Tokyo, Chiba, Japan
| | - Kenta Nakai
- Human Genome Center, The Institute of Medical Science, The University of Tokyo, Tokyo, Japan
| |
Collapse
|
8
|
Diamanti K, Umer HM, Kruczyk M, Dąbrowski MJ, Cavalli M, Wadelius C, Komorowski J. Maps of context-dependent putative regulatory regions and genomic signal interactions. Nucleic Acids Res 2016; 44:9110-9120. [PMID: 27625394 PMCID: PMC5100580 DOI: 10.1093/nar/gkw800] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/17/2015] [Accepted: 08/31/2016] [Indexed: 12/24/2022] Open
Abstract
Gene transcription is regulated mainly by transcription factors (TFs). ENCODE and Roadmap Epigenomics provide global binding profiles of TFs, which can be used to identify regulatory regions. To this end we implemented a method to systematically construct cell-type and species-specific maps of regulatory regions and TF-TF interactions. We illustrated the approach by developing maps for five human cell-lines and two other species. We detected ∼144k putative regulatory regions among the human cell-lines, with the majority of them being ∼300 bp. We found ∼20k putative regulatory elements in the ENCODE heterochromatic domains suggesting a large regulatory potential in the regions presumed transcriptionally silent. Among the most significant TF interactions identified in the heterochromatic regions were CTCF and the cohesin complex, which is in agreement with previous reports. Finally, we investigated the enrichment of the obtained putative regulatory regions in the 3D chromatin domains. More than 90% of the regions were discovered in the 3D contacting domains. We found a significant enrichment of GWAS SNPs in the putative regulatory regions. These significant enrichments provide evidence that the regulatory regions play a crucial role in the genomic structural stability. Additionally, we generated maps of putative regulatory regions for prostate and colorectal cancer human cell-lines.
Collapse
Affiliation(s)
- Klev Diamanti
- Department of Cell and Molecular Biology, Uppsala University, Uppsala SE-751-24, Sweden
| | - Husen M Umer
- Department of Cell and Molecular Biology, Uppsala University, Uppsala SE-751-24, Sweden
| | - Marcin Kruczyk
- Department of Cell and Molecular Biology, Uppsala University, Uppsala SE-751-24, Sweden
| | - Michał J Dąbrowski
- Science for Life Laboratory, Department of Immunology, Genetics and Pathology, Uppsala University, Uppsala SE-751-08, Sweden
| | - Marco Cavalli
- Institute of Computer Science, Polish Academy of Sciences, Warsaw 012-48, Poland
| | - Claes Wadelius
- Institute of Computer Science, Polish Academy of Sciences, Warsaw 012-48, Poland
| | - Jan Komorowski
- Department of Cell and Molecular Biology, Uppsala University, Uppsala SE-751-24, Sweden .,Science for Life Laboratory, Department of Immunology, Genetics and Pathology, Uppsala University, Uppsala SE-751-08, Sweden
| |
Collapse
|
9
|
Zhang S, Chen Y. CLIMP: Clustering Motifs via Maximal Cliques with Parallel Computing Design. PLoS One 2016; 11:e0160435. [PMID: 27487245 PMCID: PMC4972426 DOI: 10.1371/journal.pone.0160435] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/21/2016] [Accepted: 07/19/2016] [Indexed: 11/19/2022] Open
Abstract
A set of conserved binding sites recognized by a transcription factor is called a motif, which can be found by many applications of comparative genomics for identifying over-represented segments. Moreover, when numerous putative motifs are predicted from a collection of genome-wide data, their similarity data can be represented as a large graph, where these motifs are connected to one another. However, an efficient clustering algorithm is desired for clustering the motifs that belong to the same groups and separating the motifs that belong to different groups, or even deleting an amount of spurious ones. In this work, a new motif clustering algorithm, CLIMP, is proposed by using maximal cliques and sped up by parallelizing its program. When a synthetic motif dataset from the database JASPAR, a set of putative motifs from a phylogenetic foot-printing dataset, and a set of putative motifs from a ChIP dataset are used to compare the performances of CLIMP and two other high-performance algorithms, the results demonstrate that CLIMP mostly outperforms the two algorithms on the three datasets for motif clustering, so that it can be a useful complement of the clustering procedures in some genome-wide motif prediction pipelines. CLIMP is available at http://sqzhang.cn/climp.html.
Collapse
Affiliation(s)
- Shaoqiang Zhang
- College of Computer and Information Engineering, Tianjin Normal University, Tianjin, China
- * E-mail: (SZ); (YC)
| | - Yong Chen
- National Laboratory of Biomacromolecules, Institute of Biophysics, Chinese Academy of Sciences, Beijing, China
- Department of Biological Sciences, Center for Systems Biology, The University of Texas at Dallas, Richardson, Texas, United States of America
- * E-mail: (SZ); (YC)
| |
Collapse
|