1
|
Patel N, Bush WS. Modeling transcriptional regulation using gene regulatory networks based on multi-omics data sources. BMC Bioinformatics 2021; 22:200. [PMID: 33874910 PMCID: PMC8056605 DOI: 10.1186/s12859-021-04126-3] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/20/2020] [Accepted: 04/09/2021] [Indexed: 11/17/2022] Open
Abstract
Background Transcriptional regulation is complex, requiring multiple cis (local) and trans acting mechanisms working in concert to drive gene expression, with disruption of these processes linked to multiple diseases. Previous computational attempts to understand the influence of regulatory mechanisms on gene expression have used prediction models containing input features derived from cis regulatory factors. However, local chromatin looping and trans-acting mechanisms are known to also influence transcriptional regulation, and their inclusion may improve model accuracy and interpretation. In this study, we create a general model of transcription factor influence on gene expression by incorporating both cis and trans gene regulatory features. Results We describe a computational framework to model gene expression for GM12878 and K562 cell lines. This framework weights the impact of transcription factor-based regulatory data using multi-omics gene regulatory networks to account for both cis and trans acting mechanisms, and measures of the local chromatin context. These prediction models perform significantly better compared to models containing cis-regulatory features alone. Models that additionally integrate long distance chromatin interactions (or chromatin looping) between distal transcription factor binding regions and gene promoters also show improved accuracy. As a demonstration of their utility, effect estimates from these models were used to weight cis-regulatory rare variants for sequence kernel association test analyses of gene expression. Conclusions Our models generate refined effect estimates for the influence of individual transcription factors on gene expression, allowing characterization of their roles across the genome. This work also provides a framework for integrating multiple data types into a single model of transcriptional regulation. Supplementary Information The online version contains supplementary material available at 10.1186/s12859-021-04126-3.
Collapse
Affiliation(s)
- Neel Patel
- Department of Nutrition, Case Western Reserve University, Cleveland, OH, USA.,Department of Population and Quantitative Health Sciences, Case Western Reserve University, Cleveland, OH, USA
| | - William S Bush
- Department of Population and Quantitative Health Sciences, Case Western Reserve University, Cleveland, OH, USA.
| |
Collapse
|
2
|
Zhang LQ, Fan GL, Liu JJ, Liu L, Li QZ, Lin H. Identification of Key Histone Modifications and Their Regulatory Regions on Gene Expression Level Changes in Chronic Myelogenous Leukemia. Front Cell Dev Biol 2021; 8:621578. [PMID: 33511133 PMCID: PMC7835480 DOI: 10.3389/fcell.2020.621578] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/26/2020] [Accepted: 12/09/2020] [Indexed: 12/12/2022] Open
Abstract
Chronic myelogenous leukemia (CML) is a type of cancer with a series of characteristics that make it particularly suitable for observations on leukemogenesis. Research have exhibited that the occurrence and progression of CML are associated with the dynamic alterations of histone modification (HM) patterns. In this study, we analyze the distribution patterns of 11 HM signals and calculate the signal changes of these HMs in CML cell lines as compared with that in normal cell lines. Meanwhile, the impacts of HM signal changes on expression level changes of CML-related genes are investigated. Based on the alterations of HM signals between CML and normal cell lines, the up- and down-regulated genes are predicted by the random forest algorithm to identify the key HMs and their regulatory regions. Research show that H3K79me2, H3K36me3, and H3K27ac are key HMs to expression level changes of CML-related genes in leukemogenesis. Especially H3K79me2 and H3K36me3 perform their important functions in all 100 bins studied. Our research reveals that H3K79me2 and H3K36me3 may be the core HMs for the clinical treatment of CML.
Collapse
Affiliation(s)
- Lu-Qiang Zhang
- Laboratory of Theoretical Biophysics, School of Physical Science and Technology, Inner Mongolia University, Hohhot, China
| | - Guo-Liang Fan
- Laboratory of Theoretical Biophysics, School of Physical Science and Technology, Inner Mongolia University, Hohhot, China
| | - Jun-Jie Liu
- Laboratory of Theoretical Biophysics, School of Physical Science and Technology, Inner Mongolia University, Hohhot, China
| | - Li Liu
- Laboratory of Theoretical Biophysics, School of Physical Science and Technology, Inner Mongolia University, Hohhot, China
| | - Qian-Zhong Li
- Laboratory of Theoretical Biophysics, School of Physical Science and Technology, Inner Mongolia University, Hohhot, China.,The Research Center for Laboratory Animal Science, College of Life Sciences, Inner Mongolia University, Hohhot, China
| | - Hao Lin
- Key Laboratory for Neuro-Information of Ministry of Education, Center for Informational Biology, School of Life Sciences and Technology, University of Electronic Science and Technology of China, Chengdu, China
| |
Collapse
|
3
|
Schmidt F, Kern F, Schulz MH. Integrative prediction of gene expression with chromatin accessibility and conformation data. Epigenetics Chromatin 2020; 13:4. [PMID: 32029002 PMCID: PMC7003490 DOI: 10.1186/s13072-020-0327-0] [Citation(s) in RCA: 15] [Impact Index Per Article: 3.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/13/2019] [Accepted: 01/06/2020] [Indexed: 02/06/2023] Open
Abstract
BACKGROUND Enhancers play a fundamental role in orchestrating cell state and development. Although several methods have been developed to identify enhancers, linking them to their target genes is still an open problem. Several theories have been proposed on the functional mechanisms of enhancers, which triggered the development of various methods to infer promoter-enhancer interactions (PEIs). The advancement of high-throughput techniques describing the three-dimensional organization of the chromatin, paved the way to pinpoint long-range PEIs. Here we investigated whether including PEIs in computational models for the prediction of gene expression improves performance and interpretability. RESULTS We have extended our [Formula: see text] framework to include DNA contacts deduced from chromatin conformation capture experiments and compared various methods to determine PEIs using predictive modelling of gene expression from chromatin accessibility data and predicted transcription factor (TF) motif data. We designed a novel machine learning approach that allows the prioritization of TFs binding to distal loop and promoter regions with respect to their importance for gene expression regulation. Our analysis revealed a set of core TFs that are part of enhancer-promoter loops involving YY1 in different cell lines. CONCLUSION We present a novel approach that can be used to prioritize TFs involved in distal and promoter-proximal regulatory events by integrating chromatin accessibility, conformation, and gene expression data. We show that the integration of chromatin conformation data can improve gene expression prediction and aids model interpretability.
Collapse
Affiliation(s)
- Florian Schmidt
- High-throughput Genomics & Systems Biology, Cluster of Excellence on Multimodal Computing and Interaction, Saarland Informatics Campus, 66123 Saarbrücken, Germany
- Computational Biology & Applied Algorithmics, Max-Planck Institute for Informatics, Saarland Informatics Campus, 66123 Saarbrücken, Germany
- Center for Bioinformatics, Saarland Informatics Campus, 66123 Saarbrücken, Germany
- Genome Institute of Singapore, A*STAR, 60 Biopolis Street, Singapore, 138672 Singapore
| | - Fabian Kern
- High-throughput Genomics & Systems Biology, Cluster of Excellence on Multimodal Computing and Interaction, Saarland Informatics Campus, 66123 Saarbrücken, Germany
- Center for Bioinformatics, Saarland Informatics Campus, 66123 Saarbrücken, Germany
- Chair for Clinical Bioinformatics, Saarland Informatics Campus, 66123 Saarbrücken, Germany
| | - Marcel H. Schulz
- High-throughput Genomics & Systems Biology, Cluster of Excellence on Multimodal Computing and Interaction, Saarland Informatics Campus, 66123 Saarbrücken, Germany
- Computational Biology & Applied Algorithmics, Max-Planck Institute for Informatics, Saarland Informatics Campus, 66123 Saarbrücken, Germany
- Center for Bioinformatics, Saarland Informatics Campus, 66123 Saarbrücken, Germany
- Institute of Cardiovascular Regeneration, Goethe-University, Theodor-Stern-Kai 7, 60590 Frankfurt am Main, Germany
- German Center for Cardiovascular Research, Partner Site Rhein-Main, Theodor-Stern-Kai 7, 60590 Frankfurt am Main, Germany
| |
Collapse
|
4
|
Ren J, Lee J, Na D. Recent advances in genetic engineering tools based on synthetic biology. J Microbiol 2020; 58:1-10. [PMID: 31898252 DOI: 10.1007/s12275-020-9334-x] [Citation(s) in RCA: 18] [Impact Index Per Article: 4.5] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/03/2019] [Revised: 08/19/2019] [Accepted: 11/05/2019] [Indexed: 12/26/2022]
Abstract
Genome-scale engineering is a crucial methodology to rationally regulate microbiological system operations, leading to expected biological behaviors or enhanced bioproduct yields. Over the past decade, innovative genome modification technologies have been developed for effectively regulating and manipulating genes at the genome level. Here, we discuss the current genome-scale engineering technologies used for microbial engineering. Recently developed strategies, such as clustered regularly interspaced short palindromic repeats (CRISPR)-Cas9, multiplex automated genome engineering (MAGE), promoter engineering, CRISPR-based regulations, and synthetic small regulatory RNA (sRNA)-based knockdown, are considered as powerful tools for genome-scale engineering in microbiological systems. MAGE, which modifies specific nucleotides of the genome sequence, is utilized as a genome-editing tool. Contrastingly, synthetic sRNA, CRISPRi, and CRISPRa are mainly used to regulate gene expression without modifying the genome sequence. This review introduces the recent genome-scale editing and regulating technologies and their applications in metabolic engineering.
Collapse
Affiliation(s)
- Jun Ren
- School of Integrative Engineering, Chung-Ang University, Seoul, 06974, Republic of Korea
| | - Jingyu Lee
- School of Integrative Engineering, Chung-Ang University, Seoul, 06974, Republic of Korea
| | - Dokyun Na
- School of Integrative Engineering, Chung-Ang University, Seoul, 06974, Republic of Korea.
| |
Collapse
|
5
|
Schmidt F, Schulz MH. On the problem of confounders in modeling gene expression. Bioinformatics 2019; 35:711-719. [PMID: 30084962 PMCID: PMC6530814 DOI: 10.1093/bioinformatics/bty674] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/18/2018] [Revised: 06/21/2018] [Accepted: 08/02/2018] [Indexed: 01/01/2023] Open
Abstract
Motivation Modeling of Transcription Factor (TF) binding from both ChIP-seq and chromatin accessibility data has become prevalent in computational biology. Several models have been proposed to generate new hypotheses on transcriptional regulation. However, there is no distinct approach to derive TF binding scores from ChIP-seq and open chromatin experiments. Here, we review biases of various scoring approaches and their effects on the interpretation and reliability of predictive gene expression models. Results We generated predictive models for gene expression using ChIP-seq and DNase1-seq data from DEEP and ENCODE. Via randomization experiments, we identified confounders in TF gene scores derived from both ChIP-seq and DNase1-seq data. We reviewed correction approaches for both data types, which reduced the influence of identified confounders without harm to model performance. Also, our analyses highlighted further quality control measures, in addition to model performance, that may help to assure model reliability and to avoid misinterpretation in future studies. Availability and implementation The software used in this study is available online at https://github.com/SchulzLab/TEPIC. Supplementary information Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Florian Schmidt
- High-througput Genomics and Systems Biology, Cluster of Excellence on Multimodal Computing and Interaction, Saarland Informatics Campus, Saarbrücken, Germany.,Department of Computational Biology and Applied Algorithmics, Max Planck Institute for Informatics, Saarland Informatics Campus, Saarbrücken, Germany.,Graduate School for Computer Science, Saarland Informatics Campus, Saarbrücken, Germany
| | - Marcel H Schulz
- High-througput Genomics and Systems Biology, Cluster of Excellence on Multimodal Computing and Interaction, Saarland Informatics Campus, Saarbrücken, Germany.,Department of Computational Biology and Applied Algorithmics, Max Planck Institute for Informatics, Saarland Informatics Campus, Saarbrücken, Germany
| |
Collapse
|
6
|
The spatial binding model of the pioneer factor Oct4 with its target genes during cell reprogramming. Comput Struct Biotechnol J 2019; 17:1226-1233. [PMID: 31921389 PMCID: PMC6944736 DOI: 10.1016/j.csbj.2019.09.002] [Citation(s) in RCA: 18] [Impact Index Per Article: 3.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/19/2019] [Revised: 09/05/2019] [Accepted: 09/07/2019] [Indexed: 12/18/2022] Open
Abstract
Understanding the target regulation between pioneer factor and its binding genes is crucial for improving the efficiency of TF-mediated reprogramming. Oct4 as the only one factor that cannot be substituted by other POU members, it is urgent need to develop a quantitative model for describing the spatial binding pattern with its target genes. The dynamic profiles of pioneer factor Oct4-binding showed that the major wave occurs at the intermediate stage of cell reprogramming (from day 7 to day 15), and the promoter is the preferred targeting regions. The Oct4-binding distributions perform significant chromosome bias. The overall enrichment on chromosome 1–11 is higher than that on the others. The dramatic event of TF-mediated reprogramming is mainly concentrated on autosomes. We also found that the spatial binding ability of Oct4 binding can be represented quantitatively by using three parameters of peaks (height, width and distance). The dynamic changes of Oct4-binding demonstrated that the width play more important roles in regulating expression of target genes. At last, a multivariate linear regression was introduced to establish the spatial binding model of the Oct4-binding. The evaluation results confirmed that the height and width is positively correlated with the gene expression. And the additive interaction terms of height and width can better optimize the model performance than the multiplicative terms. The best average coefficients of determination of improved model achieved to 81.38%. Our study will provide new insights into the cooperative regulation of spatial binding pattern of pioneer factors in cell reprogramming.
Collapse
|
7
|
Feng ZX, Li QZ, Meng JJ. Modeling the relationship of diverse genomic signatures to gene expression levels with the regulation of long-range enhancer-promoter interactions. BIOPHYSICS REPORTS 2019. [DOI: 10.1007/s41048-019-0089-z] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/29/2022] Open
|
8
|
Pliner HA, Packer JS, McFaline-Figueroa JL, Cusanovich DA, Daza RM, Aghamirzaie D, Srivatsan S, Qiu X, Jackson D, Minkina A, Adey AC, Steemers FJ, Shendure J, Trapnell C. Cicero Predicts cis-Regulatory DNA Interactions from Single-Cell Chromatin Accessibility Data. Mol Cell 2018; 71:858-871.e8. [PMID: 30078726 PMCID: PMC6582963 DOI: 10.1016/j.molcel.2018.06.044] [Citation(s) in RCA: 409] [Impact Index Per Article: 68.2] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/07/2018] [Revised: 05/08/2018] [Accepted: 06/29/2018] [Indexed: 12/13/2022]
Abstract
Linking regulatory DNA elements to their target genes, which may be located hundreds of kilobases away, remains challenging. Here, we introduce Cicero, an algorithm that identifies co-accessible pairs of DNA elements using single-cell chromatin accessibility data and so connects regulatory elements to their putative target genes. We apply Cicero to investigate how dynamically accessible elements orchestrate gene regulation in differentiating myoblasts. Groups of Cicero-linked regulatory elements meet criteria of "chromatin hubs"-they are enriched for physical proximity, interact with a common set of transcription factors, and undergo coordinated changes in histone marks that are predictive of changes in gene expression. Pseudotemporal analysis revealed that most DNA elements remain in chromatin hubs throughout differentiation. A subset of elements bound by MYOD1 in myoblasts exhibit early opening in a PBX1- and MEIS1-dependent manner. Our strategy can be applied to dissect the architecture, sequence determinants, and mechanisms of cis-regulation on a genome-wide scale.
Collapse
Affiliation(s)
- Hannah A Pliner
- Department of Genome Sciences, University of Washington, Seattle, WA, USA
| | - Jonathan S Packer
- Department of Genome Sciences, University of Washington, Seattle, WA, USA
| | | | | | - Riza M Daza
- Department of Genome Sciences, University of Washington, Seattle, WA, USA
| | - Delasa Aghamirzaie
- Department of Genome Sciences, University of Washington, Seattle, WA, USA
| | - Sanjay Srivatsan
- Department of Genome Sciences, University of Washington, Seattle, WA, USA
| | - Xiaojie Qiu
- Department of Genome Sciences, University of Washington, Seattle, WA, USA; Molecular and Cellular Biology Program, University of Washington, Seattle, WA, USA
| | - Dana Jackson
- Department of Genome Sciences, University of Washington, Seattle, WA, USA
| | - Anna Minkina
- Department of Genome Sciences, University of Washington, Seattle, WA, USA
| | - Andrew C Adey
- Department of Molecular and Medical Genetics, Oregon Health and Science University, Portland, OR, USA
| | | | - Jay Shendure
- Department of Genome Sciences, University of Washington, Seattle, WA, USA; Howard Hughes Medical Institute, Seattle, WA, USA; Brotman Baty Institute for Precision Medicine, Seattle, WA, USA.
| | - Cole Trapnell
- Department of Genome Sciences, University of Washington, Seattle, WA, USA; Brotman Baty Institute for Precision Medicine, Seattle, WA, USA.
| |
Collapse
|
9
|
Genome-wide analysis of H3K36me3 and its regulations to cancer-related genes expression in human cell lines. Biosystems 2018; 171:59-65. [DOI: 10.1016/j.biosystems.2018.07.004] [Citation(s) in RCA: 12] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/06/2018] [Revised: 07/01/2018] [Accepted: 07/09/2018] [Indexed: 01/11/2023]
|
10
|
Zhang LQ, Li QZ. Estimating the effects of transcription factors binding and histone modifications on gene expression levels in human cells. Oncotarget 2018; 8:40090-40103. [PMID: 28454114 PMCID: PMC5522221 DOI: 10.18632/oncotarget.16988] [Citation(s) in RCA: 15] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/15/2016] [Accepted: 03/11/2017] [Indexed: 12/22/2022] Open
Abstract
Transcription factors and histone modifications are vital for the regulation of gene expression. Hence, to estimate the effects of transcription factors binding and histone modifications on gene expression, we construct a statistical model for the genome-wide 15 transcription factors binding data, 10 histone modifications profiles and DNase-I hypersensitivity data in three mammalian. Remarkably, our results show POLR2A and H3K36me3 can highly and consistently predict gene expression in three cell lines. And H3K4me3, H3K27me3 and H3K9ac are more reliable predictors than other histone modifications in human embryonic stem cells. Moreover, genome-wide statistical redundancies exist within and between transcription factors and histone modifications, and these phenomena may be caused by the regulation mechanism. In further study, we find that even though transcription factors and histone modifications offer similar effects on expression levels of genome-wide genes, the effects of transcription factors and histone modifications on predictive abilities are different for genes in independent biological processes.
Collapse
Affiliation(s)
- Lu-Qiang Zhang
- Laboratory of Theoretical Biophysics, School of Physical Science and Technology, Inner Mongolia University, Hohhot, China
| | - Qian-Zhong Li
- Laboratory of Theoretical Biophysics, School of Physical Science and Technology, Inner Mongolia University, Hohhot, China
| |
Collapse
|
11
|
Dang LT, Tondl M, Chiu MHH, Revote J, Paten B, Tano V, Tokolyi A, Besse F, Quaife-Ryan G, Cumming H, Drvodelic MJ, Eichenlaub MP, Hallab JC, Stolper JS, Rossello FJ, Bogoyevitch MA, Jans DA, Nim HT, Porrello ER, Hudson JE, Ramialison M. TrawlerWeb: an online de novo motif discovery tool for next-generation sequencing datasets. BMC Genomics 2018; 19:238. [PMID: 29621972 PMCID: PMC5887194 DOI: 10.1186/s12864-018-4630-0] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/30/2018] [Accepted: 03/27/2018] [Indexed: 12/14/2022] Open
Abstract
Background A strong focus of the post-genomic era is mining of the non-coding regulatory genome in order to unravel the function of regulatory elements that coordinate gene expression (Nat 489:57–74, 2012; Nat 507:462–70, 2014; Nat 507:455–61, 2014; Nat 518:317–30, 2015). Whole-genome approaches based on next-generation sequencing (NGS) have provided insight into the genomic location of regulatory elements throughout different cell types, organs and organisms. These technologies are now widespread and commonly used in laboratories from various fields of research. This highlights the need for fast and user-friendly software tools dedicated to extracting cis-regulatory information contained in these regulatory regions; for instance transcription factor binding site (TFBS) composition. Ideally, such tools should not require prior programming knowledge to ensure they are accessible for all users. Results We present TrawlerWeb, a web-based version of the Trawler_standalone tool (Nat Methods 4:563–5, 2007; Nat Protoc 5:323–34, 2010), to allow for the identification of enriched motifs in DNA sequences obtained from next-generation sequencing experiments in order to predict their TFBS composition. TrawlerWeb is designed for online queries with standard options common to web-based motif discovery tools. In addition, TrawlerWeb provides three unique new features: 1) TrawlerWeb allows the input of BED files directly generated from NGS experiments, 2) it automatically generates an input-matched biologically relevant background, and 3) it displays resulting conservation scores for each instance of the motif found in the input sequences, which assists the researcher in prioritising the motifs to validate experimentally. Finally, to date, this web-based version of Trawler_standalone remains the fastest online de novo motif discovery tool compared to other popular web-based software, while generating predictions with high accuracy. Conclusions TrawlerWeb provides users with a fast, simple and easy-to-use web interface for de novo motif discovery. This will assist in rapidly analysing NGS datasets that are now being routinely generated. TrawlerWeb is freely available and accessible at: http://trawler.erc.monash.edu.au. Electronic supplementary material The online version of this article (10.1186/s12864-018-4630-0) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Louis T Dang
- Australian Regenerative Medicine Institute, Systems Biology Institute Australia, Monash University, Clayton, VIC, Australia
| | - Markus Tondl
- Australian Regenerative Medicine Institute, Systems Biology Institute Australia, Monash University, Clayton, VIC, Australia
| | - Man Ho H Chiu
- Australian Regenerative Medicine Institute, Systems Biology Institute Australia, Monash University, Clayton, VIC, Australia
| | - Jerico Revote
- eResearch, Monash University, Clayton, VIC, Australia
| | - Benedict Paten
- UC Santa Cruz Genomics Institute, University of California, Santa Cruz, CA, USA
| | - Vincent Tano
- Department of Biochemistry and Molecular Biology, Bio21 Institute and Cell Signalling Research Laboratories, The University of Melbourne, Melbourne, VIC, Australia
| | - Alex Tokolyi
- Australian Regenerative Medicine Institute, Systems Biology Institute Australia, Monash University, Clayton, VIC, Australia
| | - Florence Besse
- CNRS, Inserm, Institute of Biology Valrose, Université Côte d'Azur, Parc Valrose, Nice, France
| | - Greg Quaife-Ryan
- School of Biomedical Sciences, The University of Queensland, QLD, Brisbane, Australia
| | - Helen Cumming
- Centre for Innate Immunity and Infectious Diseases, Hudson Institute of Medical Research, Monash University, Clayton, VIC, Australia
| | - Mark J Drvodelic
- Australian Regenerative Medicine Institute, Systems Biology Institute Australia, Monash University, Clayton, VIC, Australia
| | - Michael P Eichenlaub
- Australian Regenerative Medicine Institute, Systems Biology Institute Australia, Monash University, Clayton, VIC, Australia
| | - Jeannette C Hallab
- Australian Regenerative Medicine Institute, Systems Biology Institute Australia, Monash University, Clayton, VIC, Australia
| | - Julian S Stolper
- Australian Regenerative Medicine Institute, Systems Biology Institute Australia, Monash University, Clayton, VIC, Australia
| | - Fernando J Rossello
- Australian Regenerative Medicine Institute, Systems Biology Institute Australia, Monash University, Clayton, VIC, Australia
| | - Marie A Bogoyevitch
- Department of Biochemistry and Molecular Biology, Bio21 Institute and Cell Signalling Research Laboratories, The University of Melbourne, Melbourne, VIC, Australia
| | - David A Jans
- Department of Biochemistry and Molecular Biology, Monash University, Clayton, VIC, Australia
| | - Hieu T Nim
- Australian Regenerative Medicine Institute, Systems Biology Institute Australia, Monash University, Clayton, VIC, Australia.,Faculty of Information Technology, Monash University, Clayton, VIC, Australia
| | - Enzo R Porrello
- Murdoch Children's Research Institute, The Royal Children's Hospital, Parkville, VIC, Australia.,Department of Physiology, School of Biomedical Sciences, The University of Melbourne, Parkville, VIC, Australia
| | - James E Hudson
- School of Biomedical Sciences, The University of Queensland, QLD, Brisbane, Australia
| | - Mirana Ramialison
- Australian Regenerative Medicine Institute, Systems Biology Institute Australia, Monash University, Clayton, VIC, Australia.
| |
Collapse
|
12
|
Li Y, Zhang J, Huo C, Ding N, Li J, Xiao J, Lin X, Cai B, Zhang Y, Xu J. Dynamic Organization of lncRNA and Circular RNA Regulators Collectively Controlled Cardiac Differentiation in Humans. EBioMedicine 2017; 24:137-146. [PMID: 29037607 PMCID: PMC5652025 DOI: 10.1016/j.ebiom.2017.09.015] [Citation(s) in RCA: 67] [Impact Index Per Article: 9.6] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/14/2017] [Revised: 09/13/2017] [Accepted: 09/13/2017] [Indexed: 02/08/2023] Open
Abstract
Advances in developmental cardiology have increased our understanding of the early aspects of heart differentiation. However, understanding noncoding RNA (ncRNA) transcription and regulation during this process remains elusive. Here, we constructed transcriptomes for both long noncoding RNAs (lncRNAs) and circular RNAs (circRNAs) in four important developmental stages ranging from early embryonic to cardiomyocyte based on high-throughput sequencing datasets, which indicate the high stage-specific expression patterns of two ncRNA types. Additionally, higher similarities of samples within each stage were found, highlighting the divergence of samples collected from distinct cardiac developmental stages. Next, we developed a method to identify numerous lncRNA and circRNA regulators whose expression was significantly stage-specific and shifted gradually and continuously during heart differentiation. We inferred that these ncRNAs are important for the stages of cardiac differentiation. Moreover, transcriptional regulation analysis revealed that the expression of stage-specific lncRNAs is controlled by known key stage-specific transcription factors (TFs). In addition, circRNAs exhibited dynamic expression patterns independent from their host genes. Functional enrichment analysis revealed that lncRNAs and circRNAs play critical roles in pathways that are activated specifically during heart differentiation. We further identified candidate TF-ncRNA-gene network modules for each differentiation stage, suggesting the dynamic organization of lncRNAs and circRNAs collectively controlled cardiac differentiation, which may cause heart-related diseases when defective. Our study provides a foundation for understanding the dynamic regulation of ncRNA transcriptomes during heart differentiation and identifies the dynamic organization of novel key lncRNAs and circRNAs to collectively control cardiac differentiation.
Collapse
Affiliation(s)
- Yongsheng Li
- College of Bioinformatics Science and Technology, Harbin Medical University, Harbin, Heilongjiang 150086, China
| | - Jinwen Zhang
- College of Bioinformatics Science and Technology, Harbin Medical University, Harbin, Heilongjiang 150086, China
| | - Caiqin Huo
- College of Bioinformatics Science and Technology, Harbin Medical University, Harbin, Heilongjiang 150086, China
| | - Na Ding
- College of Bioinformatics Science and Technology, Harbin Medical University, Harbin, Heilongjiang 150086, China
| | - Junyi Li
- College of Bioinformatics Science and Technology, Harbin Medical University, Harbin, Heilongjiang 150086, China
| | - Jun Xiao
- College of Bioinformatics Science and Technology, Harbin Medical University, Harbin, Heilongjiang 150086, China
| | - Xiaoyu Lin
- College of Bioinformatics Science and Technology, Harbin Medical University, Harbin, Heilongjiang 150086, China
| | - Benzhi Cai
- Department of Clinical Pharmacy, The Second Affiliated Hospital, Department of Pharmacology, College of Pharmacy, Harbin Medical University, Harbin, Heilongjiang 150086, China.
| | - Yunpeng Zhang
- College of Bioinformatics Science and Technology, Harbin Medical University, Harbin, Heilongjiang 150086, China.
| | - Juan Xu
- College of Bioinformatics Science and Technology, Harbin Medical University, Harbin, Heilongjiang 150086, China.
| |
Collapse
|
13
|
Integrated analysis and transcript abundance modelling of H3K4me3 and H3K27me3 in developing secondary xylem. Sci Rep 2017; 7:3370. [PMID: 28611454 PMCID: PMC5469831 DOI: 10.1038/s41598-017-03665-1] [Citation(s) in RCA: 16] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/03/2017] [Accepted: 05/02/2017] [Indexed: 01/10/2023] Open
Abstract
Despite the considerable contribution of xylem development (xylogenesis) to plant biomass accumulation, its epigenetic regulation is poorly understood. Furthermore, the relative contributions of histone modifications to transcriptional regulation is not well studied in plants. We investigated the biological relevance of H3K4me3 and H3K27me3 in secondary xylem development using ChIP-seq and their association with transcript levels among other histone modifications in woody and herbaceous models. In developing secondary xylem of the woody model Eucalyptus grandis, H3K4me3 and H3K27me3 genomic spans were distinctly associated with xylogenesis-related processes, with (late) lignification pathways enriched for putative bivalent domains, but not early secondary cell wall polysaccharide deposition. H3K27me3-occupied genes, of which 753 (~31%) are novel targets, were enriched for transcriptional regulation and flower development and had significant preferential expression in roots. Linear regression models of the ChIP-seq profiles predicted ~50% of transcript abundance measured with strand-specific RNA-seq, confirmed in a parallel analysis in Arabidopsis where integration of seven additional histone modifications each contributed smaller proportions of unique information to the predictive models. This study uncovers the biological importance of histone modification antagonism and genomic span in xylogenesis and quantifies for the first time the relative correlations of histone modifications with transcript abundance in plants.
Collapse
|
14
|
Schmidt F, Gasparoni N, Gasparoni G, Gianmoena K, Cadenas C, Polansky JK, Ebert P, Nordström K, Barann M, Sinha A, Fröhler S, Xiong J, Dehghani Amirabad A, Behjati Ardakani F, Hutter B, Zipprich G, Felder B, Eils J, Brors B, Chen W, Hengstler JG, Hamann A, Lengauer T, Rosenstiel P, Walter J, Schulz MH. Combining transcription factor binding affinities with open-chromatin data for accurate gene expression prediction. Nucleic Acids Res 2017; 45:54-66. [PMID: 27899623 PMCID: PMC5224477 DOI: 10.1093/nar/gkw1061] [Citation(s) in RCA: 73] [Impact Index Per Article: 10.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/27/2016] [Revised: 10/18/2016] [Accepted: 10/24/2016] [Indexed: 12/21/2022] Open
Abstract
The binding and contribution of transcription factors (TF) to cell specific gene expression is often deduced from open-chromatin measurements to avoid costly TF ChIP-seq assays. Thus, it is important to develop computational methods for accurate TF binding prediction in open-chromatin regions (OCRs). Here, we report a novel segmentation-based method, TEPIC, to predict TF binding by combining sets of OCRs with position weight matrices. TEPIC can be applied to various open-chromatin data, e.g. DNaseI-seq and NOMe-seq. Additionally, Histone-Marks (HMs) can be used to identify candidate TF binding sites. TEPIC computes TF affinities and uses open-chromatin/HM signal intensity as quantitative measures of TF binding strength. Using machine learning, we find low affinity binding sites to improve our ability to explain gene expression variability compared to the standard presence/absence classification of binding sites. Further, we show that both footprints and peaks capture essential TF binding events and lead to a good prediction performance. In our application, gene-based scores computed by TEPIC with one open-chromatin assay nearly reach the quality of several TF ChIP-seq data sets. Finally, these scores correctly predict known transcriptional regulators as illustrated by the application to novel DNaseI-seq and NOMe-seq data for primary human hepatocytes and CD4+ T-cells, respectively.
Collapse
Affiliation(s)
- Florian Schmidt
- Cluster of Excellence for Multimodal Computing and Interaction, Saarland Informatics Campus, Saarland University, Saarbrücken, 66123, Germany
- Computational Biology & Applied Algorithmics, Max Planck Institute for Informatics, Saarland Informatics Campus, Saarbrücken, 66123, Germany
| | - Nina Gasparoni
- Department of Genetics, University of Saarland, Saarbrücken, 66123, Germany
| | - Gilles Gasparoni
- Department of Genetics, University of Saarland, Saarbrücken, 66123, Germany
| | - Kathrin Gianmoena
- Leibniz Research Centre for Working Environment and Human Factors IfADo, Dortmund, 44139, Germany
| | - Cristina Cadenas
- Leibniz Research Centre for Working Environment and Human Factors IfADo, Dortmund, 44139, Germany
| | - Julia K Polansky
- Experimental Rheumatology, German Rheumatism Research Centre, Berlin, 10117, Germany
| | - Peter Ebert
- Computational Biology & Applied Algorithmics, Max Planck Institute for Informatics, Saarland Informatics Campus, Saarbrücken, 66123, Germany
- International Max Planck Research School for Computer Science, Saarland Informatics Campus, Saarbrücken, 66123, Germany
| | - Karl Nordström
- Department of Genetics, University of Saarland, Saarbrücken, 66123, Germany
| | - Matthias Barann
- Institute of Clinical Molecular Biology, Christian-Albrechts-University, Kiel, 24105, Germany
| | - Anupam Sinha
- Institute of Clinical Molecular Biology, Christian-Albrechts-University, Kiel, 24105, Germany
| | - Sebastian Fröhler
- Berlin Institute for Medical Systems Biology, Max-Delbrück Center for Molecular Medicine, Berlin, 13092, Germany
| | - Jieyi Xiong
- Berlin Institute for Medical Systems Biology, Max-Delbrück Center for Molecular Medicine, Berlin, 13092, Germany
| | - Azim Dehghani Amirabad
- Cluster of Excellence for Multimodal Computing and Interaction, Saarland Informatics Campus, Saarland University, Saarbrücken, 66123, Germany
- Computational Biology & Applied Algorithmics, Max Planck Institute for Informatics, Saarland Informatics Campus, Saarbrücken, 66123, Germany
- International Max Planck Research School for Computer Science, Saarland Informatics Campus, Saarbrücken, 66123, Germany
| | - Fatemeh Behjati Ardakani
- Cluster of Excellence for Multimodal Computing and Interaction, Saarland Informatics Campus, Saarland University, Saarbrücken, 66123, Germany
- Computational Biology & Applied Algorithmics, Max Planck Institute for Informatics, Saarland Informatics Campus, Saarbrücken, 66123, Germany
| | - Barbara Hutter
- Applied Bioinformatics, Deutsches Krebsforschungszentrum, Heidelberg, 69120, Germany
| | - Gideon Zipprich
- Data Management and Genomics IT, Deutsches Krebsforschungszentrum, Heidelberg, 69120, Germany
| | - Bärbel Felder
- Data Management and Genomics IT, Deutsches Krebsforschungszentrum, Heidelberg, 69120, Germany
| | - Jürgen Eils
- Data Management and Genomics IT, Deutsches Krebsforschungszentrum, Heidelberg, 69120, Germany
| | - Benedikt Brors
- Applied Bioinformatics, Deutsches Krebsforschungszentrum, Heidelberg, 69120, Germany
| | - Wei Chen
- Berlin Institute for Medical Systems Biology, Max-Delbrück Center for Molecular Medicine, Berlin, 13092, Germany
| | - Jan G Hengstler
- Leibniz Research Centre for Working Environment and Human Factors IfADo, Dortmund, 44139, Germany
| | - Alf Hamann
- International Max Planck Research School for Computer Science, Saarland Informatics Campus, Saarbrücken, 66123, Germany
| | - Thomas Lengauer
- Computational Biology & Applied Algorithmics, Max Planck Institute for Informatics, Saarland Informatics Campus, Saarbrücken, 66123, Germany
| | - Philip Rosenstiel
- Institute of Clinical Molecular Biology, Christian-Albrechts-University, Kiel, 24105, Germany
| | - Jörn Walter
- Department of Genetics, University of Saarland, Saarbrücken, 66123, Germany
| | - Marcel H Schulz
- Cluster of Excellence for Multimodal Computing and Interaction, Saarland Informatics Campus, Saarland University, Saarbrücken, 66123, Germany
- Computational Biology & Applied Algorithmics, Max Planck Institute for Informatics, Saarland Informatics Campus, Saarbrücken, 66123, Germany
| |
Collapse
|
15
|
Budden DM, Crampin EJ. Distributed gene expression modelling for exploring variability in epigenetic function. BMC Bioinformatics 2016; 17:446. [PMID: 27816056 PMCID: PMC5097851 DOI: 10.1186/s12859-016-1313-1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/03/2016] [Accepted: 10/25/2016] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Predictive gene expression modelling is an important tool in computational biology due to the volume of high-throughput sequencing data generated by recent consortia. However, the scope of previous studies has been restricted to a small set of cell-lines or experimental conditions due an inability to leverage distributed processing architectures for large, sharded data-sets. RESULTS We present a distributed implementation of gene expression modelling using the MapReduce paradigm and prove that performance improves as a linear function of available processor cores. We then leverage the computational efficiency of this framework to explore the variability of epigenetic function across fifty histone modification data-sets from variety of cancerous and non-cancerous cell-lines. CONCLUSIONS We demonstrate that the genome-wide relationships between histone modifications and mRNA transcription are lineage, tissue and karyotype-invariant, and that models trained on matched -omics data from non-cancerous cell-lines are able to predict cancerous expression with equivalent genome-wide fidelity.
Collapse
Affiliation(s)
- David M Budden
- Massachusetts Institute of Technology, Computer Science and Artificial Intelligence Laboratory, Cambridge, 02139, USA. .,Systems Biology Laboratory, Melbourne School of Engineering, the University of Melbourne, Parkville, 3010, Australia.
| | - Edmund J Crampin
- Systems Biology Laboratory, Melbourne School of Engineering, the University of Melbourne, Parkville, 3010, Australia.,ARC Centre of Excellence in Convergent Bio-Nano Science and Technology, Parkville, 3010, Australia.,Department of Mathematics and Statistics, the University of Melbourne, Parkville, 3010, Australia.,School of Medicine, the University of Melbourne, Parkville, 3010, Australia
| |
Collapse
|
16
|
Abstract
Modeling biology as classical problems in computer science allows researchers to leverage the wealth of theoretical advancements in this field. Despite countless studies presenting heuristics that report improvement on specific benchmarking data, there has been comparatively little focus on exploring the theoretical bounds on the performance of practical (polynomial-time) algorithms. Conversely, theoretical studies tend to overstate the generalizability of their conclusions to physical biological processes. In this article we provide a fresh perspective on the concepts of NP-hardness and inapproximability in the computational biology domain, using popular sequence assembly and alignment (mapping) algorithms as illustrative examples. These algorithms exemplify how computer science theory can both (a) lead to substantial improvement in practical performance and (b) highlight areas ripe for future innovation. Importantly, we discuss caveats that seemingly allow the performance of heuristics to exceed their provable bounds.
Collapse
Affiliation(s)
- David Budden
- 1 Google, Inc. , Pyrmont, Australia .,2 Computer Science and Artificial Intelligence Laboratory, Massachusetts Institute of Technology , Cambridge, Massachusetts
| | - Mitchell Jones
- 1 Google, Inc. , Pyrmont, Australia .,3 Department of Computer Science, University of Illinois at Urbana-Champaign
| |
Collapse
|
17
|
Information theoretic approaches for inference of biological networks from continuous-valued data. BMC SYSTEMS BIOLOGY 2016; 10:89. [PMID: 27599566 PMCID: PMC5013667 DOI: 10.1186/s12918-016-0331-y] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 02/25/2016] [Accepted: 08/23/2016] [Indexed: 01/30/2023]
Abstract
Background Characterising programs of gene regulation by studying individual protein-DNA and protein-protein interactions would require a large volume of high-resolution proteomics data, and such data are not yet available. Instead, many gene regulatory network (GRN) techniques have been developed, which leverage the wealth of transcriptomic data generated by recent consortia to study indirect, gene-level relationships between transcriptional regulators. Despite the popularity of such methods, previous methods of GRN inference exhibit limitations that we highlight and address through the lens of information theory. Results We introduce new model-free and non-linear information theoretic measures for the inference of GRNs and other biological networks from continuous-valued data. Although previous tools have implemented mutual information as a means of inferring pairwise associations, they either introduce statistical bias through discretisation or are limited to modelling undirected relationships. Our approach overcomes both of these limitations, as demonstrated by a substantial improvement in empirical performance for a set of 160 GRNs of varying size and topology. Conclusions The information theoretic measures described in this study yield substantial improvements over previous approaches (e.g. ARACNE) and have been implemented in the latest release of NAIL (Network Analysis and Inference Library). However, despite the theoretical and empirical advantages of these new measures, they do not circumvent the fundamental limitation of indeterminacy exhibited across this class of biological networks. These methods have presently found value in computational neurobiology, and will likely gain traction for GRN analysis as the volume and quality of temporal transcriptomics data continues to improve.
Collapse
|
18
|
Chen X, Jung JG, Shajahan-Haq AN, Clarke R, Shih IM, Wang Y, Magnani L, Wang TL, Xuan J. ChIP-BIT: Bayesian inference of target genes using a novel joint probabilistic model of ChIP-seq profiles. Nucleic Acids Res 2016; 44:e65. [PMID: 26704972 PMCID: PMC4838354 DOI: 10.1093/nar/gkv1491] [Citation(s) in RCA: 14] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/21/2015] [Revised: 11/16/2015] [Accepted: 12/09/2015] [Indexed: 11/16/2022] Open
Abstract
Chromatin immunoprecipitation with massively parallel DNA sequencing (ChIP-seq) has greatly improved the reliability with which transcription factor binding sites (TFBSs) can be identified from genome-wide profiling studies. Many computational tools are developed to detect binding events or peaks, however the robust detection of weak binding events remains a challenge for current peak calling tools. We have developed a novel Bayesian approach (ChIP-BIT) to reliably detect TFBSs and their target genes by jointly modeling binding signal intensities and binding locations of TFBSs. Specifically, a Gaussian mixture model is used to capture both binding and background signals in sample data. As a unique feature of ChIP-BIT, background signals are modeled by a local Gaussian distribution that is accurately estimated from the input data. Extensive simulation studies showed a significantly improved performance of ChIP-BIT in target gene prediction, particularly for detecting weak binding signals at gene promoter regions. We applied ChIP-BIT to find target genes from NOTCH3 and PBX1 ChIP-seq data acquired from MCF-7 breast cancer cells. TF knockdown experiments have initially validated about 30% of co-regulated target genes identified by ChIP-BIT as being differentially expressed in MCF-7 cells. Functional analysis on these genes further revealed the existence of crosstalk between Notch and Wnt signaling pathways.
Collapse
Affiliation(s)
- Xi Chen
- Bradley Department of Electrical and Computer Engineering, Virginia Polytechnic Institute and State University, 900 North Glebe Road, Arlington, VA 22203, USA
| | - Jin-Gyoung Jung
- Department of Pathology, Johns Hopkins Medical Institutions, 1550 Orleans Street, CRB-II, Baltimore, MD 21231, USA
| | - Ayesha N Shajahan-Haq
- Department of Oncology, Lombardi Comprehensive Cancer Center, Georgetown University Medical Center, 3970 Reservoir Road NW, Washington, DC 20057, USA
| | - Robert Clarke
- Department of Oncology, Lombardi Comprehensive Cancer Center, Georgetown University Medical Center, 3970 Reservoir Road NW, Washington, DC 20057, USA
| | - Ie-Ming Shih
- Department of Pathology, Johns Hopkins Medical Institutions, 1550 Orleans Street, CRB-II, Baltimore, MD 21231, USA
| | - Yue Wang
- Bradley Department of Electrical and Computer Engineering, Virginia Polytechnic Institute and State University, 900 North Glebe Road, Arlington, VA 22203, USA
| | - Luca Magnani
- Department of Surgery and Cancer, Imperial College London, ICTEM building, Hammersmith Hospital, DuCane Road, London W120NN, UK
| | - Tian-Li Wang
- Department of Pathology, Johns Hopkins Medical Institutions, 1550 Orleans Street, CRB-II, Baltimore, MD 21231, USA
| | - Jianhua Xuan
- Bradley Department of Electrical and Computer Engineering, Virginia Polytechnic Institute and State University, 900 North Glebe Road, Arlington, VA 22203, USA
| |
Collapse
|
19
|
Kleftogiannis D, Kalnis P, Bajic VB. Progress and challenges in bioinformatics approaches for enhancer identification. Brief Bioinform 2015; 17:967-979. [PMID: 26634919 PMCID: PMC5142011 DOI: 10.1093/bib/bbv101] [Citation(s) in RCA: 56] [Impact Index Per Article: 6.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/27/2015] [Revised: 10/22/2015] [Indexed: 12/20/2022] Open
Abstract
Enhancers are cis-acting DNA elements that play critical roles in distal regulation of gene expression. Identifying enhancers is an important step for understanding distinct gene expression programs that may reflect normal and pathogenic cellular conditions. Experimental identification of enhancers is constrained by the set of conditions used in the experiment. This requires multiple experiments to identify enhancers, as they can be active under specific cellular conditions but not in different cell types/tissues or cellular states. This has opened prospects for computational prediction methods that can be used for high-throughput identification of putative enhancers to complement experimental approaches. Potential functions and properties of predicted enhancers have been catalogued and summarized in several enhancer-oriented databases. Because the current methods for the computational prediction of enhancers produce significantly different enhancer predictions, it will be beneficial for the research community to have an overview of the strategies and solutions developed in this field. In this review, we focus on the identification and analysis of enhancers by bioinformatics approaches. First, we describe a general framework for computational identification of enhancers, present relevant data types and discuss possible computational solutions. Next, we cover over 30 existing computational enhancer identification methods that were developed since 2000. Our review highlights advantages, limitations and potentials, while suggesting pragmatic guidelines for development of more efficient computational enhancer prediction methods. Finally, we discuss challenges and open problems of this topic, which require further consideration.
Collapse
|
20
|
Grassi E, Zapparoli E, Molineris I, Provero P. Total Binding Affinity Profiles of Regulatory Regions Predict Transcription Factor Binding and Gene Expression in Human Cells. PLoS One 2015; 10:e0143627. [PMID: 26599758 PMCID: PMC4658012 DOI: 10.1371/journal.pone.0143627] [Citation(s) in RCA: 15] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/12/2015] [Accepted: 11/07/2015] [Indexed: 11/29/2022] Open
Abstract
Transcription factors regulate gene expression by binding regulatory DNA. Understanding the rules governing such binding is an essential step in describing the network of regulatory interactions, and its pathological alterations. We show that describing regulatory regions in terms of their profile of total binding affinities for transcription factors leads to increased predictive power compared to methods based on the identification of discrete binding sites. This applies both to the prediction of transcription factor binding as revealed by ChIP-seq experiments and to the prediction of gene expression through RNA-seq. Further significant improvements in predictive power are obtained when regulatory regions are defined based on chromatin states inferred from histone modification data.
Collapse
Affiliation(s)
- Elena Grassi
- Dept. of Molecular Biotechnology and Health Sciences, University of Turin, Turin, Italy
| | - Ettore Zapparoli
- Dept. of Molecular Biotechnology and Health Sciences, University of Turin, Turin, Italy
| | - Ivan Molineris
- Dept. of Molecular Biotechnology and Health Sciences, University of Turin, Turin, Italy
| | - Paolo Provero
- Dept. of Molecular Biotechnology and Health Sciences, University of Turin, Turin, Italy
- Center for Translational Genomics and Bioinformatics, San Raffaele Scientific Institute, Milan, Italy
| |
Collapse
|
21
|
FlexDM: Simple, parallel and fault-tolerant data mining using WEKA. SOURCE CODE FOR BIOLOGY AND MEDICINE 2015; 10:13. [PMID: 26579209 PMCID: PMC4647584 DOI: 10.1186/s13029-015-0045-3] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 01/20/2015] [Accepted: 11/09/2015] [Indexed: 12/03/2022]
Abstract
Background With the continued exponential growth in data volume, large-scale data mining and machine learning experiments have become a necessity for many researchers without programming or statistics backgrounds. WEKA (Waikato Environment for Knowledge Analysis) is a gold standard framework that facilitates and simplifies this task by allowing specification of algorithms, hyper-parameters and test strategies from a streamlined Experimenter GUI. Despite its popularity, the WEKA Experimenter exhibits several limitations that we address in our new FlexDM software. Results FlexDM addresses four fundamental limitations with the WEKA Experimenter: reliance on a verbose and difficult-to-modify XML schema; inability to meta-optimise experiments over a large number of algorithm hyper-parameters; inability to recover from software or hardware failure during a large experiment; and failing to leverage modern multicore processor architectures. Direct comparisons between the FlexDM and default WEKA XML schemas demonstrate a 10-fold improvement in brevity for a specification that allows finer control of experimental procedures. The stability of FlexDM has been tested on a large biological dataset (approximately 450 k attributes by 150 samples), and automatic parallelisation of tasks yields a quasi-linear reduction in execution time when distributed across multiple processor cores. Conclusion FlexDM is a powerful and easy-to-use extension to the WEKA package, which better handles the increased volume and complexity of data that has emerged during the 20 years since WEKA’s original development. FlexDM has been tested on Windows, OSX and Linux operating systems and is provided as a pre-configured virtual reference environment for trivial usage and extensibility. This software can substantially improve the productivity of any research group conducting large-scale data mining or machine learning tasks, in addition to providing non-programmers with improved control over specific aspects of their data analysis pipeline via a succinct and simplified XML schema. Electronic supplementary material The online version of this article (doi:10.1186/s13029-015-0045-3) contains supplementary material, which is available to authorized users.
Collapse
|
22
|
Narang V, Ramli MA, Singhal A, Kumar P, de Libero G, Poidinger M, Monterola C. Automated Identification of Core Regulatory Genes in Human Gene Regulatory Networks. PLoS Comput Biol 2015; 11:e1004504. [PMID: 26393364 PMCID: PMC4578944 DOI: 10.1371/journal.pcbi.1004504] [Citation(s) in RCA: 24] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/10/2015] [Accepted: 08/11/2015] [Indexed: 12/20/2022] Open
Abstract
Human gene regulatory networks (GRN) can be difficult to interpret due to a tangle of edges interconnecting thousands of genes. We constructed a general human GRN from extensive transcription factor and microRNA target data obtained from public databases. In a subnetwork of this GRN that is active during estrogen stimulation of MCF-7 breast cancer cells, we benchmarked automated algorithms for identifying core regulatory genes (transcription factors and microRNAs). Among these algorithms, we identified K-core decomposition, pagerank and betweenness centrality algorithms as the most effective for discovering core regulatory genes in the network evaluated based on previously known roles of these genes in MCF-7 biology as well as in their ability to explain the up or down expression status of up to 70% of the remaining genes. Finally, we validated the use of K-core algorithm for organizing the GRN in an easier to interpret layered hierarchy where more influential regulatory genes percolate towards the inner layers. The integrated human gene and miRNA network and software used in this study are provided as supplementary materials (S1 Data) accompanying this manuscript. A gene regulatory network (GRN) represents how some genes encoding regulatory molecules such as transcription factors or microRNAs regulate the expression of other genes. Researchers commonly study GRNs involved in a specific biological process with the aim of identifying a few important regulatory genes. In higher organisms such as humans, a regulatory gene regulates multiple target genes and correspondingly any gene is regulated by multiple regulatory genes. Due to such multiplicity of interactions, a GRN usually resembles a tangled hairball wherein it is difficult to identify few most influential regulatory genes. In this study, we show that network analysis algorithms such as K-core, pagerank and betweenness centrality are useful for identifying a few important or core regulatory genes in a GRN, and the K-core algorithm is also useful for organizing regulatory genes in a hierarchical layered structure where the most influential genes in a GRN are found within the innermost layer or core. These few core regulatory genes determine to a large extent the expression status of the remaining genes in the network. We illustrate a pragmatic application of this technique to GRNs reconstructed from genome-wide gene expression measurements in the MCF-7 human breast cancer cell line.
Collapse
|
23
|
Stegmayer G, Pividori M, Milone DH. A very simple and fast way to access and validate algorithms in reproducible research. Brief Bioinform 2015. [DOI: 10.1093/bib/bbv054] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/28/2023] Open
|
24
|
Budden DM, Hurley DG, Crampin EJ. Modelling the conditional regulatory activity of methylated and bivalent promoters. Epigenetics Chromatin 2015; 8:21. [PMID: 26097508 PMCID: PMC4474576 DOI: 10.1186/s13072-015-0013-9] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/10/2015] [Accepted: 06/10/2015] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Predictive modelling of gene expression is a powerful framework for the in silico exploration of transcriptional regulatory interactions through the integration of high-throughput -omics data. A major limitation of previous approaches is their inability to handle conditional interactions that emerge when genes are subject to different regulatory mechanisms. Although chromatin immunoprecipitation-based histone modification data are often used as proxies for chromatin accessibility, the association between these variables and expression often depends upon the presence of other epigenetic markers (e.g. DNA methylation or histone variants). These conditional interactions are poorly handled by previous predictive models and reduce the reliability of downstream biological inference. RESULTS We have previously demonstrated that integrating both transcription factor and histone modification data within a single predictive model is rendered ineffective by their statistical redundancy. In this study, we evaluate four proposed methods for quantifying gene-level DNA methylation levels and demonstrate that inclusion of these data in predictive modelling frameworks is also subject to this critical limitation in data integration. Based on the hypothesis that statistical redundancy in epigenetic data is caused by conditional regulatory interactions within a dynamic chromatin context, we construct a new gene expression model which is the first to improve prediction accuracy by unsupervised identification of latent regulatory classes. We show that DNA methylation and H2A.Z histone variant data can be interpreted in this way to identify and explore the signatures of silenced and bivalent promoters, substantially improving genome-wide predictions of mRNA transcript abundance and downstream biological inference across multiple cell lines. CONCLUSIONS Previous models of gene expression have been applied successfully to several important problems in molecular biology, including the discovery of transcription factor roles, identification of regulatory elements responsible for differential expression patterns and comparative analysis of the transcriptome across distant species. Our analysis supports our hypothesis that statistical redundancy in epigenetic data is partially due to conditional relationships between these regulators and gene expression levels. This analysis provides insight into the heterogeneous roles of H3K4me3 and H3K27me3 in the presence of the H2A.Z histone variant (implicated in cancer progression) and how these signatures change during lineage commitment and carcinogenesis.
Collapse
Affiliation(s)
- David M Budden
- Systems Biology Laboratory, Melbourne School of Engineering, The University of Melbourne, 3010 Parkville, Australia ; NICTA Victoria Research Laboratory, The University of Melbourne, 3010 Parkville, Australia
| | - Daniel G Hurley
- Systems Biology Laboratory, Melbourne School of Engineering, The University of Melbourne, 3010 Parkville, Australia
| | - Edmund J Crampin
- Systems Biology Laboratory, Melbourne School of Engineering, The University of Melbourne, 3010 Parkville, Australia ; NICTA Victoria Research Laboratory, The University of Melbourne, 3010 Parkville, Australia ; ARC Centre of Excellence in Convergent Bio-Nano Science and Technology, 3010 Parkville, Australia ; Department of Mathematics and Statistics, The University of Melbourne, 3010 Parkville, Australia ; School of Medicine, The University of Melbourne, 3010 Parkville, Australia
| |
Collapse
|
25
|
Hurley DG, Budden DM, Crampin EJ. Virtual Reference Environments: a simple way to make research reproducible. Brief Bioinform 2014; 16:901-3. [PMID: 25433467 PMCID: PMC4570198 DOI: 10.1093/bib/bbu043] [Citation(s) in RCA: 22] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/02/2014] [Indexed: 11/18/2022] Open
Abstract
‘Reproducible research’ has received increasing attention over the past few years as bioinformatics and computational biology methodologies become more complex. Although reproducible research is progressing in several valuable ways, we suggest that recent increases in internet bandwidth and disk space, along with the availability of open-source and free-software licences for tools, enable another simple step to make research reproducible. In this article, we urge the creation of minimal virtual reference environments implementing all the tools necessary to reproduce a result, as a standard part of publication. We address potential problems with this approach, and show an example environment from our own work.
Collapse
|
26
|
Budden DM, Hurley DG, Cursons J, Markham JF, Davis MJ, Crampin EJ. Predicting expression: the complementary power of histone modification and transcription factor binding data. Epigenetics Chromatin 2014; 7:36. [PMID: 25489339 PMCID: PMC4258808 DOI: 10.1186/1756-8935-7-36] [Citation(s) in RCA: 32] [Impact Index Per Article: 3.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/25/2014] [Accepted: 11/05/2014] [Indexed: 01/01/2023] Open
Abstract
Background Transcription factors (TFs) and histone modifications (HMs) play critical roles in gene expression by regulating mRNA transcription. Modelling frameworks have been developed to integrate high-throughput omics data, with the aim of elucidating the regulatory logic that results from the interactions of DNA, TFs and HMs. These models have yielded an unexpected and poorly understood result: that TFs and HMs are statistically redundant in explaining mRNA transcript abundance at a genome-wide level. Results We constructed predictive models of gene expression by integrating RNA-sequencing, TF and HM chromatin immunoprecipitation sequencing and DNase I hypersensitivity data for two mammalian cell types. All models identified genome-wide statistical redundancy both within and between TFs and HMs, as previously reported. To investigate potential explanations, groups of genes were constructed for ontology-classified biological processes. Predictive models were constructed for each process to explore the distribution of statistical redundancy. We found significant variation in the predictive capacity of TFs and HMs across these processes and demonstrated the predictive power of HMs to be inversely proportional to process enrichment for housekeeping genes. Conclusions It is well established that the roles played by TFs and HMs are not functionally redundant. Instead, we attribute the statistical redundancy reported in this and previous genome-wide modelling studies to the heterogeneous distribution of HMs across chromatin domains. Furthermore, we conclude that statistical redundancy between individual TFs can be readily explained by nucleosome-mediated cooperative binding. This could possibly help the cell confer regulatory robustness by rejecting signalling noise and allowing control via multiple pathways. Electronic supplementary material The online version of this article (doi:10.1186/1756-8935-7-36) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- David M Budden
- Systems Biology Laboratory, Melbourne School of Engineering, The University of Melbourne, 3010 Parkville, Australia ; NICTA Victoria Research Laboratory, The University of Melbourne, 3010 Parkville, Australia
| | - Daniel G Hurley
- Systems Biology Laboratory, Melbourne School of Engineering, The University of Melbourne, 3010 Parkville, Australia
| | - Joseph Cursons
- Systems Biology Laboratory, Melbourne School of Engineering, The University of Melbourne, 3010 Parkville, Australia
| | - John F Markham
- Systems Biology Laboratory, Melbourne School of Engineering, The University of Melbourne, 3010 Parkville, Australia ; The Walter and Eliza Hall Institute of Medical Research, Department of Medical Biology, The University of Melbourne, 3010 Parkville, Australia
| | - Melissa J Davis
- Systems Biology Laboratory, Melbourne School of Engineering, The University of Melbourne, 3010 Parkville, Australia
| | - Edmund J Crampin
- Systems Biology Laboratory, Melbourne School of Engineering, The University of Melbourne, 3010 Parkville, Australia ; NICTA Victoria Research Laboratory, The University of Melbourne, 3010 Parkville, Australia ; The Walter and Eliza Hall Institute of Medical Research, Department of Medical Biology, The University of Melbourne, 3010 Parkville, Australia ; ARC Centre of Excellence in Convergent Bio-Nano Science and Technology, 3010 Parkville, Australia ; Department of Mathematics and Statistics, The University of Melbourne, 3010 Parkville, Australia ; School of Medicine, The University of Melbourne, 3010 Parkville, Australia
| |
Collapse
|