51
|
Lin X, Qi Y, Latham AP, Zhang B. Multiscale modeling of genome organization with maximum entropy optimization. J Chem Phys 2021; 155:010901. [PMID: 34241389 PMCID: PMC8253599 DOI: 10.1063/5.0044150] [Citation(s) in RCA: 30] [Impact Index Per Article: 10.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/14/2021] [Accepted: 04/28/2021] [Indexed: 12/15/2022] Open
Abstract
Three-dimensional (3D) organization of the human genome plays an essential role in all DNA-templated processes, including gene transcription, gene regulation, and DNA replication. Computational modeling can be an effective way of building high-resolution genome structures and improving our understanding of these molecular processes. However, it faces significant challenges as the human genome consists of over 6 × 109 base pairs, a system size that exceeds the capacity of traditional modeling approaches. In this perspective, we review the progress that has been made in modeling the human genome. Coarse-grained models parameterized to reproduce experimental data via the maximum entropy optimization algorithm serve as effective means to study genome organization at various length scales. They have provided insight into the principles of whole-genome organization and enabled de novo predictions of chromosome structures from epigenetic modifications. Applications of these models at a near-atomistic resolution further revealed physicochemical interactions that drive the phase separation of disordered proteins and dictate chromatin stability in situ. We conclude with an outlook on the opportunities and challenges in studying chromosome dynamics.
Collapse
Affiliation(s)
- Xingcheng Lin
- Department of Chemistry, Massachusetts Institute of Technology, Cambridge, Massachusetts 02139, USA
| | - Yifeng Qi
- Department of Chemistry, Massachusetts Institute of Technology, Cambridge, Massachusetts 02139, USA
| | - Andrew P Latham
- Department of Chemistry, Massachusetts Institute of Technology, Cambridge, Massachusetts 02139, USA
| | - Bin Zhang
- Department of Chemistry, Massachusetts Institute of Technology, Cambridge, Massachusetts 02139, USA
| |
Collapse
|
52
|
Talukder A, Barham C, Li X, Hu H. Interpretation of deep learning in genomics and epigenomics. Brief Bioinform 2021; 22:bbaa177. [PMID: 34020542 PMCID: PMC8138893 DOI: 10.1093/bib/bbaa177] [Citation(s) in RCA: 42] [Impact Index Per Article: 14.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/25/2020] [Revised: 06/26/2020] [Accepted: 07/10/2020] [Indexed: 12/17/2022] Open
Abstract
Machine learning methods have been widely applied to big data analysis in genomics and epigenomics research. Although accuracy and efficiency are common goals in many modeling tasks, model interpretability is especially important to these studies towards understanding the underlying molecular and cellular mechanisms. Deep neural networks (DNNs) have recently gained popularity in various types of genomic and epigenomic studies due to their capabilities in utilizing large-scale high-throughput bioinformatics data and achieving high accuracy in predictions and classifications. However, DNNs are often challenged by their potential to explain the predictions due to their black-box nature. In this review, we present current development in the model interpretation of DNNs, focusing on their applications in genomics and epigenomics. We first describe state-of-the-art DNN interpretation methods in representative machine learning fields. We then summarize the DNN interpretation methods in recent studies on genomics and epigenomics, focusing on current data- and computing-intensive topics such as sequence motif identification, genetic variations, gene expression, chromatin interactions and non-coding RNAs. We also present the biological discoveries that resulted from these interpretation methods. We finally discuss the advantages and limitations of current interpretation approaches in the context of genomic and epigenomic studies. Contact:xiaoman@mail.ucf.edu, haihu@cs.ucf.edu.
Collapse
Affiliation(s)
- Amlan Talukder
- Computer Science, University of Central Florida, Orlando, FL 32816, USA
| | - Clayton Barham
- Computer Science, University of Central Florida, Orlando, FL 32816, USA
| | - Xiaoman Li
- Burnett School of Biomedical Science, University of Central Florida, Orlando, FL 32816, USA
| | - Haiyan Hu
- Computer Science, University of Central Florida, Orlando, FL 32816, USA
| |
Collapse
|
53
|
Lv H, Dao FY, Zulfiqar H, Su W, Ding H, Liu L, Lin H. A sequence-based deep learning approach to predict CTCF-mediated chromatin loop. Brief Bioinform 2021; 22:6149346. [PMID: 33634313 DOI: 10.1093/bib/bbab031] [Citation(s) in RCA: 26] [Impact Index Per Article: 8.7] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/26/2020] [Revised: 12/01/2020] [Accepted: 01/21/2021] [Indexed: 12/13/2022] Open
Abstract
Three-dimensional (3D) architecture of the chromosomes is of crucial importance for transcription regulation and DNA replication. Various high-throughput chromosome conformation capture-based methods have revealed that CTCF-mediated chromatin loops are a major component of 3D architecture. However, CTCF-mediated chromatin loops are cell type specific, and most chromatin interaction capture techniques are time-consuming and labor-intensive, which restricts their usage on a very large number of cell types. Genomic sequence-based computational models are sophisticated enough to capture important features of chromatin architecture and help to identify chromatin loops. In this work, we develop Deep-loop, a convolutional neural network model, to integrate k-tuple nucleotide frequency component, nucleotide pair spectrum encoding, position conservation, position scoring function and natural vector features for the prediction of chromatin loops. By a series of examination based on cross-validation, Deep-loop shows excellent performance in the identification of the chromatin loops from different cell types. The source code of Deep-loop is freely available at the repository https://github.com/linDing-group/Deep-loop.
Collapse
Affiliation(s)
- Hao Lv
- Informational Biology at University of Electronic Science and Technology of China
| | - Fu-Ying Dao
- Informational Biology at University of Electronic Science and Technology of China
| | - Hasan Zulfiqar
- Informational Biology at University of Electronic Science and Technology of China
| | - Wei Su
- Informational Biology at University of Electronic Science and Technology of China
| | - Hui Ding
- Informational Biology at University of Electronic Science and Technology of China
| | - Li Liu
- Laboratory of Theoretical Biophysics at Inner Mongolia University
| | - Hao Lin
- Informational Biology at University of Electronic Science and Technology of China
| |
Collapse
|
54
|
Wang Y, Chen S, Li W, Jiang R, Wang Y. Associating divergent lncRNAs with target genes by integrating genome sequence, gene expression and chromatin accessibility data. NAR Genom Bioinform 2021; 2:lqaa019. [PMID: 33575579 PMCID: PMC7671357 DOI: 10.1093/nargab/lqaa019] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/13/2019] [Revised: 02/18/2020] [Accepted: 03/05/2020] [Indexed: 01/22/2023] Open
Abstract
Recent RNA knockdown experiments revealed that a dozen divergent long noncoding RNAs (lncRNAs) positively regulate the transcription of genes in cis. Here, to understand the regulatory mechanism of divergent lncRNAs, we proposed a computational model IRDL (Identify the Regulatory Divergent LncRNAs) to associate divergent lncRNAs with target genes. IRDL took advantage of the cross-tissue paired expression and chromatin accessibility data in ENCODE and a dozen experimentally validated divergent lncRNA target genes. IRDL integrated sequence similarity, co-expression and co-accessibility features, battled the scarcity of gold standard datasets with an increasingly learning framework and identified 446 and 977 divergent lncRNA-gene regulatory associations for mouse and human, respectively. We found that the identified divergent lncRNAs and target genes correlated well in expression and chromatin accessibility. The functional and pathway enrichment analysis suggests that divergent lncRNAs are strongly associated with developmental regulatory transcription factors. The predicted loop structure validation and canonical database search indicate a scaffold regulatory model for divergent lncRNAs. Furthermore, we computationally revealed the tissue/cell-specific regulatory associations considering the specificity of lncRNA. In conclusion, IRDL provides a way to understand the regulatory mechanism of divergent lncRNAs and hints at hundreds of tissue/cell-specific regulatory associations worthy for further biological validation.
Collapse
Affiliation(s)
- Yongcui Wang
- Key Laboratory of Adaptation and Evolution of Plateau Biota, Northwest Institute of Plateau Biology, Chinese Academy of Sciences, Xining, Qinghai 810008, China.,Qinghai Provincial Key Laboratory of Crop Molecular Breeding, Northwest Institute of Plateau Biology, Chinese Academy of Sciences, Xining 810008, China
| | - Shilong Chen
- Key Laboratory of Adaptation and Evolution of Plateau Biota, Northwest Institute of Plateau Biology, Chinese Academy of Sciences, Xining, Qinghai 810008, China
| | - Wenran Li
- MOE Key Laboratory of Bioinformatics, Bioinformatics Division, Beijing National Research Center for Information Science and Technology, Department of Automation, Tsinghua University, Beijing 100084, China
| | - Rui Jiang
- MOE Key Laboratory of Bioinformatics, Bioinformatics Division, Beijing National Research Center for Information Science and Technology, Department of Automation, Tsinghua University, Beijing 100084, China
| | - Yong Wang
- CEMS, NCMIS, MDIS, Academy of Mathematics and Systems Science, Chinese Academy of Sciences, Beijing 100190, China.,Center for Excellence in Animal Evolution and Genetics, Chinese Academy of Sciences, Kunming 650223, China
| |
Collapse
|
55
|
Kuang S, Wang L. Identification and analysis of consensus RNA motifs binding to the genome regulator CTCF. NAR Genom Bioinform 2021; 2:lqaa031. [PMID: 33575587 PMCID: PMC7671415 DOI: 10.1093/nargab/lqaa031] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/03/2019] [Revised: 03/21/2020] [Accepted: 04/28/2020] [Indexed: 12/14/2022] Open
Abstract
CCCTC-binding factor (CTCF) is a key regulator of 3D genome organization and gene expression. Recent studies suggest that RNA transcripts, mostly long non-coding RNAs (lncRNAs), can serve as locus-specific factors to bind and recruit CTCF to the chromatin. However, it remains unclear whether specific sequence patterns are shared by the CTCF-binding RNA sites, and no RNA motif has been reported so far for CTCF binding. In this study, we have developed DeepLncCTCF, a new deep learning model based on a convolutional neural network and a bidirectional long short-term memory network, to discover the RNA recognition patterns of CTCF and identify candidate lncRNAs binding to CTCF. When evaluated on two different datasets, human U2OS dataset and mouse ESC dataset, DeepLncCTCF was shown to be able to accurately predict CTCF-binding RNA sites from nucleotide sequence. By examining the sequence features learned by DeepLncCTCF, we discovered a novel RNA motif with the consensus sequence, AGAUNGGA, for potential CTCF binding in humans. Furthermore, the applicability of DeepLncCTCF was demonstrated by identifying nearly 5000 candidate lncRNAs that might bind to CTCF in the nucleus. Our results provide useful information for understanding the molecular mechanisms of CTCF function in 3D genome organization.
Collapse
Affiliation(s)
- Shuzhen Kuang
- Department of Genetics and Biochemistry, Clemson University, Clemson, SC 29634, USA.,Department of Biological Sciences, Clemson University, Clemson, SC 29634, USA
| | - Liangjiang Wang
- Department of Genetics and Biochemistry, Clemson University, Clemson, SC 29634, USA
| |
Collapse
|
56
|
Chen S, Gan M, Lv H, Jiang R. DeepCAPE: A Deep Convolutional Neural Network for the Accurate Prediction of Enhancers. GENOMICS PROTEOMICS & BIOINFORMATICS 2021; 19:565-577. [PMID: 33581335 PMCID: PMC9040020 DOI: 10.1016/j.gpb.2019.04.006] [Citation(s) in RCA: 9] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 10/24/2018] [Revised: 03/15/2019] [Accepted: 04/29/2019] [Indexed: 12/12/2022]
Abstract
The establishment of a landscape of enhancers across human cells is crucial to deciphering the mechanism of gene regulation, cell differentiation, and disease development. High-throughput experimental approaches, which contain successfully reported enhancers in typical cell lines, are still too costly and time-consuming to perform systematic identification of enhancers specific to different cell lines. Existing computational methods, capable of predicting regulatory elements purely relying on DNA sequences, lack the power of cell line-specific screening. Recent studies have suggested that chromatin accessibility of a DNA segment is closely related to its potential function in regulation, and thus may provide useful information in identifying regulatory elements. Motivated by the aforementioned understanding, we integrate DNA sequences and chromatin accessibility data to accurately predict enhancers in a cell line-specific manner. We proposed DeepCAPE, a deep convolutional neural network to predict enhancers via the integration of DNA sequences and DNase-seq data. Benefitting from the well-designed feature extraction mechanism and skip connection strategy, our model not only consistently outperforms existing methods in the imbalanced classification of cell line-specific enhancers against background sequences, but also has the ability to self-adapt to different sizes of datasets. Besides, with the adoption of auto-encoder, our model is capable of making cross-cell line predictions. We further visualize kernels of the first convolutional layer and show the match of identified sequence signatures and known motifs. We finally demonstrate the potential ability of our model to explain functional implications of putative disease-associated genetic variants and discriminate disease-related enhancers. The source code and detailed tutorial of DeepCAPE are freely available at https://github.com/ShengquanChen/DeepCAPE.
Collapse
Affiliation(s)
- Shengquan Chen
- MOE Key Laboratory of Bioinformatics, Research Department of Bioinformatics at the Beijing National Research Center for Information Science and Technology, Center for Synthetic and Systems Biology, Department of Automation, Tsinghua University, Beijing 100084, China
| | - Mingxin Gan
- Department of Management Science and Engineering, School of Economics and Management, University of Science and Technology Beijing, Beijing 100083, China
| | - Hairong Lv
- MOE Key Laboratory of Bioinformatics, Research Department of Bioinformatics at the Beijing National Research Center for Information Science and Technology, Center for Synthetic and Systems Biology, Department of Automation, Tsinghua University, Beijing 100084, China
| | - Rui Jiang
- MOE Key Laboratory of Bioinformatics, Research Department of Bioinformatics at the Beijing National Research Center for Information Science and Technology, Center for Synthetic and Systems Biology, Department of Automation, Tsinghua University, Beijing 100084, China.
| |
Collapse
|
57
|
Belokopytova P, Fishman V. Predicting Genome Architecture: Challenges and Solutions. Front Genet 2021; 11:617202. [PMID: 33552135 PMCID: PMC7862721 DOI: 10.3389/fgene.2020.617202] [Citation(s) in RCA: 19] [Impact Index Per Article: 6.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/14/2020] [Accepted: 12/15/2020] [Indexed: 12/22/2022] Open
Abstract
Genome architecture plays a pivotal role in gene regulation. The use of high-throughput methods for chromatin profiling and 3-D interaction mapping provide rich experimental data sets describing genome organization and dynamics. These data challenge development of new models and algorithms connecting genome architecture with epigenetic marks. In this review, we describe how chromatin architecture could be reconstructed from epigenetic data using biophysical or statistical approaches. We discuss the applicability and limitations of these methods for understanding the mechanisms of chromatin organization. We also highlight the emergence of new predictive approaches for scoring effects of structural variations in human cells.
Collapse
Affiliation(s)
- Polina Belokopytova
- Natural Sciences Department, Novosibirsk State University, Novosibirsk, Russia
- Institute of Cytology and Genetics Siberian Branch of Russian Academy of Sciences (SB RAS), Novosibirsk, Russia
| | - Veniamin Fishman
- Natural Sciences Department, Novosibirsk State University, Novosibirsk, Russia
- Institute of Cytology and Genetics Siberian Branch of Russian Academy of Sciences (SB RAS), Novosibirsk, Russia
| |
Collapse
|
58
|
Tao H, Li H, Xu K, Hong H, Jiang S, Du G, Wang J, Sun Y, Huang X, Ding Y, Li F, Zheng X, Chen H, Bo X. Computational methods for the prediction of chromatin interaction and organization using sequence and epigenomic profiles. Brief Bioinform 2021; 22:6102668. [PMID: 33454752 PMCID: PMC8424394 DOI: 10.1093/bib/bbaa405] [Citation(s) in RCA: 11] [Impact Index Per Article: 3.7] [Reference Citation Analysis] [Abstract] [Key Words] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/21/2020] [Revised: 11/26/2020] [Accepted: 12/10/2020] [Indexed: 12/14/2022] Open
Abstract
The exploration of three-dimensional chromatin interaction and organization provides insight into mechanisms underlying gene regulation, cell differentiation and disease development. Advances in chromosome conformation capture technologies, such as high-throughput chromosome conformation capture (Hi-C) and chromatin interaction analysis by paired-end tag (ChIA-PET), have enabled the exploration of chromatin interaction and organization. However, high-resolution Hi-C and ChIA-PET data are only available for a limited number of cell lines, and their acquisition is costly, time consuming, laborious and affected by theoretical limitations. Increasing evidence shows that DNA sequence and epigenomic features are informative predictors of regulatory interaction and chromatin architecture. Based on these features, numerous computational methods have been developed for the prediction of chromatin interaction and organization, whereas they are not extensively applied in biomedical study. A systematical study to summarize and evaluate such methods is still needed to facilitate their application. Here, we summarize 48 computational methods for the prediction of chromatin interaction and organization using sequence and epigenomic profiles, categorize them and compare their performance. Besides, we provide a comprehensive guideline for the selection of suitable methods to predict chromatin interaction and organization based on available data and biological question of interest.
Collapse
Affiliation(s)
- Huan Tao
- Beijing Institute of Radiation Medicine
| | - Hao Li
- Beijing Institute of Radiation Medicine
| | - Kang Xu
- Beijing Institute of Radiation Medicine
| | - Hao Hong
- Beijing Institute of Radiation Medicine, Department of Biotechnology
| | - Shuai Jiang
- Beijing Institute of Radiation Medicine, Department of Biotechnology
| | - Guifang Du
- Beijing Institute of Radiation Medicine, Department of Biotechnology
| | | | - Yu Sun
- Beijing Institute of Radiation Medicine, Department of Biotechnology
| | - Xin Huang
- Beijing Institute of Radiation Medicine, Department of Biotechnology
| | - Yang Ding
- Beijing Institute of Radiation Medicine
| | - Fei Li
- Chinese Academy of Sciences, Department of Computer Network Information Center
| | | | | | | |
Collapse
|
59
|
Contessoto VG, Cheng RR, Hajitaheri A, Dodero-Rojas E, Mello MF, Lieberman-Aiden E, Wolynes P, Di Pierro M, Onuchic JN. The Nucleome Data Bank: web-based resources to simulate and analyze the three-dimensional genome. Nucleic Acids Res 2021; 49:D172-D182. [PMID: 33021634 PMCID: PMC7778995 DOI: 10.1093/nar/gkaa818] [Citation(s) in RCA: 18] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/17/2020] [Revised: 09/05/2020] [Accepted: 10/02/2020] [Indexed: 11/30/2022] Open
Abstract
We introduce the Nucleome Data Bank (NDB), a web-based platform to simulate and analyze the three-dimensional (3D) organization of genomes. The NDB enables physics-based simulation of chromosomal structural dynamics through the MEGABASE + MiChroM computational pipeline. The input of the pipeline consists of epigenetic information sourced from the Encode database; the output consists of the trajectories of chromosomal motions that accurately predict Hi-C and fluorescence insitu hybridization data, as well as multiple observations of chromosomal dynamics in vivo. As an intermediate step, users can also generate chromosomal sub-compartment annotations directly from the same epigenetic input, without the use of any DNA-DNA proximity ligation data. Additionally, the NDB freely hosts both experimental and computational structural genomics data. Besides being able to perform their own genome simulations and download the hosted data, users can also analyze and visualize the same data through custom-designed web-based tools. In particular, the one-dimensional genetic and epigenetic data can be overlaid onto accurate 3D structures of chromosomes, to study the spatial distribution of genetic and epigenetic features. The NDB aims to be a shared resource to biologists, biophysicists and all genome scientists. The NDB is available at https://ndb.rice.edu.
Collapse
Affiliation(s)
- Vinícius G Contessoto
- Center for Theoretical Biological Physics, Rice University, Houston, TX 77005, USA
- Brazilian Biorenewables National Laboratory-LNBR, Brazilian Center for Research in Energy and Materials-CNPEM, Campinas, SP 13083-100, Brazil
- Department of Physics, São Paulo State University (UNESP), Institute of Biosciences, Humanities and Exact Sciences, São José do Rio Preto, SP 15054-000, Brazil
| | - Ryan R Cheng
- Center for Theoretical Biological Physics, Rice University, Houston, TX 77005, USA
| | - Arya Hajitaheri
- Center for Theoretical Biological Physics, Rice University, Houston, TX 77005, USA
- Department of Computer Science, University of Houston, Houston, TX 77204, USA
| | - Esteban Dodero-Rojas
- Center for Theoretical Biological Physics, Rice University, Houston, TX 77005, USA
- Theoretical and Computational Physics Laboratory, University of Costa Rica, San José 5 11501, Costa Rica
| | - Matheus F Mello
- Center for Theoretical Biological Physics, Rice University, Houston, TX 77005, USA
- Chemical Engineering Department, Military Institute of Engineering, Rio de Janeiro, RJ 22290-270, Brazil
| | - Erez Lieberman-Aiden
- Center for Theoretical Biological Physics, Rice University, Houston, TX 77005, USA
- The Center for Genome Architecture, Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, TX 77030, USA
| | - Peter G Wolynes
- Center for Theoretical Biological Physics, Rice University, Houston, TX 77005, USA
- Department of Physics & Astronomy, Rice University, Houston, TX 77005, USA
- Department of Chemistry, Rice University, Houston, TX 77005, USA
- Department of Biosciences, Rice University, Houston, TX 77005, USA
| | - Michele Di Pierro
- Center for Theoretical Biological Physics, Rice University, Houston, TX 77005, USA
- Department of Physics, Northeastern University, Boston, MA 02115, USA
| | - José N Onuchic
- Center for Theoretical Biological Physics, Rice University, Houston, TX 77005, USA
- Department of Physics & Astronomy, Rice University, Houston, TX 77005, USA
- Department of Chemistry, Rice University, Houston, TX 77005, USA
- Department of Biosciences, Rice University, Houston, TX 77005, USA
| |
Collapse
|
60
|
Rozenwald MB, Galitsyna AA, Sapunov GV, Khrameeva EE, Gelfand MS. A machine learning framework for the prediction of chromatin folding in Drosophila using epigenetic features. PeerJ Comput Sci 2020; 6:e307. [PMID: 33816958 PMCID: PMC7924456 DOI: 10.7717/peerj-cs.307] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/28/2020] [Accepted: 09/30/2020] [Indexed: 05/03/2023]
Abstract
Technological advances have lead to the creation of large epigenetic datasets, including information about DNA binding proteins and DNA spatial structure. Hi-C experiments have revealed that chromosomes are subdivided into sets of self-interacting domains called Topologically Associating Domains (TADs). TADs are involved in the regulation of gene expression activity, but the mechanisms of their formation are not yet fully understood. Here, we focus on machine learning methods to characterize DNA folding patterns in Drosophila based on chromatin marks across three cell lines. We present linear regression models with four types of regularization, gradient boosting, and recurrent neural networks (RNN) as tools to study chromatin folding characteristics associated with TADs given epigenetic chromatin immunoprecipitation data. The bidirectional long short-term memory RNN architecture produced the best prediction scores and identified biologically relevant features. Distribution of protein Chriz (Chromator) and histone modification H3K4me3 were selected as the most informative features for the prediction of TADs characteristics. This approach may be adapted to any similar biological dataset of chromatin features across various cell lines and species. The code for the implemented pipeline, Hi-ChiP-ML, is publicly available: https://github.com/MichalRozenwald/Hi-ChIP-ML.
Collapse
Affiliation(s)
- Michal B. Rozenwald
- Faculty of Computer Science, National Research University Higher School of Economics, Moscow, Russia
| | | | - Grigory V. Sapunov
- Faculty of Computer Science, National Research University Higher School of Economics, Moscow, Russia
- Intento, Inc., Berkeley, CA, USA
| | | | - Mikhail S. Gelfand
- Skolkovo Institute of Science and Technology, Moscow, Russia
- A.A. Kharkevich Institute for Information Transmission Problems, RAS, Moscow, Russia
| |
Collapse
|
61
|
Kuang S, Wang L. Deep Learning of Sequence Patterns for CCCTC-Binding Factor-Mediated Chromatin Loop Formation. J Comput Biol 2020; 28:133-145. [PMID: 33232622 DOI: 10.1089/cmb.2020.0225] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/16/2022] Open
Abstract
The three-dimensional (3D) organization of the human genome is of crucial importance for gene regulation, and the CCCTC-binding factor (CTCF) plays an important role in chromatin interactions. However, it is still unclear what sequence patterns in addition to CTCF motif pairs determine chromatin loop formation. To discover the underlying sequence patterns, we have developed a deep learning model, called DeepCTCFLoop, to predict whether a chromatin loop can be formed between a pair of convergent or tandem CTCF motifs using only the DNA sequences of the motifs and their flanking regions. Our results suggest that DeepCTCFLoop can accurately distinguish the CTCF motif pairs forming chromatin loops from the ones not forming loops. It significantly outperforms CTCF-MP, a machine learning model based on word2vec and boosted trees, when using DNA sequences only. Furthermore, we show that DNA motifs binding to several transcription factors, including ZNF384, ZNF263, ASCL1, SP1, and ZEB1, may constitute the complex sequence patterns for CTCF-mediated chromatin loop formation. DeepCTCFLoop has also been applied to disease-associated sequence variants to identify candidates that may disrupt chromatin loop formation. Therefore, our results provide useful information for understanding the mechanism of 3D genome organization and may also help annotate and prioritize the noncoding sequence variants associated with human diseases.
Collapse
Affiliation(s)
- Shuzhen Kuang
- Department of Genetics and Biochemistry, Clemson University, Clemson, South Carolina, USA.,Department of Biological Sciences, Clemson University, Clemson, South Carolina, USA
| | - Liangjiang Wang
- Department of Genetics and Biochemistry, Clemson University, Clemson, South Carolina, USA
| |
Collapse
|
62
|
Fudenberg G, Kelley DR, Pollard KS. Predicting 3D genome folding from DNA sequence with Akita. Nat Methods 2020; 17:1111-1117. [PMID: 33046897 PMCID: PMC8211359 DOI: 10.1038/s41592-020-0958-x] [Citation(s) in RCA: 100] [Impact Index Per Article: 25.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/09/2019] [Accepted: 08/20/2020] [Indexed: 02/07/2023]
Abstract
In interphase, the human genome sequence folds in three dimensions into a rich variety of locus-specific contact patterns. Cohesin and CTCF (CCCTC-binding factor) are key regulators; perturbing the levels of either greatly disrupts genome-wide folding as assayed by chromosome conformation capture methods. Still, how a given DNA sequence encodes a particular locus-specific folding pattern remains unknown. Here we present a convolutional neural network, Akita, that accurately predicts genome folding from DNA sequence alone. Representations learned by Akita underscore the importance of an orientation-specific grammar for CTCF binding sites. Akita learns predictive nucleotide-level features of genome folding, revealing effects of nucleotides beyond the core CTCF motif. Once trained, Akita enables rapid in silico predictions. Accounting for this, we demonstrate how Akita can be used to perform in silico saturation mutagenesis, interpret eQTLs, make predictions for structural variants and probe species-specific genome folding. Collectively, these results enable decoding genome function from sequence through structure.
Collapse
Affiliation(s)
- Geoff Fudenberg
- Gladstone Institute of Data Science and Biotechnology, San Francisco, CA, USA.
| | | | - Katherine S Pollard
- Gladstone Institute of Data Science and Biotechnology, San Francisco, CA, USA.
- Department of Epidemiology and Biostatistics, Institute for Human Genetics, Quantitative Biology Institute, and Institute for Computational Health Sciences, University of California, San Francisco, CA, USA.
- Chan Zuckerberg Biohub, San Francisco, CA, USA.
| |
Collapse
|
63
|
Fudenberg G, Kelley DR, Pollard KS. Predicting 3D genome folding from DNA sequence with Akita. Nat Methods 2020; 17:1111-1117. [PMID: 33046897 DOI: 10.1101/800060] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/09/2019] [Accepted: 08/20/2020] [Indexed: 05/20/2023]
Abstract
In interphase, the human genome sequence folds in three dimensions into a rich variety of locus-specific contact patterns. Cohesin and CTCF (CCCTC-binding factor) are key regulators; perturbing the levels of either greatly disrupts genome-wide folding as assayed by chromosome conformation capture methods. Still, how a given DNA sequence encodes a particular locus-specific folding pattern remains unknown. Here we present a convolutional neural network, Akita, that accurately predicts genome folding from DNA sequence alone. Representations learned by Akita underscore the importance of an orientation-specific grammar for CTCF binding sites. Akita learns predictive nucleotide-level features of genome folding, revealing effects of nucleotides beyond the core CTCF motif. Once trained, Akita enables rapid in silico predictions. Accounting for this, we demonstrate how Akita can be used to perform in silico saturation mutagenesis, interpret eQTLs, make predictions for structural variants and probe species-specific genome folding. Collectively, these results enable decoding genome function from sequence through structure.
Collapse
Affiliation(s)
- Geoff Fudenberg
- Gladstone Institute of Data Science and Biotechnology, San Francisco, CA, USA.
| | | | - Katherine S Pollard
- Gladstone Institute of Data Science and Biotechnology, San Francisco, CA, USA.
- Department of Epidemiology and Biostatistics, Institute for Human Genetics, Quantitative Biology Institute, and Institute for Computational Health Sciences, University of California, San Francisco, CA, USA.
- Chan Zuckerberg Biohub, San Francisco, CA, USA.
| |
Collapse
|
64
|
Schwessinger R, Gosden M, Downes D, Brown RC, Oudelaar AM, Telenius J, Teh YW, Lunter G, Hughes JR. DeepC: predicting 3D genome folding using megabase-scale transfer learning. Nat Methods 2020; 17:1118-1124. [PMID: 33046896 PMCID: PMC7610627 DOI: 10.1038/s41592-020-0960-3] [Citation(s) in RCA: 73] [Impact Index Per Article: 18.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/28/2020] [Accepted: 08/20/2020] [Indexed: 01/29/2023]
Abstract
Predicting the impact of noncoding genetic variation requires interpreting it in the context of three-dimensional genome architecture. We have developed deepC, a transfer-learning-based deep neural network that accurately predicts genome folding from megabase-scale DNA sequence. DeepC predicts domain boundaries at high resolution, learns the sequence determinants of genome folding and predicts the impact of both large-scale structural and single base-pair variations.
Collapse
Affiliation(s)
- Ron Schwessinger
- MRC Molecular Haematology Unit, MRC Weatherall Institute of Molecular Medicine, University of Oxford, Oxford, UK
- MRC WIMM Centre for Computational Biology, MRC Weatherall Institute of Molecular Medicine, University of Oxford, Oxford, UK
- Wellcome Centre for Human Genetics, University of Oxford, Oxford, UK
| | - Matthew Gosden
- MRC Molecular Haematology Unit, MRC Weatherall Institute of Molecular Medicine, University of Oxford, Oxford, UK
| | - Damien Downes
- MRC Molecular Haematology Unit, MRC Weatherall Institute of Molecular Medicine, University of Oxford, Oxford, UK
| | - Richard C Brown
- Wellcome Centre for Human Genetics, University of Oxford, Oxford, UK
| | - A Marieke Oudelaar
- MRC Molecular Haematology Unit, MRC Weatherall Institute of Molecular Medicine, University of Oxford, Oxford, UK
- MRC WIMM Centre for Computational Biology, MRC Weatherall Institute of Molecular Medicine, University of Oxford, Oxford, UK
| | - Jelena Telenius
- MRC WIMM Centre for Computational Biology, MRC Weatherall Institute of Molecular Medicine, University of Oxford, Oxford, UK
| | - Yee Whye Teh
- Department of Statistics, University of Oxford, Oxford, UK
| | - Gerton Lunter
- MRC WIMM Centre for Computational Biology, MRC Weatherall Institute of Molecular Medicine, University of Oxford, Oxford, UK.
- Wellcome Centre for Human Genetics, University of Oxford, Oxford, UK.
| | - Jim R Hughes
- MRC Molecular Haematology Unit, MRC Weatherall Institute of Molecular Medicine, University of Oxford, Oxford, UK.
- MRC WIMM Centre for Computational Biology, MRC Weatherall Institute of Molecular Medicine, University of Oxford, Oxford, UK.
| |
Collapse
|
65
|
Baur B, Shin J, Zhang S, Roy S. Data integration for inferring context-specific gene regulatory networks. CURRENT OPINION IN SYSTEMS BIOLOGY 2020; 23:38-46. [PMID: 33225112 PMCID: PMC7676633 DOI: 10.1016/j.coisb.2020.09.005] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/28/2022]
Abstract
Transcriptional regulatory networks control context-specific gene expression patterns and play important roles in normal and disease processes. Advances in genomics are rapidly increasing our ability to measure different components of the regulation machinery at the single-cell and bulk population level. An important challenge is to combine different types of regulatory genomic measurements to construct a more complete picture of gene regulatory networks across different disease, environmental, and developmental contexts. In this review, we focus on recent computational methods that integrate regulatory genomic data sets to infer context specificity and dynamics in regulatory networks.
Collapse
Affiliation(s)
- Brittany Baur
- Wisconsin Institute for Discovery, University of Wisconsin-Madison, Madison, WI, 53715, USA
| | - Junha Shin
- Wisconsin Institute for Discovery, University of Wisconsin-Madison, Madison, WI, 53715, USA
| | - Shilu Zhang
- Wisconsin Institute for Discovery, University of Wisconsin-Madison, Madison, WI, 53715, USA
| | - Sushmita Roy
- Wisconsin Institute for Discovery, University of Wisconsin-Madison, Madison, WI, 53715, USA
- Department of Biostatistics and Medical Informatics, University of Wisconsin-Madison, Madison, WI, 53715, USA
| |
Collapse
|
66
|
Li Y, Ma A, Mathé EA, Li L, Liu B, Ma Q. Elucidation of Biological Networks across Complex Diseases Using Single-Cell Omics. Trends Genet 2020; 36:951-966. [PMID: 32868128 DOI: 10.1016/j.tig.2020.08.004] [Citation(s) in RCA: 18] [Impact Index Per Article: 4.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/17/2020] [Revised: 07/29/2020] [Accepted: 08/04/2020] [Indexed: 12/14/2022]
Abstract
Single-cell multimodal omics (scMulti-omics) technologies have made it possible to trace cellular lineages during differentiation and to identify new cell types in heterogeneous cell populations. The derived information is especially promising for computing cell-type-specific biological networks encoded in complex diseases and improving our understanding of the underlying gene regulatory mechanisms. The integration of these networks could, therefore, give rise to a heterogeneous regulatory landscape (HRL) in support of disease diagnosis and drug therapeutics. In this review, we provide an overview of this field and pay particular attention to how diverse biological networks can be inferred in a specific cell type based on integrative methods. Then, we discuss how HRL can advance our understanding of regulatory mechanisms underlying complex diseases and aid in the prediction of prognosis and therapeutic responses. Finally, we outline challenges and future trends that will be central to bringing the field of HRL in complex diseases forward.
Collapse
Affiliation(s)
- Yang Li
- Department of Biomedical Informatics, The Ohio State University, Columbus, OH, 43210, USA
| | - Anjun Ma
- Department of Biomedical Informatics, The Ohio State University, Columbus, OH, 43210, USA
| | - Ewy A Mathé
- Division of Preclinical Innovation, National Center for Advancing Translational Sciences, National Institutes of Health (NIH), Rockville, MD, 20892, USA
| | - Lang Li
- Department of Biomedical Informatics, The Ohio State University, Columbus, OH, 43210, USA
| | - Bingqiang Liu
- School of Mathematics, Shandong University, Jinan, Shandong, 250100, China.
| | - Qin Ma
- Department of Biomedical Informatics, The Ohio State University, Columbus, OH, 43210, USA.
| |
Collapse
|
67
|
A method for scoring the cell type-specific impacts of noncoding variants in personal genomes. Proc Natl Acad Sci U S A 2020; 117:21364-21372. [PMID: 32817564 PMCID: PMC7474608 DOI: 10.1073/pnas.1922703117] [Citation(s) in RCA: 13] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/01/2023] Open
Abstract
Here we use the expression and accessibility data from a diverse set of cell types to learn a model for the dependence of the accessibility of a regulatory element on its DNA sequence and TF expression. Using GTEx samples with WGS data, we show that the noncoding variants predicted to affect accessibility are more strongly associated with the expression of nearby genes. To interpret a personal genome, we combine the sequence information with context-specific TF expression to prioritize variants and regulatory elements in any genomic region of interest. This approach should be helpful in the study of risk loci previously identified by GWAS. Results from analysis of height and WGS data from the GTEx project support this hypothesis. A person’s genome typically contains millions of variants which represent the differences between this personal genome and the reference human genome. The interpretation of these variants, i.e., the assessment of their potential impact on a person’s phenotype, is currently of great interest in human genetics and medicine. We have developed a prioritization tool called OpenCausal which takes as inputs 1) a personal genome and 2) a reference context-specific TF expression profile and returns a list of noncoding variants prioritized according to their impact on chromatin accessibility for any given genomic region of interest. We applied OpenCausal to 6,430 samples across 18 tissues derived from the GTEx project and found that the variants prioritized by OpenCausal are highly enriched for eQTLs and caQTLs. We further propose a strategy to integrate the predicted open scores with genome-wide association studies (GWAS) data to prioritize putative causal variants and regulatory elements for a given risk locus (i.e., fine-mapping analysis). As an initial example, we applied this method to a GWAS dataset of human height and found that the prioritized putative variants and elements are correlated with the phenotype (i.e., heights of individuals) better than others.
Collapse
|
68
|
Liu Q, Lv H, Jiang R. hicGAN infers super resolution Hi-C data with generative adversarial networks. Bioinformatics 2020; 35:i99-i107. [PMID: 31510693 PMCID: PMC6612845 DOI: 10.1093/bioinformatics/btz317] [Citation(s) in RCA: 40] [Impact Index Per Article: 10.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/30/2022] Open
Abstract
Motivation Hi-C is a genome-wide technology for investigating 3D chromatin conformation by measuring physical contacts between pairs of genomic regions. The resolution of Hi-C data directly impacts the effectiveness and accuracy of downstream analysis such as identifying topologically associating domains (TADs) and meaningful chromatin loops. High resolution Hi-C data are valuable resources which implicate the relationship between 3D genome conformation and function, especially linking distal regulatory elements to their target genes. However, high resolution Hi-C data across various tissues and cell types are not always available due to the high sequencing cost. It is therefore indispensable to develop computational approaches for enhancing the resolution of Hi-C data. Results We proposed hicGAN, an open-sourced framework, for inferring high resolution Hi-C data from low resolution Hi-C data with generative adversarial networks (GANs). To the best of our knowledge, this is the first study to apply GANs to 3D genome analysis. We demonstrate that hicGAN effectively enhances the resolution of low resolution Hi-C data by generating matrices that are highly consistent with the original high resolution Hi-C matrices. A typical scenario of usage for our approach is to enhance low resolution Hi-C data in new cell types, especially where the high resolution Hi-C data are not available. Our study not only presents a novel approach for enhancing Hi-C data resolution, but also provides fascinating insights into disclosing complex mechanism underlying the formation of chromatin contacts. Availability and implementation We release hicGAN as an open-sourced software at https://github.com/kimmo1019/hicGAN. Supplementary information Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Qiao Liu
- Ministry of Education Key Laboratory of Bioinformatics, Bioinformatics Division and Center for Synthetic and Systems Biology, Beijing National Research Center for Information Science and Technology, Department of Automation, Tsinghua University, Beijing, China
| | - Hairong Lv
- Ministry of Education Key Laboratory of Bioinformatics, Bioinformatics Division and Center for Synthetic and Systems Biology, Beijing National Research Center for Information Science and Technology, Department of Automation, Tsinghua University, Beijing, China
| | - Rui Jiang
- Ministry of Education Key Laboratory of Bioinformatics, Bioinformatics Division and Center for Synthetic and Systems Biology, Beijing National Research Center for Information Science and Technology, Department of Automation, Tsinghua University, Beijing, China
| |
Collapse
|
69
|
|
70
|
Xu H, Zhang S, Yi X, Plewczynski D, Li MJ. Exploring 3D chromatin contacts in gene regulation: The evolution of approaches for the identification of functional enhancer-promoter interaction. Comput Struct Biotechnol J 2020; 18:558-570. [PMID: 32226593 PMCID: PMC7090358 DOI: 10.1016/j.csbj.2020.02.013] [Citation(s) in RCA: 27] [Impact Index Per Article: 6.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/31/2019] [Revised: 02/21/2020] [Accepted: 02/22/2020] [Indexed: 12/12/2022] Open
Abstract
Mechanisms underlying gene regulation are key to understand how multicellular organisms with various cell types develop from the same genetic blueprint. Dynamic interactions between enhancers and genes are revealed to play central roles in controlling gene transcription, but the determinants to link functional enhancer-promoter pairs remain elusive. A major challenge is the lack of reliable approach to detect and verify functional enhancer-promoter interactions (EPIs). In this review, we summarized the current methods for detecting EPIs and described how developing techniques facilitate the identification of EPI through assessing the merits and drawbacks of these methods. We also reviewed recent state-of-art EPI prediction methods in terms of their rationale, data usage and characterization. Furthermore, we briefly discussed the evolved strategies for validating functional EPIs.
Collapse
Affiliation(s)
- Hang Xu
- 2011 Collaborative Innovation Center of Tianjin for Medical Epigenetics, Tianjin Key Laboratory of Medical Epigenetics, Tianjin Medical University, Tianjin, China
| | - Shijie Zhang
- 2011 Collaborative Innovation Center of Tianjin for Medical Epigenetics, Tianjin Key Laboratory of Medical Epigenetics, Tianjin Medical University, Tianjin, China
- Department of Pharmacology, Tianjin Key Laboratory of Inflammation Biology, School of Basic Medical Sciences, Tianjin Medical University, Tianjin, China
| | - Xianfu Yi
- School of Biomedical Engineering, Tianjin Medical University, Tianjin, China
| | - Dariusz Plewczynski
- Centre of New Technologies, University of Warsaw, Banacha 2c, 02-097 Warsaw, Poland
- Faculty of Mathematics and Information Science, Warsaw University of Technology, Koszykowa 75, 00-662 Warsaw, Poland
| | - Mulin Jun Li
- 2011 Collaborative Innovation Center of Tianjin for Medical Epigenetics, Tianjin Key Laboratory of Medical Epigenetics, Tianjin Medical University, Tianjin, China
- Department of Pharmacology, Tianjin Key Laboratory of Inflammation Biology, School of Basic Medical Sciences, Tianjin Medical University, Tianjin, China
| |
Collapse
|
71
|
Hamamoto R, Komatsu M, Takasawa K, Asada K, Kaneko S. Epigenetics Analysis and Integrated Analysis of Multiomics Data, Including Epigenetic Data, Using Artificial Intelligence in the Era of Precision Medicine. Biomolecules 2019; 10:biom10010062. [PMID: 31905969 PMCID: PMC7023005 DOI: 10.3390/biom10010062] [Citation(s) in RCA: 38] [Impact Index Per Article: 7.6] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/01/2019] [Revised: 12/20/2019] [Accepted: 12/27/2019] [Indexed: 12/14/2022] Open
Abstract
To clarify the mechanisms of diseases, such as cancer, studies analyzing genetic mutations have been actively conducted for a long time, and a large number of achievements have already been reported. Indeed, genomic medicine is considered the core discipline of precision medicine, and currently, the clinical application of cutting-edge genomic medicine aimed at improving the prevention, diagnosis and treatment of a wide range of diseases is promoted. However, although the Human Genome Project was completed in 2003 and large-scale genetic analyses have since been accomplished worldwide with the development of next-generation sequencing (NGS), explaining the mechanism of disease onset only using genetic variation has been recognized as difficult. Meanwhile, the importance of epigenetics, which describes inheritance by mechanisms other than the genomic DNA sequence, has recently attracted attention, and, in particular, many studies have reported the involvement of epigenetic deregulation in human cancer. So far, given that genetic and epigenetic studies tend to be accomplished independently, physiological relationships between genetics and epigenetics in diseases remain almost unknown. Since this situation may be a disadvantage to developing precision medicine, the integrated understanding of genetic variation and epigenetic deregulation appears to be now critical. Importantly, the current progress of artificial intelligence (AI) technologies, such as machine learning and deep learning, is remarkable and enables multimodal analyses of big omics data. In this regard, it is important to develop a platform that can conduct multimodal analysis of medical big data using AI as this may accelerate the realization of precision medicine. In this review, we discuss the importance of genome-wide epigenetic and multiomics analyses using AI in the era of precision medicine.
Collapse
Affiliation(s)
- Ryuji Hamamoto
- Division of Molecular Modification and Cancer Biology, National Cancer Center Research Institute, 5-1-1 Tsukiji, Chuo-ku, Tokyo 104-0045, Japan; (M.K.); (K.T.); (K.A.); (S.K.)
- Cancer Translational Research Team, RIKEN Center for Advanced Intelligence Project, 1-4-1 Nihonbashi, Chuo-ku, Tokyo 103-0027, Japan
- Correspondence: ; Tel.: +81-3-3547-5271
| | - Masaaki Komatsu
- Division of Molecular Modification and Cancer Biology, National Cancer Center Research Institute, 5-1-1 Tsukiji, Chuo-ku, Tokyo 104-0045, Japan; (M.K.); (K.T.); (K.A.); (S.K.)
- Cancer Translational Research Team, RIKEN Center for Advanced Intelligence Project, 1-4-1 Nihonbashi, Chuo-ku, Tokyo 103-0027, Japan
| | - Ken Takasawa
- Division of Molecular Modification and Cancer Biology, National Cancer Center Research Institute, 5-1-1 Tsukiji, Chuo-ku, Tokyo 104-0045, Japan; (M.K.); (K.T.); (K.A.); (S.K.)
- Cancer Translational Research Team, RIKEN Center for Advanced Intelligence Project, 1-4-1 Nihonbashi, Chuo-ku, Tokyo 103-0027, Japan
| | - Ken Asada
- Division of Molecular Modification and Cancer Biology, National Cancer Center Research Institute, 5-1-1 Tsukiji, Chuo-ku, Tokyo 104-0045, Japan; (M.K.); (K.T.); (K.A.); (S.K.)
- Cancer Translational Research Team, RIKEN Center for Advanced Intelligence Project, 1-4-1 Nihonbashi, Chuo-ku, Tokyo 103-0027, Japan
| | - Syuzo Kaneko
- Division of Molecular Modification and Cancer Biology, National Cancer Center Research Institute, 5-1-1 Tsukiji, Chuo-ku, Tokyo 104-0045, Japan; (M.K.); (K.T.); (K.A.); (S.K.)
| |
Collapse
|
72
|
Xiao M, Zhuang Z, Pan W. Local Epigenomic Data are more Informative than Local Genome Sequence Data in Predicting Enhancer-Promoter Interactions Using Neural Networks. Genes (Basel) 2019; 11:E41. [PMID: 31905774 PMCID: PMC7016741 DOI: 10.3390/genes11010041] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/29/2019] [Revised: 12/23/2019] [Accepted: 12/26/2019] [Indexed: 12/13/2022] Open
Abstract
Enhancer-promoter interactions (EPIs) are crucial for transcriptional regulation. Mapping such interactions proves useful for understanding disease regulations and discovering risk genes in genome-wide association studies. Some previous studies showed that machine learning methods, as computational alternatives to costly experimental approaches, performed well in predicting EPIs from local sequence and/or local epigenomic data. In particular, deep learning methods were demonstrated to outperform traditional machine learning methods, and using DNA sequence data alone could perform either better than or almost as well as only utilizing epigenomic data. However, most, if not all, of these previous studies were based on randomly splitting enhancer-promoter pairs as training, tuning, and test data, which has recently been pointed out to be problematic; due to multiple and duplicating/overlapping enhancers (and promoters) in enhancer-promoter pairs in EPI data, such random splitting does not lead to independent training, tuning, and test data, thus resulting in model over-fitting and over-estimating predictive performance. Here, after correcting this design issue, we extensively studied the performance of various deep learning models with local sequence and epigenomic data around enhancer-promoter pairs. Our results confirmed much lower performance using either sequence or epigenomic data alone, or both, than reported previously. We also demonstrated that local epigenomic features were more informative than local sequence data. Our results were based on an extensive exploration of many convolutional neural network (CNN) and feed-forward neural network (FNN) structures, and of gradient boosting as a representative of traditional machine learning.
Collapse
Affiliation(s)
- Mengli Xiao
- Division of Biostatistics, University of Minnesota, Minneapolis, MN 55455, USA;
| | - Zhong Zhuang
- Department of Electrical and Computer Engineering, University of Minnesota, Minneapolis, MN 55455, USA;
| | - Wei Pan
- Division of Biostatistics, University of Minnesota, Minneapolis, MN 55455, USA;
| |
Collapse
|
73
|
Belokopytova PS, Nuriddinov MA, Mozheiko EA, Fishman D, Fishman V. Quantitative prediction of enhancer-promoter interactions. Genome Res 2019; 30:72-84. [PMID: 31804952 PMCID: PMC6961579 DOI: 10.1101/gr.249367.119] [Citation(s) in RCA: 37] [Impact Index Per Article: 7.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/13/2019] [Accepted: 11/25/2019] [Indexed: 11/24/2022]
Abstract
Recent experimental and computational efforts have provided large data sets describing three-dimensional organization of mouse and human genomes and showed the interconnection between the expression profile, epigenetic state, and spatial interactions of loci. These interconnections were utilized to infer the spatial organization of chromatin, including enhancer–promoter contacts, from one-dimensional epigenetic marks. Here, we show that the predictive power of some of these algorithms is overestimated due to peculiar properties of the biological data. We propose an alternative approach, which provides high-quality predictions of chromatin interactions using information on gene expression and CTCF-binding alone. Using multiple metrics, we confirmed that our algorithm could efficiently predict the three-dimensional architecture of both normal and rearranged genomes.
Collapse
Affiliation(s)
- Polina S Belokopytova
- Institute of Cytology and Genetics SB RAS 630090, Novosibirsk, Russia.,Novosibirsk State University, Novosibirsk, Russia 630090
| | | | | | - Daniil Fishman
- Novosibirsk State University, Novosibirsk, Russia 630090
| | - Veniamin Fishman
- Institute of Cytology and Genetics SB RAS 630090, Novosibirsk, Russia.,Novosibirsk State University, Novosibirsk, Russia 630090
| |
Collapse
|
74
|
Luo X, Chi W, Deng M. Deepprune: Learning Efficient and Interpretable Convolutional Networks Through Weight Pruning for Predicting DNA-Protein Binding. Front Genet 2019; 10:1145. [PMID: 31824562 PMCID: PMC6879555 DOI: 10.3389/fgene.2019.01145] [Citation(s) in RCA: 11] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/13/2019] [Accepted: 10/21/2019] [Indexed: 12/02/2022] Open
Abstract
Convolutional neural network (CNN) based methods have outperformed conventional machine learning methods in predicting the binding preference of DNA-protein binding. Although studies in the past have shown that more convolutional kernels help to achieve better performance, visualization of the model can be obscured by the use of many kernels, resulting in overfitting and reduced interpretation because the number of motifs in true models is limited. Therefore, we aim to arrive at high performance, but with limited kernel numbers, in CNN-based models for motif inference. We herein present Deepprune, a novel deep learning framework, which prunes the weights in the dense layer and fine-tunes iteratively. These two steps enable the training of CNN-based models with limited kernel numbers, allowing easy interpretation of the learned model. We demonstrate that Deepprune significantly improves motif inference performance for the simulated datasets. Furthermore, we show that Deepprune outperforms the baseline with limited kernel numbers when inferring DNA-binding sites from ChIP-seq data.
Collapse
Affiliation(s)
- Xiao Luo
- School of Mathematical Sciences, Peking University, Beijing, China
| | - Weilai Chi
- Center for Quantitative Biology, Peking University, Beijing, China
| | - Minghua Deng
- School of Mathematical Sciences, Peking University, Beijing, China
- Center for Quantitative Biology, Peking University, Beijing, China
| |
Collapse
|
75
|
Gan M, Li W, Jiang R. EnContact: predicting enhancer-enhancer contacts using sequence-based deep learning model. PeerJ 2019; 7:e7657. [PMID: 31565573 PMCID: PMC6746221 DOI: 10.7717/peerj.7657] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/14/2019] [Accepted: 08/10/2019] [Indexed: 01/22/2023] Open
Abstract
Chromatin contacts between regulatory elements are of crucial importance for the interpretation of transcriptional regulation and the understanding of disease mechanisms. However, existing computational methods mainly focus on the prediction of interactions between enhancers and promoters, leaving enhancer-enhancer (E-E) interactions not well explored. In this work, we develop a novel deep learning approach, named Enhancer-enhancer contacts prediction (EnContact), to predict E-E contacts using genomic sequences as input. We statistically demonstrated the predicting ability of EnContact using training sets and testing sets derived from HiChIP data of seven cell lines. We also show that our model significantly outperforms other baseline methods. Besides, our model identifies finer-mapping E-E interactions from region-based chromatin contacts, where each region contains several enhancers. In addition, we identify a class of hub enhancers using the predicted E-E interactions and find that hub enhancers tend to be active across cell lines. We summarize that our EnContact model is capable of predicting E-E interactions using features automatically learned from genomic sequences.
Collapse
Affiliation(s)
- Mingxin Gan
- Donlinks School of Economics and Management, University of Science and Technology Beijing, Beijing, China
| | - Wenran Li
- MOE Key Laboratory of Bioinformatics; Bioinformatics Division and Center for Synthetic and Systems Biology, BNRist; Department of Automation, Tsinghua University, Beijing, China
- Department of Statistics, Stanford University, Stanford, CA, USA
- Department of Biomedical Data Science, Stanford University, Stanford, CA, USA
| | - Rui Jiang
- MOE Key Laboratory of Bioinformatics; Bioinformatics Division and Center for Synthetic and Systems Biology, BNRist; Department of Automation, Tsinghua University, Beijing, China
| |
Collapse
|
76
|
Yin Q, Wu M, Liu Q, Lv H, Jiang R. DeepHistone: a deep learning approach to predicting histone modifications. BMC Genomics 2019; 20:193. [PMID: 30967126 PMCID: PMC6456942 DOI: 10.1186/s12864-019-5489-4] [Citation(s) in RCA: 35] [Impact Index Per Article: 7.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/11/2022] Open
Abstract
MOTIVATION Quantitative detection of histone modifications has emerged in the recent years as a major means for understanding such biological processes as chromosome packaging, transcriptional activation, and DNA damage. However, high-throughput experimental techniques such as ChIP-seq are usually expensive and time-consuming, prohibiting the establishment of a histone modification landscape for hundreds of cell types across dozens of histone markers. These disadvantages have been appealing for computational methods to complement experimental approaches towards large-scale analysis of histone modifications. RESULTS We proposed a deep learning framework to integrate sequence information and chromatin accessibility data for the accurate prediction of modification sites specific to different histone markers. Our method, named DeepHistone, outperformed several baseline methods in a series of comprehensive validation experiments, not only within an epigenome but also across epigenomes. Besides, sequence signatures automatically extracted by our method was consistent with known transcription factor binding sites, thereby giving insights into regulatory signatures of histone modifications. As an application, our method was shown to be able to distinguish functional single nucleotide polymorphisms from their nearby genetic variants, thereby having the potential to be used for exploring functional implications of putative disease-associated genetic variants. CONCLUSIONS DeepHistone demonstrated the possibility of using a deep learning framework to integrate DNA sequence and experimental data for predicting epigenomic signals. With the state-of-the-art performance, DeepHistone was expected to shed light on a variety of epigenomic studies. DeepHistone is freely available in https://github.com/QijinYin/DeepHistone .
Collapse
Affiliation(s)
- Qijin Yin
- MOE Key Laboratory of Bioinformatics; Bioinformatics Division, Beijing National Laboratory for Information Science and Technology, Tsinghua University, Beijing, 100084, China
| | - Mengmeng Wu
- MOE Key Laboratory of Bioinformatics; Bioinformatics Division, Beijing National Laboratory for Information Science and Technology, Tsinghua University, Beijing, 100084, China
| | - Qiao Liu
- MOE Key Laboratory of Bioinformatics; Bioinformatics Division, Beijing National Laboratory for Information Science and Technology, Tsinghua University, Beijing, 100084, China
| | - Hairong Lv
- MOE Key Laboratory of Bioinformatics; Bioinformatics Division, Beijing National Laboratory for Information Science and Technology, Tsinghua University, Beijing, 100084, China.
| | - Rui Jiang
- MOE Key Laboratory of Bioinformatics; Bioinformatics Division, Beijing National Laboratory for Information Science and Technology, Tsinghua University, Beijing, 100084, China.
| |
Collapse
|