1
|
Zheng Y, Li Q, Freiberger MI, Song H, Hu G, Zhang M, Gu R, Li J. Predicting the Dynamic Interaction of Intrinsically Disordered Proteins. J Chem Inf Model 2024; 64:6768-6777. [PMID: 39163306 DOI: 10.1021/acs.jcim.4c00930] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 08/22/2024]
Abstract
Intrinsically disordered proteins (IDPs) participate in various biological processes. Interactions involving IDPs are usually dynamic and are affected by their inherent conformation fluctuations. Comprehensive characterization of these interactions based on current techniques is challenging. Here, we present GSALIDP, a GraphSAGE-embedded LSTM network, to capture the dynamic nature of IDP-involved interactions and predict their behaviors. This framework models multiple conformations of IDP as a dynamic graph, which can effectively describe the fluctuation of its flexible conformation. The dynamic interaction between IDPs is studied, and the data sets of IDP conformations and their interactions are obtained through atomistic molecular dynamic (MD) simulations. Residues of IDP are encoded through a series of features including their frustration. GSALIDP can effectively predict the interaction sites of IDP and the contact residue pairs between IDPs. Its performance in predicting IDP interactions is on par with or even better than the conventional models in predicting the interaction of structural proteins. To the best of our knowledge, this is the first model to extend the protein interaction prediction to IDP-involved interactions.
Collapse
Affiliation(s)
- Yuchuan Zheng
- School of Physics, Zhejiang University, Hangzhou 310058, PR China
| | - Qixiu Li
- School of Physics, Zhejiang University, Hangzhou 310058, PR China
| | - Maria I Freiberger
- Protein Physiology Lab, Departamento de Quimica Biologica, Facultad de Ciencias Exactas y Naturales, Universidad de Buenos Aires-CONICET-IQUIBICEN, Buenos Aires C1428EGA, Argentina
| | - Haoyu Song
- School of Physics, Zhejiang University, Hangzhou 310058, PR China
| | - Guorong Hu
- School of Physics, Zhejiang University, Hangzhou 310058, PR China
| | - Moxin Zhang
- School of Physics, Zhejiang University, Hangzhou 310058, PR China
| | - Ruoxu Gu
- School of Life Sciences and Biotechnology, Shanghai Jiao Tong University, Shanghai 200240, PR China
| | - Jingyuan Li
- School of Physics, Zhejiang University, Hangzhou 310058, PR China
| |
Collapse
|
2
|
Jing Zhang F, Zhang SW, Zhang S. Prediction of Transcription Factor Binding Sites With an Attention Augmented Convolutional Neural Network. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2022; 19:3614-3623. [PMID: 34752400 DOI: 10.1109/tcbb.2021.3126623] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/13/2023]
Abstract
Identification of transcription factor binding sites (TFBSs) is essential for revealing the rules of protein-DNA binding. Although some computational methods have been presented to predict TFBSs using epigenomic and sequence features, most of them ignore the common features among cross-cell types. It is still unclear to what extent the common features could help for this task. To this end, we proposed a new method (named Attention-augmented Convolutional Neural Network, or ACNN) to predict TFBSs. ACNN uses attention-augmented convolutional layers to capture global and local contexts in DNA sequences and employs the convolutional layers to capture features of histone modification markers. In addition, ACNN adopts the private and shared convolutional neural network (CNN) modules to learn specific and common features, respectively. To encourage the shared CNN module to learn the common features, adversarial training is applied in ACNN. The results on 253 ChIP-seq datasets show that ACNN outperforms other existing methods. The attention-augmented convolutional layers and adversarial training mechanism in ACNN can effectively improve the prediction performance. Moreover, in the case of limited labeled data, ACNN also performs better than a baseline method. We further visualize the convolution kernels as motifs to explain the interpretability of ACNN.
Collapse
|
3
|
Towards a better understanding of TF-DNA binding prediction from genomic features. Comput Biol Med 2022; 149:105993. [DOI: 10.1016/j.compbiomed.2022.105993] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/15/2022] [Revised: 07/12/2022] [Accepted: 08/14/2022] [Indexed: 11/17/2022]
|
4
|
Abstract
The tremendous amount of biological sequence data available, combined with the recent methodological breakthrough in deep learning in domains such as computer vision or natural language processing, is leading today to the transformation of bioinformatics through the emergence of deep genomics, the application of deep learning to genomic sequences. We review here the new applications that the use of deep learning enables in the field, focusing on three aspects: the functional annotation of genomes, the sequence determinants of the genome functions and the possibility to write synthetic genomic sequences.
Collapse
|
5
|
Jing F, Zhang SW, Zhang S. Prediction of the transcription factor binding sites with meta-learning. Methods 2022; 203:207-213. [DOI: 10.1016/j.ymeth.2022.04.010] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/11/2021] [Revised: 04/01/2022] [Accepted: 04/17/2022] [Indexed: 11/26/2022] Open
|
6
|
InsuLock: A Weakly Supervised Learning Approach for Accurate Insulator Prediction, and Variant Impact Quantification. Genes (Basel) 2022; 13:genes13040621. [PMID: 35456427 PMCID: PMC9026820 DOI: 10.3390/genes13040621] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/30/2022] [Revised: 03/24/2022] [Accepted: 03/25/2022] [Indexed: 02/01/2023] Open
Abstract
Mapping chromatin insulator loops is crucial to investigating genome evolution, elucidating critical biological functions, and ultimately quantifying variant impact in diseases. However, chromatin conformation profiling assays are usually expensive, time-consuming, and may report fuzzy insulator annotations with low resolution. Therefore, we propose a weakly supervised deep learning method, InsuLock, to address these challenges. Specifically, InsuLock first utilizes a Siamese neural network to predict the existence of insulators within a given region (up to 2000 bp). Then, it uses an object detection module for precise insulator boundary localization via gradient-weighted class activation mapping (~40 bp resolution). Finally, it quantifies variant impacts by comparing the insulator score differences between the wild-type and mutant alleles. We applied InsuLock on various bulk and single-cell datasets for performance testing and benchmarking. We showed that it outperformed existing methods with an AUROC of ~0.96 and condensed insulator annotations to ~2.5% of their original size while still demonstrating higher conservation scores and better motif enrichments. Finally, we utilized InsuLock to make cell-type-specific variant impacts from brain scATAC-seq data and identified a schizophrenia GWAS variant disrupting an insulator loop proximal to a known risk gene, indicating a possible new mechanism of action for the disease.
Collapse
|
7
|
Zhang Y, Wang Z, Zeng Y, Zhou J, Zou Q. High-resolution transcription factor binding sites prediction improved performance and interpretability by deep learning method. Brief Bioinform 2021; 22:6322761. [PMID: 34272562 DOI: 10.1093/bib/bbab273] [Citation(s) in RCA: 16] [Impact Index Per Article: 5.3] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/30/2021] [Revised: 06/19/2021] [Accepted: 06/25/2021] [Indexed: 11/14/2022] Open
Abstract
Transcription factors (TFs) are essential proteins in regulating the spatiotemporal expression of genes. It is crucial to infer the potential transcription factor binding sites (TFBSs) with high resolution to promote biology and realize precision medicine. Recently, deep learning-based models have shown exemplary performance in the prediction of TFBSs at the base-pair level. However, the previous models fail to integrate nucleotide position information and semantic information without noisy responses. Thus, there is still room for improvement. Moreover, both the inner mechanism and prediction results of these models are challenging to interpret. To this end, the Deep Attentive Encoder-Decoder Neural Network (D-AEDNet) is developed to identify the location of TFs-DNA binding sites in DNA sequences. In particular, our model adopts Skip Architecture to leverage the nucleotide position information in the encoder and removes noisy responses in the information fusion process by Attention Gate. Simultaneously, the Transcription Factor Motif Discovery based on Sliding Window (TF-MoDSW), an approach to discover TFs-DNA binding motifs by utilizing the output of neural networks, is proposed to understand the biological meaning of the predicted result. On ChIP-exo datasets, experimental results show that D-AEDNet has better performance than competing methods. Besides, we authenticate that Attention Gate can improve the interpretability of our model by ways of visualization analysis. Furthermore, we confirm that ability of D-AEDNet to learn TFs-DNA binding motifs outperform the state-of-the-art methods and availability of TF-MoDSW to discover biological sequence motifs in TFs-DNA interaction by conducting experiment on ChIP-seq datasets.
Collapse
Affiliation(s)
- Yongqing Zhang
- School of Computer Science, Chengdu University of Information Technology, 610225, Chengdu, China
| | - Zixuan Wang
- School of Computer Science, Chengdu University of Information Technology, 610225, Chengdu, China
| | - Yuanqi Zeng
- School of Computer Science, Chengdu University of Information Technology, 610225, Chengdu, China
| | - Jiliu Zhou
- School of Computer Science, Chengdu University of Information Technology, 610225, Chengdu, China
| | - Quan Zou
- Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, 610054, Chengdu, China
| |
Collapse
|
8
|
Degtyareva AO, Antontseva EV, Merkulova TI. Regulatory SNPs: Altered Transcription Factor Binding Sites Implicated in Complex Traits and Diseases. Int J Mol Sci 2021; 22:6454. [PMID: 34208629 PMCID: PMC8235176 DOI: 10.3390/ijms22126454] [Citation(s) in RCA: 30] [Impact Index Per Article: 10.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/05/2021] [Revised: 06/15/2021] [Accepted: 06/15/2021] [Indexed: 12/19/2022] Open
Abstract
The vast majority of the genetic variants (mainly SNPs) associated with various human traits and diseases map to a noncoding part of the genome and are enriched in its regulatory compartment, suggesting that many causal variants may affect gene expression. The leading mechanism of action of these SNPs consists in the alterations in the transcription factor binding via creation or disruption of transcription factor binding sites (TFBSs) or some change in the affinity of these regulatory proteins to their cognate sites. In this review, we first focus on the history of the discovery of regulatory SNPs (rSNPs) and systematized description of the existing methodical approaches to their study. Then, we brief the recent comprehensive examples of rSNPs studied from the discovery of the changes in the TFBS sequence as a result of a nucleotide substitution to identification of its effect on the target gene expression and, eventually, to phenotype. We also describe state-of-the-art genome-wide approaches to identification of regulatory variants, including both making molecular sense of genome-wide association studies (GWAS) and the alternative approaches the primary goal of which is to determine the functionality of genetic variants. Among these approaches, special attention is paid to expression quantitative trait loci (eQTLs) analysis and the search for allele-specific events in RNA-seq (ASE events) as well as in ChIP-seq, DNase-seq, and ATAC-seq (ASB events) data.
Collapse
Affiliation(s)
- Arina O. Degtyareva
- Department of Molecular Genetic, Institute of Cytology and Genetics, 630090 Novosibirsk, Russia; (A.O.D.); (E.V.A.)
| | - Elena V. Antontseva
- Department of Molecular Genetic, Institute of Cytology and Genetics, 630090 Novosibirsk, Russia; (A.O.D.); (E.V.A.)
| | - Tatiana I. Merkulova
- Department of Molecular Genetic, Institute of Cytology and Genetics, 630090 Novosibirsk, Russia; (A.O.D.); (E.V.A.)
- Department of Natural Sciences, Novosibirsk State University, 630090 Novosibirsk, Russia
| |
Collapse
|
9
|
Tao H, Li H, Xu K, Hong H, Jiang S, Du G, Wang J, Sun Y, Huang X, Ding Y, Li F, Zheng X, Chen H, Bo X. Computational methods for the prediction of chromatin interaction and organization using sequence and epigenomic profiles. Brief Bioinform 2021; 22:6102668. [PMID: 33454752 PMCID: PMC8424394 DOI: 10.1093/bib/bbaa405] [Citation(s) in RCA: 11] [Impact Index Per Article: 3.7] [Reference Citation Analysis] [Abstract] [Key Words] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/21/2020] [Revised: 11/26/2020] [Accepted: 12/10/2020] [Indexed: 12/14/2022] Open
Abstract
The exploration of three-dimensional chromatin interaction and organization provides insight into mechanisms underlying gene regulation, cell differentiation and disease development. Advances in chromosome conformation capture technologies, such as high-throughput chromosome conformation capture (Hi-C) and chromatin interaction analysis by paired-end tag (ChIA-PET), have enabled the exploration of chromatin interaction and organization. However, high-resolution Hi-C and ChIA-PET data are only available for a limited number of cell lines, and their acquisition is costly, time consuming, laborious and affected by theoretical limitations. Increasing evidence shows that DNA sequence and epigenomic features are informative predictors of regulatory interaction and chromatin architecture. Based on these features, numerous computational methods have been developed for the prediction of chromatin interaction and organization, whereas they are not extensively applied in biomedical study. A systematical study to summarize and evaluate such methods is still needed to facilitate their application. Here, we summarize 48 computational methods for the prediction of chromatin interaction and organization using sequence and epigenomic profiles, categorize them and compare their performance. Besides, we provide a comprehensive guideline for the selection of suitable methods to predict chromatin interaction and organization based on available data and biological question of interest.
Collapse
Affiliation(s)
- Huan Tao
- Beijing Institute of Radiation Medicine
| | - Hao Li
- Beijing Institute of Radiation Medicine
| | - Kang Xu
- Beijing Institute of Radiation Medicine
| | - Hao Hong
- Beijing Institute of Radiation Medicine, Department of Biotechnology
| | - Shuai Jiang
- Beijing Institute of Radiation Medicine, Department of Biotechnology
| | - Guifang Du
- Beijing Institute of Radiation Medicine, Department of Biotechnology
| | | | - Yu Sun
- Beijing Institute of Radiation Medicine, Department of Biotechnology
| | - Xin Huang
- Beijing Institute of Radiation Medicine, Department of Biotechnology
| | - Yang Ding
- Beijing Institute of Radiation Medicine
| | - Fei Li
- Chinese Academy of Sciences, Department of Computer Network Information Center
| | | | | | | |
Collapse
|
10
|
Rozenwald MB, Galitsyna AA, Sapunov GV, Khrameeva EE, Gelfand MS. A machine learning framework for the prediction of chromatin folding in Drosophila using epigenetic features. PeerJ Comput Sci 2020; 6:e307. [PMID: 33816958 PMCID: PMC7924456 DOI: 10.7717/peerj-cs.307] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/28/2020] [Accepted: 09/30/2020] [Indexed: 05/03/2023]
Abstract
Technological advances have lead to the creation of large epigenetic datasets, including information about DNA binding proteins and DNA spatial structure. Hi-C experiments have revealed that chromosomes are subdivided into sets of self-interacting domains called Topologically Associating Domains (TADs). TADs are involved in the regulation of gene expression activity, but the mechanisms of their formation are not yet fully understood. Here, we focus on machine learning methods to characterize DNA folding patterns in Drosophila based on chromatin marks across three cell lines. We present linear regression models with four types of regularization, gradient boosting, and recurrent neural networks (RNN) as tools to study chromatin folding characteristics associated with TADs given epigenetic chromatin immunoprecipitation data. The bidirectional long short-term memory RNN architecture produced the best prediction scores and identified biologically relevant features. Distribution of protein Chriz (Chromator) and histone modification H3K4me3 were selected as the most informative features for the prediction of TADs characteristics. This approach may be adapted to any similar biological dataset of chromatin features across various cell lines and species. The code for the implemented pipeline, Hi-ChiP-ML, is publicly available: https://github.com/MichalRozenwald/Hi-ChIP-ML.
Collapse
Affiliation(s)
- Michal B. Rozenwald
- Faculty of Computer Science, National Research University Higher School of Economics, Moscow, Russia
| | | | - Grigory V. Sapunov
- Faculty of Computer Science, National Research University Higher School of Economics, Moscow, Russia
- Intento, Inc., Berkeley, CA, USA
| | | | - Mikhail S. Gelfand
- Skolkovo Institute of Science and Technology, Moscow, Russia
- A.A. Kharkevich Institute for Information Transmission Problems, RAS, Moscow, Russia
| |
Collapse
|
11
|
Jing F, Zhang SW, Zhang S. Prediction of enhancer-promoter interactions using the cross-cell type information and domain adversarial neural network. BMC Bioinformatics 2020; 21:507. [PMID: 33160328 PMCID: PMC7648314 DOI: 10.1186/s12859-020-03844-4] [Citation(s) in RCA: 18] [Impact Index Per Article: 4.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/31/2020] [Accepted: 10/27/2020] [Indexed: 12/27/2022] Open
Abstract
BACKGROUND Enhancer-promoter interactions (EPIs) play key roles in transcriptional regulation and disease progression. Although several computational methods have been developed to predict such interactions, their performances are not satisfactory when training and testing data from different cell lines. Currently, it is still unclear what extent a across cell line prediction can be made based on sequence-level information. RESULTS In this work, we present a novel Sequence-based method (called SEPT) to predict the enhancer-promoter interactions in new cell line by using the cross-cell information and Transfer learning. SEPT first learns the features of enhancer and promoter from DNA sequences with convolutional neural network (CNN), then designing the gradient reversal layer of transfer learning to reduce the cell line specific features meanwhile retaining the features associated with EPIs. When the locations of enhancers and promoters are provided in new cell line, SEPT can successfully recognize EPIs in this new cell line based on labeled data of other cell lines. The experiment results show that SEPT can effectively learn the latent import EPIs-related features between cell lines and achieves the best prediction performance in terms of AUC (the area under the receiver operating curves). CONCLUSIONS SEPT is an effective method for predicting the EPIs in new cell line. Domain adversarial architecture of transfer learning used in SEPT can learn the latent EPIs shared features among cell lines from all other existing labeled data. It can be expected that SEPT will be of interest to researchers concerned with biological interaction prediction.
Collapse
Affiliation(s)
- Fang Jing
- Key Laboratory of Information Fusion Technology of Ministry of Education, School of Automation, Northwestern Polytechnical University, 127 West Youyi Road, Xi’an, 710072 Shaanxi China
| | - Shao-Wu Zhang
- Key Laboratory of Information Fusion Technology of Ministry of Education, School of Automation, Northwestern Polytechnical University, 127 West Youyi Road, Xi’an, 710072 Shaanxi China
| | - Shihua Zhang
- NCMIS, CEMS, RCSDS, Academy of Mathematics and Systems Science, Chinese Academy of Sciences, 55 Zhongguancun East Road, Beijing, 10090 China
- School of Mathematical Sciences, University of Chinese Academy of Sciences, Beijing, 100049 China
- Center for Excellence in Animal Evolution and Genetics, Chinese Academy of Sciences, Kunming, 650223 China
| |
Collapse
|
12
|
Xi J, Yuan X, Wang M, Li A, Li X, Huang Q. Inferring subgroup-specific driver genes from heterogeneous cancer samples via subspace learning with subgroup indication. Bioinformatics 2020; 36:1855-1863. [PMID: 31626284 DOI: 10.1093/bioinformatics/btz793] [Citation(s) in RCA: 31] [Impact Index Per Article: 7.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/17/2019] [Revised: 09/23/2019] [Accepted: 10/16/2019] [Indexed: 11/14/2022] Open
Abstract
MOTIVATION Detecting driver genes from gene mutation data is a fundamental task for tumorigenesis research. Due to the fact that cancer is a heterogeneous disease with various subgroups, subgroup-specific driver genes are the key factors in the development of precision medicine for heterogeneous cancer. However, the existing driver gene detection methods are not designed to identify subgroup specificities of their detected driver genes, and therefore cannot indicate which group of patients is associated with the detected driver genes, which is difficult to provide specifically clinical guidance for individual patients. RESULTS By incorporating the subspace learning framework, we propose a novel bioinformatics method called DriverSub, which can efficiently predict subgroup-specific driver genes in the situation where the subgroup annotations are not available. When evaluated by simulation datasets with known ground truth and compared with existing methods, DriverSub yields the best prediction of driver genes and the inference of their related subgroups. When we apply DriverSub on the mutation data of real heterogeneous cancers, we can observe that the predicted results of DriverSub are highly enriched for experimentally validated known driver genes. Moreover, the subgroups inferred by DriverSub are significantly associated with the annotated molecular subgroups, indicating its capability of predicting subgroup-specific driver genes. AVAILABILITY AND IMPLEMENTATION The source code is publicly available at https://github.com/JianingXi/DriverSub. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Jianing Xi
- School of Mechanical Engineering , Northwestern Polytechnical University, Xi'an, 710072, China.,Center for OPTical IMagery Analysis and Learning (OPTIMAL), Northwestern Polytechnical University, Xi'an, 710072, China
| | - Xiguo Yuan
- School of Computer Science and Technology, Xidian University, Xi'an 710071, China
| | - Minghui Wang
- School of Information Science and Technology, University of Science and Technology of China, Hefei 230027, China
| | - Ao Li
- School of Information Science and Technology, University of Science and Technology of China, Hefei 230027, China
| | - Xuelong Li
- Center for OPTical IMagery Analysis and Learning (OPTIMAL), Northwestern Polytechnical University, Xi'an, 710072, China.,School of Computer Science, Northwestern Polytechnical University, Xi'an 710072, China
| | - Qinghua Huang
- School of Mechanical Engineering , Northwestern Polytechnical University, Xi'an, 710072, China.,Center for OPTical IMagery Analysis and Learning (OPTIMAL), Northwestern Polytechnical University, Xi'an, 710072, China
| |
Collapse
|
13
|
|
14
|
Xiao M, Zhuang Z, Pan W. Local Epigenomic Data are more Informative than Local Genome Sequence Data in Predicting Enhancer-Promoter Interactions Using Neural Networks. Genes (Basel) 2019; 11:E41. [PMID: 31905774 PMCID: PMC7016741 DOI: 10.3390/genes11010041] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/29/2019] [Revised: 12/23/2019] [Accepted: 12/26/2019] [Indexed: 12/13/2022] Open
Abstract
Enhancer-promoter interactions (EPIs) are crucial for transcriptional regulation. Mapping such interactions proves useful for understanding disease regulations and discovering risk genes in genome-wide association studies. Some previous studies showed that machine learning methods, as computational alternatives to costly experimental approaches, performed well in predicting EPIs from local sequence and/or local epigenomic data. In particular, deep learning methods were demonstrated to outperform traditional machine learning methods, and using DNA sequence data alone could perform either better than or almost as well as only utilizing epigenomic data. However, most, if not all, of these previous studies were based on randomly splitting enhancer-promoter pairs as training, tuning, and test data, which has recently been pointed out to be problematic; due to multiple and duplicating/overlapping enhancers (and promoters) in enhancer-promoter pairs in EPI data, such random splitting does not lead to independent training, tuning, and test data, thus resulting in model over-fitting and over-estimating predictive performance. Here, after correcting this design issue, we extensively studied the performance of various deep learning models with local sequence and epigenomic data around enhancer-promoter pairs. Our results confirmed much lower performance using either sequence or epigenomic data alone, or both, than reported previously. We also demonstrated that local epigenomic features were more informative than local sequence data. Our results were based on an extensive exploration of many convolutional neural network (CNN) and feed-forward neural network (FNN) structures, and of gradient boosting as a representative of traditional machine learning.
Collapse
Affiliation(s)
- Mengli Xiao
- Division of Biostatistics, University of Minnesota, Minneapolis, MN 55455, USA;
| | - Zhong Zhuang
- Department of Electrical and Computer Engineering, University of Minnesota, Minneapolis, MN 55455, USA;
| | - Wei Pan
- Division of Biostatistics, University of Minnesota, Minneapolis, MN 55455, USA;
| |
Collapse
|