1
|
Kindel F, Triesch S, Schlüter U, Randarevitch LA, Reichel-Deland V, Weber APM, Denton AK. Predmoter-cross-species prediction of plant promoter and enhancer regions. BIOINFORMATICS ADVANCES 2024; 4:vbae074. [PMID: 38841126 PMCID: PMC11150885 DOI: 10.1093/bioadv/vbae074] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 01/04/2024] [Revised: 04/10/2024] [Accepted: 05/22/2024] [Indexed: 06/07/2024]
Abstract
Motivation Identifying cis-regulatory elements (CREs) is crucial for analyzing gene regulatory networks. Next generation sequencing methods were developed to identify CREs but represent a considerable expenditure for targeted analysis of few genomic loci. Thus, predicting the outputs of these methods would significantly cut costs and time investment. Results We present Predmoter, a deep neural network that predicts base-wise Assay for Transposase Accessible Chromatin using sequencing (ATAC-seq) and histone Chromatin immunoprecipitation DNA-sequencing (ChIP-seq) read coverage for plant genomes. Predmoter uses only the DNA sequence as input. We trained our final model on 21 species for 13 of which ATAC-seq data and for 17 of which ChIP-seq data was publicly available. We evaluated our models on Arabidopsis thaliana and Oryza sativa. Our best models showed accurate predictions in peak position and pattern for ATAC- and histone ChIP-seq. Annotating putatively accessible chromatin regions provides valuable input for the identification of CREs. In conjunction with other in silico data, this can significantly reduce the search space for experimentally verifiable DNA-protein interaction pairs. Availability and implementation The source code for Predmoter is available at: https://github.com/weberlab-hhu/Predmoter. Predmoter takes a fasta file as input and outputs h5, and optionally bigWig and bedGraph files.
Collapse
Affiliation(s)
- Felicitas Kindel
- Institute of Plant Biochemistry, Math.-Nat. Faculty, Heinrich Heine University, Düsseldorf 40225, Germany
| | - Sebastian Triesch
- Institute of Plant Biochemistry, Math.-Nat. Faculty, Heinrich Heine University, Düsseldorf 40225, Germany
- Cluster of Excellence on Plant Sciences (CEPLAS), Germany
| | - Urte Schlüter
- Institute of Plant Biochemistry, Math.-Nat. Faculty, Heinrich Heine University, Düsseldorf 40225, Germany
| | - Laura Alexandra Randarevitch
- Cluster of Excellence on Plant Sciences (CEPLAS), Germany
- Institute of Population Genetics, Math.-Nat. Faculty, Heinrich Heine University, Düsseldorf 40225, Germany
| | - Vanessa Reichel-Deland
- Institute of Plant Biochemistry, Math.-Nat. Faculty, Heinrich Heine University, Düsseldorf 40225, Germany
| | - Andreas P M Weber
- Institute of Plant Biochemistry, Math.-Nat. Faculty, Heinrich Heine University, Düsseldorf 40225, Germany
- Cluster of Excellence on Plant Sciences (CEPLAS), Germany
| | - Alisandra K Denton
- Institute of Plant Biochemistry, Math.-Nat. Faculty, Heinrich Heine University, Düsseldorf 40225, Germany
- Cluster of Excellence on Plant Sciences (CEPLAS), Germany
- Valence Labs, Montréal, Québec H2S 3H1, Canada
| |
Collapse
|
2
|
Ramakrishnan A, Wangensteen G, Kim S, Nestler EJ, Shen L. DeepRegFinder: deep learning-based regulatory elements finder. BIOINFORMATICS ADVANCES 2024; 4:vbae007. [PMID: 38343388 PMCID: PMC10858349 DOI: 10.1093/bioadv/vbae007] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 06/14/2023] [Revised: 12/06/2023] [Accepted: 01/12/2024] [Indexed: 06/15/2024]
Abstract
Summary Enhancers and promoters are important classes of DNA regulatory elements (DREs) that govern gene expression. Identifying them at a genomic scale is a critical task in bioinformatics. The DREs often exhibit unique histone mark binding patterns, which can be captured by high-throughput ChIP-seq experiments. To account for the variations and noises among the binding sites, machine learning models are trained on known enhancer/promoter sites using histone mark ChIP-seq data and predict enhancers/promoters at other genomic regions. To this end, we have developed a highly customizable program named DeepRegFinder, which automates the entire process of data processing, model training, and prediction. We have employed convolutional and recurrent neural networks for model training and prediction. DeepRegFinder further categorizes enhancers and promoters into active and poised states, making it a unique and valuable feature for researchers. Our method demonstrates improved precision and recall in comparison to existing algorithms for enhancer prediction across multiple cell types. Moreover, our pipeline is modular and eliminates the tedious steps involved in preprocessing, making it easier for users to apply on their data quickly. Availability and implementation https://github.com/shenlab-sinai/DeepRegFinder.
Collapse
Affiliation(s)
- Aarthi Ramakrishnan
- Friedman Brain Institute and Nash Family Department of Neuroscience, Icahn School of Medicine at Mount Sinai, New York, NY 10029, United States
| | - George Wangensteen
- Department of Computer Science, Brown University, Providence, RI 02912, United States
| | - Sarah Kim
- Cancer Program, Broad Institute, Cambridge, MA 02142, United States
| | - Eric J Nestler
- Friedman Brain Institute and Nash Family Department of Neuroscience, Icahn School of Medicine at Mount Sinai, New York, NY 10029, United States
| | - Li Shen
- Friedman Brain Institute and Nash Family Department of Neuroscience, Icahn School of Medicine at Mount Sinai, New York, NY 10029, United States
| |
Collapse
|
3
|
Barbero-Aparicio JA, Olivares-Gil A, Díez-Pastor JF, García-Osorio C. Deep learning and support vector machines for transcription start site identification. PeerJ Comput Sci 2023; 9:e1340. [PMID: 37346545 PMCID: PMC10280436 DOI: 10.7717/peerj-cs.1340] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/26/2022] [Accepted: 03/21/2023] [Indexed: 06/23/2023]
Abstract
Recognizing transcription start sites is key to gene identification. Several approaches have been employed in related problems such as detecting translation initiation sites or promoters, many of the most recent ones based on machine learning. Deep learning methods have been proven to be exceptionally effective for this task, but their use in transcription start site identification has not yet been explored in depth. Also, the very few existing works do not compare their methods to support vector machines (SVMs), the most established technique in this area of study, nor provide the curated dataset used in the study. The reduced amount of published papers in this specific problem could be explained by this lack of datasets. Given that both support vector machines and deep neural networks have been applied in related problems with remarkable results, we compared their performance in transcription start site predictions, concluding that SVMs are computationally much slower, and deep learning methods, specially long short-term memory neural networks (LSTMs), are best suited to work with sequences than SVMs. For such a purpose, we used the reference human genome GRCh38. Additionally, we studied two different aspects related to data processing: the proper way to generate training examples and the imbalanced nature of the data. Furthermore, the generalization performance of the models studied was also tested using the mouse genome, where the LSTM neural network stood out from the rest of the algorithms. To sum up, this article provides an analysis of the best architecture choices in transcription start site identification, as well as a method to generate transcription start site datasets including negative instances on any species available in Ensembl. We found that deep learning methods are better suited than SVMs to solve this problem, being more efficient and better adapted to long sequences and large amounts of data. We also create a transcription start site (TSS) dataset large enough to be used in deep learning experiments.
Collapse
Affiliation(s)
| | - Alicia Olivares-Gil
- Departamento de Ingeniería Informática, Universidad de Burgos, Burgos, Spain
| | - José F. Díez-Pastor
- Departamento de Ingeniería Informática, Universidad de Burgos, Burgos, Spain
| | - César García-Osorio
- Departamento de Ingeniería Informática, Universidad de Burgos, Burgos, Spain
| |
Collapse
|
4
|
Osmala M, Eraslan G, Lähdesmäki H. ChromDMM: a Dirichlet-multinomial mixture model for clustering heterogeneous epigenetic data. Bioinformatics 2022; 38:3863-3870. [PMID: 35786716 PMCID: PMC9364382 DOI: 10.1093/bioinformatics/btac444] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/28/2022] [Revised: 05/20/2022] [Accepted: 06/30/2022] [Indexed: 12/24/2022] Open
Abstract
MOTIVATION Research on epigenetic modifications and other chromatin features at genomic regulatory elements elucidates essential biological mechanisms including the regulation of gene expression. Despite the growing number of epigenetic datasets, new tools are still needed to discover novel distinctive patterns of heterogeneous epigenetic signals at regulatory elements. RESULTS We introduce ChromDMM, a product Dirichlet-multinomial mixture model for clustering genomic regions that are characterized by multiple chromatin features. ChromDMM extends the mixture model framework by profile shifting and flipping that can probabilistically account for inaccuracies in the position and strand-orientation of the genomic regions. Owing to hyper-parameter optimization, ChromDMM can also regularize the smoothness of the epigenetic profiles across the consecutive genomic regions. With simulated data, we demonstrate that ChromDMM clusters, shifts and strand-orients the profiles more accurately than previous methods. With ENCODE data, we show that the clustering of enhancer regions in the human genome reveals distinct patterns in several chromatin features. We further validate the enhancer clusters by their enrichment for transcriptional regulatory factor binding sites. AVAILABILITY AND IMPLEMENTATION ChromDMM is implemented as an R package and is available at https://github.com/MariaOsmala/ChromDMM. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
| | | | - Harri Lähdesmäki
- Department of Computer Science, Aalto University, Espoo 02150, Finland
| |
Collapse
|
5
|
Jain M, Garg R. Enhancers as potential targets for engineering salinity stress tolerance in crop plants. PHYSIOLOGIA PLANTARUM 2021; 173:1382-1391. [PMID: 33837536 DOI: 10.1111/ppl.13421] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 02/08/2021] [Revised: 03/19/2021] [Accepted: 04/06/2021] [Indexed: 06/12/2023]
Abstract
Enhancers represent noncoding regulatory regions of the genome located distantly from their target genes. They regulate gene expression programs in a context-specific manner via interacting with promoters of one or more target genes and are generally associated with transcription factor binding sites and epi(genomic)/chromatin features, such as regions of chromatin accessibility and histone modifications. The enhancers are difficult to identify due to the modularity of their associated features. Although enhancers have been studied extensively in human and animals, only a handful of them has been identified in few plant species till date due to nonavailability of plant-specific experimental and computational approaches for their discovery. Being an important regulatory component of the genome, enhancers represent potential targets for engineering agronomic traits, including salinity stress tolerance in plants. Here, we provide a review of the available experimental and computational approaches along with the associated sequence and chromatin/epigenetic features for the discovery of enhancers in plants. In addition, we provide insights into the challenges and future prospects of enhancer research in plant biology with emphasis on potential applications in engineering salinity stress tolerance in crop plants.
Collapse
Affiliation(s)
- Mukesh Jain
- School of Computational and Integrative Sciences, Jawaharlal Nehru University, New Delhi, India
| | - Rohini Garg
- Department of Life Sciences, School of Natural Sciences, Shiv Nadar University, Gautam Buddha Nagar, Uttar Pradesh, India
| |
Collapse
|
6
|
Zhang TH, Flores M, Huang Y. ES-ARCNN: Predicting enhancer strength by using data augmentation and residual convolutional neural network. Anal Biochem 2021; 618:114120. [PMID: 33535061 DOI: 10.1016/j.ab.2021.114120] [Citation(s) in RCA: 7] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/17/2020] [Revised: 12/13/2020] [Accepted: 01/21/2021] [Indexed: 02/06/2023]
Abstract
Enhancers are non-coding DNA sequences bound by proteins called transcription factors. They function as distant regulators of gene transcription and participate in the development and maintenance of cell types and tissues. Since experimental validation of enhancers is expensive and time-consuming, many computational methods have been developed to predict enhancers and their strength. However, most of these methods still lack good performance in the prediction of enhancer strength. Here, we present a method to predict Enhancers Strength (i.e., strong and weak) by using Augmented data and Residual Convolutional Neural Network (ES-ARCNN). To train ES-ARCNN, we used two data augmentation tricks (i.e., reverse complement and shift) to previously identified enhancers for enlarging a previously identified dataset of enhancers. We further employed a residual convolutional neural network and trained it using the augmented dataset. Compared with other state-of-the-art methods in the 10-fold cross-validation (CV) test, ES-ARCNN has the best performance with the accuracy of 66.17%, and the tricks of data augmentation can effectively improve the prediction performance. We further tested ES-ARCNN on an independent dataset and obtained 65.5% accuracy, which has more than 4% improvement over the other three existing methods. The results in 10CV and independent tests show that ES-ARCNN can effectively predict the enhancer strength. The transcription factor binding sites (TFBSs) enrichment analysis shows that from the mechanistic perspective, enhancer strength is associated with a higher density of important TFBSs in a tissue. A user-friendly web-application is also provided at http://compgenomics.utsa.edu/ES-ARCNN/.
Collapse
Affiliation(s)
- Ting-He Zhang
- Department of Electrical and Computer Engineering, The University of Texas at San Antonio, San Antonio, TX, 78249-0669, USA
| | - Mario Flores
- Department of Electrical and Computer Engineering, The University of Texas at San Antonio, San Antonio, TX, 78249-0669, USA.
| | - Yufei Huang
- Department of Electrical and Computer Engineering, The University of Texas at San Antonio, San Antonio, TX, 78249-0669, USA; Department of Populational Health Science, The University of Texas Health San Antonio, San Antonio, TX, 78229, USA.
| |
Collapse
|