1
|
Hu W, Li Y, Wu Y, Guan L, Li M. A deep learning model for DNA enhancer prediction based on nucleotide position aware feature encoding. iScience 2024; 27:110030. [PMID: 38868182 PMCID: PMC11167433 DOI: 10.1016/j.isci.2024.110030] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/08/2024] [Revised: 04/23/2024] [Accepted: 05/16/2024] [Indexed: 06/14/2024] Open
Abstract
Enhancers, genomic DNA elements, regulate neighboring gene expression crucial for biological processes like cell differentiation and stress response. However, current machine learning methods for predicting DNA enhancers often underutilize hidden features in gene sequences, limiting model accuracy. Hence, this article proposes the PDCNN model, a deep learning-based enhancer prediction method. PDCNN extracts statistical nucleotide representations from gene sequences, discerning positional distribution information of nucleotides in modifier-like DNA sequences. With a convolutional neural network structure, PDCNN employs dual convolutional and fully connected layers. The cross-entropy loss function iteratively updates using a gradient descent algorithm, enhancing prediction accuracy. Model parameters are fine-tuned to select optimal combinations for training, achieving over 95% accuracy. Comparative analysis with traditional methods and existing models demonstrates PDCNN's robust feature extraction capability. It outperforms advanced machine learning methods in identifying DNA enhancers, presenting an effective method with broad implications for genomics, biology, and medical research.
Collapse
Affiliation(s)
- Wenxing Hu
- College of Physics and Electronic Information, Gannan Normal University, Ganzhou 341000, Jiangxi, China
| | - Yelin Li
- College of Physics and Electronic Information, Gannan Normal University, Ganzhou 341000, Jiangxi, China
| | - Yan Wu
- College of Physics and Electronic Information, Gannan Normal University, Ganzhou 341000, Jiangxi, China
| | - Lixin Guan
- College of Physics and Electronic Information, Gannan Normal University, Ganzhou 341000, Jiangxi, China
| | - Mengshan Li
- College of Physics and Electronic Information, Gannan Normal University, Ganzhou 341000, Jiangxi, China
| |
Collapse
|
2
|
Gaynor-Gillett SC, Cheng L, Shi M, Liu J, Wang G, Spector M, Flaherty M, Wall M, Hwang A, Gu M, Chen Z, Chen Y, Consortium P, Moran JR, Zhang J, Lee D, Gerstein M, Geschwind D, White KP. Validation of Enhancer Regions in Primary Human Neural Progenitor Cells using Capture STARR-seq. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2024.03.14.585066. [PMID: 38562832 PMCID: PMC10983874 DOI: 10.1101/2024.03.14.585066] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 04/04/2024]
Abstract
Genome-wide association studies (GWAS) and expression analyses implicate noncoding regulatory regions as harboring risk factors for psychiatric disease, but functional characterization of these regions remains limited. We performed capture STARR-sequencing of over 78,000 candidate regions to identify active enhancers in primary human neural progenitor cells (phNPCs). We selected candidate regions by integrating data from NPCs, prefrontal cortex, developmental timepoints, and GWAS. Over 8,000 regions demonstrated enhancer activity in the phNPCs, and we linked these regions to over 2,200 predicted target genes. These genes are involved in neuronal and psychiatric disease-associated pathways, including dopaminergic synapse, axon guidance, and schizophrenia. We functionally validated a subset of these enhancers using mutation STARR-sequencing and CRISPR deletions, demonstrating the effects of genetic variation on enhancer activity and enhancer deletion on gene expression. Overall, we identified thousands of highly active enhancers and functionally validated a subset of these enhancers, improving our understanding of regulatory networks underlying brain function and disease.
Collapse
Affiliation(s)
- Sophia C. Gaynor-Gillett
- Tempus Labs, Inc.; Chicago, IL, 60654, USA
- Department of Biology, Cornell College; Mount Vernon, IA, 52314, USA
| | | | - Manman Shi
- Tempus Labs, Inc.; Chicago, IL, 60654, USA
| | - Jason Liu
- Computational Biology and Bioinformatics Program, Yale University; New Haven, CT, 06511, USA
| | - Gaoyuan Wang
- Computational Biology and Bioinformatics Program, Yale University; New Haven, CT, 06511, USA
| | | | | | | | - Ahyeon Hwang
- Department of Computer Science, University of California Irvine; Irvine, CA, 92697, USA
| | - Mengting Gu
- Computational Biology and Bioinformatics Program, Yale University; New Haven, CT, 06511, USA
| | - Zhanlin Chen
- Computational Biology and Bioinformatics Program, Yale University; New Haven, CT, 06511, USA
| | - Yuhang Chen
- Computational Biology and Bioinformatics Program, Yale University; New Haven, CT, 06511, USA
| | | | | | - Jing Zhang
- Department of Computer Science, University of California Irvine; Irvine, CA, 92697, USA
| | - Donghoon Lee
- Department of Genetics and Genomic Sciences, Icahn School of Medicine at Mount Sinai; New York, NY, 10029, USA
- Department of Psychiatry, Icahn School of Medicine at Mount Sinai; New York, NY, 10029, USA
| | - Mark Gerstein
- Computational Biology and Bioinformatics Program, Yale University; New Haven, CT, 06511, USA
- Department of Statistics and Data Science, Yale University; New Haven, CT, 06511, USA
- Department of Molecular Biophysics and Biochemistry, Yale University; New Haven, CT, 06511, USA
- Department of Computer Science, Yale University; New Haven, CT, 06511, USA
| | - Daniel Geschwind
- Department of Neurology, David Geffen School of Medicine, University of California Los Angeles; Los Angeles, CA, 90095, USA
- Department of Psychiatry and Semel Institute, David Geffen School of Medicine, University of California Los Angeles; Los Angeles, CA, 90095, USA
- Department of Human Genetics, David Geffen School of Medicine, University of California Los Angeles; Los Angeles, CA, 90095, USA
| | - Kevin P. White
- Yong Loo Lin School of Medicine, National University of Singapore; Singapore, 117597
| |
Collapse
|
3
|
Wang Y, Jin W, Pan X, Liao W, Shen Q, Cai J, Gong W, Tian Y, Xu D, Li Y, Li J, Gong J, Zhang Z, Yuan X. Pig-eRNAdb: a comprehensive enhancer and eRNA dataset of pigs. Sci Data 2024; 11:157. [PMID: 38302497 PMCID: PMC10834423 DOI: 10.1038/s41597-024-02960-7] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/31/2023] [Accepted: 01/11/2024] [Indexed: 02/03/2024] Open
Abstract
Enhancers and the enhancer RNAs (eRNAs) have been strongly implicated in regulations of transcriptions. Based the multi-omics data (ATAC-seq, ChIP-seq and RNA-seq) from public databases, Pig-eRNAdb is a dataset that comprehensively integrates enhancers and eRNAs for pigs using the machine learning strategy, which incorporates 82,399 enhancers and 37,803 eRNAs from 607 samples across 15 tissues of pigs. This user-friendly dataset covers a comprehensive depth of enhancers and eRNAs annotation for pigs. The coordinates of enhancers and the expression patterns of eRNAs are downloadable. Besides, thousands of regulators on eRNAs, the target genes of eRNAs, the tissue-specific eRNAs, and the housekeeping eRNAs are also accessible as well as the sequence similarity of eRNAs with humans. Moreover, the tissue-specific eRNA-trait associations encompass 652 traits are also provided. It will crucially facilitate investigations on enhancers and eRNAs with Pig-eRNAdb as a reference dataset in pigs.
Collapse
Affiliation(s)
- Yifei Wang
- Guangdong Provincial Key Laboratory of Agro-Animal Genomics and Molecular Breeding, Guangdong Laboratory of Lingnan Modern Agriculture, National Engineering Research Center for Breeding Swine Industry, State Key Laboratory of Swine and Poultry Breeding Industry, College of Animal Science, South China Agricultural University, Guangzhou, 510642, China
| | - Weiwei Jin
- Hubei Key Laboratory of Agricultural Bioinformatics, College of Informatics, Huazhong Agricultural University, Wuhan, 430070, China
| | - Xiangchun Pan
- Guangdong Provincial Key Laboratory of Agro-Animal Genomics and Molecular Breeding, Guangdong Laboratory of Lingnan Modern Agriculture, National Engineering Research Center for Breeding Swine Industry, State Key Laboratory of Swine and Poultry Breeding Industry, College of Animal Science, South China Agricultural University, Guangzhou, 510642, China
| | - Weili Liao
- Guangdong Provincial Key Laboratory of Agro-Animal Genomics and Molecular Breeding, Guangdong Laboratory of Lingnan Modern Agriculture, National Engineering Research Center for Breeding Swine Industry, State Key Laboratory of Swine and Poultry Breeding Industry, College of Animal Science, South China Agricultural University, Guangzhou, 510642, China
| | - Qingpeng Shen
- Guangdong Provincial Key Laboratory of Agro-Animal Genomics and Molecular Breeding, Guangdong Laboratory of Lingnan Modern Agriculture, National Engineering Research Center for Breeding Swine Industry, State Key Laboratory of Swine and Poultry Breeding Industry, College of Animal Science, South China Agricultural University, Guangzhou, 510642, China
| | - Jiali Cai
- Guangdong Provincial Key Laboratory of Agro-Animal Genomics and Molecular Breeding, Guangdong Laboratory of Lingnan Modern Agriculture, National Engineering Research Center for Breeding Swine Industry, State Key Laboratory of Swine and Poultry Breeding Industry, College of Animal Science, South China Agricultural University, Guangzhou, 510642, China
| | - Wentao Gong
- Guangdong Provincial Key Laboratory of Agro-Animal Genomics and Molecular Breeding, Guangdong Laboratory of Lingnan Modern Agriculture, National Engineering Research Center for Breeding Swine Industry, State Key Laboratory of Swine and Poultry Breeding Industry, College of Animal Science, South China Agricultural University, Guangzhou, 510642, China
| | - Yuhan Tian
- Guangdong Provincial Key Laboratory of Agro-Animal Genomics and Molecular Breeding, Guangdong Laboratory of Lingnan Modern Agriculture, National Engineering Research Center for Breeding Swine Industry, State Key Laboratory of Swine and Poultry Breeding Industry, College of Animal Science, South China Agricultural University, Guangzhou, 510642, China
| | - Dantong Xu
- Guangdong Provincial Key Laboratory of Agro-Animal Genomics and Molecular Breeding, Guangdong Laboratory of Lingnan Modern Agriculture, National Engineering Research Center for Breeding Swine Industry, State Key Laboratory of Swine and Poultry Breeding Industry, College of Animal Science, South China Agricultural University, Guangzhou, 510642, China
| | - Yipeng Li
- Guangdong Provincial Key Laboratory of Agro-Animal Genomics and Molecular Breeding, Guangdong Laboratory of Lingnan Modern Agriculture, National Engineering Research Center for Breeding Swine Industry, State Key Laboratory of Swine and Poultry Breeding Industry, College of Animal Science, South China Agricultural University, Guangzhou, 510642, China
| | - Jiaqi Li
- Guangdong Provincial Key Laboratory of Agro-Animal Genomics and Molecular Breeding, Guangdong Laboratory of Lingnan Modern Agriculture, National Engineering Research Center for Breeding Swine Industry, State Key Laboratory of Swine and Poultry Breeding Industry, College of Animal Science, South China Agricultural University, Guangzhou, 510642, China
| | - Jing Gong
- Hubei Key Laboratory of Agricultural Bioinformatics, College of Informatics, Huazhong Agricultural University, Wuhan, 430070, China
| | - Zhe Zhang
- Guangdong Provincial Key Laboratory of Agro-Animal Genomics and Molecular Breeding, Guangdong Laboratory of Lingnan Modern Agriculture, National Engineering Research Center for Breeding Swine Industry, State Key Laboratory of Swine and Poultry Breeding Industry, College of Animal Science, South China Agricultural University, Guangzhou, 510642, China.
| | - Xiaolong Yuan
- Guangdong Provincial Key Laboratory of Agro-Animal Genomics and Molecular Breeding, Guangdong Laboratory of Lingnan Modern Agriculture, National Engineering Research Center for Breeding Swine Industry, State Key Laboratory of Swine and Poultry Breeding Industry, College of Animal Science, South China Agricultural University, Guangzhou, 510642, China.
| |
Collapse
|
4
|
Wang Q, Zhang J, Liu Z, Duan Y, Li C. Integrative approaches based on genomic techniques in the functional studies on enhancers. Brief Bioinform 2023; 25:bbad442. [PMID: 38048082 PMCID: PMC10694556 DOI: 10.1093/bib/bbad442] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/28/2023] [Revised: 10/22/2023] [Accepted: 11/08/2023] [Indexed: 12/05/2023] Open
Abstract
With the development of sequencing technology and the dramatic drop in sequencing cost, the functions of noncoding genes are being characterized in a wide variety of fields (e.g. biomedicine). Enhancers are noncoding DNA elements with vital transcription regulation functions. Tens of thousands of enhancers have been identified in the human genome; however, the location, function, target genes and regulatory mechanisms of most enhancers have not been elucidated thus far. As high-throughput sequencing techniques have leapt forwards, omics approaches have been extensively employed in enhancer research. Multidimensional genomic data integration enables the full exploration of the data and provides novel perspectives for screening, identification and characterization of the function and regulatory mechanisms of unknown enhancers. However, multidimensional genomic data are still difficult to integrate genome wide due to complex varieties, massive amounts, high rarity, etc. To facilitate the appropriate methods for studying enhancers with high efficacy, we delineate the principles, data processing modes and progress of various omics approaches to study enhancers and summarize the applications of traditional machine learning and deep learning in multi-omics integration in the enhancer field. In addition, the challenges encountered during the integration of multiple omics data are addressed. Overall, this review provides a comprehensive foundation for enhancer analysis.
Collapse
Affiliation(s)
- Qilin Wang
- School of Engineering Medicine, Beihang University, Beijing 100191, China
- School of Biological Science and Medical Engineering, Beihang University, Beijing 100191, China
| | - Junyou Zhang
- School of Engineering Medicine, Beihang University, Beijing 100191, China
- School of Biological Science and Medical Engineering, Beihang University, Beijing 100191, China
| | - Zhaoshuo Liu
- School of Engineering Medicine, Beihang University, Beijing 100191, China
- School of Biological Science and Medical Engineering, Beihang University, Beijing 100191, China
| | - Yingying Duan
- School of Engineering Medicine, Beihang University, Beijing 100191, China
- School of Biological Science and Medical Engineering, Beihang University, Beijing 100191, China
| | - Chunyan Li
- School of Engineering Medicine, Beihang University, Beijing 100191, China
- School of Biological Science and Medical Engineering, Beihang University, Beijing 100191, China
- Key Laboratory of Big Data-Based Precision Medicine (Ministry of Industry and Information Technology), Beihang University, Beijing 100191, China
- Beijing Advanced Innovation Center for Big Data-Based Precision Medicine, Beihang University, Beijing 100191, China
| |
Collapse
|
5
|
Zheng A, Shen Z, Glass CK, Gymrek M. Deep learning predicts the impact of regulatory variants on cell-type-specific enhancers in the brain. BIOINFORMATICS ADVANCES 2023; 3:vbad002. [PMID: 36726730 PMCID: PMC9887460 DOI: 10.1093/bioadv/vbad002] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 08/01/2022] [Revised: 11/11/2022] [Accepted: 01/11/2023] [Indexed: 01/13/2023]
Abstract
Motivation Previous studies have shown that the heritability of multiple brain-related traits and disorders is highly enriched in transcriptional enhancer regions. However, these regions often contain many individual variants, while only a subset of them are likely to causally contribute to a trait. Statistical fine-mapping techniques can identify putative causal variants, but their resolution is often limited, especially in regions with multiple variants in high linkage disequilibrium. In these cases, alternative computational methods to estimate the impact of individual variants can aid in variant prioritization. Results Here, we develop a deep learning pipeline to predict cell-type-specific enhancer activity directly from genomic sequences and quantify the impact of individual genetic variants in these regions. We show that the variants highlighted by our deep learning models are targeted by purifying selection in the human population, likely indicating a functional role. We integrate our deep learning predictions with statistical fine-mapping results for 8 brain-related traits, identifying 63 distinct candidate causal variants predicted to contribute to these traits by modulating enhancer activity, representing 6% of all genome-wide association study signals analyzed. Overall, our study provides a valuable computational method that can prioritize individual variants based on their estimated regulatory impact, but also highlights the limitations of existing methods for variant prioritization and fine-mapping. Availability and implementation The data underlying this article, nucleotide-level importance scores, and code for running the deep learning pipeline are available at https://github.com/Pandaman-Ryan/AgentBind-brain. Contact mgymrek@ucsd.edu. Supplementary information Supplementary data are available at Bioinformatics Advances online.
Collapse
Affiliation(s)
| | | | - Christopher K Glass
- Department of Cellular and Molecular Medicine, University of California San Diego, La Jolla, CA 92093, USA,Department of Medicine, University of California San Diego, La Jolla, CA 92093, USA
| | | |
Collapse
|
6
|
scEpiLock: A Weakly Supervised Learning Framework for cis-Regulatory Element Localization and Variant Impact Quantification for Single-Cell Epigenetic Data. Biomolecules 2022; 12:biom12070874. [PMID: 35883430 PMCID: PMC9312957 DOI: 10.3390/biom12070874] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/06/2022] [Revised: 06/16/2022] [Accepted: 06/16/2022] [Indexed: 02/04/2023] Open
Abstract
Recent advances in single-cell transposase-accessible chromatin using a sequencing assay (scATAC-seq) allow cellular heterogeneity dissection and regulatory landscape reconstruction with an unprecedented resolution. However, compared to bulk-sequencing, its ultra-high missingness remarkably reduces usable reads in each cell type, resulting in broader, fuzzier peak boundary definitions and limiting our ability to pinpoint functional regions and interpret variant impacts precisely. We propose a weakly supervised learning method, scEpiLock, to directly identify core functional regions from coarse peak labels and quantify variant impacts in a cell-type-specific manner. First, scEpiLock uses a multi-label classifier to predict chromatin accessibility via a deep convolutional neural network. Then, its weakly supervised object detection module further refines the peak boundary definition using gradient-weighted class activation mapping (Grad-CAM). Finally, scEpiLock provides cell-type-specific variant impacts within a given peak region. We applied scEpiLock to various scATAC-seq datasets and found that it achieves an area under receiver operating characteristic curve (AUC) of ~0.9 and an area under precision recall (AUPR) above 0.7. Besides, scEpiLock’s object detection condenses coarse peaks to only ⅓ of their original size while still reporting higher conservation scores. In addition, we applied scEpiLock on brain scATAC-seq data and reported several genome-wide association studies (GWAS) variants disrupting regulatory elements around known risk genes for Alzheimer’s disease, demonstrating its potential to provide cell-type-specific biological insights in disease studies.
Collapse
|
7
|
Zhang L, Zhang J, Nie Q. DIRECT-NET: An efficient method to discover cis-regulatory elements and construct regulatory networks from single-cell multiomics data. SCIENCE ADVANCES 2022; 8:eabl7393. [PMID: 35648859 PMCID: PMC9159696 DOI: 10.1126/sciadv.abl7393] [Citation(s) in RCA: 17] [Impact Index Per Article: 8.5] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/13/2023]
Abstract
The emergence of single-cell multiomics data provides unprecedented opportunities to scrutinize the transcriptional regulatory mechanisms controlling cell identity. However, how to use those datasets to dissect the cis-regulatory element (CRE)–to–gene relationships at a single-cell level remains a major challenge. Here, we present DIRECT-NET, a machine-learning method based on gradient boosting, to identify genome-wide CREs and their relationship to target genes, either from parallel single-cell gene expression and chromatin accessibility data or from single-cell chromatin accessibility data alone. By extensively evaluating and characterizing DIRECT-NET’s predicted CREs using independent functional genomics data, we find that DIRECT-NET substantially improves the accuracy of inferring CRE-to-gene relationships in comparison to existing methods. DIRECT-NET is also capable of revealing cell subpopulation–specific and dynamic regulatory linkages. Overall, DIRECT-NET provides an efficient tool for predicting transcriptional regulation codes from single-cell multiomics data.
Collapse
Affiliation(s)
- Lihua Zhang
- School of Computer Science, Wuhan University, Wuhan 430072, China
- Department of Mathematics, University of California, Irvine, Irvine, CA 92697, USA
- NSF-Simons Center for Multiscale Cell Fate Research, University of California, Irvine, Irvine, CA 92697, USA
| | - Jing Zhang
- Department of Computer Science, University of California, Irvine, Irvine, CA 92697, USA
- Corresponding author. (J.Z.); (Q.N.)
| | - Qing Nie
- Department of Mathematics, University of California, Irvine, Irvine, CA 92697, USA
- NSF-Simons Center for Multiscale Cell Fate Research, University of California, Irvine, Irvine, CA 92697, USA
- Department of Developmental and Cell Biology, University of California, Irvine, Irvine, CA 92697, USA
- Corresponding author. (J.Z.); (Q.N.)
| |
Collapse
|