1
|
Shireen H, Batool F, Khatoon H, Parveen N, Sehar NU, Hussain I, Ali S, Abbasi AA. Predicting genome-wide tissue-specific enhancers via combinatorial transcription factor genomic occupancy analysis. FEBS Lett 2024. [PMID: 39367524 DOI: 10.1002/1873-3468.15030] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/14/2024] [Revised: 08/27/2024] [Accepted: 09/13/2024] [Indexed: 10/06/2024]
Abstract
Enhancers are non-coding cis-regulatory elements crucial for transcriptional regulation. Mutations in enhancers can disrupt gene regulation, leading to disease phenotypes. Identifying enhancers and their tissue-specific activity is challenging due to their lack of stereotyped sequences. This study presents a sequence-based computational model that uses combinatorial transcription factor (TF) genomic occupancy to predict tissue-specific enhancers. Trained on diverse datasets, including ENCODE and Vista enhancer browser data, the model predicted 25 000 forebrain-specific cis-regulatory modules (CRMs) in the human genome. Validation using biochemical features, disease-associated SNPs, and in vivo zebrafish analysis confirmed its effectiveness. This model aids in predicting enhancers lacking well-characterized chromatin features, complementing experimental approaches in tissue-specific enhancer discovery.
Collapse
Affiliation(s)
- Huma Shireen
- National Center for Bioinformatics, Program of Comparative and Evolutionary Genomics, Faculty of Biological Sciences, Quaid-i-Azam University, Islamabad, Pakistan
| | - Fatima Batool
- National Center for Bioinformatics, Program of Comparative and Evolutionary Genomics, Faculty of Biological Sciences, Quaid-i-Azam University, Islamabad, Pakistan
| | - Hizran Khatoon
- National Center for Bioinformatics, Program of Comparative and Evolutionary Genomics, Faculty of Biological Sciences, Quaid-i-Azam University, Islamabad, Pakistan
| | - Nazia Parveen
- National Center for Bioinformatics, Program of Comparative and Evolutionary Genomics, Faculty of Biological Sciences, Quaid-i-Azam University, Islamabad, Pakistan
| | - Noor Us Sehar
- National Center for Bioinformatics, Program of Comparative and Evolutionary Genomics, Faculty of Biological Sciences, Quaid-i-Azam University, Islamabad, Pakistan
| | - Irfan Hussain
- Centre for Regenerative Medicine and Stem Cells Research, Agha Khan University hospital, Karachi, Pakistan
| | - Shahid Ali
- Department of Organismal Biology and Anatomy, The University of Chicago, Chicago, IL, USA
| | - Amir Ali Abbasi
- National Center for Bioinformatics, Program of Comparative and Evolutionary Genomics, Faculty of Biological Sciences, Quaid-i-Azam University, Islamabad, Pakistan
| |
Collapse
|
2
|
Safaei M, Goodarzi A, Abpeikar Z, Farmani AR, Kouhpayeh SA, Najafipour S, Jafari Najaf Abadi MH. Determination of key hub genes in Leishmaniasis as potential factors in diagnosis and treatment based on a bioinformatics study. Sci Rep 2024; 14:22537. [PMID: 39342024 PMCID: PMC11438978 DOI: 10.1038/s41598-024-73779-w] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/23/2024] [Accepted: 09/20/2024] [Indexed: 10/01/2024] Open
Abstract
Leishmaniasis is an infectious disease caused by protozoan parasites from different species of leishmania. The disease is transmitted by female sandflies that carry these parasites. In this study, datasets on leishmaniasis published in the GEO database were analyzed and summarized. The analysis in all three datasets (GSE43880, GSE55664, and GSE63931) used in this study has been performed on the skin wounds of patients infected with a clinical form of leishmania (Leishmania braziliensis), and biopsies have been taken from them. To identify differentially expressed genes (DEGs) between leishmaniasis patients and controls, the robust rank aggregation (RRA) procedure was applied. We performed gene functional annotation and protein-protein interaction (PPI) network analysis to demonstrate the putative functionalities of the DEGs. The study utilized Molecular Complex Detection (MCODE), Gene Ontology (GO), and Kyoto Encyclopedia of Genes and Genomes (KEGG) to detect molecular complexes within the protein-protein interaction (PPI) network and conduct analyses on the identified functional modules. The CytoHubba plugin's results were paired with RRA analysis to determine the hub genes. Finally, the interaction between miRNAs and hub genes was predicted. Based on the RRA integrated analysis, 407 DEGs were identified (263 up-regulated genes and 144 down-regulated genes). The top three modules were listed after creating the PPI network via the MCODE plug. Seven hub genes were found using the CytoHubba app and RRA: CXCL10, GBP1, GNLY, GZMA, GZMB, NKG7, and UBD. According to our enrichment analysis, these functional modules were primarily associated with immune pathways, cytokine activity/signaling pathways, and inflammation pathways. However, a UBD hub gene is interestingly involved in the ubiquitination pathways of pathogenesis. The mirNet database predicted the hub gene's interaction with miRNAs, and results revealed that several miRNAs, including mir-146a-5p, crucial in fighting pathogenesis. The key hub genes discovered in this work may be considered as potential biomarkers in diagnosis, development of agonists/antagonist, novel vaccine design, and will greatly contribute to clinical studies in the future.
Collapse
Affiliation(s)
- Mohsen Safaei
- Department of Tissue Engineering, School of Advanced Technologies in Medicine, Fasa University of Medical Sciences, Fasa, Iran
| | - Arash Goodarzi
- Department of Tissue Engineering, School of Advanced Technologies in Medicine, Fasa University of Medical Sciences, Fasa, Iran
| | - Zahra Abpeikar
- Department of Tissue Engineering, School of Advanced Technologies in Medicine, Fasa University of Medical Sciences, Fasa, Iran.
| | - Ahmad Reza Farmani
- Department of Tissue Engineering, School of Advanced Technologies in Medicine, Fasa University of Medical Sciences, Fasa, Iran
| | - Seyed Amin Kouhpayeh
- Department of Pharmacology, School of Medicine, Fasa University of Medical Sciences, Fasa, Iran
| | - Sohrab Najafipour
- Department of Microbiology, Faculty of Medicine, Fasa University of Medical Sciences, Fasa, Iran
| | - Mohammad Hassan Jafari Najaf Abadi
- Department of Medical Biotechnology, School of Medicine, Shahid Sadoughi University of Medical Sciences and Health Services, Yazd, Iran.
- Research Center for Health Technology Assessment and Medical Informatics, School of Public Health, Shahid Sadoughi University of Medical Sciences, Yazd, Iran.
| |
Collapse
|
3
|
Mulero-Hernández J, Mironov V, Miñarro-Giménez JA, Kuiper M, Fernández-Breis J. Integration of chromosome locations and functional aspects of enhancers and topologically associating domains in knowledge graphs enables versatile queries about gene regulation. Nucleic Acids Res 2024; 52:e69. [PMID: 38967009 PMCID: PMC11347148 DOI: 10.1093/nar/gkae566] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/12/2023] [Revised: 06/12/2024] [Accepted: 06/19/2024] [Indexed: 07/06/2024] Open
Abstract
Knowledge about transcription factor binding and regulation, target genes, cis-regulatory modules and topologically associating domains is not only defined by functional associations like biological processes or diseases but also has a determinative genome location aspect. Here, we exploit these location and functional aspects together to develop new strategies to enable advanced data querying. Many databases have been developed to provide information about enhancers, but a schema that allows the standardized representation of data, securing interoperability between resources, has been lacking. In this work, we use knowledge graphs for the standardized representation of enhancers and topologically associating domains, together with data about their target genes, transcription factors, location on the human genome, and functional data about diseases and gene ontology annotations. We used this schema to integrate twenty-five enhancer datasets and two domain datasets, creating the most powerful integrative resource in this field to date. The knowledge graphs have been implemented using the Resource Description Framework and integrated within the open-access BioGateway knowledge network, generating a resource that contains an interoperable set of knowledge graphs (enhancers, TADs, genes, proteins, diseases, GO terms, and interactions between domains). We show how advanced queries, which combine functional and location restrictions, can be used to develop new hypotheses about functional aspects of gene expression regulation.
Collapse
Affiliation(s)
- Juan Mulero-Hernández
- Departamento de Informática y Sistemas, Universidad de Murcia, CEIR Campus Mare Nostrum, Instituto Murciano de Investigación Biosanitaria (IMIB),30100 Murcia, Spain
| | - Vladimir Mironov
- Department of Biology, Norwegian University of Science and Technology, NO-7491 Trondheim, Norway
| | - José Antonio Miñarro-Giménez
- Departamento de Informática y Sistemas, Universidad de Murcia, CEIR Campus Mare Nostrum, Instituto Murciano de Investigación Biosanitaria (IMIB),30100 Murcia, Spain
| | - Martin Kuiper
- Department of Biology, Norwegian University of Science and Technology, NO-7491 Trondheim, Norway
| | - Jesualdo Tomás Fernández-Breis
- Departamento de Informática y Sistemas, Universidad de Murcia, CEIR Campus Mare Nostrum, Instituto Murciano de Investigación Biosanitaria (IMIB),30100 Murcia, Spain
| |
Collapse
|
4
|
Das S, Rai SN. Predicting the Effect of miRNA on Gene Regulation to Foster Translational Multi-Omics Research-A Review on the Role of Super-Enhancers. Noncoding RNA 2024; 10:45. [PMID: 39195574 DOI: 10.3390/ncrna10040045] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/13/2024] [Revised: 08/12/2024] [Accepted: 08/13/2024] [Indexed: 08/29/2024] Open
Abstract
Gene regulation is crucial for cellular function and homeostasis. It involves diverse mechanisms controlling the production of specific gene products and contributing to tissue-specific variations in gene expression. The dysregulation of genes leads to disease, emphasizing the need to understand these mechanisms. Computational methods have jointly studied transcription factors (TFs), microRNA (miRNA), and messenger RNA (mRNA) to investigate gene regulatory networks. However, there remains a knowledge gap in comprehending gene regulatory networks. On the other hand, super-enhancers (SEs) have been implicated in miRNA biogenesis and function in recent experimental studies, in addition to their pivotal roles in cell identity and disease progression. However, statistical/computational methodologies harnessing the potential of SEs in deciphering gene regulation networks remain notably absent. However, to understand the effect of miRNA on mRNA, existing statistical/computational methods could be updated, or novel methods could be developed by accounting for SEs in the model. In this review, we categorize existing computational methods that utilize TF and miRNA data to understand gene regulatory networks into three broad areas and explore the challenges of integrating enhancers/SEs. The three areas include unraveling indirect regulatory networks, identifying network motifs, and enriching pathway identification by dissecting gene regulators. We hypothesize that addressing these challenges will enhance our understanding of gene regulation, aiding in the identification of therapeutic targets and disease biomarkers. We believe that constructing statistical/computational models that dissect the role of SEs in predicting the effect of miRNA on gene regulation is crucial for tackling these challenges.
Collapse
Affiliation(s)
- Sarmistha Das
- Biostatistics and Informatics Shared Resource, University of Cincinnati College of Medicine, Cincinnati, OH 45267, USA
- Cancer Data Science Center, University of Cincinnati College of Medicine, Cincinnati, OH 45267, USA
- Division of Biostatistics and Bioinformatics, Department of Biostatistics, Health Informatics and Data Sciences, University of Cincinnati College of Medicine, Cincinnati, OH 45267, USA
| | - Shesh N Rai
- Biostatistics and Informatics Shared Resource, University of Cincinnati College of Medicine, Cincinnati, OH 45267, USA
- Cancer Data Science Center, University of Cincinnati College of Medicine, Cincinnati, OH 45267, USA
- Division of Biostatistics and Bioinformatics, Department of Biostatistics, Health Informatics and Data Sciences, University of Cincinnati College of Medicine, Cincinnati, OH 45267, USA
| |
Collapse
|
5
|
Ni P, Wu S, Su Z. Validated Negative Regions (VNRs) in the VISTA Database might be Truncated Forms of Bona Fide Enhancers. ADVANCED GENETICS (HOBOKEN, N.J.) 2024; 5:2300209. [PMID: 38884049 PMCID: PMC11170074 DOI: 10.1002/ggn2.202300209] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 12/11/2023] [Revised: 03/16/2024] [Indexed: 06/18/2024]
Abstract
The VISTA enhancer database is a valuable resource for evaluating predicted enhancers in humans and mice. In addition to thousands of validated positive regions (VPRs) in the human and mouse genomes, the database also contains similar numbers of validated negative regions (VNRs). It is previously shown that the VPRs are on average half as long as predicted overlapping enhancers that are highly conserved and hypothesize that the VPRs may be truncated forms of long bona fide enhancers. Here, it is shown that like the VPRs, the VNRs also are under strong evolutionary constraints and overlap predicted enhancers in the genomes. The VNRs are also on average half as long as predicted overlapping enhancers that are highly conserved. Moreover, the VNRs and the VPRs display similar cell/tissue-specific modification patterns of key epigenetic marks of active enhancers. Furthermore, the VNRs and the VPRs show similar impact score spectra of in silico mutagenesis. These highly similar properties between the VPRs and the VNRs suggest that like the VPRs, the VNRs may also be truncated forms of long bona fide enhancers.
Collapse
Affiliation(s)
- Pengyu Ni
- Department of Bioinformatics and Genomics the University of North Carolina at Charlotte Charlotte NC 28223 USA
- Present address: Department of Molecular Biophysics & Biochemistry Yale University New Haven CT 06520 USA
| | - Siwen Wu
- Department of Bioinformatics and Genomics the University of North Carolina at Charlotte Charlotte NC 28223 USA
| | - Zhengchang Su
- Department of Bioinformatics and Genomics the University of North Carolina at Charlotte Charlotte NC 28223 USA
| |
Collapse
|
6
|
Abbasi AF, Asim MN, Ahmed S, Dengel A. Long extrachromosomal circular DNA identification by fusing sequence-derived features of physicochemical properties and nucleotide distribution patterns. Sci Rep 2024; 14:9466. [PMID: 38658614 PMCID: PMC11043385 DOI: 10.1038/s41598-024-57457-5] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/19/2023] [Accepted: 03/18/2024] [Indexed: 04/26/2024] Open
Abstract
Long extrachromosomal circular DNA (leccDNA) regulates several biological processes such as genomic instability, gene amplification, and oncogenesis. The identification of leccDNA holds significant importance to investigate its potential associations with cancer, autoimmune, cardiovascular, and neurological diseases. In addition, understanding these associations can provide valuable insights about disease mechanisms and potential therapeutic approaches. Conventionally, wet lab-based methods are utilized to identify leccDNA, which are hindered by the need for prior knowledge, and resource-intensive processes, potentially limiting their broader applicability. To empower the process of leccDNA identification across multiple species, the paper in hand presents the very first computational predictor. The proposed iLEC-DNA predictor makes use of SVM classifier along with sequence-derived nucleotide distribution patterns and physicochemical properties-based features. In addition, the study introduces a set of 12 benchmark leccDNA datasets related to three species, namely Homo sapiens (HM), Arabidopsis Thaliana (AT), and Saccharomyces cerevisiae (SC/YS). It performs large-scale experimentation across 12 benchmark datasets under different experimental settings using the proposed predictor, more than 140 baseline predictors, and 858 encoder ensembles. The proposed predictor outperforms baseline predictors and encoder ensembles across diverse leccDNA datasets by producing average performance values of 81.09%, 62.2% and 81.08% in terms of ACC, MCC and AUC-ROC across all the datasets. The source code of the proposed and baseline predictors is available at https://github.com/FAhtisham/Extrachrosmosomal-DNA-Prediction . To facilitate the scientific community, a web application for leccDNA identification is available at https://sds_genetic_analysis.opendfki.de/iLEC_DNA/.
Collapse
Affiliation(s)
- Ahtisham Fazeel Abbasi
- Department of Computer Science, Rhineland-Palatinate Technical University of Kaiserslautern-Landau, 67663, Kaiserslautern, Germany.
- German Research Center for Artificial Intelligence GmbH, 67663, Kaiserslautern, Germany.
| | - Muhammad Nabeel Asim
- German Research Center for Artificial Intelligence GmbH, 67663, Kaiserslautern, Germany.
| | - Sheraz Ahmed
- German Research Center for Artificial Intelligence GmbH, 67663, Kaiserslautern, Germany
| | - Andreas Dengel
- Department of Computer Science, Rhineland-Palatinate Technical University of Kaiserslautern-Landau, 67663, Kaiserslautern, Germany
- German Research Center for Artificial Intelligence GmbH, 67663, Kaiserslautern, Germany
| |
Collapse
|
7
|
Yao X, Ouyang S, Lian Y, Peng Q, Zhou X, Huang F, Hu X, Shi F, Xia J. PheSeq, a Bayesian deep learning model to enhance and interpret the gene-disease association studies. Genome Med 2024; 16:56. [PMID: 38627848 PMCID: PMC11020195 DOI: 10.1186/s13073-024-01330-7] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/26/2023] [Accepted: 04/02/2024] [Indexed: 04/19/2024] Open
Abstract
Despite the abundance of genotype-phenotype association studies, the resulting association outcomes often lack robustness and interpretations. To address these challenges, we introduce PheSeq, a Bayesian deep learning model that enhances and interprets association studies through the integration and perception of phenotype descriptions. By implementing the PheSeq model in three case studies on Alzheimer's disease, breast cancer, and lung cancer, we identify 1024 priority genes for Alzheimer's disease and 818 and 566 genes for breast cancer and lung cancer, respectively. Benefiting from data fusion, these findings represent moderate positive rates, high recall rates, and interpretation in gene-disease association studies.
Collapse
Affiliation(s)
- Xinzhi Yao
- College of Informatics, Hubei Key Laboratory of Agricultural Bioinformatics, Huazhong Agricultural University, Wuhan, China
- Hubei Key Laboratory of Agricultural Bioinformatics, Huazhong Agricultural University, Wuhan, China
| | - Sizhuo Ouyang
- College of Informatics, Hubei Key Laboratory of Agricultural Bioinformatics, Huazhong Agricultural University, Wuhan, China
- Hubei Key Laboratory of Agricultural Bioinformatics, Huazhong Agricultural University, Wuhan, China
| | - Yulong Lian
- College of Science, Huazhong Agricultural University, Wuhan, China
| | - Qianqian Peng
- College of Informatics, Hubei Key Laboratory of Agricultural Bioinformatics, Huazhong Agricultural University, Wuhan, China
- Hubei Key Laboratory of Agricultural Bioinformatics, Huazhong Agricultural University, Wuhan, China
| | - Xionghui Zhou
- College of Informatics, Hubei Key Laboratory of Agricultural Bioinformatics, Huazhong Agricultural University, Wuhan, China
- Hubei Key Laboratory of Agricultural Bioinformatics, Huazhong Agricultural University, Wuhan, China
| | - Feier Huang
- College of Life Science and Technology, Huazhong Agricultural University, Wuhan, China
| | - Xuehai Hu
- College of Informatics, Hubei Key Laboratory of Agricultural Bioinformatics, Huazhong Agricultural University, Wuhan, China
- Hubei Key Laboratory of Agricultural Bioinformatics, Huazhong Agricultural University, Wuhan, China
| | - Feng Shi
- College of Science, Huazhong Agricultural University, Wuhan, China
| | - Jingbo Xia
- College of Informatics, Hubei Key Laboratory of Agricultural Bioinformatics, Huazhong Agricultural University, Wuhan, China.
- Hubei Key Laboratory of Agricultural Bioinformatics, Huazhong Agricultural University, Wuhan, China.
| |
Collapse
|
8
|
Wang Q, Zhang J, Liu Z, Duan Y, Li C. Integrative approaches based on genomic techniques in the functional studies on enhancers. Brief Bioinform 2023; 25:bbad442. [PMID: 38048082 PMCID: PMC10694556 DOI: 10.1093/bib/bbad442] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/28/2023] [Revised: 10/22/2023] [Accepted: 11/08/2023] [Indexed: 12/05/2023] Open
Abstract
With the development of sequencing technology and the dramatic drop in sequencing cost, the functions of noncoding genes are being characterized in a wide variety of fields (e.g. biomedicine). Enhancers are noncoding DNA elements with vital transcription regulation functions. Tens of thousands of enhancers have been identified in the human genome; however, the location, function, target genes and regulatory mechanisms of most enhancers have not been elucidated thus far. As high-throughput sequencing techniques have leapt forwards, omics approaches have been extensively employed in enhancer research. Multidimensional genomic data integration enables the full exploration of the data and provides novel perspectives for screening, identification and characterization of the function and regulatory mechanisms of unknown enhancers. However, multidimensional genomic data are still difficult to integrate genome wide due to complex varieties, massive amounts, high rarity, etc. To facilitate the appropriate methods for studying enhancers with high efficacy, we delineate the principles, data processing modes and progress of various omics approaches to study enhancers and summarize the applications of traditional machine learning and deep learning in multi-omics integration in the enhancer field. In addition, the challenges encountered during the integration of multiple omics data are addressed. Overall, this review provides a comprehensive foundation for enhancer analysis.
Collapse
Affiliation(s)
- Qilin Wang
- School of Engineering Medicine, Beihang University, Beijing 100191, China
- School of Biological Science and Medical Engineering, Beihang University, Beijing 100191, China
| | - Junyou Zhang
- School of Engineering Medicine, Beihang University, Beijing 100191, China
- School of Biological Science and Medical Engineering, Beihang University, Beijing 100191, China
| | - Zhaoshuo Liu
- School of Engineering Medicine, Beihang University, Beijing 100191, China
- School of Biological Science and Medical Engineering, Beihang University, Beijing 100191, China
| | - Yingying Duan
- School of Engineering Medicine, Beihang University, Beijing 100191, China
- School of Biological Science and Medical Engineering, Beihang University, Beijing 100191, China
| | - Chunyan Li
- School of Engineering Medicine, Beihang University, Beijing 100191, China
- School of Biological Science and Medical Engineering, Beihang University, Beijing 100191, China
- Key Laboratory of Big Data-Based Precision Medicine (Ministry of Industry and Information Technology), Beihang University, Beijing 100191, China
- Beijing Advanced Innovation Center for Big Data-Based Precision Medicine, Beihang University, Beijing 100191, China
| |
Collapse
|
9
|
Gonçalves TM, Stewart CL, Baxley SD, Xu J, Li D, Gabel HW, Wang T, Avraham O, Zhao G. Towards a comprehensive regulatory map of Mammalian Genomes. RESEARCH SQUARE 2023:rs.3.rs-3294408. [PMID: 37841836 PMCID: PMC10571623 DOI: 10.21203/rs.3.rs-3294408/v1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 10/17/2023]
Abstract
Genome mapping studies have generated a nearly complete collection of genes for the human genome, but we still lack an equivalently vetted inventory of human regulatory sequences. Cis-regulatory modules (CRMs) play important roles in controlling when, where, and how much a gene is expressed. We developed a training data-free CRM-prediction algorithm, the Mammalian Regulatory MOdule Detector (MrMOD) for accurate CRM prediction in mammalian genomes. MrMOD provides genome position-fixed CRM models similar to the fixed gene models for the mouse and human genomes using only genomic sequences as the inputs with one adjustable parameter - the significance p-value. Importantly, MrMOD predicts a comprehensive set of high-resolution CRMs in the mouse and human genomes including all types of regulatory modules not limited to any tissue, cell type, developmental stage, or condition. We computationally validated MrMOD predictions used a compendium of 21 orthogonal experimental data sets including thousands of experimentally defined CRMs and millions of putative regulatory elements derived from hundreds of different tissues, cell types, and stimulus conditions obtained from multiple databases. In ovo transgenic reporter assay demonstrates the power of our prediction in guiding experimental design. We analyzed CRMs located in the chromosome 17 using unsupervised machine learning and identified groups of CRMs with multiple lines of evidence supporting their functionality, linking CRMs with upstream binding transcription factors and downstream target genes. Our work provides a comprehensive base pair resolution annotation of the functional regulatory elements and non-functional regions in the mammalian genomes.
Collapse
Affiliation(s)
| | | | | | - Jason Xu
- Missouri University of Science & Technology
| | - Daofeng Li
- Washington University School of Medicine
| | | | - Ting Wang
- Washington University School of Medicine
| | | | | |
Collapse
|
10
|
Liu Y, Wang Z, Yuan H, Zhu G, Zhang Y. HEAP: a task adaptive-based explainable deep learning framework for enhancer activity prediction. Brief Bioinform 2023; 24:bbad286. [PMID: 37539835 DOI: 10.1093/bib/bbad286] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/15/2023] [Revised: 07/05/2023] [Accepted: 07/21/2023] [Indexed: 08/05/2023] Open
Abstract
Enhancers are crucial cis-regulatory elements that control gene expression in a cell-type-specific manner. Despite extensive genetic and computational studies, accurately predicting enhancer activity in different cell types remains a challenge, and the grammar of enhancers is still poorly understood. Here, we present HEAP (high-resolution enhancer activity prediction), an explainable deep learning framework for predicting enhancers and exploring enhancer grammar. The framework includes three modules that use grammar-based reasoning for enhancer prediction. The algorithm can incorporate DNA sequences and epigenetic modifications to obtain better accuracy. We use a novel two-step multi-task learning method, task adaptive parameter sharing (TAPS), to efficiently predict enhancers in different cell types. We first train a shared model with all cell-type datasets. Then we adapt to specific tasks by adding several task-specific subset layers. Experiments demonstrate that HEAP outperforms published methods and showcases the effectiveness of the TAPS, especially for those with limited training samples. Notably, the explainable framework HEAP utilizes post-hoc interpretation to provide insights into the prediction mechanisms from three perspectives: data, model architecture and algorithm, leading to a better understanding of model decisions and enhancer grammar. To the best of our knowledge, HEAP will be a valuable tool for insight into the complex mechanisms of enhancer activity.
Collapse
Affiliation(s)
- Yuhang Liu
- School of Computer Science, Chengdu University of Information Technology, 610225, Chengdu, China
| | - Zixuan Wang
- College of Electronics and Information Engieering, Sichuan University, 610065, Chengdu, China
| | - Hao Yuan
- School of Computer Science, Chengdu University of Information Technology, 610225, Chengdu, China
| | - Guiquan Zhu
- West China Hospital of Stomatology, Sichuan University, 610041, Chengdu, China
| | - Yongqing Zhang
- School of Computer Science, Chengdu University of Information Technology, 610225, Chengdu, China
| |
Collapse
|
11
|
Phan LT, Oh C, He T, Manavalan B. A comprehensive revisit of the machine-learning tools developed for the identification of enhancers in the human genome. Proteomics 2023; 23:e2200409. [PMID: 37021401 DOI: 10.1002/pmic.202200409] [Citation(s) in RCA: 3] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/10/2023] [Revised: 03/18/2023] [Accepted: 03/27/2023] [Indexed: 04/07/2023]
Abstract
Enhancers are non-coding DNA elements that play a crucial role in enhancing the transcription rate of a specific gene in the genome. Experiments for identifying enhancers can be restricted by their conditions and involve complicated, time-consuming, laborious, and costly steps. To overcome these challenges, computational platforms have been developed to complement experimental methods that enable high-throughput identification of enhancers. Over the last few years, the development of various enhancer computational tools has resulted in significant progress in predicting putative enhancers. Thus, researchers are now able to use a variety of strategies to enhance and advance enhancer study. In this review, an overview of machine learning (ML)-based prediction methods for enhancer identification and related databases has been provided. The existing enhancer-prediction methods have also been reviewed regarding their algorithms, feature selection processes, validation techniques, and software utility. In addition, the advantages and drawbacks of these ML approaches and guidelines for developing bioinformatic tools have been highlighted for a more efficient enhancer prediction. This review will serve as a useful resource for experimentalists in selecting the appropriate ML tool for their study, and for bioinformaticians in developing more accurate and advanced ML-based predictors.
Collapse
Affiliation(s)
- Le Thi Phan
- Computational Biology and Bioinformatics Laboratory, Department of Integrative Biotechnology, College of Biotechnology and Bioengineering, Sungkyunkwan University, Suwon, Gyeonggi-do, South Korea
| | - Changmin Oh
- Computational Biology and Bioinformatics Laboratory, Department of Integrative Biotechnology, College of Biotechnology and Bioengineering, Sungkyunkwan University, Suwon, Gyeonggi-do, South Korea
| | - Tao He
- Beidahuang Industry Group General Hospital, Harbin, China
| | - Balachandran Manavalan
- Computational Biology and Bioinformatics Laboratory, Department of Integrative Biotechnology, College of Biotechnology and Bioengineering, Sungkyunkwan University, Suwon, Gyeonggi-do, South Korea
| |
Collapse
|
12
|
Maytum A, Edginton-White B, Bonifer C. Identification and characterization of enhancer elements controlling cell type-specific and signalling dependent chromatin programming during hematopoietic development. Stem Cell Investig 2023; 10:14. [PMID: 37404470 PMCID: PMC10316067 DOI: 10.21037/sci-2023-011] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/14/2023] [Accepted: 05/24/2023] [Indexed: 07/06/2023]
Abstract
The development of multi-cellular organisms from a single fertilized egg requires to differentially execute the information encoded in our DNA. This complex process is regulated by the interplay of transcription factors with a chromatin environment, both of which provide the epigenetic information maintaining cell-type specific gene expression patterns. Moreover, transcription factors and their target genes form vast interacting gene regulatory networks which can be exquisitely stable. However, all developmental processes originate from pluripotent precursor cell types. The production of terminally differentiated cells from such cells, therefore, requires successive changes of cell fates, meaning that genes relevant for the next stage of differentiation must be switched on and genes not relevant anymore must be switched off. The stimulus for the change of cell fate originates from extrinsic signals which set a cascade of intracellular processes in motion that eventually terminate at the genome leading to changes in gene expression and the development of alternate gene regulatory networks. How developmental trajectories are encoded in the genome and how the interplay between intrinsic and extrinsic processes regulates development is one of the major questions in developmental biology. The development of the hematopoietic system has long served as model to understand how changes in gene regulatory networks drive the differentiation of the various blood cell types. In this review, we highlight the main signals and transcription factors and how they are integrated at the level of chromatin programming and gene expression control. We also highlight recent studies identifying the cis-regulatory elements such as enhancers at the global level and explain how their developmental activity is regulated by the cooperation of cell-type specific and ubiquitous transcription factors with extrinsic signals.
Collapse
Affiliation(s)
- Alexander Maytum
- Institute of Cancer and Genomic Sciences, School of Medicine and Dentistry, University of Birmingham, Birmingham, UK
| | - Ben Edginton-White
- Institute of Cancer and Genomic Sciences, School of Medicine and Dentistry, University of Birmingham, Birmingham, UK
| | - Constanze Bonifer
- Institute of Cancer and Genomic Sciences, School of Medicine and Dentistry, University of Birmingham, Birmingham, UK
| |
Collapse
|
13
|
Genome-wide identification and characterization of DNA enhancers with a stacked multivariate fusion framework. PLoS Comput Biol 2022; 18:e1010779. [PMID: 36520922 PMCID: PMC9836277 DOI: 10.1371/journal.pcbi.1010779] [Citation(s) in RCA: 4] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/27/2022] [Revised: 01/12/2023] [Accepted: 11/29/2022] [Indexed: 12/23/2022] Open
Abstract
Enhancers are short non-coding DNA sequences outside of the target promoter regions that can be bound by specific proteins to increase a gene's transcriptional activity, which has a crucial role in the spatiotemporal and quantitative regulation of gene expression. However, enhancers do not have a specific sequence motifs or structures, and their scattered distribution in the genome makes the identification of enhancers from human cell lines particularly challenging. Here we present a novel, stacked multivariate fusion framework called SMFM, which enables a comprehensive identification and analysis of enhancers from regulatory DNA sequences as well as their interpretation. Specifically, to characterize the hierarchical relationships of enhancer sequences, multi-source biological information and dynamic semantic information are fused to represent regulatory DNA enhancer sequences. Then, we implement a deep learning-based sequence network to learn the feature representation of the enhancer sequences comprehensively and to extract the implicit relationships in the dynamic semantic information. Ultimately, an ensemble machine learning classifier is trained based on the refined multi-source features and dynamic implicit relations obtained from the deep learning-based sequence network. Benchmarking experiments demonstrated that SMFM significantly outperforms other existing methods using several evaluation metrics. In addition, an independent test set was used to validate the generalization performance of SMFM by comparing it to other state-of-the-art enhancer identification methods. Moreover, we performed motif analysis based on the contribution scores of different bases of enhancer sequences to the final identification results. Besides, we conducted interpretability analysis of the identified enhancer sequences based on attention weights of EnhancerBERT, a fine-tuned BERT model that provides new insights into exploring the gene semantic information likely to underlie the discovered enhancers in an interpretable manner. Finally, in a human placenta study with 4,562 active distal gene regulatory enhancers, SMFM successfully exposed tissue-related placental development and the differential mechanism, demonstrating the generalizability and stability of our proposed framework.
Collapse
|
14
|
Ni P, Moe J, Su Z. Accurate prediction of functional states of cis-regulatory modules reveals common epigenetic rules in humans and mice. BMC Biol 2022; 20:221. [PMID: 36199141 PMCID: PMC9535988 DOI: 10.1186/s12915-022-01426-9] [Citation(s) in RCA: 5] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/12/2022] [Accepted: 09/29/2022] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Predicting cis-regulatory modules (CRMs) in a genome and their functional states in various cell/tissue types of the organism are two related challenging computational tasks. Most current methods attempt to simultaneously achieve both using data of multiple epigenetic marks in a cell/tissue type. Though conceptually attractive, they suffer high false discovery rates and limited applications. To fill the gaps, we proposed a two-step strategy to first predict a map of CRMs in the genome, and then predict functional states of all the CRMs in various cell/tissue types of the organism. We have recently developed an algorithm for the first step that was able to more accurately and completely predict CRMs in a genome than existing methods by integrating numerous transcription factor ChIP-seq datasets in the organism. Here, we presented machine-learning methods for the second step. RESULTS We showed that functional states in a cell/tissue type of all the CRMs in the genome could be accurately predicted using data of only 1~4 epigenetic marks by a variety of machine-learning classifiers. Our predictions are substantially more accurate than the best achieved so far. Interestingly, a model trained on a cell/tissue type in humans can accurately predict functional states of CRMs in different cell/tissue types of humans as well as of mice, and vice versa. Therefore, epigenetic code that defines functional states of CRMs in various cell/tissue types is universal at least in humans and mice. Moreover, we found that from tens to hundreds of thousands of CRMs were active in a human and mouse cell/tissue type, and up to 99.98% of them were reutilized in different cell/tissue types, while as small as 0.02% of them were unique to a cell/tissue type that might define the cell/tissue type. CONCLUSIONS Our two-step approach can accurately predict functional states in any cell/tissue type of all the CRMs in the genome using data of only 1~4 epigenetic marks. Our approach is also more cost-effective than existing methods that typically use data of more epigenetic marks. Our results suggest common epigenetic rules for defining functional states of CRMs in various cell/tissue types in humans and mice.
Collapse
Affiliation(s)
- Pengyu Ni
- Department of Bioinformatics and Genomics, the University of North Carolina at Charlotte, Charlotte, NC, 28223, USA
| | - Joshua Moe
- Department of Bioinformatics and Genomics, the University of North Carolina at Charlotte, Charlotte, NC, 28223, USA
| | - Zhengchang Su
- Department of Bioinformatics and Genomics, the University of North Carolina at Charlotte, Charlotte, NC, 28223, USA.
| |
Collapse
|
15
|
Huang S, Chen S, Zhang D, Gao J, Liu L. Enhancer-associated regulatory network and gene signature based on transcriptome and methylation data to predict the survival of patients with lung adenocarcinoma. Front Genet 2022; 13:fgene-2022-1008602. [PMID: 36212131 PMCID: PMC9538943 DOI: 10.3389/fgene.2022.1008602] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/01/2022] [Accepted: 09/08/2022] [Indexed: 11/17/2022] Open
Abstract
Accumulating evidence has proved that aberrant methylation of enhancers plays regulatory roles in gene expression for various cancers including lung adenocarcinoma (LUAD). In this study, the transcriptome and methylation data of The Cancer Genome Atlas (TCGA)-LUAD cohort were comprehensively analyzed with a five-step Enhancer Linking by Methylation/Expression Relationships (ELMER) process. Step 1: 131,371 distal (2 kb upstream from the transcription start site) probes were obtained. Step 2: 10,665 distal hypomethylated probes were identified in an unsupervised mode with the get.diff.meth function. Step 3: 699 probe-gene pairs with negative correlations were screened using the get.pair function in an unsupervised mode. Step 4: After mapping with probes, 768 motifs were obtained and 24 of them were enriched. Step 5: 127 transcription factors (TFs) with differential expressions and negative correlations with methylation levels were screened, which were corresponding to 21 motifs. After the ELMER process, a prognostic “TFs-motifs-genes” regulatory network was constructed. The Least absolute shrinkage and selection operator (LASSO) and Stepwise regression analyses were further applied to identify variables in the TCGA-LUAD cohort and an eight-gene signature was constructed for calculating the risk score. The risk score was verified in two independent validation cohorts. The area under curve values of receiver operating characteristic curves predicting 1-, 3-, and 5-years survival ranged from 0.633 to 0.764. With the increase of the risk scores, both the survival statuses and clinical traits showed a worse tendency. There were significant differences in the degrees of immune cell infiltration, TMB values, and TIDE scores between the high-risk and low-risk groups. Finally, a better-performing prognostic nomogram was integrated with the risk score and other clinical traits. In short, this multi-omics analysis demonstrated the application of ELMER in analyzing enhancer-associated regulatory network in LUAD, which provided promising strategies for epigenetic therapy and prognostic biomarkers.
Collapse
Affiliation(s)
- Shihao Huang
- Department of Biochemistry, Institute of Glycobiology, Dalian Medical University, Dalian, Liaoning, China
| | - Shiyu Chen
- Department of Laboratory Medicine, Nanxishan Hospital of Guangxi Zhuang Autonomous Region, Guilin, China
| | - Di Zhang
- Department of Biochemistry, Institute of Glycobiology, Dalian Medical University, Dalian, Liaoning, China
| | - Jiamei Gao
- Department of Biochemistry, Institute of Glycobiology, Dalian Medical University, Dalian, Liaoning, China
| | - Linhua Liu
- Department of Biochemistry, Institute of Glycobiology, Dalian Medical University, Dalian, Liaoning, China
- *Correspondence: Linhua Liu,
| |
Collapse
|
16
|
Sharov AA, Nakatake Y, Wang W. Atlas of regulated target genes of transcription factors (ART-TF) in human ES cells. BMC Bioinformatics 2022; 23:377. [PMID: 36114445 PMCID: PMC9479252 DOI: 10.1186/s12859-022-04924-3] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/05/2022] [Accepted: 09/12/2022] [Indexed: 12/26/2022] Open
Abstract
Background Transcription factors (TFs) play central roles in maintaining “stemness” of embryonic stem (ES) cells and their differentiation into several hundreds of adult cell types. The regulatory competence of TFs is routinely assessed by detecting target genes to which they bind. However, these data do not indicate which target genes are activated, repressed, or not affected by the change of TF abundance. There is a lack of large-scale studies that compare the genome binding of TFs with the expression change of target genes after manipulation of each TF. Results In this paper we associated human TFs with their target genes by two criteria: binding to genes, evaluated from published ChIP-seq data (n = 1868); and change of target gene expression shortly after induction of each TF in human ES cells. Lists of direction- and strength-specific regulated target genes are generated for 311 TFs (out of 351 TFs tested) with expected proportion of false positives less than or equal to 0.30, including 63 new TFs not present in four existing databases of target genes. Our lists of direction-specific targets for 152 TFs (80.0%) are larger that in the TRRUST database. In average, 30.9% of genes that respond greater than or equal to twofold to the induction of TFs are regulated targets. Regulated target genes indicate that the majority of TFs are either strong activators or strong repressors, whereas sets of genes that responded greater than or equal to twofold to the induction of TFs did not show strong asymmetry in the direction of expression change. The majority of human TFs (82.1%) regulated their target genes primarily via binding to enhancers. Repression of target genes is more often mediated by promoter-binding than activation of target genes. Enhancer-promoter loops are more abundant among strong activator and repressor TFs. Conclusions We developed an atlas of regulated targets of TFs (ART-TF) in human ES cells by combining data on TF binding with data on gene expression change after manipulation of individual TFs. Sets of regulated gene targets were identified with a controlled rate of false positives. This approach contributes to the understanding of biological functions of TFs and organization of gene regulatory networks. This atlas should be a valuable resource for ES cell-based regenerative medicine studies. Supplementary Information The online version contains supplementary material available at 10.1186/s12859-022-04924-3.
Collapse
|
17
|
Hu J, Wang J, Li J, Hu H, Wu B, Ren H, Wang J. AHLS-pred: a novel sequence-based predictor of acyl-homoserine-lactone synthases using machine learning algorithms. ENVIRONMENTAL MICROBIOLOGY REPORTS 2022; 14:616-631. [PMID: 35403334 DOI: 10.1111/1758-2229.13068] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 11/15/2021] [Revised: 03/28/2022] [Accepted: 03/30/2022] [Indexed: 06/14/2023]
Abstract
Acyl-homoserine-lactones (AHLs), as the major quorum sensing (QS) signalling molecules in Gram-negative bacteria, have shown great application potential in regulating biological nutrient removal process. The identification of AHLs synthases plays an essential role in in-depth research on QS mechanisms and applications of biological wastewater treatment processes. This work proposed the first prediction model for AHLs synthases based on machine learning algorithms, namely, AHLS-pred. The training dataset AHLS1400 and the independent testing dataset AHLS132 for AHLSs prediction were first established. Three sequence-based feature extraction methods are utilized to generate feature descriptors, namely, amino acid composition, dipeptide composition and G-gap dipeptide composition respectively. Subsequently, the optimal features were obtained based on the sorted feature descriptors (in F-score order) and the sequential forward search strategy. By comparing five different machine learning algorithms, the final prediction model is trained with support vector machine classifier on AHLS1400 in fivefold cross-validation with the best performance (ACC = 99.43%, MCC = 0.989, AUC = 0.997). The results show that AHLS-pred achieves an ACC of 94.70%, MCC of 0.894 and AUC of 0.995 on the independent testing dataset AHLS132. It demonstrates that AHLS-pred is a promising and powerful prediction method for accelerating the process of AHLSs computational identification.
Collapse
Affiliation(s)
- Jie Hu
- State Key Laboratory of Pollution Control and Resource Reuse, School of the Environment, Nanjing University, Nanjing, Jiangsu, 210023, China
| | - Jin Wang
- State Key Laboratory of Pollution Control and Resource Reuse, School of the Environment, Nanjing University, Nanjing, Jiangsu, 210023, China
| | - Jiahao Li
- State Key Laboratory of Pollution Control and Resource Reuse, School of the Environment, Nanjing University, Nanjing, Jiangsu, 210023, China
| | - Haidong Hu
- State Key Laboratory of Pollution Control and Resource Reuse, School of the Environment, Nanjing University, Nanjing, Jiangsu, 210023, China
| | - Bin Wu
- State Key Laboratory of Pollution Control and Resource Reuse, School of the Environment, Nanjing University, Nanjing, Jiangsu, 210023, China
| | - Hongqiang Ren
- State Key Laboratory of Pollution Control and Resource Reuse, School of the Environment, Nanjing University, Nanjing, Jiangsu, 210023, China
| | - Jinfeng Wang
- State Key Laboratory of Pollution Control and Resource Reuse, School of the Environment, Nanjing University, Nanjing, Jiangsu, 210023, China
| |
Collapse
|
18
|
Huang G, Luo W, Zhang G, Zheng P, Yao Y, Lyu J, Liu Y, Wei DQ. Enhancer-LSTMAtt: A Bi-LSTM and Attention-Based Deep Learning Method for Enhancer Recognition. Biomolecules 2022; 12:biom12070995. [PMID: 35883552 PMCID: PMC9313278 DOI: 10.3390/biom12070995] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/31/2022] [Revised: 07/03/2022] [Accepted: 07/07/2022] [Indexed: 01/27/2023] Open
Abstract
Enhancers are short DNA segments that play a key role in biological processes, such as accelerating transcription of target genes. Since the enhancer resides anywhere in a genome sequence, it is difficult to precisely identify enhancers. We presented a bi-directional long-short term memory (Bi-LSTM) and attention-based deep learning method (Enhancer-LSTMAtt) for enhancer recognition. Enhancer-LSTMAtt is an end-to-end deep learning model that consists mainly of deep residual neural network, Bi-LSTM, and feed-forward attention. We extensively compared the Enhancer-LSTMAtt with 19 state-of-the-art methods by 5-fold cross validation, 10-fold cross validation and independent test. Enhancer-LSTMAtt achieved competitive performances, especially in the independent test. We realized Enhancer-LSTMAtt into a user-friendly web application. Enhancer-LSTMAtt is applicable not only to recognizing enhancers, but also to distinguishing strong enhancer from weak enhancers. Enhancer-LSTMAtt is believed to become a promising tool for identifying enhancers.
Collapse
Affiliation(s)
- Guohua Huang
- School of Electrical Engineering, Shaoyang University, Shaoyang 422000, China; (W.L.); (G.Z.); (P.Z.); (J.L.)
- Correspondence:
| | - Wei Luo
- School of Electrical Engineering, Shaoyang University, Shaoyang 422000, China; (W.L.); (G.Z.); (P.Z.); (J.L.)
| | - Guiyang Zhang
- School of Electrical Engineering, Shaoyang University, Shaoyang 422000, China; (W.L.); (G.Z.); (P.Z.); (J.L.)
| | - Peijie Zheng
- School of Electrical Engineering, Shaoyang University, Shaoyang 422000, China; (W.L.); (G.Z.); (P.Z.); (J.L.)
| | - Yuhua Yao
- School of Mathematics and Statistics, Hainan Normal University, Haikou 571158, China;
| | - Jianyi Lyu
- School of Electrical Engineering, Shaoyang University, Shaoyang 422000, China; (W.L.); (G.Z.); (P.Z.); (J.L.)
| | - Yuewu Liu
- College of Information and Intelligence, Hunan Agricultural University, Changsha 410083, China;
| | - Dong-Qing Wei
- State Key Laboratory of Microbial Metabolism, and School of Life Sciences and Biotechnology, Shanghai Jiao Tong University, Shanghai 200240, China;
| |
Collapse
|
19
|
Mulero Hernández J, Fernández-Breis JT. Analysis of the landscape of human enhancer sequences in biological databases. Comput Struct Biotechnol J 2022; 20:2728-2744. [PMID: 35685360 PMCID: PMC9168495 DOI: 10.1016/j.csbj.2022.05.045] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/28/2022] [Revised: 05/20/2022] [Accepted: 05/21/2022] [Indexed: 12/01/2022] Open
Abstract
The process of gene regulation extends as a network in which both genetic sequences and proteins are involved. The levels of regulation and the mechanisms involved are multiple. Transcription is the main control mechanism for most genes, being the downstream steps responsible for refining the transcription patterns. In turn, gene transcription is mainly controlled by regulatory events that occur at promoters and enhancers. Several studies are focused on analyzing the contribution of enhancers in the development of diseases and their possible use as therapeutic targets. The study of regulatory elements has advanced rapidly in recent years with the development and use of next generation sequencing techniques. All this information has generated a large volume of information that has been transferred to a growing number of public repositories that store this information. In this article, we analyze the content of those public repositories that contain information about human enhancers with the aim of detecting whether the knowledge generated by scientific research is contained in those databases in a way that could be computationally exploited. The analysis will be based on three main aspects identified in the literature: types of enhancers, type of evidence about the enhancers, and methods for detecting enhancer-promoter interactions. Our results show that no single database facilitates the optimal exploitation of enhancer data, most types of enhancers are not represented in the databases and there is need for a standardized model for enhancers. We have identified major gaps and challenges for the computational exploitation of enhancer data.
Collapse
Affiliation(s)
- Juan Mulero Hernández
- Dept. Informática y Sistemas, Universidad de Murcia, CEIR Campus Mare Nostrum, IMIB-Arrixaca, Spain
| | | |
Collapse
|
20
|
DeepSTARR predicts enhancer activity from DNA sequence and enables the de novo design of synthetic enhancers. Nat Genet 2022; 54:613-624. [PMID: 35551305 DOI: 10.1038/s41588-022-01048-5] [Citation(s) in RCA: 69] [Impact Index Per Article: 34.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/20/2021] [Accepted: 03/08/2022] [Indexed: 02/06/2023]
Abstract
Enhancer sequences control gene expression and comprise binding sites (motifs) for different transcription factors (TFs). Despite extensive genetic and computational studies, the relationship between DNA sequence and regulatory activity is poorly understood, and de novo enhancer design has been challenging. Here, we built a deep-learning model, DeepSTARR, to quantitatively predict the activities of thousands of developmental and housekeeping enhancers directly from DNA sequence in Drosophila melanogaster S2 cells. The model learned relevant TF motifs and higher-order syntax rules, including functionally nonequivalent instances of the same TF motif that are determined by motif-flanking sequence and intermotif distances. We validated these rules experimentally and demonstrated that they can be generalized to humans by testing more than 40,000 wildtype and mutant Drosophila and human enhancers. Finally, we designed and functionally validated synthetic enhancers with desired activities de novo.
Collapse
|
21
|
Enhancer RNAs (eRNAs) in Cancer: The Jacks of All Trades. Cancers (Basel) 2022; 14:cancers14081978. [PMID: 35454885 PMCID: PMC9030334 DOI: 10.3390/cancers14081978] [Citation(s) in RCA: 8] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/12/2022] [Revised: 04/09/2022] [Accepted: 04/12/2022] [Indexed: 02/04/2023] Open
Abstract
Simple Summary This review focuses on eRNAs and the several mechanisms by which they can regulate gene expression. In particular we describe here the most recent examples of eRNAs dysregulated in cancer or involved in the immune escape of tumor cells. Abstract Enhancer RNAs (eRNAs) are non-coding RNAs (ncRNAs) transcribed in enhancer regions. They play an important role in transcriptional regulation, mainly during cellular differentiation. eRNAs are tightly tissue- and cell-type specific and are induced by specific stimuli, activating promoters of target genes in turn. eRNAs usually have a very short half-life but in some cases, once activated, they can be stably expressed and acquire additional functions. Due to their critical role, eRNAs are often dysregulated in cancer and growing number of interactions with chromatin modifiers, transcription factors, and splicing machinery have been described. Enhancer activation and eRNA transcription have particular relevance also in inflammatory response, placing the eRNAs at the interplay between cancer and immune cells. Here, we summarize all the possible molecular mechanisms recently reported in association with eRNAs activity.
Collapse
|
22
|
Jankovic B, Gojobori T. From shallow to deep: some lessons learned from application of machine learning for recognition of functional genomic elements in human genome. Hum Genomics 2022; 16:7. [PMID: 35180894 PMCID: PMC8855580 DOI: 10.1186/s40246-022-00376-1] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/26/2021] [Accepted: 01/02/2022] [Indexed: 11/25/2022] Open
Abstract
Identification of genomic signals as indicators for functional genomic elements is one of the areas that received early and widespread application of machine learning methods. With time, the methods applied grew in variety and generally exhibited a tendency to improve their ability to identify some major genomic and transcriptomics signals. The evolution of machine learning in genomics followed a similar path to applications of machine learning in other fields. These were impacted in a major way by three dominant developments, namely an enormous increase in availability and quality of data, a significant increase in computational power available to machine learning applications, and finally, new machine learning paradigms, of which deep learning is the most well-known example. It is not easy in general to distinguish factors leading to improvements in results of applications of machine learning. This is even more so in the field of genomics, where the advent of next-generation sequencing and the increased ability to perform functional analysis of raw data have had a major effect on the applicability of machine learning in OMICS fields. In this paper, we survey the results from a subset of published work in application of machine learning in the recognition of genomic signals and regions in human genome and summarize some lessons learnt from this endeavor. There is no doubt that a significant progress has been made both in terms of accuracy and reliability of models. Questions remain however whether the progress has been sufficient and what these developments bring to the field of genomics in general and human genomics in particular. Improving usability, interpretability and accuracy of models remains an important open challenge for current and future research in application of machine learning and more generally of artificial intelligence methods in genomics.
Collapse
Affiliation(s)
- Boris Jankovic
- Computational Bioscience Research Center, King Abdullah University of Science and Technology, Thuwal, Saudi Arabia
| | - Takashi Gojobori
- Computational Bioscience Research Center, King Abdullah University of Science and Technology, Thuwal, Saudi Arabia. .,Division of Biological and Environmental Sciences and Engineering, King Abdullah University of Science and Technology, Thuwal, Saudi Arabia.
| |
Collapse
|
23
|
Holm I, Nardini L, Pain A, Bischoff E, Anderson CE, Zongo S, Guelbeogo WM, Sagnon N, Gohl DM, Nowling RJ, Vernick KD, Riehle MM. Comprehensive Genomic Discovery of Non-Coding Transcriptional Enhancers in the African Malaria Vector Anopheles coluzzii. Front Genet 2022; 12:785934. [PMID: 35082832 PMCID: PMC8784733 DOI: 10.3389/fgene.2021.785934] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/29/2021] [Accepted: 12/10/2021] [Indexed: 11/24/2022] Open
Abstract
Almost all regulation of gene expression in eukaryotic genomes is mediated by the action of distant non-coding transcriptional enhancers upon proximal gene promoters. Enhancer locations cannot be accurately predicted bioinformatically because of the absence of a defined sequence code, and thus functional assays are required for their direct detection. Here we used a massively parallel reporter assay, Self-Transcribing Active Regulatory Region sequencing (STARR-seq), to generate the first comprehensive genome-wide map of enhancers in Anopheles coluzzii, a major African malaria vector in the Gambiae species complex. The screen was carried out by transfecting reporter libraries created from the genomic DNA of 60 wild A. coluzzii from Burkina Faso into A. coluzzii 4a3A cells, in order to functionally query enhancer activity of the natural population within the homologous cellular context. We report a catalog of 3,288 active genomic enhancers that were significant across three biological replicates, 74% of them located in intergenic and intronic regions. The STARR-seq enhancer screen is chromatin-free and thus detects inherent activity of a comprehensive catalog of enhancers that may be restricted in vivo to specific cell types or developmental stages. Testing of a validation panel of enhancer candidates using manual luciferase assays confirmed enhancer function in 26 of 28 (93%) of the candidates over a wide dynamic range of activity from two to at least 16-fold activity above baseline. The enhancers occupy only 0.7% of the genome, and display distinct composition features. The enhancer compartment is significantly enriched for 15 transcription factor binding site signatures, and displays divergence for specific dinucleotide repeats, as compared to matched non-enhancer genomic controls. The genome-wide catalog of A. coluzzii enhancers is publicly available in a simple searchable graphic format. This enhancer catalogue will be valuable in linking genetic and phenotypic variation, in identifying regulatory elements that could be employed in vector manipulation, and in better targeting of chromosome editing to minimize extraneous regulation influences on the introduced sequences. Importance: Understanding the role of the non-coding regulatory genome in complex disease phenotypes is essential, but even in well-characterized model organisms, identification of regulatory regions within the vast non-coding genome remains a challenge. We used a large-scale assay to generate a genome wide map of transcriptional enhancers. Such a catalogue for the important malaria vector, Anopheles coluzzii, will be an important research tool as the role of non-coding regulatory variation in differential susceptibility to malaria infection is explored and as a public resource for research on this important insect vector of disease.
Collapse
Affiliation(s)
- Inge Holm
- Institut Pasteur, Université de Paris, CNRS UMR 2000, Unit of Insect Vector Genetics and Genomics, Department of Parasites and Insect Vectors, Paris, France
| | - Luisa Nardini
- Institut Pasteur, Université de Paris, CNRS UMR 2000, Unit of Insect Vector Genetics and Genomics, Department of Parasites and Insect Vectors, Paris, France
| | - Adrien Pain
- Institut Pasteur, Université de Paris, CNRS UMR 2000, Unit of Insect Vector Genetics and Genomics, Department of Parasites and Insect Vectors, Paris, France.,Institut Pasteur, Université de Paris, Hub de Bioinformatique et Biostatistique, Paris, France
| | - Emmanuel Bischoff
- Institut Pasteur, Université de Paris, CNRS UMR 2000, Unit of Insect Vector Genetics and Genomics, Department of Parasites and Insect Vectors, Paris, France
| | - Cameron E Anderson
- Department of Microbiology and Immunology, Medical College of Wisconsin, Milwaukee, WI, United States
| | - Soumanaba Zongo
- Centre National de Recherche et de Formation sur le Paludisme (CNRFP), Ministry of Health, Ouagadougou, Burkina Faso
| | - Wamdaogo M Guelbeogo
- Centre National de Recherche et de Formation sur le Paludisme (CNRFP), Ministry of Health, Ouagadougou, Burkina Faso
| | - N'Fale Sagnon
- Centre National de Recherche et de Formation sur le Paludisme (CNRFP), Ministry of Health, Ouagadougou, Burkina Faso
| | - Daryl M Gohl
- University of Minnesota Genomics Center, Minneapolis, MN, United States.,Department of Genetics, Cell Biology and Development, University of Minnesota, Minneapolis, MN, United States
| | - Ronald J Nowling
- Department of Electrical Engineering and Computer Science, Milwaukee School of Engineering (MSOE), Milwaukee, WI, United States
| | - Kenneth D Vernick
- Institut Pasteur, Université de Paris, CNRS UMR 2000, Unit of Insect Vector Genetics and Genomics, Department of Parasites and Insect Vectors, Paris, France
| | - Michelle M Riehle
- Department of Microbiology and Immunology, Medical College of Wisconsin, Milwaukee, WI, United States
| |
Collapse
|
24
|
Jain M, Garg R. Enhancers as potential targets for engineering salinity stress tolerance in crop plants. PHYSIOLOGIA PLANTARUM 2021; 173:1382-1391. [PMID: 33837536 DOI: 10.1111/ppl.13421] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 02/08/2021] [Revised: 03/19/2021] [Accepted: 04/06/2021] [Indexed: 06/12/2023]
Abstract
Enhancers represent noncoding regulatory regions of the genome located distantly from their target genes. They regulate gene expression programs in a context-specific manner via interacting with promoters of one or more target genes and are generally associated with transcription factor binding sites and epi(genomic)/chromatin features, such as regions of chromatin accessibility and histone modifications. The enhancers are difficult to identify due to the modularity of their associated features. Although enhancers have been studied extensively in human and animals, only a handful of them has been identified in few plant species till date due to nonavailability of plant-specific experimental and computational approaches for their discovery. Being an important regulatory component of the genome, enhancers represent potential targets for engineering agronomic traits, including salinity stress tolerance in plants. Here, we provide a review of the available experimental and computational approaches along with the associated sequence and chromatin/epigenetic features for the discovery of enhancers in plants. In addition, we provide insights into the challenges and future prospects of enhancer research in plant biology with emphasis on potential applications in engineering salinity stress tolerance in crop plants.
Collapse
Affiliation(s)
- Mukesh Jain
- School of Computational and Integrative Sciences, Jawaharlal Nehru University, New Delhi, India
| | - Rohini Garg
- Department of Life Sciences, School of Natural Sciences, Shiv Nadar University, Gautam Buddha Nagar, Uttar Pradesh, India
| |
Collapse
|
25
|
Yousefi S, Deng R, Lanko K, Salsench EM, Nikoncuk A, van der Linde HC, Perenthaler E, van Ham TJ, Mulugeta E, Barakat TS. Comprehensive multi-omics integration identifies differentially active enhancers during human brain development with clinical relevance. Genome Med 2021; 13:162. [PMID: 34663447 PMCID: PMC8524963 DOI: 10.1186/s13073-021-00980-1] [Citation(s) in RCA: 9] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/28/2021] [Accepted: 09/29/2021] [Indexed: 12/13/2022] Open
Abstract
BACKGROUND Non-coding regulatory elements (NCREs), such as enhancers, play a crucial role in gene regulation, and genetic aberrations in NCREs can lead to human disease, including brain disorders. The human brain is a complex organ that is susceptible to numerous disorders; many of these are caused by genetic changes, but a multitude remain currently unexplained. Understanding NCREs acting during brain development has the potential to shed light on previously unrecognized genetic causes of human brain disease. Despite immense community-wide efforts to understand the role of the non-coding genome and NCREs, annotating functional NCREs remains challenging. METHODS Here we performed an integrative computational analysis of virtually all currently available epigenome data sets related to human fetal brain. RESULTS Our in-depth analysis unravels 39,709 differentially active enhancers (DAEs) that show dynamic epigenomic rearrangement during early stages of human brain development, indicating likely biological function. Many of these DAEs are linked to clinically relevant genes, and functional validation of selected DAEs in cell models and zebrafish confirms their role in gene regulation. Compared to enhancers without dynamic epigenomic rearrangement, DAEs are subjected to higher sequence constraints in humans, have distinct sequence characteristics and are bound by a distinct transcription factor landscape. DAEs are enriched for GWAS loci for brain-related traits and for genetic variation found in individuals with neurodevelopmental disorders, including autism. CONCLUSION This compendium of high-confidence enhancers will assist in deciphering the mechanism behind developmental genetics of human brain and will be relevant to uncover missing heritability in human genetic brain disorders.
Collapse
Affiliation(s)
- Soheil Yousefi
- Department of Clinical Genetics, Erasmus MC University Medical Center, Rotterdam, The Netherlands
| | - Ruizhi Deng
- Department of Clinical Genetics, Erasmus MC University Medical Center, Rotterdam, The Netherlands
| | - Kristina Lanko
- Department of Clinical Genetics, Erasmus MC University Medical Center, Rotterdam, The Netherlands
| | - Eva Medico Salsench
- Department of Clinical Genetics, Erasmus MC University Medical Center, Rotterdam, The Netherlands
| | - Anita Nikoncuk
- Department of Clinical Genetics, Erasmus MC University Medical Center, Rotterdam, The Netherlands
| | - Herma C. van der Linde
- Department of Clinical Genetics, Erasmus MC University Medical Center, Rotterdam, The Netherlands
| | - Elena Perenthaler
- Department of Clinical Genetics, Erasmus MC University Medical Center, Rotterdam, The Netherlands
| | - Tjakko J. van Ham
- Department of Clinical Genetics, Erasmus MC University Medical Center, Rotterdam, The Netherlands
| | - Eskeatnaf Mulugeta
- Department of Cell Biology, Erasmus MC University Medical Center, Rotterdam, The Netherlands
| | - Tahsin Stefan Barakat
- Department of Clinical Genetics, Erasmus MC University Medical Center, Rotterdam, The Netherlands
| |
Collapse
|
26
|
Ferré Q, Chèneby J, Puthier D, Capponi C, Ballester B. Anomaly detection in genomic catalogues using unsupervised multi-view autoencoders. BMC Bioinformatics 2021; 22:460. [PMID: 34563116 PMCID: PMC8467021 DOI: 10.1186/s12859-021-04359-2] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/15/2020] [Revised: 06/04/2021] [Accepted: 08/09/2021] [Indexed: 11/13/2022] Open
Abstract
Background Accurate identification of Transcriptional Regulator binding locations is essential for analysis of genomic regions, including Cis Regulatory Elements. The customary NGS approaches, predominantly ChIP-Seq, can be obscured by data anomalies and biases which are difficult to detect without supervision. Results Here, we develop a method to leverage the usual combinations between many experimental series to mark such atypical peaks. We use deep learning to perform a lossy compression of the genomic regions’ representations with multiview convolutions. Using artificial data, we show that our method correctly identifies groups of correlating series and evaluates CRE according to group completeness. It is then applied to the ReMap database’s large volume of curated ChIP-seq data. We show that peaks lacking known biological correlators are singled out and less confirmed in real data. We propose normalization approaches useful in interpreting black-box models. Conclusion Our approach detects peaks that are less corroborated than average. It can be extended to other similar problems, and can be interpreted to identify correlation groups. It is implemented in an open-source tool called atyPeak. Supplementary Information The online version contains supplementary material available at 10.1186/s12859-021-04359-2.
Collapse
Affiliation(s)
- Quentin Ferré
- INSERM, TAGC, Aix Marseille University, Marseille, France.,Université de Toulon, CNRS, LIS, Aix Marseille University, Marseille, France
| | - Jeanne Chèneby
- INSERM, TAGC, Aix Marseille University, Marseille, France
| | - Denis Puthier
- INSERM, TAGC, Aix Marseille University, Marseille, France
| | - Cécile Capponi
- Université de Toulon, CNRS, LIS, Aix Marseille University, Marseille, France.
| | | |
Collapse
|
27
|
Ni P, Su Z. Accurate prediction of cis-regulatory modules reveals a prevalent regulatory genome of humans. NAR Genom Bioinform 2021; 3:lqab052. [PMID: 34159315 PMCID: PMC8210889 DOI: 10.1093/nargab/lqab052] [Citation(s) in RCA: 12] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/26/2021] [Revised: 05/01/2021] [Accepted: 06/14/2021] [Indexed: 02/07/2023] Open
Abstract
cis-regulatory modules(CRMs) formed by clusters of transcription factor (TF) binding sites (TFBSs) are as important as coding sequences in specifying phenotypes of humans. It is essential to categorize all CRMs and constituent TFBSs in the genome. In contrast to most existing methods that predict CRMs in specific cell types using epigenetic marks, we predict a largely cell type agonistic but more comprehensive map of CRMs and constituent TFBSs in the gnome by integrating all available TF ChIP-seq datasets. Our method is able to partition 77.47% of genome regions covered by available 6092 datasets into a CRM candidate (CRMC) set (56.84%) and a non-CRMC set (43.16%). Intriguingly, the predicted CRMCs are under strong evolutionary constraints, while the non-CRMCs are largely selectively neutral, strongly suggesting that the CRMCs are likely cis-regulatory, while the non-CRMCs are not. Our predicted CRMs are under stronger evolutionary constraints than three state-of-the-art predictions (GeneHancer, EnhancerAtlas and ENCODE phase 3) and substantially outperform them for recalling VISTA enhancers and non-coding ClinVar variants. We estimated that the human genome might encode about 1.47M CRMs and 68M TFBSs, comprising about 55% and 22% of the genome, respectively; for both of which, we predicted 80%. Therefore, the cis-regulatory genome appears to be more prevalent than originally thought.
Collapse
Affiliation(s)
- Pengyu Ni
- Department of Bioinformatics and Genomics, the University of North Carolina at Charlotte, 9201 University City Boulevard, Charlotte, NC 28223, USA
| | - Zhengchang Su
- Department of Bioinformatics and Genomics, the University of North Carolina at Charlotte, 9201 University City Boulevard, Charlotte, NC 28223, USA
| |
Collapse
|
28
|
Niu X, Deng K, Liu L, Yang K, Hu X. A statistical framework for predicting critical regions of p53-dependent enhancers. Brief Bioinform 2021; 22:bbaa053. [PMID: 32392580 PMCID: PMC8138796 DOI: 10.1093/bib/bbaa053] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/07/2020] [Revised: 02/26/2020] [Indexed: 12/13/2022] Open
Abstract
P53 is the 'guardian of the genome' and is responsible for regulating cell cycle and apoptosis. The genomic p53 binding regions, where activating transcriptional factors and cofactors like p300 simultaneously bind, are called 'p53-dependent enhancers', which play an important role in tumorigenesis. Current experimental assays generally provide a broad peak of each enhancer element, leaving our knowledge about critical enhancer regions (CERs) limited. Under the inspiration of enhancer dissection by CRISPR-Cas9 screen library on genome-wide p53 binding sites, here we introduce a statistical framework called 'Computational CRISPR Strategy' (CCS), to predict whether a given DNA fragment will be a p53-dependent CER by employing 7-mer as feature extractions along with random forest as the regressor. When training on a p53 CRISPR enhancer dataset, CCS not only accurately fitted the top-ranked enriched single guide RNAs (sgRNAs) but also successfully reproduced two known CERs that were validated by experiments. When applying it to an independent testing dataset on a tilling of a 2K-b genomic region of CRISPR-deCDKN1A-Lib, the trained model shows great generalizability by identifying a CER containing five top-ranked sgRNAs. A feature importance analysis further indicates that top-ranked 7-mers are mapped onto informative TF motifs including POU5F1 and SOX5, which are differentially enriched in p53-dependent CERs and are potential factors to make a general p53 binding site to form a p53-dependent CER, providing the interpretability of the trained model. Our results demonstrate that CCS is an alternative way of the CRISPR experiment to screen the genome for mapping p53-dependent CERs.
Collapse
Affiliation(s)
| | | | | | | | - Xuehai Hu
- Corresponding author: Xuehai Hu, College of Informatics, Hubei Key Laboratory of Agricultural Bioinformatics, Huazhong Agricultural University, Wuhan, Hubei, 430070, P.R. China. Tel.: +86-18171282783; Fax: +86-27-87288509; E-mail:
| |
Collapse
|
29
|
Hong J, Gao R, Yang Y. CrepHAN: Cross-species prediction of enhancers by using hierarchical attention networks. Bioinformatics 2021; 37:3436-3443. [PMID: 33978703 DOI: 10.1093/bioinformatics/btab349] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/21/2020] [Revised: 04/21/2021] [Accepted: 05/06/2021] [Indexed: 01/17/2023] Open
Abstract
MOTIVATION Enhancers are important functional elements in genome sequences. The identification of enhancers is a very challenging task due to the great diversity of enhancer sequences and the flexible localization on genomes. Till now, the interactions between enhancers and genes have not been fully understood yet. To speed up the studies of the regulatory roles of enhancers, computational tools for the prediction of enhancers have emerged in recent years. Especially, thanks to the ENCODE project and the advances of high-throughput experimental techniques, a large amount of experimentally verified enhancers have been annotated on the human genome, which allows large-scale predictions of unknown enhancers using data-driven methods. However, except for human and some model organisms, the validated enhancer annotations are scarce for most species, leading to more difficulties in the computational identification of enhancers for their genomes. RESULTS In this study, we propose a deep learning-based predictor for enhancers, named CrepHAN, which is featured by a hierarchical attention neural network and word embedding-based representations for DNA sequences. We use the experimentally-supported data of the human genome to train the model, and perform experiments on human and other mammals, including mouse, cow, and dog. The experimental results show that CrepHAN has more advantages on cross-species predictions, and outperforms the existing models by a large margin. Especially, for human-mouse cross-predictions, the AUC score of ROC curve is increased by 0.033∼0.145 on the combined tissue dataset and 0.032∼0.109 on tissue-specific datasets. AVAILABILITY bcmi.sjtu.edu.cn/~yangyang/CrepHAN.html. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Jianwei Hong
- Department of Computer Science and Engineering, Shanghai Jiao Tong University, 800 Dong Chuan Rd., Shanghai 200240, China.,School of Agriculture and Biology, Shanghai Jiao Tong University, Shanghai 200240, China
| | - Ruitian Gao
- Department of Bioinformatics and Biostatistics, Shanghai Jiao Tong University, Shanghai 200240, China
| | - Yang Yang
- Department of Computer Science and Engineering, Shanghai Jiao Tong University, 800 Dong Chuan Rd., Shanghai 200240, China.,Key Laboratory of Shanghai Education Commission for Intelligent Interaction and Cognitive Engineering, Shanghai, 200240, China
| |
Collapse
|
30
|
Aboelnour E, Bonev B. Decoding the organization, dynamics, and function of the 4D genome. Dev Cell 2021; 56:1562-1573. [PMID: 33984271 DOI: 10.1016/j.devcel.2021.04.023] [Citation(s) in RCA: 7] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/21/2020] [Revised: 02/15/2021] [Accepted: 04/21/2021] [Indexed: 11/15/2022]
Abstract
Understanding how complex cell-fate decisions emerge at the molecular level is a key challenge in developmental biology. Despite remarkable progress in decoding the contribution of the linear epigenome, how spatial genome architecture functionally informs changes in gene expression remains unclear. In this review, we discuss recent insights in elucidating the molecular landscape of genome folding, emphasizing the multilayered nature of the 3D genome, its importance for gene regulation, and its spatiotemporal dynamics. Finally, we discuss how these new concepts and emergent technologies will enable us to address some of the outstanding questions in development and disease.
Collapse
Affiliation(s)
- Erin Aboelnour
- Helmholtz Pioneer Campus, Helmholtz Zentrum München, 85764 Neuherberg, Germany
| | - Boyan Bonev
- Helmholtz Pioneer Campus, Helmholtz Zentrum München, 85764 Neuherberg, Germany; Biomedical Center (BMC), Faculty of Medicine, LMU Munich, Germany.
| |
Collapse
|
31
|
Lee JTH, Patikas N, Kiselev VY, Hemberg M. Fast searches of large collections of single-cell data using scfind. Nat Methods 2021; 18:262-271. [PMID: 33649586 PMCID: PMC7116898 DOI: 10.1038/s41592-021-01076-9] [Citation(s) in RCA: 6] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/14/2019] [Accepted: 01/20/2021] [Indexed: 01/30/2023]
Abstract
Single-cell technologies have made it possible to profile millions of cells, but for these resources to be useful they must be easy to query and access. To facilitate interactive and intuitive access to single-cell data we have developed scfind, a single-cell analysis tool that facilitates fast search of biologically or clinically relevant marker genes in cell atlases. Using transcriptome data from six mouse cell atlases, we show how scfind can be used to evaluate marker genes, perform in silico gating, and identify both cell-type-specific and housekeeping genes. Moreover, we have developed a subquery optimization routine to ensure that long and complex queries return meaningful results. To make scfind more user friendly, we use indices of PubMed abstracts and techniques from natural language processing to allow for arbitrary queries. Finally, we show how scfind can be used for multi-omics analyses by combining single-cell ATAC-seq data with transcriptome data.
Collapse
Affiliation(s)
| | - Nikolaos Patikas
- Wellcome Sanger Institute, Wellcome Genome Campus, Hinxton, UK
- UK Dementia Research Institute, Department of Clinical Neurosciences, University of Cambridge, Cambridge, UK
| | | | - Martin Hemberg
- Wellcome Sanger Institute, Wellcome Genome Campus, Hinxton, UK.
- Evergrande Center for Immunologic Disease, Harvard Medical School and Brigham and Women's Hospital, Boston, MA, USA.
| |
Collapse
|
32
|
Tobias IC, Abatti LE, Moorthy SD, Mullany S, Taylor T, Khader N, Filice MA, Mitchell JA. Transcriptional enhancers: from prediction to functional assessment on a genome-wide scale. Genome 2020; 64:426-448. [PMID: 32961076 DOI: 10.1139/gen-2020-0104] [Citation(s) in RCA: 10] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/20/2022]
Abstract
Enhancers are cis-regulatory sequences located distally to target genes. These sequences consolidate developmental and environmental cues to coordinate gene expression in a tissue-specific manner. Enhancer function and tissue specificity depend on the expressed set of transcription factors, which recognize binding sites and recruit cofactors that regulate local chromatin organization and gene transcription. Unlike other genomic elements, enhancers are challenging to identify because they function independently of orientation, are often distant from their promoters, have poorly defined boundaries, and display no reading frame. In addition, there are no defined genetic or epigenetic features that are unambiguously associated with enhancer activity. Over recent years there have been developments in both empirical assays and computational methods for enhancer prediction. We review genome-wide tools, CRISPR advancements, and high-throughput screening approaches that have improved our ability to both observe and manipulate enhancers in vitro at the level of primary genetic sequences, chromatin states, and spatial interactions. We also highlight contemporary animal models and their importance to enhancer validation. Together, these experimental systems and techniques complement one another and broaden our understanding of enhancer function in development, evolution, and disease.
Collapse
Affiliation(s)
- Ian C Tobias
- Department of Cell and Systems Biology, University of Toronto, Toronto, ON, M5S 3G5, Canada.,Department of Cell and Systems Biology, University of Toronto, Toronto, ON, M5S 3G5, Canada
| | - Luis E Abatti
- Department of Cell and Systems Biology, University of Toronto, Toronto, ON, M5S 3G5, Canada.,Department of Cell and Systems Biology, University of Toronto, Toronto, ON, M5S 3G5, Canada
| | - Sakthi D Moorthy
- Department of Cell and Systems Biology, University of Toronto, Toronto, ON, M5S 3G5, Canada.,Department of Cell and Systems Biology, University of Toronto, Toronto, ON, M5S 3G5, Canada
| | - Shanelle Mullany
- Department of Cell and Systems Biology, University of Toronto, Toronto, ON, M5S 3G5, Canada.,Department of Cell and Systems Biology, University of Toronto, Toronto, ON, M5S 3G5, Canada
| | - Tiegh Taylor
- Department of Cell and Systems Biology, University of Toronto, Toronto, ON, M5S 3G5, Canada.,Department of Cell and Systems Biology, University of Toronto, Toronto, ON, M5S 3G5, Canada
| | - Nawrah Khader
- Department of Cell and Systems Biology, University of Toronto, Toronto, ON, M5S 3G5, Canada.,Department of Cell and Systems Biology, University of Toronto, Toronto, ON, M5S 3G5, Canada
| | - Mario A Filice
- Department of Cell and Systems Biology, University of Toronto, Toronto, ON, M5S 3G5, Canada.,Department of Cell and Systems Biology, University of Toronto, Toronto, ON, M5S 3G5, Canada
| | - Jennifer A Mitchell
- Department of Cell and Systems Biology, University of Toronto, Toronto, ON, M5S 3G5, Canada.,Department of Cell and Systems Biology, University of Toronto, Toronto, ON, M5S 3G5, Canada
| |
Collapse
|
33
|
Osmala M, Lähdesmäki H. Enhancer prediction in the human genome by probabilistic modelling of the chromatin feature patterns. BMC Bioinformatics 2020; 21:317. [PMID: 32689977 PMCID: PMC7370432 DOI: 10.1186/s12859-020-03621-3] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/10/2019] [Accepted: 06/19/2020] [Indexed: 12/11/2022] Open
Abstract
Background The binding sites of transcription factors (TFs) and the localisation of histone modifications in the human genome can be quantified by the chromatin immunoprecipitation assay coupled with next-generation sequencing (ChIP-seq). The resulting chromatin feature data has been successfully adopted for genome-wide enhancer identification by several unsupervised and supervised machine learning methods. However, the current methods predict different numbers and different sets of enhancers for the same cell type and do not utilise the pattern of the ChIP-seq coverage profiles efficiently. Results In this work, we propose a PRobabilistic Enhancer PRedictIoN Tool (PREPRINT) that assumes characteristic coverage patterns of chromatin features at enhancers and employs a statistical model to account for their variability. PREPRINT defines probabilistic distance measures to quantify the similarity of the genomic query regions and the characteristic coverage patterns. The probabilistic scores of the enhancer and non-enhancer samples are utilised to train a kernel-based classifier. The performance of the method is demonstrated on ENCODE data for two cell lines. The predicted enhancers are computationally validated based on the transcriptional regulatory protein binding sites and compared to the predictions obtained by state-of-the-art methods. Conclusion PREPRINT performs favorably to the state-of-the-art methods, especially when requiring the methods to predict a larger set of enhancers. PREPRINT generalises successfully to data from cell type not utilised for training, and often the PREPRINT performs better than the previous methods. The PREPRINT enhancers are less sensitive to the choice of prediction threshold. PREPRINT identifies biologically validated enhancers not predicted by the competing methods. The enhancers predicted by PREPRINT can aid the genome interpretation in functional genomics and clinical studies.
Collapse
Affiliation(s)
- Maria Osmala
- Department of Computer Science, Aalto University, Konemiehentie 2, Espoo, 02150, Finland.
| | - Harri Lähdesmäki
- Department of Computer Science, Aalto University, Konemiehentie 2, Espoo, 02150, Finland
| |
Collapse
|
34
|
Malladi VS, Nagari A, Franco HL, Kraus WL. Total Functional Score of Enhancer Elements Identifies Lineage-Specific Enhancers That Drive Differentiation of Pancreatic Cells. Bioinform Biol Insights 2020; 14:1177932220938063. [PMID: 32655276 PMCID: PMC7331761 DOI: 10.1177/1177932220938063] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/25/2020] [Accepted: 06/02/2020] [Indexed: 01/10/2023] Open
Abstract
The differentiation of embryonic stem cells into various lineages is highly dependent on the chromatin state of the genome and patterns of gene expression. To identify lineage-specific enhancers driving the differentiation of progenitors into pancreatic cells, we used a previously described computational framework called Total Functional Score of Enhancer Elements (TFSEE), which integrates multiple genomic assays that probe both transcriptional and epigenomic states. First, we evaluated and compared TFSEE as an enhancer-calling algorithm with enhancers called using GRO-seq-defined enhancer transcripts (method 1) versus enhancers called using histone modification ChIP-seq data (method 2). Second, we used TFSEE to define the enhancer landscape and identify transcription factors (TFs) that maintain the multipotency of a subpopulation of endodermal stem cells during differentiation into pancreatic lineages. Collectively, our results demonstrate that TFSEE is a robust enhancer-calling algorithm that can be used to perform multilayer genomic data integration to uncover cell type-specific TFs that control lineage-specific enhancers.
Collapse
Affiliation(s)
- Venkat S Malladi
- Laboratory of Signaling and Gene Regulation, Cecil H. and Ida Green Center for Reproductive Biology Sciences, The University of Texas Southwestern Medical Center, Dallas, TX, USA.,Department of Bioinformatics, The University of Texas Southwestern Medical Center, Dallas, TX, USA
| | - Anusha Nagari
- Laboratory of Signaling and Gene Regulation, Cecil H. and Ida Green Center for Reproductive Biology Sciences, The University of Texas Southwestern Medical Center, Dallas, TX, USA
| | - Hector L Franco
- Laboratory of Signaling and Gene Regulation, Cecil H. and Ida Green Center for Reproductive Biology Sciences, The University of Texas Southwestern Medical Center, Dallas, TX, USA.,Department of Genetics and Lineberger Comprehensive Cancer Center, The University of North Carolina at Chapel Hill, Chapel Hill, NC, USA
| | - W Lee Kraus
- Laboratory of Signaling and Gene Regulation, Cecil H. and Ida Green Center for Reproductive Biology Sciences, The University of Texas Southwestern Medical Center, Dallas, TX, USA
| |
Collapse
|
35
|
Neumayr C, Pagani M, Stark A, Arnold CD. STARR-seq and UMI-STARR-seq: Assessing Enhancer Activities for Genome-Wide-, High-, and Low-Complexity Candidate Libraries. ACTA ACUST UNITED AC 2020; 128:e105. [PMID: 31503413 PMCID: PMC9286403 DOI: 10.1002/cpmb.105] [Citation(s) in RCA: 28] [Impact Index Per Article: 7.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/14/2022]
Abstract
The identification of transcriptional enhancers and the quantitative assessment of enhancer activities is essential to understanding how regulatory information for gene expression is encoded in animal and human genomes. Further, it is key to understanding how sequence variants affect enhancer function. STARR‐seq enables the direct and quantitative assessment of enhancer activity for millions of candidate sequences of arbitrary length and origin in parallel, allowing the screening of entire genomes and the establishment of genome‐wide enhancer activity maps. In STARR‐seq, the candidate sequences are cloned downstream of the core promoter into a reporter gene's transcription unit (i.e., the 3′ UTR). Candidates that function as active enhancers lead to the transcription of reporter mRNAs that harbor the candidates’ sequences. This direct coupling of enhancer sequence and enhancer activity in cis enables the straightforward and efficient cloning of complex candidate libraries and the assessment of enhancer activities of millions of candidates in parallel by quantifying the reporter mRNAs by deep sequencing. This article describes how to create focused and genome‐wide human STARR‐seq libraries and how to perform STARR‐seq screens in mammalian cells, and also describes a novel STARR‐seq variant (UMI‐STARR‐seq) that allows the accurate counting of reporter mRNAs for STARR‐seq libraries of low complexity. © 2019 The Authors. Basic Protocol 1: STARR‐seq plasmid library cloning Basic Protocol 2: Mammalian STARR‐seq screening protocol Alternate Protocol: UMI‐STARR‐seq screening protocol—unique molecular identifier integration Support Protocol: Transfection of human cells using the MaxCyte STX scalable transfection system
Collapse
Affiliation(s)
- Christoph Neumayr
- Research Institute of Molecular Pathology (IMP), Vienna Biocenter (VBC), Vienna, Austria
| | - Michaela Pagani
- Research Institute of Molecular Pathology (IMP), Vienna Biocenter (VBC), Vienna, Austria
| | - Alexander Stark
- Research Institute of Molecular Pathology (IMP), Vienna Biocenter (VBC), Vienna, Austria.,Medical University of Vienna, Vienna Biocenter (VBC), Vienna, Austria
| | - Cosmas D Arnold
- Research Institute of Molecular Pathology (IMP), Vienna Biocenter (VBC), Vienna, Austria
| |
Collapse
|
36
|
Tomoyasu Y, Halfon MS. How to study enhancers in non-traditional insect models. ACTA ACUST UNITED AC 2020; 223:223/Suppl_1/jeb212241. [PMID: 32034049 DOI: 10.1242/jeb.212241] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/12/2022]
Abstract
Transcriptional enhancers are central to the function and evolution of genes and gene regulation. At the organismal level, enhancers play a crucial role in coordinating tissue- and context-dependent gene expression. At the population level, changes in enhancers are thought to be a major driving force that facilitates evolution of diverse traits. An amazing array of diverse traits seen in insect morphology, physiology and behavior has been the subject of research for centuries. Although enhancer studies in insects outside of Drosophila have been limited, recent advances in functional genomic approaches have begun to make such studies possible in an increasing selection of insect species. Here, instead of comprehensively reviewing currently available technologies for enhancer studies in established model organisms such as Drosophila, we focus on a subset of computational and experimental approaches that are likely applicable to non-Drosophila insects, and discuss the pros and cons of each approach. We discuss the importance of validating enhancer function and evaluate several possible validation methods, such as reporter assays and genome editing. Key points and potential pitfalls when establishing a reporter assay system in non-traditional insect models are also discussed. We close with a discussion of how to advance enhancer studies in insects, both by improving computational approaches and by expanding the genetic toolbox in various insects. Through these discussions, this Review provides a conceptual framework for studying the function and evolution of enhancers in non-traditional insect models.
Collapse
Affiliation(s)
| | - Marc S Halfon
- Department of Biochemistry, University at Buffalo-State University of New York, Buffalo, NY 14203, USA
| |
Collapse
|
37
|
Pataskar A, Vanderlinden W, Emmerig J, Singh A, Lipfert J, Tiwari VK. Deciphering the Gene Regulatory Landscape Encoded in DNA Biophysical Features. iScience 2019; 21:638-649. [PMID: 31731201 PMCID: PMC6889597 DOI: 10.1016/j.isci.2019.10.055] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/11/2019] [Revised: 10/20/2019] [Accepted: 10/24/2019] [Indexed: 01/24/2023] Open
Abstract
Gene regulation in higher organisms involves a sophisticated interplay between genetic and epigenetic mechanisms. Despite advances, the logic in selective usage of certain genomic regions as regulatory elements remains unclear. Here we show that the inherent biophysical properties of the DNA encode epigenetic state and the underlying regulatory potential. We find that the propeller twist (ProT) level is indicative of genomic location of the regulatory elements, their strength, the affinity landscape of transcription factors, and distribution in the nuclear 3D space. We experimentally show that ProT levels confer increased DNA flexibility and surface accessibility, and thus potentially primes usage of high ProT regions as regulatory elements. ProT levels also correlate with occurrence and phenotypic consequences of mutations. Interestingly, cell-fate switches involve a transient usage of low ProT regulatory elements. Altogether, our work provides unprecedented insights into the gene regulatory landscape encoded in the DNA biophysical features. DNA shape features encode genomic surface accessibility and flexibility High ProT is a deterministic feature of enhancers ProT levels correlate with nuclear organization of epigenetic states Cell-fate switches involve a transient usage of low ProT regulatory elements
Collapse
Affiliation(s)
- Abhijeet Pataskar
- Netherlands Cancer Institute, Amsterdam, the Netherlands; Former Address: Institute of Molecular Biology, 55128 Mainz, Germany
| | - Willem Vanderlinden
- Department of Physics and Center for NanoScience, LMU Munich, 80799 Munich, Germany
| | - Johannes Emmerig
- Department of Physics and Center for NanoScience, LMU Munich, 80799 Munich, Germany
| | - Aditi Singh
- Wellcome-Wolfson Institute for Experimental Medicine, School of Medicine, Dentistry & Biomedical Science, Queens University Belfast, Belfast BT9 7BL, UK
| | - Jan Lipfert
- Department of Physics and Center for NanoScience, LMU Munich, 80799 Munich, Germany
| | - Vijay K Tiwari
- Wellcome-Wolfson Institute for Experimental Medicine, School of Medicine, Dentistry & Biomedical Science, Queens University Belfast, Belfast BT9 7BL, UK; Former Address: Institute of Molecular Biology, 55128 Mainz, Germany.
| |
Collapse
|
38
|
Enhancer prediction with histone modification marks using a hybrid neural network model. Methods 2019; 166:48-56. [DOI: 10.1016/j.ymeth.2019.03.014] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/16/2018] [Revised: 02/28/2019] [Accepted: 03/16/2019] [Indexed: 01/19/2023] Open
|
39
|
Perenthaler E, Yousefi S, Niggl E, Barakat TS. Beyond the Exome: The Non-coding Genome and Enhancers in Neurodevelopmental Disorders and Malformations of Cortical Development. Front Cell Neurosci 2019; 13:352. [PMID: 31417368 PMCID: PMC6685065 DOI: 10.3389/fncel.2019.00352] [Citation(s) in RCA: 46] [Impact Index Per Article: 9.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/31/2019] [Accepted: 07/16/2019] [Indexed: 12/22/2022] Open
Abstract
The development of the human cerebral cortex is a complex and dynamic process, in which neural stem cell proliferation, neuronal migration, and post-migratory neuronal organization need to occur in a well-organized fashion. Alterations at any of these crucial stages can result in malformations of cortical development (MCDs), a group of genetically heterogeneous neurodevelopmental disorders that present with developmental delay, intellectual disability and epilepsy. Recent progress in genetic technologies, such as next generation sequencing, most often focusing on all protein-coding exons (e.g., whole exome sequencing), allowed the discovery of more than a 100 genes associated with various types of MCDs. Although this has considerably increased the diagnostic yield, most MCD cases remain unexplained. As Whole Exome Sequencing investigates only a minor part of the human genome (1-2%), it is likely that patients, in which no disease-causing mutation has been identified, could harbor mutations in genomic regions beyond the exome. Even though functional annotation of non-coding regions is still lagging behind that of protein-coding genes, tremendous progress has been made in the field of gene regulation. One group of non-coding regulatory regions are enhancers, which can be distantly located upstream or downstream of genes and which can mediate temporal and tissue-specific transcriptional control via long-distance interactions with promoter regions. Although some examples exist in literature that link alterations of enhancers to genetic disorders, a widespread appreciation of the putative roles of these sequences in MCDs is still lacking. Here, we summarize the current state of knowledge on cis-regulatory regions and discuss novel technologies such as massively-parallel reporter assay systems, CRISPR-Cas9-based screens and computational approaches that help to further elucidate the emerging role of the non-coding genome in disease. Moreover, we discuss existing literature on mutations or copy number alterations of regulatory regions involved in brain development. We foresee that the future implementation of the knowledge obtained through ongoing gene regulation studies will benefit patients and will provide an explanation to part of the missing heritability of MCDs and other genetic disorders.
Collapse
Affiliation(s)
| | | | | | - Tahsin Stefan Barakat
- Department of Clinical Genetics, Erasmus MC – University Medical Center, Rotterdam, Netherlands
| |
Collapse
|
40
|
Benton ML, Talipineni SC, Kostka D, Capra JA. Genome-wide enhancer annotations differ significantly in genomic distribution, evolution, and function. BMC Genomics 2019; 20:511. [PMID: 31221079 PMCID: PMC6585034 DOI: 10.1186/s12864-019-5779-x] [Citation(s) in RCA: 25] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/04/2019] [Accepted: 05/07/2019] [Indexed: 12/28/2022] Open
Abstract
Background Non-coding gene regulatory enhancers are essential to transcription in mammalian cells. As a result, a large variety of experimental and computational strategies have been developed to identify cis-regulatory enhancer sequences. Given the differences in the biological signals assayed, some variation in the enhancers identified by different methods is expected; however, the concordance of enhancers identified by different methods has not been comprehensively evaluated. This is critically needed, since in practice, most studies consider enhancers identified by only a single method. Here, we compare enhancer sets from eleven representative strategies in four biological contexts. Results All sets we evaluated overlap significantly more than expected by chance; however, there is significant dissimilarity in their genomic, evolutionary, and functional characteristics, both at the element and base-pair level, within each context. The disagreement is sufficient to influence interpretation of candidate SNPs from GWAS studies, and to lead to disparate conclusions about enhancer and disease mechanisms. Most regions identified as enhancers are supported by only one method, and we find limited evidence that regions identified by multiple methods are better candidates than those identified by a single method. As a result, we cannot recommend the use of any single enhancer identification strategy in all settings. Conclusions Our results highlight the inherent complexity of enhancer biology and identify an important challenge to mapping the genetic architecture of complex disease. Greater appreciation of how the diverse enhancer identification strategies in use today relate to the dynamic activity of gene regulatory regions is needed to enable robust and reproducible results. Electronic supplementary material The online version of this article (10.1186/s12864-019-5779-x) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Mary Lauren Benton
- Department of Biomedical Informatics, Vanderbilt University, Nashville, TN, 37235, USA
| | - Sai Charan Talipineni
- Department of Developmental Biology, University of Pittsburgh School of Medicine, Pittsburgh, PA, 15201, USA
| | - Dennis Kostka
- Department of Developmental Biology, University of Pittsburgh School of Medicine, Pittsburgh, PA, 15201, USA. .,Department of Computational & Systems Biology, Pittsburgh Center for Evolutionary Biology and Medicine, University of Pittsburgh School of Medicine, Pittsburgh, PA, 15201, USA.
| | - John A Capra
- Department of Biomedical Informatics, Vanderbilt University, Nashville, TN, 37235, USA. .,Departments of Biological Sciences and Computer Science, Vanderbilt Genetics Institute, Center for Structural Biology, Vanderbilt University, Nashville, TN, 37235, USA.
| |
Collapse
|
41
|
Hariprakash JM, Ferrari F. Computational Biology Solutions to Identify Enhancers-target Gene Pairs. Comput Struct Biotechnol J 2019; 17:821-831. [PMID: 31316726 PMCID: PMC6611831 DOI: 10.1016/j.csbj.2019.06.012] [Citation(s) in RCA: 23] [Impact Index Per Article: 4.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/15/2019] [Revised: 06/04/2019] [Accepted: 06/11/2019] [Indexed: 12/12/2022] Open
Abstract
Enhancers are non-coding regulatory elements that are distant from their target gene. Their characterization still remains elusive especially due to challenges in achieving a comprehensive pairing of enhancers and target genes. A number of computational biology solutions have been proposed to address this problem leveraging the increasing availability of functional genomics data and the improved mechanistic understanding of enhancer action. In this review we focus on computational methods for genome-wide definition of enhancer-target gene pairs. We outline the different classes of methods, as well as their main advantages and limitations. The types of information integrated by each method, along with details on their applicability are presented and discussed. We especially highlight the technical challenges that are still unresolved and hamper the effective achievement of a satisfactory and comprehensive solution. We expect this field will keep evolving in the coming years due to the ever-growing availability of data and increasing insights into enhancers crucial role in regulating genome functionality.
Collapse
Affiliation(s)
| | - Francesco Ferrari
- IFOM, The FIRC Institute of Molecular Oncology, Milan, Italy
- Institute of Molecular Genetics, National Research Council, Pavia, Italy
| |
Collapse
|
42
|
Albalawi F, Chahid A, Guo X, Albaradei S, Magana-Mora A, Jankovic BR, Uludag M, Van Neste C, Essack M, Laleg-Kirati TM, Bajic VB. Hybrid model for efficient prediction of poly(A) signals in human genomic DNA. Methods 2019; 166:31-39. [PMID: 30991099 DOI: 10.1016/j.ymeth.2019.04.001] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/15/2018] [Revised: 03/12/2019] [Accepted: 04/01/2019] [Indexed: 12/15/2022] Open
Abstract
Polyadenylation signals (PAS) are found in most protein-coding and some non-coding genes in eukaryotes. Their accurate recognition improves understanding gene regulation mechanisms and recognition of the 3'-end of transcribed gene regions where premature or alternate transcription ends may lead to various diseases. Although different methods and tools for in-silico prediction of genomic signals have been proposed, the correct identification of PAS in genomic DNA remains challenging due to a vast number of non-relevant hexamers identical to PAS hexamers. In this study, we developed a novel method for PAS recognition. The method is implemented in a hybrid PAS recognition model (HybPAS), which is based on deep neural networks (DNNs) and logistic regression models (LRMs). One of such models is developed for each of the 12 most frequent human PAS hexamers. DNN models appeared the best for eight PAS types (including the two most frequent PAS hexamers), while LRM appeared best for the remaining four PAS types. The new models use different combinations of signal processing-based, statistical, and sequence-based features as input. The results obtained on human genomic data show that HybPAS outperforms the well-tuned state-of-the-art Omni-PolyA models, reducing the classification error for different PAS hexamers by up to 57.35% for 10 out of 12 PAS types, with Omni-PolyA models being better for two PAS types. For the most frequent PAS types, 'AATAAA' and 'ATTAAA', HybPAS reduced the error rate by 35.14% and 34.48%, respectively. On average, HybPAS reduces the error by 30.29%. HybPAS is implemented partly in Python and in MATLAB available at https://github.com/EMANG-KAUST/PolyA_Prediction_LRM_DNN.
Collapse
Affiliation(s)
- Fahad Albalawi
- King Abdullah University of Science and Technology, Computational Bioscience Research Center, Thuwal 23955-6900, Saudi Arabia; Taif University, Electrical Engineering, Taif 21944, Saudi Arabia
| | - Abderrazak Chahid
- King Abdullah University of Science and Technology, Computational Bioscience Research Center, Thuwal 23955-6900, Saudi Arabia
| | - Xingang Guo
- King Abdullah University of Science and Technology, Computational Bioscience Research Center, Thuwal 23955-6900, Saudi Arabia
| | - Somayah Albaradei
- King Abdullah University of Science and Technology, Computational Bioscience Research Center, Thuwal 23955-6900, Saudi Arabia
| | - Arturo Magana-Mora
- King Abdullah University of Science and Technology, Computational Bioscience Research Center, Thuwal 23955-6900, Saudi Arabia; Saudi Aramco, EXPEC-ARC, Drilling Technology Team, Dhahran 31311, Saudi Arabia
| | - Boris R Jankovic
- King Abdullah University of Science and Technology, Computational Bioscience Research Center, Thuwal 23955-6900, Saudi Arabia
| | - Mahmut Uludag
- King Abdullah University of Science and Technology, Computational Bioscience Research Center, Thuwal 23955-6900, Saudi Arabia
| | - Christophe Van Neste
- King Abdullah University of Science and Technology, Computational Bioscience Research Center, Thuwal 23955-6900, Saudi Arabia; Ghent University, Center for Medical Genetics Ghent (CMGG), B-9000 Ghent, Belgium
| | - Magbubah Essack
- King Abdullah University of Science and Technology, Computational Bioscience Research Center, Thuwal 23955-6900, Saudi Arabia
| | - Taous-Meriem Laleg-Kirati
- King Abdullah University of Science and Technology, Computational Bioscience Research Center, Thuwal 23955-6900, Saudi Arabia.
| | - Vladimir B Bajic
- King Abdullah University of Science and Technology, Computational Bioscience Research Center, Thuwal 23955-6900, Saudi Arabia.
| |
Collapse
|
43
|
Wu C, Chen J, Liu Y, Hu X. Improved Prediction of Regulatory Element Using Hybrid Abelian Complexity Features with DNA Sequences. Int J Mol Sci 2019; 20:ijms20071704. [PMID: 30959806 PMCID: PMC6480087 DOI: 10.3390/ijms20071704] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/26/2019] [Revised: 04/01/2019] [Accepted: 04/02/2019] [Indexed: 12/14/2022] Open
Abstract
Deciphering the code of cis-regulatory element (CRE) is one of the core issues of current biology. As an important category of CRE, enhancers play crucial roles in gene transcriptional regulations in a distant manner. Further, the disruption of an enhancer can cause abnormal transcription and, thus, trigger human diseases, which means that its accurate identification is currently of broad interest. Here, we introduce an innovative concept, i.e., abelian complexity function (ACF), which is a more complex extension of the classic subword complexity function, for a new coding of DNA sequences. After feature selection by an upper bound estimation and integration with DNA composition features, we developed an enhancer prediction model with hybrid abelian complexity features (HACF). Compared with existing methods, HACF shows consistently superior performance on three sources of enhancer datasets. We tested the generalization ability of HACF by scanning human chromosome 22 to validate previously reported super-enhancers. Meanwhile, we identified novel candidate enhancers which have supports from enhancer-related ENCODE ChIP-seq signals. In summary, HACF improves current enhancer prediction and may be beneficial for further prioritization of functional noncoding variants.
Collapse
Affiliation(s)
- Chengchao Wu
- College of Informatics, Agricultural Bioinformatics Key Laboratory of Hubei Province, Huazhong Agricultural University, Wuhan 430070, China.
| | - Jin Chen
- College of Science, Huazhong Agricultural University, Wuhan 430070, China.
| | - Yunxia Liu
- College of Informatics, Agricultural Bioinformatics Key Laboratory of Hubei Province, Huazhong Agricultural University, Wuhan 430070, China.
| | - Xuehai Hu
- College of Informatics, Agricultural Bioinformatics Key Laboratory of Hubei Province, Huazhong Agricultural University, Wuhan 430070, China.
| |
Collapse
|
44
|
Asma H, Halfon MS. Computational enhancer prediction: evaluation and improvements. BMC Bioinformatics 2019; 20:174. [PMID: 30953451 PMCID: PMC6451241 DOI: 10.1186/s12859-019-2781-x] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/15/2019] [Accepted: 03/27/2019] [Indexed: 12/21/2022] Open
Abstract
BACKGROUND Identifying transcriptional enhancers and other cis-regulatory modules (CRMs) is an important goal of post-sequencing genome annotation. Computational approaches provide a useful complement to empirical methods for CRM discovery, but it is critical that we develop effective means to evaluate their performance in terms of estimating their sensitivity and specificity. RESULTS We introduce here pCRMeval, a pipeline for in silico evaluation of any enhancer prediction tools that are flexible enough to be applied to the Drosophila melanogaster genome. pCRMeval compares the result of predictions with the extensive existing knowledge of experimentally-validated Drosophila CRMs in order to estimate the precision and relative sensitivity of the prediction method. In the case of supervised prediction methods-when training data composed of validated CRMs are used-pCRMeval can also assess the sensitivity of specific training sets. We demonstrate the utility of pCRMeval through evaluation of our SCRMshaw CRM prediction method and training data. By measuring the impact of different parameters on SCRMshaw performance, as assessed by pCRMeval, we develop a more robust version of SCRMshaw, SCRMshaw_HD, that improves the number of predictions while maintaining sensitivity and specificity. Our analysis also demonstrates that SCRMshaw_HD, when applied to increasingly less well-assembled genomes, maintains its strong predictive power with only a minor drop-off in performance. CONCLUSION Our pCRMeval pipeline provides a general framework for evaluation that can be applied to any CRM prediction method, particularly a supervised method. While we make use of it here primarily to test and improve a particular method for CRM prediction, SCRMshaw, pCRMeval should provide a valuable platform to the research community not only for evaluating individual methods, but also for comparing between competing methods.
Collapse
Affiliation(s)
- Hasiba Asma
- Program in Genetics, Genomics, and Bioinformatics, University at Buffalo-State University of New York, 701 Ellicott St, Buffalo, NY, 14203, USA
| | - Marc S Halfon
- Program in Genetics, Genomics, and Bioinformatics, University at Buffalo-State University of New York, 701 Ellicott St, Buffalo, NY, 14203, USA.
- Department of Biochemistry, University at Buffalo-State University of New York, 701 Ellicott St, Buffalo, NY, 14203, USA.
- Department of Biological Sciences, University at Buffalo-State University of New York, 701 Ellicott St, Buffalo, NY, 14203, USA.
- Department of Biomedical Informatics, University at Buffalo-State University of New York, 701 Ellicott St, Buffalo, NY, 14203, USA.
- NY State Center of Excellence in Bioinformatics and Life Sciences, 701 Ellicott St, Buffalo, NY, 14203, USA.
- Molecular and Cellular Biology Department and Program in Cancer Genetics, Roswell Park Comprehensive Cancer Center, Buffalo, NY, 14263, USA.
| |
Collapse
|
45
|
Colbran LL, Chen L, Capra JA. Sequence Characteristics Distinguish Transcribed Enhancers from Promoters and Predict Their Breadth of Activity. Genetics 2019; 211:1205-1217. [PMID: 30696717 PMCID: PMC6456323 DOI: 10.1534/genetics.118.301895] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/27/2018] [Accepted: 01/27/2019] [Indexed: 01/08/2023] Open
Abstract
Enhancers and promoters both regulate gene expression by recruiting transcription factors (TFs); however, the degree to which enhancer vs. promoter activity is due to differences in their sequences or to genomic context is the subject of ongoing debate. We examined this question by analyzing the sequences of thousands of transcribed enhancers and promoters from hundreds of cellular contexts previously identified by cap analysis of gene expression. Support vector machine classifiers trained on counts of all possible 6-bp-long sequences (6-mers) were able to accurately distinguish promoters from enhancers and distinguish their breadth of activity across tissues. Classifiers trained to predict enhancer activity also performed well when applied to promoter prediction tasks, but promoter-trained classifiers performed poorly on enhancers. This suggests that the learned sequence patterns predictive of enhancer activity generalize to promoters, but not vice versa. Our classifiers also indicate that there are functionally relevant differences in enhancer and promoter GC content beyond the influence of CpG islands. Furthermore, sequences characteristic of broad promoter or broad enhancer activity matched different TFs, with predicted ETS- and RFX-binding sites indicative of promoters, and AP-1 sites indicative of enhancers. Finally, we evaluated the ability of our models to distinguish enhancers and promoters defined by histone modifications. Separating these classes was substantially more difficult, and this difference may contribute to ongoing debates about the similarity of enhancers and promoters. In summary, our results suggest that high-confidence transcribed enhancers and promoters can largely be distinguished based on biologically relevant sequence properties.
Collapse
Affiliation(s)
- Laura L Colbran
- Vanderbilt Genetics Institute, Vanderbilt University, Nashville, Tennessee 37235
| | - Ling Chen
- Department of Biological Sciences, Vanderbilt University, Nashville, Tennessee 37235
| | - John A Capra
- Vanderbilt Genetics Institute, Vanderbilt University, Nashville, Tennessee 37235
- Department of Biological Sciences, Vanderbilt University, Nashville, Tennessee 37235
- Center for Structural Biology, Departments of Biomedical Informatics and Computer Science, Vanderbilt University, Nashville, Tennessee 37235
| |
Collapse
|
46
|
Zehnder T, Benner P, Vingron M. Predicting enhancers in mammalian genomes using supervised hidden Markov models. BMC Bioinformatics 2019; 20:157. [PMID: 30917778 PMCID: PMC6437899 DOI: 10.1186/s12859-019-2708-6] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/11/2018] [Accepted: 02/27/2019] [Indexed: 12/24/2022] Open
Abstract
BACKGROUND Eukaryotic gene regulation is a complex process comprising the dynamic interaction of enhancers and promoters in order to activate gene expression. In recent years, research in regulatory genomics has contributed to a better understanding of the characteristics of promoter elements and for most sequenced model organism genomes there exist comprehensive and reliable promoter annotations. For enhancers, however, a reliable description of their characteristics and location has so far proven to be elusive. With the development of high-throughput methods such as ChIP-seq, large amounts of data about epigenetic conditions have become available, and many existing methods use the information on chromatin accessibility or histone modifications to train classifiers in order to segment the genome into functional groups such as enhancers and promoters. However, these methods often do not consider prior biological knowledge about enhancers such as their diverse lengths or molecular structure. RESULTS We developed enhancer HMM (eHMM), a supervised hidden Markov model designed to learn the molecular structure of promoters and enhancers. Both consist of a central stretch of accessible DNA flanked by nucleosomes with distinct histone modification patterns. We evaluated the performance of eHMM within and across cell types and developmental stages and found that eHMM successfully predicts enhancers with high precision and recall comparable to state-of-the-art methods, and consistently outperforms those in terms of accuracy and resolution. CONCLUSIONS eHMM predicts active enhancers based on data from chromatin accessibility assays and a minimal set of histone modification ChIP-seq experiments. In comparison to other 'black box' methods its parameters are easy to interpret. eHMM can be used as a stand-alone tool for enhancer prediction without the need for additional training or a tuning of parameters. The high spatial precision of enhancer predictions gives valuable targets for potential knockout experiments or downstream analyses such as motif search.
Collapse
Affiliation(s)
- Tobias Zehnder
- Max Planck Institute for Molecular Genetics, Ihnestraße 63-73, Berlin, 14195 Germany
| | - Philipp Benner
- Max Planck Institute for Molecular Genetics, Ihnestraße 63-73, Berlin, 14195 Germany
| | - Martin Vingron
- Max Planck Institute for Molecular Genetics, Ihnestraße 63-73, Berlin, 14195 Germany
| |
Collapse
|
47
|
Ho EYK, Cao Q, Gu M, Chan RWL, Wu Q, Gerstein M, Yip KY. Shaping the nebulous enhancer in the era of high-throughput assays and genome editing. Brief Bioinform 2019; 21:836-850. [PMID: 30895290 DOI: 10.1093/bib/bbz030] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/12/2018] [Revised: 02/15/2019] [Accepted: 02/26/2019] [Indexed: 01/22/2023] Open
Abstract
Since the 1st discovery of transcriptional enhancers in 1981, their textbook definition has remained largely unchanged in the past 37 years. With the emergence of high-throughput assays and genome editing, which are switching the paradigm from bottom-up discovery and testing of individual enhancers to top-down profiling of enhancer activities genome-wide, it has become increasingly evidenced that this classical definition has left substantial gray areas in different aspects. Here we survey a representative set of recent research articles and report the definitions of enhancers they have adopted. The results reveal that a wide spectrum of definitions is used usually without the definition stated explicitly, which could lead to difficulties in data interpretation and downstream analyses. Based on these findings, we discuss the practical implications and suggestions for future studies.
Collapse
Affiliation(s)
| | - Qin Cao
- Department of Computer Science and Engineering, The Chinese University of Hong Kong, Hong Kong
| | - Mengting Gu
- Department of Molecular Biophysics and Biochemistry, Yale University, New Haven, Connecticut, USA
| | - Ricky Wai-Lun Chan
- Department of Computer Science and Engineering, The Chinese University of Hong Kong, Hong Kong
| | - Qiong Wu
- Department of Computer Science and Engineering, The Chinese University of Hong Kong, Hong Kong.,School of Biomedical Sciences, The Chinese University of Hong Kong, Hong Kong
| | - Mark Gerstein
- Department of Molecular Biophysics and Biochemistry, Yale University, New Haven, Connecticut, USA.,Program in Computational Biology and Bioinformatics.,Department of Computer Science, Yale University, New Haven, Connecticut, USA
| | - Kevin Y Yip
- Department of Biomedical Engineering.,Department of Computer Science and Engineering, The Chinese University of Hong Kong, Hong Kong.,Hong Kong Bioinformatics Centre.,CUHK-BGI Innovation Institute of Trans-omics.,Hong Kong Institute of Diabetes and Obesity, The Chinese University of Hong Kong, Hong Kong
| |
Collapse
|
48
|
Mejía-Guerra MK, Buckler ES. A k-mer grammar analysis to uncover maize regulatory architecture. BMC PLANT BIOLOGY 2019; 19:103. [PMID: 30876396 PMCID: PMC6419808 DOI: 10.1186/s12870-019-1693-2] [Citation(s) in RCA: 10] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 04/25/2018] [Accepted: 02/21/2019] [Indexed: 05/06/2023]
Abstract
BACKGROUND Only a small percentage of the genome sequence is involved in regulation of gene expression, but to biochemically identify this portion is expensive and laborious. In species like maize, with diverse intergenic regions and lots of repetitive elements, this is an especially challenging problem that limits the use of the data from one line to the other. While regulatory regions are rare, they do have characteristic chromatin contexts and sequence organization (the grammar) with which they can be identified. RESULTS We developed a computational framework to exploit this sequence arrangement. The models learn to classify regulatory regions based on sequence features - k-mers. To do this, we borrowed two approaches from the field of natural language processing: (1) "bag-of-words" which is commonly used for differentially weighting key words in tasks like sentiment analyses, and (2) a vector-space model using word2vec (vector-k-mers), that captures semantic and linguistic relationships between words. We built "bag-of-k-mers" and "vector-k-mers" models that distinguish between regulatory and non-regulatory regions with an average accuracy above 90%. Our "bag-of-k-mers" achieved higher overall accuracy, while the "vector-k-mers" models were more useful in highlighting key groups of sequences within the regulatory regions. CONCLUSIONS These models now provide powerful tools to annotate regulatory regions in other maize lines beyond the reference, at low cost and with high accuracy.
Collapse
Affiliation(s)
| | - Edward S. Buckler
- Institute for Genomic Diversity, Cornell University, 175 Biotechnology Building, Ithaca, 14853 NY USA
- USDA-ARS, Research Geneticist, USDA ARS Robert Holley Center, Ithaca, 14853 NY USA
- Department of Plant Breeding and Genetics, Cornell University, 159 Biotechnology Building, Ithaca, 14853 NY USA
| |
Collapse
|
49
|
Suzuki N, Hirano K, Ogino H, Ochi H. Arid3a regulates nephric tubule regeneration via evolutionarily conserved regeneration signal-response enhancers. eLife 2019; 8:43186. [PMID: 30616715 PMCID: PMC6324879 DOI: 10.7554/elife.43186] [Citation(s) in RCA: 20] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/27/2018] [Accepted: 12/18/2018] [Indexed: 12/15/2022] Open
Abstract
Amphibians and fish have the ability to regenerate numerous tissues, whereas mammals have a limited regenerative capacity. Despite numerous developmental genes becoming reactivated during regeneration, an extensive analysis is yet to be performed on whether highly regenerative animals utilize unique cis-regulatory elements for the reactivation of genes during regeneration and how such cis-regulatory elements become activated. Here, we screened regeneration signal-response enhancers at the lhx1 locus using Xenopus and found that the noncoding elements conserved from fish to human function as enhancers in the regenerating nephric tubules. A DNA-binding motif of Arid3a, a component of H3K9me3 demethylases, was commonly found in RSREs. Arid3a binds to RSREs and reduces the H3K9me3 levels. It promotes cell cycle progression and causes the outgrowth of nephric tubules, whereas the conditional knockdown of arid3a using photo-morpholino inhibits regeneration. These results suggest that Arid3a contributes to the regeneration of nephric tubules by decreasing H3K9me3 on RSREs.
Collapse
Affiliation(s)
- Nanoka Suzuki
- Institute for Promotion of Medical Science Research, Yamagata University, Faculty of Medicine, Yamagata, Japan
| | - Kodai Hirano
- Institute for Promotion of Medical Science Research, Yamagata University, Faculty of Medicine, Yamagata, Japan
| | - Hajime Ogino
- Amphibian Research Center, Hiroshima University, Higashi-hiroshima, Japan
| | - Haruki Ochi
- Institute for Promotion of Medical Science Research, Yamagata University, Faculty of Medicine, Yamagata, Japan
| |
Collapse
|
50
|
TELS: A Novel Computational Framework for Identifying Motif Signatures of Transcribed Enhancers. GENOMICS PROTEOMICS & BIOINFORMATICS 2018; 16:332-341. [PMID: 30578915 PMCID: PMC6364045 DOI: 10.1016/j.gpb.2018.05.003] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 11/05/2017] [Revised: 04/23/2018] [Accepted: 05/15/2018] [Indexed: 12/31/2022]
Abstract
In mammalian cells, transcribed enhancers (TrEns) play important roles in the initiation of gene expression and maintenance of gene expression levels in a spatiotemporal manner. One of the most challenging questions is how the genomic characteristics of enhancers relate to enhancer activities. To date, only a limited number of enhancer sequence characteristics have been investigated, leaving space for exploring the enhancers’ DNA code in a more systematic way. To address this problem, we developed a novel computational framework, Transcribed Enhancer Landscape Search (TELS), aimed at identifying predictive cell type/tissue-specific motif signatures of TrEns. As a case study, we used TELS to compile a comprehensive catalog of motif signatures for all known TrEns identified by the FANTOM5 consortium across 112 human primary cells and tissues. Our results confirm that combinations of different short motifs characterize in an optimized manner cell type/tissue-specific TrEns. Our study is the first to report combinations of motifs that maximize classification performance of TrEns exclusively transcribed in one cell type/tissue from TrEns exclusively transcribed in different cell types/tissues. Moreover, we also report 31 motif signatures predictive of enhancers’ broad activity. TELS codes and material are publicly available at http://www.cbrc.kaust.edu.sa/TELS.
Collapse
|