1
|
Asma H, Tieke E, Deem KD, Rahmat J, Dong T, Huang X, Tomoyasu Y, Halfon MS. Regulatory genome annotation of 33 insect species. eLife 2024; 13:RP96738. [PMID: 39392676 PMCID: PMC11469670 DOI: 10.7554/elife.96738] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/12/2024] Open
Abstract
Annotation of newly sequenced genomes frequently includes genes, but rarely covers important non-coding genomic features such as the cis-regulatory modules-e.g., enhancers and silencers-that regulate gene expression. Here, we begin to remedy this situation by developing a workflow for rapid initial annotation of insect regulatory sequences, and provide a searchable database resource with enhancer predictions for 33 genomes. Using our previously developed SCRMshaw computational enhancer prediction method, we predict over 2.8 million regulatory sequences along with the tissues where they are expected to be active, in a set of insect species ranging over 360 million years of evolution. Extensive analysis and validation of the data provides several lines of evidence suggesting that we achieve a high true-positive rate for enhancer prediction. One, we show that our predictions target specific loci, rather than random genomic locations. Two, we predict enhancers in orthologous loci across a diverged set of species to a significantly higher degree than random expectation would allow. Three, we demonstrate that our predictions are highly enriched for regions of accessible chromatin. Four, we achieve a validation rate in excess of 70% using in vivo reporter gene assays. As we continue to annotate both new tissues and new species, our regulatory annotation resource will provide a rich source of data for the research community and will have utility for both small-scale (single gene, single species) and large-scale (many genes, many species) studies of gene regulation. In particular, the ability to search for functionally related regulatory elements in orthologous loci should greatly facilitate studies of enhancer evolution even among distantly related species.
Collapse
Affiliation(s)
- Hasiba Asma
- Program in Genetics, Genomics, and Bioinformatics, University at Buffalo-State University of New YorkBuffaloUnited States
| | - Ellen Tieke
- Department of Biology, Miami UniversityOxfordUnited States
| | - Kevin D Deem
- Department of Biology, Miami UniversityOxfordUnited States
| | - Jabale Rahmat
- Department of Biology, Miami UniversityOxfordUnited States
| | - Tiffany Dong
- Department of Biochemistry, University at Buffalo-State University of New YorkBuffaloUnited States
| | - Xinbo Huang
- Department of Biochemistry, University at Buffalo-State University of New YorkBuffaloUnited States
| | | | - Marc S Halfon
- Program in Genetics, Genomics, and Bioinformatics, University at Buffalo-State University of New YorkBuffaloUnited States
- Department of Biochemistry, University at Buffalo-State University of New YorkBuffaloUnited States
- Department of Biomedical Informatics, University at Buffalo-State University of New YorkBuffaloUnited States
- Department of Biological Sciences, University at Buffalo-State University of New YorkBuffaloUnited States
| |
Collapse
|
2
|
Kumar A, Schrader AW, Aggarwal B, Boroojeny AE, Asadian M, Lee J, Song YJ, Zhao SD, Han HS, Sinha S. Intracellular spatial transcriptomic analysis toolkit (InSTAnT). Nat Commun 2024; 15:7794. [PMID: 39242579 PMCID: PMC11379969 DOI: 10.1038/s41467-024-49457-w] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/27/2023] [Accepted: 06/04/2024] [Indexed: 09/09/2024] Open
Abstract
Imaging-based spatial transcriptomics technologies such as Multiplexed error-robust fluorescence in situ hybridization (MERFISH) can capture cellular processes in unparalleled detail. However, rigorous and robust analytical tools are needed to unlock their full potential for discovering subcellular biological patterns. We present Intracellular Spatial Transcriptomic Analysis Toolkit (InSTAnT), a computational toolkit for extracting molecular relationships from spatial transcriptomics data at single molecule resolution. InSTAnT employs specialized statistical tests and algorithms to detect gene pairs and modules exhibiting intriguing patterns of co-localization, both within individual cells and across the cellular landscape. We showcase the toolkit on five different datasets representing two different cell lines, two brain structures, two species, and three different technologies. We perform rigorous statistical assessment of discovered co-localization patterns, find supporting evidence from databases and RNA interactions, and identify associated subcellular domains. We uncover several cell type and region-specific gene co-localizations within the brain. Intra-cellular spatial patterns discovered by InSTAnT mirror diverse molecular relationships, including RNA interactions and shared sub-cellular localization or function, providing a rich compendium of testable hypotheses regarding molecular functions.
Collapse
Affiliation(s)
- Anurendra Kumar
- College of Computing, Georgia Institute of Technology, Atlanta, GA, 30332, USA
| | - Alex W Schrader
- Department of Chemistry, University of Illinois Urbana-Champaign, Urbana, IL, 61801, USA
| | - Bhavay Aggarwal
- School of Biological Sciences, Georgia Institute of Technology, Atlanta, GA, 30332, USA
| | | | - Marisa Asadian
- Department of Chemistry, University of Illinois Urbana-Champaign, Urbana, IL, 61801, USA
| | - JuYeon Lee
- Department of Chemistry, University of Illinois Urbana-Champaign, Urbana, IL, 61801, USA
| | - You Jin Song
- Department of Cell and Developmental Biology, University of Illinois Urbana-Champaign, Urbana, IL, 61801, USA
| | - Sihai Dave Zhao
- Department of Statistics, University of Illinois Urbana-Champaign, Urbana, IL, 61820, USA.
- Carl R. Woese Institute for Genomic Biology, University of Illinois Urbana-Champaign, Urbana, IL, 61801, USA.
| | - Hee-Sun Han
- Department of Chemistry, University of Illinois Urbana-Champaign, Urbana, IL, 61801, USA.
- Carl R. Woese Institute for Genomic Biology, University of Illinois Urbana-Champaign, Urbana, IL, 61801, USA.
| | - Saurabh Sinha
- H. Milton Stewart School of Industrial & Systems Engineering, Georgia Institute of Technology, Atlanta, GA, 30318, USA.
- The Wallace H. Coulter Department of Biomedical Engineering, Georgia Institute of Technology, Atlanta, GA, 30332, USA.
| |
Collapse
|
3
|
Dyer NA, Lucas ER, Nagi SC, McDermott DP, Brenas JH, Miles A, Clarkson CS, Mawejje HD, Wilding CS, Halfon MS, Asma H, Heinz E, Donnelly MJ. Mechanisms of transcriptional regulation in Anopheles gambiae revealed by allele-specific expression. Proc Biol Sci 2024; 291:20241142. [PMID: 39288798 PMCID: PMC11407855 DOI: 10.1098/rspb.2024.1142] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/15/2024] [Revised: 07/05/2024] [Accepted: 07/24/2024] [Indexed: 09/19/2024] Open
Abstract
Malaria control relies on insecticides targeting the mosquito vector, but this is increasingly compromised by insecticide resistance, which can be achieved by elevated expression of detoxifying enzymes that metabolize the insecticide. In diploid organisms, gene expression is regulated both in cis, by regulatory sequences on the same chromosome, and by trans acting factors, affecting both alleles equally. Differing levels of transcription can be caused by mutations in cis-regulatory modules (CRM), but few of these have been identified in mosquitoes. We crossed bendiocarb-resistant and susceptible Anopheles gambiae strains to identify cis-regulated genes that might be responsible for the resistant phenotype using RNAseq, and CRM sequences controlling gene expression in insecticide resistance relevant tissues were predicted using machine learning. We found 115 genes showing allele-specific expression (ASE) in hybrids of insecticide susceptible and resistant strains, suggesting cis-regulation is an important mechanism of gene expression regulation in A. gambiae. The genes showing ASE included a higher proportion of Anopheles-specific genes on average younger than genes with balanced allelic expression.
Collapse
Affiliation(s)
- Naomi A. Dyer
- Department of Vector Biology, Liverpool School of Tropical Medicine, Pembroke Place, LiverpoolL3 5QA, UK
| | - Eric R. Lucas
- Department of Vector Biology, Liverpool School of Tropical Medicine, Pembroke Place, LiverpoolL3 5QA, UK
| | - Sanjay C. Nagi
- Department of Vector Biology, Liverpool School of Tropical Medicine, Pembroke Place, LiverpoolL3 5QA, UK
| | - Daniel P. McDermott
- Department of Vector Biology, Liverpool School of Tropical Medicine, Pembroke Place, LiverpoolL3 5QA, UK
| | - Jon H. Brenas
- Wellcome Sanger Institute, Wellcome Genome Campus, Hinxton, CambridgeCB10 1SA, UK
| | - Alistair Miles
- Wellcome Sanger Institute, Wellcome Genome Campus, Hinxton, CambridgeCB10 1SA, UK
| | - Chris S. Clarkson
- Wellcome Sanger Institute, Wellcome Genome Campus, Hinxton, CambridgeCB10 1SA, UK
| | - Henry D. Mawejje
- Infectious Diseases Research Collaboration (IDRC), Plot 2C Nakasero Hill Road, PO Box 7475, Kampala, Uganda
| | - Craig S. Wilding
- School of Biological and Environmental Sciences, Liverpool John Moores University, Byrom Street, LiverpoolL3 3AF, UK
| | - Marc S. Halfon
- Department of Biochemistry, Jacobs School of Medicine & Biomedical Sciences, University at Buffalo-State University of New York, 955 Main Street, Buffalo, NY14203, USA
| | - Hasiba Asma
- Department of Biochemistry, Jacobs School of Medicine & Biomedical Sciences, University at Buffalo-State University of New York, 955 Main Street, Buffalo, NY14203, USA
| | - Eva Heinz
- Department of Vector Biology, Liverpool School of Tropical Medicine, Pembroke Place, LiverpoolL3 5QA, UK
- Strathclyde Institute of Pharmacy & Biomedical Sciences, University of Strathclyde, GlasgowG4 0RE, UK
- Department of Clinical Sciences, Liverpool School of Tropical Medicine, Pembroke Place, LiverpoolL3 5QA, UK
| | - Martin J. Donnelly
- Department of Vector Biology, Liverpool School of Tropical Medicine, Pembroke Place, LiverpoolL3 5QA, UK
| |
Collapse
|
4
|
Papadadonakis S, Kioukis A, Karageorgiou C, Pavlidis P. Evolution of gene regulatory networks by means of selection and random genetic drift. PeerJ 2024; 12:e17918. [PMID: 39221262 PMCID: PMC11365478 DOI: 10.7717/peerj.17918] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/27/2023] [Accepted: 07/23/2024] [Indexed: 09/04/2024] Open
Abstract
The evolution of a population by means of genetic drift and natural selection operating on a gene regulatory network (GRN) of an individual has not been scrutinized in depth. Thus, the relative importance of various evolutionary forces and processes on shaping genetic variability in GRNs is understudied. In this study, we implemented a simulation framework, called EvoNET, that simulates forward-in-time the evolution of GRNs in a population. The fitness effect of mutations is not constant, rather fitness of each individual is evaluated on the phenotypic level, by measuring its distance from an optimal phenotype. Each individual goes through a maturation period, where its GRN may reach an equilibrium, thus deciding its phenotype. Afterwards, individuals compete to produce the next generation. We examine properties of the GRN evolution, such as robustness against the deleterious effect of mutations and the role of genetic drift. We are able to confirm previous hypotheses regarding the effect of mutations and we provide new insights on the interplay between random genetic drift and natural selection.
Collapse
Affiliation(s)
- Stefanos Papadadonakis
- Institute of Computer Science, Foundation for Research and Technology Hellas, Heraklion, Crete, Greece
- Department of Biology, University of Crete, Heraklion, Crete, Greece
| | - Antonios Kioukis
- School of Medicine, University of Crete, Heraklion, Crete, Greece
| | | | - Pavlos Pavlidis
- Institute of Computer Science, Foundation for Research and Technology Hellas, Heraklion, Crete, Greece
- Department of Biology, University of Crete, Heraklion, Crete, Greece
| |
Collapse
|
5
|
Garza AB, Garcia R, Solis LM, Halfon MS, Girgis HZ. EnhancerTracker: Comparing cell-type-specific enhancer activity of DNA sequence triplets via an ensemble of deep convolutional neural networks. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2023:2023.12.23.573198. [PMID: 38187673 PMCID: PMC10769370 DOI: 10.1101/2023.12.23.573198] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 01/09/2024]
Abstract
Motivation Transcriptional enhancers - unlike promoters - are unrestrained by distance or strand orientation with respect to their target genes, making their computational identification a challenge. Further, there are insufficient numbers of confirmed enhancers for many cell types, preventing robust training of machine-learning-based models for enhancer prediction for such cell types. Results We present EnhancerTracker , a novel tool that leverages an ensemble of deep separable convolutional neural networks to identify cell-type-specific enhancers with the need of only two confirmed enhancers. EnhancerTracker is trained, validated, and tested on 52,789 putative enhancers obtained from the FANTOM5 Project and control sequences derived from the human genome. Unlike available tools, which accept one sequence at a time, the input to our tool is three sequences; the first two are enhancers active in the same cell type. EnhancerTracker outputs 1 if the third sequence is an enhancer active in the same cell type(s) where the first two enhancers are active. It outputs 0 otherwise. On a held-out set (15%), EnhancerTracker achieved an accuracy of 64%, a specificity of 93%, a recall of 35%, a precision of 84%, and an F1 score of 49%. Availability and implementation https://github.com/BioinformaticsToolsmith/EnhancerTracker. Contact hani.girgis@tamuk.edu.
Collapse
|
6
|
Dyer NA, Lucas ER, Nagi SC, McDermott DP, Brenas JH, Miles A, Clarkson CS, Mawejje HD, Wilding CS, Halfon MS, Asma H, Heinz E, Donnelly MJ. Mechanisms of transcriptional regulation in Anopheles gambiae revealed by allele specific expression. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2023:2023.11.22.568226. [PMID: 38045426 PMCID: PMC10690255 DOI: 10.1101/2023.11.22.568226] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/05/2023]
Abstract
Malaria control relies on insecticides targeting the mosquito vector, but this is increasingly compromised by insecticide resistance, which can be achieved by elevated expression of detoxifying enzymes that metabolize the insecticide. In diploid organisms, gene expression is regulated both in cis, by regulatory sequences on the same chromosome, and by trans acting factors, affecting both alleles equally. Differing levels of transcription can be caused by mutations in cis-regulatory modules (CRM), but few of these have been identified in mosquitoes. We crossed bendiocarb resistant and susceptible Anopheles gambiae strains to identify cis-regulated genes that might be responsible for the resistant phenotype using RNAseq, and cis-regulatory module sequences controlling gene expression in insecticide resistance relevant tissues were predicted using machine learning. We found 115 genes showing allele specific expression in hybrids of insecticide susceptible and resistant strains, suggesting cis regulation is an important mechanism of gene expression regulation in Anopheles gambiae. The genes showing allele specific expression included a higher proportion of Anopheles specific genes on average younger than genes those with balanced allelic expression.
Collapse
Affiliation(s)
- Naomi A Dyer
- Department of Vector Biology, Liverpool School of Tropical Medicine, Pembroke Place, Liverpool, L3 5QA, UK
| | - Eric R Lucas
- Department of Vector Biology, Liverpool School of Tropical Medicine, Pembroke Place, Liverpool, L3 5QA, UK
| | - Sanjay C Nagi
- Department of Vector Biology, Liverpool School of Tropical Medicine, Pembroke Place, Liverpool, L3 5QA, UK
| | - Daniel P McDermott
- Department of Vector Biology, Liverpool School of Tropical Medicine, Pembroke Place, Liverpool, L3 5QA, UK
| | - Jon H Brenas
- Wellcome Sanger Institute, Wellcome Genome Campus, Hinxton, Cambridge, CB10 1SA, UK
| | - Alistair Miles
- Wellcome Sanger Institute, Wellcome Genome Campus, Hinxton, Cambridge, CB10 1SA, UK
| | - Chris S Clarkson
- Wellcome Sanger Institute, Wellcome Genome Campus, Hinxton, Cambridge, CB10 1SA, UK
| | - Henry D Mawejje
- Infectious Diseases Research Collaboration (IDRC), Plot 2C Nakasero Hill Road, P.O.Box 7475, Kampala, Uganda
| | - Craig S Wilding
- School of Biological and Environmental Sciences, Liverpool John Moores University, Byrom Street, Liverpool, L3 3AF, UK
| | - Marc S Halfon
- Department of Biochemistry, Jacobs School of Medicine & Biomedical Sciences, University at Buffalo-State University of New York, 955 Main Street, Buffalo, New York 14203, USA
| | - Hasiba Asma
- Department of Biochemistry, Jacobs School of Medicine & Biomedical Sciences, University at Buffalo-State University of New York, 955 Main Street, Buffalo, New York 14203, USA
| | - Eva Heinz
- Department of Vector Biology, Liverpool School of Tropical Medicine, Pembroke Place, Liverpool, L3 5QA, UK
- Department of Clinical Sciences, Liverpool School of Tropical Medicine, Pembroke Place, Liverpool, L3 5QA, UK
| | - Martin J Donnelly
- Department of Vector Biology, Liverpool School of Tropical Medicine, Pembroke Place, Liverpool, L3 5QA, UK
| |
Collapse
|
7
|
Nowling RJ, Njoya K, Peters JG, Riehle MM. Prediction accuracy of regulatory elements from sequence varies by functional sequencing technique. Front Cell Infect Microbiol 2023; 13:1182567. [PMID: 37600946 PMCID: PMC10433755 DOI: 10.3389/fcimb.2023.1182567] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/08/2023] [Accepted: 07/10/2023] [Indexed: 08/22/2023] Open
Abstract
Introduction Various sequencing based approaches are used to identify and characterize the activities of cis-regulatory elements in a genome-wide fashion. Some of these techniques rely on indirect markers such as histone modifications (ChIP-seq with histone antibodies) or chromatin accessibility (ATAC-seq, DNase-seq, FAIRE-seq), while other techniques use direct measures such as episomal assays measuring the enhancer properties of DNA sequences (STARR-seq) and direct measurement of the binding of transcription factors (ChIP-seq with transcription factor-specific antibodies). The activities of cis-regulatory elements such as enhancers, promoters, and repressors are determined by their sequence and secondary processes such as chromatin accessibility, DNA methylation, and bound histone markers. Methods Here, machine learning models are employed to evaluate the accuracy with which cis-regulatory elements identified by various commonly used sequencing techniques can be predicted by their underlying sequence alone to distinguish between cis-regulatory activity that is reflective of sequence content versus secondary processes. Results and discussion Models trained and evaluated on D. melanogaster sequences identified through DNase-seq and STARR-seq are significantly more accurate than models trained on sequences identified by H3K4me1, H3K4me3, and H3K27ac ChIP-seq, FAIRE-seq, and ATAC-seq. These results suggest that the activity detected by DNase-seq and STARR-seq can be largely explained by underlying DNA sequence, independent of secondary processes. Experimentally, a subset of DNase-seq and H3K4me1 ChIP-seq sequences were tested for enhancer activity using luciferase assays and compared with previous tests performed on STARR-seq sequences. The experimental data indicated that STARR-seq sequences are substantially enriched for enhancer-specific activity, while the DNase-seq and H3K4me1 ChIP-seq sequences are not. Taken together, these results indicate that the DNase-seq approach identifies a broad class of regulatory elements of which enhancers are a subset and the associated data are appropriate for training models for detecting regulatory activity from sequence alone, STARR-seq data are best for training enhancer-specific sequence models, and H3K4me1 ChIP-seq data are not well suited for training and evaluating sequence-based models for cis-regulatory element prediction.
Collapse
Affiliation(s)
- Ronald J. Nowling
- Electrical Engineering and Computer Science, Milwaukee School of Engineering, Milwaukee, WI, United States
| | - Kimani Njoya
- Department of Microbiology and Immunology, Medical College of Wisconsin, Milwaukee, WI, United States
| | - John G. Peters
- Electrical Engineering and Computer Science, Milwaukee School of Engineering, Milwaukee, WI, United States
| | - Michelle M. Riehle
- Department of Microbiology and Immunology, Medical College of Wisconsin, Milwaukee, WI, United States
| |
Collapse
|
8
|
Weinstein ML, Jaenke CM, Asma H, Spangler M, Kohnen KA, Konys CC, Williams ME, Williams AV, Rebeiz M, Halfon MS, Williams TM. A novel role for trithorax in the gene regulatory network for a rapidly evolving fruit fly pigmentation trait. PLoS Genet 2023; 19:e1010653. [PMID: 36795790 PMCID: PMC9977049 DOI: 10.1371/journal.pgen.1010653] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/08/2022] [Revised: 03/01/2023] [Accepted: 02/03/2023] [Indexed: 02/17/2023] Open
Abstract
Animal traits develop through the expression and action of numerous regulatory and realizator genes that comprise a gene regulatory network (GRN). For each GRN, its underlying patterns of gene expression are controlled by cis-regulatory elements (CREs) that bind activating and repressing transcription factors. These interactions drive cell-type and developmental stage-specific transcriptional activation or repression. Most GRNs remain incompletely mapped, and a major barrier to this daunting task is CRE identification. Here, we used an in silico method to identify predicted CREs (pCREs) that comprise the GRN which governs sex-specific pigmentation of Drosophila melanogaster. Through in vivo assays, we demonstrate that many pCREs activate expression in the correct cell-type and developmental stage. We employed genome editing to demonstrate that two CREs control the pupal abdomen expression of trithorax, whose function is required for the dimorphic phenotype. Surprisingly, trithorax had no detectable effect on this GRN's key trans-regulators, but shapes the sex-specific expression of two realizator genes. Comparison of sequences orthologous to these CREs supports an evolutionary scenario where these trithorax CREs predated the origin of the dimorphic trait. Collectively, this study demonstrates how in silico approaches can shed novel insights on the GRN basis for a trait's development and evolution.
Collapse
Affiliation(s)
- Michael L. Weinstein
- Department of Biology, University of Dayton, 300 College Park, Dayton, Ohio, United States of America
| | - Chad M. Jaenke
- Department of Biology, University of Dayton, 300 College Park, Dayton, Ohio, United States of America
| | - Hasiba Asma
- Program in Genetics, Genomics, and Bioinformatics, University at Buffalo-State University of New York, Buffalo, New York, United States of America
| | - Matthew Spangler
- Department of Biology, University of Dayton, 300 College Park, Dayton, Ohio, United States of America
| | - Katherine A. Kohnen
- Department of Biology, University of Dayton, 300 College Park, Dayton, Ohio, United States of America
| | - Claire C. Konys
- Department of Biology, University of Dayton, 300 College Park, Dayton, Ohio, United States of America
| | - Melissa E. Williams
- Department of Biology, University of Dayton, 300 College Park, Dayton, Ohio, United States of America
| | - Ashley V. Williams
- West Carrollton High School, 5833 Student St., Dayton, Ohio, United States of America
| | - Mark Rebeiz
- Department of Biological Sciences, University of Pittsburgh, Pittsburgh, Pennsylvania, United States of America
| | - Marc S. Halfon
- Program in Genetics, Genomics, and Bioinformatics, University at Buffalo-State University of New York, Buffalo, New York, United States of America
- Department of Biochemistry, University at Buffalo-State University of New York, Buffalo, New York, United States of America
| | - Thomas M. Williams
- Department of Biology, University of Dayton, 300 College Park, Dayton, Ohio, United States of America
- The Integrative Science and Engineering Center, University of Dayton, 300 College Park, Dayton, Ohio, United States of America
- * E-mail:
| |
Collapse
|
9
|
Kumar A, Schrader AW, Boroojeny AE, Asadian M, Lee J, Song YJ, Zhao SD, Han HS, Sinha S. Intracellular Spatial Transcriptomic Analysis Toolkit (InSTAnT). RESEARCH SQUARE 2023:rs.3.rs-2481749. [PMID: 36747718 PMCID: PMC9901031 DOI: 10.21203/rs.3.rs-2481749/v1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Indexed: 01/28/2023]
Abstract
Imaging-based spatial transcriptomics technologies such as MERFISH offer snapshots of cellular processes in unprecedented detail, but new analytic tools are needed to realize their full potential. We present InSTAnT, a computational toolkit for extracting molecular relationships from spatial transcriptomics data at the intra-cellular resolution. InSTAnT detects gene pairs and modules with interesting patterns of mutual co-localization within and across cells, using specialized statistical tests and graph mining. We showcase the toolkit on datasets profiling a human cancer cell line and hypothalamic preoptic region of mouse brain. We performed rigorous statistical assessment of discovered co-localization patterns, found supporting evidence from databases and RNA interactions, and identified subcellular domains associated with RNA-colocalization. We identified several novel cell type-specific gene co-localizations in the brain. Intra-cellular spatial patterns discovered by InSTAnT mirror diverse molecular relationships, including RNA interactions and shared sub-cellular localization or function, providing a rich compendium of testable hypotheses regarding molecular functions.
Collapse
Affiliation(s)
- Anurendra Kumar
- College of Computing, Georgia Institute of Technology, Atlanta, GA, 30332, USA
| | - Alex W. Schrader
- Department of Chemistry, University of Illinois Urbana-Champaign, Urbana, IL, 61801, USA
| | | | - Marisa Asadian
- Department of Chemistry, University of Illinois Urbana-Champaign, Urbana, IL, 61801, USA
| | - Juyeon Lee
- Department of Chemistry, University of Illinois Urbana-Champaign, Urbana, IL, 61801, USA
| | - You Jin Song
- Department of Cell and Developmental Biology, University of Illinois Urbana-Champaign, Urbana, IL, 61801, USA
| | - Sihai Dave Zhao
- Department of Statistics, University of Illinois Urbana-Champaign, Urbana, IL, 61820, USA
- Carl R. Woese Institute for Genomic Biology, University of Illinois Urbana-Champaign, Urbana, IL, 61801, USA
| | - Hee-Sun Han
- Department of Chemistry, University of Illinois Urbana-Champaign, Urbana, IL, 61801, USA
- Carl R. Woese Institute for Genomic Biology, University of Illinois Urbana-Champaign, Urbana, IL, 61801, USA
| | - Saurabh Sinha
- The Wallace H. Coulter Department of Biomedical Engineering, Georgia Institute of Technology, Atlanta, GA, 30332, USA
- H. Milton Stewart School of Industrial & Systems Engineering, Georgia Institute of Technology, Atlanta, GA, 30318, USA
| |
Collapse
|
10
|
Song W, Ovcharenko I. Heterogeneity of enhancers embodies shared and representative functional groups underlying developmental and cell type-specific gene regulation. Gene 2022; 834:146640. [PMID: 35680026 PMCID: PMC9235925 DOI: 10.1016/j.gene.2022.146640] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/10/2022] [Revised: 04/20/2022] [Accepted: 06/02/2022] [Indexed: 11/04/2022]
Abstract
While enhancers in a particular tissue coordinately fulfill regulatory functions, these functions are heterogeneous in nature and comprise of multiple enhancer subclasses and the associated regulatory mechanisms. In this work, we used multiple cell lines to identify enhancer subclasses linked to development, differentiation, and cellular identity. We found that enhancer functional heterogeneity during development encompasses subclasses of ubiquitous functions (11%), development specific regulatory activity (62%), and chromatin interactions (12%). In differentiated cell lines, ubiquitous enhancers (10%) stay active across multiple cell lines.They are accompanied by a large enhancer subclass (ranging from 33% to 63%) with functions specific to the corresponding lineage. The remaining enhancers (27-40%) establish regulatory chromatin structure and facilitate interactions of cell type-specific enhancers with their target promoters. In addition to specialized functions of cell type-specific enhancers, we show that proper accounting of enhancer heterogeneity leads to a 10% increase in accuracy of enhancer classification, which significantly improves the modeling of enhancers and identification of underlying regulatory mechanisms. In summary, our observations suggest that although cell type-specific enhancers are heterogeneous and coordinate different regulatory programs, enhancers from different cell lines maintain common categories of functional groups across developmental and differentiation stages, indicating a higher order rule followed by enhancer-gene regulation.
Collapse
Affiliation(s)
- Wei Song
- Computational Biology Branch, National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD, USA.
| | - Ivan Ovcharenko
- Computational Biology Branch, National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD, USA.
| |
Collapse
|
11
|
Keränen SVE, Villahoz-Baleta A, Bruno AE, Halfon MS. REDfly: An Integrated Knowledgebase for Insect Regulatory Genomics. INSECTS 2022; 13:618. [PMID: 35886794 PMCID: PMC9323752 DOI: 10.3390/insects13070618] [Citation(s) in RCA: 4] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 06/16/2022] [Revised: 07/01/2022] [Accepted: 07/06/2022] [Indexed: 11/29/2022]
Abstract
We provide here an updated description of the REDfly (Regulatory Element Database for Fly) database of transcriptional regulatory elements, a unique resource that provides regulatory annotation for the genome of Drosophila and other insects. The genomic sequences regulating insect gene expression-transcriptional cis-regulatory modules (CRMs, e.g., "enhancers") and transcription factor binding sites (TFBSs)-are not currently curated by any other major database resources. However, knowledge of such sequences is important, as CRMs play critical roles with respect to disease as well as normal development, phenotypic variation, and evolution. Characterized CRMs also provide useful tools for both basic and applied research, including developing methods for insect control. REDfly, which is the most detailed existing platform for metazoan regulatory-element annotation, includes over 40,000 experimentally verified CRMs and TFBSs along with their DNA sequences, their associated genes, and the expression patterns they direct. Here, we briefly describe REDfly's contents and data model, with an emphasis on the new features implemented since 2020. We then provide an illustrated walk-through of several common REDfly search use cases.
Collapse
Affiliation(s)
| | - Angel Villahoz-Baleta
- Center for Computational Research, State University of New York at Buffalo, Buffalo, NY 14203, USA; (A.V.-B.); (A.E.B.)
- New York State Center of Excellence in Bioinformatics and Life Sciences, State University of New York at Buffalo, Buffalo, NY 14203, USA
| | - Andrew E. Bruno
- Center for Computational Research, State University of New York at Buffalo, Buffalo, NY 14203, USA; (A.V.-B.); (A.E.B.)
- New York State Center of Excellence in Bioinformatics and Life Sciences, State University of New York at Buffalo, Buffalo, NY 14203, USA
| | - Marc S. Halfon
- New York State Center of Excellence in Bioinformatics and Life Sciences, State University of New York at Buffalo, Buffalo, NY 14203, USA
- Department of Biochemistry, State University of New York at Buffalo, Buffalo, NY 14203, USA
- Department of Biomedical Informatics, State University of New York at Buffalo, Buffalo, NY 14203, USA
- Department of Biological Sciences, State University of New York at Buffalo, Buffalo, NY 14203, USA
- Department of Molecular and Cellular Biology and Program in Cancer Genetics, Roswell Park Cancer Institute, Buffalo, NY 14263, USA
| |
Collapse
|
12
|
Schember I, Halfon MS. Identification of new Anopheles gambiae transcriptional enhancers using a cross-species prediction approach. INSECT MOLECULAR BIOLOGY 2021; 30:410-419. [PMID: 33866636 PMCID: PMC8266755 DOI: 10.1111/imb.12705] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 11/05/2020] [Revised: 02/09/2021] [Accepted: 03/31/2021] [Indexed: 06/12/2023]
Abstract
The success of transgenic mosquito vector control approaches relies on well-targeted gene expression, requiring the identification and characterization of a diverse set of mosquito promoters and transcriptional enhancers. However, few enhancers have been characterized in Anopheles gambiae to date. Here, we employ the SCRMshaw method we previously developed to predict enhancers in the A. gambiae genome, preferentially targeting vector-relevant tissues such as the salivary glands, midgut and nervous system. We demonstrate a high overall success rate, with at least 8 of 11 (73%) tested sequences validating as enhancers in an in vivo xenotransgenic assay. Four tested sequences drive expression in either the salivary gland or the midgut, making them directly useful for probing the biology of these infection-relevant tissues. The success of our study suggests that computational enhancer prediction should serve as an effective means for identifying A. gambiae enhancers with activity in tissues involved in malaria propagation and transmission.
Collapse
Affiliation(s)
- Isabella Schember
- Department of Biochemistry, University at Buffalo-State University of New York, Buffalo, NY 14203
| | - Marc S. Halfon
- Department of Biochemistry, University at Buffalo-State University of New York, Buffalo, NY 14203
- Department of Biomedical Informatics, University at Buffalo-State University of New York, Buffalo, NY 14203
- Department of Biological Sciences, University at Buffalo-State University of New York, Buffalo, NY 14203
- NY State Center of Excellence in Bioinformatics & Life Sciences, Buffalo, NY 14203
- Department of Molecular and Cellular Biology and Program in Cancer Genetics, Roswell Park Comprehensive Cancer Center, Buffalo, NY 14263
| |
Collapse
|
13
|
Asma H, Halfon MS. Annotating the Insect Regulatory Genome. INSECTS 2021; 12:591. [PMID: 34209769 PMCID: PMC8305585 DOI: 10.3390/insects12070591] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 05/28/2021] [Revised: 06/23/2021] [Accepted: 06/25/2021] [Indexed: 11/17/2022]
Abstract
An ever-growing number of insect genomes is being sequenced across the evolutionary spectrum. Comprehensive annotation of not only genes but also regulatory regions is critical for reaping the full benefits of this sequencing. Driven by developments in sequencing technologies and in both empirical and computational discovery strategies, the past few decades have witnessed dramatic progress in our ability to identify cis-regulatory modules (CRMs), sequences such as enhancers that play a major role in regulating transcription. Nevertheless, providing a timely and comprehensive regulatory annotation of newly sequenced insect genomes is an ongoing challenge. We review here the methods being used to identify CRMs in both model and non-model insect species, and focus on two tools that we have developed, REDfly and SCRMshaw. These resources can be paired together in a powerful combination to facilitate insect regulatory annotation over a broad range of species, with an accuracy equal to or better than that of other state-of-the-art methods.
Collapse
Affiliation(s)
- Hasiba Asma
- Program in Genetics, Genomics, and Bioinformatics, University at Buffalo-State University of New York, Buffalo, NY 14203, USA;
| | - Marc S. Halfon
- Program in Genetics, Genomics, and Bioinformatics, University at Buffalo-State University of New York, Buffalo, NY 14203, USA;
- Department of Biochemistry, University at Buffalo-State University of New York, Buffalo, NY 14203, USA
- Department of Biomedical Informatics, University at Buffalo-State University of New York, Buffalo, NY 14203, USA
- Department of Biological Sciences, University at Buffalo-State University of New York, Buffalo, NY 14203, USA
- NY State Center of Excellence in Bioinformatics & Life Sciences, Buffalo, NY 14203, USA
| |
Collapse
|
14
|
Hong J, Gao R, Yang Y. CrepHAN: Cross-species prediction of enhancers by using hierarchical attention networks. Bioinformatics 2021; 37:3436-3443. [PMID: 33978703 DOI: 10.1093/bioinformatics/btab349] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/21/2020] [Revised: 04/21/2021] [Accepted: 05/06/2021] [Indexed: 01/17/2023] Open
Abstract
MOTIVATION Enhancers are important functional elements in genome sequences. The identification of enhancers is a very challenging task due to the great diversity of enhancer sequences and the flexible localization on genomes. Till now, the interactions between enhancers and genes have not been fully understood yet. To speed up the studies of the regulatory roles of enhancers, computational tools for the prediction of enhancers have emerged in recent years. Especially, thanks to the ENCODE project and the advances of high-throughput experimental techniques, a large amount of experimentally verified enhancers have been annotated on the human genome, which allows large-scale predictions of unknown enhancers using data-driven methods. However, except for human and some model organisms, the validated enhancer annotations are scarce for most species, leading to more difficulties in the computational identification of enhancers for their genomes. RESULTS In this study, we propose a deep learning-based predictor for enhancers, named CrepHAN, which is featured by a hierarchical attention neural network and word embedding-based representations for DNA sequences. We use the experimentally-supported data of the human genome to train the model, and perform experiments on human and other mammals, including mouse, cow, and dog. The experimental results show that CrepHAN has more advantages on cross-species predictions, and outperforms the existing models by a large margin. Especially, for human-mouse cross-predictions, the AUC score of ROC curve is increased by 0.033∼0.145 on the combined tissue dataset and 0.032∼0.109 on tissue-specific datasets. AVAILABILITY bcmi.sjtu.edu.cn/~yangyang/CrepHAN.html. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Jianwei Hong
- Department of Computer Science and Engineering, Shanghai Jiao Tong University, 800 Dong Chuan Rd., Shanghai 200240, China.,School of Agriculture and Biology, Shanghai Jiao Tong University, Shanghai 200240, China
| | - Ruitian Gao
- Department of Bioinformatics and Biostatistics, Shanghai Jiao Tong University, Shanghai 200240, China
| | - Yang Yang
- Department of Computer Science and Engineering, Shanghai Jiao Tong University, 800 Dong Chuan Rd., Shanghai 200240, China.,Key Laboratory of Shanghai Education Commission for Intelligent Interaction and Cognitive Engineering, Shanghai, 200240, China
| |
Collapse
|
15
|
Rivera J, Keränen SVE, Gallo SM, Halfon MS. REDfly: the transcriptional regulatory element database for Drosophila. Nucleic Acids Res 2020; 47:D828-D834. [PMID: 30329093 PMCID: PMC6323911 DOI: 10.1093/nar/gky957] [Citation(s) in RCA: 44] [Impact Index Per Article: 11.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/06/2018] [Accepted: 10/04/2018] [Indexed: 12/21/2022] Open
Abstract
The REDfly database provides a comprehensive curation of experimentally-validated Drosophila transcriptional cis-regulatory elements and includes information on DNA sequence, experimental evidence, patterns of regulated gene expression, and more. Now in its thirteenth year, REDfly has grown to over 23 000 records of tested reporter gene constructs and 2200 tested transcription factor binding sites. Recent developments include the start of curation of predicted cis-regulatory modules in addition to experimentally-verified ones, improved search and filtering, and increased interaction with the authors of curated papers. An expanded data model that will capture information on temporal aspects of gene regulation, regulation in response to environmental and other non-developmental cues, sexually dimorphic gene regulation, and non-endogenous (ectopic) aspects of reporter gene expression is under development and expected to be in place within the coming year. REDfly is freely accessible at http://redfly.ccr.buffalo.edu, and news about database updates and new features can be followed on Twitter at @REDfly_database.
Collapse
Affiliation(s)
- John Rivera
- Center for Computational Research, State University of New York at Buffalo, Buffalo, NY 14203, USA.,New York State Center of Excellence in Bioinformatics and Life Sciences, State University of New York at Buffalo, Buffalo, NY 14203, USA
| | | | - Steven M Gallo
- Center for Computational Research, State University of New York at Buffalo, Buffalo, NY 14203, USA.,New York State Center of Excellence in Bioinformatics and Life Sciences, State University of New York at Buffalo, Buffalo, NY 14203, USA
| | - Marc S Halfon
- New York State Center of Excellence in Bioinformatics and Life Sciences, State University of New York at Buffalo, Buffalo, NY 14203, USA.,Department of Biochemistry, State University of New York at Buffalo, Buffalo, NY 14203, USA.,Department of Biomedical Informatics, State University of New York at Buffalo, Buffalo, NY 14203, USA.,Department of Biological Sciences, State University of New York at Buffalo, Buffalo, NY 14203, USA.,Department of Molecular and Cellular Biology and Program in Cancer Genetics, Roswell Park Cancer Institute, Buffalo, NY 14263, USA
| |
Collapse
|
16
|
Tomoyasu Y, Halfon MS. How to study enhancers in non-traditional insect models. ACTA ACUST UNITED AC 2020; 223:223/Suppl_1/jeb212241. [PMID: 32034049 DOI: 10.1242/jeb.212241] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/12/2022]
Abstract
Transcriptional enhancers are central to the function and evolution of genes and gene regulation. At the organismal level, enhancers play a crucial role in coordinating tissue- and context-dependent gene expression. At the population level, changes in enhancers are thought to be a major driving force that facilitates evolution of diverse traits. An amazing array of diverse traits seen in insect morphology, physiology and behavior has been the subject of research for centuries. Although enhancer studies in insects outside of Drosophila have been limited, recent advances in functional genomic approaches have begun to make such studies possible in an increasing selection of insect species. Here, instead of comprehensively reviewing currently available technologies for enhancer studies in established model organisms such as Drosophila, we focus on a subset of computational and experimental approaches that are likely applicable to non-Drosophila insects, and discuss the pros and cons of each approach. We discuss the importance of validating enhancer function and evaluate several possible validation methods, such as reporter assays and genome editing. Key points and potential pitfalls when establishing a reporter assay system in non-traditional insect models are also discussed. We close with a discussion of how to advance enhancer studies in insects, both by improving computational approaches and by expanding the genetic toolbox in various insects. Through these discussions, this Review provides a conceptual framework for studying the function and evolution of enhancers in non-traditional insect models.
Collapse
Affiliation(s)
| | - Marc S Halfon
- Department of Biochemistry, University at Buffalo-State University of New York, Buffalo, NY 14203, USA
| |
Collapse
|
17
|
Asma H, Halfon MS. Computational enhancer prediction: evaluation and improvements. BMC Bioinformatics 2019; 20:174. [PMID: 30953451 PMCID: PMC6451241 DOI: 10.1186/s12859-019-2781-x] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/15/2019] [Accepted: 03/27/2019] [Indexed: 12/21/2022] Open
Abstract
BACKGROUND Identifying transcriptional enhancers and other cis-regulatory modules (CRMs) is an important goal of post-sequencing genome annotation. Computational approaches provide a useful complement to empirical methods for CRM discovery, but it is critical that we develop effective means to evaluate their performance in terms of estimating their sensitivity and specificity. RESULTS We introduce here pCRMeval, a pipeline for in silico evaluation of any enhancer prediction tools that are flexible enough to be applied to the Drosophila melanogaster genome. pCRMeval compares the result of predictions with the extensive existing knowledge of experimentally-validated Drosophila CRMs in order to estimate the precision and relative sensitivity of the prediction method. In the case of supervised prediction methods-when training data composed of validated CRMs are used-pCRMeval can also assess the sensitivity of specific training sets. We demonstrate the utility of pCRMeval through evaluation of our SCRMshaw CRM prediction method and training data. By measuring the impact of different parameters on SCRMshaw performance, as assessed by pCRMeval, we develop a more robust version of SCRMshaw, SCRMshaw_HD, that improves the number of predictions while maintaining sensitivity and specificity. Our analysis also demonstrates that SCRMshaw_HD, when applied to increasingly less well-assembled genomes, maintains its strong predictive power with only a minor drop-off in performance. CONCLUSION Our pCRMeval pipeline provides a general framework for evaluation that can be applied to any CRM prediction method, particularly a supervised method. While we make use of it here primarily to test and improve a particular method for CRM prediction, SCRMshaw, pCRMeval should provide a valuable platform to the research community not only for evaluating individual methods, but also for comparing between competing methods.
Collapse
Affiliation(s)
- Hasiba Asma
- Program in Genetics, Genomics, and Bioinformatics, University at Buffalo-State University of New York, 701 Ellicott St, Buffalo, NY, 14203, USA
| | - Marc S Halfon
- Program in Genetics, Genomics, and Bioinformatics, University at Buffalo-State University of New York, 701 Ellicott St, Buffalo, NY, 14203, USA.
- Department of Biochemistry, University at Buffalo-State University of New York, 701 Ellicott St, Buffalo, NY, 14203, USA.
- Department of Biological Sciences, University at Buffalo-State University of New York, 701 Ellicott St, Buffalo, NY, 14203, USA.
- Department of Biomedical Informatics, University at Buffalo-State University of New York, 701 Ellicott St, Buffalo, NY, 14203, USA.
- NY State Center of Excellence in Bioinformatics and Life Sciences, 701 Ellicott St, Buffalo, NY, 14203, USA.
- Molecular and Cellular Biology Department and Program in Cancer Genetics, Roswell Park Comprehensive Cancer Center, Buffalo, NY, 14263, USA.
| |
Collapse
|
18
|
Abstract
Although the number of sequenced insect genomes numbers in the hundreds, little is known about gene regulatory sequences in any species other than the well-studied Drosophila melanogaster. We provide here a detailed protocol for using SCRMshaw, a computational method for predicting cis-regulatory modules (CRMs, also "enhancers") in sequenced insect genomes. SCRMshaw is effective for CRM discovery throughout the range of holometabolous insects and potentially in even more diverged species, with true-positive prediction rates of 75% or better. Minimal requirements for using SCRMshaw are a genome sequence and training data in the form of known Drosophila CRMs; a comprehensive set of the latter can be obtained from the SCRMshaw download site. For basic applications, a user with only modest computational know-how can run SCRMshaw on a desktop computer. SCRMshaw can be run with a single, narrow set of training data to predict CRMs regulating a specific pattern of gene expression, or with multiple sets of training data covering a broad range of CRM activities to provide an initial rough regulatory annotation of a complete, newly-sequenced genome.
Collapse
Affiliation(s)
- Majid Kazemian
- Departments of Biochemistry and Computer Science, Purdue University, West Lafayette, IN, USA.
| | - Marc S Halfon
- Departments of Biochemistry, Biomedical Informatics, and Biological Sciences, University at Buffalo-State University of New York, Buffalo, NY, USA.
- NY State Center of Excellence in Bioinformatics and Life Sciences, Buffalo, NY, USA.
- Department of Molecular and Cellular Biology and Program in Cancer Genetics, Roswell Park Comprehensive Cancer Center, Buffalo, NY, USA.
| |
Collapse
|
19
|
Lai YT, Deem KD, Borràs-Castells F, Sambrani N, Rudolf H, Suryamohan K, El-Sherif E, Halfon MS, McKay DJ, Tomoyasu Y. Enhancer identification and activity evaluation in the red flour beetle, Tribolium castaneum. Development 2018. [PMID: 29540499 DOI: 10.1242/dev.160663] [Citation(s) in RCA: 31] [Impact Index Per Article: 5.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/13/2022]
Abstract
Evolution of cis-regulatory elements (such as enhancers) plays an important role in the production of diverse morphology. However, a mechanistic understanding is often limited by the absence of methods for studying enhancers in species other than established model systems. Here, we sought to establish methods to identify and test enhancer activity in the red flour beetle, Tribolium castaneum To identify possible enhancer regions, we first obtained genome-wide chromatin profiles from various tissues and stages of Tribolium using FAIRE (formaldehyde-assisted isolation of regulatory elements)-sequencing. Comparison of these profiles revealed a distinct set of open chromatin regions in each tissue and at each stage. In addition, comparison of the FAIRE data with sets of computationally predicted (i.e. supervised cis-regulatory module-predicted) enhancers revealed a very high overlap between the two datasets. Second, using nubbin in the wing and hunchback in the embryo as case studies, we established the first universal reporter assay system that works in various contexts in Tribolium, and in a cross-species context. Together, these advances will facilitate investigation of cis-evolution and morphological diversity in Tribolium and other insects.
Collapse
Affiliation(s)
- Yi-Ting Lai
- Department of Biology, Miami University, Oxford, OH 45056, USA
| | - Kevin D Deem
- Department of Biology, Miami University, Oxford, OH 45056, USA
| | | | - Nagraj Sambrani
- Department of Biology, Miami University, Oxford, OH 45056, USA
| | - Heike Rudolf
- Friedrich-Alexander-Universität Erlangen-Nürnberg, Erlangen 91058, Germany
| | - Kushal Suryamohan
- Department of Biochemistry, State University of New York at Buffalo, Buffalo, NY 14214, USA
| | - Ezzat El-Sherif
- Friedrich-Alexander-Universität Erlangen-Nürnberg, Erlangen 91058, Germany
| | - Marc S Halfon
- Department of Biochemistry, State University of New York at Buffalo, Buffalo, NY 14214, USA
| | - Daniel J McKay
- Department of Biology, Department of Genetics, Integrative Program for Biological and Genome Sciences, University of North Carolina at Chapel Hill, Chapel Hill, NC 27599, USA
| | | |
Collapse
|
20
|
Herman-Izycka J, Wlasnowolski M, Wilczynski B. Taking promoters out of enhancers in sequence based predictions of tissue-specific mammalian enhancers. BMC Med Genomics 2017; 10:34. [PMID: 28589862 PMCID: PMC5461523 DOI: 10.1186/s12920-017-0264-3] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/24/2022] Open
Abstract
BACKGROUND Many genetic diseases are caused by mutations in non-coding regions of the genome. These mutations are frequently found in enhancer sequences, causing disruption to the regulatory program of the cell. Enhancers are short regulatory sequences in the non-coding part of the genome that are essential for the proper regulation of transcription. While the experimental methods for identification of such sequences are improving every year, our understanding of the rules behind the enhancer activity has not progressed much in the last decade. This is especially true in case of tissue-specific enhancers, where there are clear problems in predicting specificity of enhancer activity. RESULTS We show a random-forest based machine learning approach capable of matching the performance of the current state-of-the-art methods for enhancer prediction. Then we show that it is, similarly to other published methods, frequently cross-predicting enhancers as active in different tissues, making it less useful for predicting tissue specific activity. Then we proceed to show that the problem is related to the fact that the enhancer predicting models exhibit a bias towards predicting gene promoters as active enhancers. Then we show that using a two-step classifier can lead to lower cross-prediction between tissues. CONCLUSIONS We provide whole-genome predictions of human heart and brain enhancers obtained with two-step classifier.
Collapse
Affiliation(s)
- Julia Herman-Izycka
- Institute of Informatics, University of Warsaw, Banacha 2, Warsaw, 02-097, Poland
| | - Michal Wlasnowolski
- Institute of Informatics, University of Warsaw, Banacha 2, Warsaw, 02-097, Poland
| | - Bartek Wilczynski
- Institute of Informatics, University of Warsaw, Banacha 2, Warsaw, 02-097, Poland.
| |
Collapse
|
21
|
Perspectives on Gene Regulatory Network Evolution. Trends Genet 2017; 33:436-447. [PMID: 28528721 DOI: 10.1016/j.tig.2017.04.005] [Citation(s) in RCA: 43] [Impact Index Per Article: 6.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/07/2017] [Revised: 04/24/2017] [Accepted: 04/25/2017] [Indexed: 11/23/2022]
Abstract
Animal development proceeds through the activity of genes and their cis-regulatory modules (CRMs) working together in sets of gene regulatory networks (GRNs). The emergence of species-specific traits and novel structures results from evolutionary changes in GRNs. Recent work in a wide variety of animal models, and particularly in insects, has started to reveal the modes and mechanisms of GRN evolution. I discuss here various aspects of GRN evolution and argue that developmental system drift (DSD), in which conserved phenotype is nevertheless a result of changed genetic interactions, should regularly be viewed from the perspective of GRN evolution. Advances in methods to discover related CRMs in diverse insect species, a critical requirement for detailed GRN characterization, are also described.
Collapse
|
22
|
Yang W, Sinha S. A novel method for predicting activity of cis-regulatory modules, based on a diverse training set. Bioinformatics 2016; 33:1-7. [PMID: 27609510 DOI: 10.1093/bioinformatics/btw552] [Citation(s) in RCA: 37] [Impact Index Per Article: 4.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/28/2016] [Revised: 07/26/2016] [Accepted: 08/17/2016] [Indexed: 12/31/2022] Open
Abstract
MOTIVATION With the rapid emergence of technologies for locating cis-regulatory modules (CRMs) genome-wide, the next pressing challenge is to assign precise functions to each CRM, i.e. to determine the spatiotemporal domains or cell-types where it drives expression. A popular approach to this task is to model the typical k-mer composition of a set of CRMs known to drive a common expression pattern, and assign that pattern to other CRMs exhibiting a similar k-mer composition. This approach does not rely on prior knowledge of transcription factors relevant to the CRM or their binding motifs, and is thus more widely applicable than motif-based methods for predicting CRM activity, but is also prone to false positive predictions. RESULTS We present a novel strategy to improve the above-mentioned approach: to predict if a CRM drives a specific gene expression pattern, assess not only how similar the CRM is to other CRMs with similar activity but also to CRMs with distinct activities. We use a state-of-the-art statistical method to quantify a CRM's sequence similarity to many different training sets of CRMs, and employ a classification algorithm to integrate these similarity scores into a single prediction of the CRM's activity. This strategy is shown to significantly improve CRM activity prediction over current approaches. AVAILABILITY AND IMPLEMENTATION Our implementation of the new method, called IMMBoost, is freely available as source code, at https://github.com/weiyangedward/IMMBoost CONTACT: sinhas@illinois.eduSupplementary information: Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Wei Yang
- Department of Computer Science, University of Illinois, Urbana-Champaign, Urbana, IL, USA
| | - Saurabh Sinha
- Department of Computer Science, University of Illinois, Urbana-Champaign, Urbana, IL, USA
| |
Collapse
|
23
|
Suryamohan K, Hanson C, Andrews E, Sinha S, Scheel MD, Halfon MS. Redeployment of a conserved gene regulatory network during Aedes aegypti development. Dev Biol 2016; 416:402-13. [PMID: 27341759 DOI: 10.1016/j.ydbio.2016.06.031] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/07/2016] [Revised: 06/13/2016] [Accepted: 06/20/2016] [Indexed: 10/21/2022]
Abstract
Changes in gene regulatory networks (GRNs) underlie the evolution of morphological novelty and developmental system drift. The fruitfly Drosophila melanogaster and the dengue and Zika vector mosquito Aedes aegypti have substantially similar nervous system morphology. Nevertheless, they show significant divergence in a set of genes co-expressed in the midline of the Drosophila central nervous system, including the master regulator single minded and downstream genes including short gastrulation, Star, and NetrinA. In contrast to Drosophila, we find that midline expression of these genes is either absent or severely diminished in A. aegypti. Instead, they are co-expressed in the lateral nervous system. This suggests that in A. aegypti this "midline GRN" has been redeployed to a new location while lost from its previous site of activity. In order to characterize the relevant GRNs, we employed the SCRMshaw method we previously developed to identify transcriptional cis-regulatory modules in both species. Analysis of these regulatory sequences in transgenic Drosophila suggests that the altered gene expression observed in A. aegypti is the result of trans-dependent redeployment of the GRN, potentially stemming from cis-mediated changes in the expression of sim and other as-yet unidentified regulators. Our results illustrate a novel "repeal, replace, and redeploy" mode of evolution in which a conserved GRN acquires a different function at a new site while its original function is co-opted by a different GRN. This represents a striking example of developmental system drift in which the dramatic shift in gene expression does not result in gross morphological changes, but in more subtle differences in development and function of the late embryonic nervous system.
Collapse
Affiliation(s)
- Kushal Suryamohan
- Department of Biochemistry, University at Buffalo-State University of New York, Buffalo, NY, United States; NY State Center of Excellence in Bioinformatics and Life Sciences, Buffalo, NY, United States
| | - Casey Hanson
- Department of Computer Science, University of Illinois Urbana-Champaign, Champaign, IL, United States
| | - Emily Andrews
- Indiana University School of Medicine, Department of Medical and Molecular Genetics, South Bend, IN, United States
| | - Saurabh Sinha
- Department of Computer Science, University of Illinois Urbana-Champaign, Champaign, IL, United States
| | - Molly Duman Scheel
- Indiana University School of Medicine, Department of Medical and Molecular Genetics, South Bend, IN, United States; University of Notre Dame, Eck Inst. for Global Health and Department of Biological Sciences, South Bend, IN, United States
| | - Marc S Halfon
- Department of Biochemistry, University at Buffalo-State University of New York, Buffalo, NY, United States; NY State Center of Excellence in Bioinformatics and Life Sciences, Buffalo, NY, United States; Department of Biological Sciences and Department of Biomedical Informatics, University at Buffalo-State University of New York, Buffalo, NY, United States; Department of Molecular and Cellular Biology and Program in Cancer Genetics, Roswell Park Cancer Institute, Buffalo, NY, United States.
| |
Collapse
|
24
|
Abstract
Young et al., (2010) showed that due to gene length bias the popular Fisher Exact Test should not be used to study the association between a group of differentially expressed (DE) genes and a specific Gene Ontology (GO) category. Instead they suggest a test where one conditions on the genes in the GO category and draws the pseudo DE expressed genes according to a length-dependent distribution. The same model was presented in a different context by Kazemian et al., (2011) who went on to offer a dynamic programming (DP) algorithm to exactly compute the significance of the proposed test. Here we point out that while valid, the test proposed by these authors is no longer symmetric as Fisher's Exact Test is: one gets different answers if one conditions on the observed GO category than on the DE set. As an alternative we offer a symmetric generalization of Fisher's Exact Test and provide efficient algorithms to evaluate its significance.
Collapse
Affiliation(s)
- David Manescu
- School of Mathematics and Statistics, University of Sydney , Sydney, Australia
| | - Uri Keich
- School of Mathematics and Statistics, University of Sydney , Sydney, Australia
| |
Collapse
|
25
|
Comin M, Antonello M. On the comparison of regulatory sequences with multiple resolution Entropic Profiles. BMC Bioinformatics 2016; 17:130. [PMID: 26987840 PMCID: PMC4797186 DOI: 10.1186/s12859-016-0980-2] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/29/2015] [Accepted: 03/06/2016] [Indexed: 11/28/2022] Open
Abstract
Background Enhancers are stretches of DNA (100–1000 bp) that play a major role in development gene expression, evolution and disease. It has been recently shown that in high-level eukaryotes enhancers rarely work alone, instead they collaborate by forming clusters of cis-regulatory modules (CRMs). Although the binding of transcription factors is sequence-specific, the identification of functionally similar enhancers is very difficult and it cannot be carried out with traditional alignment-based techniques. Results The use of fast similarity measures, like alignment-free measures, to detect related regulatory sequences is crucial to understand functional correlation between two enhancers. In this paper we study the use of alignment-free measures for the classification of CRMs. However, alignment-free measures are generally tied to a fixed resolution k. Here we propose an alignment-free statistic, called \documentclass[12pt]{minimal}
\usepackage{amsmath}
\usepackage{wasysym}
\usepackage{amsfonts}
\usepackage{amssymb}
\usepackage{amsbsy}
\usepackage{mathrsfs}
\usepackage{upgreek}
\setlength{\oddsidemargin}{-69pt}
\begin{document}$EP^{*}_{2}$\end{document}EP2∗, that is based on multiple resolution patterns derived from the Entropic Profiles (EPs). The Entropic Profile is a function of the genomic location that captures the importance of that region with respect to the whole genome. As a byproduct we provide a formula to compute the exact variance of variable length word counts, a result that can be of general interest also in other applications. Conclusions We evaluate several alignment-free statistics on simulated data and real mouse ChIP-seq sequences. The new statistic, \documentclass[12pt]{minimal}
\usepackage{amsmath}
\usepackage{wasysym}
\usepackage{amsfonts}
\usepackage{amssymb}
\usepackage{amsbsy}
\usepackage{mathrsfs}
\usepackage{upgreek}
\setlength{\oddsidemargin}{-69pt}
\begin{document}$EP^{*}_{2}$\end{document}EP2∗, is highly successful in discriminating functionally related enhancers and, in almost all experiments, it outperforms fixed-resolution methods. We implemented the new alignment-free measures, as well as traditional ones, in a software called EP-sim that is freely available: http://www.dei.unipd.it/~ciompin/main/EP-sim.html.
Collapse
Affiliation(s)
- Matteo Comin
- Department of Information Engineering, University of Padova, Padova, Italy.
| | - Morris Antonello
- Department of Information Engineering, University of Padova, Padova, Italy
| |
Collapse
|
26
|
Svetlichnyy D, Imrichova H, Fiers M, Kalender Atak Z, Aerts S. Identification of High-Impact cis-Regulatory Mutations Using Transcription Factor Specific Random Forest Models. PLoS Comput Biol 2015; 11:e1004590. [PMID: 26562774 PMCID: PMC4642938 DOI: 10.1371/journal.pcbi.1004590] [Citation(s) in RCA: 21] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/07/2015] [Accepted: 10/10/2015] [Indexed: 02/02/2023] Open
Abstract
Cancer genomes contain vast amounts of somatic mutations, many of which are passenger mutations not involved in oncogenesis. Whereas driver mutations in protein-coding genes can be distinguished from passenger mutations based on their recurrence, non-coding mutations are usually not recurrent at the same position. Therefore, it is still unclear how to identify cis-regulatory driver mutations, particularly when chromatin data from the same patient is not available, thus relying only on sequence and expression information. Here we use machine-learning methods to predict functional regulatory regions using sequence information alone, and compare the predicted activity of the mutated region with the reference sequence. This way we define the Predicted Regulatory Impact of a Mutation in an Enhancer (PRIME). We find that the recently identified driver mutation in the TAL1 enhancer has a high PRIME score, representing a “gain-of-target” for MYB, whereas the highly recurrent TERT promoter mutation has a surprisingly low PRIME score. We trained Random Forest models for 45 cancer-related transcription factors, and used these to score variations in the HeLa genome and somatic mutations across more than five hundred cancer genomes. Each model predicts only a small fraction of non-coding mutations with a potential impact on the function of the encompassing regulatory region. Nevertheless, as these few candidate driver mutations are often linked to gains in chromatin activity and gene expression, they may contribute to the oncogenic program by altering the expression levels of specific oncogenes and tumor suppressor genes. Precise regulation of gene expression is controlled by cis-regulatory modules (CRM) containing binding sites for transcription factors (TF). The genome-wide location of all TF binding sites can often be obtained by ChIP-seq (chromatin immunoprecipitation followed by deep sequencing), yet in most cases only a minority of the binding peaks actually represent functional CRMs that control the transcription initiation of a bona fide TF target gene. Here, we investigated for 45 cancer-related TFs how machine-learning approaches can be used to predict functional TF target CRMs. After careful evaluation of their performance, we used these TF-target classifiers to predict which cis-regulatory mutations may have a significant impact on gene regulation by evaluating whether the mutation causes a significant gain or loss in the probability that the CRM is a functional TF target. We found that Random Forest classifiers can achieve more than 100-fold higher specificity for mutation prediction compared to the simple approaches based on scanning with position weight matrices. By scanning somatic mutations in breast cancer genomes and in the HeLa genome, we finally show that our TF-target classifiers can identify high impact non-coding mutations that are associated with concordant TF binding, gene expression changes and chromatin activity. In conclusion, TF-specific Random Forest classifiers can be used to prioritize cis-regulatory mutations in cancer genomes with high accuracy.
Collapse
Affiliation(s)
- Dmitry Svetlichnyy
- Laboratory of Computational Biology, KU Leuven Center for Human Genetics, Leuven, Belgium
| | - Hana Imrichova
- Laboratory of Computational Biology, KU Leuven Center for Human Genetics, Leuven, Belgium
| | - Mark Fiers
- VIB Center for the Biology of Disease, Leuven, Belgium
| | - Zeynep Kalender Atak
- Laboratory of Computational Biology, KU Leuven Center for Human Genetics, Leuven, Belgium
| | - Stein Aerts
- Laboratory of Computational Biology, KU Leuven Center for Human Genetics, Leuven, Belgium
- * E-mail:
| |
Collapse
|
27
|
Kazemian M, Suryamohan K, Chen JY, Zhang Y, Samee MAH, Halfon MS, Sinha S. Evidence for deep regulatory similarities in early developmental programs across highly diverged insects. Genome Biol Evol 2015; 6:2301-20. [PMID: 25173756 PMCID: PMC4217690 DOI: 10.1093/gbe/evu184] [Citation(s) in RCA: 27] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/16/2022] Open
Abstract
Many genes familiar from Drosophila development, such as the so-called gap, pair-rule, and segment polarity genes, play important roles in the development of other insects and in many cases appear to be deployed in a similar fashion, despite the fact that Drosophila-like "long germband" development is highly derived and confined to a subset of insect families. Whether or not these similarities extend to the regulatory level is unknown. Identification of regulatory regions beyond the well-studied Drosophila has been challenging as even within the Diptera (flies, including mosquitoes) regulatory sequences have diverged past the point of recognition by standard alignment methods. Here, we demonstrate that methods we previously developed for computational cis-regulatory module (CRM) discovery in Drosophila can be used effectively in highly diverged (250-350 Myr) insect species including Anopheles gambiae, Tribolium castaneum, Apis mellifera, and Nasonia vitripennis. In Drosophila, we have successfully used small sets of known CRMs as "training data" to guide the search for other CRMs with related function. We show here that although species-specific CRM training data do not exist, training sets from Drosophila can facilitate CRM discovery in diverged insects. We validate in vivo over a dozen new CRMs, roughly doubling the number of known CRMs in the four non-Drosophila species. Given the growing wealth of Drosophila CRM annotation, these results suggest that extensive regulatory sequence annotation will be possible in newly sequenced insects without recourse to costly and labor-intensive genome-scale experiments. We develop a new method, Regulus, which computes a probabilistic score of similarity based on binding site composition (despite the absence of nucleotide-level sequence alignment), and demonstrate similarity between functionally related CRMs from orthologous loci. Our work represents an important step toward being able to trace the evolutionary history of gene regulatory networks and defining the mechanisms underlying insect evolution.
Collapse
Affiliation(s)
- Majid Kazemian
- Department of Computer Science, University of Illinois at Urbana-Champaign Laboratory of Molecular Immunology, National Heart Lung and Blood Institute, National Institutes of Health, Bethesda, Maryland
| | - Kushal Suryamohan
- Department of Biochemistry, University at Buffalo-State University of New York NY State Center of Excellence in Bioinformatics and Life Sciences, Buffalo, New York
| | - Jia-Yu Chen
- Department of Computer Science, University of Illinois at Urbana-Champaign
| | - Yinan Zhang
- Department of Computer Science, University of Illinois at Urbana-Champaign
| | | | - Marc S Halfon
- Department of Biochemistry, University at Buffalo-State University of New York NY State Center of Excellence in Bioinformatics and Life Sciences, Buffalo, New York Department of Biological Sciences, University at Buffalo-State University of New York Molecular and Cellular Biology Department and Program in Cancer Genetics, Roswell Park Cancer Institute, Buffalo, New York
| | - Saurabh Sinha
- Department of Computer Science, University of Illinois at Urbana-Champaign Institute of Genomic Biology, University of Illinois at Urbana-Champaign
| |
Collapse
|
28
|
Suryamohan K, Halfon MS. Identifying transcriptional cis-regulatory modules in animal genomes. WILEY INTERDISCIPLINARY REVIEWS. DEVELOPMENTAL BIOLOGY 2015; 4:59-84. [PMID: 25704908 PMCID: PMC4339228 DOI: 10.1002/wdev.168] [Citation(s) in RCA: 47] [Impact Index Per Article: 5.2] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 09/24/2014] [Revised: 11/04/2014] [Accepted: 11/16/2014] [Indexed: 11/08/2022]
Abstract
UNLABELLED Gene expression is regulated through the activity of transcription factors (TFs) and chromatin-modifying proteins acting on specific DNA sequences, referred to as cis-regulatory elements. These include promoters, located at the transcription initiation sites of genes, and a variety of distal cis-regulatory modules (CRMs), the most common of which are transcriptional enhancers. Because regulated gene expression is fundamental to cell differentiation and acquisition of new cell fates, identifying, characterizing, and understanding the mechanisms of action of CRMs is critical for understanding development. CRM discovery has historically been challenging, as CRMs can be located far from the genes they regulate, have few readily identifiable sequence characteristics, and for many years were not amenable to high-throughput discovery methods. However, the recent availability of complete genome sequences and the development of next-generation sequencing methods have led to an explosion of both computational and empirical methods for CRM discovery in model and nonmodel organisms alike. Experimentally, CRMs can be identified through chromatin immunoprecipitation directed against TFs or histone post-translational modifications, identification of nucleosome-depleted 'open' chromatin regions, or sequencing-based high-throughput functional screening. Computational methods include comparative genomics, clustering of known or predicted TF-binding sites, and supervised machine-learning approaches trained on known CRMs. All of these methods have proven effective for CRM discovery, but each has its own considerations and limitations, and each is subject to a greater or lesser number of false-positive identifications. Experimental confirmation of predictions is essential, although shortcomings in current methods suggest that additional means of validation need to be developed. For further resources related to this article, please visit the WIREs website. CONFLICT OF INTEREST The authors have declared no conflicts of interest for this article.
Collapse
Affiliation(s)
- Kushal Suryamohan
- Department of Biochemistry, University at Buffalo-State University of New York, Buffalo, NY 14203, USA
- NY State Center of Excellence in Bioinformatics and Life Sciences, Buffalo, NY 14203, USA
| | - Marc S. Halfon
- Department of Biochemistry, University at Buffalo-State University of New York, Buffalo, NY 14203, USA
- Department of Biological Sciences, University at Buffalo-State University of New York, Buffalo, NY 14203, USA
- Department of Biomedical Informatics, University at Buffalo-State University of New York, Buffalo, NY 14203, USA
- NY State Center of Excellence in Bioinformatics and Life Sciences, Buffalo, NY 14203, USA
- Molecular and Cellular Biology Department and Program in Cancer Genetics, Roswell Park Cancer Institute, Buffalo, NY 14263, USA
| |
Collapse
|
29
|
Taher L, Narlikar L, Ovcharenko I. Identification and computational analysis of gene regulatory elements. Cold Spring Harb Protoc 2015; 2015:pdb.top083642. [PMID: 25561628 PMCID: PMC5885252 DOI: 10.1101/pdb.top083642] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 06/04/2023]
Abstract
Over the last two decades, advances in experimental and computational technologies have greatly facilitated genomic research. Next-generation sequencing technologies have made de novo sequencing of large genomes affordable, and powerful computational approaches have enabled accurate annotations of genomic DNA sequences. Charting functional regions in genomes must account for not only the coding sequences, but also noncoding RNAs, repetitive elements, chromatin states, epigenetic modifications, and gene regulatory elements. A mix of comparative genomics, high-throughput biological experiments, and machine learning approaches has played a major role in this truly global effort. Here we describe some of these approaches and provide an account of our current understanding of the complex landscape of the human genome. We also present overviews of different publicly available, large-scale experimental data sets and computational tools, which we hope will prove beneficial for researchers working with large and complex genomes.
Collapse
Affiliation(s)
- Leila Taher
- Computational Biology Branch, National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, Maryland 20894
- Institute for Biostatistics and Informatics in Medicine and Ageing Research, University of Rostock, 18051 Rostock, Germany
| | - Leelavati Narlikar
- Chemical Engineering and Process Development Division, National Chemical Laboratory, CSIR, Pune 411008, India
| | - Ivan Ovcharenko
- Computational Biology Branch, National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, Maryland 20894
| |
Collapse
|
30
|
Whitney O, Pfenning AR, Howard JT, Blatti CA, Liu F, Ward JM, Wang R, Audet JN, Kellis M, Mukherjee S, Sinha S, Hartemink AJ, West AE, Jarvis ED. Core and region-enriched networks of behaviorally regulated genes and the singing genome. Science 2014; 346:1256780. [PMID: 25504732 DOI: 10.1126/science.1256780] [Citation(s) in RCA: 75] [Impact Index Per Article: 7.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/11/2022]
Abstract
Songbirds represent an important model organism for elucidating molecular mechanisms that link genes with complex behaviors, in part because they have discrete vocal learning circuits that have parallels with those that mediate human speech. We found that ~10% of the genes in the avian genome were regulated by singing, and we found a striking regional diversity of both basal and singing-induced programs in the four key song nuclei of the zebra finch, a vocal learning songbird. The region-enriched patterns were a result of distinct combinations of region-enriched transcription factors (TFs), their binding motifs, and presinging acetylation of histone 3 at lysine 27 (H3K27ac) enhancer activity in the regulatory regions of the associated genes. RNA interference manipulations validated the role of the calcium-response transcription factor (CaRF) in regulating genes preferentially expressed in specific song nuclei in response to singing. Thus, differential combinatorial binding of a small group of activity-regulated TFs and predefined epigenetic enhancer activity influences the anatomical diversity of behaviorally regulated gene networks.
Collapse
Affiliation(s)
- Osceola Whitney
- Department of Neurobiology, Howard Hughes Medical Institute, and Duke University Medical Center, Durham, NC 27710, USA.
| | - Andreas R Pfenning
- Department of Neurobiology, Howard Hughes Medical Institute, and Duke University Medical Center, Durham, NC 27710, USA. Computer Science and Artificial Intelligence Laboratory and the Broad Institute of MIT and Harvard, Massachusetts Institute of Technology, Cambridge, MA 02139, USA.
| | - Jason T Howard
- Department of Neurobiology, Howard Hughes Medical Institute, and Duke University Medical Center, Durham, NC 27710, USA
| | - Charles A Blatti
- Department of Computer Science, University of Illinois, Urbana-Champaign, IL, USA
| | - Fang Liu
- Department of Neurobiology, Duke University Medical Center, Durham, NC 27710, USA
| | - James M Ward
- Department of Neurobiology, Howard Hughes Medical Institute, and Duke University Medical Center, Durham, NC 27710, USA
| | - Rui Wang
- Department of Neurobiology, Howard Hughes Medical Institute, and Duke University Medical Center, Durham, NC 27710, USA
| | - Jean-Nicoles Audet
- Department of Biology, McGill University, Montreal, Quebec H3A 1B1, Canada
| | - Manolis Kellis
- Computer Science and Artificial Intelligence Laboratory and the Broad Institute of MIT and Harvard, Massachusetts Institute of Technology, Cambridge, MA 02139, USA
| | | | - Saurabh Sinha
- Department of Computer Science, University of Illinois, Urbana-Champaign, IL, USA
| | | | - Anne E West
- Department of Neurobiology, Duke University Medical Center, Durham, NC 27710, USA.
| | - Erich D Jarvis
- Department of Neurobiology, Howard Hughes Medical Institute, and Duke University Medical Center, Durham, NC 27710, USA.
| |
Collapse
|
31
|
Samee MAH, Sinha S. Quantitative modeling of a gene's expression from its intergenic sequence. PLoS Comput Biol 2014; 10:e1003467. [PMID: 24604095 PMCID: PMC3945089 DOI: 10.1371/journal.pcbi.1003467] [Citation(s) in RCA: 27] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/26/2012] [Accepted: 12/18/2013] [Indexed: 11/18/2022] Open
Abstract
Modeling a gene's expression from its intergenic locus and trans-regulatory context is a fundamental goal in computational biology. Owing to the distributed nature of cis-regulatory information and the poorly understood mechanisms that integrate such information, gene locus modeling is a more challenging task than modeling individual enhancers. Here we report the first quantitative model of a gene's expression pattern as a function of its locus. We model the expression readout of a locus in two tiers: 1) combinatorial regulation by transcription factors bound to each enhancer is predicted by a thermodynamics-based model and 2) independent contributions from multiple enhancers are linearly combined to fit the gene expression pattern. The model does not require any prior knowledge about enhancers contributing toward a gene's expression. We demonstrate that the model captures the complex multi-domain expression patterns of anterior-posterior patterning genes in the early Drosophila embryo. Altogether, we model the expression patterns of 27 genes; these include several gap genes, pair-rule genes, and anterior, posterior, trunk, and terminal genes. We find that the model-selected enhancers for each gene overlap strongly with its experimentally characterized enhancers. Our findings also suggest the presence of sequence-segments in the locus that would contribute ectopic expression patterns and hence were "shut down" by the model. We applied our model to identify the transcription factors responsible for forming the stripe boundaries of the studied genes. The resulting network of regulatory interactions exhibits a high level of agreement with known regulatory influences on the target genes. Finally, we analyzed whether and why our assumption of enhancer independence was necessary for the genes we studied. We found a deterioration of expression when binding sites in one enhancer were allowed to influence the readout of another enhancer. Thus, interference between enhancer activities was a possible factor necessitating enhancer independence in our model.
Collapse
Affiliation(s)
- Md. Abul Hassan Samee
- Department of Computer Science, University of Illinois at Urbana-Champaign, Urbana, Illinois, United States of America
- * E-mail: (MAHS); (SS)
| | - Saurabh Sinha
- Department of Computer Science, University of Illinois at Urbana-Champaign, Urbana, Illinois, United States of America
- Institute for Genomic Biology, University of Illinois at Urbana-Champaign, Urbana, Illinois, United States of America
- * E-mail: (MAHS); (SS)
| |
Collapse
|
32
|
Behnam E, Waterman MS, Smith AD. A geometric interpretation for local alignment-free sequence comparison. J Comput Biol 2013; 20:471-85. [PMID: 23829649 PMCID: PMC3704055 DOI: 10.1089/cmb.2012.0280] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
Abstract
Local alignment-free sequence comparison arises in the context of identifying similar segments of sequences that may not be alignable in the traditional sense. We propose a randomized approximation algorithm that is both accurate and efficient. We show that under D2 and its important variant [Formula: see text] as the similarity measure, local alignment-free comparison between a pair of sequences can be formulated as the problem of finding the maximum bichromatic dot product between two sets of points in high dimensions. We introduce a geometric framework that reduces this problem to that of finding the bichromatic closest pair (BCP), allowing the properties of the underlying metric to be leveraged. Local alignment-free sequence comparison can be solved by making a quadratic number of alignment-free substring comparisons. We show both theoretically and through empirical results on simulated data that our approximation algorithm requires a subquadratic number of such comparisons and trades only a small amount of accuracy to achieve this efficiency. Therefore, our algorithm can extend the current usage of alignment-free-based methods and can also be regarded as a substitute for local alignment algorithms in many biological studies.
Collapse
Affiliation(s)
- Ehsan Behnam
- Molecular and Computational Biology, University of Southern California, Los Angeles, California 90089-2910, USA
| | | | | |
Collapse
|
33
|
Satija R, Bradley RK. The TAGteam motif facilitates binding of 21 sequence-specific transcription factors in the Drosophila embryo. Genome Res 2012; 22:656-65. [PMID: 22247430 DOI: 10.1101/gr.130682.111] [Citation(s) in RCA: 39] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/03/2023]
Abstract
Highly overlapping patterns of genome-wide binding of many distinct transcription factors have been observed in worms, insects, and mammals, but the origins and consequences of this overlapping binding remain unclear. While analyzing chromatin immunoprecipitation data sets from 21 sequence-specific transcription factors active in the Drosophila embryo, we found that binding of all factors exhibits a dose-dependent relationship with "TAGteam" sequence motifs bound by the zinc finger protein Vielfaltig, also known as Zelda, a recently discovered activator of the zygotic genome. TAGteam motifs are present and well conserved in highly bound regions, and are associated with transcription factor binding even in the absence of canonical recognition motifs for these factors. Furthermore, levels of binding in promoters and enhancers of zygotically transcribed genes are correlated with RNA polymerase II occupancy and gene expression levels. Our results suggest that Vielfaltig acts as a master regulator of early development by facilitating the genome-wide establishment of overlapping patterns of binding of diverse transcription factors that drive global gene expression.
Collapse
Affiliation(s)
- Rahul Satija
- Department of Statistics, Oxford University, Oxford OX1 3TG, United Kingdom
| | | |
Collapse
|
34
|
Gruel J, LeBorgne M, LeMeur N, Théret N. Simple Shared Motifs (SSM) in conserved region of promoters: a new approach to identify co-regulation patterns. BMC Bioinformatics 2011; 12:365. [PMID: 21910886 PMCID: PMC3215511 DOI: 10.1186/1471-2105-12-365] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/30/2010] [Accepted: 09/12/2011] [Indexed: 01/07/2023] Open
Abstract
Background Regulation of gene expression plays a pivotal role in cellular functions. However, understanding the dynamics of transcription remains a challenging task. A host of computational approaches have been developed to identify regulatory motifs, mainly based on the recognition of DNA sequences for transcription factor binding sites. Recent integration of additional data from genomic analyses or phylogenetic footprinting has significantly improved these methods. Results Here, we propose a different approach based on the compilation of Simple Shared Motifs (SSM), groups of sequences defined by their length and similarity and present in conserved sequences of gene promoters. We developed an original algorithm to search and count SSM in pairs of genes. An exceptional number of SSM is considered as a common regulatory pattern. The SSM approach is applied to a sample set of genes and validated using functional gene-set enrichment analyses. We demonstrate that the SSM approach selects genes that are over-represented in specific biological categories (Ontology and Pathways) and are enriched in co-expressed genes. Finally we show that genes co-expressed in the same tissue or involved in the same biological pathway have increased SSM values. Conclusions Using unbiased clustering of genes, Simple Shared Motifs analysis constitutes an original contribution to provide a clearer definition of expression networks.
Collapse
Affiliation(s)
- Jérémy Gruel
- EA 4427 SeRAIC IFR140, Université de Rennes 1, 2 avenue du Pr, Léon Bernard, Rennes 35043, France.
| | | | | | | |
Collapse
|