1
|
Li Y, Ma A, Wang Y, Guo Q, Wang C, Fu H, Liu B, Ma Q. Enhancer-driven gene regulatory networks inference from single-cell RNA-seq and ATAC-seq data. Brief Bioinform 2024; 25:bbae369. [PMID: 39082647 PMCID: PMC11289686 DOI: 10.1093/bib/bbae369] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/02/2024] [Revised: 06/19/2024] [Accepted: 07/15/2024] [Indexed: 08/03/2024] Open
Abstract
Deciphering the intricate relationships between transcription factors (TFs), enhancers, and genes through the inference of enhancer-driven gene regulatory networks (eGRNs) is crucial in understanding gene regulatory programs in a complex biological system. This study introduces STREAM, a novel method that leverages a Steiner forest problem model, a hybrid biclustering pipeline, and submodular optimization to infer eGRNs from jointly profiled single-cell transcriptome and chromatin accessibility data. Compared to existing methods, STREAM demonstrates enhanced performance in terms of TF recovery, TF-enhancer linkage prediction, and enhancer-gene relation discovery. Application of STREAM to an Alzheimer's disease dataset and a diffuse small lymphocytic lymphoma dataset reveals its ability to identify TF-enhancer-gene relations associated with pseudotime, as well as key TF-enhancer-gene relations and TF cooperation underlying tumor cells.
Collapse
Affiliation(s)
- Yang Li
- Department of Biomedical Informatics, College of Medicine, The Ohio State University, Columbus, OH 43210, United States
| | - Anjun Ma
- Department of Biomedical Informatics, College of Medicine, The Ohio State University, Columbus, OH 43210, United States
- Pelotonia Institute for Immuno-Oncology, The James Comprehensive Cancer Center, The Ohio State University, Columbus, OH 43210, United States
| | - Yizhong Wang
- School of Mathematics, Shandong University, Jinan, Shandong 250100, China
| | - Qi Guo
- Department of Biomedical Informatics, College of Medicine, The Ohio State University, Columbus, OH 43210, United States
| | - Cankun Wang
- Department of Biomedical Informatics, College of Medicine, The Ohio State University, Columbus, OH 43210, United States
| | - Hongjun Fu
- Department of Neuroscience, College of Medicine, The Ohio State University, Columbus, OH 43210, United States
| | - Bingqiang Liu
- School of Mathematics, Shandong University, Jinan, Shandong 250100, China
| | - Qin Ma
- Department of Biomedical Informatics, College of Medicine, The Ohio State University, Columbus, OH 43210, United States
- Pelotonia Institute for Immuno-Oncology, The James Comprehensive Cancer Center, The Ohio State University, Columbus, OH 43210, United States
| |
Collapse
|
2
|
Agarwal V, Inoue F, Schubach M, Martin BK, Dash PM, Zhang Z, Sohota A, Noble WS, Yardimci GG, Kircher M, Shendure J, Ahituv N. Massively parallel characterization of transcriptional regulatory elements in three diverse human cell types. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2023:2023.03.05.531189. [PMID: 36945371 PMCID: PMC10028905 DOI: 10.1101/2023.03.05.531189] [Citation(s) in RCA: 19] [Impact Index Per Article: 19.0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 03/11/2023]
Abstract
The human genome contains millions of candidate cis-regulatory elements (CREs) with cell-type-specific activities that shape both health and myriad disease states. However, we lack a functional understanding of the sequence features that control the activity and cell-type-specific features of these CREs. Here, we used lentivirus-based massively parallel reporter assays (lentiMPRAs) to test the regulatory activity of over 680,000 sequences, representing a nearly comprehensive set of all annotated CREs among three cell types (HepG2, K562, and WTC11), finding 41.7% to be functional. By testing sequences in both orientations, we find promoters to have significant strand orientation effects. We also observe that their 200 nucleotide cores function as non-cell-type-specific 'on switches' providing similar expression levels to their associated gene. In contrast, enhancers have weaker orientation effects, but increased tissue-specific characteristics. Utilizing our lentiMPRA data, we develop sequence-based models to predict CRE function with high accuracy and delineate regulatory motifs. Testing an additional lentiMPRA library encompassing 60,000 CREs in all three cell types, we further identified factors that determine cell-type specificity. Collectively, our work provides an exhaustive catalog of functional CREs in three widely used cell lines, and showcases how large-scale functional measurements can be used to dissect regulatory grammar.
Collapse
Affiliation(s)
- Vikram Agarwal
- Department of Genome Sciences, University of Washington, Seattle, WA 98195, USA
- mRNA Center of Excellence, Sanofi Pasteur Inc., Waltham, MA 02451, USA
| | - Fumitaka Inoue
- Department of Bioengineering and Therapeutic Sciences, University of California San Francisco, San Francisco, CA 94158, USA
- Institute for Human Genetics, University of California San Francisco, San Francisco, CA 94158, USA
- Institute for the Advanced Study of Human Biology (WPI-ASHBi), Kyoto University, Kyoto, Japan
| | - Max Schubach
- Berlin Institute of Health of Health at Charité - Universitätsmedizin Berlin, 10178, Berlin, Germany
| | - Beth K. Martin
- Department of Genome Sciences, University of Washington, Seattle, WA 98195, USA
| | - Pyaree Mohan Dash
- Berlin Institute of Health of Health at Charité - Universitätsmedizin Berlin, 10178, Berlin, Germany
| | - Zicong Zhang
- Institute for the Advanced Study of Human Biology (WPI-ASHBi), Kyoto University, Kyoto, Japan
| | - Ajuni Sohota
- Department of Bioengineering and Therapeutic Sciences, University of California San Francisco, San Francisco, CA 94158, USA
- Institute for Human Genetics, University of California San Francisco, San Francisco, CA 94158, USA
| | - William Stafford Noble
- Department of Genome Sciences, University of Washington, Seattle, WA 98195, USA
- Paul G. Allen School of Computer Science and Engineering, University of Washington, Seattle, WA, USA
| | - Galip Gürkan Yardimci
- Department of Genome Sciences, University of Washington, Seattle, WA 98195, USA
- Knight Cancer Institute, Oregon Health and Science University, Portland, OR, USA
- Cancer Early Detection Advanced Research Center, Oregon Health and Science University, Portland, OR, USA
| | - Martin Kircher
- Berlin Institute of Health of Health at Charité - Universitätsmedizin Berlin, 10178, Berlin, Germany
- Institute of Human Genetics, University Medical Center Schleswig-Holstein, University of Lübeck, Lübeck, Germany
| | - Jay Shendure
- Department of Genome Sciences, University of Washington, Seattle, WA 98195, USA
- Howard Hughes Medical Institute, Seattle, WA 98195, USA
- Brotman Baty Institute for Precision Medicine, University of Washington, Seattle, WA 98195, USA
- Allen Center for Cell Lineage Tracing, University of Washington, Seattle, WA 98195, USA
| | - Nadav Ahituv
- Department of Bioengineering and Therapeutic Sciences, University of California San Francisco, San Francisco, CA 94158, USA
- Institute for Human Genetics, University of California San Francisco, San Francisco, CA 94158, USA
| |
Collapse
|
3
|
Schreiber J, Bilmes J, Noble WS. Prioritizing transcriptomic and epigenomic experiments using an optimization strategy that leverages imputed data. Bioinformatics 2021; 37:439-447. [PMID: 32966546 PMCID: PMC8088321 DOI: 10.1093/bioinformatics/btaa830] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/17/2019] [Revised: 07/28/2020] [Accepted: 09/09/2020] [Indexed: 12/03/2022] Open
Abstract
Motivation Successful science often involves not only performing experiments well, but also choosing well among many possible experiments. In a hypothesis generation setting, choosing an experiment well means choosing an experiment whose results are interesting or novel. In this work, we formalize this selection procedure in the context of genomics and epigenomics data generation. Specifically, we consider the task faced by a scientific consortium such as the National Institutes of Health ENCODE Consortium, whose goal is to characterize all of the functional elements in the human genome. Given a list of possible cell types or tissue types (‘biosamples’) and a list of possible high-throughput sequencing assays, where at least one experiment has been performed in each biosample and for each assay, we ask ‘Which experiments should ENCODE perform next?’ Results We demonstrate how to represent this task as a submodular optimization problem, where the goal is to choose a panel of experiments that maximize the facility location function. A key aspect of our approach is that we use imputed data, rather than experimental data, to directly answer the posed question. We find that, across several evaluations, our method chooses a panel of experiments that span a diversity of biochemical activity. Finally, we propose two modifications of the facility location function, including a novel submodular–supermodular function, that allow incorporation of domain knowledge or constraints into the optimization procedure. Availability and implementation Our method is available as a Python package at https://github.com/jmschrei/kiwano and can be installed using the command pip install kiwano. The source code used here and the similarity matrix can be found at http://doi.org/10.5281/zenodo.3708538. Supplementary information Supplementary data are available at Bioinformatics online.
Collapse
|
4
|
Yilmaz S, Tastan O, Cicek AE. SPADIS: An Algorithm for Selecting Predictive and Diverse SNPs in GWAS. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2021; 18:1208-1216. [PMID: 31443041 DOI: 10.1109/tcbb.2019.2935437] [Citation(s) in RCA: 6] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/10/2023]
Abstract
Phenotypic heritability of complex traits and diseases is seldom explained by individual genetic variants identified in genome-wide association studies (GWAS). Many methods have been developed to select a subset of variant loci, which are associated with or predictive of the phenotype. Selecting connected SNPs on SNP-SNP networks have been proven successful in finding biologically interpretable and predictive SNPs. However, we argue that the connectedness constraint favors selecting redundant features that affect similar biological processes and therefore does not necessarily yield better predictive performance. In this paper, we propose a novel method called SPADIS that favors the selection of remotely located SNPs in order to account for their complementary effects in explaining a phenotype. SPADIS selects a diverse set of loci on a SNP-SNP network. This is achieved by maximizing a submodular set function with a greedy algorithm that ensures a constant factor approximation to the optimal solution. We compare SPADIS to the state-of-the-art method SConES, on a dataset of Arabidopsis Thaliana with continuous flowering time phenotypes. SPADIS has better average phenotype prediction performance in 15 out of 17 phenotypes when the same number of SNPs are selected and provides consistent improvements across multiple networks and settings on average. Moreover, it identifies more candidate genes and runs faster.
Collapse
|
5
|
Wang X, Rai N, Merchel Piovesan Pereira B, Eetemadi A, Tagkopoulos I. Accelerated knowledge discovery from omics data by optimal experimental design. Nat Commun 2020; 11:5026. [PMID: 33024104 PMCID: PMC7538421 DOI: 10.1038/s41467-020-18785-y] [Citation(s) in RCA: 11] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/31/2019] [Accepted: 08/27/2020] [Indexed: 12/15/2022] Open
Abstract
How to design experiments that accelerate knowledge discovery on complex biological landscapes remains a tantalizing question. We present an optimal experimental design method (coined OPEX) to identify informative omics experiments using machine learning models for both experimental space exploration and model training. OPEX-guided exploration of Escherichia coli’s populations exposed to biocide and antibiotic combinations lead to more accurate predictive models of gene expression with 44% less data. Analysis of the proposed experiments shows that broad exploration of the experimental space followed by fine-tuning emerges as the optimal strategy. Additionally, analysis of the experimental data reveals 29 cases of cross-stress protection and 4 cases of cross-stress vulnerability. Further validation reveals the central role of chaperones, stress response proteins and transport pumps in cross-stress exposure. This work demonstrates how active learning can be used to guide omics data collection for training predictive models, making evidence-driven decisions and accelerating knowledge discovery in life sciences. How to design experiments that accelerate knowledge discovery on complex biological landscapes remains a tantalizing question. Here, the authors present OPEX, an optimal experimental design method to identify informative omics experiments for both experimental space exploration and model training.
Collapse
Affiliation(s)
- Xiaokang Wang
- Department of Biomedical Engineering, University of California, Davis, CA, 95616, USA.,Genome Center, University of California, Davis, CA, 95616, USA
| | - Navneet Rai
- Genome Center, University of California, Davis, CA, 95616, USA.,Department of Computer Science, University of California, Davis, CA, 95616, USA
| | - Beatriz Merchel Piovesan Pereira
- Genome Center, University of California, Davis, CA, 95616, USA.,Microbiology Graduate Group, University of California, Davis, CA, 95616, USA
| | - Ameen Eetemadi
- Genome Center, University of California, Davis, CA, 95616, USA.,Department of Computer Science, University of California, Davis, CA, 95616, USA
| | - Ilias Tagkopoulos
- Genome Center, University of California, Davis, CA, 95616, USA. .,Department of Computer Science, University of California, Davis, CA, 95616, USA.
| |
Collapse
|
6
|
Bai W, Noble WS, Bilmes JA. Submodular Maximization via Gradient Ascent: The Case of Deep Submodular Functions. ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 2018; 2018:7989-7999. [PMID: 30705579 PMCID: PMC6351064] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Subscribe] [Scholar Register] [Indexed: 06/09/2023]
Abstract
We study the problem of maximizing deep submodular functions (DSFs) [13, 3] subject to a matroid constraint. DSFs are an expressive class of submodular functions that include, as strict subfamilies, the facility location, weighted coverage, and sums of concave composed with modular functions. We use a strategy similar to the continuous greedy approach [6], but we show that the multilinear extension of any DSF has a natural and computationally attainable concave relaxation that we can optimize using gradient ascent. Our results show a guarantee ofmax 0 < δ < 1 ( 1 - ϵ - δ - e - δ 2 Ω ( k ) ) with a running time of O(n 2 /ϵ 2 ) plus time for pipage rounding [6] to recover a discrete solution, where k is the rank of the matroid constraint. This bound is often better than the standard 1 - 1/e guarantee of the continuous greedy algorithm, but runs much faster. Our bound also holds even for fully curved (c = 1) functions where the guarantee of 1 - c/e degenerates to 1 - 1/e where c is the curvature of f [37]. We perform computational experiments that support our theoretical results.
Collapse
Affiliation(s)
- Wenruo Bai
- Depts. of Electrical & Computer Engineering, Seattle, WA 98195
| | - William S Noble
- Genome Sciences Seattle, WA 98195
- Computer Science and Engineering, Seattle, WA 98195
| | - Jeff A Bilmes
- Depts. of Electrical & Computer Engineering, Seattle, WA 98195
- Computer Science and Engineering, Seattle, WA 98195
| |
Collapse
|
7
|
PREDICTD PaRallel Epigenomics Data Imputation with Cloud-based Tensor Decomposition. Nat Commun 2018; 9:1402. [PMID: 29643364 PMCID: PMC5895786 DOI: 10.1038/s41467-018-03635-9] [Citation(s) in RCA: 34] [Impact Index Per Article: 5.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/10/2017] [Accepted: 03/02/2018] [Indexed: 11/24/2022] Open
Abstract
The Encyclopedia of DNA Elements (ENCODE) and the Roadmap Epigenomics Project seek to characterize the epigenome in diverse cell types using assays that identify, for example, genomic regions with modified histones or accessible chromatin. These efforts have produced thousands of datasets but cannot possibly measure each epigenomic factor in all cell types. To address this, we present a method, PaRallel Epigenomics Data Imputation with Cloud-based Tensor Decomposition (PREDICTD), to computationally impute missing experiments. PREDICTD leverages an elegant model called “tensor decomposition” to impute many experiments simultaneously. Compared with the current state-of-the-art method, ChromImpute, PREDICTD produces lower overall mean squared error, and combining the two methods yields further improvement. We show that PREDICTD data captures enhancer activity at noncoding human accelerated regions. PREDICTD provides reference imputed data and open-source software for investigating new cell types, and demonstrates the utility of tensor decomposition and cloud computing, both promising technologies for bioinformatics. Assays to characterize the epigenome and interrogate chromatin state genome wide have so far been performed in a selected set of conditions. Here, Durham et al. develop a computational method based on tensor decomposition to impute missing experiments in collections of epigenomics experiments.
Collapse
|
8
|
Libbrecht MW, Bilmes JA, Noble WS. Choosing non-redundant representative subsets of protein sequence data sets using submodular optimization. Proteins 2018; 86:454-466. [PMID: 29345009 DOI: 10.1002/prot.25461] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/13/2017] [Revised: 12/15/2017] [Accepted: 01/08/2018] [Indexed: 11/10/2022]
Abstract
Selecting a non-redundant representative subset of sequences is a common step in many bioinformatics workflows, such as the creation of non-redundant training sets for sequence and structural models or selection of "operational taxonomic units" from metagenomics data. Previous methods for this task, such as CD-HIT, PISCES, and UCLUST, apply a heuristic threshold-based algorithm that has no theoretical guarantees. We propose a new approach based on submodular optimization. Submodular optimization, a discrete analogue to continuous convex optimization, has been used with great success for other representative set selection problems. We demonstrate that the submodular optimization approach results in representative protein sequence subsets with greater structural diversity than sets chosen by existing methods, using as a gold standard the SCOPe library of protein domain structures. In this setting, submodular optimization consistently yields protein sequence subsets that include more SCOPe domain families than sets of the same size selected by competing approaches. We also show how the optimization framework allows us to design a mixture objective function that performs well for both large and small representative sets. The framework we describe is the best possible in polynomial time (under some assumptions), and it is flexible and intuitive because it applies a suite of generic methods to optimize one of a variety of objective functions.
Collapse
Affiliation(s)
- Maxwell W Libbrecht
- Department of Genome Sciences, University of Washington, Seattle, Washington
| | - Jeffrey A Bilmes
- Department of Electrical Engineering, University of Washington, Seattle, Washington
| | - William Stafford Noble
- Department of Genome Sciences, University of Washington, Seattle, Washington.,Department of Computer Science and Engineering, University of Washington, Seattle, Washington
| |
Collapse
|