1
|
Gugulothu P, Bhukya R. Exploring coronavirus sequence motifs through convolutional neural network for accurate identification of COVID-19. Comput Methods Biomech Biomed Engin 2024:1-15. [PMID: 39508163 DOI: 10.1080/10255842.2024.2404149] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/08/2023] [Revised: 04/22/2024] [Accepted: 09/05/2024] [Indexed: 11/08/2024]
Abstract
The SARS-CoV-2 virus reportedly originated in Wuhan in 2019, causing the coronavirus outbreak (COVID-19), which was technically designated as a global epidemic. Numerous studies have been carried out to diagnose and treat COVID-19 throughout the midst of the disease's spread. However, the genetic similarity between COVID-19 and other types of coronaviruses makes it challenging to differentiate between them. Therefore it's essential to swiftly identify if an epidemic is brought on by a brand-new virus or a well-known disease. In the present article, the DeepCoV deep-learning (DL) approach utilizes layered convolutional neural networks (CNNs) to classify viral serious acute respiratory syndrome coronavirus 2 (SARS-CoV-2) besides other viral diseases. Additionally, various motifs linked with SARS-CoV-2 can be located by examining the computational filter processes. In identifying these important motifs, DeepCoV reveals the transparency of CNNs. Experiments were conducted using the 2019nCoVR datasets, and the results indicate that DeepCoV performed more accurately than several benchmark ML models. Additionally, DeepCoV scored its maximum area under the precision-recall curve (AUCPR) and receiver operating characteristic curve (AUC-ROC) at 98.62% and 98.58%, respectively. Overall, these investigations provide strong knowledge of the employment of deep learning (DL) algorithms as a crucial alternative to identifying SARS-CoV-2 and identifying patterns of disease in the SARS-CoV-2 genes.
Collapse
Affiliation(s)
- Praveen Gugulothu
- Computer Science and Engineering, National Institute of Technology, Warangal, India
| | - Raju Bhukya
- Computer Science and Engineering, National Institute of Technology, Warangal, India
| |
Collapse
|
2
|
Toneyan S, Koo PK. Interpreting cis-regulatory interactions from large-scale deep neural networks. Nat Genet 2024; 56:2517-2527. [PMID: 39284975 DOI: 10.1038/s41588-024-01923-3] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/28/2023] [Accepted: 08/21/2024] [Indexed: 09/25/2024]
Abstract
The rise of large-scale, sequence-based deep neural networks (DNNs) for predicting gene expression has introduced challenges in their evaluation and interpretation. Current evaluations align DNN predictions with orthogonal experimental data, providing insights into generalization but offering limited insights into their decision-making process. Existing model explainability tools focus mainly on motif analysis, which becomes complex when interpreting longer sequences. Here we present cis-regulatory element model explanations (CREME), an in silico perturbation toolkit that interprets the rules of gene regulation learned by a genomic DNN. Applying CREME to Enformer, a state-of-the-art DNN, we identify cis-regulatory elements that enhance or silence gene expression and characterize their complex interactions. CREME can provide interpretations across multiple scales of genomic organization, from cis-regulatory elements to fine-mapped functional sequence elements within them, offering high-resolution insights into the regulatory architecture of the genome. CREME provides a powerful toolkit for translating the predictions of genomic DNNs into mechanistic insights of gene regulation.
Collapse
Affiliation(s)
- Shushan Toneyan
- Simons Center for Quantitative Biology, Cold Spring Harbor Laboratory, New York, NY, USA
| | - Peter K Koo
- Simons Center for Quantitative Biology, Cold Spring Harbor Laboratory, New York, NY, USA.
| |
Collapse
|
3
|
Zhu W, Li W, Zhang H, Li L. Big data and artificial intelligence-aided crop breeding: Progress and prospects. JOURNAL OF INTEGRATIVE PLANT BIOLOGY 2024. [PMID: 39467106 DOI: 10.1111/jipb.13791] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 06/30/2024] [Revised: 08/25/2024] [Accepted: 09/10/2024] [Indexed: 10/30/2024]
Abstract
The past decade has witnessed rapid developments in gene discovery, biological big data (BBD), artificial intelligence (AI)-aided technologies, and molecular breeding. These advancements are expected to accelerate crop breeding under the pressure of increasing demands for food. Here, we first summarize current breeding methods and discuss the need for new ways to support breeding efforts. Then, we review how to combine BBD and AI technologies for genetic dissection, exploring functional genes, predicting regulatory elements and functional domains, and phenotypic prediction. Finally, we propose the concept of intelligent precision design breeding (IPDB) driven by AI technology and offer ideas about how to implement IPDB. We hope that IPDB will enhance the predictability, efficiency, and cost of crop breeding compared with current technologies. As an example of IPDB, we explore the possibilities offered by CropGPT, which combines biological techniques, bioinformatics, and breeding art from breeders, and presents an open, shareable, and cooperative breeding system. IPDB provides integrated services and communication platforms for biologists, bioinformatics experts, germplasm resource specialists, breeders, dealers, and farmers, and should be well suited for future breeding.
Collapse
Affiliation(s)
- Wanchao Zhu
- Key Laboratory of Biology and Genetic Improvement of Maize in Arid Area of Northwest Region, College of Agronomy, Northwest A&F University, Yangling, 712100, China
- National Key Laboratory of Crop Genetic Improvement, Huazhong Agricultural University, Wuhan, 430070, China
| | - Weifu Li
- College of Informatics, Huazhong Agricultural University, Wuhan, 430070, China
- Engineering Research Center of Intelligent Technology for Agriculture, Ministry of Education, Wuhan, 430070, China
| | - Hongwei Zhang
- State Key Laboratory of Crop Gene Resources and Breeding, National Key Facility for Crop Gene Resources and Genetic Improvement, Institute of Crop Sciences, Chinese Academy of Agricultural Sciences, Beijing, 100081, China
| | - Lin Li
- National Key Laboratory of Crop Genetic Improvement, Huazhong Agricultural University, Wuhan, 430070, China
| |
Collapse
|
4
|
La Fleur A, Shi Y, Seelig G. Decoding biology with massively parallel reporter assays and machine learning. Genes Dev 2024; 38:843-865. [PMID: 39362779 PMCID: PMC11535156 DOI: 10.1101/gad.351800.124] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/05/2024]
Abstract
Massively parallel reporter assays (MPRAs) are powerful tools for quantifying the impacts of sequence variation on gene expression. Reading out molecular phenotypes with sequencing enables interrogating the impact of sequence variation beyond genome scale. Machine learning models integrate and codify information learned from MPRAs and enable generalization by predicting sequences outside the training data set. Models can provide a quantitative understanding of cis-regulatory codes controlling gene expression, enable variant stratification, and guide the design of synthetic regulatory elements for applications from synthetic biology to mRNA and gene therapy. This review focuses on cis-regulatory MPRAs, particularly those that interrogate cotranscriptional and post-transcriptional processes: alternative splicing, cleavage and polyadenylation, translation, and mRNA decay.
Collapse
Affiliation(s)
- Alyssa La Fleur
- Paul G. Allen School of Computer Science and Engineering, University of Washington, Seattle, Washington 98195, USA
| | - Yongsheng Shi
- Department of Microbiology and Molecular Genetics, School of Medicine, University of California, Irvine, Irvine, California 92697, USA;
| | - Georg Seelig
- Paul G. Allen School of Computer Science and Engineering, University of Washington, Seattle, Washington 98195, USA;
- Department of Electrical & Computer Engineering, University of Washington, Seattle, Washington 98195, USA
| |
Collapse
|
5
|
Li RZ, Han CZ, Glass CK. TIANA: transcription factors cooperativity inference analysis with neural attention. BMC Bioinformatics 2024; 25:274. [PMID: 39174927 PMCID: PMC11342676 DOI: 10.1186/s12859-024-05852-0] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/11/2023] [Accepted: 07/01/2024] [Indexed: 08/24/2024] Open
Abstract
BACKGROUND Growing evidence suggests that distal regulatory elements are essential for cellular function and states. The sequences within these distal elements, especially motifs for transcription factor binding, provide critical information about the underlying regulatory programs. However, cooperativities between transcription factors that recognize these motifs are nonlinear and multiplexed, rendering traditional modeling methods insufficient to capture the underlying mechanisms. Recent development of attention mechanism, which exhibit superior performance in capturing dependencies across input sequences, makes them well-suited to uncover and decipher intricate dependencies between regulatory elements. RESULT We present Transcription factors cooperativity Inference Analysis with Neural Attention (TIANA), a deep learning framework that focuses on interpretability. In this study, we demonstrated that TIANA could discover biologically relevant insights into co-occurring pairs of transcription factor motifs. Compared with existing tools, TIANA showed superior interpretability and robust performance in identifying putative transcription factor cooperativities from co-occurring motifs. CONCLUSION Our results suggest that TIANA can be an effective tool to decipher transcription factor cooperativities from distal sequence data. TIANA can be accessed through: https://github.com/rzzli/TIANA .
Collapse
Affiliation(s)
- Rick Z Li
- Department of Cellular and Molecular Medicine, University of California, San Diego, La Jolla, CA, 92093, USA
| | - Claudia Z Han
- Department of Cellular and Molecular Medicine, University of California, San Diego, La Jolla, CA, 92093, USA
| | - Christopher K Glass
- Department of Cellular and Molecular Medicine, University of California, San Diego, La Jolla, CA, 92093, USA.
| |
Collapse
|
6
|
van Hilten A, Katz S, Saccenti E, Niessen WJ, Roshchupkin GV. Designing interpretable deep learning applications for functional genomics: a quantitative analysis. Brief Bioinform 2024; 25:bbae449. [PMID: 39293804 PMCID: PMC11410376 DOI: 10.1093/bib/bbae449] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/05/2024] [Revised: 08/07/2024] [Accepted: 08/28/2024] [Indexed: 09/20/2024] Open
Abstract
Deep learning applications have had a profound impact on many scientific fields, including functional genomics. Deep learning models can learn complex interactions between and within omics data; however, interpreting and explaining these models can be challenging. Interpretability is essential not only to help progress our understanding of the biological mechanisms underlying traits and diseases but also for establishing trust in these model's efficacy for healthcare applications. Recognizing this importance, recent years have seen the development of numerous diverse interpretability strategies, making it increasingly difficult to navigate the field. In this review, we present a quantitative analysis of the challenges arising when designing interpretable deep learning solutions in functional genomics. We explore design choices related to the characteristics of genomics data, the neural network architectures applied, and strategies for interpretation. By quantifying the current state of the field with a predefined set of criteria, we find the most frequent solutions, highlight exceptional examples, and identify unexplored opportunities for developing interpretable deep learning models in genomics.
Collapse
Affiliation(s)
- Arno van Hilten
- Department of Radiology and Nuclear Medicine, Erasmus MC, 3015 GD Rotterdam, The Netherlands
| | - Sonja Katz
- Department of Radiology and Nuclear Medicine, Erasmus MC, 3015 GD Rotterdam, The Netherlands
- Laboratory of Systems and Synthetic Biology, Wageningen University & Research, 6700 HB Wageningen WE, The Netherlands
| | - Edoardo Saccenti
- Laboratory of Systems and Synthetic Biology, Wageningen University & Research, 6700 HB Wageningen WE, The Netherlands
| | - Wiro J Niessen
- Department of Imaging Physics, Delft University of Technology, 2628 CD Delft, The Netherlands
| | - Gennady V Roshchupkin
- Department of Radiology and Nuclear Medicine, Erasmus MC, 3015 GD Rotterdam, The Netherlands
- Department of Epidemiology, Erasmus MC, 3015 GD Rotterdam, The Netherlands
| |
Collapse
|
7
|
Gonzalez-Avalos E, Onodera A, Samaniego-Castruita D, Rao A, Ay F. Predicting gene expression state and prioritizing putative enhancers using 5hmC signal. Genome Biol 2024; 25:142. [PMID: 38825692 PMCID: PMC11145787 DOI: 10.1186/s13059-024-03273-z] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/03/2023] [Accepted: 05/11/2024] [Indexed: 06/04/2024] Open
Abstract
BACKGROUND Like its parent base 5-methylcytosine (5mC), 5-hydroxymethylcytosine (5hmC) is a direct epigenetic modification of cytosines in the context of CpG dinucleotides. 5hmC is the most abundant oxidized form of 5mC, generated through the action of TET dioxygenases at gene bodies of actively-transcribed genes and at active or lineage-specific enhancers. Although such enrichments are reported for 5hmC, to date, predictive models of gene expression state or putative regulatory regions for genes using 5hmC have not been developed. RESULTS Here, by using only 5hmC enrichment in genic regions and their vicinity, we develop neural network models that predict gene expression state across 49 cell types. We show that our deep neural network models distinguish high vs low expression state utilizing only 5hmC levels and these predictive models generalize to unseen cell types. Further, in order to leverage 5hmC signal in distal enhancers for expression prediction, we employ an Activity-by-Contact model and also develop a graph convolutional neural network model with both utilizing Hi-C data and 5hmC enrichment to prioritize enhancer-promoter links. These approaches identify known and novel putative enhancers for key genes in multiple immune cell subsets. CONCLUSIONS Our work highlights the importance of 5hmC in gene regulation through proximal and distal mechanisms and provides a framework to link it to genome function. With the recent advances in 6-letter DNA sequencing by short and long-read techniques, profiling of 5mC and 5hmC may be done routinely in the near future, hence, providing a broad range of applications for the methods developed here.
Collapse
Affiliation(s)
- Edahi Gonzalez-Avalos
- La Jolla Institute for Immunology, 9420 Athena Circle, La Jolla, CA, 92037, USA
- Bioinformatics and Systems Biology Graduate Program, University of California San Diego, La Jolla, CA, 92093, USA
| | - Atsushi Onodera
- La Jolla Institute for Immunology, 9420 Athena Circle, La Jolla, CA, 92037, USA
- Department of Immunology, Graduate School of Medicine, Chiba University, Chiba, 260-8670, Japan
| | - Daniela Samaniego-Castruita
- La Jolla Institute for Immunology, 9420 Athena Circle, La Jolla, CA, 92037, USA
- Biological Sciences Graduate Program, University of California San Diego, La Jolla, CA, 92093, USA
| | - Anjana Rao
- La Jolla Institute for Immunology, 9420 Athena Circle, La Jolla, CA, 92037, USA.
- Bioinformatics and Systems Biology Graduate Program, University of California San Diego, La Jolla, CA, 92093, USA.
- Department of Pharmacology, University of California San Diego, La Jolla, CA, 92093, USA.
- Sanford Consortium for Regenerative Medicine, La Jolla, CA, 92093, USA.
- Moores Cancer Center, University of California San Diego, La Jolla, CA, 92093, USA.
| | - Ferhat Ay
- La Jolla Institute for Immunology, 9420 Athena Circle, La Jolla, CA, 92037, USA.
- Bioinformatics and Systems Biology Graduate Program, University of California San Diego, La Jolla, CA, 92093, USA.
- Moores Cancer Center, University of California San Diego, La Jolla, CA, 92093, USA.
- Department of Pediatrics, University of California San Diego, La Jolla, CA, 92093, USA.
| |
Collapse
|
8
|
Cochran K, Yin M, Mantripragada A, Schreiber J, Marinov GK, Kundaje A. Dissecting the cis-regulatory syntax of transcription initiation with deep learning. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2024.05.28.596138. [PMID: 38853896 PMCID: PMC11160661 DOI: 10.1101/2024.05.28.596138] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/11/2024]
Abstract
Despite extensive characterization of mammalian Pol II transcription, the DNA sequence determinants of transcription initiation at a third of human promoters and most enhancers remain poorly understood. Hence, we trained and interpreted a neural network called ProCapNet that accurately models base-resolution initiation profiles from PRO-cap experiments using local DNA sequence. ProCapNet learns sequence motifs with distinct effects on initiation rates and TSS positioning and uncovers context-specific cryptic initiator elements intertwined within other TF motifs. ProCapNet annotates predictive motifs in nearly all actively transcribed regulatory elements across multiple cell-lines, revealing a shared cis-regulatory logic across promoters and enhancers mediated by a highly epistatic sequence syntax of cooperative and competitive motif interactions. ProCapNet models of RAMPAGE profiles measuring steady-state RNA abundance at TSSs distill initiation signals on par with models trained directly on PRO-cap profiles. ProCapNet learns a largely cell-type-agnostic cis-regulatory code of initiation complementing sequence drivers of cell-type-specific chromatin state critical for accurate prediction of cell-type-specific transcription initiation.
Collapse
Affiliation(s)
- Kelly Cochran
- Department of Computer Science, Stanford University, Stanford, CA, USA
| | | | | | - Jacob Schreiber
- Department of Genetics, Stanford University, Stanford, CA, USA
| | | | - Anshul Kundaje
- Department of Computer Science, Stanford University, Stanford, CA, USA
- Department of Genetics, Stanford University, Stanford, CA, USA
| |
Collapse
|
9
|
Toneyan S, Koo PK. Interpreting Cis-Regulatory Interactions from Large-Scale Deep Neural Networks for Genomics. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2023.07.03.547592. [PMID: 37461616 PMCID: PMC10349992 DOI: 10.1101/2023.07.03.547592] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 07/28/2023]
Abstract
The rise of large-scale, sequence-based deep neural networks (DNNs) for predicting gene expression has introduced challenges in their evaluation and interpretation. Current evaluations align DNN predictions with experimental perturbation assays, which provides insights into the generalization capabilities within the studied loci but offers a limited perspective of what drives their predictions. Moreover, existing model explainability tools focus mainly on motif analysis, which becomes complex when interpreting longer sequences. Here we introduce CREME, an in silico perturbation toolkit that interrogates large-scale DNNs to uncover rules of gene regulation that it learns. Using CREME, we investigate Enformer, a prominent DNN in gene expression prediction, revealing cis-regulatory elements (CREs) that directly enhance or silence target genes. We explore the intricate complexity of higher-order CRE interactions, the relationship between CRE distance from transcription start sites on gene expression, as well as the biochemical features of enhancers and silencers learned by Enformer. Moreover, we demonstrate the flexibility of CREME to efficiently uncover a higher-resolution view of functional sequence elements within CREs. This work demonstrates how CREME can be employed to translate the powerful predictions of large-scale DNNs to study open questions in gene regulation.
Collapse
Affiliation(s)
- Shushan Toneyan
- Simons Center for Quantitative Biology, Cold Spring Harbor Laboratory, NY, USA
| | - Peter K Koo
- Simons Center for Quantitative Biology, Cold Spring Harbor Laboratory, NY, USA
| |
Collapse
|
10
|
Santorsola M, Lescai F. The promise of explainable deep learning for omics data analysis: Adding new discovery tools to AI. N Biotechnol 2023; 77:1-11. [PMID: 37329982 DOI: 10.1016/j.nbt.2023.06.002] [Citation(s) in RCA: 2] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/27/2023] [Revised: 06/01/2023] [Accepted: 06/14/2023] [Indexed: 06/19/2023]
Abstract
Deep learning has already revolutionised the way a wide range of data is processed in many areas of daily life. The ability to learn abstractions and relationships from heterogeneous data has provided impressively accurate prediction and classification tools to handle increasingly big datasets. This has a significant impact on the growing wealth of omics datasets, with the unprecedented opportunity for a better understanding of the complexity of living organisms. While this revolution is transforming the way these data are analyzed, explainable deep learning is emerging as an additional tool with the potential to change the way biological data is interpreted. Explainability addresses critical issues such as transparency, so important when computational tools are introduced especially in clinical environments. Moreover, it empowers artificial intelligence with the capability to provide new insights into the input data, thus adding an element of discovery to these already powerful resources. In this review, we provide an overview of the transformative effects explainable deep learning is having on multiple sectors, ranging from genome engineering and genomics, from radiomics to drug design and clinical trials. We offer a perspective to life scientists, to better understand the potential of these tools, and a motivation to implement them in their research, by suggesting learning resources they can use to move their first steps in this field.
Collapse
Affiliation(s)
| | - Francesco Lescai
- Department of Biology and Biotechnology, University of Pavia, Pavia, Italy.
| |
Collapse
|
11
|
Wang X, Zeng H, Lin L, Huang Y, Lin H, Que Y. Deep learning-empowered crop breeding: intelligent, efficient and promising. FRONTIERS IN PLANT SCIENCE 2023; 14:1260089. [PMID: 37860239 PMCID: PMC10583549 DOI: 10.3389/fpls.2023.1260089] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 07/18/2023] [Accepted: 09/13/2023] [Indexed: 10/21/2023]
Abstract
Crop breeding is one of the main approaches to increase crop yield and improve crop quality. However, the breeding process faces challenges such as complex data, difficulties in data acquisition, and low prediction accuracy, resulting in low breeding efficiency and long cycle. Deep learning-based crop breeding is a strategy that applies deep learning techniques to improve and optimize the breeding process, leading to accelerated crop improvement, enhanced breeding efficiency, and the development of higher-yielding, more adaptive, and disease-resistant varieties for agricultural production. This perspective briefly discusses the mechanisms, key applications, and impact of deep learning in crop breeding. We also highlight the current challenges associated with this topic and provide insights into its future application prospects.
Collapse
Affiliation(s)
- Xiaoding Wang
- Fujian Provincial Key Lab of Network Security & Cryptology, College of Computer and Cyber Security, Fujian Normal University, Fuzhou, China
| | - Haitao Zeng
- Fujian Provincial Key Lab of Network Security & Cryptology, College of Computer and Cyber Security, Fujian Normal University, Fuzhou, China
| | - Limei Lin
- Fujian Provincial Key Lab of Network Security & Cryptology, College of Computer and Cyber Security, Fujian Normal University, Fuzhou, China
| | - Yanze Huang
- School of Computer Science and Mathematics, Fujian Provincial Key Laboratory of Big Data Mining and Applications, Fujian University of Technology, Fuzhou, China
| | - Hui Lin
- Fujian Provincial Key Lab of Network Security & Cryptology, College of Computer and Cyber Security, Fujian Normal University, Fuzhou, China
| | - Youxiong Que
- Key Laboratory of Sugarcane Biology and Genetic Breeding, Ministry of Agriculture and Rural Affairs, Fujian Agriculture and Forestry University, Fuzhou, China
- National Key Laboratory for Tropical Crop Breeding, Institute of Tropical Bioscience and Biotechnology, Chinese Academy of Tropical Agricultural Sciences, Hainan, China
| |
Collapse
|
12
|
Recio PS, Mitra NJ, Shively CA, Song D, Jaramillo G, Lewis KS, Chen X, Mitra R. Zinc cluster transcription factors frequently activate target genes using a non-canonical half-site binding mode. Nucleic Acids Res 2023; 51:5006-5021. [PMID: 37125648 PMCID: PMC10250231 DOI: 10.1093/nar/gkad320] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/09/2022] [Revised: 04/11/2023] [Accepted: 04/14/2023] [Indexed: 05/02/2023] Open
Abstract
Gene expression changes are orchestrated by transcription factors (TFs), which bind to DNA to regulate gene expression. It remains surprisingly difficult to predict basic features of the transcriptional process, including in vivo TF occupancy. Existing thermodynamic models of TF function are often not concordant with experimental measurements, suggesting undiscovered biology. Here, we analyzed one of the most well-studied TFs, the yeast zinc cluster Gal4, constructed a Shea-Ackers thermodynamic model to describe its binding, and compared the results of this model to experimentally measured Gal4p binding in vivo. We found that at many promoters, the model predicted no Gal4p binding, yet substantial binding was observed. These outlier promoters lacked canonical binding motifs, and subsequent investigation revealed Gal4p binds unexpectedly to DNA sequences with high densities of its half site (CGG). We confirmed this novel mode of binding through multiple experimental and computational paradigms; we also found most other zinc cluster TFs we tested frequently utilize this binding mode, at 27% of their targets on average. Together, these results demonstrate a novel mode of binding where zinc clusters, the largest class of TFs in yeast, bind DNA sequences with high densities of half sites.
Collapse
Affiliation(s)
- Pamela S Recio
- Department of Genetics, Washington University School of Medicine in St. Louis, St. Louis, MO 63108, USA
- The Edison Family Center for Genome Sciences & Systems Biology, Washington University School of Medicine in St. Louis, St. Louis, MO 63108, USA
| | - Nikhil J Mitra
- Department of Genetics, Washington University School of Medicine in St. Louis, St. Louis, MO 63108, USA
- The Edison Family Center for Genome Sciences & Systems Biology, Washington University School of Medicine in St. Louis, St. Louis, MO 63108, USA
| | - Christian A Shively
- Department of Genetics, Washington University School of Medicine in St. Louis, St. Louis, MO 63108, USA
- The Edison Family Center for Genome Sciences & Systems Biology, Washington University School of Medicine in St. Louis, St. Louis, MO 63108, USA
| | - David Song
- Department of Genetics, Washington University School of Medicine in St. Louis, St. Louis, MO 63108, USA
- The Edison Family Center for Genome Sciences & Systems Biology, Washington University School of Medicine in St. Louis, St. Louis, MO 63108, USA
| | - Grace Jaramillo
- Department of Genetics, Washington University School of Medicine in St. Louis, St. Louis, MO 63108, USA
- The Edison Family Center for Genome Sciences & Systems Biology, Washington University School of Medicine in St. Louis, St. Louis, MO 63108, USA
| | - Kristine Shady Lewis
- Department of Genetics, Washington University School of Medicine in St. Louis, St. Louis, MO 63108, USA
- The Edison Family Center for Genome Sciences & Systems Biology, Washington University School of Medicine in St. Louis, St. Louis, MO 63108, USA
| | - Xuhua Chen
- Department of Genetics, Washington University School of Medicine in St. Louis, St. Louis, MO 63108, USA
- The Edison Family Center for Genome Sciences & Systems Biology, Washington University School of Medicine in St. Louis, St. Louis, MO 63108, USA
| | - Robi D Mitra
- Department of Genetics, Washington University School of Medicine in St. Louis, St. Louis, MO 63108, USA
- The Edison Family Center for Genome Sciences & Systems Biology, Washington University School of Medicine in St. Louis, St. Louis, MO 63108, USA
- McDonnell Genome Institute, Washington University School of Medicine in St. Louis, St. Louis, MO 63108, USA
| |
Collapse
|
13
|
Kim DS. Deep Learning on Chromatin Accessibility. Methods Mol Biol 2023; 2611:325-333. [PMID: 36807077 DOI: 10.1007/978-1-0716-2899-7_18] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/23/2023]
Abstract
DNA accessibility has been a powerful tool in locating active regulatory elements in a cell type, but dissecting the combinatorial logic within these regulatory elements has been a continued challenge in the field. Deep learning models have been shown to be highly predictive models of regulatory DNA and have led to new biological insights on regulatory syntax and logic. Here, we provide a framework for deep learning in genomics that implements best practices and focuses on ease of use, versatility, and compatibility with existing tools for inference on DNA sequence.
Collapse
Affiliation(s)
- Daniel S Kim
- Biomedical Informatics Program, Stanford University School of Medicine, Stanford, CA, USA.
| |
Collapse
|
14
|
Predicting the prevalence of complex genetic diseases from individual genotype profiles using capsule networks. NAT MACH INTELL 2023. [DOI: 10.1038/s42256-022-00604-2] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/15/2023]
Abstract
AbstractDiseases that have a complex genetic architecture tend to suffer from considerable amounts of genetic variants that, although playing a role in the disease, have not yet been revealed as such. Two major causes for this phenomenon are genetic variants that do not stack up effects, but interact in complex ways; in addition, as recently suggested, the omnigenic model postulates that variants interact in a holistic manner to establish disease phenotypes. Here we present DiseaseCapsule, as a capsule-network-based approach that explicitly addresses to capture the hierarchical structure of the underlying genome data, and has the potential to fully capture the non-linear relationships between variants and disease. DiseaseCapsule is the first such approach to operate in a whole-genome manner when predicting disease occurrence from individual genotype profiles. In experiments, we evaluated DiseaseCapsule on amyotrophic lateral sclerosis (ALS) and Parkinson’s disease, with a particular emphasis on ALS, which is known to have a complex genetic architecture and is affected by 40% missing heritability. On ALS, DiseaseCapsule achieves 86.9% accuracy on hold-out test data in predicting disease occurrence, thereby outperforming all other approaches by large margins. Also, DiseaseCapsule required sufficiently less training data for reaching optimal performance. Last but not least, the systematic exploitation of the network architecture yielded 922 genes of particular interest, and 644 ‘non-additive’ genes that are crucial factors in DiseaseCapsule, but remain masked within linear schemes.
Collapse
|
15
|
Novakovsky G, Dexter N, Libbrecht MW, Wasserman WW, Mostafavi S. Obtaining genetics insights from deep learning via explainable artificial intelligence. Nat Rev Genet 2023; 24:125-137. [PMID: 36192604 DOI: 10.1038/s41576-022-00532-2] [Citation(s) in RCA: 63] [Impact Index Per Article: 63.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 08/31/2022] [Indexed: 01/24/2023]
Abstract
Artificial intelligence (AI) models based on deep learning now represent the state of the art for making functional predictions in genomics research. However, the underlying basis on which predictive models make such predictions is often unknown. For genomics researchers, this missing explanatory information would frequently be of greater value than the predictions themselves, as it can enable new insights into genetic processes. We review progress in the emerging area of explainable AI (xAI), a field with the potential to empower life science researchers to gain mechanistic insights into complex deep learning models. We discuss and categorize approaches for model interpretation, including an intuitive understanding of how each approach works and their underlying assumptions and limitations in the context of typical high-throughput biological datasets.
Collapse
Affiliation(s)
- Gherman Novakovsky
- Centre for Molecular Medicine and Therapeutics, Department of Medical Genetics, BC Children's Hospital Research Institute, University of British Columbia, Vancouver, British Columbia, Canada.,Bioinformatics Graduate Program, University of British Columbia, Vancouver, British Columbia, Canada
| | - Nick Dexter
- Department of Mathematics, Simon Fraser University, Burnaby, British Columbia, Canada.,School of Computing Science, Simon Fraser University, Burnaby, British Columbia, Canada
| | - Maxwell W Libbrecht
- School of Computing Science, Simon Fraser University, Burnaby, British Columbia, Canada.
| | - Wyeth W Wasserman
- Centre for Molecular Medicine and Therapeutics, Department of Medical Genetics, BC Children's Hospital Research Institute, University of British Columbia, Vancouver, British Columbia, Canada.
| | - Sara Mostafavi
- Paul G. Allen School of Computer Science and Engineering, University of Washington, Seattle, WA, USA. .,Canadian Institute for Advanced Research, Toronto, Ontario, Canada.
| |
Collapse
|
16
|
Chen Z, Liao M, Yang Z, Chen W, Wei S, Zou J, Peng Z. Co-expression network analysis of genes and networks associated with wheat pistillody. PeerJ 2022; 10:e13902. [PMID: 36039368 PMCID: PMC9419718 DOI: 10.7717/peerj.13902] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/07/2022] [Accepted: 07/24/2022] [Indexed: 01/19/2023] Open
Abstract
Crop male sterility has great value in theoretical research and breeding application. HTS-1, whose stamens transformed into pistils or pistil-like structures, is an important male sterility material selecting from Chinese Spring three-pistil (CSTP) wheat. However the molecular mechanism of pistillody development in HTS-1 remains a mystery. RNA-seq data of 11 wheat tissues were obtained from the National Center for Biotechnology Information (NCBI), including the stamens of CSTP and the pistils and pistillodic stamen of HTS-1. The Salmon program was utilized to quantify the gene expression levels of the 11 wheat tissues; and gene quantification results were normalized by transcripts per million (TPM). In total, 58,576 genes were used to construct block-wise network by co-expression networks analysis (WGCNA) R package. We obtained all of modules significantly associated with the 11 wheat tissues. AgriGO V2.0 was used to do Gene Ontology (GO) enrichment analysis; and genes and transcription factors (TFs) in these significant modules about wheat pistillody development were identified from GO enrichment results. Basic local alignment search tool (BLAST) was used to align HTS-1 proteins with the published pistillody-related proteins and TFs. Genes about wheat pistillody development were analyzed and validated by qRT-PCR. The MEturquoise, MEsaddlebrown, MEplum, MEcoral1, MElightsteelblue1, and MEdarkslateblue modules were significantly corelated to pistillodic stamen (correlation p < 0.05). Moreover, 206 genes related to carpel development (GO:0048440) or gynoecium development (GO:0048467) were identified only in the MEturquoise module by Gene Ontology (GO) analysis, and 42 of 206 genes were hub genes in MEturquoise module. qRT-PCR results showed that 38 of the 42 hub genes had highly expressed in pistils and pistillodic stamens than in stamens. A total of 15 pistillody development-related proteins were validated by BLAST. Transcription factors (TFs) were also analyzed in the MEturquoise module, and 618 TFs were identified. In total, 56 TFs from 11 families were considered to regulate the development of pistillodic stamen. The co-expression network showed that six of HB and three of BES1 genes were identified in 42 hub genes. This indicated that TFs played important roles in wheat pistillody development. In addition, there were 11 of ethylene-related genes connected with TFs or hub genes, suggesting the important roles of ethylene-related genes in pistillody development. These results provide important insights into the molecular interactions underlying pistillody development.
Collapse
Affiliation(s)
- Zhenyong Chen
- Key Laboratory of Southwest China Wildlife Resources Conservation (Ministry of Education), College of Life Science, China West Normal University, Nanchong, Sichuan, People’s Republic of China
| | - Mingli Liao
- Key Laboratory of Southwest China Wildlife Resources Conservation (Ministry of Education), College of Life Science, China West Normal University, Nanchong, Sichuan, People’s Republic of China
| | - Zaijun Yang
- Key Laboratory of Southwest China Wildlife Resources Conservation (Ministry of Education), College of Life Science, China West Normal University, Nanchong, Sichuan, People’s Republic of China
| | - Weiying Chen
- Key Laboratory of Southwest China Wildlife Resources Conservation (Ministry of Education), College of Life Science, China West Normal University, Nanchong, Sichuan, People’s Republic of China
| | - Shuhong Wei
- Key Laboratory of Southwest China Wildlife Resources Conservation (Ministry of Education), College of Life Science, China West Normal University, Nanchong, Sichuan, People’s Republic of China
| | - Jian Zou
- Key Laboratory of Southwest China Wildlife Resources Conservation (Ministry of Education), College of Life Science, China West Normal University, Nanchong, Sichuan, People’s Republic of China
| | - Zhengsong Peng
- School of Agricultural Science, Xichang University, Xichang, Sichuan, People’s Republic of China
| |
Collapse
|
17
|
Barshai M, Aubert A, Orenstein Y. G4detector: Convolutional Neural Network to Predict DNA G-Quadruplexes. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2022; 19:1946-1955. [PMID: 33872156 DOI: 10.1109/tcbb.2021.3073595] [Citation(s) in RCA: 5] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/12/2023]
Abstract
G-quadruplexes (G4s) are nucleic acid secondary structures that form within guanine-rich DNA or RNA sequences. G4 formation can affect chromatin architecture and gene regulation, and has been associated with genomic instability, genetic diseases, and cancer progression. The experimental data produced by the G4-seq experiment provides unprecedented details on G4 formation in the genome. Still, running the experimental protocol on a whole genome is an expensive and time-consuming process. Thus, it is highly desirable to have a computational method to predict G4 formation in new DNA sequences or whole genomes. Here, we present G4detector, a new method based on a convolutional neural network to predict G4s from DNA sequences. On top of the sequence information, we improved prediction accuracy by the addition of RNA secondary structure information. To train and test G4detector, we compiled novel high-throughput benchmarks over multiple species genomes measured by the G4-seq protocol. We show that G4detector outperforms extant methods for the same task on all benchmark datasets, can detect G4s genome-wide with high accuracy, and is able to extrapolate human-trained measurements to various non-human species. The code and benchmarks are publicly available on github.com/OrensteinLab/G4detector.
Collapse
|
18
|
Abstract
The tremendous amount of biological sequence data available, combined with the recent methodological breakthrough in deep learning in domains such as computer vision or natural language processing, is leading today to the transformation of bioinformatics through the emergence of deep genomics, the application of deep learning to genomic sequences. We review here the new applications that the use of deep learning enables in the field, focusing on three aspects: the functional annotation of genomes, the sequence determinants of the genome functions and the possibility to write synthetic genomic sequences.
Collapse
|
19
|
Wang X, Cao X, Feng Y, Guo M, Yu G, Wang J. ELSSI: parallel SNP-SNP interactions detection by ensemble multi-type detectors. Brief Bioinform 2022; 23:6607749. [PMID: 35696639 DOI: 10.1093/bib/bbac213] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/25/2022] [Revised: 04/18/2022] [Accepted: 05/07/2022] [Indexed: 12/11/2022] Open
Abstract
With the development of high-throughput genotyping technology, single nucleotide polymorphism (SNP)-SNP interactions (SSIs) detection has become an essential way for understanding disease susceptibility. Various methods have been proposed to detect SSIs. However, given the disease complexity and bias of individual SSI detectors, these single-detector-based methods are generally unscalable for real genome-wide data and with unfavorable results. We propose a novel ensemble learning-based approach (ELSSI) that can significantly reduce the bias of individual detectors and their computational load. ELSSI randomly divides SNPs into different subsets and evaluates them by multi-type detectors in parallel. Particularly, ELSSI introduces a four-stage pipeline (generate, score, switch and filter) to iteratively generate new SNP combination subsets from SNP subsets, score the combination subset by individual detectors, switch high-score combinations to other detectors for re-scoring, then filter out combinations with low scores. This pipeline makes ELSSI able to detect high-order SSIs from large genome-wide datasets. Experimental results on various simulated and real genome-wide datasets show the superior efficacy of ELSSI to state-of-the-art methods in detecting SSIs, especially for high-order ones. ELSSI is applicable with moderate PCs on the Internet and flexible to assemble new detectors. The code of ELSSI is available at https://www.sdu-idea.cn/codes.php?name=ELSSI.
Collapse
Affiliation(s)
- Xin Wang
- School of Software, Shandong University, Jinan 250101, China.,Joint SDU-NTU Centre for Artificial Intelligence Research(C-FAIR), Shandong University, Jinan 250101, China
| | - Xia Cao
- College of Computer and Information Sciences, Southwest University, Chongqing 400715, China
| | - Yuantao Feng
- College of Computer and Information Sciences, Southwest University, Chongqing 400715, China
| | - Maozu Guo
- School of Electrical and Information Engineering, Beijing University of Civil Engineering and Architecture, Beijing 100044, China
| | - Guoxian Yu
- School of Software, Shandong University, Jinan 250101, China
| | - Jun Wang
- Joint SDU-NTU Centre for Artificial Intelligence Research(C-FAIR), Shandong University, Jinan 250101, China
| |
Collapse
|
20
|
DeepSTARR predicts enhancer activity from DNA sequence and enables the de novo design of synthetic enhancers. Nat Genet 2022; 54:613-624. [PMID: 35551305 DOI: 10.1038/s41588-022-01048-5] [Citation(s) in RCA: 69] [Impact Index Per Article: 34.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/20/2021] [Accepted: 03/08/2022] [Indexed: 02/06/2023]
Abstract
Enhancer sequences control gene expression and comprise binding sites (motifs) for different transcription factors (TFs). Despite extensive genetic and computational studies, the relationship between DNA sequence and regulatory activity is poorly understood, and de novo enhancer design has been challenging. Here, we built a deep-learning model, DeepSTARR, to quantitatively predict the activities of thousands of developmental and housekeeping enhancers directly from DNA sequence in Drosophila melanogaster S2 cells. The model learned relevant TF motifs and higher-order syntax rules, including functionally nonequivalent instances of the same TF motif that are determined by motif-flanking sequence and intermotif distances. We validated these rules experimentally and demonstrated that they can be generalized to humans by testing more than 40,000 wildtype and mutant Drosophila and human enhancers. Finally, we designed and functionally validated synthetic enhancers with desired activities de novo.
Collapse
|
21
|
Kaplow IM, Banerjee A, Foo CS. Neural network modeling of differential binding between wild-type and mutant CTCF reveals putative binding preferences for zinc fingers 1-2. BMC Genomics 2022; 23:295. [PMID: 35410161 PMCID: PMC9004084 DOI: 10.1186/s12864-022-08486-9] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/23/2021] [Accepted: 03/21/2022] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Many transcription factors (TFs), such as multi zinc-finger (ZF) TFs, have multiple DNA binding domains (DBDs), and deciphering the DNA binding motifs of individual DBDs is a major challenge. One example of such a TF is CCCTC-binding factor (CTCF), a TF with eleven ZFs that plays a variety of roles in transcriptional regulation, most notably anchoring DNA loops. Previous studies found that CTCF ZFs 3-7 bind CTCF's core motif and ZFs 9-11 bind a specific upstream motif, but the motifs of ZFs 1-2 have yet to be identified. RESULTS We developed a new approach to identifying the binding motifs of individual DBDs of a TF through analyzing chromatin immunoprecipitation sequencing (ChIP-seq) experiments in which a single DBD is mutated: we train a deep convolutional neural network to predict whether wild-type TF binding sites are preserved in the mutant TF dataset and interpret the model. We applied this approach to mouse CTCF ChIP-seq data and identified the known binding preferences of CTCF ZFs 3-11 as well as a putative GAG binding motif for ZF 1. We analyzed other CTCF datasets to provide additional evidence that ZF 1 is associated with binding at the motif we identified, and we found that the presence of the motif for ZF 1 is associated with CTCF ChIP-seq peak strength. CONCLUSIONS Our approach can be applied to any TF for which in vivo binding data from both the wild-type and mutated versions of the TF are available, and our findings provide new potential insights binding preferences of CTCF's DBDs.
Collapse
Affiliation(s)
- Irene M Kaplow
- Departments of Computer Science, Stanford University, 240 Pasteur Drive, Stanford, California, 94305, USA. .,Present address: Department of Computational Biology, Carnegie Mellon University, 5000 Forbes Avenue, Gates-Hillman Building Room 7703, Pittsburgh, PA, 15213, USA.
| | - Abhimanyu Banerjee
- Departments of Physics, Stanford University, 240 Pasteur Drive, Stanford, California, 94305, USA
| | - Chuan Sheng Foo
- Departments of Computer Science, Stanford University, 240 Pasteur Drive, Stanford, California, 94305, USA. .,Present address: Machine Intellection Department, Institute for Infocomm Research, 1 Fusionopolis Way, #21-01 Connexis South Tower, Singapore, 138632, Singapore.
| |
Collapse
|
22
|
Yaish O, Orenstein Y. Computational modeling of mRNA degradation dynamics using deep neural networks. Bioinformatics 2022; 38:1087-1101. [PMID: 34849591 DOI: 10.1093/bioinformatics/btab800] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/29/2021] [Revised: 11/12/2021] [Accepted: 11/22/2021] [Indexed: 02/04/2023] Open
Abstract
MOTIVATION messenger RNA (mRNA) degradation plays critical roles in post-transcriptional gene regulation. A major component of mRNA degradation is determined by 3'-UTR elements. Hence, researchers are interested in studying mRNA dynamics as a function of 3'-UTR elements. A recent study measured the mRNA degradation dynamics of tens of thousands of 3'-UTR sequences using a massively parallel reporter assay. However, the computational approach used to model mRNA degradation was based on a simplifying assumption of a linear degradation rate. Consequently, the underlying mechanism of 3'-UTR elements is still not fully understood. RESULTS Here, we developed deep neural networks to predict mRNA degradation dynamics and interpreted the networks to identify regulatory elements in the 3'-UTR and their positional effect. Given an input of a 110 nt-long 3'-UTR sequence and an initial mRNA level, the model predicts mRNA levels of eight consecutive time points. Our deep neural networks significantly improved prediction performance of mRNA degradation dynamics compared with extant methods for the task. Moreover, we demonstrated that models predicting the dynamics of two identical 3'-UTR sequences, differing by their poly(A) tail, performed better than single-task models. On the interpretability front, by using Integrated Gradients, our convolutional neural networks (CNNs) models identified known and novel cis-regulatory sequence elements of mRNA degradation. By applying a novel systematic evaluation of model interpretability, we demonstrated that the recurrent neural network models are inferior to the CNN models in terms of interpretability and that random initialization ensemble improves both prediction and interoperability performance. Moreover, using a mutagenesis analysis, we newly discovered the positional effect of various 3'-UTR elements. AVAILABILITY AND IMPLEMENTATION All the code developed through this study is available at github.com/OrensteinLab/DeepUTR/. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Ofir Yaish
- School of Electrical and Computer Engineering, Ben-Gurion University of the Negev, Beer-Sheva 8410501, Israel
| | - Yaron Orenstein
- School of Electrical and Computer Engineering, Ben-Gurion University of the Negev, Beer-Sheva 8410501, Israel
| |
Collapse
|
23
|
Musolf AM, Holzinger ER, Malley JD, Bailey-Wilson JE. What makes a good prediction? Feature importance and beginning to open the black box of machine learning in genetics. Hum Genet 2021; 141:1515-1528. [PMID: 34862561 PMCID: PMC9360120 DOI: 10.1007/s00439-021-02402-z] [Citation(s) in RCA: 16] [Impact Index Per Article: 5.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/13/2021] [Accepted: 11/08/2021] [Indexed: 01/26/2023]
Abstract
Genetic data have become increasingly complex within the past decade, leading researchers to pursue increasingly complex questions, such as those involving epistatic interactions and protein prediction. Traditional methods are ill-suited to answer these questions, but machine learning (ML) techniques offer an alternative solution. ML algorithms are commonly used in genetics to predict or classify subjects, but some methods evaluate which features (variables) are responsible for creating a good prediction; this is called feature importance. This is critical in genetics, as researchers are often interested in which features (e.g., SNP genotype or environmental exposure) are responsible for a good prediction. This allows for the deeper analysis beyond simple prediction, including the determination of risk factors associated with a given phenotype. Feature importance further permits the researcher to peer inside the black box of many ML algorithms to see how they work and which features are critical in informing a good prediction. This review focuses on ML methods that provide feature importance metrics for the analysis of genetic data. Five major categories of ML algorithms: k nearest neighbors, artificial neural networks, deep learning, support vector machines, and random forests are described. The review ends with a discussion of how to choose the best machine for a data set. This review will be particularly useful for genetic researchers looking to use ML methods to answer questions beyond basic prediction and classification.
Collapse
Affiliation(s)
- Anthony M Musolf
- Statistical Genetics Section, Computational and Statistical Genomics Branch, National Human Genome Research Institute, National Institutes of Health, 333 Cassell Drive Suite 1200, Baltimore, MD, 21224, USA
| | - Emily R Holzinger
- Target Sciences, Informatics and Predictive Sciences, Bristol Myers Squibb, Cambridge, MA, USA
| | - James D Malley
- Statistical Genetics Section, Computational and Statistical Genomics Branch, National Human Genome Research Institute, National Institutes of Health, 333 Cassell Drive Suite 1200, Baltimore, MD, 21224, USA
| | - Joan E Bailey-Wilson
- Statistical Genetics Section, Computational and Statistical Genomics Branch, National Human Genome Research Institute, National Institutes of Health, 333 Cassell Drive Suite 1200, Baltimore, MD, 21224, USA.
| |
Collapse
|
24
|
Abd El Hamid MM, Shaheen M, Mabrouk MS, Omar YMK. MACHINE LEARNING FOR DETECTING EPISTASIS INTERACTIONS AND ITS RELEVANCE TO PERSONALIZED MEDICINE IN ALZHEIMER’S DISEASE: SYSTEMATIC REVIEW. BIOMEDICAL ENGINEERING: APPLICATIONS, BASIS AND COMMUNICATIONS 2021; 33. [DOI: 10.4015/s1016237221500472] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 09/02/2023]
Abstract
Alzheimer’s disease (AD) is a progressive disease that attacks the brain’s neurons and causes problems in memory, thinking, and reasoning skills. Personalized Medicine (PM) needs a better and more accurate understanding of the relationship between human genetic data and complex diseases like AD. The goal of PM is to tailor the treatment of a case person to his individual properties. PM requires the prediction of a person’s disease from genetic data, and its success depends on the accurate detection of genetic biomarkers. Single Nucleotide polymorphisms (SNPs) are considered the most prevalent type of variation in the human genome. Epistasis has a biological relevance to complex diseases and has an important impact on PM. Detection of the most significant epistasis interactions associated with complex diseases is a big challenge. This paper reviews several machine learning techniques and algorithms to detect the most significant epistasis interactions in Alzheimer’s disease. We discuss many machine learning techniques that can be used for detecting SNPs’ combinations like Random Forests, Support Vector Machines, Multifactor Dimensionality Reduction, Neural Network, and Deep Learning. This review paper highlights the pros and cons of these techniques and explains how they can be applied in an efficient framework to apply knowledge discovery and data mining in AD disease.
Collapse
Affiliation(s)
- Marwa M. Abd El Hamid
- The Higher Institute of Computer Science & Information Technology, El-Shorouk Academy, El Shorouk City, Cairo, Egypt
- College of Computing and Information Technology AASTMT, Egypt
| | - Mohamed Shaheen
- College of Computing and Information Technology AASTMT, Egypt
| | - Mai S. Mabrouk
- Biomedical Engineering Department Misr University for Science and Technology 6th of October City, Egypt
| | | |
Collapse
|
25
|
Claussnitzer M, Susztak K. Gaining insight into metabolic diseases from human genetic discoveries. Trends Genet 2021; 37:1081-1094. [PMID: 34315631 PMCID: PMC8578350 DOI: 10.1016/j.tig.2021.07.005] [Citation(s) in RCA: 10] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/23/2021] [Revised: 06/29/2021] [Accepted: 07/05/2021] [Indexed: 12/30/2022]
Abstract
Human large-scale genetic association studies have identified sequence variations at thousands of genetic risk loci that are more common in patients with diverse metabolic disease compared with healthy controls. While these genetic associations have been replicated in multiple large cohorts and sometimes can explain up to 50% of heritability, the molecular and cellular mechanisms affected by common genetic variation associated with metabolic disease remains mostly unknown. A variety of new genome-wide data types, in conjunction with novel biostatistical and computational analytical methodologies and foundational experimental technologies, are paving the way for a principled approach to systematic variant-to-function (V2F) studies for metabolic diseases, turning associated regions into causal variants, cell types and states of action, effector genes, and cellular and physiological mechanisms. Identification of new target genes and cellular programs for metabolic risk loci will improve mechanistic understanding of disease biology and identification of novel therapeutic strategies.
Collapse
Affiliation(s)
- Melina Claussnitzer
- Beth Israel Deaconess Medical Center, The Broad Institute of MIT and Harvard, Cambridge, MA 02142, USA.
| | - Katalin Susztak
- Department of Medicine and Genetics, University of Pennsylvania, Philadelphia, PA 19104, USA.
| |
Collapse
|
26
|
The dynamic, combinatorial cis-regulatory lexicon of epidermal differentiation. Nat Genet 2021; 53:1564-1576. [PMID: 34650237 PMCID: PMC8763320 DOI: 10.1038/s41588-021-00947-3] [Citation(s) in RCA: 36] [Impact Index Per Article: 12.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/09/2020] [Accepted: 09/01/2021] [Indexed: 01/24/2023]
Abstract
Transcription factors bind DNA sequence motif vocabularies in cis-regulatory elements (CREs) to modulate chromatin state and gene expression during cell state transitions. A quantitative understanding of how motif lexicons influence dynamic regulatory activity has been elusive due to the combinatorial nature of the cis-regulatory code. To address this, we undertook multiomic data profiling of chromatin and expression dynamics across epidermal differentiation to identify 40,103 dynamic CREs associated with 3,609 dynamically expressed genes, then applied an interpretable deep-learning framework to model the cis-regulatory logic of chromatin accessibility. This analysis framework identified cooperative DNA sequence rules in dynamic CREs regulating synchronous gene modules with diverse roles in skin differentiation. Massively parallel reporter assay analysis validated temporal dynamics and cooperative cis-regulatory logic. Variants linked to human polygenic skin disease were enriched in these time-dependent combinatorial motif rules. This integrative approach shows the combinatorial cis-regulatory lexicon of epidermal differentiation and represents a general framework for deciphering the organizational principles of the cis-regulatory code of dynamic gene regulation.
Collapse
|
27
|
Morilla I. Repairing the human with artificial intelligence in oncology. Artif Intell Cancer 2021; 2:60-68. [DOI: 10.35713/aic.v2.i5.60] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 10/15/2021] [Revised: 10/26/2021] [Accepted: 10/27/2021] [Indexed: 02/06/2023] Open
Abstract
Artificial intelligence is a groundbreaking tool to learn and analyse higher features extracted from any dataset at large scale. This ability makes it ideal to facing any complex problem that may generally arise in the biomedical domain or oncology in particular. In this work, we envisage to provide a global vision of this mathematical discipline outgrowth by linking some other related subdomains such as transfer, reinforcement or federated learning. Complementary, we also introduce the recently popular method of topological data analysis that improves the performance of learning models.
Collapse
Affiliation(s)
- Ian Morilla
- Laboratoire Analyse, Géométrie et Applications - Institut Galilée, Sorbonne Paris Nord University, Paris 75006, France
| |
Collapse
|
28
|
Ullah F, Ben-Hur A. A self-attention model for inferring cooperativity between regulatory features. Nucleic Acids Res 2021; 49:e77. [PMID: 33950192 PMCID: PMC8287919 DOI: 10.1093/nar/gkab349] [Citation(s) in RCA: 14] [Impact Index Per Article: 4.7] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/07/2020] [Revised: 04/15/2021] [Accepted: 04/20/2021] [Indexed: 11/14/2022] Open
Abstract
Deep learning has demonstrated its predictive power in modeling complex biological phenomena such as gene expression. The value of these models hinges not only on their accuracy, but also on the ability to extract biologically relevant information from the trained models. While there has been much recent work on developing feature attribution methods that discover the most important features for a given sequence, inferring cooperativity between regulatory elements, which is the hallmark of phenomena such as gene expression, remains an open problem. We present SATORI, a Self-ATtentiOn based model to detect Regulatory element Interactions. Our approach combines convolutional layers with a self-attention mechanism that helps us capture a global view of the landscape of interactions between regulatory elements in a sequence. A comprehensive evaluation demonstrates the ability of SATORI to identify numerous statistically significant TF-TF interactions, many of which have been previously reported. Our method is able to detect higher numbers of experimentally verified TF-TF interactions than existing methods, and has the advantage of not requiring a computationally expensive post-processing step. Finally, SATORI can be used for detection of any type of feature interaction in models that use a similar attention mechanism, and is not limited to the detection of TF-TF interactions.
Collapse
Affiliation(s)
- Fahad Ullah
- Department of Computer Science, Colorado State University, Fort Collins, CO 80523, USA
| | - Asa Ben-Hur
- Department of Computer Science, Colorado State University, Fort Collins, CO 80523, USA
| |
Collapse
|
29
|
Koo PK, Majdandzic A, Ploenzke M, Anand P, Paul SB. Global importance analysis: An interpretability method to quantify importance of genomic features in deep neural networks. PLoS Comput Biol 2021; 17:e1008925. [PMID: 33983921 PMCID: PMC8118286 DOI: 10.1371/journal.pcbi.1008925] [Citation(s) in RCA: 40] [Impact Index Per Article: 13.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/18/2020] [Accepted: 03/30/2021] [Indexed: 12/15/2022] Open
Abstract
Deep neural networks have demonstrated improved performance at predicting the sequence specificities of DNA- and RNA-binding proteins compared to previous methods that rely on k-mers and position weight matrices. To gain insights into why a DNN makes a given prediction, model interpretability methods, such as attribution methods, can be employed to identify motif-like representations along a given sequence. Because explanations are given on an individual sequence basis and can vary substantially across sequences, deducing generalizable trends across the dataset and quantifying their effect size remains a challenge. Here we introduce global importance analysis (GIA), a model interpretability method that quantifies the population-level effect size that putative patterns have on model predictions. GIA provides an avenue to quantitatively test hypotheses of putative patterns and their interactions with other patterns, as well as map out specific functions the network has learned. As a case study, we demonstrate the utility of GIA on the computational task of predicting RNA-protein interactions from sequence. We first introduce a convolutional network, we call ResidualBind, and benchmark its performance against previous methods on RNAcompete data. Using GIA, we then demonstrate that in addition to sequence motifs, ResidualBind learns a model that considers the number of motifs, their spacing, and sequence context, such as RNA secondary structure and GC-bias.
Collapse
Affiliation(s)
- Peter K. Koo
- Simons Center for Quantitative Biology, Cold Spring Harbor Laboratory, Cold Spring Harbor, New York, United States of America
| | - Antonio Majdandzic
- Simons Center for Quantitative Biology, Cold Spring Harbor Laboratory, Cold Spring Harbor, New York, United States of America
| | - Matthew Ploenzke
- Department of Biostatistics, Harvard University, Cambridge, Massachusetts, United States of America
| | - Praveen Anand
- Dana-Farber Cancer Institute, Boston, Massachusetts, United States of America
| | - Steffan B. Paul
- Bioinformatics and Integrative Genomics Program, Harvard Medical School, Boston, Massachusetts, United States of America
| |
Collapse
|
30
|
Abstract
Motivation The universal expressibility assumption of Deep Neural Networks (DNNs) is the key motivation behind recent worksin the systems biology community to employDNNs to solve important problems in functional genomics and moleculargenetics. Typically, such investigations have taken a ‘black box’ approach in which the internal structure of themodel used is set purely by machine learning considerations with little consideration of representing the internalstructure of the biological system by the mathematical structure of the DNN. DNNs have not yet been applied to thedetailed modeling of transcriptional control in which mRNA production is controlled by the binding of specific transcriptionfactors to DNA, in part because such models are in part formulated in terms of specific chemical equationsthat appear different in form from those used in neural networks. Results In this paper, we give an example of a DNN whichcan model the detailed control of transcription in a precise and predictive manner. Its internal structure is fully interpretableand is faithful to underlying chemistry of transcription factor binding to DNA. We derive our DNN from asystems biology model that was not previously recognized as having a DNN structure. Although we apply our DNNto data from the early embryo of the fruit fly Drosophila, this system serves as a test bed for analysis of much larger datasets obtained by systems biology studies on a genomic scale. . Availability and implementation The implementation and data for the models used in this paper are in a zip file in the supplementary material. Supplementary information Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Yi Liu
- Department of Statistics, Ecology and Evolution, Molecular Genetics & Cell Biology, Institute of Genomics and Systems Biology, University of Chicago, Chicago, IL 60637, USA
| | - Kenneth Barr
- Department of Human Genetics, Ecology and Evolution, Molecular Genetics & Cell Biology, Institute of Genomics and Systems Biology, University of Chicago, Chicago, IL 60637, USA
| | - John Reinitz
- Departments of Statistics, Ecology and Evolution, Molecular Genetics & Cell Biology, Institute of Genomics and Systems Biology, University of Chicago, Chicago, IL 60637, USA
| |
Collapse
|
31
|
Bartoszewicz JM, Seidel A, Renard BY. Interpretable detection of novel human viruses from genome sequencing data. NAR Genom Bioinform 2021; 3:lqab004. [PMID: 33554119 PMCID: PMC7849996 DOI: 10.1093/nargab/lqab004] [Citation(s) in RCA: 17] [Impact Index Per Article: 5.7] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/12/2020] [Revised: 01/04/2021] [Accepted: 01/15/2021] [Indexed: 01/21/2023] Open
Abstract
Viruses evolve extremely quickly, so reliable methods for viral host prediction are necessary to safeguard biosecurity and biosafety alike. Novel human-infecting viruses are difficult to detect with standard bioinformatics workflows. Here, we predict whether a virus can infect humans directly from next-generation sequencing reads. We show that deep neural architectures significantly outperform both shallow machine learning and standard, homology-based algorithms, cutting the error rates in half and generalizing to taxonomic units distant from those presented during training. Further, we develop a suite of interpretability tools and show that it can be applied also to other models beyond the host prediction task. We propose a new approach for convolutional filter visualization to disentangle the information content of each nucleotide from its contribution to the final classification decision. Nucleotide-resolution maps of the learned associations between pathogen genomes and the infectious phenotype can be used to detect regions of interest in novel agents, for example, the SARS-CoV-2 coronavirus, unknown before it caused a COVID-19 pandemic in 2020. All methods presented here are implemented as easy-to-install packages not only enabling analysis of NGS datasets without requiring any deep learning skills, but also allowing advanced users to easily train and explain new models for genomics.
Collapse
Affiliation(s)
- Jakub M Bartoszewicz
- Bioinformatics (MF1), Department of Methodology and Research Infrastructure, Robert Koch Institute, 13353 Berlin, Germany
- Department of Mathematics and Computer Science, Free University of Berlin, 14195 Berlin, Germany
- Data Analytics and Computational Statistics, Hasso Plattner Institute for Digital Engineering, 14482 Potsdam, Brandenburg, Germany
- Digital Engineering Faculty, University of Postdam, 14482 Potsdam, Brandenburg, Germany
| | - Anja Seidel
- Bioinformatics (MF1), Department of Methodology and Research Infrastructure, Robert Koch Institute, 13353 Berlin, Germany
- Department of Mathematics and Computer Science, Free University of Berlin, 14195 Berlin, Germany
| | - Bernhard Y Renard
- Bioinformatics (MF1), Department of Methodology and Research Infrastructure, Robert Koch Institute, 13353 Berlin, Germany
- Data Analytics and Computational Statistics, Hasso Plattner Institute for Digital Engineering, 14482 Potsdam, Brandenburg, Germany
- Digital Engineering Faculty, University of Postdam, 14482 Potsdam, Brandenburg, Germany
| |
Collapse
|
32
|
Avsec Ž, Weilert M, Shrikumar A, Krueger S, Alexandari A, Dalal K, Fropf R, McAnany C, Gagneur J, Kundaje A, Zeitlinger J. Base-resolution models of transcription-factor binding reveal soft motif syntax. Nat Genet 2021; 53:354-366. [PMID: 33603233 PMCID: PMC8812996 DOI: 10.1038/s41588-021-00782-6] [Citation(s) in RCA: 262] [Impact Index Per Article: 87.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/19/2020] [Accepted: 01/07/2021] [Indexed: 01/30/2023]
Abstract
The arrangement (syntax) of transcription factor (TF) binding motifs is an important part of the cis-regulatory code, yet remains elusive. We introduce a deep learning model, BPNet, that uses DNA sequence to predict base-resolution chromatin immunoprecipitation (ChIP)-nexus binding profiles of pluripotency TFs. We develop interpretation tools to learn predictive motif representations and identify soft syntax rules for cooperative TF binding interactions. Strikingly, Nanog preferentially binds with helical periodicity, and TFs often cooperate in a directional manner, which we validate using clustered regularly interspaced short palindromic repeat (CRISPR)-induced point mutations. Our model represents a powerful general approach to uncover the motifs and syntax of cis-regulatory sequences in genomics data.
Collapse
Affiliation(s)
- Žiga Avsec
- Department of Informatics, Technical University of Munich, Garching, Germany,Graduate School of Quantitative Biosciences (QBM), Ludwig-Maximilians-Universität München, Munich, Germany,Currently at DeepMind, London, UK
| | - Melanie Weilert
- Stowers Institute for Medical Research, Kansas City, MO, USA
| | - Avanti Shrikumar
- Department of Computer Science, Stanford University, Stanford, CA, USA
| | - Sabrina Krueger
- Stowers Institute for Medical Research, Kansas City, MO, USA
| | - Amr Alexandari
- Department of Computer Science, Stanford University, Stanford, CA, USA
| | - Khyati Dalal
- Stowers Institute for Medical Research, Kansas City, MO, USA,The University of Kansas Medical Center, Kansas City, KS, USA
| | - Robin Fropf
- Stowers Institute for Medical Research, Kansas City, MO, USA
| | - Charles McAnany
- Stowers Institute for Medical Research, Kansas City, MO, USA
| | - Julien Gagneur
- Department of Informatics, Technical University of Munich, Garching, Germany
| | - Anshul Kundaje
- Department of Computer Science, Stanford University, Stanford, CA, USA,Department of Genetics, Stanford University, Stanford, CA, USA,correspondence: ,
| | - Julia Zeitlinger
- Stowers Institute for Medical Research, Kansas City, MO, USA,The University of Kansas Medical Center, Kansas City, KS, USA,correspondence: ,
| |
Collapse
|
33
|
Teo YYA, Danilevsky A, Shomron N. Overcoming Interpretability in Deep Learning Cancer Classification. Methods Mol Biol 2021; 2243:297-309. [PMID: 33606264 DOI: 10.1007/978-1-0716-1103-6_15] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/12/2022]
Abstract
Since its inception, deep learning has revolutionized the field of machine learning and data-driven science. One such data-driven science to be transformed by deep learning is genomics. In the past decade, numerous genomics studies have adopted deep learning and its applications range from predicting regulatory elements to cancer classification. Despite its dominating efficacy in these applications, deep learning is not without drawbacks. A prominent shortcoming of deep learning is the lack of interpretability. Hence, the main objective of this study is to address this obstacle in the deep learning cancer classification. Here we adopt a feature importance scoring methodology (Gradient-based class activation mapping or Grad-CAM) on a quasi-recurrent neural network model that classify cancer based on FASTA sequencing data. In this study, we managed to formulate a nucleotide-to-genomic-region Grad-CAM scoring methodology, as well as, validate the use this methodology for the chosen model. Consequently, this allows for the utilization of the Grad-CAM scoring methodology for feature importance in deep learning cancer classification. The results from our study identify potential novel candidate genes, genomic elements, and mechanisms for future cancer research.
Collapse
Affiliation(s)
| | | | - Noam Shomron
- Faculty of Medicine, Tel Aviv University, Tel Aviv, Israel.
| |
Collapse
|
34
|
Tack DS, Romantseva EF, Tonner PD, Pressman A, Rammohan J, Strychalski EA. Measurements drive progress in directed evolution for precise engineering of biological systems. CURRENT OPINION IN SYSTEMS BIOLOGY 2020; 23:32-37. [PMID: 34611570 PMCID: PMC8489032 DOI: 10.1016/j.coisb.2020.09.004] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/13/2023]
Abstract
Precise engineering of biological systems requires quantitative, high-throughput measurements, exemplified by progress in directed evolution. New approaches allow high-throughput measurements of phenotypes and their corresponding genotypes. When integrated into directed evolution, these quantitative approaches enable the precise engineering of biological function. At the same time, the increasingly routine availability of large, high-quality data sets supports the integration of machine learning with directed evolution. Together, these advances herald striking capabilities for engineering biology.
Collapse
Affiliation(s)
- Drew S Tack
- National Institute of Standards and Technology, Gaithersburg, MD, 20898, USA
| | | | - Peter D Tonner
- National Institute of Standards and Technology, Gaithersburg, MD, 20898, USA
| | - Abe Pressman
- National Institute of Standards and Technology, Gaithersburg, MD, 20898, USA
| | - Jayan Rammohan
- National Institute of Standards and Technology, Gaithersburg, MD, 20898, USA
| | | |
Collapse
|
35
|
Bartoszewicz JM, Seidel A, Rentzsch R, Renard BY. DeePaC: predicting pathogenic potential of novel DNA with reverse-complement neural networks. Bioinformatics 2020; 36:81-89. [PMID: 31298694 DOI: 10.1093/bioinformatics/btz541] [Citation(s) in RCA: 16] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/09/2019] [Revised: 06/22/2019] [Accepted: 07/10/2019] [Indexed: 12/31/2022] Open
Abstract
MOTIVATION We expect novel pathogens to arise due to their fast-paced evolution, and new species to be discovered thanks to advances in DNA sequencing and metagenomics. Moreover, recent developments in synthetic biology raise concerns that some strains of bacteria could be modified for malicious purposes. Traditional approaches to open-view pathogen detection depend on databases of known organisms, which limits their performance on unknown, unrecognized and unmapped sequences. In contrast, machine learning methods can infer pathogenic phenotypes from single NGS reads, even though the biological context is unavailable. RESULTS We present DeePaC, a Deep Learning Approach to Pathogenicity Classification. It includes a flexible framework allowing easy evaluation of neural architectures with reverse-complement parameter sharing. We show that convolutional neural networks and LSTMs outperform the state-of-the-art based on both sequence homology and machine learning. Combining a deep learning approach with integrating the predictions for both mates in a read pair results in cutting the error rate almost in half in comparison to the previous state-of-the-art. AVAILABILITY AND IMPLEMENTATION The code and the models are available at: https://gitlab.com/rki_bioinformatics/DeePaC. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Jakub M Bartoszewicz
- Bioinformatics Unit (MF1), Department of Methodology and Research Infrastructure, Robert Koch Institute, 13353 Berlin, Germany
- Department of Mathematics and Computer Science, Free University of Berlin, 14195 Berlin, Germany
| | - Anja Seidel
- Bioinformatics Unit (MF1), Department of Methodology and Research Infrastructure, Robert Koch Institute, 13353 Berlin, Germany
- Department of Mathematics and Computer Science, Free University of Berlin, 14195 Berlin, Germany
| | - Robert Rentzsch
- Bioinformatics Unit (MF1), Department of Methodology and Research Infrastructure, Robert Koch Institute, 13353 Berlin, Germany
| | - Bernhard Y Renard
- Bioinformatics Unit (MF1), Department of Methodology and Research Infrastructure, Robert Koch Institute, 13353 Berlin, Germany
| |
Collapse
|
36
|
Linder J, Bogard N, Rosenberg AB, Seelig G. A Generative Neural Network for Maximizing Fitness and Diversity of Synthetic DNA and Protein Sequences. Cell Syst 2020; 11:49-62.e16. [PMID: 32711843 PMCID: PMC8694568 DOI: 10.1016/j.cels.2020.05.007] [Citation(s) in RCA: 50] [Impact Index Per Article: 12.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/07/2019] [Revised: 04/06/2020] [Accepted: 05/19/2020] [Indexed: 11/29/2022]
Abstract
Engineering gene and protein sequences with defined functional properties is a major goal of synthetic biology. Deep neural network models, together with gradient ascent-style optimization, show promise for sequence design. The generated sequences can however get stuck in local minima and often have low diversity. Here, we develop deep exploration networks (DENs), a class of activation-maximizing generative models, which minimize the cost of a neural network fitness predictor by gradient descent. By penalizing any two generated patterns on the basis of a similarity metric, DENs explicitly maximize sequence diversity. To avoid drifting into low-confidence regions of the predictor, we incorporate variational autoencoders to maintain the likelihood ratio of generated sequences. Using DENs, we engineered polyadenylation signals with more than 10-fold higher selection odds than the best gradient ascent-generated patterns, identified splice regulatory sequences predicted to result in highly differential splicing between cell lines, and improved on state-of-the-art results for protein design tasks.
Collapse
Affiliation(s)
- Johannes Linder
- Paul G. Allen School of Computer Science and Engineering, University of Washington, Seattle, WA 98195, USA.
| | - Nicholas Bogard
- Department of Electrical and Computer Engineering, University of Washington, Seattle, WA 98195, USA
| | - Alexander B Rosenberg
- Department of Electrical and Computer Engineering, University of Washington, Seattle, WA 98195, USA
| | - Georg Seelig
- Paul G. Allen School of Computer Science and Engineering, University of Washington, Seattle, WA 98195, USA; Department of Electrical and Computer Engineering, University of Washington, Seattle, WA 98195, USA
| |
Collapse
|
37
|
Kopp W, Monti R, Tamburrini A, Ohler U, Akalin A. Deep learning for genomics using Janggu. Nat Commun 2020; 11:3488. [PMID: 32661261 PMCID: PMC7359359 DOI: 10.1038/s41467-020-17155-y] [Citation(s) in RCA: 30] [Impact Index Per Article: 7.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/23/2019] [Accepted: 06/12/2020] [Indexed: 11/09/2022] Open
Abstract
In recent years, numerous applications have demonstrated the potential of deep learning for an improved understanding of biological processes. However, most deep learning tools developed so far are designed to address a specific question on a fixed dataset and/or by a fixed model architecture. Here we present Janggu, a python library facilitates deep learning for genomics applications, aiming to ease data acquisition and model evaluation. Among its key features are special dataset objects, which form a unified and flexible data acquisition and pre-processing framework for genomics data that enables streamlining of future research applications through reusable components. Through a numpy-like interface, these dataset objects are directly compatible with popular deep learning libraries, including keras or pytorch. Janggu offers the possibility to visualize predictions as genomic tracks or by exporting them to the bigWig format as well as utilities for keras-based models. We illustrate the functionality of Janggu on several deep learning genomics applications. First, we evaluate different model topologies for the task of predicting binding sites for the transcription factor JunD. Second, we demonstrate the framework on published models for predicting chromatin effects. Third, we show that promoter usage measured by CAGE can be predicted using DNase hypersensitivity, histone modifications and DNA sequence features. We improve the performance of these models due to a novel feature in Janggu that allows us to include high-order sequence features. We believe that Janggu will help to significantly reduce repetitive programming overhead for deep learning applications in genomics, and will enable computational biologists to rapidly assess biological hypotheses.
Collapse
Affiliation(s)
- Wolfgang Kopp
- Berlin Institute for Medical Systems Biology, Max Delbrueck Center for Molecular Medicine, 10115, Berlin, Germany.
| | - Remo Monti
- Berlin Institute for Medical Systems Biology, Max Delbrueck Center for Molecular Medicine, 10115, Berlin, Germany.,Digital Health Machine Learning, Hasso Plattner Institute, University of Potsdam, 14482, Potsdam, Germany
| | - Annalaura Tamburrini
- Berlin Institute for Medical Systems Biology, Max Delbrueck Center for Molecular Medicine, 10115, Berlin, Germany.,Department of Biology, Centro di Bioinformatica Molecolare, University of Rome 'Tor Vergata', 00133, Rome, Italy
| | - Uwe Ohler
- Berlin Institute for Medical Systems Biology, Max Delbrueck Center for Molecular Medicine, 10115, Berlin, Germany.,Department of Biology, Humboldt University, 10115, Berlin, Germany
| | - Altuna Akalin
- Berlin Institute for Medical Systems Biology, Max Delbrueck Center for Molecular Medicine, 10115, Berlin, Germany.
| |
Collapse
|
38
|
Pratt H, Weng Z. LogoJS: a Javascript package for creating sequence logos and embedding them in web applications. Bioinformatics 2020; 36:3573-3575. [PMID: 32181813 PMCID: PMC7267833 DOI: 10.1093/bioinformatics/btaa192] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/18/2019] [Revised: 02/13/2020] [Accepted: 03/13/2020] [Indexed: 11/14/2022] Open
Abstract
SUMMARY Sequence logos were introduced nearly 30 years ago as a human-readable format for representing consensus sequences, and they remain widely used. As new experimental and computational techniques have developed, logos have been extended: extra symbols represent covalent modifications to nucleotides, logos with multiple letters at each position illustrate models with multi-nucleotide features and symbols extending below the x-axis may represent a binding energy penalty for a residue or a negative weight output from a neural network. Web-based visualization tools for genomic data are increasingly taking advantage of modern web technology to offer dynamic, interactive figures to users, but support for sequence logos remains limited. Here, we present LogoJS, a Javascript package for rendering customizable, interactive, vector-graphic sequence logos and embedding them in web applications. LogoJS supports all the aforementioned logo extensions and is bundled with a companion web application for creating and sharing logos. AVAILABILITY AND IMPLEMENTATION LogoJS is implemented both in plain Javascript and ReactJS, a popular user-interface framework. The web application is hosted at logojs.wenglab.org. All major browsers and operating systems are supported. The package and application are open-source; code is available at GitHub. CONTACT zhiping.weng@umassmed.edu. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Henry Pratt
- Program in Bioinformatics and Integrative Biology, University of Massachusetts Medical School, Worcester, MA 01605, USA
| | - Zhiping Weng
- Program in Bioinformatics and Integrative Biology, University of Massachusetts Medical School, Worcester, MA 01605, USA
| |
Collapse
|
39
|
Wang H, Cimen E, Singh N, Buckler E. Deep learning for plant genomics and crop improvement. CURRENT OPINION IN PLANT BIOLOGY 2020; 54:34-41. [PMID: 31986354 DOI: 10.1016/j.pbi.2019.12.010] [Citation(s) in RCA: 69] [Impact Index Per Article: 17.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 08/01/2019] [Revised: 11/28/2019] [Accepted: 12/18/2019] [Indexed: 05/26/2023]
Abstract
Our era has witnessed tremendous advances in plant genomics, characterized by an explosion of high-throughput techniques to identify multi-dimensional genome-wide molecular phenotypes at low costs. More importantly, genomics is not merely acquiring molecular phenotypes, but also leveraging powerful data mining tools to predict and explain them. In recent years, deep learning has been found extremely effective in these tasks. This review highlights two prominent questions at the intersection of genomics and deep learning: 1) how can the flow of information from genomic DNA sequences to molecular phenotypes be modeled; 2) how can we identify functional variants in natural populations using deep learning models? Additionally, we discuss the possibility of unleashing the power of deep learning in synthetic biology to create novel genomic elements with desirable functions. Taken together, we propose a central role of deep learning in future plant genomics research and crop genetic improvement.
Collapse
Affiliation(s)
- Hai Wang
- National Maize Improvement Center, Key Laboratory of Crop Heterosis and Utilization, Joint Laboratory for International Cooperation in Crop Molecular Breeding, China Agricultural University, Beijing 100193, China; Institute for Genomic Diversity, Cornell University, Ithaca, NY 14853, USA; Biotechnology Research Institute, Chinese Academy of Agricultural Sciences, Beijing 100081, China.
| | - Emre Cimen
- Institute for Genomic Diversity, Cornell University, Ithaca, NY 14853, USA; Computational Intelligence and Optimization Laboratory, Industrial Engineering Department, Eskisehir Technical University, Eskisehir 26000, Turkey
| | - Nisha Singh
- Institute for Genomic Diversity, Cornell University, Ithaca, NY 14853, USA; ICAR-National Institute for Plant Biotechnology, New Delhi 110012, India
| | - Edward Buckler
- Institute for Genomic Diversity, Cornell University, Ithaca, NY 14853, USA; United States Department of Agriculture, Agricultural Research Service, Ithaca, NY 14853, USA
| |
Collapse
|
40
|
Xiang G, Keller CA, Heuston E, Giardine BM, An L, Wixom AQ, Miller A, Cockburn A, Sauria MEG, Weaver K, Lichtenberg J, Göttgens B, Li Q, Bodine D, Mahony S, Taylor J, Blobel GA, Weiss MJ, Cheng Y, Yue F, Hughes J, Higgs DR, Zhang Y, Hardison RC. An integrative view of the regulatory and transcriptional landscapes in mouse hematopoiesis. Genome Res 2020; 30:472-484. [PMID: 32132109 PMCID: PMC7111515 DOI: 10.1101/gr.255760.119] [Citation(s) in RCA: 23] [Impact Index Per Article: 5.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/09/2019] [Accepted: 02/21/2020] [Indexed: 01/29/2023]
Abstract
Thousands of epigenomic data sets have been generated in the past decade, but it is difficult for researchers to effectively use all the data relevant to their projects. Systematic integrative analysis can help meet this need, and the VISION project was established for validated systematic integration of epigenomic data in hematopoiesis. Here, we systematically integrated extensive data recording epigenetic features and transcriptomes from many sources, including individual laboratories and consortia, to produce a comprehensive view of the regulatory landscape of differentiating hematopoietic cell types in mouse. By using IDEAS as our integrative and discriminative epigenome annotation system, we identified and assigned epigenetic states simultaneously along chromosomes and across cell types, precisely and comprehensively. Combining nuclease accessibility and epigenetic states produced a set of more than 200,000 candidate cis-regulatory elements (cCREs) that efficiently capture enhancers and promoters. The transitions in epigenetic states of these cCREs across cell types provided insights into mechanisms of regulation, including decreases in numbers of active cCREs during differentiation of most lineages, transitions from poised to active or inactive states, and shifts in nuclease accessibility of CTCF-bound elements. Regression modeling of epigenetic states at cCREs and gene expression produced a versatile resource to improve selection of cCREs potentially regulating target genes. These resources are available from our VISION website to aid research in genomics and hematopoiesis.
Collapse
Affiliation(s)
- Guanjue Xiang
- Department of Biochemistry and Molecular Biology, The Pennsylvania State University, University Park, Pennsylvania 16802, USA
| | - Cheryl A Keller
- Department of Biochemistry and Molecular Biology, The Pennsylvania State University, University Park, Pennsylvania 16802, USA
| | - Elisabeth Heuston
- NHGRI Hematopoiesis Section, Genetics and Molecular Biology Branch, National Institutes of Health, Bethesda, Maryland 20892, USA
| | - Belinda M Giardine
- Department of Biochemistry and Molecular Biology, The Pennsylvania State University, University Park, Pennsylvania 16802, USA
| | - Lin An
- Department of Biochemistry and Molecular Biology, The Pennsylvania State University, University Park, Pennsylvania 16802, USA
| | - Alexander Q Wixom
- Department of Biochemistry and Molecular Biology, The Pennsylvania State University, University Park, Pennsylvania 16802, USA
| | - Amber Miller
- Department of Biochemistry and Molecular Biology, The Pennsylvania State University, University Park, Pennsylvania 16802, USA
| | - April Cockburn
- Department of Biochemistry and Molecular Biology, The Pennsylvania State University, University Park, Pennsylvania 16802, USA
| | - Michael E G Sauria
- Departments of Biology and Computer Science, Johns Hopkins University, Baltimore, Maryland 20218, USA
| | - Kathryn Weaver
- Departments of Biology and Computer Science, Johns Hopkins University, Baltimore, Maryland 20218, USA
| | - Jens Lichtenberg
- NHGRI Hematopoiesis Section, Genetics and Molecular Biology Branch, National Institutes of Health, Bethesda, Maryland 20892, USA
| | - Berthold Göttgens
- Welcome and MRC Cambridge Stem Cell Institute, University of Cambridge, Cambridge CB2 1TN, United Kingdom
| | - Qunhua Li
- Department of Statistics, Program in Bioinformatics and Genomics, Center for Computational Biology and Bioinformatics, The Pennsylvania State University, University Park, Pennsylvania 16802, USA
| | - David Bodine
- NHGRI Hematopoiesis Section, Genetics and Molecular Biology Branch, National Institutes of Health, Bethesda, Maryland 20892, USA
| | - Shaun Mahony
- Department of Biochemistry and Molecular Biology, The Pennsylvania State University, University Park, Pennsylvania 16802, USA
| | - James Taylor
- Departments of Biology and Computer Science, Johns Hopkins University, Baltimore, Maryland 20218, USA
| | - Gerd A Blobel
- Department of Pediatrics, Children's Hospital of Philadelphia and University of Pennsylvania School of Medicine, Philadelphia, Pennsylvania 19104, USA
| | - Mitchell J Weiss
- Department of Hematology, St. Jude Children's Research Hospital, Memphis, Tennessee 38105, USA
| | - Yong Cheng
- Department of Hematology, St. Jude Children's Research Hospital, Memphis, Tennessee 38105, USA
| | - Feng Yue
- Department of Biochemistry and Molecular Biology, The Pennsylvania State University College of Medicine, Hershey, Pennsylvania 17033, USA
| | - Jim Hughes
- MRC Weatherall Institute of Molecular Medicine, Oxford University, Oxford OX3 9DS, United Kingdom
| | - Douglas R Higgs
- MRC Weatherall Institute of Molecular Medicine, Oxford University, Oxford OX3 9DS, United Kingdom
| | - Yu Zhang
- Department of Statistics, Program in Bioinformatics and Genomics, Center for Computational Biology and Bioinformatics, The Pennsylvania State University, University Park, Pennsylvania 16802, USA
| | - Ross C Hardison
- Department of Biochemistry and Molecular Biology, The Pennsylvania State University, University Park, Pennsylvania 16802, USA
| |
Collapse
|
41
|
Koo PK, Ploenzke M. Deep learning for inferring transcription factor binding sites. CURRENT OPINION IN SYSTEMS BIOLOGY 2020; 19:16-23. [PMID: 32905524 PMCID: PMC7469942 DOI: 10.1016/j.coisb.2020.04.001] [Citation(s) in RCA: 35] [Impact Index Per Article: 8.8] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 01/28/2023]
Abstract
Deep learning is a powerful tool for predicting transcription factor binding sites from DNA sequence. Despite their high predictive accuracy, there are no guarantees that a high-performing deep learning model will learn causal sequence-function relationships. Thus a move beyond performance comparisons on benchmark datasets is needed. Interpreting model predictions is a powerful approach to identify which features drive performance gains and ideally provide insight into the underlying biological mechanisms. Here we highlight timely advances in deep learning for genomics, with a focus on inferring transcription factors binding sites. We describe recent applications, model architectures, and advances in local and global model interpretability methods, then conclude with a discussion on future research directions.
Collapse
Affiliation(s)
- Peter K Koo
- Simons Center for Quantitative Biology, Cold Spring Harbor Laboratory, Cold Spring Harbor, NY, USA
| | - Matt Ploenzke
- Department of Biostatistics, Harvard University, Cambridge, MA, USA
| |
Collapse
|
42
|
Penzar DD, Zinkevich AO, Vorontsov IE, Sitnik VV, Favorov AV, Makeev VJ, Kulakovskiy IV. What Do Neighbors Tell About You: The Local Context of Cis-Regulatory Modules Complicates Prediction of Regulatory Variants. Front Genet 2019; 10:1078. [PMID: 31737053 PMCID: PMC6834773 DOI: 10.3389/fgene.2019.01078] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/15/2019] [Accepted: 10/09/2019] [Indexed: 02/05/2023] Open
Abstract
Many problems of modern genetics and functional genomics require the assessment of functional effects of sequence variants, including gene expression changes. Machine learning is considered to be a promising approach for solving this task, but its practical applications remain a challenge due to the insufficient volume and diversity of training data. A promising source of valuable data is a saturation mutagenesis massively parallel reporter assay, which quantitatively measures changes in transcription activity caused by sequence variants. Here, we explore the computational predictions of the effects of individual single-nucleotide variants on gene transcription measured in the massively parallel reporter assays, based on the data from the recent "Regulation Saturation" Critical Assessment of Genome Interpretation challenge. We show that the estimated prediction quality strongly depends on the structure of the training and validation data. Particularly, training on the sequence segments located next to the validation data results in the "information leakage" caused by the local context. This information leakage allows reproducing the prediction quality of the best CAGI challenge submissions with a fairly simple machine learning approach, and even obtaining notably better-than-random predictions using irrelevant genomic regions. Validation scenarios preventing such information leakage dramatically reduce the measured prediction quality. The performance at independent regulatory regions entirely excluded from the training set appears to be much lower than needed for practical applications, and even the performance estimation will become reliable only in the future with richer data from multiple reporters. The source code and data are available at https://bitbucket.org/autosomeru_cagi2018/cagi2018_regsat and https://genomeinterpretation.org/content/expression-variants.
Collapse
Affiliation(s)
- Dmitry D. Penzar
- Vavilov Institute of General Genetics, Russian Academy of Sciences, Moscow, Russia
- Faculty of Bioengineering and Bioinformatics, Lomonosov Moscow State University, Moscow, Russia
- Department of Medical and Biological Physics, Moscow Institute of Physics and Technology (State University), Dolgoprudny, Russia
| | - Arsenii O. Zinkevich
- Vavilov Institute of General Genetics, Russian Academy of Sciences, Moscow, Russia
- Faculty of Bioengineering and Bioinformatics, Lomonosov Moscow State University, Moscow, Russia
| | - Ilya E. Vorontsov
- Vavilov Institute of General Genetics, Russian Academy of Sciences, Moscow, Russia
| | - Vasily V. Sitnik
- Vavilov Institute of General Genetics, Russian Academy of Sciences, Moscow, Russia
| | - Alexander V. Favorov
- Vavilov Institute of General Genetics, Russian Academy of Sciences, Moscow, Russia
- Department of Oncology, Sidney Kimmel Comprehensive Cancer Center, The Johns Hopkins University School of Medicine, Baltimore, MD, United States
| | - Vsevolod J. Makeev
- Vavilov Institute of General Genetics, Russian Academy of Sciences, Moscow, Russia
- Department of Medical and Biological Physics, Moscow Institute of Physics and Technology (State University), Dolgoprudny, Russia
- Engelhardt Institute of Molecular Biology, Russian Academy of Sciences, Moscow, Russia
| | - Ivan V. Kulakovskiy
- Vavilov Institute of General Genetics, Russian Academy of Sciences, Moscow, Russia
- Engelhardt Institute of Molecular Biology, Russian Academy of Sciences, Moscow, Russia
- Institute of Mathematical Problems of Biology RAS - the Branch of Keldysh Institute of Applied Mathematics of Russian Academy of Sciences, Pushchino, Russia
| |
Collapse
|
43
|
Eraslan G, Avsec Ž, Gagneur J, Theis FJ. Deep learning: new computational modelling techniques for genomics. Nat Rev Genet 2019; 20:389-403. [PMID: 30971806 DOI: 10.1038/s41576-019-0122-6] [Citation(s) in RCA: 526] [Impact Index Per Article: 105.2] [Reference Citation Analysis] [Abstract] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/17/2022]
Abstract
As a data-driven science, genomics largely utilizes machine learning to capture dependencies in data and derive novel biological hypotheses. However, the ability to extract new insights from the exponentially increasing volume of genomics data requires more expressive machine learning models. By effectively leveraging large data sets, deep learning has transformed fields such as computer vision and natural language processing. Now, it is becoming the method of choice for many genomics modelling tasks, including predicting the impact of genetic variation on gene regulatory mechanisms such as DNA accessibility and splicing.
Collapse
Affiliation(s)
- Gökcen Eraslan
- Institute of Computational Biology, Helmholtz Zentrum München, Neuherberg, Germany.,School of Life Sciences Weihenstephan, Technical University of Munich, Freising, Germany
| | - Žiga Avsec
- Department of Informatics, Technical University of Munich, Garching, Germany
| | - Julien Gagneur
- Department of Informatics, Technical University of Munich, Garching, Germany.
| | - Fabian J Theis
- Institute of Computational Biology, Helmholtz Zentrum München, Neuherberg, Germany. .,School of Life Sciences Weihenstephan, Technical University of Munich, Freising, Germany. .,Department of Mathematics, Technical University of Munich, Garching, Germany.
| |
Collapse
|
44
|
Liu G, Zeng H, Gifford DK. Visualizing complex feature interactions and feature sharing in genomic deep neural networks. BMC Bioinformatics 2019; 20:401. [PMID: 31324140 PMCID: PMC6642501 DOI: 10.1186/s12859-019-2957-4] [Citation(s) in RCA: 11] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/14/2018] [Accepted: 06/18/2019] [Indexed: 12/31/2022] Open
Abstract
BACKGROUND Visualization tools for deep learning models typically focus on discovering key input features without considering how such low level features are combined in intermediate layers to make decisions. Moreover, many of these methods examine a network's response to specific input examples that may be insufficient to reveal the complexity of model decision making. RESULTS We present DeepResolve, an analysis framework for deep convolutional models of genome function that visualizes how input features contribute individually and combinatorially to network decisions. Unlike other methods, DeepResolve does not depend upon the analysis of a predefined set of inputs. Rather, it uses gradient ascent to stochastically explore intermediate feature maps to 1) discover important features, 2) visualize their contribution and interaction patterns, and 3) analyze feature sharing across tasks that suggests shared biological mechanism. We demonstrate the visualization of decision making using our proposed method on deep neural networks trained on both experimental and synthetic data. DeepResolve is competitive with existing visualization tools in discovering key sequence features, and identifies certain negative features and non-additive feature interactions that are not easily observed with existing tools. It also recovers similarities between poorly correlated classes which are not observed by traditional methods. DeepResolve reveals that DeepSEA's learned decision structure is shared across genome annotations including histone marks, DNase hypersensitivity, and transcription factor binding. We identify groups of TFs that suggest known shared biological mechanism, and recover correlation between DNA hypersensitivities and TF/Chromatin marks. CONCLUSIONS DeepResolve is capable of visualizing complex feature contribution patterns and feature interactions that contribute to decision making in genomic deep convolutional networks. It also recovers feature sharing and class similarities which suggest interesting biological mechanisms. DeepResolve is compatible with existing visualization tools and provides complementary insights.
Collapse
Affiliation(s)
- Ge Liu
- Computer Science and Artificial Intelligence Laboratory, Massachusetts Institute of Technology, Cambridge, Massachusetts, USA
| | - Haoyang Zeng
- Computer Science and Artificial Intelligence Laboratory, Massachusetts Institute of Technology, Cambridge, Massachusetts, USA
| | - David K Gifford
- Computer Science and Artificial Intelligence Laboratory, Massachusetts Institute of Technology, Cambridge, Massachusetts, USA.
| |
Collapse
|
45
|
Movva R, Greenside P, Marinov GK, Nair S, Shrikumar A, Kundaje A. Deciphering regulatory DNA sequences and noncoding genetic variants using neural network models of massively parallel reporter assays. PLoS One 2019; 14:e0218073. [PMID: 31206543 PMCID: PMC6576758 DOI: 10.1371/journal.pone.0218073] [Citation(s) in RCA: 44] [Impact Index Per Article: 8.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/21/2018] [Accepted: 05/24/2019] [Indexed: 11/19/2022] Open
Abstract
The relationship between noncoding DNA sequence and gene expression is not well-understood. Massively parallel reporter assays (MPRAs), which quantify the regulatory activity of large libraries of DNA sequences in parallel, are a powerful approach to characterize this relationship. We present MPRA-DragoNN, a convolutional neural network (CNN)-based framework to predict and interpret the regulatory activity of DNA sequences as measured by MPRAs. While our method is generally applicable to a variety of MPRA designs, here we trained our model on the Sharpr-MPRA dataset that measures the activity of ∼500,000 constructs tiling 15,720 regulatory regions in human K562 and HepG2 cell lines. MPRA-DragoNN predictions were moderately correlated (Spearman ρ = 0.28) with measured activity and were within range of replicate concordance of the assay. State-of-the-art model interpretation methods revealed high-resolution predictive regulatory sequence features that overlapped transcription factor (TF) binding motifs. We used the model to investigate the cell type and chromatin state preferences of predictive TF motifs. We explored the ability of our model to predict the allelic effects of regulatory variants in an independent MPRA experiment and fine map putative functional SNPs in loci associated with lipid traits. Our results suggest that interpretable deep learning models trained on MPRA data have the potential to reveal meaningful patterns in regulatory DNA sequences and prioritize regulatory genetic variants, especially as larger, higher-quality datasets are produced.
Collapse
Affiliation(s)
- Rajiv Movva
- The Harker School, San Jose, CA, United States of America
- Department of Genetics, Stanford University, Stanford, CA, United States of America
| | - Peyton Greenside
- Biomedical Informatics Training Program, Stanford University, Stanford, CA, United States of America
| | - Georgi K. Marinov
- Department of Genetics, Stanford University, Stanford, CA, United States of America
| | - Surag Nair
- Department of Computer Science, Stanford University, Stanford, CA, United States of America
| | - Avanti Shrikumar
- Department of Computer Science, Stanford University, Stanford, CA, United States of America
| | - Anshul Kundaje
- Department of Genetics, Stanford University, Stanford, CA, United States of America
- Department of Computer Science, Stanford University, Stanford, CA, United States of America
| |
Collapse
|
46
|
Lai X, Stigliani A, Vachon G, Carles C, Smaczniak C, Zubieta C, Kaufmann K, Parcy F. Building Transcription Factor Binding Site Models to Understand Gene Regulation in Plants. MOLECULAR PLANT 2019; 12:743-763. [PMID: 30447332 DOI: 10.1016/j.molp.2018.10.010] [Citation(s) in RCA: 50] [Impact Index Per Article: 10.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 06/29/2018] [Revised: 09/20/2018] [Accepted: 10/30/2018] [Indexed: 06/09/2023]
Abstract
Transcription factors (TFs) are key cellular components that control gene expression. They recognize specific DNA sequences, the TF binding sites (TFBSs), and thus are targeted to specific regions of the genome where they can recruit transcriptional co-factors and/or chromatin regulators to fine-tune spatiotemporal gene regulation. Therefore, the identification of TFBSs in genomic sequences and their subsequent quantitative modeling is of crucial importance for understanding and predicting gene expression. Here, we review how TFBSs can be determined experimentally, how the TFBS models can be constructed in silico, and how they can be optimized by taking into account features such as position interdependence within TFBSs, DNA shape, and/or by introducing state-of-the-art computational algorithms such as deep learning methods. In addition, we discuss the integration of context variables into the TFBS modeling, including nucleosome positioning, chromatin states, methylation patterns, 3D genome architectures, and TF cooperative binding, in order to better predict TF binding under cellular contexts. Finally, we explore the possibilities of combining the optimized TFBS model with technological advances, such as targeted TFBS perturbation by CRISPR, to better understand gene regulation, evolution, and plant diversity.
Collapse
Affiliation(s)
- Xuelei Lai
- CNRS, Univ. Grenoble Alpes, CEA, INRA, BIG-LPCV, 38000 Grenoble, France.
| | - Arnaud Stigliani
- CNRS, Univ. Grenoble Alpes, CEA, INRA, BIG-LPCV, 38000 Grenoble, France
| | - Gilles Vachon
- CNRS, Univ. Grenoble Alpes, CEA, INRA, BIG-LPCV, 38000 Grenoble, France
| | - Cristel Carles
- CNRS, Univ. Grenoble Alpes, CEA, INRA, BIG-LPCV, 38000 Grenoble, France
| | - Cezary Smaczniak
- Department for Plant Cell and Molecular Biology, Institute for Biology, Humboldt-Universität zu Berlin, Berlin, Germany
| | - Chloe Zubieta
- CNRS, Univ. Grenoble Alpes, CEA, INRA, BIG-LPCV, 38000 Grenoble, France
| | - Kerstin Kaufmann
- Department for Plant Cell and Molecular Biology, Institute for Biology, Humboldt-Universität zu Berlin, Berlin, Germany
| | - François Parcy
- CNRS, Univ. Grenoble Alpes, CEA, INRA, BIG-LPCV, 38000 Grenoble, France.
| |
Collapse
|