1
|
Kim S, Yuan JB, Woods WS, Newton DA, Perez-Pinera P, Song JS. Chromatin structure and context-dependent sequence features control prime editing efficiency. Front Genet 2023; 14:1222112. [PMID: 37456665 PMCID: PMC10344898 DOI: 10.3389/fgene.2023.1222112] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/13/2023] [Accepted: 06/16/2023] [Indexed: 07/18/2023] Open
Abstract
Prime editing (PE) is a highly versatile CRISPR-Cas9 genome editing technique. The current constructs, however, have variable efficiency and may require laborious experimental optimization. This study presents statistical models for learning the salient epigenomic and sequence features of target sites modulating the editing efficiency and provides guidelines for designing optimal PEs. We found that both regional constitutive heterochromatin and local nucleosome occlusion of target sites impede editing, while position-specific G/C nucleotides in the primer-binding site (PBS) and reverse transcription (RT) template regions of PE guide RNA (pegRNA) yield high editing efficiency, especially for short PBS designs. The presence of G/C nucleotides was most critical immediately 5' to the protospacer adjacent motif (PAM) site for all designs. The effects of different last templated nucleotides were quantified and observed to depend on the length of both PBS and RT templates. Our models found AGG to be the preferred PAM and detected a guanine nucleotide four bases downstream of the PAM to facilitate editing, suggesting a hitherto-unrecognized interaction with Cas9. A neural network interpretation method based on nonextensive statistical mechanics further revealed multi-nucleotide preferences, indicating dependency among several bases across pegRNA. Our work clarifies previous conflicting observations and uncovers context-dependent features important for optimizing PE designs.
Collapse
Affiliation(s)
- Somang Kim
- Department of Physics, University of Illinois at Urbana-Champaign, Urbana, IL, United States
- Carl R. Woese Institute for Genomic Biology, University of Illinois at Urbana-Champaign, Urbana, IL, United States
| | - Jimmy B. Yuan
- Department of Physics, University of Illinois at Urbana-Champaign, Urbana, IL, United States
- Carl R. Woese Institute for Genomic Biology, University of Illinois at Urbana-Champaign, Urbana, IL, United States
| | - Wendy S. Woods
- Department of Bioengineering, University of Illinois at Urbana-Champaign, Urbana, IL, United States
| | - Destry A. Newton
- Department of Physics, University of Illinois at Urbana-Champaign, Urbana, IL, United States
| | - Pablo Perez-Pinera
- Carl R. Woese Institute for Genomic Biology, University of Illinois at Urbana-Champaign, Urbana, IL, United States
- Department of Bioengineering, University of Illinois at Urbana-Champaign, Urbana, IL, United States
- Department of Biomedical and Translational Sciences, Carle-Illinois College of Medicine, University of Illinois at Urbana-Champaign, Urbana, IL, United States
- Cancer Center at Illinois, University of Illinois at Urbana-Champaign, Urbana, IL, United States
- Department of Molecular and Integrative Physiology, University of Illinois at Urbana-Champaign, Urbana, IL, United States
| | - Jun S. Song
- Department of Physics, University of Illinois at Urbana-Champaign, Urbana, IL, United States
- Carl R. Woese Institute for Genomic Biology, University of Illinois at Urbana-Champaign, Urbana, IL, United States
- Cancer Center at Illinois, University of Illinois at Urbana-Champaign, Urbana, IL, United States
- Center for Theoretical Physics, Department of Physics, Massachusetts Institute of Technology, Cambridge, MA, United States
- Department of Statistics, Harvard University, Cambridge, MA, United States
| |
Collapse
|
2
|
Kim S, Yuan JB, Woods WS, Newton DA, Perez-Pinera P, Song JS. Chromatin structure and context-dependent sequence features control prime editing efficiency. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2023:2023.04.15.536944. [PMID: 37162994 PMCID: PMC10168420 DOI: 10.1101/2023.04.15.536944] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/11/2023]
Abstract
Prime editor (PE) is a highly versatile CRISPR-Cas9 genome editing technique. The current constructs, however, have variable efficiency and may require laborious experimental optimization. This study presents statistical models for learning the salient epigenomic and sequence features of target sites modulating the editing efficiency and provides guidelines for designing optimal PEs. We found that both regional constitutive heterochromatin and local nucleosome occlusion of target sites impede editing, while position-specific G/C nucleotides in the primer binding site (PBS) and reverse transcription (RT) template regions of PE guide-RNA (pegRNA) yield high editing efficiency, especially for short PBS designs. The presence of G/C nucleotides was most critical immediately 5' to the protospacer adjacent motif (PAM) site for all designs. The effects of different last templated nucleotides were quantified and seen to depend on both PBS and RT template lengths. Our models found AGG to be the preferred PAM and detected a guanine nucleotide four bases downstream of PAM to facilitate editing, suggesting a hitherto-unrecognized interaction with Cas9. A neural network interpretation method based on nonextensive statistical mechanics further revealed multi-nucleotide preferences, indicating dependency among several bases across pegRNA. Our work clarifies previous conflicting observations and uncovers context-dependent features important for optimizing PE designs.
Collapse
|
3
|
Novakovsky G, Dexter N, Libbrecht MW, Wasserman WW, Mostafavi S. Obtaining genetics insights from deep learning via explainable artificial intelligence. Nat Rev Genet 2023; 24:125-137. [PMID: 36192604 DOI: 10.1038/s41576-022-00532-2] [Citation(s) in RCA: 63] [Impact Index Per Article: 63.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 08/31/2022] [Indexed: 01/24/2023]
Abstract
Artificial intelligence (AI) models based on deep learning now represent the state of the art for making functional predictions in genomics research. However, the underlying basis on which predictive models make such predictions is often unknown. For genomics researchers, this missing explanatory information would frequently be of greater value than the predictions themselves, as it can enable new insights into genetic processes. We review progress in the emerging area of explainable AI (xAI), a field with the potential to empower life science researchers to gain mechanistic insights into complex deep learning models. We discuss and categorize approaches for model interpretation, including an intuitive understanding of how each approach works and their underlying assumptions and limitations in the context of typical high-throughput biological datasets.
Collapse
Affiliation(s)
- Gherman Novakovsky
- Centre for Molecular Medicine and Therapeutics, Department of Medical Genetics, BC Children's Hospital Research Institute, University of British Columbia, Vancouver, British Columbia, Canada.,Bioinformatics Graduate Program, University of British Columbia, Vancouver, British Columbia, Canada
| | - Nick Dexter
- Department of Mathematics, Simon Fraser University, Burnaby, British Columbia, Canada.,School of Computing Science, Simon Fraser University, Burnaby, British Columbia, Canada
| | - Maxwell W Libbrecht
- School of Computing Science, Simon Fraser University, Burnaby, British Columbia, Canada.
| | - Wyeth W Wasserman
- Centre for Molecular Medicine and Therapeutics, Department of Medical Genetics, BC Children's Hospital Research Institute, University of British Columbia, Vancouver, British Columbia, Canada.
| | - Sara Mostafavi
- Paul G. Allen School of Computer Science and Engineering, University of Washington, Seattle, WA, USA. .,Canadian Institute for Advanced Research, Toronto, Ontario, Canada.
| |
Collapse
|
4
|
Mukul Das M, Sarkar K. Evaluation of machine learning classifiers for predicting essential genes in Mycobacterium tuberculosis strains. Bioinformation 2022; 18:1126-1130. [PMID: 37701504 PMCID: PMC10492903 DOI: 10.6026/973206300181126] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/01/2022] [Revised: 12/20/2022] [Accepted: 12/31/2022] [Indexed: 09/14/2023] Open
Abstract
Accurate investigation and prediction of essential genes from bacterial genome is very important as it might be explored in effective targets for antimicrobial drugs and understanding biological mechanism of a cell. A subset of key features data obtained from 14 genome sequence-based features of 20 strains of Mycobacterium tuberculosis bacteria whose essential gene information was downloaded from ePath and NCBI database for mapping and matching essential genes by using a genome extraction program. The selection of key features was performed by using Genetic Algorithm. For each of three classifiers, 80%, 10% and 10% of subset key features were used for training, validation and testing, respectively. Experimental results (10-f-cv) illustrated that DNN (proposed), DT, and SVM achieved AUC of 0.98, 0.88 and 0.82, respectively. DNN (proposed) outperformed DT and SVM. The higher prediction accuracy of classifiers was observed because of using only key features which also justified better generalizability of classifiers and efficiency of key features related to gene essentiality. Besides, DNN (proposed) also showed best prediction performance while compared with other predictors used in previous studies. The genome extraction program was developed for mapping and matching of essential genes between ePath and NCBI database.
Collapse
Affiliation(s)
- Monish Mukul Das
- Department of Computer Science and Engineering, University of Kalyani, Kalyani, Nadia - 741235
| | - Keka Sarkar
- Department of Microbiology, University of Kalyani, Kalyani, Nadia - 741235
| |
Collapse
|
5
|
On the Solutions of a Quadratic Integral Equation of the Urysohn Type of Fractional Variable Order. ENTROPY 2022; 24:e24070886. [PMID: 35885108 PMCID: PMC9316200 DOI: 10.3390/e24070886] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Subscribe] [Scholar Register] [Received: 06/08/2022] [Revised: 06/23/2022] [Accepted: 06/23/2022] [Indexed: 02/04/2023]
Abstract
In this manuscript we introduce a quadratic integral equation of the Urysohn type of fractional variable order. The existence and uniqueness of solutions of the proposed fractional model are studied by transforming it into an integral equation of fractional constant order. The obtained new results are based on the Schauder’s fixed-point theorem and the Banach contraction principle with the help of piece-wise constant functions. Although the used methods are very powerful, they are not applied to the quadratic integral equation of the Urysohn type of fractional variable order. With this research we extend the applicability of these techniques to the introduced the Urysohn type model of fractional variable order. The applicability of the new results are demonstrated by providing Ulam–Hyers stability criteria and an example. Moreover, the presented results lead to future progress and expansion of the theory of fractional-order models, as well as of the concept of entropy in the framework of fractional calculus. Further, an example is constructed to demonstrate the reasonableness and effectiveness of the observed results.
Collapse
|
6
|
Zhang S, Leistico JR, Cho RJ, Cheng JB, Song JS. Spectral clustering of single-cell multi-omics data on multilayer graphs. Bioinformatics 2022; 38:3600-3608. [PMID: 35652725 DOI: 10.1093/bioinformatics/btac378] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/24/2022] [Revised: 05/20/2022] [Accepted: 05/30/2022] [Indexed: 11/14/2022] Open
Abstract
MOTIVATION Single-cell sequencing technologies that simultaneously generate multimodal cellular profiles present opportunities for improved understanding of cell heterogeneity in tissues. How the multimodal information can be integrated to obtain a common cell type identification, however, poses a computational challenge. Multilayer graphs provide a natural representation of multi-omic single-cell sequencing datasets, and finding cell clusters may be understood as a multilayer graph partition problem. RESULTS We introduce two spectral algorithms on multilayer graphs, spectral clustering on multilayer graphs (SCML) and the weighted locally linear (WLL) method, to cluster cells in multi-omic single-cell sequencing datasets. We connect these algorithms through a unifying mathematical framework that represents each layer using a Hamiltonian operator and a mixture of its eigenstates to integrate the multiple graph layers, demonstrating in the process that the WLL method is a rigorous multilayer spectral graph theoretic reformulation of the popular Seurat weighted nearest neighbor (WNN) algorithm. Implementing our algorithms and applying them to a CITE-seq dataset of cord blood mononuclear cells yields results similar to the Seurat WNN analysis. Our work thus extends spectral methods to multimodal single-cell data analysis. AVAILABILITY The code used in this study can be found at https://github.com/jssong-lab/sc-spectrum. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Shuyi Zhang
- Department of Physics, University of Illinois at Urbana-Champaign, Urbana, IL, USA.,Carl R. Woese Institute for Genomic Biology, University of Illinois at Urbana-Champaign, Urbana, IL, USA
| | - Jacob R Leistico
- Department of Physics, University of Illinois at Urbana-Champaign, Urbana, IL, USA.,Carl R. Woese Institute for Genomic Biology, University of Illinois at Urbana-Champaign, Urbana, IL, USA
| | - Raymond J Cho
- Department of Dermatology, University of California, San Francisco, San Francisco, CA, USA
| | - Jeffrey B Cheng
- Department of Dermatology, University of California, San Francisco, San Francisco, CA, USA
| | - Jun S Song
- Department of Physics, University of Illinois at Urbana-Champaign, Urbana, IL, USA.,Carl R. Woese Institute for Genomic Biology, University of Illinois at Urbana-Champaign, Urbana, IL, USA
| |
Collapse
|
7
|
You Y, Lai X, Pan Y, Zheng H, Vera J, Liu S, Deng S, Zhang L. Artificial intelligence in cancer target identification and drug discovery. Signal Transduct Target Ther 2022; 7:156. [PMID: 35538061 PMCID: PMC9090746 DOI: 10.1038/s41392-022-00994-0] [Citation(s) in RCA: 68] [Impact Index Per Article: 34.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/04/2021] [Revised: 03/14/2022] [Accepted: 04/05/2022] [Indexed: 02/08/2023] Open
Abstract
Artificial intelligence is an advanced method to identify novel anticancer targets and discover novel drugs from biology networks because the networks can effectively preserve and quantify the interaction between components of cell systems underlying human diseases such as cancer. Here, we review and discuss how to employ artificial intelligence approaches to identify novel anticancer targets and discover drugs. First, we describe the scope of artificial intelligence biology analysis for novel anticancer target investigations. Second, we review and discuss the basic principles and theory of commonly used network-based and machine learning-based artificial intelligence algorithms. Finally, we showcase the applications of artificial intelligence approaches in cancer target identification and drug discovery. Taken together, the artificial intelligence models have provided us with a quantitative framework to study the relationship between network characteristics and cancer, thereby leading to the identification of potential anticancer targets and the discovery of novel drug candidates.
Collapse
Affiliation(s)
- Yujie You
- College of Computer Science, Sichuan University, Chengdu, 610065, China
| | - Xin Lai
- Laboratory of Systems Tumor Immunology, Friedrich-Alexander-Universität Erlangen-Nürnberg (FAU) and Universitätsklinikum Erlangen, Erlangen, 91052, Germany
| | - Yi Pan
- Faculty of Computer Science and Control Engineering, Shenzhen Institute of Advanced Technology, Chinese Academy of Sciences, Room D513, 1068 Xueyuan Avenue, Shenzhen University Town, Shenzhen, 518055, China
| | - Huiru Zheng
- School of Computing, Ulster University, Belfast, BT15 1ED, UK
| | - Julio Vera
- Laboratory of Systems Tumor Immunology, Friedrich-Alexander-Universität Erlangen-Nürnberg (FAU) and Universitätsklinikum Erlangen, Erlangen, 91052, Germany
| | - Suran Liu
- College of Computer Science, Sichuan University, Chengdu, 610065, China
| | - Senyi Deng
- Institute of Thoracic Oncology, Department of Thoracic Surgery, West China Hospital, Sichuan University, Chengdu, 610065, China.
| | - Le Zhang
- College of Computer Science, Sichuan University, Chengdu, 610065, China.
- Key Laboratory of Systems Biology, Hangzhou Institute for Advanced Study, University of Chinese Academy of Sciences, Chinese Academy of Sciences, Hangzhou, 310024, China.
- Key Laboratory of Systems Health Science of Zhejiang Province, Hangzhou Institute for Advanced Study, University of Chinese Academy of Sciences, Hangzhou, 310024, China.
| |
Collapse
|
8
|
Hartmann D, Franzen D, Brodehl S. Studying the Evolution of Neural Activation Patterns During Training of Feed-Forward ReLU Networks. Front Artif Intell 2021; 4:642374. [PMID: 35005614 PMCID: PMC8733739 DOI: 10.3389/frai.2021.642374] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/15/2020] [Accepted: 11/10/2021] [Indexed: 11/26/2022] Open
Abstract
The ability of deep neural networks to form powerful emergent representations of complex statistical patterns in data is as remarkable as imperfectly understood. For deep ReLU networks, these are encoded in the mixed discrete–continuous structure of linear weight matrices and non-linear binary activations. Our article develops a new technique for instrumenting such networks to efficiently record activation statistics, such as information content (entropy) and similarity of patterns, in real-world training runs. We then study the evolution of activation patterns during training for networks of different architecture using different training and initialization strategies. As a result, we see characteristic- and general-related as well as architecture-related behavioral patterns: in particular, most architectures form bottom-up structure, with the exception of highly tuned state-of-the-art architectures and methods (PyramidNet and FixUp), where layers appear to converge more simultaneously. We also observe intermediate dips in entropy in conventional CNNs that are not visible in residual networks. A reference implementation is provided under a free license1.
Collapse
|
9
|
Abstract
This review provides the feasible literature on drug discovery through ML tools and techniques that are enforced in every phase of drug development to accelerate the research process and deduce the risk and expenditure in clinical trials. Machine learning techniques improve the decision-making in pharmaceutical data across various applications like QSAR analysis, hit discoveries, de novo drug architectures to retrieve accurate outcomes. Target validation, prognostic biomarkers, digital pathology are considered under problem statements in this review. ML challenges must be applicable for the main cause of inadequacy in interpretability outcomes that may restrict the applications in drug discovery. In clinical trials, absolute and methodological data must be generated to tackle many puzzles in validating ML techniques, improving decision-making, promoting awareness in ML approaches, and deducing risk failures in drug discovery.
Collapse
Affiliation(s)
- Suresh Dara
- Department of Computer Science and Engineering, B V Raju Institute of Technology, Narsapur, Medak, 502313 Telangana India
| | - Swetha Dhamercherla
- Department of Computer Science and Engineering, B V Raju Institute of Technology, Narsapur, Medak, 502313 Telangana India
| | - Surender Singh Jadav
- Centre for Molecular Cancer Research (CMCR) and Vishnu Institute of Pharmaceutical Education and Research (VIPER), Narsapur, Medak, 502313 Telangana India
| | - CH Madhu Babu
- Department of Computer Science and Engineering, B V Raju Institute of Technology, Narsapur, Medak, 502313 Telangana India
| | - Mohamed Jawed Ahsan
- Department of Pharmaceutical Chemistry, Maharishi Arvind College of Pharmacy, Jaipur, 302023 Rajasthan India
| |
Collapse
|
10
|
Manjunath M, Yan J, Youn Y, Drucker KL, Kollmeyer TM, McKinney AM, Zazubovich V, Zhang Y, Costello JF, Eckel-Passow J, Selvin PR, Jenkins RB, Song JS. Functional analysis of low-grade glioma genetic variants predicts key target genes and transcription factors. Neuro Oncol 2021; 23:638-649. [PMID: 33130899 DOI: 10.1093/neuonc/noaa248] [Citation(s) in RCA: 9] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/15/2022] Open
Abstract
BACKGROUND Large-scale genome-wide association studies (GWAS) have implicated thousands of germline genetic variants in modulating individuals' risk to various diseases, including cancer. At least 25 risk loci have been identified for low-grade gliomas (LGGs), but their molecular functions remain largely unknown. METHODS We hypothesized that GWAS loci contain causal single nucleotide polymorphisms (SNPs) that reside in accessible open chromatin regions and modulate the expression of target genes by perturbing the binding affinity of transcription factors (TFs). We performed an integrative analysis of genomic and epigenomic data from The Cancer Genome Atlas and other public repositories to identify candidate causal SNPs within linkage disequilibrium blocks of LGG GWAS loci. We assessed their potential regulatory role via in silico TF binding sequence perturbations, convolutional neural network trained on TF binding data, and simulated annealing-based interpretation methods. RESULTS We built an interactive website (http://education.knoweng.org/alg3/) summarizing the functional footprinting of 280 variants in 25 LGG GWAS regions, providing rich information for further computational and experimental scrutiny. We identified as case studies PHLDB1 and SLC25A26 as candidate target genes of rs12803321 and rs11706832, respectively, and predicted the GWAS variant rs648044 to be the causal SNP modulating ZBTB16, a known tumor suppressor in multiple cancers. We showed that rs648044 likely perturbed the binding affinity of the TF MAFF, as supported by RNA interference and in vitro MAFF binding experiments. CONCLUSIONS The identified candidate (causal SNP, target gene, TF) triplets and the accompanying resource will help accelerate our understanding of the molecular mechanisms underlying genetic risk factors for gliomas.
Collapse
Affiliation(s)
- Mohith Manjunath
- Department of Physics, University of Illinois at Urbana-Champaign, Urbana, Illinois, USA.,Carl R. Woese Institute for Genomic Biology, University of Illinois at Urbana-Champaign, Urbana, Illinois, USA
| | - Jialu Yan
- Department of Physics, University of Illinois at Urbana-Champaign, Urbana, Illinois, USA.,Carl R. Woese Institute for Genomic Biology, University of Illinois at Urbana-Champaign, Urbana, Illinois, USA
| | - Yeoan Youn
- Center for Biophysics and Quantitative Biology, University of Illinois at Urbana-Champaign, Urbana, Illinois, USA
| | - Kristen L Drucker
- Department of Laboratory Medicine and Pathology, Mayo Clinic, Rochester, Minnesota, USA
| | - Thomas M Kollmeyer
- Department of Laboratory Medicine and Pathology, Mayo Clinic, Rochester, Minnesota, USA
| | - Andrew M McKinney
- Department of Neurological Surgery, University of California San Francisco, San Francisco, California, USA
| | - Valter Zazubovich
- Department of Physics, Concordia University, Montreal, Québec, Canada
| | - Yi Zhang
- Carl R. Woese Institute for Genomic Biology, University of Illinois at Urbana-Champaign, Urbana, Illinois, USA.,Department of Bioengineering, University of Illinois at Urbana-Champaign, Urbana, Illinois, USA.,Department of Data Sciences, Dana-Farber Cancer Institute, Boston, Massachusetts, USA
| | - Joseph F Costello
- Department of Neurological Surgery, University of California San Francisco, San Francisco, California, USA
| | | | - Paul R Selvin
- Department of Physics, University of Illinois at Urbana-Champaign, Urbana, Illinois, USA.,Center for Biophysics and Quantitative Biology, University of Illinois at Urbana-Champaign, Urbana, Illinois, USA
| | - Robert B Jenkins
- Department of Laboratory Medicine and Pathology, Mayo Clinic, Rochester, Minnesota, USA
| | - Jun S Song
- Department of Physics, University of Illinois at Urbana-Champaign, Urbana, Illinois, USA.,Carl R. Woese Institute for Genomic Biology, University of Illinois at Urbana-Champaign, Urbana, Illinois, USA
| |
Collapse
|
11
|
Koo PK, Majdandzic A, Ploenzke M, Anand P, Paul SB. Global importance analysis: An interpretability method to quantify importance of genomic features in deep neural networks. PLoS Comput Biol 2021; 17:e1008925. [PMID: 33983921 PMCID: PMC8118286 DOI: 10.1371/journal.pcbi.1008925] [Citation(s) in RCA: 33] [Impact Index Per Article: 11.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/18/2020] [Accepted: 03/30/2021] [Indexed: 12/15/2022] Open
Abstract
Deep neural networks have demonstrated improved performance at predicting the sequence specificities of DNA- and RNA-binding proteins compared to previous methods that rely on k-mers and position weight matrices. To gain insights into why a DNN makes a given prediction, model interpretability methods, such as attribution methods, can be employed to identify motif-like representations along a given sequence. Because explanations are given on an individual sequence basis and can vary substantially across sequences, deducing generalizable trends across the dataset and quantifying their effect size remains a challenge. Here we introduce global importance analysis (GIA), a model interpretability method that quantifies the population-level effect size that putative patterns have on model predictions. GIA provides an avenue to quantitatively test hypotheses of putative patterns and their interactions with other patterns, as well as map out specific functions the network has learned. As a case study, we demonstrate the utility of GIA on the computational task of predicting RNA-protein interactions from sequence. We first introduce a convolutional network, we call ResidualBind, and benchmark its performance against previous methods on RNAcompete data. Using GIA, we then demonstrate that in addition to sequence motifs, ResidualBind learns a model that considers the number of motifs, their spacing, and sequence context, such as RNA secondary structure and GC-bias.
Collapse
Affiliation(s)
- Peter K. Koo
- Simons Center for Quantitative Biology, Cold Spring Harbor Laboratory, Cold Spring Harbor, New York, United States of America
| | - Antonio Majdandzic
- Simons Center for Quantitative Biology, Cold Spring Harbor Laboratory, Cold Spring Harbor, New York, United States of America
| | - Matthew Ploenzke
- Department of Biostatistics, Harvard University, Cambridge, Massachusetts, United States of America
| | - Praveen Anand
- Dana-Farber Cancer Institute, Boston, Massachusetts, United States of America
| | - Steffan B. Paul
- Bioinformatics and Integrative Genomics Program, Harvard Medical School, Boston, Massachusetts, United States of America
| |
Collapse
|
12
|
Luu AM, Leistico JR, Miller T, Kim S, Song JS. Predicting TCR-Epitope Binding Specificity Using Deep Metric Learning and Multimodal Learning. Genes (Basel) 2021; 12:genes12040572. [PMID: 33920780 PMCID: PMC8071129 DOI: 10.3390/genes12040572] [Citation(s) in RCA: 10] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/26/2021] [Revised: 04/12/2021] [Accepted: 04/13/2021] [Indexed: 12/18/2022] Open
Abstract
Understanding the recognition of specific epitopes by cytotoxic T cells is a central problem in immunology. Although predicting binding between peptides and the class I Major Histocompatibility Complex (MHC) has had success, predicting interactions between T cell receptors (TCRs) and MHC class I-peptide complexes (pMHC) remains elusive. This paper utilizes a convolutional neural network model employing deep metric learning and multimodal learning to perform two critical tasks in TCR-epitope binding prediction: identifying the TCRs that bind a given epitope from a TCR repertoire, and identifying the binding epitope of a given TCR from a list of candidate epitopes. Our model can perform both tasks simultaneously and reveals that inconsistent preprocessing of TCR sequences can confound binding prediction. Applying a neural network interpretation method identifies key amino acid sequence patterns and positions within the TCR, important for binding specificity. Contrary to common assumption, known crystal structures of TCR-pMHC complexes show that the predicted salient amino acid positions are not necessarily the closest to the epitopes, implying that physical proximity may not be a good proxy for importance in determining TCR-epitope specificity. Our work thus provides an insight into the learned predictive features of TCR-epitope binding specificity and advances the associated classification tasks.
Collapse
Affiliation(s)
- Alan M. Luu
- Department of Physics, University of Illinois at Urbana-Champaign, Urbana, IL 61801, USA; (A.M.L.); (J.R.L.); (T.M.); (S.K.)
- Carl R. Woese Institute for Genomic Biology, University of Illinois at Urbana-Champaign, Urbana, IL 61801, USA
| | - Jacob R. Leistico
- Department of Physics, University of Illinois at Urbana-Champaign, Urbana, IL 61801, USA; (A.M.L.); (J.R.L.); (T.M.); (S.K.)
- Carl R. Woese Institute for Genomic Biology, University of Illinois at Urbana-Champaign, Urbana, IL 61801, USA
| | - Tim Miller
- Department of Physics, University of Illinois at Urbana-Champaign, Urbana, IL 61801, USA; (A.M.L.); (J.R.L.); (T.M.); (S.K.)
- Carl R. Woese Institute for Genomic Biology, University of Illinois at Urbana-Champaign, Urbana, IL 61801, USA
| | - Somang Kim
- Department of Physics, University of Illinois at Urbana-Champaign, Urbana, IL 61801, USA; (A.M.L.); (J.R.L.); (T.M.); (S.K.)
- Carl R. Woese Institute for Genomic Biology, University of Illinois at Urbana-Champaign, Urbana, IL 61801, USA
| | - Jun S. Song
- Department of Physics, University of Illinois at Urbana-Champaign, Urbana, IL 61801, USA; (A.M.L.); (J.R.L.); (T.M.); (S.K.)
- Carl R. Woese Institute for Genomic Biology, University of Illinois at Urbana-Champaign, Urbana, IL 61801, USA
- Cancer Center at Illinois, University of Illinois, Urbana, IL 61801, USA
- Correspondence:
| |
Collapse
|
13
|
Yang S, Zhu F, Ling X, Liu Q, Zhao P. Intelligent Health Care: Applications of Deep Learning in Computational Medicine. Front Genet 2021; 12:607471. [PMID: 33912213 PMCID: PMC8075004 DOI: 10.3389/fgene.2021.607471] [Citation(s) in RCA: 17] [Impact Index Per Article: 5.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/17/2020] [Accepted: 03/05/2021] [Indexed: 12/24/2022] Open
Abstract
With the progress of medical technology, biomedical field ushered in the era of big data, based on which and driven by artificial intelligence technology, computational medicine has emerged. People need to extract the effective information contained in these big biomedical data to promote the development of precision medicine. Traditionally, the machine learning methods are used to dig out biomedical data to find the features from data, which generally rely on feature engineering and domain knowledge of experts, requiring tremendous time and human resources. Different from traditional approaches, deep learning, as a cutting-edge machine learning branch, can automatically learn complex and robust feature from raw data without the need for feature engineering. The applications of deep learning in medical image, electronic health record, genomics, and drug development are studied, where the suggestion is that deep learning has obvious advantage in making full use of biomedical data and improving medical health level. Deep learning plays an increasingly important role in the field of medical health and has a broad prospect of application. However, the problems and challenges of deep learning in computational medical health still exist, including insufficient data, interpretability, data privacy, and heterogeneity. Analysis and discussion on these problems provide a reference to improve the application of deep learning in medical health.
Collapse
Affiliation(s)
- Sijie Yang
- School of Computer Science and Technology, Soochow University, Suzhou, China
| | - Fei Zhu
- School of Computer Science and Technology, Soochow University, Suzhou, China
| | - Xinghong Ling
- School of Computer Science and Technology, Soochow University, Suzhou, China
- WenZheng College of Soochow University, Suzhou, China
| | - Quan Liu
- School of Computer Science and Technology, Soochow University, Suzhou, China
| | - Peiyao Zhao
- School of Computer Science and Technology, Soochow University, Suzhou, China
| |
Collapse
|
14
|
|
15
|
Abstract
BACKGROUND Essential genes are those genes that are critical for the survival of an organism. The prediction of essential genes in bacteria can provide targets for the design of novel antibiotic compounds or antimicrobial strategies. RESULTS We propose a deep neural network for predicting essential genes in microbes. Our architecture called DEEPLYESSENTIAL makes minimal assumptions about the input data (i.e., it only uses gene primary sequence and the corresponding protein sequence) to carry out the prediction thus maximizing its practical application compared to existing predictors that require structural or topological features which might not be readily available. We also expose and study a hidden performance bias that effected previous classifiers. Extensive results show that DEEPLYESSENTIAL outperform existing classifiers that either employ down-sampling to balance the training set or use clustering to exclude multiple copies of orthologous genes. CONCLUSION Deep neural network architectures can efficiently predict whether a microbial gene is essential (or not) using only its sequence information.
Collapse
Affiliation(s)
- Md Abid Hasan
- Department of Computer Science and Engineering, University of California Riverside, 900 University Ave, Riverside, 92507 CA USA
| | - Stefano Lonardi
- Department of Computer Science and Engineering, University of California Riverside, 900 University Ave, Riverside, 92507 CA USA
| |
Collapse
|
16
|
Tuladhar R, Santamaria F, Stamova I. Fractional Lotka-Volterra-Type Cooperation Models: Impulsive Control on Their Stability Behavior. ENTROPY (BASEL, SWITZERLAND) 2020; 22:E970. [PMID: 33286739 PMCID: PMC7597273 DOI: 10.3390/e22090970] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 08/01/2020] [Revised: 08/26/2020] [Accepted: 08/27/2020] [Indexed: 11/29/2022]
Abstract
We present a biological fractional n-species delayed cooperation model of Lotka-Volterra type. The considered fractional derivatives are in the Caputo sense. Impulsive control strategies are applied for several stability properties of the states, namely Mittag-Leffler stability, practical stability and stability with respect to sets. The proposed results extend the existing stability results for integer-order n-species delayed Lotka-Volterra cooperation models to the fractional-order case under impulsive control.
Collapse
Affiliation(s)
- Rohisha Tuladhar
- Department of Biology, University of Texas at San Antonio, San Antonio, TX 78249, USA; (R.T.); (F.S.)
| | - Fidel Santamaria
- Department of Biology, University of Texas at San Antonio, San Antonio, TX 78249, USA; (R.T.); (F.S.)
| | - Ivanka Stamova
- Department of Mathematics, University of Texas at San Antonio, San Antonio, TX 78249, USA
| |
Collapse
|
17
|
Finnegan AI, Kim S, Jin H, Gapinske M, Woods WS, Perez-Pinera P, Song JS. Epigenetic engineering of yeast reveals dynamic molecular adaptation to methylation stress and genetic modulators of specific DNMT3 family members. Nucleic Acids Res 2020; 48:4081-4099. [PMID: 32187373 PMCID: PMC7192628 DOI: 10.1093/nar/gkaa161] [Citation(s) in RCA: 13] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/29/2019] [Revised: 02/16/2020] [Accepted: 03/13/2020] [Indexed: 12/21/2022] Open
Abstract
Cytosine methylation is a ubiquitous modification in mammalian DNA generated and maintained by several DNA methyltransferases (DNMTs) with partially overlapping functions and genomic targets. To systematically dissect the factors specifying each DNMT's activity, we engineered combinatorial knock-in of human DNMT genes in Komagataella phaffii, a yeast species lacking endogenous DNA methylation. Time-course expression measurements captured dynamic network-level adaptation of cells to DNMT3B1-induced DNA methylation stress and showed that coordinately modulating the availability of S-adenosyl methionine (SAM), the essential metabolite for DNMT-catalyzed methylation, is an evolutionarily conserved epigenetic stress response, also implicated in several human diseases. Convolutional neural networks trained on genome-wide CpG-methylation data learned distinct sequence preferences of DNMT3 family members. A simulated annealing interpretation method resolved these preferences into individual flanking nucleotides and periodic poly(A) tracts that rotationally position highly methylated cytosines relative to phased nucleosomes. Furthermore, the nucleosome repeat length defined the spatial unit of methylation spreading. Gene methylation patterns were similar to those in mammals, and hypo- and hypermethylation were predictive of increased and decreased transcription relative to control, respectively, in the absence of mammalian readers of DNA methylation. Introducing controlled epigenetic perturbations in yeast thus enabled characterization of fundamental genomic features directing specific DNMT3 proteins.
Collapse
Affiliation(s)
- Alex I Finnegan
- Department of Physics, University of Illinois at Urbana-Champaign, Urbana, IL, USA
- Carl R. Woese Institute for Genomic Biology, University of Illinois at Urbana-Champaign, Urbana, IL, USA
| | - Somang Kim
- Department of Physics, University of Illinois at Urbana-Champaign, Urbana, IL, USA
- Carl R. Woese Institute for Genomic Biology, University of Illinois at Urbana-Champaign, Urbana, IL, USA
| | - Hu Jin
- Department of Biomedical Informatics, Harvard Medical School, Boston, MA, USA
| | - Michael Gapinske
- Department of Bioengineering, University of Illinois at Urbana-Champaign, Urbana, IL, USA
| | - Wendy S Woods
- Department of Bioengineering, University of Illinois at Urbana-Champaign, Urbana, IL, USA
| | - Pablo Perez-Pinera
- Carl R. Woese Institute for Genomic Biology, University of Illinois at Urbana-Champaign, Urbana, IL, USA
- Department of Bioengineering, University of Illinois at Urbana-Champaign, Urbana, IL, USA
- Department of Biomedical and Translational Sciences, Carle-Illinois College of Medicine, University of Illinois, Urbana, IL 61801, USA
- Cancer Center at Illinois, University of Illinois, Urbana, IL 61801, USA
| | - Jun S Song
- Department of Physics, University of Illinois at Urbana-Champaign, Urbana, IL, USA
- Carl R. Woese Institute for Genomic Biology, University of Illinois at Urbana-Champaign, Urbana, IL, USA
- Cancer Center at Illinois, University of Illinois, Urbana, IL 61801, USA
| |
Collapse
|
18
|
Koo PK, Ploenzke M. Deep learning for inferring transcription factor binding sites. CURRENT OPINION IN SYSTEMS BIOLOGY 2020; 19:16-23. [PMID: 32905524 PMCID: PMC7469942 DOI: 10.1016/j.coisb.2020.04.001] [Citation(s) in RCA: 35] [Impact Index Per Article: 8.8] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 01/28/2023]
Abstract
Deep learning is a powerful tool for predicting transcription factor binding sites from DNA sequence. Despite their high predictive accuracy, there are no guarantees that a high-performing deep learning model will learn causal sequence-function relationships. Thus a move beyond performance comparisons on benchmark datasets is needed. Interpreting model predictions is a powerful approach to identify which features drive performance gains and ideally provide insight into the underlying biological mechanisms. Here we highlight timely advances in deep learning for genomics, with a focus on inferring transcription factors binding sites. We describe recent applications, model architectures, and advances in local and global model interpretability methods, then conclude with a discussion on future research directions.
Collapse
Affiliation(s)
- Peter K Koo
- Simons Center for Quantitative Biology, Cold Spring Harbor Laboratory, Cold Spring Harbor, NY, USA
| | - Matt Ploenzke
- Department of Biostatistics, Harvard University, Cambridge, MA, USA
| |
Collapse
|
19
|
Vamathevan J, Clark D, Czodrowski P, Dunham I, Ferran E, Lee G, Li B, Madabhushi A, Shah P, Spitzer M, Zhao S. Applications of machine learning in drug discovery and development. Nat Rev Drug Discov 2019; 18:463-477. [PMID: 30976107 DOI: 10.1038/s41573-019-0024-5] [Citation(s) in RCA: 968] [Impact Index Per Article: 193.6] [Reference Citation Analysis] [Abstract] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/07/2023]
Abstract
Drug discovery and development pipelines are long, complex and depend on numerous factors. Machine learning (ML) approaches provide a set of tools that can improve discovery and decision making for well-specified questions with abundant, high-quality data. Opportunities to apply ML occur in all stages of drug discovery. Examples include target validation, identification of prognostic biomarkers and analysis of digital pathology data in clinical trials. Applications have ranged in context and methodology, with some approaches yielding accurate predictions and insights. The challenges of applying ML lie primarily with the lack of interpretability and repeatability of ML-generated results, which may limit their application. In all areas, systematic and comprehensive high-dimensional data still need to be generated. With ongoing efforts to tackle these issues, as well as increasing awareness of the factors needed to validate ML approaches, the application of ML can promote data-driven decision making and has the potential to speed up the process and reduce failure rates in drug discovery and development.
Collapse
Affiliation(s)
- Jessica Vamathevan
- European Molecular Biology Laboratory, European Bioinformatics Institute, Cambridge, UK.
| | - Dominic Clark
- European Molecular Biology Laboratory, European Bioinformatics Institute, Cambridge, UK
| | | | - Ian Dunham
- Open Targets and European Molecular Biology Laboratory, European Bioinformatics Institute, Cambridge, UK
| | - Edgardo Ferran
- European Molecular Biology Laboratory, European Bioinformatics Institute, Cambridge, UK
| | - George Lee
- Bristol-Myers Squibb, Princeton, NJ, USA
| | - Bin Li
- Takeda Pharmaceuticals International Co., Cambridge, MA, USA
| | - Anant Madabhushi
- Case Western Reserve University, Cleveland, OH, USA.,Louis Stokes Cleveland Veterans Affair Medical Center, Cleveland, OH, USA
| | | | - Michaela Spitzer
- Open Targets and European Molecular Biology Laboratory, European Bioinformatics Institute, Cambridge, UK
| | - Shanrong Zhao
- Pfizer Worldwide Research and Development, Cambridge, MA, USA
| |
Collapse
|
20
|
Durstewitz D, Koppe G, Meyer-Lindenberg A. Deep neural networks in psychiatry. Mol Psychiatry 2019; 24:1583-1598. [PMID: 30770893 DOI: 10.1038/s41380-019-0365-9] [Citation(s) in RCA: 108] [Impact Index Per Article: 21.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 06/05/2018] [Revised: 01/02/2019] [Accepted: 01/24/2019] [Indexed: 01/03/2023]
Abstract
Machine and deep learning methods, today's core of artificial intelligence, have been applied with increasing success and impact in many commercial and research settings. They are powerful tools for large scale data analysis, prediction and classification, especially in very data-rich environments ("big data"), and have started to find their way into medical applications. Here we will first give an overview of machine learning methods, with a focus on deep and recurrent neural networks, their relation to statistics, and the core principles behind them. We will then discuss and review directions along which (deep) neural networks can be, or already have been, applied in the context of psychiatry, and will try to delineate their future potential in this area. We will also comment on an emerging area that so far has been much less well explored: by embedding semantically interpretable computational models of brain dynamics or behavior into a statistical machine learning context, insights into dysfunction beyond mere prediction and classification may be gained. Especially this marriage of computational models with statistical inference may offer insights into neural and behavioral mechanisms that could open completely novel avenues for psychiatric treatment.
Collapse
Affiliation(s)
- Daniel Durstewitz
- Department of Theoretical Neuroscience, Central Institute of Mental Health, Medical Faculty Mannheim/Heidelberg University, 68159, Mannheim, Germany.
| | - Georgia Koppe
- Department of Theoretical Neuroscience, Central Institute of Mental Health, Medical Faculty Mannheim/Heidelberg University, 68159, Mannheim, Germany.,Department of Psychiatry and Psychotherapy, Central Institute of Mental Health, Medical Faculty Mannheim/Heidelberg University, 68159, Mannheim, Germany
| | - Andreas Meyer-Lindenberg
- Department of Psychiatry and Psychotherapy, Central Institute of Mental Health, Medical Faculty Mannheim/Heidelberg University, 68159, Mannheim, Germany
| |
Collapse
|
21
|
Greenside P, Shimko T, Fordyce P, Kundaje A. Discovering epistatic feature interactions from neural network models of regulatory DNA sequences. Bioinformatics 2019; 34:i629-i637. [PMID: 30423062 PMCID: PMC6129272 DOI: 10.1093/bioinformatics/bty575] [Citation(s) in RCA: 48] [Impact Index Per Article: 9.6] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/15/2022] Open
Abstract
Motivation Transcription factors bind regulatory DNA sequences in a combinatorial manner to modulate gene expression. Deep neural networks (DNNs) can learn the cis-regulatory grammars encoded in regulatory DNA sequences associated with transcription factor binding and chromatin accessibility. Several feature attribution methods have been developed for estimating the predictive importance of individual features (nucleotides or motifs) in any input DNA sequence to its associated output prediction from a DNN model. However, these methods do not reveal higher-order feature interactions encoded by the models. Results We present a new method called Deep Feature Interaction Maps (DFIM) to efficiently estimate interactions between all pairs of features in any input DNA sequence. DFIM accurately identifies ground truth motif interactions embedded in simulated regulatory DNA sequences. DFIM identifies synergistic interactions between GATA1 and TAL1 motifs from in vivo TF binding models. DFIM reveals epistatic interactions involving nucleotides flanking the core motif of the Cbf1 TF in yeast from in vitro TF binding models. We also apply DFIM to regulatory sequence models of in vivo chromatin accessibility to reveal interactions between regulatory genetic variants and proximal motifs of target TFs as validated by TF binding quantitative trait loci. Our approach makes significant strides in improving the interpretability of deep learning models for genomics. Availability and implementation Code is available at: https://github.com/kundajelab/dfim. Supplementary information Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Peyton Greenside
- Biomedical Informatics Training Program, Stanford University, Stanford, CA
| | | | - Polly Fordyce
- Genetics, Stanford University, Stanford, CA.,Bioengineering, Stanford University, Stanford, CA.,Chan Zuckerberg Biohub, San Francisco, CA, USA.,Chem-H Institute, Stanford University, Stanford, CA, USA
| | - Anshul Kundaje
- Genetics, Stanford University, Stanford, CA.,Computer Science, Stanford University, Stanford, CA, USA
| |
Collapse
|
22
|
Liu G, Zeng H, Gifford DK. Visualizing complex feature interactions and feature sharing in genomic deep neural networks. BMC Bioinformatics 2019; 20:401. [PMID: 31324140 PMCID: PMC6642501 DOI: 10.1186/s12859-019-2957-4] [Citation(s) in RCA: 11] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/14/2018] [Accepted: 06/18/2019] [Indexed: 12/31/2022] Open
Abstract
BACKGROUND Visualization tools for deep learning models typically focus on discovering key input features without considering how such low level features are combined in intermediate layers to make decisions. Moreover, many of these methods examine a network's response to specific input examples that may be insufficient to reveal the complexity of model decision making. RESULTS We present DeepResolve, an analysis framework for deep convolutional models of genome function that visualizes how input features contribute individually and combinatorially to network decisions. Unlike other methods, DeepResolve does not depend upon the analysis of a predefined set of inputs. Rather, it uses gradient ascent to stochastically explore intermediate feature maps to 1) discover important features, 2) visualize their contribution and interaction patterns, and 3) analyze feature sharing across tasks that suggests shared biological mechanism. We demonstrate the visualization of decision making using our proposed method on deep neural networks trained on both experimental and synthetic data. DeepResolve is competitive with existing visualization tools in discovering key sequence features, and identifies certain negative features and non-additive feature interactions that are not easily observed with existing tools. It also recovers similarities between poorly correlated classes which are not observed by traditional methods. DeepResolve reveals that DeepSEA's learned decision structure is shared across genome annotations including histone marks, DNase hypersensitivity, and transcription factor binding. We identify groups of TFs that suggest known shared biological mechanism, and recover correlation between DNA hypersensitivities and TF/Chromatin marks. CONCLUSIONS DeepResolve is capable of visualizing complex feature contribution patterns and feature interactions that contribute to decision making in genomic deep convolutional networks. It also recovers feature sharing and class similarities which suggest interesting biological mechanisms. DeepResolve is compatible with existing visualization tools and provides complementary insights.
Collapse
Affiliation(s)
- Ge Liu
- Computer Science and Artificial Intelligence Laboratory, Massachusetts Institute of Technology, Cambridge, Massachusetts, USA
| | - Haoyang Zeng
- Computer Science and Artificial Intelligence Laboratory, Massachusetts Institute of Technology, Cambridge, Massachusetts, USA
| | - David K Gifford
- Computer Science and Artificial Intelligence Laboratory, Massachusetts Institute of Technology, Cambridge, Massachusetts, USA.
| |
Collapse
|
23
|
Dense neural networks for predicting chromatin conformation. BMC Bioinformatics 2018; 19:372. [PMID: 30314429 PMCID: PMC6186068 DOI: 10.1186/s12859-018-2286-z] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/30/2018] [Accepted: 07/16/2018] [Indexed: 12/12/2022] Open
Abstract
Background DNA inside eukaryotic cells wraps around histones to form the 11nm chromatin fiber that can further fold into higher-order DNA loops, which may depend on the binding of architectural factors. Predicting how the DNA will fold given a distribution of bound factors, here viewed as a type of sequence, is currently an unsolved problem and several heterogeneous polymer models have shown that many features of the measured structure can be reproduced from simulations. However a model that determines the optimal connection between sequence and structure and that can rapidly assess the effects of varying either one is still lacking. Results Here we train a dense neural network to solve for the local folding of chromatin, connecting structure, represented as a contact map, to a sequence of bound chromatin factors. The network includes a convolutional filter that compresses the large number of bound chromatin factors into a single 1D sequence representation that is optimized for predicting structure. We also train a network to solve the inverse problem, namely given only structural information in the form of a contact map, predict the likely sequence of chromatin states that generated it. Conclusions By carrying out sensitivity analysis on both networks, we are able to highlight the importance of chromatin contexts and neighborhoods for regulating long-range contacts, along with critical alterations that affect contact formation. Our analysis shows that the networks have learned physical insights that are informative and intuitive about this complex polymer problem. Electronic supplementary material The online version of this article (10.1186/s12859-018-2286-z) contains supplementary material, which is available to authorized users.
Collapse
|
24
|
Abstract
Nucleosomes form the fundamental building blocks of eukaryotic chromatin, and previous attempts to understand the principles governing their genome-wide distribution have spurred much interest and debate in biology. In particular, the precise role of DNA sequence in shaping local chromatin structure has been controversial. This paper rigorously quantifies the contribution of hitherto-debated sequence features-including G+C content, 10.5 bp periodicity, and poly(dA:dT) tracts-to three distinct aspects of genome-wide nucleosome landscape: occupancy, translational positioning and rotational positioning. Our computational framework simultaneously learns nucleosome number and nucleosome-positioning energy from genome-wide nucleosome maps. In contrast to other previous studies, our model can predict both in vitro and in vivo nucleosome maps in Saccharomyces cerevisiae. We find that although G+C content is the primary determinant of MNase-derived nucleosome occupancy, MNase digestion biases may substantially influence this GC dependence. By contrast, poly(dA:dT) tracts are seen to deter nucleosome formation, regardless of the experimental method used. We further show that the 10.5 bp nucleotide periodicity facilitates rotational but not translational positioning. Applying our method to in vivo nucleosome maps demonstrates that, for a subset of genes, the regularly-spaced nucleosome arrays observed around transcription start sites can be partially recapitulated by DNA sequence alone. Finally, in vivo nucleosome occupancy derived from MNase-seq experiments around transcription termination sites can be mostly explained by the genomic sequence. Implications of these results and potential extensions of the proposed computational framework are discussed.
Collapse
Affiliation(s)
- Hu Jin
- Department of Physics, University of Illinois, Urbana-Champaign, Urbana, IL 61801
- Carl R. Woese Institute for Genomic Biology, University of Illinois, Urbana-Champaign, Urbana, IL 61801
| | - Alex I. Finnegan
- Department of Physics, University of Illinois, Urbana-Champaign, Urbana, IL 61801
- Carl R. Woese Institute for Genomic Biology, University of Illinois, Urbana-Champaign, Urbana, IL 61801
| | - Jun S. Song
- Department of Physics, University of Illinois, Urbana-Champaign, Urbana, IL 61801
- Carl R. Woese Institute for Genomic Biology, University of Illinois, Urbana-Champaign, Urbana, IL 61801
| |
Collapse
|