1
|
Friedman RZ, Ramu A, Lichtarge S, Myers CA, Granas DM, Gause M, Corbo JC, Cohen BA, White MA. Active learning of enhancer and silencer regulatory grammar in photoreceptors. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2023:2023.08.21.554146. [PMID: 37662358 PMCID: PMC10473580 DOI: 10.1101/2023.08.21.554146] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 09/05/2023]
Abstract
Cis-regulatory elements (CREs) direct gene expression in health and disease, and models that can accurately predict their activities from DNA sequences are crucial for biomedicine. Deep learning represents one emerging strategy to model the regulatory grammar that relates CRE sequence to function. However, these models require training data on a scale that exceeds the number of CREs in the genome. We address this problem using active machine learning to iteratively train models on multiple rounds of synthetic DNA sequences assayed in live mammalian retinas. During each round of training the model actively selects sequence perturbations to assay, thereby efficiently generating informative training data. We iteratively trained a model that predicts the activities of sequences containing binding motifs for the photoreceptor transcription factor Cone-rod homeobox (CRX) using an order of magnitude less training data than current approaches. The model's internal confidence estimates of its predictions are reliable guides for designing sequences with high activity. The model correctly identified critical sequence differences between active and inactive sequences with nearly identical transcription factor binding sites, and revealed order and spacing preferences for combinations of motifs. Our results establish active learning as an effective method to train accurate deep learning models of cis-regulatory function after exhausting naturally occurring training examples in the genome.
Collapse
Affiliation(s)
- Ryan Z. Friedman
- The Edison Family Center for Genome Sciences & Systems Biology, Washington University School of Medicine, Saint Louis, MO, 63110
- Department of Genetics, Washington University School of Medicine, Saint Louis, MO, 63110
| | - Avinash Ramu
- The Edison Family Center for Genome Sciences & Systems Biology, Washington University School of Medicine, Saint Louis, MO, 63110
- Department of Genetics, Washington University School of Medicine, Saint Louis, MO, 63110
| | - Sara Lichtarge
- The Edison Family Center for Genome Sciences & Systems Biology, Washington University School of Medicine, Saint Louis, MO, 63110
- Department of Genetics, Washington University School of Medicine, Saint Louis, MO, 63110
| | - Connie A. Myers
- Department of Pathology and Immunology, Washington University School of Medicine, Saint Louis, MO, 63110
| | - David M. Granas
- The Edison Family Center for Genome Sciences & Systems Biology, Washington University School of Medicine, Saint Louis, MO, 63110
- Department of Genetics, Washington University School of Medicine, Saint Louis, MO, 63110
| | - Maria Gause
- Department of Pathology and Immunology, Washington University School of Medicine, Saint Louis, MO, 63110
| | - Joseph C. Corbo
- Department of Pathology and Immunology, Washington University School of Medicine, Saint Louis, MO, 63110
| | - Barak A. Cohen
- The Edison Family Center for Genome Sciences & Systems Biology, Washington University School of Medicine, Saint Louis, MO, 63110
- Department of Genetics, Washington University School of Medicine, Saint Louis, MO, 63110
| | - Michael A. White
- The Edison Family Center for Genome Sciences & Systems Biology, Washington University School of Medicine, Saint Louis, MO, 63110
- Department of Genetics, Washington University School of Medicine, Saint Louis, MO, 63110
| |
Collapse
|
2
|
Mukadum F, Nguyen Q, Adrion DM, Appleby G, Chen R, Dang H, Chang R, Garnett R, Lopez SA. Efficient Discovery of Visible Light-Activated Azoarene Photoswitches with Long Half-Lives Using Active Search. J Chem Inf Model 2021; 61:5524-5534. [PMID: 34752100 DOI: 10.1021/acs.jcim.1c00954] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/07/2023]
Abstract
Photoswitches are molecules that undergo a reversible, structural isomerization after exposure to certain wavelengths of light. The dynamic control offered by molecular photoswitches is favorable for materials chemistry, photopharmacology, and catalysis applications. Ideal photoswitches absorb visible light and have long-lived metastable isomers. We used high-throughput virtual screening to predict the absorption maxima (λmax) of the E-isomer and half-life (t1/2) of the Z-isomer. However, computing the photophysical and kinetic stabilities with density functional theory of each entry of a virtual molecular library containing thousands or millions of molecules is prohibitively time-consuming. We applied active search, a machine-learning technique, to intelligently search a chemical search space of 255 991 photoswitches based on 29 known azoarenes and their derivatives. We iteratively trained the active search algorithm on whether a candidate absorbed visible light (λmax > 450 nm). Active search was found to triple the discovery rate compared to random search. Further, we projected 1962 photoswitches to 2D using the Uniform Manifold Approximation and Projection algorithm and found that λmax depends on the core, which is tunable by substituents. We then incorporated a second stage of screening to predict the stabilities of the Z-isomers for the top candidates of each core. We identified four ideal photoswitches that concurrently satisfy the following criteria: λmax > 450 nm and t1/2 > 2 h.These candidates had λmax and t1/2 range from 465 to 531 nm and hours to days, respectively.
Collapse
Affiliation(s)
- Fatemah Mukadum
- Department of Chemistry and Chemical Biology, Northeastern University, Boston, Massachusetts 02115, United States
| | - Quan Nguyen
- Department of Computer Science and Engineering, Washington University in St. Louis, St. Louis, Missouri 63130, United States
| | - Daniel M Adrion
- Department of Chemistry and Chemical Biology, Northeastern University, Boston, Massachusetts 02115, United States
| | - Gabriel Appleby
- Department of Computer Science, Tufts University, Medford, Massachusetts 02155, United States
| | - Rui Chen
- Department of Computer Science, Tufts University, Medford, Massachusetts 02155, United States
| | - Haley Dang
- Department of Chemistry and Chemical Biology, Northeastern University, Boston, Massachusetts 02115, United States
| | - Remco Chang
- Department of Computer Science, Tufts University, Medford, Massachusetts 02155, United States
| | - Roman Garnett
- Department of Computer Science and Engineering, Washington University in St. Louis, St. Louis, Missouri 63130, United States
| | - Steven A Lopez
- Department of Chemistry and Chemical Biology, Northeastern University, Boston, Massachusetts 02115, United States
| |
Collapse
|
3
|
Gong Y, Xue D, Chuai G, Yu J, Liu Q. DeepReac+: deep active learning for quantitative modeling of organic chemical reactions. Chem Sci 2021; 12:14459-14472. [PMID: 34880997 PMCID: PMC8580052 DOI: 10.1039/d1sc02087k] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/14/2021] [Accepted: 10/08/2021] [Indexed: 11/21/2022] Open
Abstract
Various computational methods have been developed for quantitative modeling of organic chemical reactions; however, the lack of universality as well as the requirement of large amounts of experimental data limit their broad applications. Here, we present DeepReac+, an efficient and universal computational framework for prediction of chemical reaction outcomes and identification of optimal reaction conditions based on deep active learning. Under this framework, DeepReac is designed as a graph-neural-network-based model, which directly takes 2D molecular structures as inputs and automatically adapts to different prediction tasks. In addition, carefully-designed active learning strategies are incorporated to substantially reduce the number of necessary experiments for model training. We demonstrate the universality and high efficiency of DeepReac+ by achieving the state-of-the-art results with a minimum of labeled data on three diverse chemical reaction datasets in several scenarios. Collectively, DeepReac+ has great potential and utility in the development of AI-aided chemical synthesis. DeepReac+ is freely accessible at https://github.com/bm2-lab/DeepReac.
Collapse
Affiliation(s)
- Yukang Gong
- Department of Ophthalmology, Shanghai Tenth People's Hospital, Bioinformatics Department, School of Life Sciences and Technology, Tongji University Shanghai 200072 China
| | - Dongyu Xue
- Department of Ophthalmology, Shanghai Tenth People's Hospital, Bioinformatics Department, School of Life Sciences and Technology, Tongji University Shanghai 200072 China
| | - Guohui Chuai
- Department of Ophthalmology, Shanghai Tenth People's Hospital, Bioinformatics Department, School of Life Sciences and Technology, Tongji University Shanghai 200072 China
| | - Jing Yu
- Department of Ophthalmology, Shanghai Tenth People's Hospital, Bioinformatics Department, School of Life Sciences and Technology, Tongji University Shanghai 200072 China
| | - Qi Liu
- Department of Ophthalmology, Shanghai Tenth People's Hospital, Bioinformatics Department, School of Life Sciences and Technology, Tongji University Shanghai 200072 China
| |
Collapse
|
4
|
Predicting kinase inhibitors using bioactivity matrix derived informer sets. PLoS Comput Biol 2019; 15:e1006813. [PMID: 31381559 PMCID: PMC6695194 DOI: 10.1371/journal.pcbi.1006813] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/24/2019] [Revised: 08/15/2019] [Accepted: 07/13/2019] [Indexed: 12/21/2022] Open
Abstract
Prediction of compounds that are active against a desired biological target is a common step in drug discovery efforts. Virtual screening methods seek some active-enriched fraction of a library for experimental testing. Where data are too scarce to train supervised learning models for compound prioritization, initial screening must provide the necessary data. Commonly, such an initial library is selected on the basis of chemical diversity by some pseudo-random process (for example, the first few plates of a larger library) or by selecting an entire smaller library. These approaches may not produce a sufficient number or diversity of actives. An alternative approach is to select an informer set of screening compounds on the basis of chemogenomic information from previous testing of compounds against a large number of targets. We compare different ways of using chemogenomic data to choose a small informer set of compounds based on previously measured bioactivity data. We develop this Informer-Based-Ranking (IBR) approach using the Published Kinase Inhibitor Sets (PKIS) as the chemogenomic data to select the informer sets. We test the informer compounds on a target that is not part of the chemogenomic data, then predict the activity of the remaining compounds based on the experimental informer data and the chemogenomic data. Through new chemical screening experiments, we demonstrate the utility of IBR strategies in a prospective test on three kinase targets not included in the PKIS. In the early stages of drug discovery efforts, computational models are used to predict activity and prioritize compounds for experimental testing. New targets commonly lack the data necessary to build effective models, and the screening needed to generate that experimental data can be costly. We seek to improve the efficiency of the initial screening phase, and of the process of prioritizing compounds for subsequent screening. We choose a small informer set of compounds based on publicly available prior screening data on distinct targets. We then collect experimental data on these informer compounds and use that data to predict the activity of other compounds in the set for the target of interest. Computational and statistical tools are needed to identify informer compounds and to prioritize other compounds for subsequent phases of screening. We find that selection of informer compounds on the basis of bioactivity data from previous screening efforts is superior to the traditional approach of selection of a chemically diverse subset of compounds. We demonstrate the success of this approach in retrospective tests on the Published Kinase Inhibitor Sets (PKIS) chemogenomic data and in prospective experimental screens against three additional non-human kinase targets.
Collapse
|
5
|
Miyao T, Funatsu K. Iterative Screening Methods for Identification of Chemical Compounds with Specific Values of Various Properties. J Chem Inf Model 2019; 59:2626-2641. [PMID: 31058504 DOI: 10.1021/acs.jcim.9b00093] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/28/2022]
Abstract
Identification of chemical compounds having desirable properties is a central goal of screening campaigns. Iterative screening is a means of surveying a set of compounds, during which their property values are determined and used as feedback for regression models. Quantitative models that assess the relationships between chemical structures and property/activity are repeatedly updated through this type of cycle, and the efficient sampling of compounds for the subsequent test is a key factor in the early identification of target compounds. Nevertheless, methodological approaches to comparisons and to establishing the degree of extrapolation of sampled compounds, including the effects of applicability domains, are still required. In the present study, we conducted a series of virtual experiments to assess the characteristics of different iterative screening methods. Genetic algorithm-based partial least-squares regression, support vector regression, Bayesian optimization with Gaussian Process (GP), and batch-based Bayesian optimization with GP (GP_batch) were all compared, based on the analysis of one million compounds extracted from the ZINC database. Our results show that, irrespective of the diversity of the initial set of compounds, it was possible to identify a compound having the desired property value using the appropriate screening method. However, overall, the GP_batch method was found to be preferable when evaluating properties either which are difficult to predict or for which a key factor is present in the set of molecular descriptors.
Collapse
Affiliation(s)
- Tomoyuki Miyao
- Data Science Center and Graduate School of Science and Technology , Nara Institute of Science and Technology , 8916-5 Takayama-cho , Ikoma , Nara 630-0192 , Japan
| | - Kimito Funatsu
- Data Science Center and Graduate School of Science and Technology , Nara Institute of Science and Technology , 8916-5 Takayama-cho , Ikoma , Nara 630-0192 , Japan.,Department of Chemical System Engineering, School of Engineering , The University of Tokyo , 7-3-1 Hongo , Bunkyo-ku , Tokyo 113-8656 , Japan
| |
Collapse
|
6
|
Cortés-Ciriano I, Firth NC, Bender A, Watson O. Discovering Highly Potent Molecules from an Initial Set of Inactives Using Iterative Screening. J Chem Inf Model 2018; 58:2000-2014. [PMID: 30130102 DOI: 10.1021/acs.jcim.8b00376] [Citation(s) in RCA: 24] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/07/2023]
Abstract
The versatility of similarity searching and quantitative structure-activity relationships to model the activity of compound sets within given bioactivity ranges (i.e., interpolation) is well established. However, their relative performance in the common scenario in early stage drug discovery where lots of inactive data but no active data points are available (i.e., extrapolation from the low-activity to the high-activity range) has not been thoroughly examined yet. To this aim, we have designed an iterative virtual screening strategy which was evaluated on 25 diverse bioactivity data sets from ChEMBL. We benchmark the efficiency of random forest (RF), multiple linear regression, ridge regression, similarity searching, and random selection of compounds to identify a highly active molecule in the test set among a large number of low-potency compounds. We use the number of iterations required to find this active molecule to evaluate the performance of each experimental setup. We show that linear and ridge regression often outperform RF and similarity searching, reducing the number of iterations to find an active compound by a factor of 2 or more. Even simple regression methods seem better able to extrapolate to high-bioactivity ranges than RF, which only provides output values in the range covered by the training set. In addition, examination of the scaffold diversity in the data sets used shows that in some cases similarity searching and RF require two times as many iterations as random selection depending on the chemical space covered in the initial training data. Lastly, we show using bioactivity data for COX-1 and COX-2 that our framework can be extended to multitarget drug discovery, where compounds are selected by concomitantly considering their activity against multiple targets. Overall, this study provides an approach for iterative screening where only inactive data are present in early stages of drug discovery in order to discover highly potent compounds and the best experimental set up in which to do so.
Collapse
Affiliation(s)
- Isidro Cortés-Ciriano
- Centre for Molecular Informatics, Department of Chemistry , University of Cambridge , Lensfield Road , Cambridge CB2 1EW , United Kingdom
| | - Nicholas C Firth
- Centre for Medical Image Computing, Department of Computer Science , UCL , London WC1E 6BT , United Kingdom.,Evariste Technologies Ltd , Goring on Thames RG8 9AL , United Kingdom
| | - Andreas Bender
- Centre for Molecular Informatics, Department of Chemistry , University of Cambridge , Lensfield Road , Cambridge CB2 1EW , United Kingdom
| | - Oliver Watson
- Evariste Technologies Ltd , Goring on Thames RG8 9AL , United Kingdom
| |
Collapse
|
7
|
Oglic D, Oatley SA, Macdonald SJF, Mcinally T, Garnett R, Hirst JD, Gärtner T. Active Search for Computer-aided Drug Design. Mol Inform 2018; 37. [PMID: 29388736 DOI: 10.1002/minf.201700130] [Citation(s) in RCA: 13] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/01/2017] [Accepted: 01/03/2018] [Indexed: 01/08/2023]
Abstract
We consider lead discovery as active search in a space of labelled graphs. In particular, we extend our recent data-driven adaptive Markov chain approach, and evaluate it on a focused drug design problem, where we search for an antagonist of an αv integrin, the target protein that belongs to a group of Arg-Gly-Asp integrin receptors. This group of integrin receptors is thought to play a key role in idiopathic pulmonary fibrosis, a chronic lung disease of significant pharmaceutical interest. As an in silico proxy of the binding affinity, we use a molecular docking score to an experimentally determined αvβ6 protein structure. The search is driven by a probabilistic surrogate of the activity of all molecules from that space. As the process evolves and the algorithm observes the activity scores of the previously designed molecules, the hypothesis of the activity is refined. The algorithm is guaranteed to converge in probability to the best hypothesis from an a priori specified hypothesis space. In our empirical evaluations, the approach achieves a large structural variety of designed molecular structures for which the docking score is better than the desired threshold. Some novel molecules, suggested to be active by the surrogate model, provoke a significant interest from the perspective of medicinal chemistry and warrant prioritization for synthesis. Moreover, the approach discovered 19 out of the 24 active compounds which are known to be active from previous biological assays.
Collapse
Affiliation(s)
- Dino Oglic
- School of Computer Science, University of Nottingham, Jubilee Campus, Wollaton Road, Nottingham, NG8 1BB, United Kingdom.,Institut für Informatik III, Universität Bonn, Römerstraße 164, 53117, Bonn, Germany
| | - Steven A Oatley
- School of Chemistry, University of Nottingham, University Park, Nottingham, NG7 2RD, United Kingdom
| | - Simon J F Macdonald
- GlaxoSmithKline, Medicines Research Centre, Gunnels Wood Road, Stevenage, Hertfordshire, SG1 2NY, United Kingdom
| | - Thomas Mcinally
- School of Chemistry, University of Nottingham, University Park, Nottingham, NG7 2RD, United Kingdom
| | - Roman Garnett
- Department of Computer Science and Engineering Washington University in St. Louis, One Brookings Drive CB 1045, St. Louis, MO, 63130, USA
| | - Jonathan D Hirst
- School of Chemistry, University of Nottingham, University Park, Nottingham, NG7 2RD, United Kingdom
| | - Thomas Gärtner
- School of Computer Science, University of Nottingham, Jubilee Campus, Wollaton Road, Nottingham, NG8 1BB, United Kingdom
| |
Collapse
|
8
|
Lang T, Flachsenberg F, von Luxburg U, Rarey M. Feasibility of Active Machine Learning for Multiclass Compound Classification. J Chem Inf Model 2016; 56:12-20. [DOI: 10.1021/acs.jcim.5b00332] [Citation(s) in RCA: 31] [Impact Index Per Article: 3.9] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/29/2022]
Affiliation(s)
| | | | - Ulrike von Luxburg
- Department
of Computer Science, University of Tübingen, 72076 Tübingen, Germany
| | | |
Collapse
|