1
|
Snyder SH, Vignaux PA, Ozalp MK, Gerlach J, Puhl AC, Lane TR, Corbett J, Urbina F, Ekins S. The Goldilocks paradigm: comparing classical machine learning, large language models, and few-shot learning for drug discovery applications. Commun Chem 2024; 7:134. [PMID: 38866916 PMCID: PMC11169557 DOI: 10.1038/s42004-024-01220-4] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/21/2023] [Accepted: 06/04/2024] [Indexed: 06/14/2024] Open
Abstract
Recent advances in machine learning (ML) have led to newer model architectures including transformers (large language models, LLMs) showing state of the art results in text generation and image analysis as well as few-shot learning (FSLC) models which offer predictive power with extremely small datasets. These new architectures may offer promise, yet the 'no-free lunch' theorem suggests that no single model algorithm can outperform at all possible tasks. Here, we explore the capabilities of classical (SVR), FSLC, and transformer models (MolBART) over a range of dataset tasks and show a 'goldilocks zone' for each model type, in which dataset size and feature distribution (i.e. dataset "diversity") determines the optimal algorithm strategy. When datasets are small ( < 50 molecules), FSLC tend to outperform both classical ML and transformers. When datasets are small-to-medium sized (50-240 molecules) and diverse, transformers outperform both classical models and few-shot learning. Finally, when datasets are of larger and of sufficient size, classical models then perform the best, suggesting that the optimal model to choose likely depends on the dataset available, its size and diversity. These findings may help to answer the perennial question of which ML algorithm is to be used when faced with a new dataset.
Collapse
Affiliation(s)
- Scott H Snyder
- Collaborations Pharmaceuticals, Inc., 840 Main Campus Drive, Lab 3510, Raleigh, NC, 27606, USA
| | - Patricia A Vignaux
- Collaborations Pharmaceuticals, Inc., 840 Main Campus Drive, Lab 3510, Raleigh, NC, 27606, USA
| | - Mustafa Kemal Ozalp
- Collaborations Pharmaceuticals, Inc., 840 Main Campus Drive, Lab 3510, Raleigh, NC, 27606, USA
| | - Jacob Gerlach
- Collaborations Pharmaceuticals, Inc., 840 Main Campus Drive, Lab 3510, Raleigh, NC, 27606, USA
| | - Ana C Puhl
- Collaborations Pharmaceuticals, Inc., 840 Main Campus Drive, Lab 3510, Raleigh, NC, 27606, USA
| | - Thomas R Lane
- Collaborations Pharmaceuticals, Inc., 840 Main Campus Drive, Lab 3510, Raleigh, NC, 27606, USA
| | - John Corbett
- Collaborations Pharmaceuticals, Inc., 840 Main Campus Drive, Lab 3510, Raleigh, NC, 27606, USA
| | - Fabio Urbina
- Collaborations Pharmaceuticals, Inc., 840 Main Campus Drive, Lab 3510, Raleigh, NC, 27606, USA.
| | - Sean Ekins
- Collaborations Pharmaceuticals, Inc., 840 Main Campus Drive, Lab 3510, Raleigh, NC, 27606, USA.
| |
Collapse
|
2
|
Alves VM, Korn D, Pervitsky V, Thieme A, Capuzzi SJ, Baker N, Chirkova R, Ekins S, Muratov EN, Hickey A, Tropsha A. Knowledge-based approaches to drug discovery for rare diseases. Drug Discov Today 2022; 27:490-502. [PMID: 34718207 PMCID: PMC9124594 DOI: 10.1016/j.drudis.2021.10.014] [Citation(s) in RCA: 13] [Impact Index Per Article: 6.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/28/2021] [Revised: 09/13/2021] [Accepted: 10/21/2021] [Indexed: 02/03/2023]
Abstract
The conventional drug discovery pipeline has proven to be unsustainable for rare diseases. Herein, we discuss recent advances in biomedical knowledge mining applied to discovering therapeutics for rare diseases. We summarize current chemogenomics data of relevance to rare diseases and provide a perspective on the effectiveness of machine learning (ML) and biomedical knowledge graph mining in rare disease drug discovery. We illustrate the power of these methodologies using a chordoma case study. We expect that a broader application of knowledge graph mining and artificial intelligence (AI) approaches will expedite the discovery of viable drug candidates against both rare and common diseases.
Collapse
Affiliation(s)
- Vinicius M Alves
- Laboratory for Molecular Modeling, Division of Chemical Biology and Medicinal Chemistry, UNC Eshelman School of Pharmacy, University of North Carolina, Chapel Hill, NC 27599, USA; UNC Catalyst for Rare Diseases, UNC Eshelman School of Pharmacy, University of North Carolina, Chapel Hill, NC 27599, USA
| | - Daniel Korn
- Laboratory for Molecular Modeling, Division of Chemical Biology and Medicinal Chemistry, UNC Eshelman School of Pharmacy, University of North Carolina, Chapel Hill, NC 27599, USA
| | - Vera Pervitsky
- Laboratory for Molecular Modeling, Division of Chemical Biology and Medicinal Chemistry, UNC Eshelman School of Pharmacy, University of North Carolina, Chapel Hill, NC 27599, USA
| | - Andrew Thieme
- Laboratory for Molecular Modeling, Division of Chemical Biology and Medicinal Chemistry, UNC Eshelman School of Pharmacy, University of North Carolina, Chapel Hill, NC 27599, USA
| | - Stephen J Capuzzi
- Laboratory for Molecular Modeling, Division of Chemical Biology and Medicinal Chemistry, UNC Eshelman School of Pharmacy, University of North Carolina, Chapel Hill, NC 27599, USA
| | - Nancy Baker
- ParlezChem, 123 W Union Street, Hillsborough, NC 27278, USA
| | - Rada Chirkova
- Department of Computer Science, North Carolina State University, Raleigh, NC 27695-8206, USA
| | - Sean Ekins
- Collaborations Pharmaceuticals Inc., 840 Main Campus Drive, Lab 3510, Raleigh, NC 27606, USA
| | - Eugene N Muratov
- Laboratory for Molecular Modeling, Division of Chemical Biology and Medicinal Chemistry, UNC Eshelman School of Pharmacy, University of North Carolina, Chapel Hill, NC 27599, USA; Department of Pharmaceutical Sciences, Federal University of Paraiba, Joao Pessoa, PB, Brazil
| | - Anthony Hickey
- UNC Catalyst for Rare Diseases, UNC Eshelman School of Pharmacy, University of North Carolina, Chapel Hill, NC 27599, USA.
| | - Alexander Tropsha
- Laboratory for Molecular Modeling, Division of Chemical Biology and Medicinal Chemistry, UNC Eshelman School of Pharmacy, University of North Carolina, Chapel Hill, NC 27599, USA.
| |
Collapse
|
3
|
GU C, GAO Y, HAN R, GUO M, LIU H, GAO J, LIU Y, LI B, SUN L, BU R, LIU Y, HAO J, MENG Y, AN M, CAO X, SU C, LI G. Metabolomics of clinical samples reveal the treatment mechanism of lanthanum hydroxide on vascular calcification in chronic kidney disease. PROCEEDINGS OF THE JAPAN ACADEMY. SERIES B, PHYSICAL AND BIOLOGICAL SCIENCES 2022; 98:361-377. [PMID: 35908957 PMCID: PMC9363596 DOI: 10.2183/pjab.98.019] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 04/05/2022] [Accepted: 05/17/2022] [Indexed: 06/15/2023]
Abstract
Previous studies showed that lanthanum hydroxide (LH) has a therapeutic effect on chronic kidney disease (CKD) and vascular calcification, which suggests that it might have clinical value. However, the target and mechanism of action of LH are unclear. Metabolomics of clinical samples can be used to predict the mechanism of drug action. In this study, metabolomic profiles in patients with end-stage renal disease (ESRD) were used to screen related signaling pathways, and we verified the influence of LH on the ROS-PI3K-AKT-mTOR-HIF-1α signaling pathway by western blotting and quantitative real-time RT-qPCR in vivo and in vitro. We found that ROS and SLC16A10 genes were activated in patients with ESRD. The SLC16A10 gene is associated with six significant metabolites (L-cysteine, L-cystine, L-isoleucine, L-arginine, L-aspartic acid, and L-phenylalanine) and the PI3K-AKT signaling pathway. The results showed that LH inhibits the ESRD process and its cardiovascular complications by inhibiting the ROS-PI3K-AKT-mTOR-HIF-1α signaling pathway. Collectively, LH may be a candidate phosphorus binder for the treatment of vascular calcification in ESRD.
Collapse
Affiliation(s)
- Chao GU
- Department of Pharmacology, College of Pharmacy, Inner Mongolian Medical University, Hohhot, Inner Mongolia Autonomous Region, China
| | - Yuan GAO
- Department of Pharmacology, College of Pharmacy, Inner Mongolian Medical University, Hohhot, Inner Mongolia Autonomous Region, China
| | - Ruilan HAN
- Department of Pharmacology, College of Pharmacy, Inner Mongolian Medical University, Hohhot, Inner Mongolia Autonomous Region, China
| | - Min GUO
- Department of Clinical Pharmacy, Ordos Central Hospital, Ordos City, Inner Mongolia Autonomous Region, China
| | - Hong LIU
- Department of Pharmacology, College of Pharmacy, Inner Mongolian Medical University, Hohhot, Inner Mongolia Autonomous Region, China
| | - Jie GAO
- Department of Pharmacology, College of Pharmacy, Inner Mongolian Medical University, Hohhot, Inner Mongolia Autonomous Region, China
| | - Yang LIU
- Department of Pharmacology, College of Pharmacy, Inner Mongolian Medical University, Hohhot, Inner Mongolia Autonomous Region, China
| | - Bing LI
- Department of Pharmacology, College of Pharmacy, Inner Mongolian Medical University, Hohhot, Inner Mongolia Autonomous Region, China
| | - Lijun SUN
- Department of Pharmacology, College of Pharmacy, Inner Mongolian Medical University, Hohhot, Inner Mongolia Autonomous Region, China
| | - Ren BU
- Department of Pharmacology, College of Pharmacy, Inner Mongolian Medical University, Hohhot, Inner Mongolia Autonomous Region, China
| | - Yang LIU
- Department of Clinical Pharmacy, Ordos Central Hospital, Ordos City, Inner Mongolia Autonomous Region, China
| | - Jian HAO
- Renal Division, The First Affiliated Hospital of Inner Mongolia Medical University, Hohhot, Inner Mongolia Autonomous Region, China
| | - Yan MENG
- Renal Division, The First Affiliated Hospital of Inner Mongolia Medical University, Hohhot, Inner Mongolia Autonomous Region, China
| | - Ming AN
- Department of Pharmaceutical analysis, School of Pharmacy, Baotou Medical College, Baotou, Inner Mongolia Autonomous Region, China
| | - Xiaodong CAO
- Department of Pharmacology, GLP Center, Inner Mongolian Medical University, Hohhot, Inner Mongolia Autonomous Region, China
| | - Changhai SU
- Department of Clinical Pharmacy, Ordos Central Hospital, Ordos City, Inner Mongolia Autonomous Region, China
| | - Gang LI
- Department of Pharmacology, College of Pharmacy, Inner Mongolian Medical University, Hohhot, Inner Mongolia Autonomous Region, China
- Mongolian Medicine Collaborative Innovation Center, Inner Mongolia Medical University, Hohhot, Inner Mongolia Autonomous Region, China
| |
Collapse
|
4
|
Haldar R, Narayanan SJ. A novel ensemble based recommendation approach using network based analysis for identification of effective drugs for Tuberculosis. MATHEMATICAL BIOSCIENCES AND ENGINEERING : MBE 2022; 19:873-891. [PMID: 34903017 DOI: 10.3934/mbe.2022040] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/14/2023]
Abstract
Tuberculosis (TB) is a fatal infectious disease which affected millions of people worldwide for many decades and now with mutating drug resistant strains, it poses bigger challenges in treatment of the patients. Computational techniques might play a crucial role in rapidly developing new or modified anti-tuberculosis drugs which can tackle these mutating strains of TB. This research work applied a computational approach to generate a unique recommendation list of possible TB drugs as an alternate to a popular drug, EMB, by first securing an initial list of drugs from a popular online database, PubChem, and thereafter applying an ensemble of ranking mechanisms. As a novelty, both the pharmacokinetic properties and some network based attributes of the chemical structure of the drugs are considered for generating separate recommendation lists. The work also provides customized modifications on a popular and traditional ensemble ranking technique to cater to the specific dataset and requirements. The final recommendation list provides established chemical structures along with their ranks, which could be used as alternatives to EMB. It is believed that the incorporation of both pharmacokinetic and network based properties in the ensemble ranking process added to the effectiveness and relevance of the final recommendation.
Collapse
Affiliation(s)
- Rishin Haldar
- School of Computer Science and Engineering, Vellore Institute of Technology (VIT), Vellore - 632014, Tamil Nadu, India
| | - Swathi Jamjala Narayanan
- School of Computer Science and Engineering, Vellore Institute of Technology (VIT), Vellore - 632014, Tamil Nadu, India
| |
Collapse
|
5
|
Coley CW, Eyke NS, Jensen KF. Autonome Entdeckung in den chemischen Wissenschaften, Teil II: Ausblick. Angew Chem Int Ed Engl 2020. [DOI: 10.1002/ange.201909989] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/25/2022]
Affiliation(s)
- Connor W. Coley
- Department of Chemical Engineering Massachusetts Institute of Technology Cambridge MA 02139 USA
| | - Natalie S. Eyke
- Department of Chemical Engineering Massachusetts Institute of Technology Cambridge MA 02139 USA
| | - Klavs F. Jensen
- Department of Chemical Engineering Massachusetts Institute of Technology Cambridge MA 02139 USA
| |
Collapse
|
6
|
Lane TR, Foil DH, Minerali E, Urbina F, Zorn KM, Ekins S. Bioactivity Comparison across Multiple Machine Learning Algorithms Using over 5000 Datasets for Drug Discovery. Mol Pharm 2020; 18:403-415. [PMID: 33325717 DOI: 10.1021/acs.molpharmaceut.0c01013] [Citation(s) in RCA: 21] [Impact Index Per Article: 5.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/14/2022]
Abstract
Machine learning methods are attracting considerable attention from the pharmaceutical industry for use in drug discovery and applications beyond. In recent studies, we and others have applied multiple machine learning algorithms and modeling metrics and, in some cases, compared molecular descriptors to build models for individual targets or properties on a relatively small scale. Several research groups have used large numbers of datasets from public databases such as ChEMBL in order to evaluate machine learning methods of interest to them. The largest of these types of studies used on the order of 1400 datasets. We have now extracted well over 5000 datasets from CHEMBL for use with the ECFP6 fingerprint and in comparison of our proprietary software Assay Central with random forest, k-nearest neighbors, support vector classification, naïve Bayesian, AdaBoosted decision trees, and deep neural networks (three layers). Model performance was assessed using an array of fivefold cross-validation metrics including area-under-the-curve, F1 score, Cohen's kappa, and Matthews correlation coefficient. Based on ranked normalized scores for the metrics or datasets, all methods appeared comparable, while the distance from the top indicated that Assay Central and support vector classification were comparable. Unlike prior studies which have placed considerable emphasis on deep neural networks (deep learning), no advantage was seen in this case. If anything, Assay Central may have been at a slight advantage as the activity cutoff for each of the over 5000 datasets representing over 570,000 unique compounds was based on Assay Central performance, although support vector classification seems to be a strong competitor. We also applied Assay Central to perform prospective predictions for the toxicity targets PXR and hERG to further validate these models. This work appears to be the largest scale comparison of these machine learning algorithms to date. Future studies will likely evaluate additional databases, descriptors, and machine learning algorithms and further refine the methods for evaluating and comparing such models.
Collapse
Affiliation(s)
- Thomas R Lane
- Collaborations Pharmaceuticals, Inc., 840 Main Campus Drive, Lab 3510, Raleigh, North Carolina 27606, United States
| | - Daniel H Foil
- Collaborations Pharmaceuticals, Inc., 840 Main Campus Drive, Lab 3510, Raleigh, North Carolina 27606, United States
| | - Eni Minerali
- Collaborations Pharmaceuticals, Inc., 840 Main Campus Drive, Lab 3510, Raleigh, North Carolina 27606, United States
| | - Fabio Urbina
- Department of Cell Biology and Physiology, University of North Carolina at Chapel Hill, Chapel Hill, North Carolina 27599-7545, United States
| | - Kimberley M Zorn
- Collaborations Pharmaceuticals, Inc., 840 Main Campus Drive, Lab 3510, Raleigh, North Carolina 27606, United States
| | - Sean Ekins
- Collaborations Pharmaceuticals, Inc., 840 Main Campus Drive, Lab 3510, Raleigh, North Carolina 27606, United States
| |
Collapse
|
7
|
Coley CW, Eyke NS, Jensen KF. Autonomous Discovery in the Chemical Sciences Part II: Outlook. Angew Chem Int Ed Engl 2020; 59:23414-23436. [PMID: 31553509 DOI: 10.1002/anie.201909989] [Citation(s) in RCA: 104] [Impact Index Per Article: 26.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/06/2019] [Indexed: 01/19/2023]
Abstract
This two-part Review examines how automation has contributed to different aspects of discovery in the chemical sciences. In this second part, we reflect on a selection of exemplary studies. It is increasingly important to articulate what the role of automation and computation has been in the scientific process and how that has or has not accelerated discovery. One can argue that even the best automated systems have yet to "discover" despite being incredibly useful as laboratory assistants. We must carefully consider how they have been and can be applied to future problems of chemical discovery in order to effectively design and interact with future autonomous platforms. The majority of this Review defines a large set of open research directions, including improving our ability to work with complex data, build empirical models, automate both physical and computational experiments for validation, select experiments, and evaluate whether we are making progress towards the ultimate goal of autonomous discovery. Addressing these practical and methodological challenges will greatly advance the extent to which autonomous systems can make meaningful discoveries.
Collapse
Affiliation(s)
- Connor W Coley
- Department of Chemical Engineering, Massachusetts Institute of Technology, Cambridge, MA, 02139, USA
| | - Natalie S Eyke
- Department of Chemical Engineering, Massachusetts Institute of Technology, Cambridge, MA, 02139, USA
| | - Klavs F Jensen
- Department of Chemical Engineering, Massachusetts Institute of Technology, Cambridge, MA, 02139, USA
| |
Collapse
|
8
|
Raschka S. Automated discovery of GPCR bioactive ligands. Curr Opin Struct Biol 2019; 55:17-24. [PMID: 30909105 DOI: 10.1016/j.sbi.2019.02.011] [Citation(s) in RCA: 10] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/15/2018] [Accepted: 02/19/2019] [Indexed: 12/22/2022]
Abstract
While G-protein-coupled receptors (GPCRs) constitute the largest class of membrane proteins, structures and endogenous ligands of a large portion of GPCRs remain unknown. Because of the involvement of GPCRs in various signaling pathways and physiological roles, the identification of endogenous ligands as well as designing novel drugs is of high interest to the research and medical communities. Along with highlighting the recent advances in structure-based ligand discovery, including docking and molecular dynamics, this article focuses on the latest advances for automating the discovery of bioactive ligands using machine learning. Machine learning is centered around the development and applications of algorithms that can learn from data automatically. Such an approach offers immense opportunities for bioactivity prediction as well as quantitative structure-activity relationship studies. This review describes the most recent and successful applications of machine learning for bioactive ligand discovery, concluding with an outlook on deep learning methods that are capable of automatically extracting salient information from structural data as a promising future direction for rapid and efficient bioactive ligand discovery.
Collapse
Affiliation(s)
- Sebastian Raschka
- Department of Statistics, University of Wisconsin-Madison, 1300 Medical Sciences Center, Madison, WI 53706, USA.
| |
Collapse
|
9
|
Sanchez DA, Martinez LR. Underscoring interstrain variability and the impact of growth conditions on associated antimicrobial susceptibilities in preclinical testing of novel antimicrobial drugs. Crit Rev Microbiol 2019; 45:51-64. [PMID: 30522365 PMCID: PMC6905375 DOI: 10.1080/1040841x.2018.1538934] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/21/2018] [Revised: 08/22/2018] [Accepted: 10/12/2018] [Indexed: 01/12/2023]
Abstract
In the era of multidrug resistant (MDR) organisms, reliable efficacy testing of novel antimicrobials during developmental stages is of paramount concern prior to introduction in clinical trials. Unfortunately, interstrain variability is often underappreciated when appraising the efficacy of innovative antimicrobials as preclinical testing of a limited number of standardized strains in unvarying conditions does not account for the vastness and potential for hyperdiversity among and within microbial populations. In this review, the importance of accounting for interstrain variability's potential to impact breadth of novel drug efficacy evaluation in the early stages of drug development will be discussed. Additionally, testing under varying microenvironmental conditions that may influence drug efficacy will be discussed. Biofilm growth, the influence of polymicrobial growth, mechanisms of antimicrobial resistance, pH, anaerobic conditions, and other virulence factors are some of critical issues that require more attention and standardization during preclinical drug efficacy evaluation. Furthermore, potential solutions for addressing this issue in pre-clinical antimicrobial development are proposed via centralization of microbial characterization and drug target databases, testing of a large number of clinical strains, inclusion of mutator strains in testing and the use of growth parameter mathematical models for testing.
Collapse
Affiliation(s)
- David A. Sanchez
- Howard University College of Medicine, Washington, DC, USA
- Brigham and Women’s Hospital, Boston, MA, USA
| | - Luis R. Martinez
- Department of Biological Sciences, The Border Biomedical Research Center, University of Texas at El Paso, TX, USA
| |
Collapse
|
10
|
Korotcov A, Tkachenko V, Russo DP, Ekins S. Comparison of Deep Learning With Multiple Machine Learning Methods and Metrics Using Diverse Drug Discovery Data Sets. Mol Pharm 2017; 14:4462-4475. [PMID: 29096442 PMCID: PMC5741413 DOI: 10.1021/acs.molpharmaceut.7b00578] [Citation(s) in RCA: 184] [Impact Index Per Article: 26.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/30/2022]
Abstract
Machine learning methods have been applied to many data sets in pharmaceutical research for several decades. The relative ease and availability of fingerprint type molecular descriptors paired with Bayesian methods resulted in the widespread use of this approach for a diverse array of end points relevant to drug discovery. Deep learning is the latest machine learning algorithm attracting attention for many of pharmaceutical applications from docking to virtual screening. Deep learning is based on an artificial neural network with multiple hidden layers and has found considerable traction for many artificial intelligence applications. We have previously suggested the need for a comparison of different machine learning methods with deep learning across an array of varying data sets that is applicable to pharmaceutical research. End points relevant to pharmaceutical research include absorption, distribution, metabolism, excretion, and toxicity (ADME/Tox) properties, as well as activity against pathogens and drug discovery data sets. In this study, we have used data sets for solubility, probe-likeness, hERG, KCNQ1, bubonic plague, Chagas, tuberculosis, and malaria to compare different machine learning methods using FCFP6 fingerprints. These data sets represent whole cell screens, individual proteins, physicochemical properties as well as a data set with a complex end point. Our aim was to assess whether deep learning offered any improvement in testing when assessed using an array of metrics including AUC, F1 score, Cohen's kappa, Matthews correlation coefficient and others. Based on ranked normalized scores for the metrics or data sets Deep Neural Networks (DNN) ranked higher than SVM, which in turn was ranked higher than all the other machine learning methods. Visualizing these properties for training and test sets using radar type plots indicates when models are inferior or perhaps over trained. These results also suggest the need for assessing deep learning further using multiple metrics with much larger scale comparisons, prospective testing as well as assessment of different fingerprints and DNN architectures beyond those used.
Collapse
Affiliation(s)
- Alexandru Korotcov
- Science Data Software, LLC, 14914 Bradwill Court, Rockville, MD 20850, USA
| | - Valery Tkachenko
- Science Data Software, LLC, 14914 Bradwill Court, Rockville, MD 20850, USA
| | - Daniel P Russo
- Collaborations Pharmaceuticals, Inc., 840 Main Campus Drive, Lab 3510, Raleigh, NC 27606, USA
- The Rutgers Center for Computational and Integrative Biology, Camden, NJ, 08102, USA
| | - Sean Ekins
- Collaborations Pharmaceuticals, Inc., 840 Main Campus Drive, Lab 3510, Raleigh, NC 27606, USA
| |
Collapse
|
11
|
Capuzzi SJ, Kim ISJ, Lam WI, Thornton TE, Muratov EN, Pozefsky D, Tropsha A. Chembench: A Publicly Accessible, Integrated Cheminformatics Portal. J Chem Inf Model 2017; 57:105-108. [PMID: 28045544 DOI: 10.1021/acs.jcim.6b00462] [Citation(s) in RCA: 36] [Impact Index Per Article: 5.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/14/2022]
Abstract
The enormous increase in the amount of publicly available chemical genomics data and the growing emphasis on data sharing and open science mandates that cheminformaticians also make their models publicly available for broad use by the scientific community. Chembench is one of the first publicly accessible, integrated cheminformatics Web portals. It has been extensively used by researchers from different fields for curation, visualization, analysis, and modeling of chemogenomics data. Since its launch in 2008, Chembench has been accessed more than 1 million times by more than 5000 users from a total of 98 countries. We report on the recent updates and improvements that increase the simplicity of use, computational efficiency, accuracy, and accessibility of a broad range of tools and services for computer-assisted drug design and computational toxicology available on Chembench. Chembench remains freely accessible at https://chembench.mml.unc.edu.
Collapse
Affiliation(s)
- Stephen J Capuzzi
- Laboratory for Molecular Modeling, Division of Chemical Biology and Medicinal Chemistry, UNC Eshelman School of Pharmacy, and ‡Department of Computer Science, University of North Carolina , Chapel Hill, North Carolina 27599, United States
| | - Ian Sang-June Kim
- Laboratory for Molecular Modeling, Division of Chemical Biology and Medicinal Chemistry, UNC Eshelman School of Pharmacy, and ‡Department of Computer Science, University of North Carolina , Chapel Hill, North Carolina 27599, United States
| | - Wai In Lam
- Laboratory for Molecular Modeling, Division of Chemical Biology and Medicinal Chemistry, UNC Eshelman School of Pharmacy, and ‡Department of Computer Science, University of North Carolina , Chapel Hill, North Carolina 27599, United States
| | - Thomas E Thornton
- Laboratory for Molecular Modeling, Division of Chemical Biology and Medicinal Chemistry, UNC Eshelman School of Pharmacy, and ‡Department of Computer Science, University of North Carolina , Chapel Hill, North Carolina 27599, United States
| | - Eugene N Muratov
- Laboratory for Molecular Modeling, Division of Chemical Biology and Medicinal Chemistry, UNC Eshelman School of Pharmacy, and ‡Department of Computer Science, University of North Carolina , Chapel Hill, North Carolina 27599, United States
| | - Diane Pozefsky
- Laboratory for Molecular Modeling, Division of Chemical Biology and Medicinal Chemistry, UNC Eshelman School of Pharmacy, and ‡Department of Computer Science, University of North Carolina , Chapel Hill, North Carolina 27599, United States
| | - Alexander Tropsha
- Laboratory for Molecular Modeling, Division of Chemical Biology and Medicinal Chemistry, UNC Eshelman School of Pharmacy, and ‡Department of Computer Science, University of North Carolina , Chapel Hill, North Carolina 27599, United States
| |
Collapse
|
12
|
Collaborative drug discovery for More Medicines for Tuberculosis (MM4TB). Drug Discov Today 2016; 22:555-565. [PMID: 27884746 DOI: 10.1016/j.drudis.2016.10.009] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/13/2016] [Revised: 10/11/2016] [Accepted: 10/21/2016] [Indexed: 01/30/2023]
Abstract
Neglected disease drug discovery is generally poorly funded compared with major diseases and hence there is an increasing focus on collaboration and precompetitive efforts such as public-private partnerships (PPPs). The More Medicines for Tuberculosis (MM4TB) project is one such collaboration funded by the EU with the goal of discovering new drugs for tuberculosis. Collaborative Drug Discovery has provided a commercial web-based platform called CDD Vault which is a hosted collaborative solution for securely sharing diverse chemistry and biology data. Using CDD Vault alongside other commercial and free cheminformatics tools has enabled support of this and other large collaborative projects, aiding drug discovery efforts and fostering collaboration. We will describe CDD's efforts in assisting with the MM4TB project.
Collapse
|
13
|
Ekins S. The Next Era: Deep Learning in Pharmaceutical Research. Pharm Res 2016; 33:2594-603. [PMID: 27599991 DOI: 10.1007/s11095-016-2029-7] [Citation(s) in RCA: 99] [Impact Index Per Article: 12.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/10/2016] [Accepted: 08/23/2016] [Indexed: 01/22/2023]
Abstract
Over the past decade we have witnessed the increasing sophistication of machine learning algorithms applied in daily use from internet searches, voice recognition, social network software to machine vision software in cameras, phones, robots and self-driving cars. Pharmaceutical research has also seen its fair share of machine learning developments. For example, applying such methods to mine the growing datasets that are created in drug discovery not only enables us to learn from the past but to predict a molecule's properties and behavior in future. The latest machine learning algorithm garnering significant attention is deep learning, which is an artificial neural network with multiple hidden layers. Publications over the last 3 years suggest that this algorithm may have advantages over previous machine learning methods and offer a slight but discernable edge in predictive performance. The time has come for a balanced review of this technique but also to apply machine learning methods such as deep learning across a wider array of endpoints relevant to pharmaceutical research for which the datasets are growing such as physicochemical property prediction, formulation prediction, absorption, distribution, metabolism, excretion and toxicity (ADME/Tox), target prediction and skin permeation, etc. We also show that there are many potential applications of deep learning beyond cheminformatics. It will be important to perform prospective testing (which has been carried out rarely to date) in order to convince skeptics that there will be benefits from investing in this technique.
Collapse
Affiliation(s)
- Sean Ekins
- Collaborations Pharmaceuticals, Inc, 5616 Hilltop Needmore Road, Fuquay-Varina, North Carolina, 27526, USA. .,Collaborative Drug Discovery, 1633 Bayshore Highway, Suite 342, Burlingame, California, 94010, USA.
| |
Collapse
|
14
|
Ai N, Fan X, Ekins S. In silico methods for predicting drug-drug interactions with cytochrome P-450s, transporters and beyond. Adv Drug Deliv Rev 2015; 86:46-60. [PMID: 25796619 DOI: 10.1016/j.addr.2015.03.006] [Citation(s) in RCA: 24] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/19/2014] [Revised: 01/05/2015] [Accepted: 03/11/2015] [Indexed: 12/13/2022]
Abstract
Drug-drug interactions (DDIs) are associated with severe adverse effects that may lead to the patient requiring alternative therapeutics and could ultimately lead to drug withdrawal from the market if they are severe. To prevent the occurrence of DDI in the clinic, experimental systems to evaluate drug interaction have been integrated into the various stages of the drug discovery and development process. A large body of knowledge about DDI has also accumulated through these studies and pharmacovigillence systems. Much of this work to date has focused on the drug metabolizing enzymes such as cytochrome P-450s as well as drug transporters, ion channels and occasionally other proteins. This combined knowledge provides a foundation for a hypothesis-driven in silico approach, using either cheminformatics or physiologically based pharmacokinetics (PK) modeling methods to assess DDI potential. Here we review recent advances in these approaches with emphasis on hypothesis-driven mechanistic models for important protein targets involved in PK-based DDI. Recent efforts with other informatics approaches to detect DDI are highlighted. Besides DDI, we also briefly introduce drug interactions with other substances, such as Traditional Chinese Medicines to illustrate how in silico modeling can be useful in this domain. We also summarize valuable data sources and web-based tools that are available for DDI prediction. We finally explore the challenges we see faced by in silico approaches for predicting DDI and propose future directions to make these computational models more reliable, accurate, and publically accessible.
Collapse
Affiliation(s)
- Ni Ai
- Pharmaceutical Informatics Institute, College of Pharmaceutical Sciences, Zhejiang University, 866 Yuhangtang Road, Hangzhou, Zhejiang 310058, PR China
| | - Xiaohui Fan
- Pharmaceutical Informatics Institute, College of Pharmaceutical Sciences, Zhejiang University, 866 Yuhangtang Road, Hangzhou, Zhejiang 310058, PR China.
| | - Sean Ekins
- Collaborations in Chemistry, 5616 Hilltop Needmore Road, Fuquay-Varina, NC 27526, USA.
| |
Collapse
|
15
|
Clark AM, Dole K, Coulon-Spektor A, McNutt A, Grass G, Freundlich JS, Reynolds RC, Ekins S. Open Source Bayesian Models. 1. Application to ADME/Tox and Drug Discovery Datasets. J Chem Inf Model 2015; 55:1231-45. [PMID: 25994950 PMCID: PMC4478615 DOI: 10.1021/acs.jcim.5b00143] [Citation(s) in RCA: 84] [Impact Index Per Article: 9.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/07/2023]
Abstract
![]()
On the order of hundreds of absorption,
distribution, metabolism,
excretion, and toxicity (ADME/Tox) models have been described in the
literature in the past decade which are more often than not inaccessible
to anyone but their authors. Public accessibility is also an issue
with computational models for bioactivity, and the ability to share
such models still remains a major challenge limiting drug discovery.
We describe the creation of a reference implementation of a Bayesian
model-building software module, which we have released as an open
source component that is now included in the Chemistry Development
Kit (CDK) project, as well as implemented in the CDD Vault and
in several mobile apps. We use this implementation to build an array
of Bayesian models for ADME/Tox, in vitro and in vivo bioactivity, and other physicochemical properties.
We show that these models possess cross-validation receiver operator
curve values comparable to those generated previously in prior publications
using alternative tools. We have now described how the implementation
of Bayesian models with FCFP6 descriptors generated in the CDD Vault
enables the rapid production of robust machine learning models from
public data or the user’s own datasets. The current study sets
the stage for generating models in proprietary software (such as CDD)
and exporting these models in a format that could be run in open source
software using CDK components. This work also demonstrates that we
can enable biocomputation across distributed private or public datasets
to enhance drug discovery.
Collapse
Affiliation(s)
- Alex M Clark
- †Molecular Materials Informatics, Inc., 1900 St. Jacques No. 302, Montreal H3J 2S1, Quebec, Canada
| | - Krishna Dole
- ‡Collaborative Drug Discovery, 1633 Bayshore Highway, Suite 342, Burlingame, California 94010, United States
| | - Anna Coulon-Spektor
- ‡Collaborative Drug Discovery, 1633 Bayshore Highway, Suite 342, Burlingame, California 94010, United States
| | - Andrew McNutt
- ‡Collaborative Drug Discovery, 1633 Bayshore Highway, Suite 342, Burlingame, California 94010, United States
| | - George Grass
- §G2 Research, Inc., P.O. Box 1242, Tahoe City, California 96145, United States
| | | | - Robert C Reynolds
- #Department of Chemistry, College of Arts and Sciences, University of Alabama at Birmingham, , 1530 Third Avenue South, Birmingham, Alabama 35294-1240, United States
| | - Sean Ekins
- ‡Collaborative Drug Discovery, 1633 Bayshore Highway, Suite 342, Burlingame, California 94010, United States.,∇Collaborations in Chemistry, 5616 Hilltop Needmore Road, Fuquay-Varina, North Carolina 27526, United States
| |
Collapse
|
16
|
Clark AM, Ekins S. Open Source Bayesian Models. 2. Mining a "Big Dataset" To Create and Validate Models with ChEMBL. J Chem Inf Model 2015; 55:1246-60. [PMID: 25995041 DOI: 10.1021/acs.jcim.5b00144] [Citation(s) in RCA: 67] [Impact Index Per Article: 7.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/05/2023]
Abstract
In an associated paper, we have described a reference implementation of Laplacian-corrected naïve Bayesian model building using extended connectivity (ECFP)- and molecular function class fingerprints of maximum diameter 6 (FCFP)-type fingerprints. As a follow-up, we have now undertaken a large-scale validation study in order to ensure that the technique generalizes to a broad variety of drug discovery datasets. To achieve this, we have used the ChEMBL (version 20) database and split it into more than 2000 separate datasets, each of which consists of compounds and measurements with the same target and activity measurement. In order to test these datasets with the two-state Bayesian classification, we developed an automated algorithm for detecting a suitable threshold for active/inactive designation, which we applied to all collections. With these datasets, we were able to establish that our Bayesian model implementation is effective for the large majority of cases, and we were able to quantify the impact of fingerprint folding on the receiver operator curve cross-validation metrics. We were also able to study the impact that the choice of training/testing set partitioning has on the resulting recall rates. The datasets have been made publicly available to be downloaded, along with the corresponding model data files, which can be used in conjunction with the CDK and several mobile apps. We have also explored some novel visualization methods which leverage the structural origins of the ECFP/FCFP fingerprints to attribute regions of a molecule responsible for positive and negative contributions to activity. The ability to score molecules across thousands of relevant datasets across organisms also may help to access desirable and undesirable off-target effects as well as suggest potential targets for compounds derived from phenotypic screens.
Collapse
Affiliation(s)
- Alex M Clark
- †Molecular Materials Informatics, Inc., 1900 St. Jacques No. 302, Montreal H3J 2S1, Quebec, Canada
| | - Sean Ekins
- ‡Collaborations Pharmaceuticals, Inc., 5616 Hilltop Needmore Road, Fuquay-Varina, North Carolina 27526, United States.,§Collaborations in Chemistry, 5616 Hilltop Needmore Road, Fuquay-Varina, North Carolina 27526, United States.,∥Collaborative Drug Discovery, 1633 Bayshore Highway, Suite 342, Burlingame, California 94010, United States
| |
Collapse
|
17
|
Ekins S, Litterman NK, Arnold RJG, Burgess RW, Freundlich JS, Gray SJ, Higgins JJ, Langley B, Willis DE, Notterpek L, Pleasure D, Sereda MW, Moore A. A brief review of recent Charcot-Marie-Tooth research and priorities. F1000Res 2015; 4:53. [PMID: 25901280 PMCID: PMC4392824 DOI: 10.12688/f1000research.6160.1] [Citation(s) in RCA: 25] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Accepted: 02/24/2015] [Indexed: 12/14/2022] Open
Abstract
This brief review of current research progress on Charcot-Marie-Tooth (CMT) disease is a summary of discussions initiated at the Hereditary Neuropathy Foundation (HNF) scientific advisory board meeting on November 7, 2014. It covers recent published and unpublished
in vitro and
in vivo research. We discuss recent promising preclinical work for CMT1A, the development of new biomarkers, the characterization of different animal models, and the analysis of the frequency of gene mutations in patients with CMT. We also describe how progress in related fields may benefit CMT therapeutic development, including the potential of gene therapy and stem cell research. We also discuss the potential to assess and improve the quality of life of CMT patients. This summary of CMT research identifies some of the gaps which may have an impact on upcoming clinical trials. We provide some priorities for CMT research and areas which HNF can support. The goal of this review is to inform the scientific community about ongoing research and to avoid unnecessary overlap, while also highlighting areas ripe for further investigation. The general collaborative approach we have taken may be useful for other rare neurological diseases.
Collapse
Affiliation(s)
- Sean Ekins
- Hereditary Neuropathy Foundation, New York, NY, 10016, USA ; Collaborations in Chemistry, Fuquay Varina, NC, 27526, USA ; Collaborative Drug Discovery, Burlingame, CA, 94010, USA
| | | | - Renée J G Arnold
- Arnold Consultancy & Technology LLC, New York, NY, 10023, USA ; Master of Public Health Program, Mount Sinai School of Medicine, New York, NY, 10029, USA ; Quorum Consulting, Inc, San Francisco, CA, 94104, USA
| | - Robert W Burgess
- The Jackson Laboratory in Bar Harbor, Bar Harbour, ME, 04609, USA
| | - Joel S Freundlich
- Department of Medicine, Center for Emerging and Reemerging Pathogens, Rutgers University - New Jersey Medical School, Newark, NJ, 07103, USA
| | - Steven J Gray
- Gene Therapy Center and Dept. of Ophthalmology, University of North Carolina at Chapel Hill, Chapel Hill, NC, 27599-7352, USA
| | | | - Brett Langley
- Burke-Cornell Medical Research Institute, White Plains, NY, 10605, USA ; Department of Neurology and Neuroscience, Weill Medical College of Cornell University, New York, NY, 10065, USA
| | - Dianna E Willis
- Burke-Cornell Medical Research Institute, White Plains, NY, 10605, USA
| | - Lucia Notterpek
- Department of Neuroscience, College of Medicine, McKnight Brain Institute, University of Florida, Gainesville, FL, 32611, USA
| | - David Pleasure
- Institute for Pediatric Regenerative Medicine, University of California Davis, School of Medicine, Sacramento, CA, 95817, USA ; Department of Neurology, University of California, Davis, School of Medicine, c/o Shriners Hospital, Sacramento, CA, 95817, USA
| | - Michael W Sereda
- Department of Neurogenetics, Max Planck Institute (MPI) of Experimental Medicine, Göttingen, 37075, Germany ; Department of Clinical Neurophysiology, University Medical Center (UMG), Göttingen, D-37075, Germany
| | - Allison Moore
- Hereditary Neuropathy Foundation, New York, NY, 10016, USA
| |
Collapse
|
18
|
Warr WA. App-etite for change. J Comput Aided Mol Des 2014; 29:297-303. [PMID: 25515639 DOI: 10.1007/s10822-014-9824-1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/07/2014] [Accepted: 12/10/2014] [Indexed: 10/24/2022]
Affiliation(s)
- Wendy A Warr
- Wendy Warr & Associates, Holmes Chapel, Cheshire, UK,
| |
Collapse
|
19
|
Abstract
Rare disease research has reached a tipping point, with the confluence of scientific and technologic developments that if appropriately harnessed, could lead to key breakthroughs and treatments for this set of devastating disorders. Industry-wide trends have revealed that the traditional drug discovery research and development (R&D) model is no longer viable, and drug companies are evolving their approach. Rather than only pursue blockbuster therapeutics for heterogeneous, common diseases, drug companies have increasingly begun to shift their focus to rare diseases. In academia, advances in genetics analyses and disease mechanisms have allowed scientific understanding to mature, but the lack of funding and translational capability severely limits the rare disease research that leads to clinical trials. Simultaneously, there is a movement towards increased research collaboration, more data sharing, and heightened engagement and active involvement by patients, advocates, and foundations. The growth in networks and social networking tools presents an opportunity to help reach other patients but also find researchers and build collaborations. The growth of collaborative software that can enable researchers to share their data could also enable rare disease patients and foundations to manage their portfolio of funded projects for developing new therapeutics and suggest drug repurposing opportunities. Still there are many thousands of diseases without treatments and with only fragmented research efforts. We will describe some recent progress in several rare diseases used as examples and propose how collaborations could be facilitated. We propose that the development of a center of excellence that integrates and shares informatics resources for rare diseases sponsored by all of the stakeholders would help foster these initiatives.
Collapse
Affiliation(s)
| | - Michele Rhee
- National Brain Tumor Society, Newton, MA, 02458, USA
| | - David C Swinney
- Institute for Rare and Neglected Diseases Drug Discovery (iRND3), Mountain View, CA, 94043, USA
| | - Sean Ekins
- Collaborative Drug Discovery, Inc., Burlingame, CA, 94010, USA ; Collaborations in Chemistry, Fuquay Varina, NC, 27526, USA ; Phoenix Nest Inc., Brooklyn, NY, 11215, USA ; Hereditary Neuropathy Foundation, New York, NY, 10016, USA ; Hannah's Hope Fund, Rexford, NY, NY 12148, USA
| |
Collapse
|