1
|
Gangwal A, Ansari A, Ahmad I, Azad AK, Wan Sulaiman WMA. Current strategies to address data scarcity in artificial intelligence-based drug discovery: A comprehensive review. Comput Biol Med 2024; 179:108734. [PMID: 38964243 DOI: 10.1016/j.compbiomed.2024.108734] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/07/2024] [Revised: 06/01/2024] [Accepted: 06/08/2024] [Indexed: 07/06/2024]
Abstract
Artificial intelligence (AI) has played a vital role in computer-aided drug design (CADD). This development has been further accelerated with the increasing use of machine learning (ML), mainly deep learning (DL), and computing hardware and software advancements. As a result, initial doubts about the application of AI in drug discovery have been dispelled, leading to significant benefits in medicinal chemistry. At the same time, it is crucial to recognize that AI is still in its infancy and faces a few limitations that need to be addressed to harness its full potential in drug discovery. Some notable limitations are insufficient, unlabeled, and non-uniform data, the resemblance of some AI-generated molecules with existing molecules, unavailability of inadequate benchmarks, intellectual property rights (IPRs) related hurdles in data sharing, poor understanding of biology, focus on proxy data and ligands, lack of holistic methods to represent input (molecular structures) to prevent pre-processing of input molecules (feature engineering), etc. The major component in AI infrastructure is input data, as most of the successes of AI-driven efforts to improve drug discovery depend on the quality and quantity of data, used to train and test AI algorithms, besides a few other factors. Additionally, data-gulping DL approaches, without sufficient data, may collapse to live up to their promise. Current literature suggests a few methods, to certain extent, effectively handle low data for better output from the AI models in the context of drug discovery. These are transferring learning (TL), active learning (AL), single or one-shot learning (OSL), multi-task learning (MTL), data augmentation (DA), data synthesis (DS), etc. One different method, which enables sharing of proprietary data on a common platform (without compromising data privacy) to train ML model, is federated learning (FL). In this review, we compare and discuss these methods, their recent applications, and limitations while modeling small molecule data to get the improved output of AI methods in drug discovery. Article also sums up some other novel methods to handle inadequate data.
Collapse
Affiliation(s)
- Amit Gangwal
- Department of Natural Product Chemistry, Shri Vile Parle Kelavani Mandal's Institute of Pharmacy, Dhule, 424001, Maharashtra, India.
| | - Azim Ansari
- Computer Aided Drug Design Center, Shri Vile Parle Kelavani Mandal's Institute of Pharmacy, Dhule, 424001, Maharashtra, India
| | - Iqrar Ahmad
- Department of Pharmaceutical Chemistry, Prof. Ravindra Nikam College of Pharmacy, Gondur, Dhule, 424002, Maharashtra, India.
| | - Abul Kalam Azad
- Faculty of Pharmacy, University College of MAIWP International, Batu Caves, 68100, Kuala Lumpur, Malaysia.
| | | |
Collapse
|
2
|
Sikder R, Zhang H, Gao P, Ye T. Machine learning framework for predicting cytotoxicity and identifying toxicity drivers of disinfection byproducts. JOURNAL OF HAZARDOUS MATERIALS 2024; 469:133989. [PMID: 38461660 DOI: 10.1016/j.jhazmat.2024.133989] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 12/25/2023] [Revised: 03/06/2024] [Accepted: 03/06/2024] [Indexed: 03/12/2024]
Abstract
Drinking water disinfection can result in the formation disinfection byproducts (DBPs, > 700 have been identified to date), many of them are reportedly cytotoxic, genotoxic, or developmentally toxic. Analyzing the toxicity levels of these contaminants experimentally is challenging, however, a predictive model could rapidly and effectively assess their toxicity. In this study, machine learning models were developed to predict DBP cytotoxicity based on their chemical information and exposure experiments. The Random Forest model achieved the best performance (coefficient of determination of 0.62 and root mean square error of 0.63) among all the algorithms screened. Also, the results of a probabilistic model demonstrated reliable model predictions. According to the model interpretation, halogen atoms are the most prominent features for DBP cytotoxicity compared to other chemical substructures. The presence of iodine and bromine is associated with increased cytotoxicity levels, while the presence of chlorine is linked to a reduction in cytotoxicity levels. Other factors including chemical substructures (CC, N, CN, and 6-member ring), cell line, and exposure duration can significantly affect the cytotoxicity of DBPs. The similarity calculation indicated that the model has a large applicability domain and can provide reliable predictions for DBPs with unknown cytotoxicity. Finally, this study showed the effectiveness of data augmentation in the scenario of data scarcity.
Collapse
Affiliation(s)
- Rabbi Sikder
- Department of Civil and Environmental Engineering, South Dakota School of Mines and Technology, Rapid City, SD 57701, United States
| | - Huichun Zhang
- Department of Civil and Environmental Engineering, Case Western Reserve University, Cleveland, OH 44106, United States
| | - Peng Gao
- Department of Environmental and Occupational Health, and Department of Civil and Environmental Engineering, University of Pittsburgh, Pittsburgh, PA 15261, United States; UPMC Hillman Cancer Center, Pittsburgh, PA 15232, United States
| | - Tao Ye
- Department of Civil and Environmental Engineering, South Dakota School of Mines and Technology, Rapid City, SD 57701, United States.
| |
Collapse
|
3
|
Almeida RL, Maltarollo VG, Coelho FGF. Overcoming class imbalance in drug discovery problems: Graph neural networks and balancing approaches. J Mol Graph Model 2024; 126:108627. [PMID: 37801808 DOI: 10.1016/j.jmgm.2023.108627] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/27/2023] [Revised: 09/12/2023] [Accepted: 09/12/2023] [Indexed: 10/08/2023]
Abstract
This research investigates the application of Graph Neural Networks (GNNs) to enhance the cost-effectiveness of drug development, addressing the limitations of cost and time. Class imbalances within classification datasets, such as the discrepancy between active and inactive compounds, give rise to difficulties that can be resolved through strategies like oversampling, undersampling, and manipulation of the loss function. A comparison is conducted between three distinct datasets using three different GNN architectures. This benchmarking research can steer future investigations and enhance the efficacy of GNNs in drug discovery and design. Three hundred models for each combination of architecture and dataset were trained using hyperparameter tuning techniques and evaluated using a range of metrics. Notably, the oversampling technique outperforms eight experiments, showcasing its potential. While balancing techniques boost imbalanced dataset models, their efficacy depends on dataset specifics and problem type. Although oversampling aids molecular graph datasets, more research is needed to optimize its usage and explore other class imbalance solutions.
Collapse
Affiliation(s)
- Rafael Lopes Almeida
- Graduate Program in Electrical Engineering - Universidade Federal de Minas Gerais, Av. Antônio Carlos 6627, Belo Horizonte, 31270-901, MG, Brazil
| | - Vinícius Gonçalves Maltarollo
- Department of Pharmaceutical Products - Universidade Federal de Minas Gerais, Av. Antônio Carlos 6627, Belo Horizonte, 31270-901, MG, Brazil.
| | - Frederico Gualberto Ferreira Coelho
- Department of Electronical Engineering - Universidade Federal de Minas Gerais, Av. Antônio Carlos 6627, Belo Horizonte, 31270-901, MG, Brazil
| |
Collapse
|
4
|
Loh C, Christensen T, Dangovski R, Kim S, Soljačić M. Surrogate- and invariance-boosted contrastive learning for data-scarce applications in science. Nat Commun 2022; 13:4223. [PMID: 35864122 PMCID: PMC9304370 DOI: 10.1038/s41467-022-31915-y] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/14/2021] [Accepted: 07/07/2022] [Indexed: 11/18/2022] Open
Abstract
Deep learning techniques have been increasingly applied to the natural sciences, e.g., for property prediction and optimization or material discovery. A fundamental ingredient of such approaches is the vast quantity of labeled data needed to train the model. This poses severe challenges in data-scarce settings where obtaining labels requires substantial computational or labor resources. Noting that problems in natural sciences often benefit from easily obtainable auxiliary information sources, we introduce surrogate- and invariance-boosted contrastive learning (SIB-CL), a deep learning framework which incorporates three inexpensive and easily obtainable auxiliary information sources to overcome data scarcity. Specifically, these are: abundant unlabeled data, prior knowledge of symmetries or invariances, and surrogate data obtained at near-zero cost. We demonstrate SIB-CL's effectiveness and generality on various scientific problems, e.g., predicting the density-of-states of 2D photonic crystals and solving the 3D time-independent Schrödinger equation. SIB-CL consistently results in orders of magnitude reduction in the number of labels needed to achieve the same network accuracies.
Collapse
Affiliation(s)
- Charlotte Loh
- Department of Electrical Engineering and Computer Science, Massachusetts Institute of Technology, Cambridge, MA, USA.
| | - Thomas Christensen
- Department of Physics, Massachusetts Institute of Technology, Cambridge, MA, USA
| | - Rumen Dangovski
- Department of Electrical Engineering and Computer Science, Massachusetts Institute of Technology, Cambridge, MA, USA
| | - Samuel Kim
- Department of Electrical Engineering and Computer Science, Massachusetts Institute of Technology, Cambridge, MA, USA
| | - Marin Soljačić
- Department of Physics, Massachusetts Institute of Technology, Cambridge, MA, USA
| |
Collapse
|
5
|
Serov N, Vinogradov V. Artificial intelligence to bring nanomedicine to life. Adv Drug Deliv Rev 2022; 184:114194. [PMID: 35283223 DOI: 10.1016/j.addr.2022.114194] [Citation(s) in RCA: 27] [Impact Index Per Article: 13.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/16/2021] [Revised: 03/04/2022] [Accepted: 03/07/2022] [Indexed: 12/13/2022]
Abstract
The technology of drug delivery systems (DDSs) has demonstrated an outstanding performance and effectiveness in production of pharmaceuticals, as it is proved by many FDA-approved nanomedicines that have an enhanced selectivity, manageable drug release kinetics and synergistic therapeutic actions. Nonetheless, to date, the rational design and high-throughput development of nanomaterial-based DDSs for specific purposes is far from a routine practice and is still in its infancy, mainly due to the limitations in scientists' capabilities to effectively acquire, analyze, manage, and comprehend complex and ever-growing sets of experimental data, which is vital to develop DDSs with a set of desired functionalities. At the same time, this task is feasible for the data-driven approaches, high throughput experimentation techniques, process automatization, artificial intelligence (AI) technology, and machine learning (ML) approaches, which is referred to as The Fourth Paradigm of scientific research. Therefore, an integration of these approaches with nanomedicine and nanotechnology can potentially accelerate the rational design and high-throughput development of highly efficient nanoformulated drugs and smart materials with pre-defined functionalities. In this Review, we survey the important results and milestones achieved to date in the application of data science, high throughput, as well as automatization approaches, combined with AI and ML to design and optimize DDSs and related nanomaterials. This manuscript mission is not only to reflect the state-of-art in data-driven nanomedicine, but also show how recent findings in the related fields can transform the nanomedicine's image. We discuss how all these results can be used to boost nanomedicine translation to the clinic, as well as highlight the future directions for the development, data-driven, high throughput experimentation-, and AI-assisted design, as well as the production of nanoformulated drugs and smart materials with pre-defined properties and behavior. This Review will be of high interest to the chemists involved in materials science, nanotechnology, and DDSs development for biomedical applications, although the general nature of the presented approaches enables knowledge translation to many other fields of science.
Collapse
Affiliation(s)
- Nikita Serov
- International Institute "Solution Chemistry of Advanced Materials and Technologies", ITMO University, Saint-Petersburg 191002, Russian Federation
| | - Vladimir Vinogradov
- International Institute "Solution Chemistry of Advanced Materials and Technologies", ITMO University, Saint-Petersburg 191002, Russian Federation.
| |
Collapse
|
6
|
Robinson RLM, Sarimveis H, Doganis P, Jia X, Kotzabasaki M, Gousiadou C, Harper SL, Wilkins T. Identifying diverse metal oxide nanomaterials with lethal effects on embryonic zebrafish using machine learning. BEILSTEIN JOURNAL OF NANOTECHNOLOGY 2021; 12:1297-1325. [PMID: 34934606 PMCID: PMC8649207 DOI: 10.3762/bjnano.12.97] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 06/26/2021] [Accepted: 10/28/2021] [Indexed: 06/14/2023]
Abstract
Manufacturers of nanomaterial-enabled products need models of endpoints that are relevant to human safety to support the "safe by design" paradigm and avoid late-stage attrition. Increasingly, embryonic zebrafish (Danio Rerio) are recognised as a key human safety relevant in vivo test system. Hence, machine learning models were developed for identifying metal oxide nanomaterials causing lethality to embryonic zebrafish up to 24 hours post-fertilisation, or excess lethality in the period of 24-120 hours post-fertilisation, at concentrations of 250 ppm or less. Models were developed using data from the Nanomaterial Biological-Interactions Knowledgebase for a dataset of 44 diverse, coated and uncoated metal or, in one case, metalloid oxide nanomaterials. Different modelling approaches were evaluated using nested cross-validation on this dataset. Models were initially developed for both lethality endpoints using multiple descriptors representing the composition of the core, shell and surface functional groups, as well as particle characteristics. However, interestingly, the 24 hours post-fertilisation data were found to be harder to predict, which could reflect different exposure routes. Hence, subsequent analysis focused on the prediction of excess lethality at 120 hours-post fertilisation. The use of two data augmentation approaches, applied for the first time in nano-QSAR research, was explored, yet both failed to boost predictive performance. Interestingly, it was found that comparable results to those originally obtained using multiple descriptors could be obtained using a model based upon a single, simple descriptor: the Pauling electronegativity of the metal atom. Since it is widely recognised that a variety of intrinsic and extrinsic nanomaterial characteristics contribute to their toxicological effects, this is a surprising finding. This may partly reflect the need to investigate more sophisticated descriptors in future studies. Future studies are also required to examine how robust these modelling results are on truly external data, which were not used to select the single descriptor model. This will require further laboratory work to generate comparable data to those studied herein.
Collapse
Affiliation(s)
| | - Haralambos Sarimveis
- School of Chemical Engineering, National Technical University of Athens, 9 Heroon Polytechniou str. Zografou Campus, 15780 Athens, Greece
| | - Philip Doganis
- School of Chemical Engineering, National Technical University of Athens, 9 Heroon Polytechniou str. Zografou Campus, 15780 Athens, Greece
| | - Xiaodong Jia
- School of Chemical and Process Engineering, University of Leeds, Leeds, LS2 9JT, United Kingdom
| | - Marianna Kotzabasaki
- School of Chemical Engineering, National Technical University of Athens, 9 Heroon Polytechniou str. Zografou Campus, 15780 Athens, Greece
| | - Christiana Gousiadou
- School of Chemical Engineering, National Technical University of Athens, 9 Heroon Polytechniou str. Zografou Campus, 15780 Athens, Greece
| | - Stacey Lynn Harper
- School of Chemical, Biological, and Environmental Engineering, Oregon State University, Corvallis, Oregon, USA
- Department of Environmental and Molecular Toxicology, Oregon State University, Corvallis, Oregon, USA
- Oregon Nanoscience and Microtechnologies Institute, Eugene, Oregon, USA
| | - Terry Wilkins
- School of Chemical and Process Engineering, University of Leeds, Leeds, LS2 9JT, United Kingdom
| |
Collapse
|
7
|
Kim J, Park S, Min D, Kim W. Comprehensive Survey of Recent Drug Discovery Using Deep Learning. Int J Mol Sci 2021; 22:9983. [PMID: 34576146 PMCID: PMC8470987 DOI: 10.3390/ijms22189983] [Citation(s) in RCA: 41] [Impact Index Per Article: 13.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/24/2021] [Revised: 09/09/2021] [Accepted: 09/10/2021] [Indexed: 02/07/2023] Open
Abstract
Drug discovery based on artificial intelligence has been in the spotlight recently as it significantly reduces the time and cost required for developing novel drugs. With the advancement of deep learning (DL) technology and the growth of drug-related data, numerous deep-learning-based methodologies are emerging at all steps of drug development processes. In particular, pharmaceutical chemists have faced significant issues with regard to selecting and designing potential drugs for a target of interest to enter preclinical testing. The two major challenges are prediction of interactions between drugs and druggable targets and generation of novel molecular structures suitable for a target of interest. Therefore, we reviewed recent deep-learning applications in drug-target interaction (DTI) prediction and de novo drug design. In addition, we introduce a comprehensive summary of a variety of drug and protein representations, DL models, and commonly used benchmark datasets or tools for model training and testing. Finally, we present the remaining challenges for the promising future of DL-based DTI prediction and de novo drug design.
Collapse
Affiliation(s)
- Jintae Kim
- KaiPharm Co., Ltd., Seoul 03759, Korea; (J.K.); (S.P.)
| | - Sera Park
- KaiPharm Co., Ltd., Seoul 03759, Korea; (J.K.); (S.P.)
| | - Dongbo Min
- Computer Vision Lab, Department of Computer Science and Engineering, Ewha Womans University, Seoul 03760, Korea
| | - Wankyu Kim
- KaiPharm Co., Ltd., Seoul 03759, Korea; (J.K.); (S.P.)
- System Pharmacology Lab, Department of Life Sciences, Ewha Womans University, Seoul 03760, Korea
| |
Collapse
|
8
|
Kim J, Kim Y, Lee EK, Chae CH, Lee K, Kim WJ, Choi IS. Rotational Variance-Based Data Augmentation in 3D Graph Convolutional Network. Chem Asian J 2021; 16:2610-2613. [PMID: 34369653 DOI: 10.1002/asia.202100789] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/12/2021] [Revised: 07/30/2021] [Indexed: 01/17/2023]
Abstract
This work proposes the data augmentation by molecular rotation, with consideration that the protein-ligand binding events are rotation-variant. As a proof-of-concept, known active (i. e., 1-labeled) ligands to human β-secretase 1 (BACE-1) are rotated for the generation of 0-labeled data, and the rotation-dependent prediction accuracy of 3D graph convolutional network (3DGCN) is investigated after data augmentation. The data augmentation makes the orientation-recognizing ability of 3DGCN improved significantly in the classification task for BACE-1/ligand binding. Furthermore, the data-augmented 3DGCN has a capability for predicting active ligands from a candidate dataset, via improved performance of orientation recognition, which would be applied to virtual drug screening and discovery.
Collapse
Affiliation(s)
- Jihoo Kim
- Department of Chemistry, KAIST, Daejeon, 34141, Korea
| | - Yeji Kim
- Department of Chemistry, KAIST, Daejeon, 34141, Korea
| | - Eok Kyun Lee
- Department of Chemistry, KAIST, Daejeon, 34141, Korea
| | - Chong Hak Chae
- Data Convergence Drug Research Center, Korea Research Institute of Chemical Technology, Daejeon, 34114, Korea
| | - Kwangho Lee
- Data Convergence Drug Research Center, Korea Research Institute of Chemical Technology, Daejeon, 34114, Korea
| | - Won June Kim
- Department of Biology and Chemistry, Changwon National University, Changwon, 51140, Korea
| | - Insung S Choi
- Department of Chemistry, KAIST, Daejeon, 34141, Korea
| |
Collapse
|
9
|
GPCR_LigandClassify.py; a rigorous machine learning classifier for GPCR targeting compounds. Sci Rep 2021; 11:9510. [PMID: 33947911 PMCID: PMC8097070 DOI: 10.1038/s41598-021-88939-5] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/05/2020] [Accepted: 04/12/2021] [Indexed: 02/02/2023] Open
Abstract
The current study describes the construction of various ligand-based machine learning models to be used for drug-repurposing against the family of G-Protein Coupled Receptors (GPCRs). In building these models, we collected > 500,000 data points, encompassing experimentally measured molecular association data of > 160,000 unique ligands against > 250 GPCRs. These data points were retrieved from the GPCR-Ligand Association (GLASS) database. We have used diverse molecular featurization methods to describe the input molecules. Multiple supervised ML algorithms were developed, tested and compared for their accuracy, F scores, as well as for their Matthews' correlation coefficient scores (MCC). Our data suggest that combined with molecular fingerprinting, ensemble decision trees and gradient boosted trees ML algorithms are on the accuracy border of the rather sophisticated deep neural nets (DNNs)-based algorithms. On a test dataset, these models displayed an excellent performance, reaching a ~ 90% classification accuracy. Additionally, we showcase a few examples where our models were able to identify interesting connections between known drugs from the Drug-Bank database and members of the GPCR family of receptors. Our findings are in excellent agreement with previously reported experimental observations in the literature. We hope the models presented in this paper synergize with the currently ongoing interest of applying machine learning modeling in the field of drug repurposing and computational drug discovery in general.
Collapse
|
10
|
Augmentation in Healthcare: Augmented Biosignal Using Deep Learning and Tensor Representation. JOURNAL OF HEALTHCARE ENGINEERING 2021; 2021:6624764. [PMID: 33575018 PMCID: PMC7861952 DOI: 10.1155/2021/6624764] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 10/26/2020] [Revised: 11/22/2020] [Accepted: 01/12/2021] [Indexed: 11/25/2022]
Abstract
In healthcare applications, deep learning is a highly valuable tool. It extracts features from raw data to save time and effort for health practitioners. A deep learning model is capable of learning and extracting the features from raw data by itself without any external intervention. On the other hand, shallow learning feature extraction techniques depend on user experience in selecting a powerful feature extraction algorithm. In this article, we proposed a multistage model that is based on the spectrogram of biosignal. The proposed model provides an appropriate representation of the input raw biosignal that boosts the accuracy of training and testing dataset. In the next stage, smaller datasets are augmented as larger data sets to enhance the accuracy of the classification for biosignal datasets. After that, the augmented dataset is represented in the TensorFlow that provides more services and functionalities, which give more flexibility. The proposed model was compared with different approaches. The results show that the proposed approach is better in terms of testing and training accuracy.
Collapse
|
11
|
Lentelink NJ, Palkovits S. Transfer Learning as Tool to Enhance Predictions of Molecular Properties Based on 2D Projections. ADVANCED THEORY AND SIMULATIONS 2020. [DOI: 10.1002/adts.202000148] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/14/2022]
Affiliation(s)
- Niklas Julian Lentelink
- Institute of Technical and Macromolecular Chemistry RWTH Aachen University Worringer Weg 2 Aachen 52074 Germany
| | - Stefan Palkovits
- Institute of Technical and Macromolecular Chemistry RWTH Aachen University Worringer Weg 2 Aachen 52074 Germany
| |
Collapse
|
12
|
Jablonka K, Ongari D, Moosavi SM, Smit B. Big-Data Science in Porous Materials: Materials Genomics and Machine Learning. Chem Rev 2020; 120:8066-8129. [PMID: 32520531 PMCID: PMC7453404 DOI: 10.1021/acs.chemrev.0c00004] [Citation(s) in RCA: 154] [Impact Index Per Article: 38.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/03/2020] [Indexed: 12/16/2022]
Abstract
By combining metal nodes with organic linkers we can potentially synthesize millions of possible metal-organic frameworks (MOFs). The fact that we have so many materials opens many exciting avenues but also create new challenges. We simply have too many materials to be processed using conventional, brute force, methods. In this review, we show that having so many materials allows us to use big-data methods as a powerful technique to study these materials and to discover complex correlations. The first part of the review gives an introduction to the principles of big-data science. We show how to select appropriate training sets, survey approaches that are used to represent these materials in feature space, and review different learning architectures, as well as evaluation and interpretation strategies. In the second part, we review how the different approaches of machine learning have been applied to porous materials. In particular, we discuss applications in the field of gas storage and separation, the stability of these materials, their electronic properties, and their synthesis. Given the increasing interest of the scientific community in machine learning, we expect this list to rapidly expand in the coming years.
Collapse
Affiliation(s)
- Kevin
Maik Jablonka
- Laboratory of Molecular Simulation
(LSMO), Institut des Sciences et Ingénierie Chimiques (ISIC), École Polytechnique Fédérale
de Lausanne (EPFL), Sion, Switzerland
| | - Daniele Ongari
- Laboratory of Molecular Simulation
(LSMO), Institut des Sciences et Ingénierie Chimiques (ISIC), École Polytechnique Fédérale
de Lausanne (EPFL), Sion, Switzerland
| | - Seyed Mohamad Moosavi
- Laboratory of Molecular Simulation
(LSMO), Institut des Sciences et Ingénierie Chimiques (ISIC), École Polytechnique Fédérale
de Lausanne (EPFL), Sion, Switzerland
| | - Berend Smit
- Laboratory of Molecular Simulation
(LSMO), Institut des Sciences et Ingénierie Chimiques (ISIC), École Polytechnique Fédérale
de Lausanne (EPFL), Sion, Switzerland
| |
Collapse
|
13
|
Cortés-Ciriano I, Škuta C, Bender A, Svozil D. QSAR-derived affinity fingerprints (part 2): modeling performance for potency prediction. J Cheminform 2020; 12:41. [PMID: 33431016 PMCID: PMC7339533 DOI: 10.1186/s13321-020-00444-5] [Citation(s) in RCA: 10] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/07/2019] [Accepted: 05/16/2020] [Indexed: 01/22/2023] Open
Abstract
Affinity fingerprints report the activity of small molecules across a set of assays, and thus permit to gather information about the bioactivities of structurally dissimilar compounds, where models based on chemical structure alone are often limited, and model complex biological endpoints, such as human toxicity and in vitro cancer cell line sensitivity. Here, we propose to model in vitro compound activity using computationally predicted bioactivity profiles as compound descriptors. To this aim, we apply and validate a framework for the calculation of QSAR-derived affinity fingerprints (QAFFP) using a set of 1360 QSAR models generated using Ki, Kd, IC50 and EC50 data from ChEMBL database. QAFFP thus represent a method to encode and relate compounds on the basis of their similarity in bioactivity space. To benchmark the predictive power of QAFFP we assembled IC50 data from ChEMBL database for 18 diverse cancer cell lines widely used in preclinical drug discovery, and 25 diverse protein target data sets. This study complements part 1 where the performance of QAFFP in similarity searching, scaffold hopping, and bioactivity classification is evaluated. Despite being inherently noisy, we show that using QAFFP as descriptors leads to errors in prediction on the test set in the ~ 0.65-0.95 pIC50 units range, which are comparable to the estimated uncertainty of bioactivity data in ChEMBL (0.76-1.00 pIC50 units). We find that the predictive power of QAFFP is slightly worse than that of Morgan2 fingerprints and 1D and 2D physicochemical descriptors, with an effect size in the 0.02-0.08 pIC50 units range. Including QSAR models with low predictive power in the generation of QAFFP does not lead to improved predictive power. Given that the QSAR models we used to compute the QAFFP were selected on the basis of data availability alone, we anticipate better modeling results for QAFFP generated using more diverse and biologically meaningful targets. Data sets and Python code are publicly available at https://github.com/isidroc/QAFFP_regression .
Collapse
Affiliation(s)
- Isidro Cortés-Ciriano
- Centre for Molecular Informatics, Department of Chemistry, University of Cambridge, Lensfield Road, Cambridge, CB2 1EW, UK. .,European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Hinxton, CB10 1SD, UK.
| | - Ctibor Škuta
- CZ-OPENSCREEN: National Infrastructure for Chemical Biology, Institute of Molecular Genetics of the ASCR, v. v. i., Vídeňská 1083, 142 20, Prague, Czech Republic
| | - Andreas Bender
- Centre for Molecular Informatics, Department of Chemistry, University of Cambridge, Lensfield Road, Cambridge, CB2 1EW, UK
| | - Daniel Svozil
- CZ-OPENSCREEN: National Infrastructure for Chemical Biology, Institute of Molecular Genetics of the ASCR, v. v. i., Vídeňská 1083, 142 20, Prague, Czech Republic.,CZ-OPENSCREEN: National Infrastructure for Chemical Biology, Department of Informatics and Chemistry, Faculty of Chemical Technology, University of Chemistry and Technology Prague, Technická 5, 166 28, Prague, Czech Republic
| |
Collapse
|
14
|
Li X, Fourches D. Inductive transfer learning for molecular activity prediction: Next-Gen QSAR Models with MolPMoFiT. J Cheminform 2020; 12:27. [PMID: 33430978 PMCID: PMC7178569 DOI: 10.1186/s13321-020-00430-x] [Citation(s) in RCA: 53] [Impact Index Per Article: 13.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/29/2020] [Accepted: 04/15/2020] [Indexed: 12/25/2022] Open
Abstract
Deep neural networks can directly learn from chemical structures without extensive, user-driven selection of descriptors in order to predict molecular properties/activities with high reliability. But these approaches typically require large training sets to learn the endpoint-specific structural features and ensure reasonable prediction accuracy. Even though large datasets are becoming the new normal in drug discovery, especially when it comes to high-throughput screening or metabolomics datasets, one should also consider smaller datasets with challenging endpoints to model and forecast. Thus, it would be highly relevant to better utilize the tremendous compendium of unlabeled compounds from publicly-available datasets for improving the model performances for the user’s particular series of compounds. In this study, we propose the Molecular Prediction Model Fine-Tuning (MolPMoFiT) approach, an effective transfer learning method based on self-supervised pre-training + task-specific fine-tuning for QSPR/QSAR modeling. A large-scale molecular structure prediction model is pre-trained using one million unlabeled molecules from ChEMBL in a self-supervised learning manner, and can then be fine-tuned on various QSPR/QSAR tasks for smaller chemical datasets with specific endpoints. Herein, the method is evaluated on four benchmark datasets (lipophilicity, FreeSolv, HIV, and blood–brain barrier penetration). The results showed the method can achieve strong performances for all four datasets compared to other state-of-the-art machine learning modeling techniques reported in the literature so far.![]()
Collapse
Affiliation(s)
- Xinhao Li
- Department of Chemistry, Bioinformatics Research Center, North Carolina State University, Raleigh, NC, 27695, USA
| | - Denis Fourches
- Department of Chemistry, Bioinformatics Research Center, North Carolina State University, Raleigh, NC, 27695, USA.
| |
Collapse
|
15
|
Drakakis G, Cortés-Ciriano I, Alexander-Dann B, Bender A. Elucidating Compound Mechanism of Action and Predicting Cytotoxicity Using Machine Learning Approaches, Taking Prediction Confidence into Account. ACTA ACUST UNITED AC 2020; 11:e73. [PMID: 31483099 DOI: 10.1002/cpch.73] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/06/2023]
Abstract
The modes of action (MoAs) of drugs frequently are unknown, because many are small molecules initially identified from phenotypic screens, giving rise to the need to elucidate their MoAs. In addition, the high attrition rate for candidate drugs in preclinical studies due to intolerable toxicity has motivated the development of computational approaches to predict drug candidate (cyto)toxicity as early as possible in the drug-discovery process. Here, we provide detailed instructions for capitalizing on bioactivity predictions to elucidate the MoAs of small molecules and infer their underlying phenotypic effects. We illustrate how these predictions can be used to infer the underlying antidepressive effects of marketed drugs. We also provide the necessary functionalities to model cytotoxicity data using single and ensemble machine-learning algorithms. Finally, we give detailed instructions on how to calculate confidence intervals for individual predictions using the conformal prediction framework. © 2019 by John Wiley & Sons, Inc.
Collapse
Affiliation(s)
- Georgios Drakakis
- Centre for Molecular Informatics, Department of Chemistry, University of Cambridge, Cambridge, United Kingdom
| | - Isidro Cortés-Ciriano
- Centre for Molecular Informatics, Department of Chemistry, University of Cambridge, Cambridge, United Kingdom
| | - Ben Alexander-Dann
- Centre for Molecular Informatics, Department of Chemistry, University of Cambridge, Cambridge, United Kingdom
| | - Andreas Bender
- Centre for Molecular Informatics, Department of Chemistry, University of Cambridge, Cambridge, United Kingdom
| |
Collapse
|
16
|
Cardoso-Silva J, Papageorgiou LG, Tsoka S. Network-based piecewise linear regression for QSAR modelling. J Comput Aided Mol Des 2019; 33:831-844. [PMID: 31628660 PMCID: PMC6825651 DOI: 10.1007/s10822-019-00228-6] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/25/2019] [Accepted: 09/28/2019] [Indexed: 02/07/2023]
Abstract
Quantitative Structure-Activity Relationship (QSAR) models are critical in various areas of drug discovery, for example in lead optimisation and virtual screening. Recently, the need for models that are not only predictive but also interpretable has been highlighted. In this paper, a new methodology is proposed to build interpretable QSAR models by combining elements of network analysis and piecewise linear regression. The algorithm presented, modSAR, splits data using a two-step procedure. First, compounds associated with a common target are represented as a network in terms of their structural similarity, revealing modules of similar chemical properties. Second, each module is subdivided into subsets (regions), each of which is modelled by an independent linear equation. Comparative analysis of QSAR models across five data sets of protein inhibitors obtained from ChEMBL is reported and it is shown that modSAR offers similar predictive accuracy to popular algorithms, such as Random Forest and Support Vector Machine. Moreover, we show that models built by modSAR are interpretatable, capable of evaluating the applicability domain of the compounds and serve well tasks such as virtual screening and the development of new drug leads.
Collapse
Affiliation(s)
- Jonathan Cardoso-Silva
- Department of Informatics, Faculty of Natural and Mathematical Sciences, King's College London, Bush House, 30 Aldwych, London, WC2B 4BG, UK
| | - Lazaros G Papageorgiou
- Centre for Process Systems Engineering, Department of Chemical Engineering, University College London, Roberts Building, Torrington Place, London, WC1E 7JE, UK
| | - Sophia Tsoka
- Department of Informatics, Faculty of Natural and Mathematical Sciences, King's College London, Bush House, 30 Aldwych, London, WC2B 4BG, UK.
| |
Collapse
|
17
|
Simeon S, Montanari D, Gleeson MP. Investigation of Factors Affecting the Performance of
in silico
Volume Distribution QSAR Models for Human, Rat, Mouse, Dog & Monkey. Mol Inform 2019; 38:e1900059. [DOI: 10.1002/minf.201900059] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/23/2019] [Accepted: 07/03/2019] [Indexed: 01/09/2023]
Affiliation(s)
- Saw Simeon
- Interdisciplinary Graduate Program in Bioscience, Faculty of ScienceKasetsart University Bangkok 10900 Thailand
- Center for Advanced Studies in Nanotechnology for Chemical, Food and Agricultural Industries, KU Institute for Advanced StudiesKasetsart University Bangkok 10900 Thailand
| | - Dino Montanari
- DMPK and Bioanalysis, Aptuit Via Alessandro Fleming, 4 37135 Verona VR Italy
| | - Matthew Paul Gleeson
- Department of Chemistry, Faculty of ScienceKasetsart University Bangkok 10900 Thailand
- Department of Biomedical Engineering, Faculty of EngineeringKing Mongkut's Institute of Technology Ladkrabang Bangkok 10520 Thailand
| |
Collapse
|
18
|
Cortés-Ciriano I, Bender A. KekuleScope: prediction of cancer cell line sensitivity and compound potency using convolutional neural networks trained on compound images. J Cheminform 2019; 11:41. [PMID: 31218493 PMCID: PMC6582521 DOI: 10.1186/s13321-019-0364-5] [Citation(s) in RCA: 33] [Impact Index Per Article: 6.6] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/19/2019] [Accepted: 06/09/2019] [Indexed: 02/08/2023] Open
Abstract
The application of convolutional neural networks (ConvNets) to harness high-content screening images or 2D compound representations is gaining increasing attention in drug discovery. However, existing applications often require large data sets for training, or sophisticated pretraining schemes. Here, we show using 33 IC50 data sets from ChEMBL 23 that the in vitro activity of compounds on cancer cell lines and protein targets can be accurately predicted on a continuous scale from their Kekulé structure representations alone by extending existing architectures (AlexNet, DenseNet-201, ResNet152 and VGG-19), which were pretrained on unrelated image data sets. We show that the predictive power of the generated models, which just require standard 2D compound representations as input, is comparable to that of Random Forest (RF) models and fully-connected Deep Neural Networks trained on circular (Morgan) fingerprints. Notably, including additional fully-connected layers further increases the predictive power of the ConvNets by up to 10%. Analysis of the predictions generated by RF models and ConvNets shows that by simply averaging the output of the RF models and ConvNets we obtain significantly lower errors in prediction for multiple data sets, although the effect size is small, than those obtained with either model alone, indicating that the features extracted by the convolutional layers of the ConvNets provide complementary predictive signal to Morgan fingerprints. Lastly, we show that multi-task ConvNets trained on compound images permit to model COX isoform selectivity on a continuous scale with errors in prediction comparable to the uncertainty of the data. Overall, in this work we present a set of ConvNet architectures for the prediction of compound activity from their Kekulé structure representations with state-of-the-art performance, that require no generation of compound descriptors or use of sophisticated image processing techniques. The code needed to reproduce the results presented in this study and all the data sets are provided at https://github.com/isidroc/kekulescope .
Collapse
Affiliation(s)
- Isidro Cortés-Ciriano
- Centre for Molecular Informatics, Department of Chemistry, University of Cambridge, Lensfield Road, Cambridge, CB2 1EW UK
| | - Andreas Bender
- Centre for Molecular Informatics, Department of Chemistry, University of Cambridge, Lensfield Road, Cambridge, CB2 1EW UK
| |
Collapse
|
19
|
Grenet I, Merlo K, Comet JP, Tertiaux R, Rouquié D, Dayan F. Stacked Generalization with Applicability Domain Outperforms Simple QSAR on in Vitro Toxicological Data. J Chem Inf Model 2019; 59:1486-1496. [PMID: 30735402 DOI: 10.1021/acs.jcim.8b00553] [Citation(s) in RCA: 12] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/17/2023]
Abstract
The development of in silico tools able to predict bioactivity and toxicity of chemical substances is a powerful solution envisioned to assess toxicity as early as possible. To enable the development of such tools, the ToxCast program has generated and made publicly available in vitro bioactivity data for thousands of compounds. The goal of the present study is to characterize and explore the data from ToxCast in terms of Machine Learning capability. For this, a large scale analysis on the entire database has been performed to build models to predict bioactivities measured in in vitro assays. Simple classical QSAR algorithms (ANN, SVM, LDA, random forest, and Bayesian) were first applied on the data, and the results of these algorithms suggested that they do not seem to be well-suited for data sets with a high proportion of inactive compounds. The study then showed for the first time that the use of an ensemble method named "Stacked generalization" could improve the model performance on this type of data. Indeed, for 61% of 483 models, the Stacked method led to models with higher performance. Moreover, the combination of this ensemble method with an applicability domain filter allows one to assess the reliability of the predictions for further compound prioritization. In particular we showed that for 50% of the models, the ROC score is better if we do not consider the compounds that are not within the applicability domain.
Collapse
Affiliation(s)
- Ingrid Grenet
- University Côte d'Azur, I3S Laboratory , UMR CNRS 7271, CS 40121, 06903 Sophia Antipolis Cedex, France.,Bayer SAS , 06903 Sophia Antipolis Cedex, France
| | - Kevin Merlo
- Dassault Systèmes SE , 06906 Sophia Antipolis, Biot , France
| | - Jean-Paul Comet
- University Côte d'Azur, I3S Laboratory , UMR CNRS 7271, CS 40121, 06903 Sophia Antipolis Cedex, France
| | - Romain Tertiaux
- Dassault Systèmes SE , 06906 Sophia Antipolis, Biot , France
| | | | - Frédéric Dayan
- Dassault Systèmes SE , 06906 Sophia Antipolis, Biot , France
| |
Collapse
|
20
|
Lagunin AA, Geronikaki A, Eleftheriou P, Pogodin PV, Zakharov AV. Rational Use of Heterogeneous Data in Quantitative Structure-Activity Relationship (QSAR) Modeling of Cyclooxygenase/Lipoxygenase Inhibitors. J Chem Inf Model 2019; 59:713-730. [PMID: 30688458 DOI: 10.1021/acs.jcim.8b00617] [Citation(s) in RCA: 13] [Impact Index Per Article: 2.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/22/2023]
Abstract
Numerous studies have been published in recent years with acceptable quantitative structure-activity relationship (QSAR) modeling based on heterogeneous data. In many cases, the training sets for QSAR modeling were constructed from compounds tested by different biological assays, contradicting the opinion that QSAR modeling should be based on the data measured by a single protocol. We attempted to develop approaches that help to determine how heterogeneous data should be used for the creation of QSAR models on the basis of different sets of compounds tested by different experimental methods for the same target and the same endpoint. To this end, more than 100 QSAR models for the IC50 values of ligands interacting with cyclooxygenase 1,2 (COX) and seed lipoxygenase (LOX), obtained from ChEMBL database were created using the GUSAR software. The QSAR models were tested on the external set, including 26 new thiazolidinone derivatives, which were experimentally tested for COX-1,2/LOX inhibition. The IC50 values of the derivatives varied from 89 μM to 26 μM for LOX, from 200 μM to 0.018 μM for COX-1, and from 210 μM to 1 μM for COX-2. This study showed that the accuracy of the models is dependent on the distribution of IC50 values of low activity compounds in the training sets. In the most cases, QSAR models created based on the combined training sets had advantages in comparison with QSAR models, based on a single publication. We introduced a new method of combination of quantitative data from different experimental studies based on the data of reference compounds, which was called "scaling".
Collapse
Affiliation(s)
- Alexey A Lagunin
- Pirogov Russian National Research Medical University , Ostrovitianov str. 1 , Moscow , 117997 , Russia
- Institute of Biomedical Chemistry , Pogodinskaya Str., 10/8 , Moscow , 119121 , Russia
| | - Athina Geronikaki
- School of Pharmacy , Aristotle University , Thessaloniki , 54124 , Greece
| | - Phaedra Eleftheriou
- School of Health and Medical Care , Alexander Technological Educational Institute of Thessaloniki , Thessaloniki , 57400 , Greece
| | - Pavel V Pogodin
- Institute of Biomedical Chemistry , Pogodinskaya Str., 10/8 , Moscow , 119121 , Russia
| | - Alexey V Zakharov
- National Center for Advancing Translational Sciences (NCATS) , National Institutes of Health , Rockville , Maryland 20850 , United States
| |
Collapse
|
21
|
Lagunin AA, Romanova MA, Zadorozhny AD, Kurilenko NS, Shilov BV, Pogodin PV, Ivanov SM, Filimonov DA, Poroikov VV. Comparison of Quantitative and Qualitative (Q)SAR Models Created for the Prediction of K i and IC 50 Values of Antitarget Inhibitors. Front Pharmacol 2018; 9:1136. [PMID: 30364128 PMCID: PMC6192375 DOI: 10.3389/fphar.2018.01136] [Citation(s) in RCA: 14] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/02/2018] [Accepted: 09/18/2018] [Indexed: 12/20/2022] Open
Abstract
Estimation of interaction of drug-like compounds with antitargets is important for the assessment of possible toxic effects during drug development. Publicly available online databases provide data on the experimental results of chemical interactions with antitargets, which can be used for the creation of (Q)SAR models. The structures and experimental Ki and IC50 values for compounds tested on the inhibition of 30 antitargets from the ChEMBL 20 database were used. Data sets with Ki and IC50 values including more than 100 compounds were created for each antitarget. The (Q)SAR models were created by GUSAR software using quantitative neighborhoods of atoms (QNA), multilevel neighborhoods of atoms (MNA) descriptors, and self-consistent regression. The accuracy of (Q)SAR models was validated by the fivefold cross-validation procedure. The balanced accuracy was higher for qualitative SAR models (0.80 and 0.81 for Ki and IC50 values, respectively) than for quantitative QSAR models (0.73 and 0.76 for Ki and IC50 values, respectively). In most cases, sensitivity was higher for SAR models than for QSAR models, but specificity was higher for QSAR models. The mean R 2 and RMSE were 0.64 and 0.77 for Ki values and 0.59 and 0.73 for IC50 values, respectively. The number of compounds falling within the applicability domain was higher for SAR models than for the test sets.
Collapse
Affiliation(s)
- Alexey A. Lagunin
- Department of Bioinformatics, Institute of Biomedical Chemistry, Moscow, Russia
- Department of Bioinformatics, Pirogov Russian National Research Medical University, Moscow, Russia
| | - Maria A. Romanova
- Department of Bioinformatics, Pirogov Russian National Research Medical University, Moscow, Russia
| | - Anton D. Zadorozhny
- Department of Bioinformatics, Pirogov Russian National Research Medical University, Moscow, Russia
| | - Natalia S. Kurilenko
- Department of Bioinformatics, Pirogov Russian National Research Medical University, Moscow, Russia
| | - Boris V. Shilov
- Department of Bioinformatics, Pirogov Russian National Research Medical University, Moscow, Russia
| | - Pavel V. Pogodin
- Department of Bioinformatics, Institute of Biomedical Chemistry, Moscow, Russia
| | - Sergey M. Ivanov
- Department of Bioinformatics, Institute of Biomedical Chemistry, Moscow, Russia
- Department of Bioinformatics, Pirogov Russian National Research Medical University, Moscow, Russia
| | - Dmitry A. Filimonov
- Department of Bioinformatics, Institute of Biomedical Chemistry, Moscow, Russia
| | | |
Collapse
|
22
|
Cardoso‐Silva J, Papadatos G, Papageorgiou LG, Tsoka S. Optimal Piecewise Linear Regression Algorithm for QSAR Modelling. Mol Inform 2018; 38:e1800028. [DOI: 10.1002/minf.201800028] [Citation(s) in RCA: 12] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/06/2018] [Accepted: 08/02/2018] [Indexed: 12/20/2022]
Affiliation(s)
- Jonathan Cardoso‐Silva
- Department of Informatics, Faculty of Natural and Mathematical SciencesKing's College London, Bush House London WC2B 4BG UK
| | - George Papadatos
- European Molecular Biology Laboratory – European Bioinformatics InstituteWellcome Trust Genome Campus Hinxton, Cambridge CB10 1SD UK
- GlaxoSmithKline Gunnels Wood Road Stevenage, Hertfordshire SG1 2NY UK
| | - Lazaros G. Papageorgiou
- Centre for Process Systems Engineering, Department of Chemical EngineeringUniversity College London Torrington Place London WC1E 7JE UK
| | - Sophia Tsoka
- Department of Informatics, Faculty of Natural and Mathematical SciencesKing's College London, Bush House London WC2B 4BG UK
| |
Collapse
|
23
|
Svensson F, Aniceto N, Norinder U, Cortes-Ciriano I, Spjuth O, Carlsson L, Bender A. Conformal Regression for Quantitative Structure–Activity Relationship Modeling—Quantifying Prediction Uncertainty. J Chem Inf Model 2018; 58:1132-1140. [DOI: 10.1021/acs.jcim.8b00054] [Citation(s) in RCA: 29] [Impact Index Per Article: 4.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/21/2022]
Affiliation(s)
- Fredrik Svensson
- Centre for Molecular Informatics, Department of Chemistry, University of Cambridge, Lensfield Road, Cambridge CB2 1EW, U.K
- IOTA Pharmaceuticals, St Johns Innovation Centre, Cowley Road, Cambridge CB4 0WS, U.K
| | - Natalia Aniceto
- Centre for Molecular Informatics, Department of Chemistry, University of Cambridge, Lensfield Road, Cambridge CB2 1EW, U.K
| | - Ulf Norinder
- Swetox, Unit of Toxicology Sciences, Karolinska Institutet, Forskargatan 20, SE-151 36 Södertälje, Sweden
- Department of Computer and Systems Sciences, Stockholm University, Box 7003, SE-164 07 Kista, Sweden
| | - Isidro Cortes-Ciriano
- Centre for Molecular Informatics, Department of Chemistry, University of Cambridge, Lensfield Road, Cambridge CB2 1EW, U.K
| | - Ola Spjuth
- Department of Pharmaceutical Biosciences, Uppsala University, Box 591, SE-75124, Uppsala Sweden
| | - Lars Carlsson
- Quantitative Biology, Discovery Sciences, IMED Biotech Unit, AstraZeneca, SE-43183, Mölndal, Sweden
- Department of Computer Science, Royal Holloway, University of London, Egham Hill, Surrey, U.K
| | - Andreas Bender
- Centre for Molecular Informatics, Department of Chemistry, University of Cambridge, Lensfield Road, Cambridge CB2 1EW, U.K
| |
Collapse
|
24
|
Fukunishi Y, Yamasaki S, Yasumatsu I, Takeuchi K, Kurosawa T, Nakamura H. Quantitative Structure-activity Relationship (QSAR) Models for Docking Score Correction. Mol Inform 2017; 36:1600013. [PMID: 28001004 PMCID: PMC5297997 DOI: 10.1002/minf.201600013] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/27/2016] [Accepted: 04/01/2016] [Indexed: 01/26/2023]
Abstract
In order to improve docking score correction, we developed several structure-based quantitative structure activity relationship (QSAR) models by protein-drug docking simulations and applied these models to public affinity data. The prediction models used descriptor-based regression, and the compound descriptor was a set of docking scores against multiple (∼600) proteins including nontargets. The binding free energy that corresponded to the docking score was approximated by a weighted average of docking scores for multiple proteins, and we tried linear, weighted linear and polynomial regression models considering the compound similarities. In addition, we tried a combination of these regression models for individual data sets such as IC50 , Ki , and %inhibition values. The cross-validation results showed that the weighted linear model was more accurate than the simple linear regression model. Thus, the QSAR approaches based on the affinity data of public databases should improve docking scores.
Collapse
Affiliation(s)
- Yoshifumi Fukunishi
- Molecular Profiling Research Center for Drug Discovery (molprof), National Institute of Advanced Industrial Science and Technology (AIST), 2-3-26, Aomi, Koto-ku, Tokyo, 135-0064, Japan
| | - Satoshi Yamasaki
- Technology Research Association for Next-Generation Natural Products Chemistry, 2-3-26, Aomi, Koto-ku, Tokyo, 135-0064, Japan
| | - Isao Yasumatsu
- Technology Research Association for Next-Generation Natural Products Chemistry, 2-3-26, Aomi, Koto-ku, Tokyo, 135-0064, Japan
- Daiichi Sankyo RD Novare Co., Ltd., 1-16-13, Kita-Kasai, Edogawa-ku, Tokyo, 134-8630, Japan
| | - Koh Takeuchi
- Molecular Profiling Research Center for Drug Discovery (molprof), National Institute of Advanced Industrial Science and Technology (AIST), 2-3-26, Aomi, Koto-ku, Tokyo, 135-0064, Japan
| | - Takashi Kurosawa
- Technology Research Association for Next-Generation Natural Products Chemistry, 2-3-26, Aomi, Koto-ku, Tokyo, 135-0064, Japan
- Hitachi Solutions East Japan, 12-1 Ekimaehoncho, Kawasaki-ku, Kanagawa, 210-0007, Japan
| | - Haruki Nakamura
- Institute for Protein Research, Osaka University, 3-2 Yamadaoka, Suita, Osaka, 565-0871, Japan
| |
Collapse
|