1
|
Minot M, Reddy ST. Meta learning addresses noisy and under-labeled data in machine learning-guided antibody engineering. Cell Syst 2024; 15:4-18.e4. [PMID: 38194961 DOI: 10.1016/j.cels.2023.12.003] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/30/2023] [Revised: 07/21/2023] [Accepted: 12/07/2023] [Indexed: 01/11/2024]
Abstract
Machine learning-guided protein engineering is rapidly progressing; however, collecting high-quality, large datasets remains a bottleneck. Directed evolution and protein engineering studies often require extensive experimental processes to eliminate noise and label protein sequence-function data. Meta learning has proven effective in other fields in learning from noisy data via bi-level optimization given the availability of a small dataset with trusted labels. Here, we leverage meta learning approaches to overcome noisy and under-labeled data and expedite workflows in antibody engineering. We generate yeast display antibody mutagenesis libraries and screen them for target antigen binding followed by deep sequencing. We then create representative learning tasks, including learning from noisy training data, positive and unlabeled learning, and learning out of distribution properties. We demonstrate that meta learning has the potential to reduce experimental screening time and improve the robustness of machine learning models by training with noisy and under-labeled training data.
Collapse
Affiliation(s)
- Mason Minot
- ETH Zurich, Department of Biosystems Science and Engineering, Basel 4056, Switzerland
| | - Sai T Reddy
- ETH Zurich, Department of Biosystems Science and Engineering, Basel 4056, Switzerland.
| |
Collapse
|
2
|
Carneiro J, Magalhães RP, de la Oliva Roque VM, Simões M, Pratas D, Sousa SF. TargIDe: a machine-learning workflow for target identification of molecules with antibiofilm activity against Pseudomonas aeruginosa. J Comput Aided Mol Des 2023; 37:265-278. [PMID: 37085636 DOI: 10.1007/s10822-023-00505-5] [Citation(s) in RCA: 2] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/10/2023] [Accepted: 04/12/2023] [Indexed: 04/23/2023]
Abstract
Bacterial biofilms are a source of infectious human diseases and are heavily linked to antibiotic resistance. Pseudomonas aeruginosa is a multidrug-resistant bacterium widely present and implicated in several hospital-acquired infections. Over the last years, the development of new drugs able to inhibit Pseudomonas aeruginosa by interfering with its ability to form biofilms has become a promising strategy in drug discovery. Identifying molecules able to interfere with biofilm formation is difficult, but further developing these molecules by rationally improving their activity is particularly challenging, as it requires knowledge of the specific protein target that is inhibited. This work describes the development of a machine learning multitechnique consensus workflow to predict the protein targets of molecules with confirmed inhibitory activity against biofilm formation by Pseudomonas aeruginosa. It uses a specialized database containing all the known targets implicated in biofilm formation by Pseudomonas aeruginosa. The experimentally confirmed inhibitors available on ChEMBL, together with chemical descriptors, were used as the input features for a combination of nine different classification models, yielding a consensus method to predict the most likely target of a ligand. The implemented algorithm is freely available at https://github.com/BioSIM-Research-Group/TargIDe under licence GNU General Public Licence (GPL) version 3 and can easily be improved as more data become available.
Collapse
Affiliation(s)
- João Carneiro
- Interdisciplinary Centre of Marine and Environmental Research, CIIMAR, University of Porto, Terminal de Cruzeiros do Porto de Leixões, Av. General Norton de Matos, s/n, Porto, 4450-208, Portugal.
| | - Rita P Magalhães
- Faculty of Medicine, Associate Laboratory i4HB-Institute for Health and Bioeconomy, University of Porto, 4200-319, Porto, Portugal
- Department of Biomedicine, Faculty of Medicine, UCIBIO-Applied Molecular Biosciences Unit, University of Porto, BioSIM, Porto, 4200-319, Portugal
| | - Victor M de la Oliva Roque
- Faculty of Medicine, Associate Laboratory i4HB-Institute for Health and Bioeconomy, University of Porto, 4200-319, Porto, Portugal
- Department of Biomedicine, Faculty of Medicine, UCIBIO-Applied Molecular Biosciences Unit, University of Porto, BioSIM, Porto, 4200-319, Portugal
| | - Manuel Simões
- Faculty of Engineering, LEPABE Laboratory for Process Engineering, Environment, Biotechnology and Energy, University of Porto, Rua Dr. Roberto Frias, s/n, Porto, 4200-465, Portugal
- Faculty of Engineering, ALiCE-Associate Laboratory in Chemical Engineering, University of Porto, Rua Dr. Roberto Frias, 4200-465, Porto, Portugal
| | - Diogo Pratas
- Institute of Electronics and Informatics Engineering of Aveiro, IEETA, University of Aveiro, Aveiro, Portugal
- Department of Electronics, Telecommunications and Informatics, DETI, University of Aveiro, Aveiro, Portugal
- Department of Virology, DoV, University of Helsinki, Helsinki, Finland
| | - Sérgio F Sousa
- Faculty of Medicine, Associate Laboratory i4HB-Institute for Health and Bioeconomy, University of Porto, 4200-319, Porto, Portugal
- Department of Biomedicine, Faculty of Medicine, UCIBIO-Applied Molecular Biosciences Unit, University of Porto, BioSIM, Porto, 4200-319, Portugal
| |
Collapse
|
3
|
Deep learning model for classification and bioactivity prediction of essential oil-producing plants from Egypt. Sci Rep 2020; 10:21349. [PMID: 33288845 PMCID: PMC7721748 DOI: 10.1038/s41598-020-78449-1] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/18/2020] [Accepted: 11/20/2020] [Indexed: 11/29/2022] Open
Abstract
Reliance on deep learning techniques has become an important trend in several science domains including biological science, due to its proven efficiency in manipulating big data that are often characterized by their non-linear processes and complicated relationships. In this study, Convolutional Neural Networks (CNN) has been recruited, as one of the deep learning techniques, to be used in classifying and predicting the biological activities of the essential oil-producing plant/s through their chemical compositions. The model is established based on the available chemical composition’s information of a set of endemic Egyptian plants and their biological activities. Another type of machine learning algorithms, Multiclass Neural Network (MNN), has been applied on the same Essential Oils (EO) dataset. This aims to fairly evaluate the performance of the proposed CNN model. The recorded accuracy in the testing process for both CNN and MNN is 98.13% and 81.88%, respectively. Finally, the CNN technique has been adopted as a reliable model for classifying and predicting the bioactivities of the Egyptian EO-containing plants. The overall accuracy for the final prediction process is reported as approximately 97%. Hereby, the proposed deep learning model could be utilized as an efficient model in predicting the bioactivities of, at least Egyptian, EOs-producing plants.
Collapse
|
4
|
Berenger F, Yamanishi Y. Ranking Molecules with Vanishing Kernels and a Single Parameter: Active Applicability Domain Included. J Chem Inf Model 2020; 60:4376-4387. [PMID: 32281797 DOI: 10.1021/acs.jcim.9b01075] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/29/2022]
Abstract
In ligand-based virtual screening, high-throughput screening (HTS) data sets can be exploited to train classification models. Such models can be used to prioritize yet untested molecules, from the most likely active (against a protein target of interest) to the least likely active. In this study, a single-parameter ranking method with an Applicability Domain (AD) is proposed. In effect, Kernel Density Estimates (KDE) are revisited to improve their computational efficiency and incorporate an AD. Two modifications are proposed: (i) using vanishing kernels (i.e., kernel functions with a finite support) and (ii) using the Tanimoto distance between molecular fingerprints as a radial basis function. This construction is termed "Vanishing Ranking Kernels" (VRK). Using VRK on 21 HTS assays, it is shown that VRK can compete in performance with a graph convolutional deep neural network. VRK are conceptually simple and fast to train. During training, they require optimizing a single parameter. A trained VRK model usually defines an active AD. Exploiting this AD can significantly increase the screening frequency of a VRK model. Software: https://github.com/UnixJunkie/rankers. Data sets: https://zenodo.org/record/1320776 and https://zenodo.org/record/3540423.
Collapse
Affiliation(s)
- Francois Berenger
- Department of Bioscience and Bioinformatics, Faculty of Computer Science and Systems Engineering, Kyushu Institute of Technology, Kawazu, 680-4 Iizuka, Japan
| | - Yoshihiro Yamanishi
- Department of Bioscience and Bioinformatics, Faculty of Computer Science and Systems Engineering, Kyushu Institute of Technology, Kawazu, 680-4 Iizuka, Japan
| |
Collapse
|
5
|
de la Vega de León A, Chen B, Gillet VJ. Effect of missing data on multitask prediction methods. J Cheminform 2018; 10:26. [PMID: 29789977 PMCID: PMC5964064 DOI: 10.1186/s13321-018-0281-z] [Citation(s) in RCA: 21] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/19/2018] [Accepted: 05/14/2018] [Indexed: 01/05/2023] Open
Abstract
There has been a growing interest in multitask prediction in chemoinformatics, helped by the increasing use of deep neural networks in this field. This technique is applied to multitarget data sets, where compounds have been tested against different targets, with the aim of developing models to predict a profile of biological activities for a given compound. However, multitarget data sets tend to be sparse; i.e., not all compound-target combinations have experimental values. There has been little research on the effect of missing data on the performance of multitask methods. We have used two complete data sets to simulate sparseness by removing data from the training set. Different models to remove the data were compared. These sparse sets were used to train two different multitask methods, deep neural networks and Macau, which is a Bayesian probabilistic matrix factorization technique. Results from both methods were remarkably similar and showed that the performance decrease because of missing data is at first small before accelerating after large amounts of data are removed. This work provides a first approximation to assess how much data is required to produce good performance in multitask prediction exercises.
Collapse
Affiliation(s)
| | - Beining Chen
- Department of Chemistry, University of Sheffield, Dainton Building, Brook Hill, Sheffield, S3 7HF, UK
| | - Valerie J Gillet
- Information School, University of Sheffield, Regent Court, 211 Portobello, Sheffield, S1 4DP, UK
| |
Collapse
|
6
|
Afolabi LT, Saeed F, Hashim H, Petinrin OO. Ensemble learning method for the prediction of new bioactive molecules. PLoS One 2018; 13:e0189538. [PMID: 29329334 PMCID: PMC5766097 DOI: 10.1371/journal.pone.0189538] [Citation(s) in RCA: 17] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/24/2017] [Accepted: 11/27/2017] [Indexed: 12/31/2022] Open
Abstract
Pharmacologically active molecules can provide remedies for a range of different illnesses and infections. Therefore, the search for such bioactive molecules has been an enduring mission. As such, there is a need to employ a more suitable, reliable, and robust classification method for enhancing the prediction of the existence of new bioactive molecules. In this paper, we adopt a recently developed combination of different boosting methods (Adaboost) for the prediction of new bioactive molecules. We conducted the research experiments utilizing the widely used MDL Drug Data Report (MDDR) database. The proposed boosting method generated better results than other machine learning methods. This finding suggests that the method is suitable for inclusion among the in silico tools for use in cheminformatics, computational chemistry and molecular biology.
Collapse
Affiliation(s)
| | - Faisal Saeed
- College of Computer Science and Engineering, Taibah University, Medina, Saudi Arabia
- Information Systems Department, Faculty of Computing, Universiti Teknologi Malaysia, Skudai, Johor, Malaysia
| | - Haslinda Hashim
- Information Systems Department, Faculty of Computing, Universiti Teknologi Malaysia, Skudai, Johor, Malaysia
- Kolej Yayasan Pelajaran Johor, KM16, Jalan Kulai-Kota Tinggi, Kota Tinggi, Johor, Malaysia
| | | |
Collapse
|
7
|
Riniker S, Landrum GA, Montanari F, Villalba SD, Maier J, Jansen JM, Walters WP, Shelat AA. Virtual-screening workflow tutorials and prospective results from the Teach-Discover-Treat competition 2014 against malaria. F1000Res 2017; 6:1136. [PMID: 28928948 DOI: 10.12688/f1000research.11905.1] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Accepted: 07/11/2017] [Indexed: 12/21/2022] Open
Abstract
The first challenge in the 2014 competition launched by the Teach-Discover-Treat (TDT) initiative asked for the development of a tutorial for ligand-based virtual screening, based on data from a primary phenotypic high-throughput screen (HTS) against malaria. The resulting Workflows were applied to select compounds from a commercial database, and a subset of those were purchased and tested experimentally for anti-malaria activity. Here, we present the two most successful Workflows, both using machine-learning approaches, and report the results for the 114 compounds tested in the follow-up screen. Excluding the two known anti-malarials quinidine and amodiaquine and 31 compounds already present in the primary HTS, a high hit rate of 57% was found.
Collapse
Affiliation(s)
- Sereina Riniker
- Laboratory of Physical Chemistry, ETH Zürich, Zürich, Switzerland
| | | | - Floriane Montanari
- Pharmacoinformatics Research Group, Department of Pharmaceutical Chemistry, University of Vienna, Vienna, Austria
| | - Santiago D Villalba
- IMP - Research Institute of Molecular Pathology, Vienna Biocenter, Vienna, Austria
| | - Julie Maier
- Department of Chemical Biology & Therapeutics, St. Jude Children's Research Hospital, Memphis, TN, USA
| | - Johanna M Jansen
- Department of Global Discovery Chemistry, Novartis Institutes for BioMedical Research, Emeryville, CA, USA
| | | | - Anang A Shelat
- Department of Chemical Biology & Therapeutics, St. Jude Children's Research Hospital, Memphis, TN, USA
| |
Collapse
|
8
|
Riniker S, Landrum GA, Montanari F, Villalba SD, Maier J, Jansen JM, Walters WP, Shelat AA. Virtual-screening workflow tutorials and prospective results from the Teach-Discover-Treat competition 2014 against malaria. F1000Res 2017. [PMID: 28928948 PMCID: PMC5580409 DOI: 10.12688/f1000research.11905.2] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Indexed: 01/14/2023] Open
Abstract
The first challenge in the 2014 competition launched by the Teach-Discover-Treat (TDT) initiative asked for the development of a tutorial for ligand-based virtual screening, based on data from a primary phenotypic high-throughput screen (HTS) against malaria. The resulting Workflows were applied to select compounds from a commercial database, and a subset of those were purchased and tested experimentally for anti-malaria activity. Here, we present the two most successful Workflows, both using machine-learning approaches, and report the results for the 114 compounds tested in the follow-up screen. Excluding the two known anti-malarials quinidine and amodiaquine and 31 compounds already present in the primary HTS, a high hit rate of 57% was found.
Collapse
Affiliation(s)
- Sereina Riniker
- Laboratory of Physical Chemistry, ETH Zürich, Zürich, Switzerland
| | | | - Floriane Montanari
- Pharmacoinformatics Research Group, Department of Pharmaceutical Chemistry, University of Vienna, Vienna, Austria
| | - Santiago D Villalba
- IMP - Research Institute of Molecular Pathology, Vienna Biocenter, Vienna, Austria
| | - Julie Maier
- Department of Chemical Biology & Therapeutics, St. Jude Children's Research Hospital, Memphis, TN, USA
| | - Johanna M Jansen
- Department of Global Discovery Chemistry, Novartis Institutes for BioMedical Research, Emeryville, CA, USA
| | | | - Anang A Shelat
- Department of Chemical Biology & Therapeutics, St. Jude Children's Research Hospital, Memphis, TN, USA
| |
Collapse
|
9
|
Babajide Mustapha I, Saeed F. Bioactive Molecule Prediction Using Extreme Gradient Boosting. Molecules 2016; 21:molecules21080983. [PMID: 27483216 PMCID: PMC6273295 DOI: 10.3390/molecules21080983] [Citation(s) in RCA: 98] [Impact Index Per Article: 12.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/01/2016] [Revised: 07/19/2016] [Accepted: 07/22/2016] [Indexed: 01/29/2023] Open
Abstract
Following the explosive growth in chemical and biological data, the shift from traditional methods of drug discovery to computer-aided means has made data mining and machine learning methods integral parts of today's drug discovery process. In this paper, extreme gradient boosting (Xgboost), which is an ensemble of Classification and Regression Tree (CART) and a variant of the Gradient Boosting Machine, was investigated for the prediction of biological activity based on quantitative description of the compound's molecular structure. Seven datasets, well known in the literature were used in this paper and experimental results show that Xgboost can outperform machine learning algorithms like Random Forest (RF), Support Vector Machines (LSVM), Radial Basis Function Neural Network (RBFN) and Naïve Bayes (NB) for the prediction of biological activities. In addition to its ability to detect minority activity classes in highly imbalanced datasets, it showed remarkable performance on both high and low diversity datasets.
Collapse
Affiliation(s)
- Ismail Babajide Mustapha
- UTM Big Data Centre, Ibnu Sina Institute for Scientific and Industrial Research, Universiti Teknologi Malaysia, Skudai, Johor 81310, Malaysia.
| | - Faisal Saeed
- Information Systems Department, Faculty of Computing, Universiti Teknologi Malaysia, Skudai, Johor 81310, Malaysia.
| |
Collapse
|
10
|
Gilson MK, Liu T, Baitaluk M, Nicola G, Hwang L, Chong J. BindingDB in 2015: A public database for medicinal chemistry, computational chemistry and systems pharmacology. Nucleic Acids Res 2016; 44:D1045-53. [PMID: 26481362 PMCID: PMC4702793 DOI: 10.1093/nar/gkv1072] [Citation(s) in RCA: 804] [Impact Index Per Article: 100.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/20/2015] [Revised: 10/02/2015] [Accepted: 10/05/2015] [Indexed: 12/12/2022] Open
Abstract
BindingDB, www.bindingdb.org, is a publicly accessible database of experimental protein-small molecule interaction data. Its collection of over a million data entries derives primarily from scientific articles and, increasingly, US patents. BindingDB provides many ways to browse and search for data of interest, including an advanced search tool, which can cross searches of multiple query types, including text, chemical structure, protein sequence and numerical affinities. The PDB and PubMed provide links to data in BindingDB, and vice versa; and BindingDB provides links to pathway information, the ZINC catalog of available compounds, and other resources. The BindingDB website offers specialized tools that take advantage of its large data collection, including ones to generate hypotheses for the protein targets bound by a bioactive compound, and for the compounds bound by a new protein of known sequence; and virtual compound screening by maximal chemical similarity, binary kernel discrimination, and support vector machine methods. Specialized data sets are also available, such as binding data for hundreds of congeneric series of ligands, drawn from BindingDB and organized for use in validating drug design methods. BindingDB offers several forms of programmatic access, and comes with extensive background material and documentation. Here, we provide the first update of BindingDB since 2007, focusing on new and unique features and highlighting directions of importance to the field as a whole.
Collapse
Affiliation(s)
- Michael K Gilson
- Skaggs School of Pharmacy and Pharmaceutical Sciences, University of California, San Diego, 9500 Gilman Drive, La Jolla, CA 92093-0736, USA
| | - Tiqing Liu
- Skaggs School of Pharmacy and Pharmaceutical Sciences, University of California, San Diego, 9500 Gilman Drive, La Jolla, CA 92093-0736, USA
| | - Michael Baitaluk
- Skaggs School of Pharmacy and Pharmaceutical Sciences, University of California, San Diego, 9500 Gilman Drive, La Jolla, CA 92093-0736, USA
| | - George Nicola
- Skaggs School of Pharmacy and Pharmaceutical Sciences, University of California, San Diego, 9500 Gilman Drive, La Jolla, CA 92093-0736, USA
| | - Linda Hwang
- Skaggs School of Pharmacy and Pharmaceutical Sciences, University of California, San Diego, 9500 Gilman Drive, La Jolla, CA 92093-0736, USA
| | - Jenny Chong
- Skaggs School of Pharmacy and Pharmaceutical Sciences, University of California, San Diego, 9500 Gilman Drive, La Jolla, CA 92093-0736, USA
| |
Collapse
|
11
|
Abstract
We revisit the Parzen Window approach widely employed in pattern recognition. The Parzen Window approach can suffer from a severe computational bottleneck. This manuscript introduces a new scheme to ameliorate this computational drawback.
Pattern classification methods assign an object to one of several predefined classes/categories based on features extracted from observed attributes of the object (pattern). When L discriminatory features for the pattern can be accurately determined, the pattern classification problem presents no difficulty. However, precise identification of the relevant features for a classification algorithm (classifier) to be able to categorize real world patterns without errors is generally infeasible. In this case, the pattern classification problem is often cast as devising a classifier that minimizes the misclassification rate. One way of doing this is to consider both the pattern attributes and its class label as random variables, estimate the posterior class probabilities for a given pattern and then assign the pattern to the class/category for which the posterior class probability value estimated is maximum. More often than not, the form of the posterior class probabilities is unknown. The so-called Parzen Window approach is widely employed to estimate class-conditional probability (class-specific probability) densities for a given pattern. These probability densities can then be utilized to estimate the appropriate posterior class probabilities for that pattern. However, the Parzen Window scheme can become computationally impractical when the size of the training dataset is in the tens of thousands and L is also large (a few hundred or more). Over the years, various schemes have been suggested to ameliorate the computational drawback of the Parzen Window approach, but the problem still remains outstanding and unresolved. In this paper, we revisit the Parzen Window technique and introduce a novel approach that may circumvent the aforementioned computational bottleneck. The current paper presents the mathematical aspect of our idea. Practical realizations of the proposed scheme will be given elsewhere.
Collapse
|
12
|
Abstract
The emphasis of this review is particularly on multivariate statistical methods currently used in quantitative structure–activity relationship (QSAR) studies.
Collapse
Affiliation(s)
- Somayeh Pirhadi
- Drug Design in Silico Lab
- Chemistry Faculty
- K. N. Toosi University of Technology
- Tehran
- Iran
| | | | - Jahan B. Ghasemi
- Drug Design in Silico Lab
- Chemistry Faculty
- K. N. Toosi University of Technology
- Tehran
- Iran
| |
Collapse
|
13
|
Lewis RA, Wood D. Modern 2D QSAR for drug discovery. WILEY INTERDISCIPLINARY REVIEWS-COMPUTATIONAL MOLECULAR SCIENCE 2014. [DOI: 10.1002/wcms.1187] [Citation(s) in RCA: 31] [Impact Index Per Article: 3.1] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 01/26/2023]
Affiliation(s)
- Richard A. Lewis
- Novartis Institutes for BioMedical Research; Novartis Pharma AG; Basel Switzerland
| | - David Wood
- Novartis Institutes for BioMedical Research; Novartis Horsham Research Centre; Horsham UK
| |
Collapse
|
14
|
Abdo A, Leclère V, Jacques P, Salim N, Pupin M. Prediction of new bioactive molecules using a Bayesian belief network. J Chem Inf Model 2014; 54:30-6. [PMID: 24392938 DOI: 10.1021/ci4004909] [Citation(s) in RCA: 19] [Impact Index Per Article: 1.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/30/2022]
Abstract
Natural products and synthetic compounds are a valuable source of new small molecules leading to novel drugs to cure diseases. However identifying new biologically active small molecules is still a challenge. In this paper, we introduce a new activity prediction approach using Bayesian belief network for classification (BBNC). The roots of the network are the fragments composing a compound. The leaves are, on one side, the activities to predict and, on another side, the unknown compound. The activities are represented by sets of known compounds, and sets of inactive compounds are also used. We calculated a similarity between an unknown compound and each activity class. The more similar activity is assigned to the unknown compound. We applied this new approach on eight well-known data sets extracted from the literature and compared its performance to three classical machine learning algorithms. Experiments showed that BBNC provides interesting prediction rates (from 79% accuracy for high diverse data sets to 99% for low diverse ones) with a short time calculation. Experiments also showed that BBNC is particularly effective for homogeneous data sets but has been found to perform less well with structurally heterogeneous sets. However, it is important to stress that we believe that using several approaches whenever possible for activity prediction can often give a broader understanding of the data than using only one approach alone. Thus, BBNC is a useful addition to the computational chemist's toolbox.
Collapse
Affiliation(s)
- Ammar Abdo
- LIFL UMR CNRS 8022 Université Lille1 and INRIA Lille Nord Europe, 59655 Villeneuve d'Ascq cedex, France
| | | | | | | | | |
Collapse
|
15
|
Riniker S, Fechner N, Landrum GA. Heterogeneous Classifier Fusion for Ligand-Based Virtual Screening: Or, How Decision Making by Committee Can Be a Good Thing. J Chem Inf Model 2013; 53:2829-36. [DOI: 10.1021/ci400466r] [Citation(s) in RCA: 36] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/16/2023]
Affiliation(s)
- Sereina Riniker
- Novartis Institutes for BioMedical Research, Novartis Pharma AG, Novartis Campus, CH-4056 Basel, Switzerland
| | - Nikolas Fechner
- Novartis Institutes for BioMedical Research, Novartis Pharma AG, Novartis Campus, CH-4056 Basel, Switzerland
| | - Gregory A. Landrum
- Novartis Institutes for BioMedical Research, Novartis Pharma AG, Novartis Campus, CH-4056 Basel, Switzerland
| |
Collapse
|
16
|
Bauman JD, Patel D, Dharia C, Fromer MW, Ahmed S, Frenkel Y, Vijayan RSK, Eck JT, Ho WC, Das K, Shatkin AJ, Arnold E. Detecting allosteric sites of HIV-1 reverse transcriptase by X-ray crystallographic fragment screening. J Med Chem 2013; 56:2738-46. [PMID: 23342998 PMCID: PMC3906421 DOI: 10.1021/jm301271j] [Citation(s) in RCA: 66] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/23/2023]
Abstract
HIV-1 reverse transcriptase (RT) undergoes a series of conformational changes during viral replication and is a central target for antiretroviral therapy. The intrinsic flexibility of RT can provide novel allosteric sites for inhibition. Crystals of RT that diffract X-rays to better than 2 Å resolution facilitated the probing of RT for new druggable sites using fragment screening by X-ray crystallography. A total of 775 fragments were grouped into 143 cocktails, which were soaked into crystals of RT in complex with the non-nucleoside drug rilpivirine (TMC278). Seven new sites were discovered, including the Incoming Nucleotide Binding, Knuckles, NNRTI Adjacent, and 399 sites, located in the polymerase region of RT, and the 428, RNase H Primer Grip Adjacent, and 507 sites, located in the RNase H region. Three of these sites (Knuckles, NNRTI Adjacent, and Incoming Nucleotide Binding) are inhibitory and provide opportunities for discovery of new anti-AIDS drugs.
Collapse
Affiliation(s)
- Joseph D. Bauman
- Center for Advanced Biotechnology and Medicine, Rutgers University, Piscataway, New Jersey
- Department of Chemistry and Chemical Biology, Rutgers University, Piscataway, New Jersey
| | - Disha Patel
- Center for Advanced Biotechnology and Medicine, Rutgers University, Piscataway, New Jersey
- Department of Medicinal Chemistry, Rutgers University, Piscataway, New Jersey
| | - Chhaya Dharia
- Center for Advanced Biotechnology and Medicine, Rutgers University, Piscataway, New Jersey
- Department of Chemistry and Chemical Biology, Rutgers University, Piscataway, New Jersey
| | - Marc W. Fromer
- Center for Advanced Biotechnology and Medicine, Rutgers University, Piscataway, New Jersey
- Department of Chemistry and Chemical Biology, Rutgers University, Piscataway, New Jersey
| | - Sameer Ahmed
- Center for Advanced Biotechnology and Medicine, Rutgers University, Piscataway, New Jersey
- Department of Chemistry and Chemical Biology, Rutgers University, Piscataway, New Jersey
| | - Yulia Frenkel
- Center for Advanced Biotechnology and Medicine, Rutgers University, Piscataway, New Jersey
- Department of Chemistry and Chemical Biology, Rutgers University, Piscataway, New Jersey
| | - R. S. K. Vijayan
- Center for Advanced Biotechnology and Medicine, Rutgers University, Piscataway, New Jersey
- Department of Chemistry and Chemical Biology, Rutgers University, Piscataway, New Jersey
| | - J. Thomas Eck
- Center for Advanced Biotechnology and Medicine, Rutgers University, Piscataway, New Jersey
- Department of Chemistry and Chemical Biology, Rutgers University, Piscataway, New Jersey
| | - William C. Ho
- Center for Advanced Biotechnology and Medicine, Rutgers University, Piscataway, New Jersey
- Department of Chemistry and Chemical Biology, Rutgers University, Piscataway, New Jersey
| | - Kalyan Das
- Center for Advanced Biotechnology and Medicine, Rutgers University, Piscataway, New Jersey
- Department of Chemistry and Chemical Biology, Rutgers University, Piscataway, New Jersey
| | - Aaron J. Shatkin
- Center for Advanced Biotechnology and Medicine, Rutgers University, Piscataway, New Jersey
| | - Eddy Arnold
- Center for Advanced Biotechnology and Medicine, Rutgers University, Piscataway, New Jersey
- Department of Chemistry and Chemical Biology, Rutgers University, Piscataway, New Jersey
- Department of Medicinal Chemistry, Rutgers University, Piscataway, New Jersey
| |
Collapse
|
17
|
Tyzack JD, Mussa HY, Glen RC. Probabilistic classifier: generated using randomised sub-sampling of the feature space. J Cheminform 2012. [PMCID: PMC3341313 DOI: 10.1186/1758-2946-4-s1-p40] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022] Open
|
18
|
Wang Z, Mussa HY, Lowe R, Glen RC, Yan A. Probability Based hERG Blocker Classifiers. Mol Inform 2012; 31:679-85. [DOI: 10.1002/minf.201200011] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/02/2012] [Accepted: 07/03/2012] [Indexed: 11/11/2022]
|
19
|
Nicola G, Liu T, Gilson MK. Public domain databases for medicinal chemistry. J Med Chem 2012; 55:6987-7002. [PMID: 22731701 DOI: 10.1021/jm300501t] [Citation(s) in RCA: 64] [Impact Index Per Article: 5.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/11/2022]
Affiliation(s)
- George Nicola
- Skaggs School of Pharmacy and Pharmaceutical Sciences, University of California-San Diego, 9500 Gilman Drive, La Jolla, California 92093, United States
| | | | | |
Collapse
|
20
|
Zhang S. Application of Machine Leaning in Drug Discovery and Development. Mach Learn 2012. [DOI: 10.4018/978-1-60960-818-7.ch517] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/09/2022]
Abstract
Machine learning techniques have been widely used in drug discovery and development, particularly in the areas of cheminformatics, bioinformatics and other types of pharmaceutical research. It has been demonstrated they are suitable for large high dimensional data, and the models built with these methods can be used for robust external predictions. However, various problems and challenges still exist, and new approaches are in great need. In this Chapter, the authors will review the current development of machine learning techniques, and especially focus on several machine learning techniques they developed as well as their application to model building, lead discovery via virtual screening, integration with molecular docking, and prediction of off-target properties. The authors will suggest some potential different avenues to unify different disciplines, such as cheminformatics, bioinformatics and systems biology, for the purpose of developing integrated in silico drug discovery and development approaches.
Collapse
Affiliation(s)
- Shuxing Zhang
- The University of Texas at M.D. Anderson Cancer Center, USA
| |
Collapse
|
21
|
He J, Yang G, Rao H, Li Z, Ding X, Chen Y. Prediction of human major histocompatibility complex class II binding peptides by continuous kernel discrimination method. Artif Intell Med 2011; 55:107-15. [PMID: 22134095 DOI: 10.1016/j.artmed.2011.10.005] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/22/2011] [Revised: 10/12/2011] [Accepted: 10/21/2011] [Indexed: 11/25/2022]
Abstract
OBJECTIVE Accurate prediction of major histocompatibility complex (MHC) class II binding peptides helps reducing the experimental cost for identifying helper T cell epitopes, which has been a challenging problem partly because of the variable length of the binding peptides. This work is to develop an accurate model for predicting MHC-binding peptides using machine learning methods. METHODS In this work, a machine learning method, continuous kernel discrimination (CKD), was used for predicting MHC class II binders of variable lengths. The composition transition and distribution features were used for encoding peptide sequence and the Metropolis Monte Carlo simulated annealing approach was used for feature selection. RESULTS Feature selection was found to significantly improve the performance of the model. For benchmark dataset Dataset-1, the number of features is reduced from 147 to 24 and the area under the receiver operating characteristic curve (AUC) is improved from 0.8088 to 0.9034, while for benchmark dataset Dataset-2, the number of features is reduced from 147 to 44 and the AUC is improved from 0.7349 to 0.8499. An optimal CKD model was derived from the feature selection and bandwidth optimization using 10-fold cross-validation. Its AUC values are between 0.831 and 0.980 evaluated on benchmark datasets BM-Set1 and are between 0.806 and 0.949 on benchmark datasets BM-Set2 for MHC class II alleles. These results indicate a significantly better performance for our CKD model over other earlier models based on the training and testing of the same datasets. CONCLUSIONS Our study suggested that the CKD method outperforms other machine learning methods proposed earlier in the prediction of MHC class II biding peptides. Moreover, the choice of the cut-off for CKD classifier is crucial for its performance.
Collapse
Affiliation(s)
- Ju He
- College of Chemistry, Sichuan University, Chengdu 610064, People's Republic of China
| | | | | | | | | | | |
Collapse
|
22
|
Lowe R, Mussa HY, Mitchell JBO, Glen RC. Classifying Molecules Using a Sparse Probabilistic Kernel Binary Classifier. J Chem Inf Model 2011; 51:1539-44. [DOI: 10.1021/ci200128w] [Citation(s) in RCA: 21] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/28/2022]
Affiliation(s)
- Robert Lowe
- Unilever Centre for Molecular Sciences Informatics, Department of Chemistry, University of Cambridge, Lensfield Road, Cambridge CB2 1EW, United Kingdom
| | - Hamse Y. Mussa
- Unilever Centre for Molecular Sciences Informatics, Department of Chemistry, University of Cambridge, Lensfield Road, Cambridge CB2 1EW, United Kingdom
| | - John B. O. Mitchell
- EaStCHEM School of Chemistry and Biomedical Sciences Research Complex, University of St. Andrews, North Haugh, St. Andrews, Scotland KY16 9ST, United Kingdom
| | - Robert C. Glen
- Unilever Centre for Molecular Sciences Informatics, Department of Chemistry, University of Cambridge, Lensfield Road, Cambridge CB2 1EW, United Kingdom
| |
Collapse
|
23
|
|
24
|
Abstract
This chapter reviews the application of fragment descriptors at different stages of virtual screening: filtering, similarity search, and direct activity assessment using QSAR/QSPR models. Several case studies are considered. It is demonstrated that the power of fragment descriptors stems from their universality, very high computational efficiency, simplicity of interpretation, and versatility.
Collapse
Affiliation(s)
- Alexandre Varnek
- Laboratory of Chemoinformatics, UMR7177 CNRS, University of Strasbourg, Strasbourg, France
| |
Collapse
|
25
|
Mussa HY, Hawizy L, Nigsch F, Glen RC. Classifying large chemical data sets: using a regularized potential function method. J Chem Inf Model 2010; 51:4-14. [PMID: 21155612 DOI: 10.1021/ci100022u] [Citation(s) in RCA: 11] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/30/2022]
Abstract
In recent years classifiers generated with kernel-based methods, such as support vector machines (SVM), Gaussian processes (GP), regularization networks (RN), and binary kernel discrimination (BKD) have been very popular in chemoinformatics data analysis. Aizerman et al. were the first to introduce the notion of employing kernel-based classifiers in the area of pattern recognition. Their original scheme, which they termed the potential function method (PFM), can basically be viewed as a kernel-based perceptron procedure and arguably subsumes the modern kernel-based algorithms. PFM can be computationally much cheaper than modern kernel-based classifiers; furthermore, PFM is far simpler conceptually and easier to implement than the SVM, GP, and RN algorithms. Unfortunately, unlike, e.g., SVM, GP, and RN, PFM is not endowed with both theoretical guarantees and practical strategies to safeguard it against generating overfitting classifiers. This is, in our opinion, the reason why this simple and elegant method has not been taken up in chemoinformatics. In this paper we empirically address this drawback: while maintaining its simplicity, we demonstrate that PFM combined with a simple regularization scheme may yield binary classifiers that can be, in practice, as efficient as classifiers obtained by employing state-of-the-art kernel-based methods. Using a realistic classification example, the augmented PFM was used to generate binary classifiers. Using a large chemical data set, the generalization ability of PFM classifiers were then compared with the prediction power of Laplacian-modified naive Bayesian (LmNB), Winnow (WN), and SVM classifiers.
Collapse
Affiliation(s)
- Hamse Y Mussa
- Unilever Centre for Molecular Sciences Informatics, Department of Chemistry, University of Cambridge, Lensfield Road, Cambridge CB2 1EW, United Kingdom.
| | | | | | | |
Collapse
|
26
|
Kutchukian PS, Shakhnovich EI. De novo design: balancing novelty and confined chemical space. Expert Opin Drug Discov 2010; 5:789-812. [PMID: 22827800 DOI: 10.1517/17460441.2010.497534] [Citation(s) in RCA: 62] [Impact Index Per Article: 4.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/05/2022]
Abstract
IMPORTANCE OF THE FIELD De novo drug design serves as a tool for the discovery of new ligands for macromolecular targets as well as optimization of known ligands. Recently developed tools aim to address the multi-objective nature of drug design in an unprecedented manner. AREAS COVERED IN THIS REVIEW This article discusses recent advances in de novo drug design programs and accessory programs used to evaluate compounds post-generation. WHAT THE READER WILL GAIN The reader is introduced to the challenges inherent in de novo drug design and will become familiar with current trends in de novo design. Furthermore, the reader will be better prepared to assess the value of a tool, and be equipped to design more elegant tools in the future. TAKE HOME MESSAGE De novo drug design can assist in the efficient discovery of new compounds with a high affinity for a given target. The inclusion of existing chemoinformatic methods with current structure-based de novo design tools provides a means of enhancing the therapeutic value of these generated compounds.
Collapse
Affiliation(s)
- Peter S Kutchukian
- Harvard University, Chemistry and Chemical Biology Department, 12 Oxford Street, Cambridge, MA 02138, USA
| | | |
Collapse
|
27
|
Geppert H, Vogt M, Bajorath J. Current trends in ligand-based virtual screening: molecular representations, data mining methods, new application areas, and performance evaluation. J Chem Inf Model 2010; 50:205-16. [PMID: 20088575 DOI: 10.1021/ci900419k] [Citation(s) in RCA: 231] [Impact Index Per Article: 16.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/19/2022]
Affiliation(s)
- Hanna Geppert
- Department of Life Science Informatics, B-IT, LIMES Program Unit Chemical Biology and Medicinal Chemistry, Rheinische Friedrich-Wilhelms-Universitat, Dahlmannstrasse 2, D-53113 Bonn, Germany
| | | | | |
Collapse
|
28
|
Ranu S, Singh AK. Mining Statistically Significant Molecular Substructures for Efficient Molecular Classification. J Chem Inf Model 2009; 49:2537-50. [DOI: 10.1021/ci900035z] [Citation(s) in RCA: 13] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/30/2022]
Affiliation(s)
- Sayan Ranu
- Department of Computer Science, University of California, Santa Barbara, California
| | - Ambuj K. Singh
- Department of Computer Science, University of California, Santa Barbara, California
| |
Collapse
|
29
|
Kutchukian PS, Lou D, Shakhnovich EI. FOG: Fragment Optimized Growth Algorithm for the de Novo Generation of Molecules Occupying Druglike Chemical Space. J Chem Inf Model 2009; 49:1630-42. [DOI: 10.1021/ci9000458] [Citation(s) in RCA: 42] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/06/2023]
Affiliation(s)
- Peter S. Kutchukian
- Department of Chemistry and Chemical Biology, Harvard University, 12 Oxford Street, Cambridge, Massachusetts 02138
| | - David Lou
- Department of Chemistry and Chemical Biology, Harvard University, 12 Oxford Street, Cambridge, Massachusetts 02138
| | - Eugene I. Shakhnovich
- Department of Chemistry and Chemical Biology, Harvard University, 12 Oxford Street, Cambridge, Massachusetts 02138
| |
Collapse
|
30
|
Nasr RJ, Swamidass SJ, Baldi PF. Large scale study of multiple-molecule queries. J Cheminform 2009; 1:7. [PMID: 20298525 PMCID: PMC3225883 DOI: 10.1186/1758-2946-1-7] [Citation(s) in RCA: 21] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/01/2009] [Accepted: 06/04/2009] [Indexed: 12/04/2022] Open
Abstract
Background In ligand-based screening, as well as in other chemoinformatics applications, one seeks to effectively search large repositories of molecules in order to retrieve molecules that are similar typically to a single molecule lead. However, in some case, multiple molecules from the same family are available to seed the query and search for other members of the same family. Multiple-molecule query methods have been less studied than single-molecule query methods. Furthermore, the previous studies have relied on proprietary data and sometimes have not used proper cross-validation methods to assess the results. In contrast, here we develop and compare multiple-molecule query methods using several large publicly available data sets and background. We also create a framework based on a strict cross-validation protocol to allow unbiased benchmarking for direct comparison in future studies across several performance metrics. Results Fourteen different multiple-molecule query methods were defined and benchmarked using: (1) 41 publicly available data sets of related molecules with similar biological activity; and (2) publicly available background data sets consisting of up to 175,000 molecules randomly extracted from the ChemDB database and other sources. Eight of the fourteen methods were parameter free, and six of them fit one or two free parameters to the data using a careful cross-validation protocol. All the methods were assessed and compared for their ability to retrieve members of the same family against the background data set by using several performance metrics including the Area Under the Accumulation Curve (AUAC), Area Under the Curve (AUC), F1-measure, and BEDROC metrics. Consistent with the previous literature, the best parameter-free methods are the MAX-SIM and MIN-RANK methods, which score a molecule to a family by the maximum similarity, or minimum ranking, obtained across the family. One new parameterized method introduced in this study and two previously defined methods, the Exponential Tanimoto Discriminant (ETD), the Tanimoto Power Discriminant (TPD), and the Binary Kernel Discriminant (BKD), outperform most other methods but are more complex, requiring one or two parameters to be fit to the data. Conclusion Fourteen methods for multiple-molecule querying of chemical databases, including novel methods, (ETD) and (TPD), are validated using publicly available data sets, standard cross-validation protocols, and established metrics. The best results are obtained with ETD, TPD, BKD, MAX-SIM, and MIN-RANK. These results can be replicated and compared with the results of future studies using data freely downloadable from http://cdb.ics.uci.edu/.
Collapse
Affiliation(s)
- Ramzi J Nasr
- The Bren School of Information and Computer Science, Institute for Genomics and Bioinformatics, University of California, Irvine, CA 92697-3435, USA.
| | | | | |
Collapse
|
31
|
|
32
|
Ma XH, Wang R, Yang SY, Li ZR, Xue Y, Wei YC, Low BC, Chen YZ. Evaluation of virtual screening performance of support vector machines trained by sparsely distributed active compounds. J Chem Inf Model 2008; 48:1227-37. [PMID: 18533644 DOI: 10.1021/ci800022e] [Citation(s) in RCA: 32] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/04/2023]
Abstract
Virtual screening performance of support vector machines (SVM) depends on the diversity of training active and inactive compounds. While diverse inactive compounds can be routinely generated, the number and diversity of known actives are typically low. We evaluated the performance of SVM trained by sparsely distributed actives in six MDDR biological target classes composed of a high number of known actives (983-1645) of high, intermediate, and low structural diversity (muscarinic M1 receptor agonists, NMDA receptor antagonists, thrombin inhibitors, HIV protease inhibitors, cephalosporins, and renin inhibitors). SVM trained by regularly sparse data sets of 100 actives show improved yields at substantially reduced false-hit rates compared to those of published studies and those of Tanimoto-based similarity searching method based on the same data sets and molecular descriptors. SVM trained by very sparse data sets of 40 actives (2.4%-4.1% of the known actives) predicted 17.5-39.5%, 23.0-48.1%, and 70.2-92.4% of the remaining 943-1605 actives in the high, intermediate, and low diversity classes, respectively, 13.8-68.7% of which are outside the training compound families. SVM predicted 99.97% and 97.1% of the 9.997 M PUBCHEM and 167K remaining MDDR compounds as inactive and 2.6%-8.3% of the 19,495-38,483 MDDR compounds similar to the known actives as active. These suggest that SVM has substantial capability in identifying novel active compounds from sparse active data sets at low false-hit rates.
Collapse
Affiliation(s)
- X H Ma
- Centre for Computational Science and Engineering, National University of Singapore, Singapore
| | | | | | | | | | | | | | | |
Collapse
|
33
|
Reid D, Sadjad BS, Zsoldos Z, Simon A. LASSO—ligand activity by surface similarity order: a new tool for ligand based virtual screening. J Comput Aided Mol Des 2008; 22:479-87. [DOI: 10.1007/s10822-007-9164-5] [Citation(s) in RCA: 17] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/01/2007] [Accepted: 12/18/2007] [Indexed: 10/22/2022]
|
34
|
Han LY, Ma XH, Lin HH, Jia J, Zhu F, Xue Y, Li ZR, Cao ZW, Ji ZL, Chen YZ. A support vector machines approach for virtual screening of active compounds of single and multiple mechanisms from large libraries at an improved hit-rate and enrichment factor. J Mol Graph Model 2007; 26:1276-86. [PMID: 18218332 DOI: 10.1016/j.jmgm.2007.12.002] [Citation(s) in RCA: 65] [Impact Index Per Article: 3.8] [Reference Citation Analysis] [Abstract] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/29/2007] [Revised: 12/05/2007] [Accepted: 12/05/2007] [Indexed: 01/04/2023]
Abstract
Support vector machines (SVM) and other machine-learning (ML) methods have been explored as ligand-based virtual screening (VS) tools for facilitating lead discovery. While exhibiting good hit selection performance, in screening large compound libraries, these methods tend to produce lower hit-rate than those of the best performing VS tools, partly because their training-sets contain limited spectrum of inactive compounds. We tested whether the performance of SVM can be improved by using training-sets of diverse inactive compounds. In retrospective database screening of active compounds of single mechanism (HIV protease inhibitors, DHFR inhibitors, dopamine antagonists) and multiple mechanisms (CNS active agents) from large libraries of 2.986 million compounds, the yields, hit-rates, and enrichment factors of our SVM models are 52.4-78.0%, 4.7-73.8%, and 214-10,543, respectively, compared to those of 62-95%, 0.65-35%, and 20-1200 by structure-based VS and 55-81%, 0.2-0.7%, and 110-795 by other ligand-based VS tools in screening libraries of >or=1 million compounds. The hit-rates are comparable and the enrichment factors are substantially better than the best results of other VS tools. 24.3-87.6% of the predicted hits are outside the known hit families. SVM appears to be potentially useful for facilitating lead discovery in VS of large compound libraries.
Collapse
Affiliation(s)
- L Y Han
- Bioinformatics and Drug Design Group, Department of Pharmacy, National University of Singapore, Blk S16, Level 8, 3 Science Drive 2, Singapore 117543, Singapore
| | | | | | | | | | | | | | | | | | | |
Collapse
|
35
|
Li H, Yap CW, Ung CY, Xue Y, Li ZR, Han LY, Lin HH, Chen YZ. Machine learning approaches for predicting compounds that interact with therapeutic and ADMET related proteins. J Pharm Sci 2007; 96:2838-60. [PMID: 17786989 DOI: 10.1002/jps.20985] [Citation(s) in RCA: 46] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/06/2023]
Abstract
Computational methods for predicting compounds of specific pharmacodynamic and ADMET (absorption, distribution, metabolism, excretion and toxicity) property are useful for facilitating drug discovery and evaluation. Recently, machine learning methods such as neural networks and support vector machines have been explored for predicting inhibitors, antagonists, blockers, agonists, activators and substrates of proteins related to specific therapeutic and ADMET property. These methods are particularly useful for compounds of diverse structures to complement QSAR methods, and for cases of unavailable receptor 3D structure to complement structure-based methods. A number of studies have demonstrated the potential of these methods for predicting such compounds as substrates of P-glycoprotein and cytochrome P450 CYP isoenzymes, inhibitors of protein kinases and CYP isoenzymes, and agonists of serotonin receptor and estrogen receptor. This article is intended to review the strategies, current progresses and underlying difficulties in using machine learning methods for predicting these protein binders and as potential virtual screening tools. Algorithms for proper representation of the structural and physicochemical properties of compounds are also evaluated.
Collapse
Affiliation(s)
- H Li
- Bioinformatics and Drug Design Group, Department of Pharmacy and Department of Computational Science, National University of Singapore, Blk S16, Level 8, 3 Science Drive 2, Singapore 117543, Singapore
| | | | | | | | | | | | | | | |
Collapse
|
36
|
Willett P, Wilton D, Hartzoulakis B, Tang R, Ford J, Madge D. Prediction of Ion Channel Activity Using Binary Kernel Discrimination. J Chem Inf Model 2007; 47:1961-6. [PMID: 17622131 DOI: 10.1021/ci700087v] [Citation(s) in RCA: 22] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/21/2023]
Abstract
Voltage-gated ion channels are a diverse family of pharmaceutically important membrane proteins for which limited 3D information is available. A number of virtual screening tools have been used to assist with the discovery of new leads and with the analysis of screening results. One such tool, and the subject of this paper, is binary kernel discrimination (BKD), a machine-learning approach that has recently been applied to applications in chemoinformatics. It uses a training set of compounds, for which both structural and qualitative activity data are known, to produce a model that can then be used to rank another set of compounds in order of likely activity. Here, we report the use of BKD to build models for the prediction of five different ion channel targets using two types of activity data. The results obtained suggest that the approach provides an effective way of prioritizing compounds for acquisition and testing.
Collapse
Affiliation(s)
- Peter Willett
- Department of Information Studies, University of Sheffield, 211 Portobello Street, Sheffield S1 4DP, United Kingdom.
| | | | | | | | | | | |
Collapse
|
37
|
Pasupa K, Harrison RF, Willett P. Parsimonious Kernel Fisher Discrimination. PATTERN RECOGNITION AND IMAGE ANALYSIS 2007. [DOI: 10.1007/978-3-540-72847-4_68] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/29/2022]
|
38
|
Eckert H, Bajorath J. Molecular similarity analysis in virtual screening: foundations, limitations and novel approaches. Drug Discov Today 2007; 12:225-33. [PMID: 17331887 DOI: 10.1016/j.drudis.2007.01.011] [Citation(s) in RCA: 312] [Impact Index Per Article: 18.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/19/2006] [Revised: 12/22/2006] [Accepted: 01/23/2007] [Indexed: 11/27/2022]
Abstract
The success of ligand-based virtual-screening calculations is influenced highly by the nature of target-specific structure-activity relationships. This might pose severe constraints on the ability to recognize diverse structures with similar activity. Accordingly, the performance of similarity-based methods strongly depends on the class of compound that is studied, and approaches of different design and complexity often produce, overall, equally good (or bad) results. However, it is also found that there is often little overlap in the similarity relationships detected by different approaches, which rationalizes the need to develop alternative similarity methods. Among others, these include novel algorithms to navigate high-dimensional chemical spaces, train similarity calculations on specific compound classes, and detect remote similarity relationships.
Collapse
Affiliation(s)
- Hanna Eckert
- Department of Life Science Informatics, B-IT, Rheinische Friedrich-Wilhelms-Universität Bonn, Dahlmannstrasse 2, D-53113 Bonn, Germany
| | | |
Collapse
|
39
|
Jensen BF, Vind C, Padkjaer SB, Brockhoff PB, Refsgaard HHF. In Silico Prediction of Cytochrome P450 2D6 and 3A4 Inhibition Using Gaussian Kernel Weighted k-Nearest Neighbor and Extended Connectivity Fingerprints, Including Structural Fragment Analysis of Inhibitors versus Noninhibitors. J Med Chem 2007; 50:501-11. [PMID: 17266202 DOI: 10.1021/jm060333s] [Citation(s) in RCA: 87] [Impact Index Per Article: 5.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/03/2023]
Abstract
Inhibition of cytochrome P450 (CYP) enzymes is unwanted because of the risk of severe side effects due to drug-drug interactions. We present two in silico Gaussian kernel weighted k-nearest neighbor models based on extended connectivity fingerprints that classify CYP2D6 and CYP3A4 inhibition. Data used for modeling consisted of diverse sets of 1153 and 1382 drug candidates tested for CYP2D6 and CYP3A4 inhibition in human liver microsomes. For CYP2D6, 82% of the classified test set compounds were predicted to the correct class. For CYP3A4, 88% of the classified compounds were correctly classified. CYP2D6 and CYP3A4 inhibition were additionally classified for an external test set on 14 drugs, and multidimensional scaling plots showed that the drugs in the external test set were in the periphery of the training sets. Furthermore, fragment analyses were performed and structural fragments frequent in CYP2D6 and CYP3A4 inhibitors and noninhibitors are presented.
Collapse
Affiliation(s)
- Berith F Jensen
- Exploratory ADME, Diabetes Research Unit, Novo Nordisk A/S, 2760 Måløv, Denmark
| | | | | | | | | |
Collapse
|
40
|
Chen B, Harrison RF, Papadatos G, Willett P, Wood DJ, Lewell XQ, Greenidge P, Stiefl N. Evaluation of machine-learning methods for ligand-based virtual screening. J Comput Aided Mol Des 2007; 21:53-62. [PMID: 17205373 DOI: 10.1007/s10822-006-9096-5] [Citation(s) in RCA: 93] [Impact Index Per Article: 5.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/02/2006] [Accepted: 12/04/2006] [Indexed: 01/28/2023]
Abstract
Machine-learning methods can be used for virtual screening by analysing the structural characteristics of molecules of known (in)activity, and we here discuss the use of kernel discrimination and naive Bayesian classifier (NBC) methods for this purpose. We report a kernel method that allows the processing of molecules represented by binary, integer and real-valued descriptors, and show that it is little different in screening performance from a previously described kernel that had been developed specifically for the analysis of binary fingerprint representations of molecular structure. We then evaluate the performance of an NBC when the training-set contains only a very few active molecules. In such cases, a simpler approach based on group fusion would appear to provide superior screening performance, especially when structurally heterogeneous datasets are to be processed.
Collapse
Affiliation(s)
- Beining Chen
- Department of Chemistry, University of Sheffield, Western Bank, Sheffield, UK
| | | | | | | | | | | | | | | |
Collapse
|
41
|
Willett P. Similarity-based virtual screening using 2D fingerprints. Drug Discov Today 2006; 11:1046-53. [PMID: 17129822 DOI: 10.1016/j.drudis.2006.10.005] [Citation(s) in RCA: 547] [Impact Index Per Article: 30.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/18/2006] [Revised: 09/04/2006] [Accepted: 10/09/2006] [Indexed: 11/19/2022]
Abstract
This paper summarizes recent work at the University of Sheffield on virtual screening methods that use 2D fingerprint measures of structural similarity. A detailed comparison of a large number of similarity coefficients demonstrates that the well-known Tanimoto coefficient remains the method of choice for the computation of fingerprint-based similarity, despite possessing some inherent biases related to the sizes of the molecules that are being sought. Group fusion involves combining the results of similarity searches based on multiple reference structures and a single similarity measure. We demonstrate the effectiveness of this approach to screening, and also describe an approximate form of group fusion, turbo similarity searching, that can be used when just a single reference structure is available.
Collapse
Affiliation(s)
- Peter Willett
- Krebs Institute for Biomolecular Research and Department of Information Studies, University of Sheffield, 211 Portobello, Sheffield S1 4DP, UK.
| |
Collapse
|
42
|
Liu T, Lin Y, Wen X, Jorissen RN, Gilson MK. BindingDB: a web-accessible database of experimentally determined protein-ligand binding affinities. Nucleic Acids Res 2006; 35:D198-201. [PMID: 17145705 PMCID: PMC1751547 DOI: 10.1093/nar/gkl999] [Citation(s) in RCA: 1216] [Impact Index Per Article: 67.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/21/2022] Open
Abstract
BindingDB () is a publicly accessible database currently containing ∼20 000 experimentally determined binding affinities of protein–ligand complexes, for 110 protein targets including isoforms and mutational variants, and ∼11 000 small molecule ligands. The data are extracted from the scientific literature, data collection focusing on proteins that are drug-targets or candidate drug-targets and for which structural data are present in the Protein Data Bank. The BindingDB website supports a range of query types, including searches by chemical structure, substructure and similarity; protein sequence; ligand and protein names; affinity ranges and molecular weight. Data sets generated by BindingDB queries can be downloaded in the form of annotated SDfiles for further analysis, or used as the basis for virtual screening of a compound database uploaded by the user. The data in BindingDB are linked both to structural data in the PDB via PDB IDs and chemical and sequence searches, and to the literature in PubMed via PubMed IDs.
Collapse
Affiliation(s)
| | | | | | | | - Michael K. Gilson
- To whom correspondence should be addressed. Tel: +1 240 314 6217; Fax: +1 240 314 6255;
| |
Collapse
|
43
|
Willett P. Enhancing the Effectiveness of Ligand-Based Virtual Screening Using Data Fusion. ACTA ACUST UNITED AC 2006. [DOI: 10.1002/qsar.200610084] [Citation(s) in RCA: 60] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/06/2022]
|
44
|
Ganguly M, Brown N, Schuffenhauer A, Ertl P, Gillet VJ, Greenidge PA. Introducing the consensus modeling concept in genetic algorithms: application to interpretable discriminant analysis. J Chem Inf Model 2006; 46:2110-24. [PMID: 16995742 DOI: 10.1021/ci050529l] [Citation(s) in RCA: 17] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/29/2022]
Abstract
An evolutionary statistical learning method was applied to classify drugs according to their biological target and also to discriminate between a compilation of oral and nonoral drugs. The emphasis was placed not only on how well the models predict but also on their interpretability. In an enhancement to previous studies, the consistency of the model weights over several runs of the genetic algorithm was considered with the goal of producing comprehensible models. Via this approach, the descriptors and their ranges that contribute most to class discrimination were identified. Selecting a bin step size that enables the average descriptor properties of the class being trained to be captured improves the interpretability and discriminatory power of a model. The performance, consistency, and robustness of such models were further enhanced by using two novel approaches that reduce the variability between individual solutions: consensus and splice modeling. Finally, the ability of the genetic algorithm to discriminate between activity classes was compared with a similarity searching method, while naïve Bayes classifiers and support vector machines were applied in discriminating the oral and nonoral drugs.
Collapse
Affiliation(s)
- Milan Ganguly
- Novartis Institutes for BioMedical Research, Basel, CH-4002, Switzerland
| | | | | | | | | | | |
Collapse
|
45
|
Auer J, Bajorath J. Emerging Chemical Patterns: A New Methodology for Molecular Classification and Compound Selection. J Chem Inf Model 2006; 46:2502-14. [PMID: 17125191 DOI: 10.1021/ci600301t] [Citation(s) in RCA: 43] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/28/2022]
Abstract
A concept termed Emerging Chemical Patterns (ECPs) is introduced as a novel approach to molecular classification. The methodology makes it possible to extract key molecular features from very few known active compounds and classify molecules according to different potency levels. The approach was developed in light of the situation often faced during the early stages of lead optimization efforts: too few active reference molecules are available to build computational models for the prediction of potent compounds. The ECP method generates high-resolution signatures of active compounds. Predictive ECP models can be built based on the information provided by sets of only three molecules with potency in the nanomolar and micromolar range. In addition to individual compound predictions, an iterative ECP scheme has been designed. When applied to different sets of active molecules, iterative ECP classification produced compound selection sets with increases in average potency of up to 3 orders of magnitude.
Collapse
Affiliation(s)
- Jens Auer
- Department of Life Science Informatics, B-IT, Rheinische Friedrich-Wilhelms-Universität, Dahlmannstrasse 2, D-53113 Bonn, Germany
| | | |
Collapse
|
46
|
Hert J, Willett P, Wilton DJ, Acklin P, Azzaoui K, Jacoby E, Schuffenhauer A. New methods for ligand-based virtual screening: use of data fusion and machine learning to enhance the effectiveness of similarity searching. J Chem Inf Model 2006; 46:462-70. [PMID: 16562973 DOI: 10.1021/ci050348j] [Citation(s) in RCA: 165] [Impact Index Per Article: 9.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/28/2022]
Abstract
Similarity searching using a single bioactive reference structure is a well-established technique for accessing chemical structure databases. This paper describes two extensions of the basic approach. First, we discuss the use of group fusion to combine the results of similarity searches when multiple reference structures are available. We demonstrate that this technique is notably more effective than conventional similarity searching in scaffold-hopping searches for structurally diverse sets of active molecules; conversely, the technique will do little to improve the search performance if the actives are structurally homogeneous. Second, we make the assumption that the nearest neighbors resulting from a similarity search, using a single bioactive reference structure, are also active and use this assumption to implement approximate forms of group fusion, substructural analysis, and binary kernel discrimination. This approach, called turbo similarity searching, is notably more effective than conventional similarity searching.
Collapse
Affiliation(s)
- Jérôme Hert
- Krebs Institute for Biomolecular Research and Department of Information Studies, University of Sheffield, Western Bank, Sheffield S10 2TN, U.K
| | | | | | | | | | | | | |
Collapse
|
47
|
Eckert H, Vogt I, Bajorath J. Mapping Algorithms for Molecular Similarity Analysis and Ligand-Based Virtual Screening: Design of DynaMAD and Comparison with MAD and DMC. J Chem Inf Model 2006; 46:1623-34. [PMID: 16859294 DOI: 10.1021/ci060083o] [Citation(s) in RCA: 21] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/28/2022]
Abstract
Here, we introduce the DynaMAD algorithm that is designed to map database compounds to combinations of activity-class-dependent descriptor value ranges in order to identify novel active molecules. The method combines and extends key features of two previously developed algorithms, MAD and DMC. These methods were first described as compound-mapping algorithms for large-scale virtual screening applications. DynaMAD and DMC operate in chemical spaces of stepwise increasing dimensionality. However, in contrast to DMC, which utilizes binary transformed descriptors, DynaMAD uses unmodified descriptor value distributions. The performance of these mapping methods was compared in detail in virtual screening trials on 24 different compound activity classes against a background of about 2 million database compounds. In these calculations, all three approaches produced results of considerable predictive value, and the enrichment of active molecules in small selection sets consisting of only about 20 or fewer database compounds emerged as a common feature. Furthermore, mapping methods were capable of recognizing remote molecular similarity relationships. Overall, DynaMAD performed better than MAD and DMC, producing average hit and recovery rates of 55% and 33%, respectively, over all 24 classes. Taken together, our findings suggest that dynamic compound mapping to combinations of activity-class-selective descriptor settings has significant potential for molecular similarity analysis and ligand-based virtual screening.
Collapse
Affiliation(s)
- Hanna Eckert
- Department of Life Science Informatics, B-IT, Rheinische Friedrich-Wilhelms-Universität, Dahlmannstr. 2, D-53113 Bonn, Germany
| | | | | |
Collapse
|
48
|
Wilton DJ, Harrison RF, Willett P, Delaney J, Lawson K, Mullier G. Virtual Screening Using Binary Kernel Discrimination: Analysis of Pesticide Data. J Chem Inf Model 2006; 46:471-7. [PMID: 16562974 DOI: 10.1021/ci050397w] [Citation(s) in RCA: 38] [Impact Index Per Article: 2.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/28/2022]
Abstract
This paper discusses the use of binary kernel discrimination (BKD) for identifying potential active compounds in lead-discovery programs. BKD was compared with established virtual screening methods in a series of experiments using pesticide data from the Syngenta corporate database. It was found to be superior to methods based on similarity searching and substructural analysis but inferior to a support vector machine. Similar conclusions resulted from application of the methods to a pesticide data set for which categorical activity data were available.
Collapse
Affiliation(s)
- David J Wilton
- Department of Information Studies, University of Sheffield, Sheffield S10 2TN, UK
| | | | | | | | | | | |
Collapse
|
49
|
Chen B, Harrison RF, Pasupa K, Willett P, Wilton DJ, Wood DJ, Lewell XQ. Virtual Screening Using Binary Kernel Discrimination: Effect of Noisy Training Data and the Optimization of Performance. J Chem Inf Model 2006; 46:478-86. [PMID: 16562975 DOI: 10.1021/ci0505426] [Citation(s) in RCA: 22] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/29/2022]
Abstract
Binary kernel discrimination (BKD) uses a training set of compounds, for which structural and qualitative activity data are available, to produce a model that can then be applied to the structures of other compounds in order to predict their likely activity. Experiments with the MDL Drug Data Report database show that the optimal value of the smoothing parameter, and hence the predictive power of BKD, is crucially dependent on the number of false positives in the training set. It is also shown that the best results for BKD are achieved using one particular optimization method for the determination of the smoothing parameter that lies at the heart of the method and using the Jaccard/Tanimoto coefficient in the kernel function that is used to compute the similarity between a test set molecule and the members of the training set.
Collapse
Affiliation(s)
- Beining Chen
- Department of Chemistry, University of Sheffield, Sheffield S10 2TN, UK
| | | | | | | | | | | | | |
Collapse
|
50
|
Capelli AM, Feriani A, Tedesco G, Pozzan A. Generation of a Focused Set of GSK Compounds Biased toward Ligand-Gated Ion-Channel Ligands. J Chem Inf Model 2006; 46:659-64. [PMID: 16562996 DOI: 10.1021/ci050353n] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/29/2022]
Abstract
A "data mining" methodology based on substructural analysis and standard 1024 Daylight fingerprints as descriptors was applied to a set of known antagonists of a subfamily of ligand-gated ion channels comprising nicotinic acetylcholine receptors (nAChR's), 5-hydroxytryptamine, gamma-amino butyric acid-A, and glycine receptors. The derived scoring function was used to generate a focused set that was screened for alpha7 nAChR, resulting in the identification of novel alpha7 ligands easily amenable to chemical modification. Finally, the same scoring function was applied retrospectively to other in-house sets screened for the same target in the same assay. The results and performance of the method are described in detail.
Collapse
Affiliation(s)
- Anna Maria Capelli
- Computational, Analytical, and Structural Sciences, GlaxoSmithKline S.p.A. Medicines Research Centre, Via A. Fleming, 4-37135 Verona, Italy.
| | | | | | | |
Collapse
|