1
|
Luo D, Zhao T, Cheng W, Xu D, Han F, Yu W, Liu X, Chen H, Zhang X. Towards Inductive and Efficient Explanations for Graph Neural Networks. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 2024; 46:5245-5259. [PMID: 38319773 DOI: 10.1109/tpami.2024.3362584] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 02/08/2024]
Abstract
Despite recent progress in Graph Neural Networks (GNNs), explaining predictions made by GNNs remains a challenging and nascent problem. The leading method mainly considers the local explanations, i.e., important subgraph structure and node features, to interpret why a GNN model makes the prediction for a single instance, e.g. a node or a graph. As a result, the explanation generated is painstakingly customized at the instance level. The unique explanation interpreting each instance independently is not sufficient to provide a global understanding of the learned GNN model, leading to the lack of generalizability and hindering it from being used in the inductive setting. Besides, training the explanation model explaining for each instance is time-consuming for large-scale real-life datasets. In this study, we address these key challenges and propose PGExplainer, a parameterized explainer for GNNs. PGExplainer adopts a deep neural network to parameterize the generation process of explanations, which renders PGExplainer a natural approach to multi-instance explanations. Compared to the existing work, PGExplainer has better generalization ability and can be utilized in an inductive setting without training the model for new instances. Thus, PGExplainer is much more efficient than the leading method with significant speed-up. In addition, the explanation networks can also be utilized as a regularizer to improve the generalization power of existing GNNs when jointly trained with downstream tasks. Experiments on both synthetic and real-life datasets show highly competitive performance with up to 24.7% relative improvement in AUC on explaining graph classification over the leading baseline.
Collapse
|
2
|
Siddique F, Anwaar A, Bashir M, Nadeem S, Rawat R, Eyupoglu V, Afzal S, Bibi M, Bin Jardan YA, Bourhia M. Revisiting methotrexate and phototrexate Zinc15 library-based derivatives using deep learning in-silico drug design approach. Front Chem 2024; 12:1380266. [PMID: 38576849 PMCID: PMC10991842 DOI: 10.3389/fchem.2024.1380266] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/01/2024] [Accepted: 03/05/2024] [Indexed: 04/06/2024] Open
Abstract
Introduction: Cancer is the second most prevalent cause of mortality in the world, despite the availability of several medications for cancer treatment. Therefore, the cancer research community emphasized on computational techniques to speed up the discovery of novel anticancer drugs. Methods: In the current study, QSAR-based virtual screening was performed on the Zinc15 compound library (271 derivatives of methotrexate (MTX) and phototrexate (PTX)) to predict their inhibitory activity against dihydrofolate reductase (DHFR), a potential anticancer drug target. The deep learning-based ADMET parameters were employed to generate a 2D QSAR model using the multiple linear regression (MPL) methods with Leave-one-out cross-validated (LOO-CV) Q2 and correlation coefficient R2 values as high as 0.77 and 0.81, respectively. Results: From the QSAR model and virtual screening analysis, the top hits (09, 27, 41, 68, 74, 85, 99, 180) exhibited pIC50 ranging from 5.85 to 7.20 with a minimum binding score of -11.6 to -11.0 kcal/mol and were subjected to further investigation. The ADMET attributes using the message-passing neural network (MPNN) model demonstrated the potential of selected hits as an oral medication based on lipophilic profile Log P (0.19-2.69) and bioavailability (76.30% to 78.46%). The clinical toxicity score was 31.24% to 35.30%, with the least toxicity score (8.30%) observed with compound 180. The DFT calculations were carried out to determine the stability, physicochemical parameters and chemical reactivity of selected compounds. The docking results were further validated by 100 ns molecular dynamic simulation analysis. Conclusion: The promising lead compounds found endorsed compared to standard reference drugs MTX and PTX that are best for anticancer activity and can lead to novel therapies after experimental validations. Furthermore, it is suggested to unveil the inhibitory potential of identified hits via in-vitro and in-vivo approaches.
Collapse
Affiliation(s)
- Farhan Siddique
- School of Pharmaceutical Science and Technology, Tianjin University, Tianjin, China
- Department of Pharmaceutical Chemistry, Faculty of Pharmacy, Bahauddin Zakariya University, Multan, Pakistan
| | - Ahmar Anwaar
- Faculty of Pharmacy, Bahauddin Zakariya University, Multan, Pakistan
| | - Maryam Bashir
- Department of Pharmaceutical Chemistry, Faculty of Pharmacy, Bahauddin Zakariya University, Multan, Pakistan
- Southern Punjab Institute of Health Sciences, Multan, Pakistan
| | - Sumaira Nadeem
- Department of Pharmacy, The Women University, Multan, Pakistan
| | - Ravi Rawat
- School of Health Sciences & Technology, UPES University, Dehradun, India
| | - Volkan Eyupoglu
- Department of Chemistry, Cankırı Karatekin University, Cankırı, Türkiye
| | - Samina Afzal
- Department of Pharmaceutical Chemistry, Faculty of Pharmacy, Bahauddin Zakariya University, Multan, Pakistan
| | - Mehvish Bibi
- Department of Pharmaceutical Chemistry, Faculty of Pharmacy, Bahauddin Zakariya University, Multan, Pakistan
| | - Yousef A. Bin Jardan
- Department of Pharmaceutics, College of Pharmacy, King Saud University, Riyadh, Saudi Arabia
| | - Mohammed Bourhia
- Laboratory of Biotechnology and Natural Resources Valorization, Faculty of Sciences, Ibn Zohr University, Agadir, Morocco
| |
Collapse
|
3
|
Liu M, Srivastava G, Ramanujam J, Brylinski M. Augmented drug combination dataset to improve the performance of machine learning models predicting synergistic anticancer effects. Sci Rep 2024; 14:1668. [PMID: 38238448 PMCID: PMC10796434 DOI: 10.1038/s41598-024-51940-9] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/23/2023] [Accepted: 01/11/2024] [Indexed: 01/22/2024] Open
Abstract
Combination therapy has gained popularity in cancer treatment as it enhances the treatment efficacy and overcomes drug resistance. Although machine learning (ML) techniques have become an indispensable tool for discovering new drug combinations, the data on drug combination therapy currently available may be insufficient to build high-precision models. We developed a data augmentation protocol to unbiasedly scale up the existing anti-cancer drug synergy dataset. Using a new drug similarity metric, we augmented the synergy data by substituting a compound in a drug combination instance with another molecule that exhibits highly similar pharmacological effects. Using this protocol, we were able to upscale the AZ-DREAM Challenges dataset from 8798 to 6,016,697 drug combinations. Comprehensive performance evaluations show that ML models trained on the augmented data consistently achieve higher accuracy than those trained solely on the original dataset. Our data augmentation protocol provides a systematic and unbiased approach to generating more diverse and larger-scale drug combination datasets, enabling the development of more precise and effective ML models. The protocol presented in this study could serve as a foundation for future research aimed at discovering novel and effective drug combinations for cancer treatment.
Collapse
Affiliation(s)
- Mengmeng Liu
- Division of Electrical and Computer Engineering, Louisiana State University, Baton Rouge, LA, 70803, USA
| | - Gopal Srivastava
- Department of Biological Sciences, Louisiana State University, Baton Rouge, LA, 70803, USA
| | - J Ramanujam
- Division of Electrical and Computer Engineering, Louisiana State University, Baton Rouge, LA, 70803, USA
- Center for Computation and Technology, Louisiana State University, Baton Rouge, LA, 70803, USA
| | - Michal Brylinski
- Department of Biological Sciences, Louisiana State University, Baton Rouge, LA, 70803, USA.
- Center for Computation and Technology, Louisiana State University, Baton Rouge, LA, 70803, USA.
| |
Collapse
|
4
|
Liu M, Srivastava G, Ramanujam J, Brylinski M. Augmented drug combination dataset to improve the performance of machine learning models predicting synergistic anticancer effects. RESEARCH SQUARE 2023:rs.3.rs-3481858. [PMID: 37961281 PMCID: PMC10635365 DOI: 10.21203/rs.3.rs-3481858/v1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/15/2023]
Abstract
Combination therapy has gained popularity in cancer treatment as it enhances the treatment efficacy and overcomes drug resistance. Although machine learning (ML) techniques have become an indispensable tool for discovering new drug combinations, the data on drug combination therapy currently available may be insufficient to build high-precision models. We developed a data augmentation protocol to unbiasedly scale up the existing anti-cancer drug synergy dataset. Using a new drug similarity metric, we augmented the synergy data by substituting a compound in a drug combination instance with another molecule that exhibits highly similar pharmacological effects. Using this protocol, we were able to upscale the AZ-DREAM Challenges dataset from 8,798 to 6,016,697 drug combinations. Comprehensive performance evaluations show that Random Forest and Gradient Boosting Trees models trained on the augmented data achieve higher accuracy than those trained solely on the original dataset. Our data augmentation protocol provides a systematic and unbiased approach to generating more diverse and larger-scale drug combination datasets, enabling the development of more precise and effective ML models. The protocol presented in this study could serve as a foundation for future research aimed at discovering novel and effective drug combinations for cancer treatment.
Collapse
|
5
|
Wei L, Zhao H, He Z, Yao Q. Neural Architecture Search for GNN-based Graph Classification. ACM T INFORM SYST 2023. [DOI: 10.1145/3584945] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/22/2023]
Abstract
Graph classification is an important problem with applications across many domains, for which the graph neural networks (GNNs) have been state-of-the-art (SOTA) methods. In the literature, to adopt GNNs for the graph classification task, there are two groups of methods: global pooling and hierarchical pooling. The global pooling methods obtain the graph representation vectors by globally pooling all the node embeddings together at the end of several GNN layers, while the hierarchical pooling methods provide one extra pooling operation between the GNN layers to extract the hierarchical information and improve the graph representations. Both global and hierarchical pooling methods are effective in different scenarios. Due to highly diverse applications, it is challenging to design data-specific pooling methods with human expertise. To address this problem, we propose PAS (Pooling Architecture Search) to design adaptive pooling architectures by using the neural architecture search (NAS). To enable the search space design, we propose a unified pooling framework consisting of four modules: Aggregation, Pooling, Readout, and Merge. Two variants PAS-G and PAS-NE are provided to design the pooling operations in different scales. A set of candidate operations are designed in the search space on top of this framework, and then existing human-designed pooling methods, including global and hierarchical ones, can be incorporated. To enable efficient search, a coarsening strategy is developed to continuously relax the search space, and then a differentiable search method can be adopted. We conduct extensive experiments on six real-world datasets, including the large-scale datasets MR and ogbg-molhiv. Experimental results in this paper demonstrate the effectiveness and efficiency of the proposed PAS in designing the pooling architectures for graph classification. Besides, the Top-1 performance on two Open Graph Benchmark (OGB) datasets further indicates the utility of PAS when facing diverse realistic data. The implementation of PAS is available at: https://github.com/AutoML-Research/PAS.
Collapse
Affiliation(s)
- Lanning Wei
- Institute of Computing Technology, Chinese Academy of Sciences; University of Chinese Academy of Sciences, China
| | | | - Zhiqiang He
- Institute of Computing Technology, Chinese Academy of Sciences; Lenovo, China
| | - Quanming Yao
- Department of Electronic Engineering, Tsinghua University, China
| |
Collapse
|
6
|
Li M, Yu J, Xu H, Meng C. Efficient Approximation of Gromov-Wasserstein Distance Using Importance Sparsification. J Comput Graph Stat 2023. [DOI: 10.1080/10618600.2023.2165500] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/11/2023]
Affiliation(s)
- Mengyu Li
- Institute of Statistics and Big Data, Renmin University of China
| | - Jun Yu
- School of Mathematics and Statistics, Beijing Institute of Technology
| | - Hongteng Xu
- Gaoling School of Artificial Intelligence, Renmin University of China
| | - Cheng Meng
- Center for Applied Statistics, Institute of Statistics and Big Data, Renmin University of China
| |
Collapse
|
7
|
Salim A, Shiju SS, Sumitra S. Neighborhood Preserving Kernels for Attributed Graphs. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 2023; 45:828-840. [PMID: 35041594 DOI: 10.1109/tpami.2022.3143806] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/14/2023]
Abstract
We describe the design of a reproducing kernel suitable for attributed graphs, in which the similarity between two graphs is defined based on the neighborhood information of the graph nodes with the aid of a product graph formulation. Attributed graphs are those which contain a piece of vector information and a discrete label over the nodes and edges. We represent the proposed kernel as the weighted sum of two other kernels of which one is an R-convolution kernel that processes the attribute information of the graph and the other is an optimal assignment kernel that processes label information. They are formulated in such a way that the edges processed as part of the kernel computation have the same neighborhood properties and hence the kernel proposed makes a well-defined correspondence between regions processed in graphs. These concepts are also extended to the case of the shortest paths. We identified the state-of-the-art kernels that can be mapped to such a neighborhood preserving framework. We found that the kernel value of the argument graphs in each iteration of the Weisfeiler-Lehman color refinement algorithm can be obtained recursively from the product graph formulated in our method. By incorporating the proposed kernel on support vector machines we analyzed the real-world data sets and it showed superior performance in comparison with that of the other state-of-the-art graph kernels.
Collapse
|
8
|
Oloulade BM, Gao J, Chen J, Al-Sabri R, Lyu T. Neural predictor-based automated graph classifier framework. Mach Learn 2022. [DOI: 10.1007/s10994-022-06287-5] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/24/2022]
|
9
|
A family of pairwise multi-marginal optimal transports that define a generalized metric. Mach Learn 2022. [DOI: 10.1007/s10994-022-06280-y] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/24/2022]
|
10
|
Hentabli H, Bengherbia B, Saeed F, Salim N, Nafea I, Toubal A, Nasser M. Convolutional Neural Network Model Based on 2D Fingerprint for Bioactivity Prediction. Int J Mol Sci 2022; 23:13230. [PMID: 36362018 PMCID: PMC9657591 DOI: 10.3390/ijms232113230] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/14/2022] [Revised: 10/22/2022] [Accepted: 10/27/2022] [Indexed: 10/15/2023] Open
Abstract
Determining and modeling the possible behaviour and actions of molecules requires investigating the basic structural features and physicochemical properties that determine their behaviour during chemical, physical, biological, and environmental processes. Computational approaches such as machine learning methods are alternatives to predicting the physiochemical properties of molecules based on their structures. However, the limited accuracy and high error rates of such predictions restrict their use. In this paper, a novel technique based on a deep learning convolutional neural network (CNN) for the prediction of chemical compounds' bioactivity is proposed and developed. The molecules are represented in the new matrix format Mol2mat, a molecular matrix representation adapted from the well-known 2D-fingerprint descriptors. To evaluate the performance of the proposed methods, a series of experiments were conducted using two standard datasets, namely the MDL Drug Data Report (MDDR) and Sutherland, datasets comprising 10 homogeneous and 14 heterogeneous activity classes. After analysing the eight fingerprints, all the probable combinations were investigated using the five best descriptors. The results showed that a combination of three fingerprints, ECFP4, EPFP4, and ECFC4, along with a CNN activity prediction process, achieved the highest performance of 98% AUC when compared to the state-of-the-art ML algorithms NaiveB, LSVM, and RBFN.
Collapse
Affiliation(s)
- Hamza Hentabli
- Laboratory of Advanced Electronics Systems (LSEA), University of Medea, Medea 26000, Algeria
- UTM Big Data Centre, Ibnu Sina Institute for Scientific and Industrial Research, Universiti Teknologi Malaysia, Johor Bahru 81310, Johor, Malaysia
| | - Billel Bengherbia
- Laboratory of Advanced Electronics Systems (LSEA), University of Medea, Medea 26000, Algeria
| | - Faisal Saeed
- UTM Big Data Centre, Ibnu Sina Institute for Scientific and Industrial Research, Universiti Teknologi Malaysia, Johor Bahru 81310, Johor, Malaysia
- DAAI Research Group, Department of Computing and Data Science, School of Computing and Digital Technology, Birmingham City University, Birmingham B4 7XG, UK
| | - Naomie Salim
- UTM Big Data Centre, Ibnu Sina Institute for Scientific and Industrial Research, Universiti Teknologi Malaysia, Johor Bahru 81310, Johor, Malaysia
| | - Ibtehal Nafea
- College of Computer Science and Engineering, Taibah University, Medina 41477, Saudi Arabia
| | - Abdelmoughni Toubal
- Laboratory of Advanced Electronics Systems (LSEA), University of Medea, Medea 26000, Algeria
| | - Maged Nasser
- School of Computer Sciences, Universiti Sains Malaysia, Gelugor 11800, Penang, Malaysia
| |
Collapse
|
11
|
Song H, Dai Z, Xu P, Ren L. Interactive Visual Pattern Search on Graph Data via Graph Representation Learning. IEEE TRANSACTIONS ON VISUALIZATION AND COMPUTER GRAPHICS 2022; 28:335-345. [PMID: 34587078 DOI: 10.1109/tvcg.2021.3114857] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/13/2023]
Abstract
Graphs are a ubiquitous data structure to model processes and relations in a wide range of domains. Examples include control-flow graphs in programs and semantic scene graphs in images. Identifying subgraph patterns in graphs is an important approach to understand their structural properties. We propose a visual analytics system GraphQ to support human-in-the-loop, example-based, subgraph pattern search in a database containing many individual graphs. To support fast, interactive queries, we use graph neural networks (GNNs) to encode a graph as fixed-length latent vector representation, and perform subgraph matching in the latent space. Due to the complexity of the problem, it is still difficult to obtain accurate one-to-one node correspondences in the matching results that are crucial for visualization and interpretation. We, therefore, propose a novel GNN for node-alignment called NeuroAlign, to facilitate easy validation and interpretation of the query results. GraphQ provides a visual query interface with a query editor and a multi-scale visualization of the results, as well as a user feedback mechanism for refining the results with additional constraints. We demonstrate GraphQ through two example usage scenarios: analyzing reusable subroutines in program workflows and semantic scene graph search in images. Quantitative experiments show that NeuroAlign achieves 19%-29% improvement in node-alignment accuracy compared to baseline GNN and provides up to 100× speedup compared to combinatorial algorithms. Our qualitative study with domain experts confirms the effectiveness for both usage scenarios.
Collapse
|
12
|
Tang H, Ma G, He L, Huang H, Zhan L. CommPOOL: An interpretable graph pooling framework for hierarchical graph representation learning. Neural Netw 2021; 143:669-677. [PMID: 34375808 DOI: 10.1016/j.neunet.2021.07.028] [Citation(s) in RCA: 7] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/04/2020] [Revised: 06/01/2021] [Accepted: 07/22/2021] [Indexed: 10/20/2022]
Abstract
Recent years have witnessed the emergence and flourishing of hierarchical graph pooling neural networks (HGPNNs) which are effective graph representation learning approaches for graph level tasks such as graph classification. However, current HGPNNs do not take full advantage of the graph's intrinsic structures (e.g., community structure). Moreover, the pooling operations in existing HGPNNs are difficult to be interpreted. In this paper, we propose a new interpretable graph pooling framework - CommPOOL, that can capture and preserve the hierarchical community structure of graphs in the graph representation learning process. Specifically, the proposed community pooling mechanism in CommPOOL utilizes an unsupervised approach for capturing the inherent community structure of graphs in an interpretable manner. CommPOOL is a general and flexible framework for hierarchical graph representation learning that can further facilitate various graph-level tasks. Evaluations on five public benchmark datasets and one synthetic dataset demonstrate the superior performance of CommPOOL in graph representation learning for graph classification compared to the state-of-the-art baseline methods, and its effectiveness in capturing and preserving the community structure of graphs.
Collapse
Affiliation(s)
- Haoteng Tang
- Department of Electrical and Computer Engineering, University of Pittsburgh, 3700 O'Hara St, Pittsburgh, PA, 15260, USA.
| | - Guixiang Ma
- Intel Labs, 2111 NE 25th Ave, Hillsboro, OR, 97124, USA
| | - Lifang He
- Department of Computer Science and Engineering, Lehigh University, 113 Research Dr, Bethlehem, PA, 18015, USA
| | - Heng Huang
- Department of Electrical and Computer Engineering, University of Pittsburgh, 3700 O'Hara St, Pittsburgh, PA, 15260, USA
| | - Liang Zhan
- Department of Electrical and Computer Engineering, University of Pittsburgh, 3700 O'Hara St, Pittsburgh, PA, 15260, USA.
| |
Collapse
|
13
|
|
14
|
Muzio G, O'Bray L, Borgwardt K. Biological network analysis with deep learning. Brief Bioinform 2021; 22:1515-1530. [PMID: 33169146 DOI: 10.1145/3447548.3467442] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/30/2020] [Revised: 08/26/2020] [Accepted: 09/11/2020] [Indexed: 05/28/2023] Open
Abstract
Recent advancements in experimental high-throughput technologies have expanded the availability and quantity of molecular data in biology. Given the importance of interactions in biological processes, such as the interactions between proteins or the bonds within a chemical compound, this data is often represented in the form of a biological network. The rise of this data has created a need for new computational tools to analyze networks. One major trend in the field is to use deep learning for this goal and, more specifically, to use methods that work with networks, the so-called graph neural networks (GNNs). In this article, we describe biological networks and review the principles and underlying algorithms of GNNs. We then discuss domains in bioinformatics in which graph neural networks are frequently being applied at the moment, such as protein function prediction, protein-protein interaction prediction and in silico drug discovery and development. Finally, we highlight application areas such as gene regulatory networks and disease diagnosis where deep learning is emerging as a new tool to answer classic questions like gene interaction prediction and automatic disease prediction from data.
Collapse
Affiliation(s)
- Giulia Muzio
- Machine Learning and Computational Biology Lab at ETH Zürich
| | - Leslie O'Bray
- Machine Learning and Computational Biology Lab at ETH Zürich
| | | |
Collapse
|
15
|
Subgraph feature extraction based on multi-view dictionary learning for graph classification. Knowl Based Syst 2021. [DOI: 10.1016/j.knosys.2020.106716] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/20/2022]
|
16
|
Ortega-Tenezaca B, Quevedo-Tumailli V, Bediaga H, Collados J, Arrasate S, Madariaga G, Munteanu CR, Cordeiro MND, González-Díaz H. PTML Multi-Label Algorithms: Models, Software, and Applications. Curr Top Med Chem 2020; 20:2326-2337. [DOI: 10.2174/1568026620666200916122616] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/18/2020] [Revised: 07/19/2020] [Accepted: 07/20/2020] [Indexed: 12/17/2022]
Abstract
By combining Machine Learning (ML) methods with Perturbation Theory (PT), it is possible
to develop predictive models for a variety of response targets. Such combination often known as
Perturbation Theory Machine Learning (PTML) modeling comprises a set of techniques that can handle
various physical, and chemical properties of different organisms, complex biological or material
systems under multiple input conditions. In so doing, these techniques effectively integrate a manifold
of diverse chemical and biological data into a single computational framework that can then be applied
for screening lead chemicals as well as to find clues for improving the targeted response(s).
PTML models have thus been extremely helpful in drug or material design efforts and found to be
predictive and applicable across a broad space of systems. After a brief outline of the applied methodology,
this work reviews the different uses of PTML in Medicinal Chemistry, as well as in other
applications. Finally, we cover the development of software available nowadays for setting up PTML
models from large datasets.
Collapse
Affiliation(s)
| | | | - Harbil Bediaga
- Department of Organic and Inorganic Chemistry, University of Basque Country UPV/EHU, 48940 Leioa, Spain
| | - Jon Collados
- Department of Organic and Inorganic Chemistry, University of Basque Country UPV/EHU, 48940 Leioa, Spain
| | - Sonia Arrasate
- Department of Organic and Inorganic Chemistry, University of Basque Country UPV/EHU, 48940 Leioa, Spain
| | - Gotzon Madariaga
- Department of Condensed Matter Physics, University of Basque Country UPV/EHU, 48940 Leioa, Spain
| | - Cristian R Munteanu
- RNASA-IMEDIR, Computer Science Faculty, University of A Coruna, 15071 A Coruna, Spain
| | - M. Natália D.S. Cordeiro
- LAQV@REQUIMTE, Department of Chemistry and Biochemistry, University of Porto, 4169-007 Porto, Portugal
| | - Humbert González-Díaz
- Department of Organic and Inorganic Chemistry, University of Basque Country UPV/EHU, 48940 Leioa, Spain
| |
Collapse
|
17
|
|
18
|
Tran QH, Vo VT, Hasegawa Y. Scale-variant topological information for characterizing the structure of complex networks. Phys Rev E 2019; 100:032308. [PMID: 31640058 DOI: 10.1103/physreve.100.032308] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/11/2019] [Indexed: 06/10/2023]
Abstract
The structure of real-world networks is usually difficult to characterize owing to the variation of topological scales, the nondyadic complex interactions, and the fluctuations in the network. We aim to address these problems by introducing a general framework using a method based on topological data analysis. By considering the diffusion process at a single specified timescale in a network, we map the network nodes to a finite set of points that contains the topological information of the network at a single scale. Subsequently, we study the shape of these point sets over variable timescales that provide scale-variant topological information, to understand the varying topological scales and the complex interactions in the network. We conduct experiments on synthetic and real-world data to demonstrate the effectiveness of the proposed framework in identifying network models, classifying real-world networks, and detecting transition points in time-evolving networks. Overall, our study presents a unified analysis that can be applied to more complex network structures, as in the case of multilayer and multiplex networks.
Collapse
Affiliation(s)
- Quoc Hoan Tran
- Department of Information and Communication Engineering, Graduate School of Information Science and Technology, The University of Tokyo, Tokyo 113-8656, Japan
| | - Van Tuan Vo
- Department of Information and Communication Engineering, Graduate School of Information Science and Technology, The University of Tokyo, Tokyo 113-8656, Japan
| | - Yoshihiko Hasegawa
- Department of Information and Communication Engineering, Graduate School of Information Science and Technology, The University of Tokyo, Tokyo 113-8656, Japan
| |
Collapse
|
19
|
Ambure P, Halder AK, González Díaz H, Cordeiro MNDS. QSAR-Co: An Open Source Software for Developing Robust Multitasking or Multitarget Classification-Based QSAR Models. J Chem Inf Model 2019; 59:2538-2544. [PMID: 31083984 DOI: 10.1021/acs.jcim.9b00295] [Citation(s) in RCA: 54] [Impact Index Per Article: 10.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/19/2022]
Abstract
Quantitative structure-activity relationships (QSAR) modeling is a well-known computational technique with wide applications in fields such as drug design, toxicity predictions, nanomaterials, etc. However, QSAR researchers still face certain problems to develop robust classification-based QSAR models, especially while handling response data pertaining to diverse experimental and/or theoretical conditions. In the present work, we have developed an open source standalone software "QSAR-Co" (available to download at https://sites.google.com/view/qsar-co ) to setup classification-based QSAR models that allow mining the response data coming from multiple conditions. The software comprises two modules: (1) the Model development module and (2) the Screen/Predict module. This user-friendly software provides several functionalities required for developing a robust multitasking or multitarget classification-based QSAR model using linear discriminant analysis or random forest techniques, with appropriate validation, following the principles set by the Organisation for Economic Co-operation and Development (OECD) for applying QSAR models in regulatory assessments.
Collapse
Affiliation(s)
- Pravin Ambure
- LAQV@REQUIMTE, Department of Chemistry and Biochemistry , University of Porto , 4169-007 Porto , Portugal
| | - Amit Kumar Halder
- LAQV@REQUIMTE, Department of Chemistry and Biochemistry , University of Porto , 4169-007 Porto , Portugal
| | - Humbert González Díaz
- Department of Organic Chemistry II , University of Basque Country UPV/EHU , 48940 Leioa , Spain
| | - M Natália D S Cordeiro
- LAQV@REQUIMTE, Department of Chemistry and Biochemistry , University of Porto , 4169-007 Porto , Portugal
| |
Collapse
|
20
|
Liu R, Glover KP, Feasel MG, Wallqvist A. General Approach to Estimate Error Bars for Quantitative Structure–Activity Relationship Predictions of Molecular Activity. J Chem Inf Model 2018; 58:1561-1575. [DOI: 10.1021/acs.jcim.8b00114] [Citation(s) in RCA: 19] [Impact Index Per Article: 3.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/28/2022]
Affiliation(s)
- Ruifeng Liu
- Department of Defense Biotechnology High Performance Computing Software Applications Institute, Telemedicine and Advanced Technology Research Center, U.S. Army Medical Research and Materiel Command, Fort Detrick, Maryland 21702, United States
| | - Kyle P. Glover
- Defense Threat Reduction Agency, Aberdeen Proving Ground, Maryland 21010, United States
| | - Michael G. Feasel
- U.S. Army—Edgewood Chemical Biological Center, Operational Toxicology, Aberdeen Proving Ground, Maryland 21010, United States
| | - Anders Wallqvist
- Department of Defense Biotechnology High Performance Computing Software Applications Institute, Telemedicine and Advanced Technology Research Center, U.S. Army Medical Research and Materiel Command, Fort Detrick, Maryland 21702, United States
| |
Collapse
|
21
|
|
22
|
Abstract
BACKGROUND Molecular descriptors have been widely used to predict biological activities and physicochemical properties or to analyze chemical libraries on the basis of similarity. Although fingerprints and properties are generally used as descriptors, neither is perfect for these purposes. A fingerprint can distinguish between molecules, whereas a property may not do the same in certain cases, and vice versa. When the number of the training set is especially small, the construction of good predictive models is difficult. Herein, a novel descriptor integrating mutually compensating fingerprint and property characteristics is described. The format of this descriptor is not conventional. It has two dimensions with variable length in one dimension to represent one molecule. This format is not acceptable for any machine learning methods. Therefore the distance between molecules has been newly defined for application to machine learning techniques. The evaluation of this descriptor, as applied to classification tasks, was performed using a support vector machine after the features of the descriptor had been optimized by a genetic algorithm. RESULTS Because the optimizing feature is time-intensive due to the complicated calculation of distances between molecules, the optimization was forced to stop before it was completed. As a result, no remarkable improvement was observed in the classification results for the new descriptor compared with those for other descriptors in any evaluation set used in this work. However, extremely low accuracies were also not found for any set. CONCLUSIONS The novel descriptor proposed in this work can potentially be used to make highly accurate predictive models. This new concept in descriptors is expected to be useful for developing novel predictive methods with quick training and high accuracy.
Collapse
Affiliation(s)
- Masataka Kuroda
- Discovery Technology Laboratories, Innovative Research Division, Mitsubishi Tanabe Pharma Corporation, 1000 Kamoshida, Aoba-ku, Yokohama, 227-0033 Japan
| |
Collapse
|
23
|
Babajide Mustapha I, Saeed F. Bioactive Molecule Prediction Using Extreme Gradient Boosting. Molecules 2016; 21:molecules21080983. [PMID: 27483216 PMCID: PMC6273295 DOI: 10.3390/molecules21080983] [Citation(s) in RCA: 98] [Impact Index Per Article: 12.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/01/2016] [Revised: 07/19/2016] [Accepted: 07/22/2016] [Indexed: 01/29/2023] Open
Abstract
Following the explosive growth in chemical and biological data, the shift from traditional methods of drug discovery to computer-aided means has made data mining and machine learning methods integral parts of today's drug discovery process. In this paper, extreme gradient boosting (Xgboost), which is an ensemble of Classification and Regression Tree (CART) and a variant of the Gradient Boosting Machine, was investigated for the prediction of biological activity based on quantitative description of the compound's molecular structure. Seven datasets, well known in the literature were used in this paper and experimental results show that Xgboost can outperform machine learning algorithms like Random Forest (RF), Support Vector Machines (LSVM), Radial Basis Function Neural Network (RBFN) and Naïve Bayes (NB) for the prediction of biological activities. In addition to its ability to detect minority activity classes in highly imbalanced datasets, it showed remarkable performance on both high and low diversity datasets.
Collapse
Affiliation(s)
- Ismail Babajide Mustapha
- UTM Big Data Centre, Ibnu Sina Institute for Scientific and Industrial Research, Universiti Teknologi Malaysia, Skudai, Johor 81310, Malaysia.
| | - Faisal Saeed
- Information Systems Department, Faculty of Computing, Universiti Teknologi Malaysia, Skudai, Johor 81310, Malaysia.
| |
Collapse
|
24
|
Nahum OE, Yosipof A, Senderowitz H. A Multi-Objective Genetic Algorithm for Outlier Removal. J Chem Inf Model 2015; 55:2507-18. [PMID: 26553402 DOI: 10.1021/acs.jcim.5b00515] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/29/2022]
Abstract
Quantitative structure activity relationship (QSAR) or quantitative structure property relationship (QSPR) models are developed to correlate activities for sets of compounds with their structure-derived descriptors by means of mathematical models. The presence of outliers, namely, compounds that differ in some respect from the rest of the data set, compromise the ability of statistical methods to derive QSAR models with good prediction statistics. Hence, outliers should be removed from data sets prior to model derivation. Here we present a new multi-objective genetic algorithm for the identification and removal of outliers based on the k nearest neighbors (kNN) method. The algorithm was used to remove outliers from three different data sets of pharmaceutical interest (logBBB, factor 7 inhibitors, and dihydrofolate reductase inhibitors), and its performances were compared with those of five other methods for outlier removal. The results suggest that the new algorithm provides filtered data sets that (1) better maintain the internal diversity of the parent data sets and (2) give rise to QSAR models with much better prediction statistics. Equally good filtered data sets in terms of these metrics were obtained when another objective function was added to the algorithm (termed "preservation"), forcing it to remove certain compounds with low probability only. This option is highly useful when specific compounds should be preferably kept in the final data set either because they have favorable activities or because they represent interesting molecular scaffolds. We expect this new algorithm to be useful in future QSAR applications.
Collapse
Affiliation(s)
- Oren E Nahum
- Department of Management, Bar-Ilan University , Ramat-Gan 52900, Israel.,School of Management and Economics, The Academic College of Tel-Aviv - Yafo , Yafo 61083, Israel.,Department of Chemistry, Bar-Ilan University , Ramat-Gan 52900, Israel
| | - Abraham Yosipof
- Department of Business Administration, Peres Academic Center , Rehovot 76102, Israel
| | | |
Collapse
|
25
|
Yosipof A, Senderowitz H. k-Nearest neighbors optimization-based outlier removal. J Comput Chem 2015; 36:493-506. [PMID: 25503870 DOI: 10.1002/jcc.23803] [Citation(s) in RCA: 16] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/27/2014] [Revised: 11/13/2014] [Accepted: 11/13/2014] [Indexed: 01/21/2023]
Abstract
Datasets of molecular compounds often contain outliers, that is, compounds which are different from the rest of the dataset. Outliers, while often interesting may affect data interpretation, model generation, and decisions making, and therefore, should be removed from the dataset prior to modeling efforts. Here, we describe a new method for the iterative identification and removal of outliers based on a k-nearest neighbors optimization algorithm. We demonstrate for three different datasets that the removal of outliers using the new algorithm provides filtered datasets which are better than those provided by four alternative outlier removal procedures as well as by random compound removal in two important aspects: (1) they better maintain the diversity of the parent datasets; (2) they give rise to quantitative structure activity relationship (QSAR) models with much better prediction statistics. The new algorithm is, therefore, suitable for the pretreatment of datasets prior to QSAR modeling.
Collapse
Affiliation(s)
- Abraham Yosipof
- Department of Chemistry, Bar Ilan University, Ramat-Gan, 52900, Israel
| | | |
Collapse
|
26
|
Moser D, Wittmann SK, Kramer J, Blöcher R, Achenbach J, Pogoryelov D, Proschak E. PENG: A Neural Gas-Based Approach for Pharmacophore Elucidation. Method Design, Validation, and Virtual Screening for Novel Ligands of LTA4H. J Chem Inf Model 2015; 55:284-93. [DOI: 10.1021/ci500618u] [Citation(s) in RCA: 15] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/06/2023]
Affiliation(s)
- Daniel Moser
- Institute of Pharmaceutical Chemistry, Goethe University, 60438 Frankfurt, Germany
- German Cancer Consortium (DKTK), 60590 Frankfurt, Germany
- German Cancer Research Center (DKFZ), 69120 Heidelberg, Germany
| | - Sandra K. Wittmann
- Institute of Pharmaceutical Chemistry, Goethe University, 60438 Frankfurt, Germany
| | - Jan Kramer
- Institute of Pharmaceutical Chemistry, Goethe University, 60438 Frankfurt, Germany
| | - René Blöcher
- Institute of Pharmaceutical Chemistry, Goethe University, 60438 Frankfurt, Germany
| | - Janosch Achenbach
- Institute of Pharmaceutical Chemistry, Goethe University, 60438 Frankfurt, Germany
- BASF SE, 67056 Ludwigshafen, Germany
| | - Denys Pogoryelov
- Institute of Biochemistry, Goethe University, 60438 Frankfurt, Germany
| | - Ewgenij Proschak
- Institute of Pharmaceutical Chemistry, Goethe University, 60438 Frankfurt, Germany
- German Cancer Consortium (DKTK), 60590 Frankfurt, Germany
- German Cancer Research Center (DKFZ), 69120 Heidelberg, Germany
| |
Collapse
|
27
|
Abstract
Background Sound statistical validation is important to evaluate and compare the overall performance of (Q)SAR models. However, classical validation does not support the user in better understanding the properties of the model or the underlying data. Even though, a number of visualization tools for analyzing (Q)SAR information in small molecule datasets exist, integrated visualization methods that allow the investigation of model validation results are still lacking. Results We propose visual validation, as an approach for the graphical inspection of (Q)SAR model validation results. The approach applies the 3D viewer CheS-Mapper, an open-source application for the exploration of small molecules in virtual 3D space. The present work describes the new functionalities in CheS-Mapper 2.0, that facilitate the analysis of (Q)SAR information and allows the visual validation of (Q)SAR models. The tool enables the comparison of model predictions to the actual activity in feature space. The approach is generic: It is model-independent and can handle physico-chemical and structural input features as well as quantitative and qualitative endpoints. Conclusions Visual validation with CheS-Mapper enables analyzing (Q)SAR information in the data and indicates how this information is employed by the (Q)SAR model. It reveals, if the endpoint is modeled too specific or too generic and highlights common properties of misclassified compounds. Moreover, the researcher can use CheS-Mapper to inspect how the (Q)SAR model predicts activity cliffs. The CheS-Mapper software is freely available at http://ches-mapper.org. Graphical abstract Comparing actual and predicted activity values with CheS-Mapper.
Collapse
|
28
|
Teixeira AL, Falcao AO. Structural Similarity Based Kriging for Quantitative Structure Activity and Property Relationship Modeling. J Chem Inf Model 2014; 54:1833-49. [DOI: 10.1021/ci500110v] [Citation(s) in RCA: 12] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/21/2023]
Affiliation(s)
- Ana L. Teixeira
- LaSIGE,
Faculty of Sciences, University of Lisbon, 1749-016 Lisbon, Portugal
- CQB
- Centro de Química e Bioquímica, Faculty of Sciences, University of Lisbon, 1749-016 Lisbon, Portugal
| | - Andre O. Falcao
- LaSIGE,
Faculty of Sciences, University of Lisbon, 1749-016 Lisbon, Portugal
- Department
of Informatics, Faculty of Sciences, University of Lisbon, 1749-016 Lisbon, Portugal
| |
Collapse
|
29
|
Carrió P, Pinto M, Ecker G, Sanz F, Pastor M. Applicability Domain ANalysis (ADAN): a robust method for assessing the reliability of drug property predictions. J Chem Inf Model 2014; 54:1500-11. [PMID: 24821140 DOI: 10.1021/ci500172z] [Citation(s) in RCA: 45] [Impact Index Per Article: 4.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/29/2022]
Abstract
We report a novel method called ADAN (Applicability Domain ANalysis) for assessing the reliability of drug property predictions obtained by in silico methods. The assessment provided by ADAN is based on the comparison of the query compound with the training set, using six diverse similarity criteria. For every criterion, the query compound is considered out of range when the similarity value obtained is larger than the 95th percentile of the values obtained for the training set. The final outcome is a number in the range of 0-6 that expresses the number of unmet similarity criteria and allows classifying the query compound within seven reliability categories. Such categories can be further exploited to assign simpler reliability classes using a traffic light schema, to assign approximate confidence intervals or to mark the predictions as unreliable. The entire methodology has been validated simulating realistic conditions, where query compounds are structurally diverse from those in the training set. The validation exercise involved the construction of more than 1000 models. These models were built using a combination of training set, molecular descriptors, and modeling methods representative of the real predictive tasks performed in the eTOX project (a project whose objective is to predict in vivo toxicological end points in drug development). Validation results confirm the robustness of the proposed assessment methodology, which compares favorably with other classical methods based solely on the structural similarity of the compounds. ADAN characteristics make the method well-suited for estimate the quality of drug predictions obtained in extremely unfavorable conditions, like the prediction of drug toxicity end points.
Collapse
Affiliation(s)
- Pau Carrió
- Research Programme on Biomedical Informatics (GRIB), Department of Experimental and Health Sciences, Universitat Pompeu Fabra, IMIM (Hospital del Mar Medical Research Institute) , Dr. Aiguader, 88, E-08003 Barcelona, Spain
| | | | | | | | | |
Collapse
|
30
|
Yamashita H, Higuchi T, Yoshida R. Atom environment kernels on molecules. J Chem Inf Model 2014; 54:1289-300. [PMID: 24802375 DOI: 10.1021/ci400403w] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/30/2022]
Abstract
The measurement of molecular similarity is an essential part of various machine learning tasks in chemical informatics. Graph kernels provide good similarity measures between molecules. Conventional graph kernels are based on counting common subgraphs of specific types in the molecular graphs. This approach has two primary limitations: (i) only exact subgraph matching is considered in the counting operation, and (ii) most of the subgraphs will be less relevant to a given task. In order to address the above-mentioned limitations, we propose a new graph kernel as an extension of the subtree kernel initially proposed by Ramon and Gärtner (2003). The proposed kernel tolerates an inexact match between subgraphs by allowing matching between atoms with similar local environments. In addition, the proposed kernel provides a method to assign an importance weight to each subgraph according to the relevance to the task, which is predetermined by a statistical test. These extensions are evaluated for classification and regression tasks of predicting a wide range of pharmaceutical properties from molecular structures, with promising results.
Collapse
Affiliation(s)
- Hiroshi Yamashita
- The Graduate University for Advanced Studies , 10-3 Midori-cho, Tachikawa, Tokyo 190-8562, Japan
| | | | | |
Collapse
|
31
|
Abdo A, Leclère V, Jacques P, Salim N, Pupin M. Prediction of new bioactive molecules using a Bayesian belief network. J Chem Inf Model 2014; 54:30-6. [PMID: 24392938 DOI: 10.1021/ci4004909] [Citation(s) in RCA: 19] [Impact Index Per Article: 1.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/30/2022]
Abstract
Natural products and synthetic compounds are a valuable source of new small molecules leading to novel drugs to cure diseases. However identifying new biologically active small molecules is still a challenge. In this paper, we introduce a new activity prediction approach using Bayesian belief network for classification (BBNC). The roots of the network are the fragments composing a compound. The leaves are, on one side, the activities to predict and, on another side, the unknown compound. The activities are represented by sets of known compounds, and sets of inactive compounds are also used. We calculated a similarity between an unknown compound and each activity class. The more similar activity is assigned to the unknown compound. We applied this new approach on eight well-known data sets extracted from the literature and compared its performance to three classical machine learning algorithms. Experiments showed that BBNC provides interesting prediction rates (from 79% accuracy for high diverse data sets to 99% for low diverse ones) with a short time calculation. Experiments also showed that BBNC is particularly effective for homogeneous data sets but has been found to perform less well with structurally heterogeneous sets. However, it is important to stress that we believe that using several approaches whenever possible for activity prediction can often give a broader understanding of the data than using only one approach alone. Thus, BBNC is a useful addition to the computational chemist's toolbox.
Collapse
Affiliation(s)
- Ammar Abdo
- LIFL UMR CNRS 8022 Université Lille1 and INRIA Lille Nord Europe, 59655 Villeneuve d'Ascq cedex, France
| | | | | | | | | |
Collapse
|
32
|
Keefer CE, Kauffman GW, Gupta RR. Interpretable, Probability-Based Confidence Metric for Continuous Quantitative Structure–Activity Relationship Models. J Chem Inf Model 2013; 53:368-83. [DOI: 10.1021/ci300554t] [Citation(s) in RCA: 41] [Impact Index Per Article: 3.7] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/28/2022]
Affiliation(s)
| | - Gregory W. Kauffman
- Worldwide Medicinal Chemistry,
Neuroscience Research Unit, Pfizer Inc., Cambridge, Massachusetts 02139, United States
| | | |
Collapse
|
33
|
Pérez-Castillo Y, Lazar C, Taminau J, Froeyen M, Cabrera-Pérez MÁ, Nowé A. GA(M)E-QSAR: A Novel, Fully Automatic Genetic-Algorithm-(Meta)-Ensembles Approach for Binary Classification in Ligand-Based Drug Design. J Chem Inf Model 2012; 52:2366-86. [DOI: 10.1021/ci300146h] [Citation(s) in RCA: 16] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/15/2023]
Affiliation(s)
- Yunierkis Pérez-Castillo
- Computational Modeling Lab (CoMo), Department
of Computer Sciences, Faculty
of Sciences, Vrije Universiteit Brussel, Pleinlaan 2, B-1050 Brussel, Belgium
- Molecular Simulations and Drug
Design Group, Centro de Bioactivos Químicos, Universidad Central “Marta Abreu” de Las Villas, Santa
Clara, Cuba
- Laboratory for
Medicinal Chemistry,
Rega Institute for Medical Research, Katholieke Universiteit Leuven, Minderbroedersstraat 10, B-3000 Leuven, Belgium
| | - Cosmin Lazar
- Computational Modeling Lab (CoMo), Department
of Computer Sciences, Faculty
of Sciences, Vrije Universiteit Brussel, Pleinlaan 2, B-1050 Brussel, Belgium
| | - Jonatan Taminau
- Computational Modeling Lab (CoMo), Department
of Computer Sciences, Faculty
of Sciences, Vrije Universiteit Brussel, Pleinlaan 2, B-1050 Brussel, Belgium
| | - Mathy Froeyen
- Laboratory for
Medicinal Chemistry,
Rega Institute for Medical Research, Katholieke Universiteit Leuven, Minderbroedersstraat 10, B-3000 Leuven, Belgium
| | - Miguel Ángel Cabrera-Pérez
- Molecular Simulations and Drug
Design Group, Centro de Bioactivos Químicos, Universidad Central “Marta Abreu” de Las Villas, Santa
Clara, Cuba
- Engineering
Department, Pharmacy and Pharmaceutical Technology Area,
Faculty of Pharmacy, University Miguel Hernandez, Alicante 03550, Spain
| | - Ann Nowé
- Computational Modeling Lab (CoMo), Department
of Computer Sciences, Faculty
of Sciences, Vrije Universiteit Brussel, Pleinlaan 2, B-1050 Brussel, Belgium
| |
Collapse
|
34
|
Szaleniec M. Prediction of enzyme activity with neural network models based on electronic and geometrical features of substrates. Pharmacol Rep 2012; 64:761-81. [DOI: 10.1016/s1734-1140(12)70873-3] [Citation(s) in RCA: 15] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/07/2012] [Revised: 04/16/2012] [Indexed: 11/26/2022]
|
35
|
Guha R. Exploring Structure-Activity Data Using the Landscape Paradigm. WILEY INTERDISCIPLINARY REVIEWS-COMPUTATIONAL MOLECULAR SCIENCE 2012; 2. [PMID: 24163705 DOI: 10.1002/wcms.1087] [Citation(s) in RCA: 21] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/15/2022]
Abstract
In this article we present an overview of the origin and applications of the activity landscape view of structure-actvitiy relationship data as conceived by Maggiora. Within this landscape, different regions exemplify different aspects of SAR trends - ranging from smoothly varying trends to discontinuous trends (also termed activity cliffs). We discuss the various definitions of landscapes and cliffs that have been proposed as well as different approaches to the numerical quantification of a landscape. We then highlight some of the landscape visualization approaches that have been developed, followed by a review of the various applications of activity landscapes and cliffs to topics in medicinal chemistry and SAR analysis.
Collapse
Affiliation(s)
- Rajarshi Guha
- NIH Center for Translational Therapeutics 9800 Medical Center Drive Rockville, MD 20850
| |
Collapse
|
36
|
Liew CY, Lim YC, Yap CW. Mixed learning algorithms and features ensemble in hepatotoxicity prediction. J Comput Aided Mol Des 2011; 25:855-71. [DOI: 10.1007/s10822-011-9468-3] [Citation(s) in RCA: 49] [Impact Index Per Article: 3.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/07/2011] [Accepted: 08/23/2011] [Indexed: 12/22/2022]
|
37
|
Myint KZ, Ma C, Wang L, Xie XQ. Fragment-similarity-based QSAR (FS-QSAR) algorithm for ligand biological activity predictions. SAR AND QSAR IN ENVIRONMENTAL RESEARCH 2011; 22:385-410. [PMID: 21598200 DOI: 10.1080/1062936x.2011.569943] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/30/2023]
Abstract
Quantitative structure-activity relationship (QSAR) studies are useful computational tools often used in drug discovery research and in many scientific disciplines. In this study, a robust fragment-similarity-based QSAR (FS-QSAR) algorithm was developed to correlate structures with biological activities by integrating fragment-based drug design concept and a multiple linear regression method. Similarity between any pair of training and testing fragments was determined by calculating the difference of lowest or highest eigenvalues of the chemistry space BCUT matrices of corresponding fragments. In addition to the BCUT-similarity function, molecular fingerprint Tanimoto coefficient (Tc) similarity function was also used as an alternative for comparison. For validation studies, the FS-QSAR algorithm was applied to several case studies, including a dataset of COX2 inhibitors and a dataset of cannabinoid CB2 triaryl bis-sulfone antagonist analogues, to build predictive models achieving average coefficient of determination (r(2)) of 0.62 and 0.68, respectively. The developed FS-QSAR method is proved to give more accurate predictions than the traditional and one-nearest-neighbour QSAR methods and can be a useful tool in the fragment-based drug discovery for ligand activity prediction.
Collapse
Affiliation(s)
- K Z Myint
- Department of Computational Biology, School of Medicine, University of Pittsburgh, Pittsburgh, USA
| | | | | | | |
Collapse
|
38
|
Hariharan R, Janakiraman A, Nilakantan R, Singh B, Varghese S, Landrum G, Schuffenhauer A. MultiMCS: A Fast Algorithm for the Maximum Common Substructure Problem on Multiple Molecules. J Chem Inf Model 2011; 51:788-806. [DOI: 10.1021/ci100297y] [Citation(s) in RCA: 29] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/29/2022]
Affiliation(s)
- Ramesh Hariharan
- Strand Life Sciences, Fifth Floor, Kirloskar Business Park, Bellary Road, Hebbal, Bangalore 560024, India
| | - Anand Janakiraman
- Strand Life Sciences, Fifth Floor, Kirloskar Business Park, Bellary Road, Hebbal, Bangalore 560024, India
| | - Ramaswamy Nilakantan
- Strand Life Sciences, Fifth Floor, Kirloskar Business Park, Bellary Road, Hebbal, Bangalore 560024, India
| | - Bhupender Singh
- Strand Life Sciences, Fifth Floor, Kirloskar Business Park, Bellary Road, Hebbal, Bangalore 560024, India
| | - Sajith Varghese
- Strand Life Sciences, Fifth Floor, Kirloskar Business Park, Bellary Road, Hebbal, Bangalore 560024, India
| | - Gregory Landrum
- Novartis Institutes for BioMedical Research, CH-4002, Basel, Switzerland
| | | |
Collapse
|
39
|
Hinselmann G, Rosenbaum L, Jahn A, Fechner N, Ostermann C, Zell A. Large-Scale Learning of Structure−Activity Relationships Using a Linear Support Vector Machine and Problem-Specific Metrics. J Chem Inf Model 2011; 51:203-13. [DOI: 10.1021/ci100073w] [Citation(s) in RCA: 26] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/07/2023]
Affiliation(s)
- Georg Hinselmann
- Center for Bioinformatics (ZBIT), University of Tübingen, Tübingen, Germany
| | - Lars Rosenbaum
- Center for Bioinformatics (ZBIT), University of Tübingen, Tübingen, Germany
| | - Andreas Jahn
- Center for Bioinformatics (ZBIT), University of Tübingen, Tübingen, Germany
| | - Nikolas Fechner
- Center for Bioinformatics (ZBIT), University of Tübingen, Tübingen, Germany
| | | | - Andreas Zell
- Center for Bioinformatics (ZBIT), University of Tübingen, Tübingen, Germany
| |
Collapse
|
40
|
Rathke F, Hansen K, Brefeld U, Müller KR. StructRank: A New Approach for Ligand-Based Virtual Screening. J Chem Inf Model 2010; 51:83-92. [DOI: 10.1021/ci100308f] [Citation(s) in RCA: 25] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/05/2023]
Affiliation(s)
- Fabian Rathke
- Department of Machine Learning, University of Technology, Berlin, Germany, Department of Image and Pattern Analysis, University of Heidelberg, Germany, and Yahoo! Research, Avinguda Diagonal 177, 08018 Barcelona, Spain
| | - Katja Hansen
- Department of Machine Learning, University of Technology, Berlin, Germany, Department of Image and Pattern Analysis, University of Heidelberg, Germany, and Yahoo! Research, Avinguda Diagonal 177, 08018 Barcelona, Spain
| | - Ulf Brefeld
- Department of Machine Learning, University of Technology, Berlin, Germany, Department of Image and Pattern Analysis, University of Heidelberg, Germany, and Yahoo! Research, Avinguda Diagonal 177, 08018 Barcelona, Spain
| | - Klaus-Robert Müller
- Department of Machine Learning, University of Technology, Berlin, Germany, Department of Image and Pattern Analysis, University of Heidelberg, Germany, and Yahoo! Research, Avinguda Diagonal 177, 08018 Barcelona, Spain
| |
Collapse
|
41
|
Ma EYT, Cameron CJF, Kremer SC. Classifying and scoring of molecules with the NGN: new datasets, significance tests, and generalization. BMC Bioinformatics 2010; 11 Suppl 8:S4. [PMID: 21034429 PMCID: PMC2966291 DOI: 10.1186/1471-2105-11-s8-s4] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/16/2022] Open
Abstract
This paper demonstrates how a Neural Grammar Network learns to classify and score molecules for a variety of tasks in chemistry and toxicology. In addition to a more detailed analysis on datasets previously studied, we introduce three new datasets (BBB, FXa, and toxicology) to show the generality of the approach. A new experimental methodology is developed and applied to both the new datasets as well as previously studied datasets. This methodology is rigorous and statistically grounded, and ultimately culminates in a Wilcoxon significance test that proves the effectiveness of the system. We further include a complete generalization of the specific technique to arbitrary grammars and datasets using a mathematical abstraction that allows researchers in different domains to apply the method to their own work. Background Our work can be viewed as an alternative to existing methods to solve the quantitative structure-activity relationship (QSAR) problem. To this end, we review a number approaches both from a methodological and also a performance perspective. In addition to these approaches, we also examined a number of chemical properties that can be used by generic classifier systems, such as feed-forward artificial neural networks. In studying these approaches, we identified a set of interesting benchmark problem sets to which many of the above approaches had been applied. These included: ACE, AChE, AR, BBB, BZR, Cox2, DHFR, ER, FXa, GPB, Therm, and Thr. Finally, we developed our own benchmark set by collecting data on toxicology. Results Our results show that our system performs better than, or comparatively to, the existing methods over a broad range of problem types. Our method does not require the expert knowledge that is necessary to apply the other methods to novel problems. Conclusions We conclude that our success is due to the ability of our system to: 1) encode molecules losslessly before presentation to the learning system, and 2) leverage the design of molecular description languages to facilitate the identification of relevant structural attributes of the molecules over different problem domains.
Collapse
|
42
|
Spjuth O, Willighagen EL, Guha R, Eklund M, Wikberg JE. Towards interoperable and reproducible QSAR analyses: Exchange of datasets. J Cheminform 2010; 2:5. [PMID: 20591161 PMCID: PMC2909924 DOI: 10.1186/1758-2946-2-5] [Citation(s) in RCA: 32] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/19/2010] [Accepted: 06/30/2010] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND QSAR is a widely used method to relate chemical structures to responses or properties based on experimental observations. Much effort has been made to evaluate and validate the statistical modeling in QSAR, but these analyses treat the dataset as fixed. An overlooked but highly important issue is the validation of the setup of the dataset, which comprises addition of chemical structures as well as selection of descriptors and software implementations prior to calculations. This process is hampered by the lack of standards and exchange formats in the field, making it virtually impossible to reproduce and validate analyses and drastically constrain collaborations and re-use of data. RESULTS We present a step towards standardizing QSAR analyses by defining interoperable and reproducible QSAR datasets, consisting of an open XML format (QSAR-ML) which builds on an open and extensible descriptor ontology. The ontology provides an extensible way of uniquely defining descriptors for use in QSAR experiments, and the exchange format supports multiple versioned implementations of these descriptors. Hence, a dataset described by QSAR-ML makes its setup completely reproducible. We also provide a reference implementation as a set of plugins for Bioclipse which simplifies setup of QSAR datasets, and allows for exporting in QSAR-ML as well as old-fashioned CSV formats. The implementation facilitates addition of new descriptor implementations from locally installed software and remote Web services; the latter is demonstrated with REST and XMPP Web services. CONCLUSIONS Standardized QSAR datasets open up new ways to store, query, and exchange data for subsequent analyses. QSAR-ML supports completely reproducible creation of datasets, solving the problems of defining which software components were used and their versions, and the descriptor ontology eliminates confusions regarding descriptors by defining them crisply. This makes is easy to join, extend, combine datasets and hence work collectively, but also allows for analyzing the effect descriptors have on the statistical model's performance. The presented Bioclipse plugins equip scientists with graphical tools that make QSAR-ML easily accessible for the community.
Collapse
Affiliation(s)
- Ola Spjuth
- Department of Pharmaceutical Biosciences, Uppsala University, Uppsala, Sweden.
| | | | | | | | | |
Collapse
|
43
|
Obrezanova O, Segall MD. Gaussian Processes for Classification: QSAR Modeling of ADMET and Target Activity. J Chem Inf Model 2010; 50:1053-61. [DOI: 10.1021/ci900406x] [Citation(s) in RCA: 45] [Impact Index Per Article: 3.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/30/2022]
Affiliation(s)
- Olga Obrezanova
- Optibrium Ltd., 7226 IQ Cambridge, Beach Drive, Cambridge, CB25 9TL, United Kingdom
| | - Matthew D. Segall
- Optibrium Ltd., 7226 IQ Cambridge, Beach Drive, Cambridge, CB25 9TL, United Kingdom
| |
Collapse
|
44
|
In silico search for multi-target anti-inflammatories in Chinese herbs and formulas. Bioorg Med Chem 2010; 18:2204-2218. [DOI: 10.1016/j.bmc.2010.01.070] [Citation(s) in RCA: 36] [Impact Index Per Article: 2.6] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/13/2009] [Revised: 01/21/2010] [Accepted: 01/29/2010] [Indexed: 11/19/2022]
|
45
|
Pérez-Garrido A, Helguera AM, Cordeiro MND, Escudero AG. QSPR modelling with the topological substructural molecular design approach: β-cyclodextrin complexation. J Pharm Sci 2009; 98:4557-76. [DOI: 10.1002/jps.21747] [Citation(s) in RCA: 18] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/10/2023]
|
46
|
Nasr RJ, Swamidass SJ, Baldi PF. Large scale study of multiple-molecule queries. J Cheminform 2009; 1:7. [PMID: 20298525 PMCID: PMC3225883 DOI: 10.1186/1758-2946-1-7] [Citation(s) in RCA: 21] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/01/2009] [Accepted: 06/04/2009] [Indexed: 12/04/2022] Open
Abstract
Background In ligand-based screening, as well as in other chemoinformatics applications, one seeks to effectively search large repositories of molecules in order to retrieve molecules that are similar typically to a single molecule lead. However, in some case, multiple molecules from the same family are available to seed the query and search for other members of the same family. Multiple-molecule query methods have been less studied than single-molecule query methods. Furthermore, the previous studies have relied on proprietary data and sometimes have not used proper cross-validation methods to assess the results. In contrast, here we develop and compare multiple-molecule query methods using several large publicly available data sets and background. We also create a framework based on a strict cross-validation protocol to allow unbiased benchmarking for direct comparison in future studies across several performance metrics. Results Fourteen different multiple-molecule query methods were defined and benchmarked using: (1) 41 publicly available data sets of related molecules with similar biological activity; and (2) publicly available background data sets consisting of up to 175,000 molecules randomly extracted from the ChemDB database and other sources. Eight of the fourteen methods were parameter free, and six of them fit one or two free parameters to the data using a careful cross-validation protocol. All the methods were assessed and compared for their ability to retrieve members of the same family against the background data set by using several performance metrics including the Area Under the Accumulation Curve (AUAC), Area Under the Curve (AUC), F1-measure, and BEDROC metrics. Consistent with the previous literature, the best parameter-free methods are the MAX-SIM and MIN-RANK methods, which score a molecule to a family by the maximum similarity, or minimum ranking, obtained across the family. One new parameterized method introduced in this study and two previously defined methods, the Exponential Tanimoto Discriminant (ETD), the Tanimoto Power Discriminant (TPD), and the Binary Kernel Discriminant (BKD), outperform most other methods but are more complex, requiring one or two parameters to be fit to the data. Conclusion Fourteen methods for multiple-molecule querying of chemical databases, including novel methods, (ETD) and (TPD), are validated using publicly available data sets, standard cross-validation protocols, and established metrics. The best results are obtained with ETD, TPD, BKD, MAX-SIM, and MIN-RANK. These results can be replicated and compared with the results of future studies using data freely downloadable from http://cdb.ics.uci.edu/.
Collapse
Affiliation(s)
- Ramzi J Nasr
- The Bren School of Information and Computer Science, Institute for Genomics and Bioinformatics, University of California, Irvine, CA 92697-3435, USA.
| | | | | |
Collapse
|
47
|
Santos-Filho OA, Cherkasov A. Using Molecular Docking, 3D-QSAR, and Cluster Analysis for Screening Structurally Diverse Data Sets of Pharmacological Interest. J Chem Inf Model 2008; 48:2054-65. [DOI: 10.1021/ci8001952] [Citation(s) in RCA: 14] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/28/2022]
Affiliation(s)
- Osvaldo A. Santos-Filho
- Jack Bell Research Centre at Vancouver General Hospital, Faculty of Medicine, University of British Columbia, 2660 Oak Street, Vancouver, British Columbia V6H 3Z6, Canada
| | - Artem Cherkasov
- Jack Bell Research Centre at Vancouver General Hospital, Faculty of Medicine, University of British Columbia, 2660 Oak Street, Vancouver, British Columbia V6H 3Z6, Canada
| |
Collapse
|
48
|
Simmons K, Kinney J, Owens A, Kleier D, Bloch K, Argentar D, Walsh A, Vaidyanathan G. Comparative study of machine-learning and chemometric tools for analysis of in-vivo high-throughput screening data. J Chem Inf Model 2008; 48:1663-8. [PMID: 18681397 DOI: 10.1021/ci800142d] [Citation(s) in RCA: 13] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/28/2022]
Abstract
High-throughput screening (HTS) has become a central tool of many pharmaceutical and crop-protection discovery operations. If HTS screening is carried out at the level of the intact organism, as is commonly done in crop protection, this strategy has the potential of uncovering a completely new mechanism of actions. The challenge in running a cost-effective HTS operation is to identify ways in which to improve the overall success rate in discovering new biologically active compounds. To this end, we describe our efforts directed at making full use of the data stream arising from HTS. This paper describes a comparative study in which several machine learning and chemometric methodologies were used to develop classifiers on the same data sets derived from in vivo HTS campaigns and their predictive performances compared in terms of false negative and false positive error profiles.
Collapse
Affiliation(s)
- Kirk Simmons
- Simmons Consulting, 52 Windybush Way, Titusville, NJ 08560, USA.
| | | | | | | | | | | | | | | |
Collapse
|
49
|
Han LY, Ma XH, Lin HH, Jia J, Zhu F, Xue Y, Li ZR, Cao ZW, Ji ZL, Chen YZ. A support vector machines approach for virtual screening of active compounds of single and multiple mechanisms from large libraries at an improved hit-rate and enrichment factor. J Mol Graph Model 2007; 26:1276-86. [PMID: 18218332 DOI: 10.1016/j.jmgm.2007.12.002] [Citation(s) in RCA: 65] [Impact Index Per Article: 3.8] [Reference Citation Analysis] [Abstract] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/29/2007] [Revised: 12/05/2007] [Accepted: 12/05/2007] [Indexed: 01/04/2023]
Abstract
Support vector machines (SVM) and other machine-learning (ML) methods have been explored as ligand-based virtual screening (VS) tools for facilitating lead discovery. While exhibiting good hit selection performance, in screening large compound libraries, these methods tend to produce lower hit-rate than those of the best performing VS tools, partly because their training-sets contain limited spectrum of inactive compounds. We tested whether the performance of SVM can be improved by using training-sets of diverse inactive compounds. In retrospective database screening of active compounds of single mechanism (HIV protease inhibitors, DHFR inhibitors, dopamine antagonists) and multiple mechanisms (CNS active agents) from large libraries of 2.986 million compounds, the yields, hit-rates, and enrichment factors of our SVM models are 52.4-78.0%, 4.7-73.8%, and 214-10,543, respectively, compared to those of 62-95%, 0.65-35%, and 20-1200 by structure-based VS and 55-81%, 0.2-0.7%, and 110-795 by other ligand-based VS tools in screening libraries of >or=1 million compounds. The hit-rates are comparable and the enrichment factors are substantially better than the best results of other VS tools. 24.3-87.6% of the predicted hits are outside the known hit families. SVM appears to be potentially useful for facilitating lead discovery in VS of large compound libraries.
Collapse
Affiliation(s)
- L Y Han
- Bioinformatics and Drug Design Group, Department of Pharmacy, National University of Singapore, Blk S16, Level 8, 3 Science Drive 2, Singapore 117543, Singapore
| | | | | | | | | | | | | | | | | | | |
Collapse
|
50
|
Abstract
Clustering is a common task in the field of cheminformatics. A key parameter that needs to be set for nonhierarchical clustering methods, such as k-means, is the number of clusters, k. Traditionally, the value of k is obtained by performing the clustering with different values of k and selecting that value that leads to the optimal clustering. In this study, we describe an approach to selecting k, a priori, based on the R-NN curve algorithm described by Guha et al. (J. Chem. Inf. Model., 2006, 46, 1713-722), which uses a nearest-neighbor technique to characterize the spatial location of compounds in arbitrary descriptor spaces. The algorithm generates a set of curves for the data set which are then analyzed to estimate the natural number of clusters. We then performed k-means clustering with the predicted value of k as well as with similar values to check that the correct number of clusters was obtained. In addition, we compared the predicted value to the number indicated by the average silhouette width as a cluster quality measure. We tested the algorithm on simulated data as well as on two chemical data sets. Our results indicate that the R-NN curve algorithm is able to determine the natural number of clusters and is in general agreement the average silhouette width in identifying the optimal number of clusters.
Collapse
Affiliation(s)
- Rajarshi Guha
- School of Informatics, Indiana University, Bloomington, Indiana 47406, USA.
| | | | | | | |
Collapse
|