1
|
Moreira-Filho JT, Ranganath D, Conway M, Schmitt C, Kleinstreuer N, Mansouri K. Democratizing cheminformatics: interpretable chemical grouping using an automated KNIME workflow. J Cheminform 2024; 16:101. [PMID: 39152469 PMCID: PMC11330086 DOI: 10.1186/s13321-024-00894-1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/28/2024] [Accepted: 08/06/2024] [Indexed: 08/19/2024] Open
Abstract
With the increased availability of chemical data in public databases, innovative techniques and algorithms have emerged for the analysis, exploration, visualization, and extraction of information from these data. One such technique is chemical grouping, where chemicals with common characteristics are categorized into distinct groups based on physicochemical properties, use, biological activity, or a combination. However, existing tools for chemical grouping often require specialized programming skills or the use of commercial software packages. To address these challenges, we developed a user-friendly chemical grouping workflow implemented in KNIME, a free, open-source, low/no-code, data analytics platform. The workflow serves as an all-encompassing tool, expertly incorporating a range of processes such as molecular descriptor calculation, feature selection, dimensionality reduction, hyperparameter search, and supervised and unsupervised machine learning methods, enabling effective chemical grouping and visualization of results. Furthermore, we implemented tools for interpretation, identifying key molecular descriptors for the chemical groups, and using natural language summaries to clarify the rationale behind these groupings. The workflow was designed to run seamlessly in both the KNIME local desktop version and KNIME Server WebPortal as a web application. It incorporates interactive interfaces and guides to assist users in a step-by-step manner. We demonstrate the utility of this workflow through a case study using an eye irritation and corrosion dataset.Scientific contributionsThis work presents a novel, comprehensive chemical grouping workflow in KNIME, enhancing accessibility by integrating a user-friendly graphical interface that eliminates the need for extensive programming skills. This workflow uniquely combines several features such as automated molecular descriptor calculation, feature selection, dimensionality reduction, and machine learning algorithms (both supervised and unsupervised), with hyperparameter optimization to refine chemical grouping accuracy. Moreover, we have introduced an innovative interpretative step and natural language summaries to elucidate the underlying reasons for chemical groupings, significantly advancing the usability of the tool and interpretability of the results.
Collapse
Affiliation(s)
- José T Moreira-Filho
- National Toxicology Program Interagency Center for the Evaluation of Alternative Toxicological Methods, Division of Translational Toxicology, National Institute of Environmental Health Sciences, Research Triangle Park, North Carolina, USA.
| | - Dhruv Ranganath
- University of North Carolina at Chapel Hill, Chapel Hill, North Carolina, USA
| | - Mike Conway
- National Institute of Environmental Health Sciences, Research Triangle Park, North Carolina, USA
| | - Charles Schmitt
- Division of Translational Toxicology, National Institute of Environmental Health Sciences, Research Triangle Park, North Carolina, USA
| | - Nicole Kleinstreuer
- National Toxicology Program Interagency Center for the Evaluation of Alternative Toxicological Methods, Division of Translational Toxicology, National Institute of Environmental Health Sciences, Research Triangle Park, North Carolina, USA
| | - Kamel Mansouri
- National Toxicology Program Interagency Center for the Evaluation of Alternative Toxicological Methods, Division of Translational Toxicology, National Institute of Environmental Health Sciences, Research Triangle Park, North Carolina, USA.
| |
Collapse
|
2
|
Contreras-Torres E, Marrero-Ponce Y, Terán JE, Agüero-Chapin G, Antunes A, García-Jacas CR. Fuzzy spherical truncation-based multi-linear protein descriptors: From their definition to application in structural-related predictions. Front Chem 2022; 10:959143. [PMID: 36277354 PMCID: PMC9585278 DOI: 10.3389/fchem.2022.959143] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/01/2022] [Accepted: 08/15/2022] [Indexed: 11/13/2022] Open
Abstract
This study introduces a set of fuzzy spherically truncated three-dimensional (3D) multi-linear descriptors for proteins. These indices codify geometric structural information from kth spherically truncated spatial-(dis)similarity two-tuple and three-tuple tensors. The coefficients of these truncated tensors are calculated by applying a smoothing value to the 3D structural encoding based on the relationships between two and three amino acids of a protein embedded into a sphere. At considering, the geometrical center of the protein matches with center of the sphere, the distance between each amino acid involved in any specific interaction and the geometrical center of the protein can be computed. Then, the fuzzy membership degree of each amino acid from an spherical region of interest is computed by fuzzy membership functions (FMFs). The truncation value is finally a combination of the membership degrees from interacting amino acids, by applying the arithmetic mean as fusion rule. Several fuzzy membership functions with diverse biases on the calculation of amino acids memberships (e.g., Z-shaped (close to the center), PI-shaped (middle region), and A-Gaussian (far from the center)) were considered as well as traditional truncation functions (e.g., Switching). Such truncation functions were comparatively evaluated by exploring: 1) the frequency of membership degrees, 2) the variability and orthogonality analyses among them based on the Shannon Entropy’s and Principal Component’s methods, respectively, and 3) the prediction performance of alignment-free prediction of protein folding rates and structural classes. These analyses unraveled the singularity of the proposed fuzzy spherically truncated MDs with respect to the classical (non-truncated) ones and respect to the MDs truncated with traditional functions. They also showed an improved prediction power by attaining an external correlation coefficient of 95.82% in the folding rate modelling and an accuracy of 100% in distinguishing structural protein classes. These outcomes are better than the ones attained by existing approaches, justifying the theoretical contribution of this report. Thus, the fuzzy spherically truncated-based protein descriptors from MuLiMs-MCoMPAs (http://tomocomd.com/mulims-mcompas) are promising alignment-free predictors for modeling protein functions and properties.
Collapse
Affiliation(s)
- Ernesto Contreras-Torres
- Grupo de Medicina Molecular y Traslacional (MeM&T), Colegio de Ciencias de la Salud (COCSA), Escuela de Medicina, Universidad San Francisco de Quito (USFQ), Quito, Pichincha, Ecuador
- Instituto de Simulación Computacional (ISC-USFQ), Quito, Pichincha, Ecuador
- BCAM—Basque Center for Applied Mathematics, Bilbao, Spain
| | - Yovani Marrero-Ponce
- Grupo de Medicina Molecular y Traslacional (MeM&T), Colegio de Ciencias de la Salud (COCSA), Escuela de Medicina, Universidad San Francisco de Quito (USFQ), Quito, Pichincha, Ecuador
- Instituto de Simulación Computacional (ISC-USFQ), Quito, Pichincha, Ecuador
- Computer-Aided Molecular “Biosilico” Discovery and Bioinformatics Research International Network (CAMD-BIR IN), Quito, Ecuador
- *Correspondence: Yovani Marrero-Ponce, , , César R. García-Jacas, , ,
| | - Julio E. Terán
- Grupo de Medicina Molecular y Traslacional (MeM&T), Colegio de Ciencias de la Salud (COCSA), Escuela de Medicina, Universidad San Francisco de Quito (USFQ), Quito, Pichincha, Ecuador
- Instituto de Simulación Computacional (ISC-USFQ), Quito, Pichincha, Ecuador
- Department of Textile Engineering, Chemistry and Science, College of Textiles, North Carolina State University, Raleigh, NC, United States
| | - Guillermin Agüero-Chapin
- CIIMAR—Centro Interdisciplinar de Investigação Marinha e Ambiental, Universidade do Porto, Porto, Portugal
- Departamento de Biologia, Faculdade de Ciências, Universidade do Porto, Porto, Portugal
| | - Agostinho Antunes
- CIIMAR—Centro Interdisciplinar de Investigação Marinha e Ambiental, Universidade do Porto, Porto, Portugal
- Departamento de Biologia, Faculdade de Ciências, Universidade do Porto, Porto, Portugal
| | - César R. García-Jacas
- Cátedras Conacyt—Departamento de Ciencias de la Computación, Centro de Investigación Científica y de Educación Superior de Ensenada (CICESE), Ensenada, Baja California, Mexico
- *Correspondence: Yovani Marrero-Ponce, , , César R. García-Jacas, , ,
| |
Collapse
|
3
|
Prada Gori DN, Llanos MA, Bellera CL, Talevi A, Alberca LN. iRaPCA and SOMoC: Development and Validation of Web Applications for New Approaches for the Clustering of Small Molecules. J Chem Inf Model 2022; 62:2987-2998. [PMID: 35687523 DOI: 10.1021/acs.jcim.2c00265] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/28/2022]
Abstract
The clustering of small molecules implies the organization of a group of chemical structures into smaller subgroups with similar features. Clustering has important applications to sample chemical datasets or libraries in a representative manner (e.g., to choose, from a virtual screening hit list, a chemically diverse subset of compounds to be submitted to experimental confirmation, or to split datasets into representative training and validation sets when implementing machine learning models). Most strategies for clustering molecules are based on molecular fingerprints and hierarchical clustering algorithms. Here, two open-source in-house methodologies for clustering of small molecules are presented: iterative Random subspace Principal Component Analysis clustering (iRaPCA), an iterative approach based on feature bagging, dimensionality reduction, and K-means optimization; and Silhouette Optimized Molecular Clustering (SOMoC), which combines molecular fingerprints with the Uniform Manifold Approximation and Projection (UMAP) and Gaussian Mixture Model algorithm (GMM). In a benchmarking exercise, the performance of both clustering methods has been examined across 29 datasets containing between 100 and 5000 small molecules, comparing these results with those given by two other well-known clustering methods, Ward and Butina. iRaPCA and SOMoC consistently showed the best performance across these 29 datasets, both in terms of within-cluster and between-cluster distances. Both iRaPCA and SOMoC have been implemented as free Web Apps and standalone applications, to allow their use to a wide audience within the scientific community.
Collapse
Affiliation(s)
- Denis N Prada Gori
- Laboratory of Bioactive Compounds Research and Development (LIDeB), Department of Biological Sciences, Faculty of Exact Sciences, National University of La Plata (UNLP), La Plata B1900ADU, Argentina
| | - Manuel A Llanos
- Laboratory of Bioactive Compounds Research and Development (LIDeB), Department of Biological Sciences, Faculty of Exact Sciences, National University of La Plata (UNLP), La Plata B1900ADU, Argentina
| | - Carolina L Bellera
- Laboratory of Bioactive Compounds Research and Development (LIDeB), Department of Biological Sciences, Faculty of Exact Sciences, National University of La Plata (UNLP), La Plata B1900ADU, Argentina
| | - Alan Talevi
- Laboratory of Bioactive Compounds Research and Development (LIDeB), Department of Biological Sciences, Faculty of Exact Sciences, National University of La Plata (UNLP), La Plata B1900ADU, Argentina
| | - Lucas N Alberca
- Laboratory of Bioactive Compounds Research and Development (LIDeB), Department of Biological Sciences, Faculty of Exact Sciences, National University of La Plata (UNLP), La Plata B1900ADU, Argentina
| |
Collapse
|
4
|
Abstract
Molecular descriptors encode a variety of molecular representations for computer-assisted drug discovery. Here, we focus on the Weighted Holistic Atom Localization and Entity Shape (WHALES) descriptors, which were originally designed for scaffold hopping from natural products to synthetic molecules. WHALES descriptors capture molecular shape and partial charges simultaneously. We introduce the key aspects of the WHALES concept and provide a step-by-step guide on how to use these descriptors for virtual compound screening and scaffold hopping. The results presented can be reproduced by using the code freely available from URL: github.com/ETHmodlab/scaffold_hopping_whales .
Collapse
Affiliation(s)
- Francesca Grisoni
- Department of Chemistry and Applied Biosciences, RETHINK, ETH Zurich, Zurich, Switzerland.
| | - Gisbert Schneider
- Department of Chemistry and Applied Biosciences, RETHINK, ETH Zurich, Zurich, Switzerland
| |
Collapse
|
5
|
A unified view of density-based methods for semi-supervised clustering and classification. Data Min Knowl Discov 2020; 33:1894-1952. [PMID: 32831623 PMCID: PMC7410108 DOI: 10.1007/s10618-019-00651-1] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/19/2018] [Accepted: 08/08/2019] [Indexed: 11/23/2022]
Abstract
Semi-supervised learning is drawing increasing attention in the era of
big data, as the gap between the abundance of cheap, automatically collected
unlabeled data and the scarcity of labeled data that are laborious and expensive to
obtain is dramatically increasing. In this paper, we first introduce a unified view
of density-based clustering algorithms. We then build upon this view and bridge the
areas of semi-supervised clustering and classification under a common umbrella of
density-based techniques. We show that there are close relations between
density-based clustering algorithms and the graph-based approach for transductive
classification. These relations are then used as a basis for a new framework for
semi-supervised classification based on building-blocks from density-based
clustering. This framework is not only efficient and effective, but it is also
statistically sound. In addition, we generalize the core algorithm in our framework,
HDBSCAN*, so that it can also perform semi-supervised clustering by directly taking
advantage of any fraction of labeled data that may be available. Experimental
results on a large collection of datasets show the advantages of the proposed
approach both for semi-supervised classification as well as for semi-supervised
clustering.
Collapse
|
6
|
Diéguez-Santana K, Rivera-Borroto OM, Puris A, Pham-The H, Le-Thi-Thu H, Rasulev B, Casañola-Martin GM. Beyond model interpretability using LDA and decision trees for α-amylase and α-glucosidase inhibitor classification studies. Chem Biol Drug Des 2019; 94:1414-1421. [PMID: 30908888 DOI: 10.1111/cbdd.13518] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/22/2018] [Revised: 02/17/2019] [Accepted: 03/03/2019] [Indexed: 12/17/2022]
Abstract
In this report are used two data sets involving the main antidiabetic enzyme targets α-amylase and α-glucosidase. The prediction of α-amylase and α-glucosidase inhibitory activity as antidiabetic is carried out using LDA and classification trees (CT). A large data set of 640 compounds for α-amylase and 1546 compounds in the case of α-glucosidase are selected to develop the tree model. In the case of CT-J48 have the better classification model performances for both targets with values above 80%-90% for the training and prediction sets, correspondingly. The best model shows an accuracy higher than 95% for training set; the model was also validated using 10-fold cross-validation procedure and through a test set achieving accuracy values of 85.32% and 86.80%, correspondingly. Additionally, the obtained model is compared with other approaches previously published in the international literature showing better results. Finally, we can say that the present results provided a double-target approach for increasing the estimation of antidiabetic chemicals identification aimed by double-way workflow in virtual screening pipelines.
Collapse
Affiliation(s)
| | - Oscar M Rivera-Borroto
- Departamento de Química Física Aplicada, Facultad de Ciencias, Universidad Autónoma de Madrid, Madrid, Spain
| | - Amilkar Puris
- Facultad de Ciencias de La Ingeniería, Universidad Técnica Estatal de Quevedo, Quevedo, Ecuador
| | | | - Huong Le-Thi-Thu
- School of Medicine and Pharmacy, Vietnam National University, Hanoi, Vietnam
| | - Bakhtiyor Rasulev
- Department of Coatings and Polymeric Materials, North Dakota State University, Fargo, North Dakota
| | | |
Collapse
|
7
|
Kaneko H. Sparse Generative Topographic Mapping for Both Data Visualization and Clustering. J Chem Inf Model 2018; 58:2528-2535. [DOI: 10.1021/acs.jcim.8b00528] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/28/2022]
Affiliation(s)
- Hiromasa Kaneko
- Department of Applied Chemistry, School of Science and Technology, Meiji University, 1-1-1 Higashi-Mita, Tama-ku, Kawasaki, Kanagawa 214-8571, Japan
| |
Collapse
|
8
|
Prathipati P, Mizuguchi K. Integration of Ligand and Structure Based Approaches for CSAR-2014. J Chem Inf Model 2015; 56:974-87. [PMID: 26492437 DOI: 10.1021/acs.jcim.5b00477] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/17/2023]
Abstract
The prediction of binding poses and affinities is an area of active interest in computer-aided drug design (CADD). Given the documented limitations with either ligand or structure based approaches, we employed an integrated approach and developed a rapid protocol for binding mode and affinity predictions. This workflow was applied to the three protein targets of Community Structure-Activity Resource-2014 (CSAR-2014) exercise: Factor Xa (FXa), Spleen Tyrosine Kinase (SYK), and tRNA (guanine-N(1))-methyltransferase (TrmD). Our docking and scoring workflow incorporates compound clustering and ligand and protein structure based pharmacophore modeling, followed by local docking, minimization, and scoring. While the former part of the protocol ensures high-quality ligand alignments and mapping, the subsequent minimization and scoring provides the predicted binding modes and affinities. We made blind predictions of docking pose for 1, 5, and 14 ligands docked into 1, 2, and 12 crystal structures of FXa, SYK, and TrmD, respectively. The resulting 174 poses were compared with cocrystallized structures (1, 5, and 14 complexes) made available at the end of CSAR. Our predicted poses were related to the experimentally determined structures with a mean root-mean-square deviation value of 3.4 Å. Further, we were able to classify high and low affinity ligands with the area under the curve values of 0.47, 0.60, and 0.69 for FXa, SYK, and TrmD, respectively, indicating the validity of our approach in at least two of the three systems. Detailed critical analysis of the results and CSAR methodology ranking procedures suggested that a straightforward application of our workflow has limitations, as some of the performance measures do not reflect the actual utility of pose and affinity predictions in the biological context of individual systems.
Collapse
Affiliation(s)
- Philip Prathipati
- National Institutes of Biomedical Innovation, Health and Nutrition , 7-6-8 Saito-Asagi, Ibaraki City, Osaka 567-0085, Japan
| | - Kenji Mizuguchi
- National Institutes of Biomedical Innovation, Health and Nutrition , 7-6-8 Saito-Asagi, Ibaraki City, Osaka 567-0085, Japan
| |
Collapse
|
9
|
Saeed F, Salim N, Abdo A. Consensus methods for combining multiple clusterings of chemical structures. J Chem Inf Model 2013; 53:1026-34. [PMID: 23581471 DOI: 10.1021/ci300442u] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/30/2022]
Abstract
The goal of consensus clustering methods is to find a consensus partition that optimally summarizes an ensemble and improves the quality of clustering compared with single clustering algorithms. In this paper, an enhanced voting-based consensus method was introduced and compared with other consensus clustering methods, including co-association-based, graph-based, and voting-based consensus methods. The MDDR and MUV data sets were used for the experiments and were represented by three 2D fingerprints: ALOGP, ECFP_4, and ECFC_4. The results were evaluated based on the ability of the clustering method to separate active from inactive molecules in each cluster using four criteria: F-measure, Quality Partition Index (QPI), Rand Index (RI), and Fowlkes-Mallows Index (FMI). The experiments suggest that the consensus methods can deliver significant improvements for the effectiveness of chemical structures clustering.
Collapse
Affiliation(s)
- Faisal Saeed
- Faculty of Computing, Universiti Teknologi Malaysia, Malaysia.
| | | | | |
Collapse
|
10
|
MacCuish JD, MacCuish NE. Chemoinformatics applications of cluster analysis. WILEY INTERDISCIPLINARY REVIEWS-COMPUTATIONAL MOLECULAR SCIENCE 2013. [DOI: 10.1002/wcms.1152] [Citation(s) in RCA: 17] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 01/02/2023]
|
11
|
Palacios-Bejarano B, Cerruela García G, Luque Ruiz I, Gómez-Nieto MÁ. QSAR model based on weighted MCS trees approach for the representation of molecule data sets. J Comput Aided Mol Des 2013; 27:185-201. [DOI: 10.1007/s10822-013-9637-7] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/12/2012] [Accepted: 02/01/2013] [Indexed: 11/28/2022]
|
12
|
Saeed F, Salim N, Abdo A, Hentabli H. Graph-Based Consensus Clustering for Combining Multiple Clusterings of Chemical Structures. Mol Inform 2013; 32:165-78. [PMID: 27481278 DOI: 10.1002/minf.201200110] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/20/2012] [Accepted: 12/09/2012] [Indexed: 11/10/2022]
Abstract
Consensus clustering methods have been successfully used for combining multiple classifiers in many areas such as machine learning, applied statistics, pattern recognition and bioinformatics. In this paper, consensus clustering is used for combining the clusterings of chemical structures to enhance the ability of separating biologically active molecules from inactive ones in each cluster. Two graph-based consensus clustering methods were examined. The Quality Partition Index method (QPI) was used to evaluate the clusterings and the results were compared to the Ward's clustering method. Two homogeneous and heterogeneous subsets DS1-DS2 of MDL Drug Data Report database (MDDR) were used for experiments and represented by two 2D fingerprints. The results, obtained by a combination of multiple runs of an individual clustering and a single run of multiple individual clusterings, showed that graph-based consensus clustering methods can improve the effectiveness of chemical structures clusterings.
Collapse
Affiliation(s)
- Faisal Saeed
- Faculty of Computer Science and Information Systems, Universiti Teknologi Malaysia, Malaysia. .,Information Technology Department, Sanhan Community College, Sanaa, Yemen.
| | - Naomie Salim
- Faculty of Computer Science and Information Systems, Universiti Teknologi Malaysia, Malaysia
| | - Ammar Abdo
- Computer Science Department, Hodeidah University, Hodeidah, Yemen.,LIFL UMR CNRS 8022 Université Lille 1 and INRIA Lille Nord Europe, 59655 Villeneuve d'Ascq cedex, France
| | - Hamza Hentabli
- Faculty of Computer Science and Information Systems, Universiti Teknologi Malaysia, Malaysia
| |
Collapse
|
13
|
Saeed F, Salim N, Abdo A. Voting-based consensus clustering for combining multiple clusterings of chemical structures. J Cheminform 2012; 4:37. [PMID: 23244782 PMCID: PMC3541359 DOI: 10.1186/1758-2946-4-37] [Citation(s) in RCA: 17] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/24/2012] [Accepted: 12/11/2012] [Indexed: 11/26/2022] Open
Abstract
UNLABELLED BACKGROUND Although many consensus clustering methods have been successfully used for combining multiple classifiers in many areas such as machine learning, applied statistics, pattern recognition and bioinformatics, few consensus clustering methods have been applied for combining multiple clusterings of chemical structures. It is known that any individual clustering method will not always give the best results for all types of applications. So, in this paper, three voting and graph-based consensus clusterings were used for combining multiple clusterings of chemical structures to enhance the ability of separating biologically active molecules from inactive ones in each cluster. RESULTS The cumulative voting-based aggregation algorithm (CVAA), cluster-based similarity partitioning algorithm (CSPA) and hyper-graph partitioning algorithm (HGPA) were examined. The F-measure and Quality Partition Index method (QPI) were used to evaluate the clusterings and the results were compared to the Ward's clustering method. The MDL Drug Data Report (MDDR) dataset was used for experiments and was represented by two 2D fingerprints, ALOGP and ECFP_4. The performance of voting-based consensus clustering method outperformed the Ward's method using F-measure and QPI method for both ALOGP and ECFP_4 fingerprints, while the graph-based consensus clustering methods outperformed the Ward's method only for ALOGP using QPI. The Jaccard and Euclidean distance measures were the methods of choice to generate the ensembles, which give the highest values for both criteria. CONCLUSIONS The results of the experiments show that consensus clustering methods can improve the effectiveness of chemical structures clusterings. The cumulative voting-based aggregation algorithm (CVAA) was the method of choice among consensus clustering methods.
Collapse
Affiliation(s)
- Faisal Saeed
- Faculty of Computer Science and Information Systems, University Technology of Malaysia, Johor, Malaysia
- Information Technology Department, Sanhan Community College, Sana'a, Yemen
| | - Naomie Salim
- Faculty of Computer Science and Information Systems, University Technology of Malaysia, Johor, Malaysia
| | - Ammar Abdo
- Department of Computer Science, Alhodaida University, Alhodaida, Yemen
- LIFL UMR CNRS 8022 Universite′ Lille 1 and INRIA Lille Nord Europe, 59655 Villeneuve d’Ascq cedex, Lille, France
| |
Collapse
|
14
|
Hechinger M, Leonhard K, Marquardt W. What is Wrong with Quantitative Structure–Property Relations Models Based on Three-Dimensional Descriptors? J Chem Inf Model 2012; 52:1984-93. [DOI: 10.1021/ci300246m] [Citation(s) in RCA: 35] [Impact Index Per Article: 2.9] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/12/2022]
Affiliation(s)
- M. Hechinger
- AVT-Process
Systems Engineering and ‡Chair of Technical Thermodynamics, RWTH Aachen University, 52064 Aachen, Germany
| | - K. Leonhard
- AVT-Process
Systems Engineering and ‡Chair of Technical Thermodynamics, RWTH Aachen University, 52064 Aachen, Germany
| | - W. Marquardt
- AVT-Process
Systems Engineering and ‡Chair of Technical Thermodynamics, RWTH Aachen University, 52064 Aachen, Germany
| |
Collapse
|
15
|
Rivera-Borroto OM, Rabassa-Gutiérrez M, Grau-Ábalo RDC, Marrero-Ponce Y, García-de la Vega JM. Dunn's index for cluster tendency assessment of pharmacological data sets. Can J Physiol Pharmacol 2012; 90:425-33. [PMID: 22443093 DOI: 10.1139/y2012-002] [Citation(s) in RCA: 11] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/22/2022]
Abstract
Cluster tendency assessment is an important stage in cluster analysis. In this sense, a group of promising techniques named visual assessment of tendency (VAT) has emerged in the literature. The presence of clusters can be detected easily through the direct observation of a dark blocks structure along the main diagonal of the intensity image. Alternatively, if the Dunn's index for a single linkage partition is greater than 1, then it is a good indication of the blocklike structure. In this report, the Dunn's index is applied as a novel measure of tendency on 8 pharmacological data sets, represented by machine-learning-selected molecular descriptors. In all cases, observed values are less than 1, thus indicating a weak tendency for data to form compact clusters. Other results suggest that there is an increasing relationship between the Dunn's index as a measure of cluster separability and the classification accuracy of various cluster algorithms tested on the same data sets.
Collapse
Affiliation(s)
- Oscar Miguel Rivera-Borroto
- Laboratorio de Bioinformática, Centro de Estudios de Informática, Física y Computación, Universidad Central "Marta Abreu" de Las Villas, Santa Clara, Cuba.
| | | | | | | | | |
Collapse
|