1
|
Dutschmann TM, Schlenker V, Baumann K. Chemoinformatic regression methods and their applicability domain. Mol Inform 2024; 43:e202400018. [PMID: 38803302 DOI: 10.1002/minf.202400018] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/16/2024] [Revised: 03/24/2024] [Accepted: 03/25/2024] [Indexed: 05/29/2024]
Abstract
The growing interest in chemoinformatic model uncertainty calls for a summary of the most widely used regression techniques and how to estimate their reliability. Regression models learn a mapping from the space of explanatory variables to the space of continuous output values. Among other limitations, the predictive performance of the model is restricted by the training data used for model fitting. Identification of unusual objects by outlier detection methods can improve model performance. Additionally, proper model evaluation necessitates defining the limitations of the model, often called the applicability domain. Comparable to certain classifiers, some regression techniques come with built-in methods or augmentations to quantify their (un)certainty, while others rely on generic procedures. The theoretical background of their working principles and how to deduce specific and general definitions for their domain of applicability shall be explained.
Collapse
Affiliation(s)
- Thomas-Martin Dutschmann
- Institute of Medicinal and Pharmaceutical Chemistry, University of Technology Braunschweig, 38106, Braunschweig, Germany
| | - Valerie Schlenker
- Institute of Medicinal and Pharmaceutical Chemistry, University of Technology Braunschweig, 38106, Braunschweig, Germany
| | - Knut Baumann
- Institute of Medicinal and Pharmaceutical Chemistry, University of Technology Braunschweig, 38106, Braunschweig, Germany
| |
Collapse
|
2
|
Arockiaraj M, Kavitha SRJ, Klavžar S, Fiona JC, Balasubramanian K. Topological, Spectroscopic and Energetic Properties of Cycloparaphenylene Series. Polycycl Aromat Compd 2023. [DOI: 10.1080/10406638.2023.2186442] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 03/14/2023]
Affiliation(s)
| | | | - Sandi Klavžar
- Faculty of Mathematics and Physics, University of Ljubljana, Ljubljana, Slovenia
- Faculty of Natural Sciences and Mathematics, University of Maribor, Slovenia
- Institute of Mathematics, Physics and Mechanics, Ljubljana, Slovenia
| | - J. Celin Fiona
- Department of Mathematics, Loyola College, Chennai, India
| | | |
Collapse
|
3
|
On Neighborhood Degree-Based Topological Analysis over Melamine-Based TriCF Structure. Symmetry (Basel) 2023. [DOI: 10.3390/sym15030635] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 03/06/2023] Open
Abstract
Triazine-based covalent organic frameworks (TriCFs) were synthesized using melamine, and cyanuric acid is a brand-new synthetic lubricant, which is thermo-stable and possesses a lamellar structure. This article demonstrates how topological descriptors for the TriCF structure are precisely evaluated using the degree sum of the end vertex neighbors and also some molecular descriptors with multiplicative neighborhood degree sums are evaluated. Furthermore, the neighborhood entropy measures for the outcomes are provided. The results are compared using the graph theoretical method.
Collapse
|
4
|
Rajpoot A, Selvaganesh L. Potential application of novel AL-indices as molecular descriptors. J Mol Graph Model 2023; 118:108353. [PMID: 36265269 DOI: 10.1016/j.jmgm.2022.108353] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/12/2022] [Revised: 09/26/2022] [Accepted: 09/27/2022] [Indexed: 11/06/2022]
Abstract
A topological index or a descriptor is a graph invariant that describes the structure of a graph as a numerical value. This paper proposes eight novel indices called AL-indices, based on distance and degree, and analyzes their behavior to show that they can serve as potentially useful molecular descriptors. We find that among the proposed indices many have very good discriminative power when compared to the existing vertex-degree based indices and justifies the requirement of these new indices. Further, we propose a method to compute these indices and vertex-degree-based indices from a recently proposed graph matrix, referred to as neighborhood matrix. Computationally, we correlate the proposed indices' efficiency against the octane isomers' and polychlorobiphenyls' physicochemical properties. We perform a comparative study of these indices with a few well-known vertex degree-based indices. Further, the proposed indices' discriminative capacity is analyzed and shown to have higher discriminative power on the considered datasets. Among all the indices under study, four indices, namely AL4, AL5, AL7, and AL8 have shown the highest discriminative power on the set of octane isomers. While the indices AL4 and AL7 have the highest discriminative power on the set of PCB molecules compared to the other VDB indices. Among the proposed and considered indices, we show that the first index AL1 has a good correlation with the Acentric factor and entropy of octane isomer and with the total surface area, log-water-solubility, relative retention time, octanol-water-partition, and log-water-activity coefficient of PCBs.
Collapse
Affiliation(s)
- Abhay Rajpoot
- Department of Mathematical Sciences, Indian Institute of Technology (BHU), Varanasi 221005, India.
| | - Lavanya Selvaganesh
- Department of Mathematical Sciences, Indian Institute of Technology (BHU), Varanasi 221005, India.
| |
Collapse
|
5
|
Abstract
AbstractTopological index is a numerical value associated with a chemical constitution for correlation of chemical structure with various physical properties, chemical reactivity or biological activity. In this work, some new indices based on neighborhood degree sum of nodes are proposed. To make the computation of the novel indices convenient, an algorithm is designed. Quantitative structure property relationship (QSPR) study is a good statistical method for investigating drug activity or binding mode for different receptors. QSPR analysis of the newly introduced indices is studied here which reveals their predicting power. A comparative study of the novel indices with some well-known and mostly used indices in structure-property modelling and isomer discrimination is performed. Some mathematical properties of these indices are also discussed here.
Collapse
|
6
|
Ciura K, Ulenberg S, Kapica H, Kawczak P, Belka M, Bączek T. Drug affinity to human serum albumin prediction by retention of cetyltrimethylammonium bromide pseudostationary phase in micellar electrokinetic chromatography and chemically advanced template search descriptors. J Pharm Biomed Anal 2020; 188:113423. [PMID: 32623315 DOI: 10.1016/j.jpba.2020.113423] [Citation(s) in RCA: 10] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/20/2020] [Revised: 06/08/2020] [Accepted: 06/09/2020] [Indexed: 01/12/2023]
Abstract
The development of high-throughput methods for the estimation of physicochemical and biological properties of drug candidates is highly desired in the pharmaceutical landscape. Affinity to plasma protein is one of the most important biological properties, which should be taken under concern during the design and assessment of future potential medicines. The main goal of this study was to develop a quantitative retention-activity relationship model, with rationalized in vivo and in silico approach to predict the affinity to human serum albumin (HSA), which is one of the most important plasma proteins. To achieve this goal, a set of 27 chemically diverse drugs with known affinity to HSA were analyzed by micellar electrokinetic chromatography (MEKC). The proposed model for HSA affinity assessment was based on retention in hexadecyltrimethylmonium bromide (CTAB) pseudostationary phase and chemically advanced template search (CATS) pharmacophore descriptors. The comparison of various regression methods, namely multiple linear regression (MLR), partial least squares regression (PLS), orthogonal partial least squares (OPLS), and support vector machine (SVM) were performed to develop a model with highest predictability. The obtained models are suitable for the prediction of drug affinity to human serum albumin using retention factor determined by MEKC and CATS descriptors, and only slightly differ in terms of coefficients of determination, Q2 value calculated using leave-one-out cross-validation technique and root-mean-squared error of cross-validation (RMSECV) as well as root-mean-square error in prediction (RMSEP) obtained by external validation.
Collapse
Affiliation(s)
- Krzesimir Ciura
- Department of Physical Chemistry, Faculty of Pharmacy, Medical University of Gdansk, 107 J. Hallera Avenue, 80-416, Gdansk, Poland.
| | - Szymon Ulenberg
- Department of Pharmaceutical Chemistry, Faculty of Pharmacy, Medical University of Gdansk, 107 J. Hallera Avenue, 80-416 Gdansk, Poland
| | - Hanna Kapica
- Department of Physical Chemistry, Faculty of Pharmacy, Medical University of Gdansk, 107 J. Hallera Avenue, 80-416, Gdansk, Poland
| | - Piotr Kawczak
- Department of Pharmaceutical Chemistry, Faculty of Pharmacy, Medical University of Gdansk, 107 J. Hallera Avenue, 80-416 Gdansk, Poland
| | - Mariusz Belka
- Department of Pharmaceutical Chemistry, Faculty of Pharmacy, Medical University of Gdansk, 107 J. Hallera Avenue, 80-416 Gdansk, Poland
| | - Tomasz Bączek
- Department of Pharmaceutical Chemistry, Faculty of Pharmacy, Medical University of Gdansk, 107 J. Hallera Avenue, 80-416 Gdansk, Poland
| |
Collapse
|
7
|
Schaduangrat N, Lampa S, Simeon S, Gleeson MP, Spjuth O, Nantasenamat C. Towards reproducible computational drug discovery. J Cheminform 2020; 12:9. [PMID: 33430992 PMCID: PMC6988305 DOI: 10.1186/s13321-020-0408-x] [Citation(s) in RCA: 78] [Impact Index Per Article: 19.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/17/2019] [Accepted: 01/02/2020] [Indexed: 12/11/2022] Open
Abstract
The reproducibility of experiments has been a long standing impediment for further scientific progress. Computational methods have been instrumental in drug discovery efforts owing to its multifaceted utilization for data collection, pre-processing, analysis and inference. This article provides an in-depth coverage on the reproducibility of computational drug discovery. This review explores the following topics: (1) the current state-of-the-art on reproducible research, (2) research documentation (e.g. electronic laboratory notebook, Jupyter notebook, etc.), (3) science of reproducible research (i.e. comparison and contrast with related concepts as replicability, reusability and reliability), (4) model development in computational drug discovery, (5) computational issues on model development and deployment, (6) use case scenarios for streamlining the computational drug discovery protocol. In computational disciplines, it has become common practice to share data and programming codes used for numerical calculations as to not only facilitate reproducibility, but also to foster collaborations (i.e. to drive the project further by introducing new ideas, growing the data, augmenting the code, etc.). It is therefore inevitable that the field of computational drug design would adopt an open approach towards the collection, curation and sharing of data/code.
Collapse
Affiliation(s)
- Nalini Schaduangrat
- Center of Data Mining and Biomedical Informatics, Faculty of Medical Technology, Mahidol University, 10700, Bangkok, Thailand
| | - Samuel Lampa
- Department of Pharmaceutical Biosciences, Uppsala University, 751 24, Uppsala, Sweden
| | - Saw Simeon
- Interdisciplinary Graduate Program in Bioscience, Faculty of Science, Kasetsart University, 10900, Bangkok, Thailand
| | - Matthew Paul Gleeson
- Department of Biomedical Engineering, Faculty of Engineering, King Mongkut's Institute of Technology Ladkrabang, 10520, Bangkok, Thailand.
| | - Ola Spjuth
- Department of Pharmaceutical Biosciences, Uppsala University, 751 24, Uppsala, Sweden.
| | - Chanin Nantasenamat
- Center of Data Mining and Biomedical Informatics, Faculty of Medical Technology, Mahidol University, 10700, Bangkok, Thailand.
| |
Collapse
|
8
|
Importance of proper statistical practices in the use of chemodescriptors and biodescriptors in the twenty-first century. Future Med Chem 2019; 11:2755-2758. [PMID: 31686545 DOI: 10.4155/fmc-2019-0250] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/17/2022] Open
|
9
|
Dmitriev AV, Lagunin AA, Karasev DА, Rudik AV, Pogodin PV, Filimonov DA, Poroikov VV. Prediction of Drug-Drug Interactions Related to Inhibition or Induction of Drug-Metabolizing Enzymes. Curr Top Med Chem 2019; 19:319-336. [PMID: 30674264 DOI: 10.2174/1568026619666190123160406] [Citation(s) in RCA: 14] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/18/2018] [Revised: 01/02/2019] [Accepted: 01/07/2019] [Indexed: 02/07/2023]
Abstract
Drug-drug interaction (DDI) is the phenomenon of alteration of the pharmacological activity of a drug(s) when another drug(s) is co-administered in cases of so-called polypharmacy. There are three types of DDIs: pharmacokinetic (PK), pharmacodynamic, and pharmaceutical. PK is the most frequent type of DDI, which often appears as a result of the inhibition or induction of drug-metabolising enzymes (DME). In this review, we summarise in silico methods that may be applied for the prediction of the inhibition or induction of DMEs and describe appropriate computational methods for DDI prediction, showing the current situation and perspectives of these approaches in medicinal and pharmaceutical chemistry. We review sources of information on DDI, which can be used in pharmaceutical investigations and medicinal practice and/or for the creation of computational models. The problem of the inaccuracy and redundancy of these data are discussed. We provide information on the state-of-the-art physiologically- based pharmacokinetic modelling (PBPK) approaches and DME-based in silico methods. In the section on ligand-based methods, we describe pharmacophore models, molecular field analysis, quantitative structure-activity relationships (QSAR), and similarity analysis applied to the prediction of DDI related to the inhibition or induction of DME. In conclusion, we discuss the problems of DDI severity assessment, mention factors that influence severity, and highlight the issues, perspectives and practical using of in silico methods.
Collapse
Affiliation(s)
| | - Alexey A Lagunin
- Institute of Biomedical Chemistry, Moscow, Russian Federation.,Pirogov Russian National Research Medical University, Moscow, RussiaN Federation
| | | | | | - Pavel V Pogodin
- Institute of Biomedical Chemistry, Moscow, Russian Federation
| | | | | |
Collapse
|
10
|
Basak SC. Editor's Perspective: Molecular Descriptor Landscape in the Twenty First Century and its Proper Use for Computer-Aided Drug Design. Curr Comput Aided Drug Des 2018; 15:1-2. [PMID: 30569845 DOI: 10.2174/157340991501181214103556] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/22/2022]
Affiliation(s)
- Subhash C Basak
- Natural Resources Research Institute Department of Chemistry & Biochemistry University of Minnesota Duluth Duluth, MN 55811, United States
| |
Collapse
|
11
|
Andreeva EP, Proshin AN, Serkov IV, Petrova LN, Bachurin SO. Application of Molecular Topological Descriptors for Clustering a Database of Isothiourea Derivatives in Studying Structure – Activity Relationships. Pharm Chem J 2017. [DOI: 10.1007/s11094-017-1595-2] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/24/2022]
|
12
|
Bradley AR, Wall ID, Green DVS, Deane CM, Marsden BD. OOMMPPAA: a tool to aid directed synthesis by the combined analysis of activity and structural data. J Chem Inf Model 2014; 54:2636-46. [PMID: 25244105 PMCID: PMC4372120 DOI: 10.1021/ci500245d] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/13/2023]
Abstract
There is an ever increasing resource in terms of both structural information and activity data for many protein targets. In this paper we describe OOMMPPAA, a novel computational tool designed to inform compound design by combining such data. OOMMPPAA uses 3D matched molecular pairs to generate 3D ligand conformations. It then identifies pharmacophoric transformations between pairs of compounds and associates them with their relevant activity changes. OOMMPPAA presents this data in an interactive application providing the user with a visual summary of important interaction regions in the context of the binding site. We present validation of the tool using openly available data for CDK2 and a GlaxoSmithKline data set for a SAM-dependent methyl-transferase. We demonstrate OOMMPPAA's application in optimizing both potency and cell permeability and use OOMMPPAA to highlight nuanced and cross-series SAR. OOMMPPAA is freely available to download at http://oommppaa.sgc.ox.ac.uk/OOMMPPAA/ .
Collapse
Affiliation(s)
- Anthony R Bradley
- SGC, Nuffield Department of Medicine, University of Oxford , Old Road Campus Research Building, Roosevelt Drive, Headington, Oxford OX3 7DQ, U.K
| | | | | | | | | |
Collapse
|
13
|
Varnek A, Baskin I. Machine learning methods for property prediction in chemoinformatics: Quo Vadis? J Chem Inf Model 2012; 52:1413-37. [PMID: 22582859 DOI: 10.1021/ci200409x] [Citation(s) in RCA: 148] [Impact Index Per Article: 12.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/19/2023]
Abstract
This paper is focused on modern approaches to machine learning, most of which are as yet used infrequently or not at all in chemoinformatics. Machine learning methods are characterized in terms of the "modes of statistical inference" and "modeling levels" nomenclature and by considering different facets of the modeling with respect to input/ouput matching, data types, models duality, and models inference. Particular attention is paid to new approaches and concepts that may provide efficient solutions of common problems in chemoinformatics: improvement of predictive performance of structure-property (activity) models, generation of structures possessing desirable properties, model applicability domain, modeling of properties with functional endpoints (e.g., phase diagrams and dose-response curves), and accounting for multiple molecular species (e.g., conformers or tautomers).
Collapse
Affiliation(s)
- Alexandre Varnek
- Laboratoire d'Infochimie, UMR 7177 CNRS, Université de Strasbourg, 4, rue B. Pascal, Strasbourg 67000, France.
| | | |
Collapse
|
14
|
Li C, Colosi LM. Molecular similarity analysis as tool to prioritize research among emerging contaminants in the environment. Sep Purif Technol 2012. [DOI: 10.1016/j.seppur.2011.02.030] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/16/2022]
|
15
|
Cerruela García G, Luque Ruiz I, Gómez-Nieto MÁ. Analysis and Study of Molecule Data Sets Using Snowflake Diagrams of Weighted Maximum Common Subgraph Trees. J Chem Inf Model 2011; 51:1216-32. [DOI: 10.1021/ci100484z] [Citation(s) in RCA: 10] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/28/2022]
Affiliation(s)
- Gonzalo Cerruela García
- Department of Computing and Numerical Analysis, University of Córdoba, Campus de Rabanales, Albert Einstein Building, E-14071 Córdoba, Spain
| | - Irene Luque Ruiz
- Department of Computing and Numerical Analysis, University of Córdoba, Campus de Rabanales, Albert Einstein Building, E-14071 Córdoba, Spain
| | - Miguel Ángel Gómez-Nieto
- Department of Computing and Numerical Analysis, University of Córdoba, Campus de Rabanales, Albert Einstein Building, E-14071 Córdoba, Spain
| |
Collapse
|
16
|
Basak SC. Role of mathematical chemodescriptors and proteomics-based biodescriptors in drug discovery. Drug Dev Res 2010. [DOI: 10.1002/ddr.20428] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/06/2022]
|
17
|
Kompany-Zareh M, Omidikia N. Jackknife-Based Selection of Gram−Schmidt Orthogonalized Descriptors in QSAR. J Chem Inf Model 2010; 50:2055-66. [DOI: 10.1021/ci100169p] [Citation(s) in RCA: 12] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/30/2022]
Affiliation(s)
- Mohsen Kompany-Zareh
- Department of Chemistry, Institute for Advanced Studies in Basic Sciences (IASBS), Zanjan 45137-66731, Iran and Department of Food Science, Faculty of Life Sciences, University of Copenhagen, Rolighedsvej 30,1958 Frederiksberg C, Denmark
| | - Nematollah Omidikia
- Department of Chemistry, Institute for Advanced Studies in Basic Sciences (IASBS), Zanjan 45137-66731, Iran and Department of Food Science, Faculty of Life Sciences, University of Copenhagen, Rolighedsvej 30,1958 Frederiksberg C, Denmark
| |
Collapse
|
18
|
Katritzky AR, Kuanar M, Slavov S, Hall CD, Karelson M, Kahn I, Dobchev DA. Quantitative Correlation of Physical and Chemical Properties with Chemical Structure: Utility for Prediction. Chem Rev 2010; 110:5714-89. [DOI: 10.1021/cr900238d] [Citation(s) in RCA: 386] [Impact Index Per Article: 27.6] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/28/2022]
Affiliation(s)
- Alan R. Katritzky
- Center for Heterocyclic Compounds, Department of Chemistry, University of Florida, Gainesville, Florida 32611
| | - Minati Kuanar
- Center for Heterocyclic Compounds, Department of Chemistry, University of Florida, Gainesville, Florida 32611
| | - Svetoslav Slavov
- Center for Heterocyclic Compounds, Department of Chemistry, University of Florida, Gainesville, Florida 32611
| | - C. Dennis Hall
- Center for Heterocyclic Compounds, Department of Chemistry, University of Florida, Gainesville, Florida 32611
| | - Mati Karelson
- Institute of Chemistry, Tallinn University of Technology, Akadeemia tee 15, Tallinn 19086, Estonia, and MolCode, Ltd., Soola 8, Tartu 51013, Estonia
| | - Iiris Kahn
- Institute of Chemistry, Tallinn University of Technology, Akadeemia tee 15, Tallinn 19086, Estonia, and MolCode, Ltd., Soola 8, Tartu 51013, Estonia
| | - Dimitar A. Dobchev
- Institute of Chemistry, Tallinn University of Technology, Akadeemia tee 15, Tallinn 19086, Estonia, and MolCode, Ltd., Soola 8, Tartu 51013, Estonia
| |
Collapse
|
19
|
Ma B, Chen H, Xu M, Hayat T, He Y, Xu J. Quantitative structure-activity relationship (QSAR) models for polycyclic aromatic hydrocarbons (PAHs) dissipation in rhizosphere based on molecular structure and effect size. ENVIRONMENTAL POLLUTION (BARKING, ESSEX : 1987) 2010; 158:2773-2777. [PMID: 20537774 DOI: 10.1016/j.envpol.2010.04.011] [Citation(s) in RCA: 14] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 11/27/2009] [Revised: 04/08/2010] [Accepted: 04/12/2010] [Indexed: 05/29/2023]
Abstract
Rhizoremediation is a significant form of bioremediation for polycyclic aromatic hydrocarbons (PAHs). This study examined the role of molecular structure in determining the rhizosphere effect on PAHs dissipation. Effect size in meta-analysis was employed as activity dataset for building quantitative structure-activity relationship (QSAR) models and accumulative effect sizes of 16 PAHs were used for validation of these models. Based on the genetic algorithm combined with partial least square regression, models for comprehensive dataset, Poaceae dataset, and Fabaceae dataset were built. The results showed that information indices, calculated as information content of molecules based on the calculation of equivalence classes from the molecular graph, were the most important molecular structural indices for QSAR models of rhizosphere effect on PAHs dissipation. The QSAR model, based on the molecular structure indices and effect size, has potential to be used in studying and predicting the rhizosphere effect of PAHs dissipation.
Collapse
Affiliation(s)
- Bin Ma
- Zhejiang Provincial Key Laboratory of Subtropical Soil and Plant Nutrition, College of Environmental and Natural Resource Sciences, Zhejiang University, Hangzhou 310029, China
| | | | | | | | | | | |
Collapse
|
20
|
Frimayanti N, Zain SM, Rahman NA. Discovering new competivive dengue DEN2 NS2B/NS3 inhibitors using similarity searching. 2010 INTERNATIONAL CONFERENCE ON CHEMISTRY AND CHEMICAL ENGINEERING 2010. [DOI: 10.1109/iccceng.2010.5560354] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 09/01/2023]
|
21
|
Ma J, Tong C, Liaw A, Sheridan R, Szumiloski J, Svetnik V. Generating hypotheses about molecular structure-activity relationships (SARs) by solving an optimization problem. Stat Anal Data Min 2009. [DOI: 10.1002/sam.10040] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/06/2023]
|
22
|
Basak S, Mills D, Hawkins D, Kraker J. Quantitative Structure-Activity Relationship (QSAR) Modeling of Human Blood : Air Partitioning with Proper Statistical Methods and Validation. Chem Biodivers 2009; 6:487-502. [DOI: 10.1002/cbdv.200800111] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/13/2023]
|
23
|
Rajappan R, Shingade PD, Natarajan R, Jayaraman VK. Quantitative Structure−Property Relationship (QSPR) Prediction of Liquid Viscosities of Pure Organic Compounds Employing Random Forest Regression. Ind Eng Chem Res 2009. [DOI: 10.1021/ie8018406] [Citation(s) in RCA: 16] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/13/2023]
Affiliation(s)
- Remya Rajappan
- Centre for Mathematical Sciences Pala Campus, Arunapuram, Kerala, India 686 574, Chemical Engineering and Process Development Division, National Chemical Laboratory, Pune, India 411 008, and Department of Chemical Engineering, Lakehead University, 955 Oliver Road Thunder Bay, ON, Canada P7B 5E1
| | - Prashant D. Shingade
- Centre for Mathematical Sciences Pala Campus, Arunapuram, Kerala, India 686 574, Chemical Engineering and Process Development Division, National Chemical Laboratory, Pune, India 411 008, and Department of Chemical Engineering, Lakehead University, 955 Oliver Road Thunder Bay, ON, Canada P7B 5E1
| | - Ramanathan Natarajan
- Centre for Mathematical Sciences Pala Campus, Arunapuram, Kerala, India 686 574, Chemical Engineering and Process Development Division, National Chemical Laboratory, Pune, India 411 008, and Department of Chemical Engineering, Lakehead University, 955 Oliver Road Thunder Bay, ON, Canada P7B 5E1
| | - Valadi K. Jayaraman
- Centre for Mathematical Sciences Pala Campus, Arunapuram, Kerala, India 686 574, Chemical Engineering and Process Development Division, National Chemical Laboratory, Pune, India 411 008, and Department of Chemical Engineering, Lakehead University, 955 Oliver Road Thunder Bay, ON, Canada P7B 5E1
| |
Collapse
|
24
|
Basak SC, Mills D. Predicting the vapour pressure of chemicals from structure: a comparison of graph theoretic versus quantum chemical descriptors. SAR AND QSAR IN ENVIRONMENTAL RESEARCH 2009; 20:119-132. [PMID: 19343587 DOI: 10.1080/10629360902726007] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/27/2023]
Abstract
In this paper a set of graph theoretic molecular descriptors was used to predict the normal vapour pressure of a collection of 121 chlorinated organic chemicals. The easily calculated topological descriptors resulted in a robust quantitative structure-property relationship (QSPR) model with q(2) of 0.988, which is comparable to a model published previously developed using the computationally expensive density functional theory (DFT) method at the B3LYP level (Becke three-parameter exchange, Lee-Yang-Parr correlation). The addition of computer-intensive quantum chemical descriptors, including polarizability, to the set of topological descriptors did not improve the predictive ability of the model.
Collapse
Affiliation(s)
- S C Basak
- University of Minnesota Duluth, Natural Resources Research Institute, Center for Water and the Environment, Duluth, MN 55811, USA.
| | | |
Collapse
|
25
|
Boik JC, Newman RA. Structure-activity models of oral clearance, cytotoxicity, and LD50: a screen for promising anticancer compounds. BMC Pharmacol 2008; 8:12. [PMID: 18554402 PMCID: PMC2442056 DOI: 10.1186/1471-2210-8-12] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/12/2007] [Accepted: 06/13/2008] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Quantitative structure-activity relationship (QSAR) models have become popular tools to help identify promising lead compounds in anticancer drug development. Few QSAR studies have investigated multitask learning, however. Multitask learning is an approach that allows distinct but related data sets to be used in training. In this paper, a suite of three QSAR models is developed to identify compounds that are likely to (a) exhibit cytotoxic behavior against cancer cells, (b) exhibit high rat LD50 values (low systemic toxicity), and (c) exhibit low to modest human oral clearance (favorable pharmacokinetic characteristics). Models were constructed using Kernel Multitask Latent Analysis (KMLA), an approach that can effectively handle a large number of correlated data features, nonlinear relationships between features and responses, and multitask learning. Multitask learning is particularly useful when the number of available training records is small relative to the number of features, as was the case with the oral clearance data. RESULTS Multitask learning modestly but significantly improved the classification precision for the oral clearance model. For the cytotoxicity model, which was constructed using a large number of records, multitask learning did not affect precision but did reduce computation time. The models developed here were used to predict activities for 115,000 natural compounds. Hundreds of natural compounds, particularly in the anthraquinone and flavonoids groups, were predicted to be cytotoxic, have high LD50 values, and have low to moderate oral clearance. CONCLUSION Multitask learning can be useful in some QSAR models. A suite of QSAR models was constructed and used to screen a large drug library for compounds likely to be cytotoxic to multiple cancer cell lines in vitro, have low systemic toxicity in rats, and have favorable pharmacokinetic properties in humans.
Collapse
Affiliation(s)
- John C Boik
- Department of Experimental Therapeutics, University of Texas M. D. Anderson Cancer Center, 8000 El Rio, Houston, TX 77054, USA
| | - Robert A Newman
- Department of Experimental Therapeutics, University of Texas M. D. Anderson Cancer Center, 8000 El Rio, Houston, TX 77054, USA
| |
Collapse
|
26
|
Basak SC, Mills D, Hawkins DM. Predicting allergic contact dermatitis: a hierarchical structure-activity relationship (SAR) approach to chemical classification using topological and quantum chemical descriptors. J Comput Aided Mol Des 2008; 22:339-43. [PMID: 18338224 DOI: 10.1007/s10822-008-9202-y] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/25/2007] [Accepted: 02/20/2008] [Indexed: 11/27/2022]
Abstract
A hierarchical classification study was carried out based on a set of 70 chemicals-35 which produce allergic contact dermatitis (ACD) and 35 which do not. This approach was implemented using a regular ridge regression computer code, followed by conversion of regression output to binary data values. The hierarchical descriptor classes used in the modeling include topostructural (TS), topochemical (TC), and quantum chemical (QC), all of which are based solely on chemical structure. The concordance, sensitivity, and specificity are reported. The model based on the TC descriptors was found to be the best, while the TS model was extremely poor.
Collapse
Affiliation(s)
- Subhash C Basak
- Natural Resources Research Institute, Center for Water and Environment, University of Minnesota, Duluth, 5013 Miller Trunk Hwy, Duluth, MN, 55811, USA.
| | | | | |
Collapse
|
27
|
Ceroni A, Costa F, Frasconi P. Classification of small molecules by two- and three-dimensional decomposition kernels. Bioinformatics 2007; 23:2038-45. [PMID: 17550912 DOI: 10.1093/bioinformatics/btm298] [Citation(s) in RCA: 24] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open
Abstract
MOTIVATION Several kernel-based methods have been recently introduced for the classification of small molecules. Most available kernels on molecules are based on 2D representations obtained from chemical structures, but far less work has focused so far on the definition of effective kernels that can also exploit 3D information. RESULTS We introduce new ideas for building kernels on small molecules that can effectively use and combine 2D and 3D information. We tested these kernels in conjunction with support vector machines for binary classification on the 60 NCI cancer screening datasets as well as on the NCI HIV data set. Our results show that 3D information leveraged by these kernels can consistently improve prediction accuracy in all datasets. AVAILABILITY An implementation of the small molecule classifier is available from http://www.dsi.unifi.it/neural/src/3DDK.
Collapse
Affiliation(s)
- Alessio Ceroni
- Machine Learning and Neural Networks Group, Dipartimento di Sistemi e Informatica, Universitá degli Studi di Firenze, Italy
| | | | | |
Collapse
|
28
|
Lagunin AA, Zakharov AV, Filimonov DA, Poroikov VV. A new approach to QSAR modelling of acute toxicity. SAR AND QSAR IN ENVIRONMENTAL RESEARCH 2007; 18:285-98. [PMID: 17514571 DOI: 10.1080/10629360701304253] [Citation(s) in RCA: 36] [Impact Index Per Article: 2.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/15/2023]
Abstract
A new QSAR approach based on a Quantitative Neighbourhoods of Atoms description of molecular structures and self-consistent regression was developed. Its prediction accuracy, advantages and limitations were analysed from three sets of published experimental data on acute toxicity: 56 phenylsulfonyl carboxylates for Vibrio fischeri; 65 aromatic compounds for the alga Chlorella vulgaris and 200 phenols for the ciliated protozoan Tetrahymena pyriformis. According to our findings, the proposed approach provides a good correlation and prediction accuracy (r(2) = 0.908 and Q(2) = 0.866) for the set of 56 phenylsulfonyl carboxylates and the 65 aromatic compounds tested on C. vulgaris (r(2) = 0.885, Q(2) = 0.849). For the 200 phenols tested on T. pyriformis, the prediction accuracy was r(2) = 0.685 and Q(2) = 0.651. This is at least as good as the best results obtained with the other QSAR methods originally used on the same data sets.
Collapse
Affiliation(s)
- A A Lagunin
- Institute of Biomedical Chemistry, Russian Academy of Medical Sciences, Moscow, Russia.
| | | | | | | |
Collapse
|
29
|
Glen R, Adams S. Similarity Metrics and Descriptor Spaces – Which Combinations to Choose? ACTA ACUST UNITED AC 2006. [DOI: 10.1002/qsar.200610097] [Citation(s) in RCA: 39] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/07/2022]
|
30
|
Ivanciuc T, Ivanciuc O, Klein DJ. Modeling the bioconcentration factors and bioaccumulation factors of polychlorinated biphenyls with posetic quantitative super-structure/activity relationships (QSSAR). Mol Divers 2006; 10:133-45. [PMID: 16710809 DOI: 10.1007/s11030-005-9003-3] [Citation(s) in RCA: 49] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/01/2005] [Accepted: 10/19/2005] [Indexed: 12/01/2022]
Abstract
During bioconcentration, chemical pollutants from water are absorbed by aquatic animals via the skin or a respiratory surface, while the entry routes of chemicals during bioaccumulation are both directly from the environment (skin or a respiratory surface) and indirectly from food. The bioconcentration factor (BCF) and the bioaccumulation factor (BAF) for a particular chemical compound are defined as the ratio of the concentration of a chemical inside an organism to the concentration in the surrounding environment. Because the experimental determination of BAF and BCF is time-consuming and expensive, it is efficacious to develop models to provide reliable activity predictions for a large number of chemical compounds. Polychlorinated biphenyls (PCBs) released from industrial activities are persistent pollutants of the environment that produce widespread contamination of water and soil. PCBs can bioaccumulate in the food chain, constituting a potential source of exposure for the general population. To predict the bioconcentration and bioaccumulation factors for PCBs we make use of the biphenyl substitution-reaction network for the sequential substitution of H-atoms by Cl-atoms. Each PCB structure then occurs as a node of this reaction network, which is some sort of super-structure, turning out mathematically to be a partially ordered set (poset). Rather than dealing with the molecular structure via ordinary QSAR we use only this poset, making different quantitative super-structure/activity relationships (QSSAR). Thence we developed cluster expansion and splinoid QSSARs for PCB bioconcentration and bioaccumulation factors. The predictive ability of the BAF and BCF models generated for 20 data sets (representing different conditions and fish species) was evaluated with the leave-one-out cross-validation, which shows that the splinoid QSSAR (r between 0.903 and 0.935) are better than models computed with the cluster expansion (r between 0.745 and 0.887). The splinoid QSSAR models for BAF and BCF yield predictions for the missing PCBs in the investigated data sets.
Collapse
Affiliation(s)
- Teodora Ivanciuc
- Department of Marine Sciences, Texas A&M University, Galveston, Texas, 77551, USA.
| | | | | |
Collapse
|
31
|
González-Díaz H, Pérez-Bello A, Uriarte E, González-Díaz Y. QSAR study for mycobacterial promoters with low sequence homology. Bioorg Med Chem Lett 2006; 16:547-53. [PMID: 16275068 DOI: 10.1016/j.bmcl.2005.10.057] [Citation(s) in RCA: 25] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/28/2005] [Revised: 10/13/2005] [Accepted: 10/18/2005] [Indexed: 11/27/2022]
Abstract
The general belief is that quantitative structure-activity relationship (QSAR) techniques work only for small molecules and, protein sequences or, more recently, DNA sequences. However, with non-branched graph for proteins and DNA sequences the QSAR often have to be based on powerful non-linear techniques such as support vector machines. In our opinion, linear QSAR models based on RNA could be useful to assign biological activity when alignment techniques fail due to low sequence homology. The idea bases the high level of branching for the RNA graph. This work introduces the so-called Markov electrostatic potentials (k)xi(M) as a new class of RNA 2D-structure descriptors. Subsequently, we validate these molecular descriptors solving a QSAR classification problem for mycobacterial promoter sequences (mps), which constitute a very low sequence homology problem. The model developed (mps=-4.664.(0)xi(M)+0. 991.(1)xi(M)-2.432) was intended to predict whether a naturally occurring sequence is an mps or not on the basis of the calculated (k)xi(M) value for the corresponding RNA secondary structure. The RNA-QSAR approach recognises 115/135mps (85.2%) and 100% of control sequences. Average predictability and robustness were greater than 95%. A previous non-linear model predicts mps with a slightly higher accuracy (97%) but uses a very large parameter space for DNA sequences. Conversely, the (k)xi(M)-based RNA-QSAR encodes more structural information and needs only two variables.
Collapse
|
32
|
Gute BD, Basak SC. Optimal neighbor selection in molecular similarity: comparison of arbitrary versus tailored prediction spaces. SAR AND QSAR IN ENVIRONMENTAL RESEARCH 2006; 17:37-51. [PMID: 16513551 DOI: 10.1080/10659360600560933] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/06/2023]
Abstract
Three classes of arbitrary quantitative molecular similarity analysis (QMSA) methods have been computed using atom pairs (APs), topological indices (TIs), and principal components (PCs) derived from topological indices. Tailored QMSA models have been developed from TIs selected through ridge regression. K-nearest neighbor (kNN) based estimation has been applied to all of the methods to estimate normal vapor pressure (p(vap)) and water solubility (sol) for a set of 194 chemicals. Results show that the tailored QMSA methods are superior to arbitrary similarity methods in estimating both of these properties for the given set of chemicals.
Collapse
Affiliation(s)
- B D Gute
- Natural Resources Research Institute, University of Minnesota Duluth, 5013 Miller Trunk Hwy., 55811, USA
| | | |
Collapse
|
33
|
Affiliation(s)
- Douglas M Hawkins
- School of Statistics, University of Minnesota, Minneapolis, Minnesota 55455, USA.
| |
Collapse
|
34
|
C Basak S, Mills D, El-Masri HA, Mumtaz MM, Hawkins DM. Predicting blood:air partition coefficients using theoretical molecular descriptors. ENVIRONMENTAL TOXICOLOGY AND PHARMACOLOGY 2004; 16:45-55. [PMID: 21782693 DOI: 10.1016/j.etap.2003.09.002] [Citation(s) in RCA: 18] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Received: 05/01/2003] [Accepted: 09/08/2003] [Indexed: 05/31/2023]
Abstract
Three regression methods, namely ridge regression (RR), partial least squares (PLS), and principal components regression (PCR), were used to develop models for the prediction of rat blood:air partition coefficient for increasingly diverse data sets. Initially, modeling was performed for a set of 13 chlorocarbons. To this set, 10 additional hydrophobic compounds were added, including aromatic and non-aromatic hydrocarbons. A set of 16 hydrophilic compounds was also modeled separately. Finally, all 39 compounds were combined into one data set for which comprehensive models were developed. A large set of diverse, theoretical molecular descriptors was calculated for use in the current study. The topostructural (TS), topochemical (TC), and geometrical or 3-dimensional (3D) indices were used hierarchically in model development. In addition, single-class models were developed using the TS, TC, and 3D descriptors. In most cases, RR outperformed PLS and PCR, and the models developed using TC indices were superior to those developed using other classes of descriptors.
Collapse
Affiliation(s)
- Subhash C Basak
- Natural Resources Research Institute, University of Minnesota Duluth, 5013 Miller Trunk Highway, Duluth, MN 55811, USA
| | | | | | | | | |
Collapse
|
35
|
Vracko M, Mills D, Basak SC. Structure-mutagenicity modelling using counter propagation neural networks. ENVIRONMENTAL TOXICOLOGY AND PHARMACOLOGY 2004; 16:25-36. [PMID: 21782691 DOI: 10.1016/j.etap.2003.09.004] [Citation(s) in RCA: 13] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Received: 05/01/2003] [Accepted: 09/08/2003] [Indexed: 05/31/2023]
Abstract
The set of 95 aromatic amines and their mutagenic potency was treated with counter propagation neural network, which enables analysis of self-organising maps (SOMs) and also the prediction of mutagenicity. Compounds were described with four classes of descriptors: topostructural (TS), topochemical (TC), geometrical, and quantum chemical (QC). The models were tested on their prediction ability with leave-one-out (LOO) cross-validation method. The squares of correlation coefficient lie between 0.65 and 0.75 and are comparable with models obtained by linear methods. In addition, we analysed self-organising maps and found clusters of structurally similar compounds.
Collapse
Affiliation(s)
- Marjan Vracko
- Laboratory for Chemometrics, National Institute of Chemistry, Hajdrihova 19, 1000 Ljubljana, Slovenia
| | | | | |
Collapse
|
36
|
Vraćko M, Szymoszek A, Barbieri P. Structure-Mutagenicity Study of 12 Trimethylimidazopyridine Isomers Using Orbital Energies and “Spectrum-like Representation” As Descriptors. ACTA ACUST UNITED AC 2004; 44:352-8. [PMID: 15032511 DOI: 10.1021/ci030420i] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/28/2022]
Abstract
The set of 12 trimethylimidazopyridine isomers with mutagenic potency toward two strains of Salmonella was treated in this study. Ten isomers with known mutagenic properties were taken to build the models. Fifteen molecular orbital energies, or a "spectrum-like" representation of 3D structures, were taken as descriptors. As modeling techniques the multiple linear regression and the counter propagation neural network were applied. Models were tested with the recall ability test and the leave-one-out cross-validation tests. For two isomers, which have not been synthesized yet, we report predicted values for both mutagenic potencies obtained with different models. The best models were found when unoccupied molecular orbital energies are among the descriptors.
Collapse
Affiliation(s)
- M Vraćko
- National Institute of Chemistry, Hajdrihova 19, Ljubljana, Slovenia.
| | | | | |
Collapse
|
37
|
Application of Breiman’s Random Forest to Modeling Structure-Activity Relationships of Pharmaceutical Molecules. MULTIPLE CLASSIFIER SYSTEMS 2004. [DOI: 10.1007/978-3-540-25966-4_33] [Citation(s) in RCA: 95] [Impact Index Per Article: 4.8] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/13/2022]
|
38
|
Basak SC, Mills D, Hawkins DM, El-Masri HA. Prediction of human blood: air partition coefficient: a comparison of structure-based and property-based methods. RISK ANALYSIS : AN OFFICIAL PUBLICATION OF THE SOCIETY FOR RISK ANALYSIS 2003; 23:1173-1184. [PMID: 14641892 DOI: 10.1111/j.0272-4332.2003.00390.x] [Citation(s) in RCA: 14] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/24/2023]
Abstract
In recent years, there has been increased interest in the development and use of quantitative structure-activity/property relationship (QSAR/QSPR) models. For the most part, this is due to the fact that experimental data is sparse and obtaining such data is costly, while theoretical structural descriptors can be obtained quickly and inexpensively. In this study, three linear regression methods, viz. principal component regression (PCR), partial least squares (PLS), and ridge regression (RR), were used to develop QSPR models for the estimation of human blood:air partition coefficient (logPblood:air) for a group of 31 diverse low-molecular weight volatile chemicals from their computed molecular descriptors. In general, RR was found to be superior to PCR or PLS. Comparisons were made between models developed using parameters based solely on molecular structure and linear regression (LR) models developed using experimental properties, including saline:air partition coefficient (logPsaline:air) and olive oil:air partition coefficient (logPolive oil:air), as independent variables, indicating that the structure-property correlations are comparable to the property-property correlations. The best models, however, were those that used rat logPblood:air as the independent variable. Haloalkane subgroups were modeled separately for comparative purposes and, although models based on the congeneric compounds were superior, the models developed on the complete set of diverse compounds were of acceptable quality. The structural descriptors were placed into one of three classes based on level of complexity: topostructural (TS), topochemical (TC), or three-dimensional/geometrical (3D). Modeling was performed using the structural descriptor classes both in a hierarchical fashion and separately. The results indicate that highest quality structure-based models, in terms of descriptor classes, were those derived using TC descriptors.
Collapse
Affiliation(s)
- Subhash C Basak
- Natural Resources Research Institute, University of Minnesota Duluth, MN 55811, USA.
| | | | | | | |
Collapse
|
39
|
Hawkins DM, Basak SC, Mills D. Assessing model fit by cross-validation. JOURNAL OF CHEMICAL INFORMATION AND COMPUTER SCIENCES 2003; 43:579-86. [PMID: 12653524 DOI: 10.1021/ci025626i] [Citation(s) in RCA: 385] [Impact Index Per Article: 18.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/29/2022]
Abstract
When QSAR models are fitted, it is important to validate any fitted model-to check that it is plausible that its predictions will carry over to fresh data not used in the model fitting exercise. There are two standard ways of doing this-using a separate hold-out test sample and the computationally much more burdensome leave-one-out cross-validation in which the entire pool of available compounds is used both to fit the model and to assess its validity. We show by theoretical argument and empiric study of a large QSAR data set that when the available sample size is small-in the dozens or scores rather than the hundreds, holding a portion of it back for testing is wasteful, and that it is much better to use cross-validation, but ensure that this is done properly.
Collapse
Affiliation(s)
- Douglas M Hawkins
- School of Statistics, University of Minnesota, Minneapolis, Minnesota 55455, USA.
| | | | | |
Collapse
|
40
|
Basak SC, Gute BD, Mills D, Hawkins DM. Quantitative molecular similarity methods in the property/toxicity estimation of chemicals: a comparison of arbitrary versus tailored similarity spaces. ACTA ACUST UNITED AC 2003. [DOI: 10.1016/s0166-1280(02)00624-3] [Citation(s) in RCA: 25] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/27/2022]
|
41
|
Basak SC, Gute BD, Mills D. Quantitative molecular similarity analysis (QMSA) methods for property estimation: a comparison of property-based, arbitrary, and tailored similarity spaces. SAR AND QSAR IN ENVIRONMENTAL RESEARCH 2002; 13:727-742. [PMID: 12570049 DOI: 10.1080/1062936021000043463] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/24/2023]
Abstract
Three classes of arbitrary quantitative molecular similarity analysis (QMSA) methods have been computed using atom pairs, topological indices, and physicochemical properties. Tailored QMSA models have been developed using a selected number of TIs chosen by ridge regression. The methods have been applied to the K-nearest neighbor based estimation of log P of two sets of chemicals. Results show that the property-based and tailored QMSA methods are superior to the arbitrary similarity methods in estimating log P of both sets of chemicals
Collapse
Affiliation(s)
- S C Basak
- Natural Resources Research Institute, University of Minnesota at Duluth, 5013 Miller Trunk Hwy., Duluth, MN 55811, USA.
| | | | | |
Collapse
|
42
|
Basak SC, Mills D, Hawkins DM, El-Masri HA. Prediction of tissue-air partition coefficients: a comparison of structure-based and property-based methods. SAR AND QSAR IN ENVIRONMENTAL RESEARCH 2002; 13:649-665. [PMID: 12570043 DOI: 10.1080/1062936021000043409] [Citation(s) in RCA: 29] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/24/2023]
Abstract
Three linear regression methods were used to develop models for the prediction of rat tissue-air partition coefficient (P). In general, ridge regression (RR) was found to be superior to principal component regression (PCR) and partial least squares regression (PLS). A set of 46 diverse low molecular-weight volatile chemicals was used to model fat-air, liver-air and muscle-air partition coefficients for male Fischer 344 rats. Comparisons were made between models developed using descriptors based solely on molecular structure and those developed using experimental properties, including saline-air and olive oil-air partition coefficients, as independent variables, indicating that the structure-property correlations are comparable to the property-property correlations. Multiple structure-based models were developed utilizing various classes of structural descriptors based on level of complexity, i.e. topostructural (TS), topochemical (TC), 3-dimensional (3D) and calculated octanol-water partition coefficient. In most cases, the structure-based models developed using only the TC descriptors were found to be superior to those developed using other structural descriptor classes. Haloalkane subgroups were modeled separately for comparative purposes, and although models based on the congeneric compounds were superior, the models developed on the complete sets of diverse compounds were acceptable. Comparisons were also made with respect to the types of descriptors important for partitioning across the various media.
Collapse
Affiliation(s)
- S C Basak
- Natural Resources Research Institute, University of Minnesota Duluth, 5013 Miller Trunk Highway, Duluth, MN 55811, USA.
| | | | | | | |
Collapse
|
43
|
|