1
|
Ruiz IL, Gómez-Nieto MÁ. Prototype Selection Method Based on the Rivality and Reliability Indexes for the Improvement of the Classification Models and External Predictions. J Chem Inf Model 2020; 60:3009-3021. [PMID: 32337999 DOI: 10.1021/acs.jcim.0c00176] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/30/2022]
Abstract
Prototype or instance selection techniques is an important field of research in knowledge discovery, data mining, and machine learning. In QSAR, the use of prototype selection techniques in the preprocessing stage of the construction of the QSAR models favors the data set curation, improving the interpretability and accuracy of the models as well as the performance of the algorithms. In this paper, we propose an efficient method for prototype selection to be used in the preprocessing stage of the construction of QSAR classification models. The proposed method is able to generate very high reduction rates in the cardinality of the training set while maintaining or even increasing the accuracy of the classification models. The validation of the method has been carried out by means of the prediction of external molecules, demonstrating that the prediction of new molecules is also maintained or even improved. The method has been tested using 40 benchmark data sets of different sizes and balancing ratios; the results of the tests have demonstrated the wide applicability domain of the proposed method.
Collapse
Affiliation(s)
- Irene Luque Ruiz
- Department of Computing and Numerical Analysis, University of Córdoba, Campus de Rabanales, Albert Einstein building, E-14071, Córdoba, Spain
| | - Miguel Ángel Gómez-Nieto
- Department of Computing and Numerical Analysis, University of Córdoba, Campus de Rabanales, Albert Einstein building, E-14071, Córdoba, Spain
| |
Collapse
|
2
|
Sheridan RP, Karnachi P, Tudor M, Xu Y, Liaw A, Shah F, Cheng AC, Joshi E, Glick M, Alvarez J. Experimental Error, Kurtosis, Activity Cliffs, and Methodology: What Limits the Predictivity of Quantitative Structure-Activity Relationship Models? J Chem Inf Model 2020; 60:1969-1982. [PMID: 32207612 DOI: 10.1021/acs.jcim.9b01067] [Citation(s) in RCA: 17] [Impact Index Per Article: 4.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/08/2023]
Abstract
Given a particular descriptor/method combination, some quantitative structure-activity relationship (QSAR) datasets are very predictive by random-split cross-validation while others are not. Recent literature in modelability suggests that the limiting issue for predictivity is in the data, not the QSAR methodology, and the limits are due to activity cliffs. Here, we investigate, on in-house data, the relative usefulness of experimental error, distribution of the activities, and activity cliff metrics in determining how predictive a dataset is likely to be. We include unmodified in-house datasets, datasets that should be perfectly predictive based only on the chemical structure, datasets where the distribution of activities is manipulated, and datasets that include a known amount of added noise. We find that activity cliff metrics determine predictivity better than the other metrics we investigated, whatever the type of dataset, consistent with the modelability literature. However, such metrics cannot distinguish real activity cliffs due to large uncertainties in the activities. We also show that a number of modern QSAR methods, and some alternative descriptors, are equally bad at predicting the activities of compounds on activity cliffs, consistent with the assumptions behind "modelability." Finally, we relate time-split predictivity with random-split predictivity and show that different coverages of chemical space are at least as important as uncertainty in activity and/or activity cliffs in limiting predictivity.
Collapse
Affiliation(s)
- Robert P Sheridan
- Computational and Structural Chemistry, Merck & Company Inc., Kenilworth, New Jersey 07033, United States
| | - Prabha Karnachi
- Computational and Structural Chemistry, Merck & Company Inc., Kenilworth, New Jersey 07033, United States
| | - Matthew Tudor
- Computational and Structural Chemistry, Merck & Company Inc., West Point, Pennsylvania 19486, United States
| | - Yuting Xu
- Biometrics Research, Merck & Company Inc., Rahway, New Jersey 07065, United States
| | - Andy Liaw
- Biometrics Research, Merck & Company Inc., Rahway, New Jersey 07065, United States
| | - Falgun Shah
- Computational and Structural Chemistry, Merck & Company Inc., West Point, Pennsylvania 19486, United States
| | - Alan C Cheng
- Computational and Structural Chemistry, Merck & Company Inc., South San Francisco, California 94080, United States
| | - Elizabeth Joshi
- Pharmacokinetics, Pharmacodynamics & Drug Metabolism, Merck & Company Inc., West Point, Pennsylvania 19486, United States
| | - Meir Glick
- Computational and Structural Chemistry, Merck & Company Inc., Boston, Massachusetts 02115, United States
| | - Juan Alvarez
- Computational and Structural Chemistry, Merck & Company Inc., Boston, Massachusetts 02115, United States
| |
Collapse
|
3
|
Matsuzaka Y, Hosaka T, Ogaito A, Yoshinari K, Uesawa Y. Prediction Model of Aryl Hydrocarbon Receptor Activation by a Novel QSAR Approach, DeepSnap-Deep Learning. Molecules 2020; 25:molecules25061317. [PMID: 32183141 PMCID: PMC7144728 DOI: 10.3390/molecules25061317] [Citation(s) in RCA: 10] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/17/2020] [Revised: 03/05/2020] [Accepted: 03/09/2020] [Indexed: 12/31/2022] Open
Abstract
The aryl hydrocarbon receptor (AhR) is a ligand-dependent transcription factor that senses environmental exogenous and endogenous ligands or xenobiotic chemicals. In particular, exposure of the liver to environmental metabolism-disrupting chemicals contributes to the development and propagation of steatosis and hepatotoxicity. However, the mechanisms for AhR-induced hepatotoxicity and tumor propagation in the liver remain to be revealed, due to the wide variety of AhR ligands. Recently, quantitative structure–activity relationship (QSAR) analysis using deep neural network (DNN) has shown superior performance for the prediction of chemical compounds. Therefore, this study proposes a novel QSAR analysis using deep learning (DL), called the DeepSnap–DL method, to construct prediction models of chemical activation of AhR. Compared with conventional machine learning (ML) techniques, such as the random forest, XGBoost, LightGBM, and CatBoost, the proposed method achieves high-performance prediction of AhR activation. Thus, the DeepSnap–DL method may be considered a useful tool for achieving high-throughput in silico evaluation of AhR-induced hepatotoxicity.
Collapse
Affiliation(s)
- Yasunari Matsuzaka
- Department of Medical Molecular Informatics, Meiji Pharmaceutical University, 204-8588 Tokyo, Japan;
| | - Takuomi Hosaka
- Laboratory of Molecular Toxicology, School of Pharmaceutical Sciences, University of Shizuoka, Shizuoka 422-8529, Japan; (T.H.); (A.O.); (K.Y.)
| | - Anna Ogaito
- Laboratory of Molecular Toxicology, School of Pharmaceutical Sciences, University of Shizuoka, Shizuoka 422-8529, Japan; (T.H.); (A.O.); (K.Y.)
| | - Kouichi Yoshinari
- Laboratory of Molecular Toxicology, School of Pharmaceutical Sciences, University of Shizuoka, Shizuoka 422-8529, Japan; (T.H.); (A.O.); (K.Y.)
| | - Yoshihiro Uesawa
- Department of Medical Molecular Informatics, Meiji Pharmaceutical University, 204-8588 Tokyo, Japan;
- Correspondence:
| |
Collapse
|
4
|
Ruiz IL, Gómez-Nieto MÁ. Building Highly Reliable Quantitative Structure–Activity Relationship Classification Models Using the Rivality Index Neighborhood Algorithm with Feature Selection. J Chem Inf Model 2020; 60:133-151. [DOI: 10.1021/acs.jcim.9b00706] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/23/2022]
Affiliation(s)
- Irene Luque Ruiz
- Department of Computing and Numerical Analysis, University of Córdoba, Campus de Rabanales, Albert Einstein Building, E-14071 Córdoba, Spain
| | - Miguel Ángel Gómez-Nieto
- Department of Computing and Numerical Analysis, University of Córdoba, Campus de Rabanales, Albert Einstein Building, E-14071 Córdoba, Spain
| |
Collapse
|
5
|
Luque Ruiz I, Gómez-Nieto MÁ. Rivality index neighbourhood algorithm with density and distances weighted schemes for the building of robust QSAR classification models with high reliable applicability domain. SAR AND QSAR IN ENVIRONMENTAL RESEARCH 2019; 30:587-615. [PMID: 31469296 DOI: 10.1080/1062936x.2019.1644666] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 06/15/2019] [Accepted: 07/14/2019] [Indexed: 06/10/2023]
Abstract
The rivality index (RI) is a normalized distance measurement between a molecule and their first nearest neighbours providing a robust prediction of the activity of a molecule based on the known activity of their nearest neighbours. Negative values of the RI describe molecules that would be correctly classified by a statistic algorithm and, vice versa, positive values of this index describe those molecules detected as outliers by the classification algorithms. In this paper, we have described a classification algorithm based on the RI and we have proposed four weighted schemes (kernels) for its calculation based on the measuring of different characteristics of the neighbourhood of molecules for each molecule of the dataset at established values of the threshold of neighbours. The results obtained have demonstrated that the proposed classification algorithm, based on the RI, generates more reliable and robust classification models than many of the more used and well-known machine learning algorithms. These results have been validated and corroborated by using 20 balanced and unbalanced benchmark datasets of different sizes and modelability. The classification models generated provide valuable information about the molecules of the dataset, the applicability domain of the models and the reliability of the predictions.
Collapse
Affiliation(s)
- I Luque Ruiz
- Department of Computing and Numerical Analysis, Campus de Rabanales, University of Córdoba , Córdoba , Spain
| | - M Á Gómez-Nieto
- Department of Computing and Numerical Analysis, Campus de Rabanales, University of Córdoba , Córdoba , Spain
| |
Collapse
|