1
|
Daghighi A, Casanola-Martin GM, Iduoku K, Kusic H, González-Díaz H, Rasulev B. Multi-Endpoint Acute Toxicity Assessment of Organic Compounds Using Large-Scale Machine Learning Modeling. ENVIRONMENTAL SCIENCE & TECHNOLOGY 2024; 58:10116-10127. [PMID: 38797941 DOI: 10.1021/acs.est.4c01017] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/29/2024]
Abstract
In recent years, alternative animal testing methods such as computational and machine learning approaches have become increasingly crucial for toxicity testing. However, the complexity and scarcity of available biomedical data challenge the development of predictive models. Combining nonlinear machine learning together with multicondition descriptors offers a solution for using data from various assays to create a robust model. This work applies multicondition descriptors (MCDs) to develop a QSTR (Quantitative Structure-Toxicity Relationship) model based on a large toxicity data set comprising more than 80,000 compounds and 59 different end points (122,572 data points). The prediction capabilities of developed single-task multi-end point machine learning models as well as a novel data analysis approach with the use of Convolutional Neural Networks (CNN) are discussed. The results show that using MCDs significantly improves the model and using them with CNN-1D yields the best result (R2train = 0.93, R2ext = 0.70). Several structural features showed a high level of contribution to the toxicity, including van der Waals surface area (VSA), number of nitrogen-containing fragments (nN+), presence of S-P fragments, ionization potential, and presence of C-N fragments. The developed models can be very useful tools to predict the toxicity of various compounds under different conditions, enabling quick toxicity assessment of new compounds.
Collapse
Affiliation(s)
- Amirreza Daghighi
- Department of Coatings and Polymeric Materials, North Dakota State University, Fargo, North Dakota 58102, United States
- Biomedical Engineering Program, North Dakota State University, Fargo, North Dakota 58102, United States
| | - Gerardo M Casanola-Martin
- Department of Coatings and Polymeric Materials, North Dakota State University, Fargo, North Dakota 58102, United States
| | - Kweeni Iduoku
- Department of Coatings and Polymeric Materials, North Dakota State University, Fargo, North Dakota 58102, United States
- Biomedical Engineering Program, North Dakota State University, Fargo, North Dakota 58102, United States
| | - Hrvoje Kusic
- Faculty of Chemical Engineering and Technology, University of Zagreb, Marulicev Trg 19, Zagreb 10000, Croatia
| | - Humberto González-Díaz
- Department of Organic and Inorganic Chemistry, University of Basque Country UPV/EHU, Leioa 48940, Spain
- BIOFISIKA, Basque Center for Biophysics CSIC-UPVEH, Leioa 48940, Spain
- IKERBASQUE, Basque Foundation for Science,Bilbao, Biscay 48011, Spain
| | - Bakhtiyor Rasulev
- Department of Coatings and Polymeric Materials, North Dakota State University, Fargo, North Dakota 58102, United States
- Biomedical Engineering Program, North Dakota State University, Fargo, North Dakota 58102, United States
| |
Collapse
|
2
|
Hatanaka M, Kato H, Sakai M, Kariya K, Nakatani S, Yoshimura T, Inagaki T. Insights into the Luminescence Quantum Yields of Cyclometalated Iridium(III) Complexes: A Density Functional Theory and Machine Learning Approach. J Phys Chem A 2023; 127:7630-7637. [PMID: 37651718 DOI: 10.1021/acs.jpca.3c02179] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 09/02/2023]
Abstract
Cyclometalated iridium(III) complexes have been used in various optical materials, including organic light-emitting diodes (OLEDs) and photocatalysts, and a deeper understanding and prediction of their luminescence quantum yields (LQYs) greatly aid in accelerating material design. In this study, we integrated density functional theory (DFT) calculations with machine learning (ML) techniques to extract factors controlling LQY. Although a substantial data set of Ir(III) complexes and their LQYs is indispensable for constructing accurate ML models to predict LQYs, generating this type of data set is challenging due to the complexities associated with ab initio calculations of LQYs. To address this issue, we investigated the nonradiative decay process of nine Ir(III) complexes emitting blue to green, each exhibiting varying experimental LQYs, by using DFT calculations. For all nine complexes, the quenching process was induced by the rotation of the single bond in one of the ligands, which converted the six-coordinate structure to the five-coordinate structure. Since the decay mechanism was common for the nine Ir(III) complexes, parameters correlated with LQYs could be used as objective variables instead of LQYs. Based on this idea, we collected a data set featuring Ir(III) complexes and the energy differences between their six- and five-coordinate triplet structures, which correlated with LQYs. We also constructed ML models using the calculated LQYs as the objective variables with the parameters from the ground-state calculations as explanatory variables. The analyses of the constructed model revealed that the LUMO energy of the ligand made the most significant negative contribution to LQY. This suggests that the potential energy surface of the metal-to-ligand charge transfer (MLCT) excited state, which stabilizes the six-coordinate structure, is reduced by decreasing the energy of the unoccupied orbitals.
Collapse
Affiliation(s)
- Miho Hatanaka
- Department of Chemistry, Faculty of Science and Technology, Keio University, 3-14-1 Hiyoshi, Kohoku-ku, Yokohama, Kanagawa 223-8522, Japan
| | - Hiromoto Kato
- Department of Chemistry, Faculty of Science and Technology, Keio University, 3-14-1 Hiyoshi, Kohoku-ku, Yokohama, Kanagawa 223-8522, Japan
| | - Minami Sakai
- Division of Materials Science, Graduate School of Science and Technology, Nara Institute of Science and Technology, 8916-5 Takayama-cho, Ikokma, Nara 630-0192, Japan
| | - Kosuke Kariya
- Department of Chemistry, Faculty of Science and Technology, Keio University, 3-14-1 Hiyoshi, Kohoku-ku, Yokohama, Kanagawa 223-8522, Japan
| | - Shunsuke Nakatani
- Department of Chemistry, Faculty of Science and Technology, Keio University, 3-14-1 Hiyoshi, Kohoku-ku, Yokohama, Kanagawa 223-8522, Japan
| | - Takayoshi Yoshimura
- Department of Chemistry, Faculty of Science and Technology, Keio University, 3-14-1 Hiyoshi, Kohoku-ku, Yokohama, Kanagawa 223-8522, Japan
| | - Taichi Inagaki
- Department of Chemistry, Faculty of Science and Technology, Keio University, 3-14-1 Hiyoshi, Kohoku-ku, Yokohama, Kanagawa 223-8522, Japan
| |
Collapse
|
3
|
Chung E, Russo DP, Ciallella HL, Wang YT, Wu M, Aleksunes LM, Zhu H. Data-Driven Quantitative Structure-Activity Relationship Modeling for Human Carcinogenicity by Chronic Oral Exposure. ENVIRONMENTAL SCIENCE & TECHNOLOGY 2023; 57:6573-6588. [PMID: 37040559 PMCID: PMC10134506 DOI: 10.1021/acs.est.3c00648] [Citation(s) in RCA: 4] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 01/24/2023] [Revised: 03/28/2023] [Accepted: 03/29/2023] [Indexed: 06/19/2023]
Abstract
Traditional methodologies for assessing chemical toxicity are expensive and time-consuming. Computational modeling approaches have emerged as low-cost alternatives, especially those used to develop quantitative structure-activity relationship (QSAR) models. However, conventional QSAR models have limited training data, leading to low predictivity for new compounds. We developed a data-driven modeling approach for constructing carcinogenicity-related models and used these models to identify potential new human carcinogens. To this goal, we used a probe carcinogen dataset from the US Environmental Protection Agency's Integrated Risk Information System (IRIS) to identify relevant PubChem bioassays. Responses of 25 PubChem assays were significantly relevant to carcinogenicity. Eight assays inferred carcinogenicity predictivity and were selected for QSAR model training. Using 5 machine learning algorithms and 3 types of chemical fingerprints, 15 QSAR models were developed for each PubChem assay dataset. These models showed acceptable predictivity during 5-fold cross-validation (average CCR = 0.71). Using our QSAR models, we can correctly predict and rank 342 IRIS compounds' carcinogenic potentials (PPV = 0.72). The models predicted potential new carcinogens, which were validated by a literature search. This study portends an automated technique that can be applied to prioritize potential toxicants using validated QSAR models based on extensive training sets from public data resources.
Collapse
Affiliation(s)
- Elena Chung
- Department
of Chemistry and Biochemistry, Rowan University, 201 Mullica Hill Road, Glassboro, New Jersey 08028, United States
| | - Daniel P. Russo
- Department
of Chemistry and Biochemistry, Rowan University, 201 Mullica Hill Road, Glassboro, New Jersey 08028, United States
| | - Heather L. Ciallella
- Department
of Toxicology, Cuyahoga County Medical Examiner’s
Office, 11001 Cedar Avenue, Cleveland, Ohio 44106, United States
| | - Yu-Tang Wang
- Institute
of Agro-Products Processing Science and Technology, Chinese Academy of Agricultural Sciences/Key Laboratory of Agro-Products
Processing, Ministry of Agriculture, Beijing 100193, China
| | - Min Wu
- School
of Life Science and Technology, China Pharmaceutical
University, No. 24, Tong Jia Xiang, Nanjing 210009, China
| | - Lauren M. Aleksunes
- Department
of Pharmacology and Toxicology, Rutgers
University, Ernest Mario School of Pharmacy, 170 Frelinghuysen Road, Piscataway, New Jersey 08854, United States
| | - Hao Zhu
- Department
of Chemistry and Biochemistry, Rowan University, 201 Mullica Hill Road, Glassboro, New Jersey 08028, United States
| |
Collapse
|
4
|
Jain S, Siramshetty VB, Alves VM, Muratov EN, Kleinstreuer N, Tropsha A, Nicklaus MC, Simeonov A, Zakharov AV. Large-Scale Modeling of Multispecies Acute Toxicity End Points Using Consensus of Multitask Deep Learning Methods. J Chem Inf Model 2021; 61:653-663. [PMID: 33533614 DOI: 10.1021/acs.jcim.0c01164] [Citation(s) in RCA: 24] [Impact Index Per Article: 8.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/01/2023]
Abstract
Computational methods to predict molecular properties regarding safety and toxicology represent alternative approaches to expedite drug development, screen environmental chemicals, and thus significantly reduce associated time and costs. There is a strong need and interest in the development of computational methods that yield reliable predictions of toxicity, and many approaches, including the recently introduced deep neural networks, have been leveraged towards this goal. Herein, we report on the collection, curation, and integration of data from the public data sets that were the source of the ChemIDplus database for systemic acute toxicity. These efforts generated the largest publicly available such data set comprising > 80,000 compounds measured against a total of 59 acute systemic toxicity end points. This data was used for developing multiple single- and multitask models utilizing random forest, deep neural networks, convolutional, and graph convolutional neural network approaches. For the first time, we also reported the consensus models based on different multitask approaches. To the best of our knowledge, prediction models for 36 of the 59 end points have never been published before. Furthermore, our results demonstrated a significantly better performance of the consensus model obtained from three multitask learning approaches that particularly predicted the 29 smaller tasks (less than 300 compounds) better than other models developed in the study. The curated data set and the developed models have been made publicly available at https://github.com/ncats/ld50-multitask, https://predictor.ncats.io/, and https://cactus.nci.nih.gov/download/acute-toxicity-db (data set only) to support regulatory and research applications.
Collapse
Affiliation(s)
- Sankalp Jain
- National Center for Advancing Translational Sciences (NCATS), National Institutes of Health, 9800 Medical Center Drive, Rockville, Maryland 20850, United States
| | - Vishal B Siramshetty
- National Center for Advancing Translational Sciences (NCATS), National Institutes of Health, 9800 Medical Center Drive, Rockville, Maryland 20850, United States
| | - Vinicius M Alves
- UNC Eshelman School of Pharmacy, University of North Carolina at Chapel Hill, Chapel Hill, North Carolina 27599, United States
| | - Eugene N Muratov
- UNC Eshelman School of Pharmacy, University of North Carolina at Chapel Hill, Chapel Hill, North Carolina 27599, United States
| | - Nicole Kleinstreuer
- Division of Intramural Research, Biostatistics and Computational Biology Branch, National Institute of Environmental Health Sciences, 111 T.W. Alexander Drive, Durham, North Carolina 27709, United States.,National Toxicology Program Interagency Center for the Evaluation of Alternative Toxicological Methods, National Institute of Environmental Health Sciences, 111 T.W. Alexander Drive, Durham, North Carolina 27709, United States
| | - Alexander Tropsha
- UNC Eshelman School of Pharmacy, University of North Carolina at Chapel Hill, Chapel Hill, North Carolina 27599, United States
| | - Marc C Nicklaus
- Computer-Aided Drug Design (CADD) Group, Chemical Biology Laboratory, Center for Cancer Research, National Cancer Institute, National Institutes of Health, DHHS, NCI-Frederick, 376 Boyles Street, Frederick, Maryland 21702, United States
| | - Anton Simeonov
- National Center for Advancing Translational Sciences (NCATS), National Institutes of Health, 9800 Medical Center Drive, Rockville, Maryland 20850, United States
| | - Alexey V Zakharov
- National Center for Advancing Translational Sciences (NCATS), National Institutes of Health, 9800 Medical Center Drive, Rockville, Maryland 20850, United States
| |
Collapse
|
5
|
Wang MWH, Goodman JM, Allen TEH. Machine Learning in Predictive Toxicology: Recent Applications and Future Directions for Classification Models. Chem Res Toxicol 2020; 34:217-239. [PMID: 33356168 DOI: 10.1021/acs.chemrestox.0c00316] [Citation(s) in RCA: 35] [Impact Index Per Article: 8.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/07/2023]
Abstract
In recent times, machine learning has become increasingly prominent in predictive toxicology as it has shifted from in vivo studies toward in silico studies. Currently, in vitro methods together with other computational methods such as quantitative structure-activity relationship modeling and absorption, distribution, metabolism, and excretion calculations are being used. An overview of machine learning and its applications in predictive toxicology is presented here, including support vector machines (SVMs), random forest (RF) and decision trees (DTs), neural networks, regression models, naïve Bayes, k-nearest neighbors, and ensemble learning. The recent successes of these machine learning methods in predictive toxicology are summarized, and a comparison of some models used in predictive toxicology is presented. In predictive toxicology, SVMs, RF, and DTs are the dominant machine learning methods due to the characteristics of the data available. Lastly, this review describes the current challenges facing the use of machine learning in predictive toxicology and offers insights into the possible areas of improvement in the field.
Collapse
Affiliation(s)
- Marcus W H Wang
- Centre for Molecular Informatics, Department of Chemistry, University of Cambridge, Lensfield Road, Cambridge CB2 1EW, United Kingdom
| | - Jonathan M Goodman
- Centre for Molecular Informatics, Department of Chemistry, University of Cambridge, Lensfield Road, Cambridge CB2 1EW, United Kingdom
| | - Timothy E H Allen
- Centre for Molecular Informatics, Department of Chemistry, University of Cambridge, Lensfield Road, Cambridge CB2 1EW, United Kingdom.,MRC Toxicology Unit, University of Cambridge, Hodgkin Building, Lancaster Road, Leicester LE1 7HB, United Kingdom
| |
Collapse
|
6
|
Antelo-Collado A, Carrasco-Velar R, García-Pedrajas N, Cerruela-García G. Effective Feature Selection Method for Class-Imbalance Datasets Applied to Chemical Toxicity Prediction. J Chem Inf Model 2020; 61:76-94. [PMID: 33350301 DOI: 10.1021/acs.jcim.0c00908] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/29/2022]
Abstract
During the drug development process, it is common to carry out toxicity tests and adverse effect studies, which are essential to guarantee patient safety and the success of the research. The use of in silico quantitative structure-activity relationship (QSAR) approaches for this task involves processing a huge amount of data that, in many cases, have an imbalanced distribution of active and inactive samples. This is usually termed the class-imbalance problem and may have a significant negative effect on the performance of the learned models. The performance of feature selection (FS) for QSAR models is usually damaged by the class-imbalance nature of the involved datasets. This paper proposes the use of an FS method focused on dealing with the class-imbalance problems. The method is based on the use of FS ensembles constructed by boosting and using two well-known FS methods, fast clustering-based FS and the fast correlation-based filter. The experimental results demonstrate the efficiency of the proposal in terms of the classification performance compared to standard methods. The proposal can be extended to other FS methods and applied to other problems in cheminformatics.
Collapse
Affiliation(s)
| | | | - Nicolás García-Pedrajas
- Department of Computing and Numerical Analysis, University of Córdoba, Campus de Rabanales, Albert Einstein Building, E-14071 Córdoba, Spain
| | - Gonzalo Cerruela-García
- Department of Computing and Numerical Analysis, University of Córdoba, Campus de Rabanales, Albert Einstein Building, E-14071 Córdoba, Spain
| |
Collapse
|
7
|
Idakwo G, Thangapandian S, Luttrell J, Li Y, Wang N, Zhou Z, Hong H, Yang B, Zhang C, Gong P. Structure-activity relationship-based chemical classification of highly imbalanced Tox21 datasets. J Cheminform 2020; 12:66. [PMID: 33372637 PMCID: PMC7592558 DOI: 10.1186/s13321-020-00468-x] [Citation(s) in RCA: 34] [Impact Index Per Article: 8.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/13/2019] [Accepted: 10/13/2020] [Indexed: 12/14/2022] Open
Abstract
The specificity of toxicant-target biomolecule interactions lends to the very imbalanced nature of many toxicity datasets, causing poor performance in Structure–Activity Relationship (SAR)-based chemical classification. Undersampling and oversampling are representative techniques for handling such an imbalance challenge. However, removing inactive chemical compound instances from the majority class using an undersampling technique can result in information loss, whereas increasing active toxicant instances in the minority class by interpolation tends to introduce artificial minority instances that often cross into the majority class space, giving rise to class overlapping and a higher false prediction rate. In this study, in order to improve the prediction accuracy of imbalanced learning, we employed SMOTEENN, a combination of Synthetic Minority Over-sampling Technique (SMOTE) and Edited Nearest Neighbor (ENN) algorithms, to oversample the minority class by creating synthetic samples, followed by cleaning the mislabeled instances. We chose the highly imbalanced Tox21 dataset, which consisted of 12 in vitro bioassays for > 10,000 chemicals that were distributed unevenly between binary classes. With Random Forest (RF) as the base classifier and bagging as the ensemble strategy, we applied four hybrid learning methods, i.e., RF without imbalance handling (RF), RF with Random Undersampling (RUS), RF with SMOTE (SMO), and RF with SMOTEENN (SMN). The performance of the four learning methods was compared using nine evaluation metrics, among which F1 score, Matthews correlation coefficient and Brier score provided a more consistent assessment of the overall performance across the 12 datasets. The Friedman’s aligned ranks test and the subsequent Bergmann-Hommel post hoc test showed that SMN significantly outperformed the other three methods. We also found that a strong negative correlation existed between the prediction accuracy and the imbalance ratio (IR), which is defined as the number of inactive compounds divided by the number of active compounds. SMN became less effective when IR exceeded a certain threshold (e.g., > 28). The ability to separate the few active compounds from the vast amounts of inactive ones is of great importance in computational toxicology. This work demonstrates that the performance of SAR-based, imbalanced chemical toxicity classification can be significantly improved through the use of data rebalancing.
Collapse
Affiliation(s)
- Gabriel Idakwo
- School of Computing Sciences and Computer Engineering, University of Southern Mississippi, Hattiesburg, MS, 39406, USA
| | - Sundar Thangapandian
- Environmental Laboratory, U.S. Army Engineer Research and Development Center, Vicksburg, MS, 39180, USA
| | - Joseph Luttrell
- School of Computing Sciences and Computer Engineering, University of Southern Mississippi, Hattiesburg, MS, 39406, USA
| | - Yan Li
- Bennett Aerospace Inc, Cary, NC, 27518, USA
| | - Nan Wang
- Department of Computer Science, New Jersey City University, Jersey City, NJ, 07305, USA
| | - Zhaoxian Zhou
- School of Computing Sciences and Computer Engineering, University of Southern Mississippi, Hattiesburg, MS, 39406, USA
| | - Huixiao Hong
- Division of Bioinformatics and Biostatistics, National Centre for Toxicological Research, U.S. Food and Drug Administration, Jefferson, AR, 72079, USA
| | - Bei Yang
- School of Information & Engineering, Zhengzhou University, Zhengzhou, 450000, China
| | - Chaoyang Zhang
- School of Computing Sciences and Computer Engineering, University of Southern Mississippi, Hattiesburg, MS, 39406, USA.
| | - Ping Gong
- Environmental Laboratory, U.S. Army Engineer Research and Development Center, Vicksburg, MS, 39180, USA.
| |
Collapse
|
8
|
Idakwo G, Luttrell J, Chen M, Hong H, Zhou Z, Gong P, Zhang C. A review on machine learning methods for in silico toxicity prediction. JOURNAL OF ENVIRONMENTAL SCIENCE AND HEALTH. PART C, ENVIRONMENTAL CARCINOGENESIS & ECOTOXICOLOGY REVIEWS 2019; 36:169-191. [PMID: 30628866 DOI: 10.1080/10590501.2018.1537118] [Citation(s) in RCA: 66] [Impact Index Per Article: 13.2] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/09/2023]
Abstract
In silico toxicity prediction plays an important role in the regulatory decision making and selection of leads in drug design as in vitro/vivo methods are often limited by ethics, time, budget, and other resources. Many computational methods have been employed in predicting the toxicity profile of chemicals. This review provides a detailed end-to-end overview of the application of machine learning algorithms to Structure-Activity Relationship (SAR)-based predictive toxicology. From raw data to model validation, the importance of data quality is stressed as it greatly affects the predictive power of derived models. Commonly overlooked challenges such as data imbalance, activity cliff, model evaluation, and definition of applicability domain are highlighted, and plausible solutions for alleviating these challenges are discussed.
Collapse
Affiliation(s)
- Gabriel Idakwo
- a School of Computing Sciences and Computer Engineering , University of Southern Mississippi , Hattiesburg , Mississippi , USA
| | - Joseph Luttrell
- a School of Computing Sciences and Computer Engineering , University of Southern Mississippi , Hattiesburg , Mississippi , USA
| | - Minjun Chen
- b Division of Bioinformatics and Biostatistics, National Center for Toxicological Science , US Food and Drug Administration , Jefferson , Arkansas , USA
| | - Huixiao Hong
- b Division of Bioinformatics and Biostatistics, National Center for Toxicological Science , US Food and Drug Administration , Jefferson , Arkansas , USA
| | - Zhaoxian Zhou
- a School of Computing Sciences and Computer Engineering , University of Southern Mississippi , Hattiesburg , Mississippi , USA
| | - Ping Gong
- c Environmental Laboratory , US Army Engineer Research and Development Center , Vicksburg , Mississippi , USA
| | - Chaoyang Zhang
- a School of Computing Sciences and Computer Engineering , University of Southern Mississippi , Hattiesburg , Mississippi , USA
| |
Collapse
|
9
|
Jain S, Kotsampasakou E, Ecker GF. Comparing the performance of meta-classifiers-a case study on selected imbalanced data sets relevant for prediction of liver toxicity. J Comput Aided Mol Des 2018; 32:583-590. [PMID: 29626291 PMCID: PMC5919997 DOI: 10.1007/s10822-018-0116-z] [Citation(s) in RCA: 27] [Impact Index Per Article: 4.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/23/2017] [Accepted: 03/29/2018] [Indexed: 12/28/2022]
Abstract
Abstract Cheminformatics datasets used in classification problems, especially those related to biological or physicochemical properties, are often imbalanced. This presents a major challenge in development of in silico prediction models, as the traditional machine learning algorithms are known to work best on balanced datasets. The class imbalance introduces a bias in the performance of these algorithms due to their preference towards the majority class. Here, we present a comparison of the performance of seven different meta-classifiers for their ability to handle imbalanced datasets, whereby Random Forest is used as base-classifier. Four different datasets that are directly (cholestasis) or indirectly (via inhibition of organic anion transporting polypeptide 1B1 and 1B3) related to liver toxicity were chosen for this purpose. The imbalance ratio in these datasets ranges between 4:1 and 20:1 for negative and positive classes, respectively. Three different sets of molecular descriptors for model development were used, and their performance was assessed in 10-fold cross-validation and on an independent validation set. Stratified bagging, MetaCost and CostSensitiveClassifier were found to be the best performing among all the methods. While MetaCost and CostSensitiveClassifier provided better sensitivity values, Stratified Bagging resulted in high balanced accuracies. Graphical Abstract ![]()
Electronic supplementary material The online version of this article (10.1007/s10822-018-0116-z) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Sankalp Jain
- Department of Pharmaceutical Chemistry, University of Vienna, Althanstrasse 14, 1090, Vienna, Austria
| | - Eleni Kotsampasakou
- Department of Pharmaceutical Chemistry, University of Vienna, Althanstrasse 14, 1090, Vienna, Austria.,Computational Toxicology Group, CMS, R&D Platform Technology & Science, GSK, Park Road, Ware, Hertfordshire, SG12 0DP, UK
| | - Gerhard F Ecker
- Department of Pharmaceutical Chemistry, University of Vienna, Althanstrasse 14, 1090, Vienna, Austria.
| |
Collapse
|
10
|
Abbasitabar F, Zare-Shahabadi V. In silico prediction of toxicity of phenols to Tetrahymena pyriformis by using genetic algorithm and decision tree-based modeling approach. CHEMOSPHERE 2017; 172:249-259. [PMID: 28081509 DOI: 10.1016/j.chemosphere.2016.12.095] [Citation(s) in RCA: 18] [Impact Index Per Article: 2.6] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 09/03/2016] [Revised: 11/29/2016] [Accepted: 12/19/2016] [Indexed: 05/27/2023]
Abstract
Risk assessment of chemicals is an important issue in environmental protection; however, there is a huge lack of experimental data for a large number of end-points. The experimental determination of toxicity of chemicals involves high costs and time-consuming process. In silico tools such as quantitative structure-toxicity relationship (QSTR) models, which are constructed on the basis of computational molecular descriptors, can predict missing data for toxic end-points for existing or even not yet synthesized chemicals. Phenol derivatives are known to be aquatic pollutants. With this background, we aimed to develop an accurate and reliable QSTR model for the prediction of toxicity of 206 phenols to Tetrahymena pyriformis. A multiple linear regression (MLR)-based QSTR was obtained using a powerful descriptor selection tool named Memorized_ACO algorithm. Statistical parameters of the model were 0.72 and 0.68 for Rtraining2 and Rtest2, respectively. To develop a high-quality QSTR model, classification and regression tree (CART) was employed. Two approaches were considered: (1) phenols were classified into different modes of action using CART and (2) the phenols in the training set were partitioned to several subsets by a tree in such a manner that in each subset, a high-quality MLR could be developed. For the first approach, the statistical parameters of the resultant QSTR model were improved to 0.83 and 0.75 for Rtraining2 and Rtest2, respectively. Genetic algorithm was employed in the second approach to obtain an optimal tree, and it was shown that the final QSTR model provided excellent prediction accuracy for the training and test sets (Rtraining2 and Rtest2 were 0.91 and 0.93, respectively). The mean absolute error for the test set was computed as 0.1615.
Collapse
Affiliation(s)
- Fatemeh Abbasitabar
- Department of Chemistry, Marvdasht Branch, Islamic Azad University, Marvdasht, Iran.
| | - Vahid Zare-Shahabadi
- Department of Chemistry, Mahshahr Branch, Islamic Azad University, Mahshahr, Iran
| |
Collapse
|
11
|
Ma L, Fan S. CURE-SMOTE algorithm and hybrid algorithm for feature selection and parameter optimization based on random forests. BMC Bioinformatics 2017; 18:169. [PMID: 28292263 PMCID: PMC5351181 DOI: 10.1186/s12859-017-1578-z] [Citation(s) in RCA: 59] [Impact Index Per Article: 8.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/25/2016] [Accepted: 03/03/2017] [Indexed: 01/04/2023] Open
Abstract
Background The random forests algorithm is a type of classifier with prominent universality, a wide application range, and robustness for avoiding overfitting. But there are still some drawbacks to random forests. Therefore, to improve the performance of random forests, this paper seeks to improve imbalanced data processing, feature selection and parameter optimization. Results We propose the CURE-SMOTE algorithm for the imbalanced data classification problem. Experiments on imbalanced UCI data reveal that the combination of Clustering Using Representatives (CURE) enhances the original synthetic minority oversampling technique (SMOTE) algorithms effectively compared with the classification results on the original data using random sampling, Borderline-SMOTE1, safe-level SMOTE, C-SMOTE, and k-means-SMOTE. Additionally, the hybrid RF (random forests) algorithm has been proposed for feature selection and parameter optimization, which uses the minimum out of bag (OOB) data error as its objective function. Simulation results on binary and higher-dimensional data indicate that the proposed hybrid RF algorithms, hybrid genetic-random forests algorithm, hybrid particle swarm-random forests algorithm and hybrid fish swarm-random forests algorithm can achieve the minimum OOB error and show the best generalization ability. Conclusion The training set produced from the proposed CURE-SMOTE algorithm is closer to the original data distribution because it contains minimal noise. Thus, better classification results are produced from this feasible and effective algorithm. Moreover, the hybrid algorithm's F-value, G-mean, AUC and OOB scores demonstrate that they surpass the performance of the original RF algorithm. Hence, this hybrid algorithm provides a new way to perform feature selection and parameter optimization.
Collapse
Affiliation(s)
- Li Ma
- School of Information Science and Technology, Jinan University, Guangzhou, 510632, China
| | - Suohai Fan
- School of Information Science and Technology, Jinan University, Guangzhou, 510632, China.
| |
Collapse
|
12
|
Dieguez-Santana K, Pham-The H, Villegas-Aguilar PJ, Le-Thi-Thu H, Castillo-Garit JA, Casañola-Martin GM. Prediction of acute toxicity of phenol derivatives using multiple linear regression approach for Tetrahymena pyriformis contaminant identification in a median-size database. CHEMOSPHERE 2016; 165:434-441. [PMID: 27668720 DOI: 10.1016/j.chemosphere.2016.09.041] [Citation(s) in RCA: 16] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 07/26/2016] [Revised: 09/10/2016] [Accepted: 09/12/2016] [Indexed: 06/06/2023]
Abstract
In this article, the modeling of inhibitory grown activity against Tetrahymena pyriformis is described. The 0-2D Dragon descriptors based on structural aspects to gain some knowledge of factors influencing aquatic toxicity are mainly used. Besides, it is done by some enlarged data of phenol derivatives described for the first time and composed of 358 chemicals. It overcomes the previous datasets with about one hundred compounds. Moreover, the results of the model evaluation by the parameters in the training, prediction and validation give adequate results comparable with those of the previous works. The more influential descriptors included in the model are: X3A, MWC02, MWC10 and piPC03 with positive contributions to the dependent variable; and MWC09, piPC02 and TPC with negative contributions. In a next step, a median-size database of nearly 8000 phenolic compounds extracted from ChEMBL was evaluated with the quantitative-structure toxicity relationship (QSTR) model developed providing some clues (SARs) for identification of ecotoxicological compounds. The outcome of this report is very useful to screen chemical databases for finding the compounds responsible of aquatic contamination in the biomarker used in the current work.
Collapse
Affiliation(s)
- Karel Dieguez-Santana
- Universidad Estatal Amazónica, Facultad de Ingeniería Ambiental, Paso Lateral Km 21/2 Via Napo, Puyo, Ecuador.
| | - Hai Pham-The
- Hanoi University of Pharmacy, 13-15 Le Thanh Tong, Hoan Kiem, Hanoi, Viet Nam
| | | | - Huong Le-Thi-Thu
- School of Medicine and Pharmacy, Vietnam National University, Hanoi (VNU) 144 Xuan Thuy, Cau Giay, Hanoi, Viet Nam
| | - Juan A Castillo-Garit
- Unidad de Toxicologia Experimental, Universidad de Ciencias Médicas Dr. Serafin Ruiz de Zárate Ruiz Santa Clara, 50200, Villa Clara, Cuba
| | - Gerardo M Casañola-Martin
- Universidad Estatal Amazónica, Facultad de Ingeniería Ambiental, Paso Lateral Km 21/2 Via Napo, Puyo, Ecuador; Hanoi University of Pharmacy, 13-15 Le Thanh Tong, Hoan Kiem, Hanoi, Viet Nam; Unidad de Investigación de Diseño de Fármacos y Conectividad Molecular, Departamento de Química Física, Facultad de Farmacia, Universitat de València, Spain.
| |
Collapse
|
13
|
Zakharov A, Peach ML, Sitzmann M, Nicklaus MC. QSAR modeling of imbalanced high-throughput screening data in PubChem. J Chem Inf Model 2014; 54:705-12. [PMID: 24524735 PMCID: PMC3985743 DOI: 10.1021/ci400737s] [Citation(s) in RCA: 80] [Impact Index Per Article: 8.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/11/2013] [Indexed: 01/19/2023]
Abstract
Many of the structures in PubChem are annotated with activities determined in high-throughput screening (HTS) assays. Because of the nature of these assays, the activity data are typically strongly imbalanced, with a small number of active compounds contrasting with a very large number of inactive compounds. We have used several such imbalanced PubChem HTS assays to test and develop strategies to efficiently build robust QSAR models from imbalanced data sets. Different descriptor types [Quantitative Neighborhoods of Atoms (QNA) and "biological" descriptors] were used to generate a variety of QSAR models in the program GUSAR. The models obtained were compared using external test and validation sets. We also report on our efforts to incorporate the most predictive of our models in the publicly available NCI/CADD Group Web services ( http://cactus.nci.nih.gov/chemical/apps/cap).
Collapse
Affiliation(s)
- Alexey
V. Zakharov
- CADD
Group, Chemical Biology Laboratory, Center for Cancer Research, National Cancer Institute, National Institutes
of Health, DHHS, NCI-Frederick, 376 Boyles St., Frederick, Maryland 21702, United
States
| | - Megan L. Peach
- Basic
Science Program, Leidos Biomedical, Inc., Computer-Aided Drug Design Group, Chemical Biology Laboratory, Frederick
National Laboratory for Cancer Research, 376 Boyles St., Frederick, Maryland 21702, United States
| | - Markus Sitzmann
- CADD
Group, Chemical Biology Laboratory, Center for Cancer Research, National Cancer Institute, National Institutes
of Health, DHHS, NCI-Frederick, 376 Boyles St., Frederick, Maryland 21702, United
States
| | - Marc C. Nicklaus
- CADD
Group, Chemical Biology Laboratory, Center for Cancer Research, National Cancer Institute, National Institutes
of Health, DHHS, NCI-Frederick, 376 Boyles St., Frederick, Maryland 21702, United
States
| |
Collapse
|
14
|
Zang Q, Rotroff DM, Judson RS. Binary Classification of a Large Collection of Environmental Chemicals from Estrogen Receptor Assays by Quantitative Structure–Activity Relationship and Machine Learning Methods. J Chem Inf Model 2013; 53:3244-61. [DOI: 10.1021/ci400527b] [Citation(s) in RCA: 48] [Impact Index Per Article: 4.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/25/2022]
Affiliation(s)
| | - Daniel M. Rotroff
- Bioinformatics
Research Center, Department of Statistics, North Carolina State University, Raleigh, North Carolina 27695, United States
| | | |
Collapse
|