1
|
Banerjee A, Kar S, Roy K, Patlewicz G, Charest N, Benfenati E, Cronin MTD. Molecular similarity in chemical informatics and predictive toxicity modeling: from quantitative read-across (q-RA) to quantitative read-across structure-activity relationship (q-RASAR) with the application of machine learning. Crit Rev Toxicol 2024; 54:659-684. [PMID: 39225123 DOI: 10.1080/10408444.2024.2386260] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/03/2024] [Revised: 07/25/2024] [Accepted: 07/25/2024] [Indexed: 09/04/2024]
Abstract
This article aims to provide a comprehensive critical, yet readable, review of general interest to the chemistry community on molecular similarity as applied to chemical informatics and predictive modeling with a special focus on read-across (RA) and read-across structure-activity relationships (RASAR). Molecular similarity-based computational tools, such as quantitative structure-activity relationships (QSARs) and RA, are routinely used to fill the data gaps for a wide range of properties including toxicity endpoints for regulatory purposes. This review will explore the background of RA starting from how structural information has been used through to how other similarity contexts such as physicochemical, absorption, distribution, metabolism, and elimination (ADME) properties, and biological aspects are being characterized. More recent developments of RA's integration with QSAR have resulted in the emergence of novel models such as ToxRead, generalized read-across (GenRA), and quantitative RASAR (q-RASAR). Conventional QSAR techniques have been excluded from this review except where necessary for context.
Collapse
Affiliation(s)
- Arkaprava Banerjee
- Department of Pharmaceutical Technology, Drug Theoretics and Cheminformatics (DTC) Laboratory, Jadavpur University, Kolkata, India
| | - Supratik Kar
- Department of Chemistry and Physics, Chemometrics & Molecular Modeling Laboratory, Kean University, Union, NJ, USA
| | - Kunal Roy
- Department of Pharmaceutical Technology, Drug Theoretics and Cheminformatics (DTC) Laboratory, Jadavpur University, Kolkata, India
| | - Grace Patlewicz
- Center for Computational Toxicology and Exposure, US Environmental Protection Agency, Research Triangle Park, NC, USA
| | - Nathaniel Charest
- Center for Computational Toxicology and Exposure, US Environmental Protection Agency, Research Triangle Park, NC, USA
| | - Emilio Benfenati
- Department of Environmental Health Sciences, Istituto di Ricerche Farmacologiche Mario Negri IRCCS, Milan, Italy
| | - Mark T D Cronin
- School of Pharmacy and Biomolecular Sciences, Liverpool John Moores University, Liverpool, UK
| |
Collapse
|
2
|
Banerjee A, Roy K. The application of chemical similarity measures in an unconventional modeling framework c-RASAR along with dimensionality reduction techniques to a representative hepatotoxicity dataset. Sci Rep 2024; 14:20812. [PMID: 39242880 PMCID: PMC11379871 DOI: 10.1038/s41598-024-71892-4] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/19/2024] [Accepted: 09/02/2024] [Indexed: 09/09/2024] Open
Abstract
With the exponential progress in the field of cheminformatics, the conventional modeling approaches have so far been to employ supervised and unsupervised machine learning (ML) and deep learning models, utilizing the standard molecular descriptors, which represent the structural, physicochemical, and electronic properties of a particular compound. Deviating from the conventional approach, in this investigation, we have employed the classification Read-Across Structure-Activity Relationship (c-RASAR), which involves the amalgamation of the concepts of classification-based quantitative structure-activity relationship (QSAR) and Read-Across to incorporate Read-Across-derived similarity and error-based descriptors into a statistical and machine learning modeling framework. ML models developed from these RASAR descriptors use similarity-based information from the close source neighbors of a particular query compound. We have employed different classification modeling algorithms on the selected QSAR and RASAR descriptors to develop predictive models for efficient prediction of query compounds' hepatotoxicity. The predictivity of each of these models was evaluated on a large number of test set compounds. The best-performing model was also used to screen a true external data set. The concepts of explainable AI (XAI) coupled with Read-Across were used to interpret the contributions of the RASAR descriptors in the best c-RASAR model and to explain the chemical diversity in the dataset. The application of various unsupervised dimensionality reduction techniques like t-SNE and UMAP and the supervised ARKA framework showed the usefulness of the RASAR descriptors over the selected QSAR descriptors in their ability to group similar compounds, enhancing the modelability of the dataset and efficiently identifying activity cliffs. Furthermore, the activity cliffs were also identified from Read-Across by observing the nature of compounds constituting the nearest neighbors for a particular query compound. On comparing our simple linear c-RASAR model with the previously reported models developed using the same dataset derived from the US FDA Orange Book ( https://www.accessdata.fda.gov/scripts/cder/ob/index.cfm ), it was observed that our model is simple, reproducible, transferable, and highly predictive. The performance of the LDA c-RASAR model on the true external set supersedes that of the previously reported work. Therefore, the present simple LDA c-RASAR model can efficiently be used to predict the hepatotoxicity of query chemicals.
Collapse
Affiliation(s)
- Arkaprava Banerjee
- Drug Theoretics and Cheminformatics Laboratory, Department of Pharmaceutical Technology, Jadavpur University, Kolkata, 700 032, India
| | - Kunal Roy
- Drug Theoretics and Cheminformatics Laboratory, Department of Pharmaceutical Technology, Jadavpur University, Kolkata, 700 032, India.
| |
Collapse
|
3
|
Banerjee A, Roy K. How to correctly develop q-RASAR models for predictive cheminformatics. Expert Opin Drug Discov 2024; 19:1017-1022. [PMID: 38966910 DOI: 10.1080/17460441.2024.2376651] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/17/2024] [Accepted: 07/02/2024] [Indexed: 07/06/2024]
Affiliation(s)
- Arkaprava Banerjee
- Drug Theoretics and Cheminformatics Laboratory, Department of Pharmaceutical Technology, Jadavpur University, Kolkata, India
| | - Kunal Roy
- Drug Theoretics and Cheminformatics Laboratory, Department of Pharmaceutical Technology, Jadavpur University, Kolkata, India
| |
Collapse
|
4
|
Wu S, Li SX, Qiu J, Zhao HM, Li YW, Feng NX, Liu BL, Cai QY, Xiang L, Mo CH, Li QX. Accurate Prediction of Rat Acute Oral Toxicity and Reference Dose for Thousands of Polycyclic Aromatic Hydrocarbon Derivatives Based on Chemometric QSAR and Machine Learning. ENVIRONMENTAL SCIENCE & TECHNOLOGY 2024. [PMID: 39137267 DOI: 10.1021/acs.est.4c03966] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 08/15/2024]
Abstract
Acute oral toxicity is currently not available for most polycyclic aromatic hydrocarbons (PAHs), especially their derivatives, because it is cost-prohibitive to experimentally determine all of them. Here, quantitative structure-activity relationship (QSAR) models using machine learning (ML) for predicting the toxicity of PAH derivatives were developed, based on oral toxicity data points of 788 individual substances of rats. Both the individual ML algorithm gradient boosting regression trees (GBRT) and the stacking ML algorithm (extreme gradient boosting + GBRT + random forest regression) provided the best prediction results with satisfactory determination coefficients for both cross-validation and the test set. It was found that those PAH derivatives with fewer polar hydrogens, more large-sized atoms, more branches, and lower polarizability have higher toxicity. Software based on the optimal ML-QSAR model was successfully developed to expand the application potential of the developed model, obtaining reliable prediction of pLD50 values and reference doses for 6893 external PAH derivatives. Among these chemicals, 472 were identified as moderately or highly toxic; 10 out of them had clear environment detection or use records. The findings provide valuable insights into the toxicity of PAHs and their derivatives, offering a standard platform for effectively evaluating chemical toxicity using ML-QSAR models.
Collapse
Affiliation(s)
- Shuang Wu
- Guangdong Provincial Research Center for Environment Pollution Control and Remediation Materials, College of Life Science and Technology, Jinan University, Guangzhou 510632, China
| | - Shi-Xin Li
- Guangdong Provincial Research Center for Environment Pollution Control and Remediation Materials, College of Life Science and Technology, Jinan University, Guangzhou 510632, China
| | - Jing Qiu
- Guangdong Provincial Research Center for Environment Pollution Control and Remediation Materials, College of Life Science and Technology, Jinan University, Guangzhou 510632, China
| | - Hai-Ming Zhao
- Guangdong Provincial Research Center for Environment Pollution Control and Remediation Materials, College of Life Science and Technology, Jinan University, Guangzhou 510632, China
| | - Yan-Wen Li
- Guangdong Provincial Research Center for Environment Pollution Control and Remediation Materials, College of Life Science and Technology, Jinan University, Guangzhou 510632, China
| | - Nai-Xian Feng
- Guangdong Provincial Research Center for Environment Pollution Control and Remediation Materials, College of Life Science and Technology, Jinan University, Guangzhou 510632, China
| | - Bai-Lin Liu
- Guangdong Provincial Research Center for Environment Pollution Control and Remediation Materials, College of Life Science and Technology, Jinan University, Guangzhou 510632, China
| | - Quan-Ying Cai
- Guangdong Provincial Research Center for Environment Pollution Control and Remediation Materials, College of Life Science and Technology, Jinan University, Guangzhou 510632, China
| | - Lei Xiang
- Guangdong Provincial Research Center for Environment Pollution Control and Remediation Materials, College of Life Science and Technology, Jinan University, Guangzhou 510632, China
| | - Ce-Hui Mo
- Guangdong Provincial Research Center for Environment Pollution Control and Remediation Materials, College of Life Science and Technology, Jinan University, Guangzhou 510632, China
| | - Qing X Li
- Department of Molecular Biosciences and Bioengineering, University of Hawaii at Manoa, Honolulu, Hawaii, 96822, United States
| |
Collapse
|
5
|
Jiang JR, Cai WX, Chen ZF, Liao XL, Cai Z. Prediction of acute toxicity for Chlorella vulgaris caused by tire wear particle-derived compounds using quantitative structure-activity relationship models. WATER RESEARCH 2024; 256:121643. [PMID: 38663211 DOI: 10.1016/j.watres.2024.121643] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 11/29/2023] [Revised: 04/16/2024] [Accepted: 04/17/2024] [Indexed: 05/12/2024]
Abstract
Tire wear particles (TWPs) enter aquatic ecosystems through various pathways, such as rainwater and urban runoff. Additives in TWPs can harm aquatic organisms in these ecosystems. Therefore, it is essential to investigate their toxicity to aquatic organisms. In our study, we initially recorded the median effective concentrations of 21 TWP-derived compounds on Chlorella vulgaris growth, ranging from 0.04 to 8.60 mg/L. Subsequently, through an extensive review of the literature, we incorporated 112 compounds with specific toxicity endpoints to construct the QSAR model using genetic algorithm and multiple linear regression techniques, followed by the construction of the consensus model and the quantitative read-across structure-activity relationship (q-RASAR) model. Meanwhile, we employed rigorous internal and external validation measures to assess the performance of the model. The results indicated that the developed q-RASAR model exhibited strong adaptation, robustness, and reliable prediction, with q-RASAR indicators of Q2LOO = 0.7673, R2tr = 0.8079, R2test = 0.8610, Q2Fn = 0.8285-0.8614, and CCCtest = 0.9222. Based on an external dataset containing 128 emerging TWP-derived compounds, the model's applicability domain coverage was 90.6 %. The q-RASAR model predicted that the structure of diphenylamine was associated with higher toxicity, possibly liked to the SpMax2_Bhm and LogBCF descriptors. The established model reliably provides prediction and fills a critical data gap. These findings highlight the potential risks posed by emerging TWP-derived compounds to aquatic organisms.
Collapse
Affiliation(s)
- Jie-Ru Jiang
- Guangdong Key Laboratory of Environmental Catalysis and Health Risk Control, Guangdong-Hong Kong-Macao Joint Laboratory for Contaminants Exposure and Health, School of Environmental Science and Engineering, Guangdong University of Technology, Guangzhou 510006, China
| | - Wen-Xi Cai
- Guangdong Key Laboratory of Environmental Catalysis and Health Risk Control, Guangdong-Hong Kong-Macao Joint Laboratory for Contaminants Exposure and Health, School of Environmental Science and Engineering, Guangdong University of Technology, Guangzhou 510006, China
| | - Zhi-Feng Chen
- Guangdong Key Laboratory of Environmental Catalysis and Health Risk Control, Guangdong-Hong Kong-Macao Joint Laboratory for Contaminants Exposure and Health, School of Environmental Science and Engineering, Guangdong University of Technology, Guangzhou 510006, China.
| | - Xiao-Liang Liao
- Guangdong Key Laboratory of Environmental Catalysis and Health Risk Control, Guangdong-Hong Kong-Macao Joint Laboratory for Contaminants Exposure and Health, School of Environmental Science and Engineering, Guangdong University of Technology, Guangzhou 510006, China
| | - Zongwei Cai
- Guangdong Key Laboratory of Environmental Catalysis and Health Risk Control, Guangdong-Hong Kong-Macao Joint Laboratory for Contaminants Exposure and Health, School of Environmental Science and Engineering, Guangdong University of Technology, Guangzhou 510006, China; State Key Laboratory of Environmental and Biological Analysis, Department of Chemistry, Hong Kong Baptist University, Hong Kong 999077, China.
| |
Collapse
|
6
|
Pore S, Banerjee A, Roy K. Application of machine learning-based read-across structure-property relationship (RASPR) as a new tool for predictive modelling: Prediction of power conversion efficiency (PCE) for selected classes of organic dyes in dye-sensitized solar cells (DSSCs). Mol Inform 2024; 43:e202300210. [PMID: 38374528 DOI: 10.1002/minf.202300210] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/14/2023] [Revised: 12/31/2023] [Accepted: 02/04/2024] [Indexed: 02/21/2024]
Abstract
The application of various in-silico-based approaches for the prediction of various properties of materials has been an effective alternative to experimental methods. Recently, the concepts of Quantitative structure-property relationship (QSPR) and read-across (RA) methods were merged to develop a new emerging chemoinformatic tool: read-across structure-property relationship (RASPR). The RASPR method can be applicable to both large and small datasets as it uses various similarity and error-based measures. It has also been observed that RASPR models tend to have an increased external predictivity compared to the corresponding QSPR models. In this study, we have modeled the power conversion efficiency (PCE) of organic dyes used in dye-sensitized solar cells (DSSCs) by using the quantitative RASPR (q-RASPR) method. We have used relatively larger classes of organic dyes-Phenothiazines (n=207), Porphyrins (n=281), and Triphenylamines (n=229) for the modelling purpose. We have divided each of the datasets into training and test sets in 3 different combinations, and with the training sets we have developed three different QSPR models with structural and physicochemical descriptors and validated them with the corresponding test sets. These corresponding modeled descriptors were used to calculate the RASPR descriptors using a Java-based tool RASAR Descriptor Calculator v2.0 (https://sites.google.com/jadavpuruniversity.in/dtc-lab-software/home), and then data fusion was performed by pooling the previously selected structural and physicochemical descriptors with the calculated RASPR descriptors. Further feature selection algorithm was employed to develop the final RASPR PLS models. Here, we also developed different machine learning (ML) models with the descriptors selected in the QSPR PLS and RASPR PLS models, and it was found that models with RASPR descriptors superseded in external predictivity the models with only structural and physicochemical descriptors: RMSEP reduced for phenothiazines from 1.16-1.25 to 1.07-1.18, for porphyrins from 1.60-1.79 to 1.45-1.53, for triphenylamines from 1.27-1.54 to 1.20-1.47.
Collapse
Affiliation(s)
- Souvik Pore
- Drug Theoretics and Chemoinformatics Laboratory, Department of Pharmaceutical Technology, Jadavpur University, 188 Raja S C Mullick Road, 700032, Kolkata, India
| | - Arkaprava Banerjee
- Drug Theoretics and Chemoinformatics Laboratory, Department of Pharmaceutical Technology, Jadavpur University, 188 Raja S C Mullick Road, 700032, Kolkata, India
| | - Kunal Roy
- Drug Theoretics and Chemoinformatics Laboratory, Department of Pharmaceutical Technology, Jadavpur University, 188 Raja S C Mullick Road, 700032, Kolkata, India
| |
Collapse
|