Reference Citation Analysis: Find an Article, Find a Category, Find a Journal, Find a Scholar

Total Articles

43
(from Reference Citation Analysis)

Article PDFs (18)

Cited by > 0 (30)

Searched Name

Molecular fingerprints

Ranked By

Results Analysis

Year Published Analysis
Article Type Analysis
Publication Title Analysis
Category Analysis

Results Analysis

Indexed Articles

Year Published

Show more Refine

Article Statistics

Refine

Publication Titles

Show more Refine

Grant Agencies

Show more Refine

Category

Show more Refine

Electronic supplementary material

The online version of this article (doi:10.1186/s13321-017-0199-x) contains supplementary material, which is available to authorized users.

Collapse

Electronic supplementary material

The online version of this article (doi:10.1186/s13321-016-0176-9) contains supplementary material, which is available to authorized users.

Collapse

Electronic supplementary material

The online version of this article (doi:10.1186/s13321-016-0148-0) contains supplementary material, which is available to authorized users.

Collapse

.
Number	Citation Analysis
1	One chiral fingerprint to find them all. J Cheminform 2024;16:53. [PMID: 38741153 DOI: 10.1186/s13321-024-00849-6] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/18/2023] [Accepted: 04/28/2024] [Indexed: 05/16/2024] Open Abstract Molecular fingerprints are indispensable tools in cheminformatics. However, stereochemistry is generally not considered, which is problematic for large molecules which are almost all chiral. Herein we report MAP4C, a chiral version of our previously reported fingerprint MAP4, which lists MinHashes computed from character strings containing the SMILES of all pairs of circular substructures up to a diameter of four bonds and the shortest topological distance between their central atoms. MAP4C includes the Cahn-Ingold-Prelog (CIP) annotation (R, S, r or s) whenever the chiral atom is the center of a circular substructure, a question mark for undefined stereocenters, and double bond cis-trans information if specified. MAP4C performs slightly better than the achiral MAP4, ECFP and AP fingerprints in non-stereoselective virtual screening benchmarks. Furthermore, MAP4C distinguishes between stereoisomers in chiral molecules from small molecule drugs to large natural products and peptides comprising thousands of diastereomers, with a degree of distinction smaller than between structural isomers and proportional to the number of chirality changes. Due to its excellent performance across diverse molecular classes and its ability to handle stereochemistry, MAP4C is recommended as a generally applicable chiral molecular fingerprint. SCIENTIFIC CONTRIBUTION: The ability of our chiral fingerprint MAP4C to handle stereoisomers from small molecules to large natural products and peptides is unprecedented and opens the way for cheminformatics to include stereochemistry as an important molecular parameter across all fields of molecular design. Collapse Key Words Atom-pairs Chemical space Molecular fingerprints Stereochemistry Virtual screening Collapse MESH Headings Collapse Grants 885076 HORIZON EUROPE European Research Council 885076 HORIZON EUROPE European Research Council 200020_178998 Schweizerischer Nationalfonds zur Förderung der Wissenschaftlichen Forschung Collapse
2	Machine learning framework for predicting cytotoxicity and identifying toxicity drivers of disinfection byproducts. JOURNAL OF HAZARDOUS MATERIALS 2024;469:133989. [PMID: 38461660 DOI: 10.1016/j.jhazmat.2024.133989] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 12/25/2023] [Revised: 03/06/2024] [Accepted: 03/06/2024] [Indexed: 03/12/2024] Abstract Drinking water disinfection can result in the formation disinfection byproducts (DBPs, > 700 have been identified to date), many of them are reportedly cytotoxic, genotoxic, or developmentally toxic. Analyzing the toxicity levels of these contaminants experimentally is challenging, however, a predictive model could rapidly and effectively assess their toxicity. In this study, machine learning models were developed to predict DBP cytotoxicity based on their chemical information and exposure experiments. The Random Forest model achieved the best performance (coefficient of determination of 0.62 and root mean square error of 0.63) among all the algorithms screened. Also, the results of a probabilistic model demonstrated reliable model predictions. According to the model interpretation, halogen atoms are the most prominent features for DBP cytotoxicity compared to other chemical substructures. The presence of iodine and bromine is associated with increased cytotoxicity levels, while the presence of chlorine is linked to a reduction in cytotoxicity levels. Other factors including chemical substructures (CC, N, CN, and 6-member ring), cell line, and exposure duration can significantly affect the cytotoxicity of DBPs. The similarity calculation indicated that the model has a large applicability domain and can provide reliable predictions for DBPs with unknown cytotoxicity. Finally, this study showed the effectiveness of data augmentation in the scenario of data scarcity. Collapse Key Words Applicability domain Chemical toxicity prediction Data augmentation Machine learning assisted QSAR models Molecular fingerprints Collapse MESH Headings Animals Cricetinae Disinfection Disinfectants/toxicity Disinfectants/analysis Water Purification Halogenation Water Pollutants, Chemical/toxicity Water Pollutants, Chemical/analysis Halogens Chlorine Drinking Water/analysis CHO Cells Collapse Grants Collapse
3	Geometric deep learning for the prediction of magnesium-binding sites in RNA structures. Int J Biol Macromol 2024;262:130150. [PMID: 38365157 DOI: 10.1016/j.ijbiomac.2024.130150] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/30/2023] [Revised: 01/24/2024] [Accepted: 02/11/2024] [Indexed: 02/18/2024] Abstract Magnesium ions (Mg2+) are essential for the folding, functional expression, and structural stability of RNA molecules. However, predicting Mg2+-binding sites in RNA molecules based solely on RNA structures is still challenging. The molecular surface, characterized by a continuous shape with geometric and chemical properties, is important for RNA modelling and carries essential information for understanding the interactions between RNAs and Mg2+ ions. Here, we propose an approach named RNA-magnesium ion surface interaction fingerprinting (RMSIF), a geometric deep learning-based conceptual framework to predict magnesium ion binding sites in RNA structures. To evaluate the performance of RMSIF, we systematically enumerated decoy Mg2+ ions across a full-space grid within the range of 2 to 10 Å from the RNA molecule and made predictions accordingly. Visualization techniques were used to validate the prediction results and calculate success rates. Comparative assessments against state-of-the-art methods like MetalionRNA, MgNet, and Metal3DRNA revealed that RMSIF achieved superior success rates and accuracy in predicting Mg2+-binding sites. Additionally, in terms of the spatial distribution of Mg2+ ions within the RNA structures, a majority were situated in the deep grooves, while a minority occupied the shallow grooves. Collectively, the conceptual framework developed in this study holds promise for advancing insights into drug design, RNA co-transcriptional folding, and structure prediction. Collapse Key Words Geometric deep learning Magnesium binding site Molecular fingerprints RNA Collapse MESH Headings RNA/chemistry Magnesium/chemistry Deep Learning Binding Sites Ions/chemistry Collapse Grants Collapse
4	Deep Learning Algorithm Based on Molecular Fingerprint for Prediction of Drug-Induced Liver Injury. Toxicology 2024;502:153736. [PMID: 38307192 DOI: 10.1016/j.tox.2024.153736] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/23/2023] [Revised: 01/02/2024] [Accepted: 01/23/2024] [Indexed: 02/04/2024] Abstract Drug-induced liver injury (DILI) is one the rare adverse drug reaction (ADR) and multifactorial endpoints. Current preclinical animal models struggle to anticipate it, and in silico methods have emerged as a way with significant potential for doing so. In this study, a high-quality dataset of 1573 compounds was assembled. The 48 classification models, which depended on six different molecular fingerprints, were built via deep neural network (DNN) and seven machine learning algorithms. Comparing the results of the DNN and machine learning models, the optional performing model was found as the one developed based on the DNN with ECFP_6 as input, which achieved the area under the receiver operating characteristic curve (AUC) of 0.713, balanced accuracy (BA) of 0.680, and F1 of 0.753. In addition, we used the SHapley Additive exPlanations (SHAP) algorithm to interpret the models, identified the crucial structural fragments related to DILI risk, and selected the top ten substructures with the highest contribution rankings to serve as warning indicators for subsequent drug hepatotoxicity screening studies. The study demonstrates that the DNN models developed based on molecular fingerprints can be a trustworthy and efficient tool for determining the risk of DILI during the pre-development of novel medications. Collapse Key Words Deep neural network Drug-induced liver injury Machine learning Molecular fingerprints Shapley additive explanation Collapse MESH Headings Animals Deep Learning Algorithms Chemical and Drug Induced Liver Injury/diagnosis Chemical and Drug Induced Liver Injury/etiology Machine Learning Neural Networks, Computer Collapse Grants Collapse
5	Large-scale comparison of machine learning methods for profiling prediction of kinase inhibitors. J Cheminform 2024;16:13. [PMID: 38291477 PMCID: PMC10829268 DOI: 10.1186/s13321-023-00799-5] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/30/2023] [Accepted: 12/22/2023] [Indexed: 02/01/2024] Open Abstract Conventional machine learning (ML) and deep learning (DL) play a key role in the selectivity prediction of kinase inhibitors. A number of models based on available datasets can be used to predict the kinase profile of compounds, but there is still controversy about the advantages and disadvantages of ML and DL for such tasks. In this study, we constructed a comprehensive benchmark dataset of kinase inhibitors, involving in 141,086 unique compounds and 216,823 well-defined bioassay data points for 354 kinases. We then systematically compared the performance of 12 ML and DL methods on the kinase profiling prediction task. Extensive experimental results reveal that (1) Descriptor-based ML models generally slightly outperform fingerprint-based ML models in terms of predictive performance. RF as an ensemble learning approach displays the overall best predictive performance. (2) Single-task graph-based DL models are generally inferior to conventional descriptor- and fingerprint-based ML models, however, the corresponding multi-task models generally improves the average accuracy of kinase profile prediction. For example, the multi-task FP-GNN model outperforms the conventional descriptor- and fingerprint-based ML models with an average AUC of 0.807. (3) Fusion models based on voting and stacking methods can further improve the performance of the kinase profiling prediction task, specifically, RF::AtomPairs + FP2 + RDKitDes fusion model performs best with the highest average AUC value of 0.825 on the test sets. These findings provide useful information for guiding choices of the ML and DL methods for the kinase profiling prediction tasks. Finally, an online platform called KIPP ( https://kipp.idruglab.cn ) and python software are developed based on the best models to support the kinase profiling prediction, as well as various kinase inhibitor identification tasks including virtual screening, compound repositioning and target fishing. Collapse Key Words Deep learning Kinase profiling Machine learning Molecular fingerprints Molecular graphs Collapse MESH Headings Collapse Grants 81973241 National Natural Science Foundation of China 2020A1515010548 Natural Science Foundation of Guangdong Province Collapse
6	Predicting drug-induced liver injury using graph attention mechanism and molecular fingerprints. Methods 2024;221:18-26. [PMID: 38040204 DOI: 10.1016/j.ymeth.2023.11.014] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/21/2023] [Revised: 11/14/2023] [Accepted: 11/25/2023] [Indexed: 12/03/2023] Open Abstract Drug-induced liver injury (DILI) is a significant issue in drug development and clinical treatment due to its potential to cause liver dysfunction or damage, which, in severe cases, can lead to liver failure or even fatality. DILI has numerous pathogenic factors, many of which remain incompletely understood. Consequently, it is imperative to devise methodologies and tools for anticipatory assessment of DILI risk in the initial phases of drug development. In this study, we present DMFPGA, a novel deep learning predictive model designed to predict DILI. To provide a comprehensive description of molecular properties, we employ a multi-head graph attention mechanism to extract features from the molecular graphs, representing characteristics at the level of compound nodes. Additionally, we combine multiple fingerprints of molecules to capture features at the molecular level of compounds. The fusion of molecular fingerprints and graph features can more fully express the properties of compounds. Subsequently, we employ a fully connected neural network to classify compounds as either DILI-positive or DILI-negative. To rigorously evaluate DMFPGA's performance, we conduct a 5-fold cross-validation experiment. The obtained results demonstrate the superiority of our method over four existing state-of-the-art computational approaches, exhibiting an average AUC of 0.935 and an average ACC of 0.934. We believe that DMFPGA is helpful for early-stage DILI prediction and assessment in drug development. Collapse Key Words Deep learning Drug-induced liver injury Featurefusion Graph attention mechanism Molecular fingerprints Molecular graph features Collapse MESH Headings Humans Chemical and Drug Induced Liver Injury/etiology Drug Development Models, Chemical Deep Learning Collapse Grants Collapse
7	FIAMol-AB: A feature fusion and attention-based deep learning method for enhanced antibiotic discovery. Comput Biol Med 2024;168:107762. [PMID: 38056212 DOI: 10.1016/j.compbiomed.2023.107762] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/29/2023] [Revised: 10/31/2023] [Accepted: 11/21/2023] [Indexed: 12/08/2023] Abstract Antibiotic resistance continues to be a growing concern for global health, accentuating the need for novel antibiotic discoveries. Traditional methodologies in this field have relied heavily on extensive experimental screening, which is often time-consuming and costly. Contrastly, computer-assisted drug screening offers rapid, cost-effective solutions. In this work, we propose FIAMol-AB, a deep learning model that combines graph neural networks, text convolutional networks and molecular fingerprint techniques. This method also combines an attention mechanism to fuse multiple forms of information within the model. The experiments show that FIAMol-AB may offer potential advantages in antibiotic discovery tasks over some existing methods. We conducted some analysis based on our model's results, which help highlight the potential significance of certain features in the model's predictive performance. Compared to different models, ours demonstrate promising results, indicating potential robustness and versatility. This suggests that by integrating multi-view information and attention mechanisms, FIAMol-AB might better learn complex molecular structures, potentially improving the precision and efficiency of antibiotic discovery. We hope our FIAMol-AB can be used as a useful method in the ongoing fight against antibiotic resistance. Collapse Key Words Attention mechanism Computational biology Graph neural networks Molecular fingerprints Collapse MESH Headings Deep Learning Anti-Bacterial Agents/pharmacology Drug Evaluation, Preclinical Neural Networks, Computer Collapse Grants Collapse
8	A new perspective on predicting the reaction rate constants of hydrated electrons for organic contaminants: Exploring molecular structure characterization methods and ambient conditions. THE SCIENCE OF THE TOTAL ENVIRONMENT 2023;904:166316. [PMID: 37591396 DOI: 10.1016/j.scitotenv.2023.166316] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 06/01/2023] [Revised: 07/26/2023] [Accepted: 08/12/2023] [Indexed: 08/19/2023] Abstract Hydrated electrons (eaq-) exhibit rapid degradation of diverse persistent organic contaminants (OCs) and hold great promise as a formidable reducing agent in water treatment. However, the diverse structures of compounds exert different influences on the second-order rate constant of hydrated electron reactions (keaq-), while the same OCs demonstrate notable discrepancies in keaq- values across different pH levels. This study aims to develop machine learning (ML) models that can effectively simulate the intricate reaction kinetics between eaq- and OCs. Furthermore, the introduction of the pH variable enables a comprehensive investigation into the impact of ambient conditions on this process, thereby improving the practicality of the model. A dataset encompassing 701 keaq- values derived from 351 peer-reviewed publications was compiled. To comprehensively investigate compound properties, this study introduced molecular descriptor (MD), molecular fingerprint (MF), and the integration of both (MD + MF) as model variables. Furthermore, 60 sets of predictive models were established utilizing two variable screening methodologies (MLR and RF) and ten prominent algorithms. Through statistical parameter analysis, it was determined that descriptors combined with MD and MF, the RF screening method, and the symbolism algorithm exhibited the best predictive efficacy. Importantly, the combination of descriptor models exhibited significantly superior performance compared to individual MF and MD models. Notably, the optimal model, denoted as RF - (MF + MD) - LGB, exhibited highly satisfactory predictive results (R2tra = 0.967, Q2tra = 0.840, R2ext = 0.761). The mechanistic explanation study based on Shapley Additive Explanations (SHAP) values further elucidated the crucial influences of polarity, pH, molecular weight, electronegativity, carbon-carbon double bonds, and molecular topology on the degradation of OCs by eaq-. The proposed modeling approach, particularly the integration of MF and MD, alongside the introduction of pH, may furnish innovative ideas for advanced reduction or oxidation processes (ARPs/AOPs) and machine learning applications in other domains. Collapse Key Words Advanced reduction Hydrated electrons Machine learning Molecular fingerprints Molecular structure descriptors Organic contaminants Collapse MESH Headings Collapse Grants Collapse
9	Developing a hybrid model for predicting the reaction kinetics between chlorine and micropollutants in water. WATER RESEARCH 2023;247:120794. [PMID: 37918199 DOI: 10.1016/j.watres.2023.120794] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 05/20/2023] [Revised: 10/03/2023] [Accepted: 10/27/2023] [Indexed: 11/04/2023] Abstract Understanding the reactivities of chlorine towards micropollutants is crucial for assessing the fate of micropollutants in water chlorination. In this study, we integrated machine learning with kinetic modeling to predict the reaction kinetics between micropollutants and chlorine in deionized water and real surface water. We first established a framework to predict the apparent second-order rate constants for micropollutants with chlorine by combining Morgan molecular fingerprints with machine learning algorithms. The framework was tuned using Bayesian optimization and showed high prediction accuracy. It was validated through experiments and used to predict the unreported apparent second-order rate constants for 103 emerging micropollutants with chlorine. The framework also improved the understanding of the structure-dependence of micropollutants' reactivity with chlorine. We incorporated the predicted apparent second-order rate constants into the Kintecus software to establish a hybrid model to profile the time-dependent changes of micropollutant concentrations by chlorination. The hybrid model was validated by experiments conducted in real surface water in the presence of natural organic matter. The hybrid model could predict how much micropollutants were degraded by chlorination with varied chlorine contact times and/or initial chlorine dosages. This study advances fundamental understanding of the reaction kinetics between chlorine and emerging micropollutants, and also offers a valuable tool to assess the fate of micropollutants during chlorination of drinking water. Collapse Key Words Chlorine Machine learning Micropollutants Molecular fingerprints Rate constants Reaction kinetics Collapse MESH Headings Chlorine Bayes Theorem Water Purification Drinking Water Kinetics Water Pollutants, Chemical/analysis Halogenation Collapse Grants Collapse
10	A structural similarity networking assisted collision cross-section prediction interval filtering strategy for multi-compound identification of complex matrix by ion-mobility mass spectrometry. Anal Chim Acta 2023;1278:341720. [PMID: 37709461 DOI: 10.1016/j.aca.2023.341720] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/20/2023] [Revised: 07/28/2023] [Accepted: 08/14/2023] [Indexed: 09/16/2023] Abstract Ion mobility coupled with mass spectrometry (IM-MS), an emerging technology for analysis of complex matrix, has been facing challenges due to the complexities of chemical structures and original data, as well as low-efficiency and error-proneness of manual operations. In this study, we developed a structural similarity networking assisted collision cross-section prediction interval filtering (SSN-CCSPIF) strategy. We first carried out a structural similarity networking (SSN) based on Tanimoto similarities among Morgan fingerprints to classify the authentic compounds potentially existing in complex matrix. By performing automatic regressive prediction statistics on mass-to-charge ratios (m/z) and collision cross-sections (CCS) with a self-built Python software, we explored the IM-MS feature trendlines, established filtering intervals and filtered potential compounds for each SSN classification. Chemical structures of all filtered compounds were further characterized by interpreting their multidimensional IM-MS data. To evaluate the applicability of SSN-CCSPIF, we selected Ginkgo biloba extract and dripping pills. The SSN-CCSPIF subtracted more background interferences (43.24%∼43.92%) than other similar strategies with conventional ClassyFire criteria (10.71%∼12.13%) or without compound classification (35.73%∼36.63%). Totally, 229 compounds, including eight potential new compounds, were characterized. Among them, seven isomeric pairs were discriminated with the integration of IM-separation. Using SSN-CCSPIF, we can achieve high-efficient analysis of complex IM-MS data and comprehensive chemical profiling of complex matrix to reveal their material basis. Collapse Key Words Collision cross-section Ginkgo biloba Ion-mobility mass spectrometry Molecular fingerprints Structural similarity networking Collapse MESH Headings Collapse Grants Collapse
11	A molecule perturbation software library and its application to study the effects of molecular design constraints. J Cheminform 2023;15:89. [PMID: 37752561 PMCID: PMC10523775 DOI: 10.1186/s13321-023-00761-5] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/14/2023] [Accepted: 09/15/2023] [Indexed: 09/28/2023] Open Abstract Computational molecular design can yield chemically unreasonable compounds when performed carelessly. A popular strategy to mitigate this risk is mimicking reference chemistry. This is commonly achieved by restricting the way in which molecules are constructed or modified. While it is well established that such an approach helps in designing chemically appealing molecules, concerns about these restrictions impacting chemical space exploration negatively linger. In this work we present a software library for constrained graph-based molecule manipulation and showcase its functionality by developing a molecule generator. Said generator designs molecules mimicking reference chemical features of differing granularity. We find that restricting molecular construction lightly, beyond the usual positive effects on drug-likeness and synthesizability of designed molecules, provides guidance to optimization algorithms navigating chemical space. Nonetheless, restricting molecular construction excessively can indeed hinder effective chemical space exploration. Collapse Key Words Chemical space Constraints De novo molecule generation Molecular design Molecular fingerprints RDKit Software library Topological perturbations Collapse MESH Headings Collapse Grants 39461 Fonds Wetenschappelijk Onderzoek Collapse
12	Behavior of organic components and the migration of heavy metals during sludge dewatering by different advanced oxidation processes via optical spectroscopy and molecular fingerprint analysis. WATER RESEARCH 2023;243:120336. [PMID: 37454458 DOI: 10.1016/j.watres.2023.120336] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 01/19/2023] [Revised: 07/04/2023] [Accepted: 07/08/2023] [Indexed: 07/18/2023] Abstract A comparative study of the different advanced oxidation processes (Fe(II)-Oxone, Fe(II)-H2O2, and Fe(II)-NaClO) was carried out herein to analyze the characteristics of organic components and the migration of heavy metals in waste activated sludge. With the Fe(II)-Oxone and Fe(II)-H2O2 treatments, sludge dewaterability was significantly improved, however, sludge dewaterability was deteriorated by the Fe(II)-NaClO treatment. The enhanced sludge dewaterability by the Fe(II)-Oxone and Fe(II)-H2O2 treatments was strongly correlated with the shifted organic components, particularly proteins, in soluble extracellular polymeric substances (S-EPS), while the deteriorated sludge dewaterability by the Fe(II)-NaClO treatment was strongly correlated with the over release of organic components from bound EPS (B-EPS) to S-EPS. For both the Fe(II)-Oxone and Fe(II)-H2O2 treatments, the radicals preferentially attacked humic acid-like organic components over the protein-like organic components in S-EPS, while for the Fe(II)-NaClO treatment, interestingly, the radicals preferentially attacked the protein-like organic components in both S-EPS and B-EPS. The hydrophilic functional groups like phenolic OH and CO of polysaccharides may be more preferentially migrated to S-EPS of sludge by the Fe(II)-NaClO treatment compared to the other two treatments. With the Fe(II)-Oxone and Fe(II)-H2O2 treatments, the proportion of aliphatic compounds as well as the much oxygenated organic components with a low desaturation and a low molecular weight increased. While with the Fe(II)-NaClO treatment, the proportion of low oxygenated organic components with a high desaturation and a high molecular weight increased. The concentration of total organic carbon, particularly the concentration of proteins, may be the key factor determining the shift of Zn and Cu from sludge solid to liquid phase, along with the high oxidation extent of organic components and close binding to CHOS and CHON compounds as indicated by density functional theory (DFT) calculation. This study systematically revealed the simultaneous sludge dewatering and migration of heavy metals when the role of organic components was factored into herein. Collapse Key Words Advanced oxidation processes Heavy metals Molecular fingerprints Organic components Sludge dewatering Collapse MESH Headings Sewage/chemistry Hydrogen Peroxide/chemistry Waste Disposal, Fluid/methods Water/chemistry Metals, Heavy Oxidation-Reduction Spectrum Analysis Proteins Ferrous Compounds/chemistry Collapse Grants Collapse
13	Identification of 5-nitroindazole as a multitargeted inhibitor for CDK and transferase kinase in lung cancer: a multisampling algorithm-based structural study. Mol Divers 2023:10.1007/s11030-023-10648-0. [PMID: 37058176 DOI: 10.1007/s11030-023-10648-0] [Citation(s) in RCA: 10] [Impact Index Per Article: 10.0] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/12/2023] [Accepted: 04/05/2023] [Indexed: 04/15/2023] Abstract Lung cancer is the second most common cancer, which is the leading cause of cancer death worldwide. The FDA has approved almost 100 drugs against lung cancer, but it is still not curable as most drugs target a single protein and block a single pathway. In this study, we screened the Drug Bank library against three major proteins- ribosomal protein S6 kinase alpha-6 (6G77), cyclic-dependent protein kinase 2 (1AQ1), and insulin-like growth factor 1 (1K3A) of lung cancer and identified the compound 5-nitroindazole (DB04534) as a multitargeted inhibitor that potentially can treat lung cancer. For the screening, we deployed multisampling algorithms such as HTVS, SP and XP, followed by the MM\GBSA calculation, and the study was extended to molecular fingerprinting analysis, pharmacokinetics prediction, and Molecular Dynamics simulation to understand the complex's stability. The docking scores against the proteins 6G77, 1AQ1, and 1K3A were - 6.884 kcal/mol, - 7.515 kcal/mol, and - 6.754 kcal/mol, respectively. Also, the compound has shown all the values satisfying the ADMET criteria, and the fingerprint analysis has shown wide similarities and the water WaterMap analysis that helped justify the compound's suitability. The molecular dynamics of each complex have shown a cumulative deviation of less than 2 Å, which is considered best for the biomolecules, especially for the protein-ligand complexes. The best feature of the identified drug candidate is that it targets multiple proteins that control cell division and growth hormone mediates simultaneously, reducing the burden of the pharmaceutical industry by reducing the resistance chance. Collapse Key Words 5-Nitroindazole Molecular dynamics simulation Molecular fingerprints Multitargeted docking WaterMap Collapse MESH Headings Collapse Grants Collapse
14	Chemoinformatics-driven classification of Angiosperms using sulfur-containing compounds and machine learning algorithm. PLANT METHODS 2022;18:118. [PMID: 36335358 PMCID: PMC9636760 DOI: 10.1186/s13007-022-00951-6] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 08/16/2022] [Accepted: 10/14/2022] [Indexed: 06/16/2023] Abstract BACKGROUND Phytochemicals or secondary metabolites are low molecular weight organic compounds with little function in plant growth and development. Nevertheless, the metabolite diversity govern not only the phenetics of an organism but may also inform the evolutionary pattern and adaptation of green plants to the changing environment. Plant chemoinformatics analyzes the chemical system of natural products using computational tools and robust mathematical algorithms. It has been a powerful approach for species-level differentiation and is widely employed for species classifications and reinforcement of previous classifications. RESULTS This study attempts to classify Angiosperms using plant sulfur-containing compound (SCC) or sulphated compound information. The SCC dataset of 692 plant species were collected from the comprehensive species-metabolite relationship family (KNApSAck) database. The structural similarity score of metabolite pairs under all possible combinations (plant species-metabolite) were determined and metabolite pairs with a Tanimoto coefficient value > 0.85 were selected for clustering using machine learning algorithm. Metabolite clustering showed association between the similar structural metabolite clusters and metabolite content among the plant species. Phylogenetic tree construction of Angiosperms displayed three major clades, of which, clade 1 and clade 2 represented the eudicots only, and clade 3, a mixture of both eudicots and monocots. The SCC-based construction of Angiosperm phylogeny is a subset of the existing monocot-dicot classification. The majority of eudicots present in clade 1 and 2 were represented by glucosinolate compounds. These clades with SCC may have been a mixture of ancestral species whilst the combinatorial presence of monocot-dicot in clade 3 suggests sulphated-chemical structure diversification in the event of adaptation during evolutionary change. CONCLUSIONS Sulphated chemoinformatics informs classification of Angiosperms via machine learning technique. Collapse Key Words Angiosperms Chemoinformatics KNApSAck database Molecular fingerprints Monocot-dicot Sulfur-containing compounds Collapse MESH Headings Collapse Grants Malaysian Ministry of Higher Education and Ministry of Science, Technology and Innovation Collapse
15	Concepts and applications of chemical fingerprint for hit and lead screening. Drug Discov Today 2022;27:103356. [PMID: 36113834 DOI: 10.1016/j.drudis.2022.103356] [Citation(s) in RCA: 13] [Impact Index Per Article: 6.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/07/2021] [Revised: 07/28/2022] [Accepted: 09/08/2022] [Indexed: 11/22/2022] Abstract Molecular fingerprints are used to represent chemical (structural, physicochemical, etc.) properties of large-scale chemical sets in a low computational cost way. They have a prominent role in transforming chemical data sets into consistent input formats (bit strings or numeric values) suitable for in silico approaches. In this review, we summarize and classify common and state-of-the-art fingerprints into eight different types (dictionary based, circular, topological, pharmacophore, protein-ligand interaction, shape based, reinforced, and multi). We also highlight applications of fingerprints in early drug research and development (R&D). Thus, this review provides a guide for the selection of appropriate fingerprints of compounds (or ligand-protein complexes) for use in drug R&D. Collapse Key Words Computational chemistry Descriptors Molecular fingerprints drug R&D Collapse MESH Headings Collapse Grants Collapse
16	Identification of Chemical-Disease Associations Through Integration of Molecular Fingerprint, Gene Ontology and Pathway Information. Interdiscip Sci 2022;14:683-696. [PMID: 35391615 DOI: 10.1007/s12539-022-00511-5] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/28/2021] [Revised: 03/16/2022] [Accepted: 03/17/2022] [Indexed: 06/14/2023] Abstract The identification of chemical-disease association types is helpful not only to discovery lead compounds and study drug repositioning, but also to treat disease and decipher pathomechanism. It is very urgent to develop computational method for identifying potential chemical-disease association types, since wet methods are usually expensive, laborious and time-consuming. In this study, molecular fingerprint, gene ontology and pathway are utilized to characterize chemicals and diseases. A novel predictor is proposed to recognize potential chemical-disease associations at the first layer, and further distinguish whether their relationships belong to biomarker or therapeutic relations at the second layer. The prediction performance of current method is assessed using the benchmark dataset based on ten-fold cross-validation. The practical prediction accuracies of the first layer and the second layer are 78.47% and 72.07%, respectively. The recognition ability for lead compounds, new drug indications, potential and true chemical-disease association pairs has also been investigated and confirmed by constructing a variety of datasets and performing a series of experiments. It is anticipated that the current method can be considered as a powerful high-throughput virtual screening tool for drug researches and developments. Collapse Key Words Chemical–disease associations Gene ontology Molecular fingerprints Pathway Random forest Collapse MESH Headings Drug Repositioning Gene Ontology Collapse Grants Collapse
17	Classifying natural products from plants, fungi or bacteria using the COCONUT database and machine learning. J Cheminform 2021;13:82. [PMID: 34663470 PMCID: PMC8524952 DOI: 10.1186/s13321-021-00559-3] [Citation(s) in RCA: 13] [Impact Index Per Article: 4.3] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/25/2021] [Accepted: 10/02/2021] [Indexed: 01/13/2023] Open Abstract Natural products (NPs) represent one of the most important resources for discovering new drugs. Here we asked whether NP origin can be assigned from their molecular structure in a subset of 60,171 NPs in the recently reported Collection of Open Natural Products (COCONUT) database assigned to plants, fungi, or bacteria. Visualizing this subset in an interactive tree-map (TMAP) calculated using MAP4 (MinHashed atom pair fingerprint) clustered NPs according to their assigned origin ( https://tm.gdb.tools/map4/coconut_tmap/ ), and a support vector machine (SVM) trained with MAP4 correctly assigned the origin for 94% of plant, 89% of fungal, and 89% of bacterial NPs in this subset. An online tool based on an SVM trained with the entire subset correctly assigned the origin of further NPs with similar performance ( https://np-svm-map4.gdb.tools/ ). Origin information might be useful when searching for biosynthetic genes of NPs isolated from plants but produced by endophytic microorganisms. Collapse Key Words Chemical space Cheminformatics Machine learning Molecular fingerprints Natural products Support vector machine Visualization Collapse MESH Headings Collapse Grants 200020_178998 schweizerischer nationalfonds zur förderung der wissenschaftlichen forschung 885076 h2020 european research council Collapse
18	FP-ADMET: a compendium of fingerprint-based ADMET prediction models. J Cheminform 2021;13:75. [PMID: 34583740 PMCID: PMC8479898 DOI: 10.1186/s13321-021-00557-5] [Citation(s) in RCA: 25] [Impact Index Per Article: 8.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/07/2021] [Accepted: 09/20/2021] [Indexed: 12/11/2022] Open Abstract MOTIVATION The absorption, distribution, metabolism, excretion, and toxicity (ADMET) of drugs plays a key role in determining which among the potential candidates are to be prioritized. In silico approaches based on machine learning methods are becoming increasing popular, but are nonetheless limited by the availability of data. With a view to making both data and models available to the scientific community, we have developed FPADMET which is a repository of molecular fingerprint-based predictive models for ADMET properties. In this article, we have examined the efficacy of fingerprint-based machine learning models for a large number of ADMET-related properties. The predictive ability of a set of 20 different binary fingerprints (based on substructure keys, atom pairs, local path environments, as well as custom fingerprints such as all-shortest paths) for over 50 ADMET and ADMET-related endpoints have been evaluated as part of the study. We find that for a majority of the properties, fingerprint-based random forest models yield comparable or better performance compared with traditional 2D/3D molecular descriptors. AVAILABILITY The models are made available as part of open access software that can be downloaded from https://gitlab.com/vishsoft/fpadmet . Collapse Key Words ADMET Machine learning Molecular fingerprints Collapse MESH Headings Collapse Grants Collapse
19	Extended similarity indices: the benefits of comparing more than two objects simultaneously. Part 2: speed, consistency, diversity selection. J Cheminform 2021;13:33. [PMID: 33892799 PMCID: PMC8067665 DOI: 10.1186/s13321-021-00504-4] [Citation(s) in RCA: 12] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/20/2020] [Accepted: 03/12/2021] [Indexed: 11/10/2022] Open Abstract Despite being a central concept in cheminformatics, molecular similarity has so far been limited to the simultaneous comparison of only two molecules at a time and using one index, generally the Tanimoto coefficent. In a recent contribution we have not only introduced a complete mathematical framework for extended similarity calculations, (i.e. comparisons of more than two molecules at a time) but defined a series of novel idices. Part 1 is a detailed analysis of the effects of various parameters on the similarity values calculated by the extended formulas. Their features were revealed by sum of ranking differences and ANOVA. Here, in addition to characterizing several important aspects of the newly introduced similarity metrics, we will highlight their applicability and utility in real-life scenarios using datasets with popular molecular fingerprints. Remarkably, for large datasets, the use of extended similarity measures provides an unprecedented speed-up over “traditional” pairwise similarity matrix calculations. We also provide illustrative examples of a more direct algorithm based on the extended Tanimoto similarity to select diverse compound sets, resulting in much higher levels of diversity than traditional approaches. We discuss the inner and outer consistency of our indices, which are key in practical applications, showing whether the n-ary and binary indices rank the data in the same way. We demonstrate the use of the new n-ary similarity metrics on t-distributed stochastic neighbor embedding (t-SNE) plots of datasets of varying diversity, or corresponding to ligands of different pharmaceutical targets, which show that our indices provide a better measure of set compactness than standard binary measures. We also present a conceptual example of the applicability of our indices in agglomerative hierarchical algorithms. The Python code for calculating the extended similarity metrics is freely available at: https://github.com/ramirandaq/MultipleComparisons Collapse Key Words Computational complexity Consistency Extended similarity indices Molecular fingerprints Multiple comparisons Rankings Scaling Sum of ranking differences Collapse MESH Headings Collapse Grants Collapse
20	Extended similarity indices: the benefits of comparing more than two objects simultaneously. Part 1: Theory and characteristics^†. J Cheminform 2021;13:32. [PMID: 33892802 PMCID: PMC8067658 DOI: 10.1186/s13321-021-00505-3] [Citation(s) in RCA: 11] [Impact Index Per Article: 3.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/15/2020] [Accepted: 03/12/2021] [Indexed: 12/14/2022] Open Abstract Quantification of the similarity of objects is a key concept in many areas of computational science. This includes cheminformatics, where molecular similarity is usually quantified based on binary fingerprints. While there is a wide selection of available molecular representations and similarity metrics, there were no previous efforts to extend the computational framework of similarity calculations to the simultaneous comparison of more than two objects (molecules) at the same time. The present study bridges this gap, by introducing a straightforward computational framework for comparing multiple objects at the same time and providing extended formulas for as many similarity metrics as possible. In the binary case (i.e. when comparing two molecules pairwise) these are naturally reduced to their well-known formulas. We provide a detailed analysis on the effects of various parameters on the similarity values calculated by the extended formulas. The extended similarity indices are entirely general and do not depend on the fingerprints used. Two types of variance analysis (ANOVA) help to understand the main features of the indices: (i) ANOVA of mean similarity indices; (ii) ANOVA of sum of ranking differences (SRD). Practical aspects and applications of the extended similarity indices are detailed in the accompanying paper: Miranda-Quintana et al. J Cheminform. 2021. https://doi.org/10.1186/s13321-021-00504-4 . Python code for calculating the extended similarity metrics is freely available at: https://github.com/ramirandaq/MultipleComparisons . Collapse Key Words ANOVA Comparisons Consistency Extended similarity indices Molecular fingerprints Rankings Sum of ranking differences Collapse MESH Headings Collapse Grants Collapse
21	LINGO-DL: a text-based approach for molecular similarity searching. J Comput Aided Mol Des 2021;35:657-665. [PMID: 33797669 DOI: 10.1007/s10822-021-00383-9] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/18/2020] [Accepted: 03/26/2021] [Indexed: 11/24/2022] Abstract The line notations of chemical structures are more compact than those of graphs and connection tables, so they can be useful for storing and transferring a large number of molecular structures. The simplified molecular input line system (SMILES) representation is the most extensively used, as it is much easier to utilise and comprehend than others, and it can be generated automatically from connection tables. A SMILES represents and encodes the molecule structure. It has been used by an existing method, LINGO, to calculate the molecular similarities and predict the structure-related properties. The LINGO method decomposes a canonical SMILES into a set of substrings of four characters referred to as LINGOs. The purpose of LINGO method is to measure the similarity between a pair of molecules by comparing the LINGOs that occur in each molecule. This paper aims to introduce an alternative version of the LINGO method using LINGOs of different lengths, called LINGO-DL. LINGO-DL is based on the fragmentation of canonical SMILES into substrings of three different lengths rather than one in LINGO method. Retrospective virtual screening experiments with MDDR, DUD, and MUV datasets show that the LINGO-DL outperforms the LINGO method, especially when the active molecules being sought have a high degree of structural heterogeneity. Collapse Key Words Drug discovery LINGO Ligand-based virtual screening Molecular fingerprints SMILES Collapse MESH Headings Collapse Grants Collapse
22	Structure-activity relationship-based chemical classification of highly imbalanced Tox21 datasets. J Cheminform 2020;12:66. [PMID: 33372637 PMCID: PMC7592558 DOI: 10.1186/s13321-020-00468-x] [Citation(s) in RCA: 32] [Impact Index Per Article: 8.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/13/2019] [Accepted: 10/13/2020] [Indexed: 12/14/2022] Open Abstract The specificity of toxicant-target biomolecule interactions lends to the very imbalanced nature of many toxicity datasets, causing poor performance in Structure–Activity Relationship (SAR)-based chemical classification. Undersampling and oversampling are representative techniques for handling such an imbalance challenge. However, removing inactive chemical compound instances from the majority class using an undersampling technique can result in information loss, whereas increasing active toxicant instances in the minority class by interpolation tends to introduce artificial minority instances that often cross into the majority class space, giving rise to class overlapping and a higher false prediction rate. In this study, in order to improve the prediction accuracy of imbalanced learning, we employed SMOTEENN, a combination of Synthetic Minority Over-sampling Technique (SMOTE) and Edited Nearest Neighbor (ENN) algorithms, to oversample the minority class by creating synthetic samples, followed by cleaning the mislabeled instances. We chose the highly imbalanced Tox21 dataset, which consisted of 12 in vitro bioassays for > 10,000 chemicals that were distributed unevenly between binary classes. With Random Forest (RF) as the base classifier and bagging as the ensemble strategy, we applied four hybrid learning methods, i.e., RF without imbalance handling (RF), RF with Random Undersampling (RUS), RF with SMOTE (SMO), and RF with SMOTEENN (SMN). The performance of the four learning methods was compared using nine evaluation metrics, among which F₁ score, Matthews correlation coefficient and Brier score provided a more consistent assessment of the overall performance across the 12 datasets. The Friedman’s aligned ranks test and the subsequent Bergmann-Hommel post hoc test showed that SMN significantly outperformed the other three methods. We also found that a strong negative correlation existed between the prediction accuracy and the imbalance ratio (IR), which is defined as the number of inactive compounds divided by the number of active compounds. SMN became less effective when IR exceeded a certain threshold (e.g., > 28). The ability to separate the few active compounds from the vast amounts of inactive ones is of great importance in computational toxicology. This work demonstrates that the performance of SAR-based, imbalanced chemical toxicity classification can be significantly improved through the use of data rebalancing. Collapse Key Words Bootstrap aggregation (bagging) Chemical classification Class distribution imbalance Edited nearest neighbor (ENN) Ensemble learning Molecular fingerprints Random forest (RF) Random undersampling (RUS) Resampling Structure–activity relationship (SAR) Synthetic minority over-sampling technique (SMOTE) Collapse MESH Headings Collapse Grants Collapse
23	Convolutional architectures for virtual screening. BMC Bioinformatics 2020;21:310. [PMID: 32938359 PMCID: PMC7493874 DOI: 10.1186/s12859-020-03645-9] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/02/2020] [Accepted: 07/06/2020] [Indexed: 11/21/2022] Open Abstract Background A Virtual Screening algorithm has to adapt to the different stages of this process. Early screening needs to ensure that all bioactive compounds are ranked in the first positions despite of the number of false positives, while a second screening round is aimed at increasing the prediction accuracy. Results A novel CNN architecture is presented to this aim, which predicts bioactivity of candidate compounds on CDK1 using a combination of molecular fingerprints as their vector representation, and has been trained suitably to achieve good results as regards both enrichment factor and accuracy in different screening modes (98.55% accuracy in active-only selection, and 98.88% in high precision discrimination). Conclusion The proposed architecture outperforms state-of-the-art ML approaches, and some interesting insights on molecular fingerprints are devised. Collapse Key Words Bioactivity prediction Deep learning Drug design Molecular fingerprints Virtual screening Collapse MESH Headings Collapse Grants Collapse
24	Monomer structure fingerprints: an extension of the monomer composition version for peptide databases. J Comput Aided Mol Des 2020;34:1147-1156. [PMID: 32812076 DOI: 10.1007/s10822-020-00336-8] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/06/2020] [Accepted: 08/12/2020] [Indexed: 10/23/2022] Abstract Previously a fingerprint based on monomer composition (MCFP) of nonribosomal peptides (NRPs) has been introduced. MCFP is a novel method for obtaining a representative description of NRP structures from their monomer composition in a fingerprint form. An effective screening and prediction of biological activities has been obtained from Norine NRPs database. In this paper, we present an extension of the MCFP fingerprint. This extension is based on adding few columns into the fingerprint; representing monomer clusters, 2D structures, peptide categories, and peptide diversity. All these data have been extracted from the NRP structure. Experiments with Norine NRPs database showed that the extended MCFP, that can be called Monomer Structure FingerPrint (MSFP) produced high prediction accuracy (> 95%) together with a high recall rate (86%) obtained when MSFP was used for prediction and similarity searching. From this study it appeared that MSFP mainly built from monomer composition can substantially be improved by adding more columns representing useful information about monomer composition and 2D structure of NRPs. Collapse Key Words Drug discovery Ligand-based virtual screening Molecular fingerprints Natural products Nonribosomal peptides Target prediction Collapse MESH Headings Collapse Grants Collapse
25	A cheminformatic study on chemical space characterization and diversity analysis of 5-LOX inhibitors. J Mol Graph Model 2020;100:107699. [PMID: 32799052 DOI: 10.1016/j.jmgm.2020.107699] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/18/2020] [Revised: 06/19/2020] [Accepted: 07/10/2020] [Indexed: 10/23/2022] Abstract The process of blocking 5-lipoxygenase (5-LOX) catalyzed leukotriene biosynthesis has been recognized for the past few decades as a promising therapeutic strategy for acute inflammatory, allergic, and respiratory diseases. Due to the toxicity effect of FDA approved 5-LOX inhibitor zileuton, novel 5-LOX inhibitors have been sought by the scientific community. As a result, a significant and relevant amount of information on the structure-activity of 5-LOX inhibitors has been released and stored in public databases. In this study, we aimed at the comprehensive cheminformatic characterization of the diversity and complexity of the chemical space of 5-LOX inhibitors and its activating protein FLAP inhibitors by comparing it with the Approved drug space and virtual LOX library. The visual representation of the property space indicates some compounds in the 5-LOX inhibitors space broaden the traditional medicinal space. The structural diversity of the databases is computed using complementary approaches, including Physicochemical Property (PCP) descriptors, molecular fingerprints, and molecular scaffold. With the apparent exception of approved drugs, the 5-LOX dataset shows more diversity compared to FLAP and LOX virtual library set. This study was able to identify the underlying patterns in the chemical and pharmacological properties space that were decisive for the drug discovery and development of 5-LOX inhibitors. Collapse Key Words 5-Lipoxygenase Chemical space Cheminformatics FLAP Leukotrienes Molecular fingerprints Molecular scaffold PCP descriptors Collapse MESH Headings Collapse Grants Collapse
26	One molecular fingerprint to rule them all: drugs, biomolecules, and the metabolome. J Cheminform 2020;12:43. [PMID: 33431010 PMCID: PMC7291580 DOI: 10.1186/s13321-020-00445-4] [Citation(s) in RCA: 105] [Impact Index Per Article: 26.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/18/2020] [Accepted: 06/04/2020] [Indexed: 02/08/2023] Open Abstract Background Molecular fingerprints are essential cheminformatics tools for virtual screening and mapping chemical space. Among the different types of fingerprints, substructure fingerprints perform best for small molecules such as drugs, while atom-pair fingerprints are preferable for large molecules such as peptides. However, no available fingerprint achieves good performance on both classes of molecules. Results Here we set out to design a new fingerprint suitable for both small and large molecules by combining substructure and atom-pair concepts. Our quest resulted in a new fingerprint called MinHashed atom-pair fingerprint up to a diameter of four bonds (MAP4). In this fingerprint the circular substructures with radii of r = 1 and r = 2 bonds around each atom in an atom-pair are written as two pairs of SMILES, each pair being combined with the topological distance separating the two central atoms. These so-called atom-pair molecular shingles are hashed, and the resulting set of hashes is MinHashed to form the MAP4 fingerprint. MAP4 significantly outperforms all other fingerprints on an extended benchmark that combines the Riniker and Landrum small molecule benchmark with a peptide benchmark recovering BLAST analogs from either scrambled or point mutation analogs. MAP4 furthermore produces well-organized chemical space tree-maps (TMAPs) for databases as diverse as DrugBank, ChEMBL, SwissProt and the Human Metabolome Database (HMBD), and differentiates between all metabolites in HMBD, over 70% of which are indistinguishable from their nearest neighbor using substructure fingerprints. Conclusion MAP4 is a new molecular fingerprint suitable for drugs, biomolecules, and the metabolome and can be adopted as a universal fingerprint to describe and search chemical space. The source code is available at https://github.com/reymond-group/map4 and interactive MAP4 similarity search tools and TMAPs for various databases are accessible at http://map-search.gdb.tools/ and http://tm.gdb.tools/map4/. Collapse Key Words Chemical space Databases Locality sensitive hashing Molecular fingerprints Virtual screening Collapse MESH Headings Collapse Grants Collapse
27	A deep neural network combined with molecular fingerprints (DNN-MF) to develop predictive models for hydroxyl radical rate constants of water contaminants. JOURNAL OF HAZARDOUS MATERIALS 2020;383:121141. [PMID: 31610411 DOI: 10.1016/j.jhazmat.2019.121141] [Citation(s) in RCA: 41] [Impact Index Per Article: 10.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 07/31/2019] [Revised: 08/29/2019] [Accepted: 09/02/2019] [Indexed: 05/24/2023] Abstract This work combined a Deep Neural Network (DNN) with molecular fingerprints (MF) to develop models to predict the OH radical rate constants of 593 organic contaminants. Molecular descriptors, most often used in establishing quantitative structural-activity relationships (QSARs), were not used here because of their complicated generation processes that rely on advanced physicochemical and computational knowledge. Instead, we only fed the most basic information of the contaminant structures, i.e., MF encoding the types of atoms and how they are connected, to DNN and DNN then developed predictive models automatically. Here, a dataset containing 457 contaminants and their OH rate constants was first used to develop predictive models by DNN-MF. The hence developed models showed comparable accuracy to the traditional QSARs. The root mean square error (RMSE) values of the test sets were 0.358-0.384. The length of 2048 bits for the MF and 3 hidden layers (each with 1024 neurons) were found to be the optimal parameters for DNN. The model containing additional 89 micorpollutants in the training set was then successfully applied to predict the OH rate constants of 17 organophosphorus flame retardants and 29 additional micropollutants, with comparable accuracy to the reported molecular descriptors-based QSARs. Collapse Key Words Advanced oxidation processes Deep neural network Hydroxyl radical Molecular fingerprints QSAR Water treatment Collapse MESH Headings Collapse Grants Collapse
28	The chemfp project. J Cheminform 2019;11:76. [PMID: 33430977 PMCID: PMC6896769 DOI: 10.1186/s13321-019-0398-8] [Citation(s) in RCA: 17] [Impact Index Per Article: 3.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/10/2019] [Accepted: 11/25/2019] [Indexed: 11/30/2022] Open Abstract The chemfp project has had four main goals: (1) promote the FPS format as a text-based exchange format for dense binary cheminformatics fingerprints, (2) develop a high-performance implementation of the BitBound algorithm that could be used as an effective baseline to benchmark new similarity search implementations, (3) experiment with funding a pure open source software project through commercial sales, and (4) publish the results and lessons learned as a guide for future implementors. The FPS format has had only minor success, though it did influence development of the FPB binary format, which is faster to load but more complex. Both are summarized. The chemfp benchmark and the no-cost/open source version of chemfp are proposed as a reference baseline to evaluate the effectiveness of other similarity search tools. They are used to evaluate the faster commercial version of chemfp, which can test 130 million 1024-bit fingerprint Tanimotos per second on a single core of a standard x86-64 server machine. When combined with the BitBound algorithm, a k = 1000 nearest-neighbor search of the 1.8 million 2048-bit Morgan fingerprints of ChEMBL 24 averages 27 ms/query. The same search of 970 million PubChem fingerprints averages 220 ms/query, making chemfp one of the fastest CPU-based similarity search implementations. Modern CPUs are fast enough that memory bandwidth and latency are now important factors. Single-threaded search uses most of the available memory bandwidth. Sorting the fingerprints by popcount improves memory coherency, which when combined with 4 OpenMP threads makes it possible to construct an N × N similarity matrix for 1 million fingerprints in about 30 min. These observations may affect the interpretation of previous publications which assumed that search was strongly CPU bound. The chemfp project funding came from selling a purely open-source software product. Several product business models were tried, but none proved sustainable. Some of the experiences are discussed, in order to contribute to the ongoing conversation on the role of open source software in cheminformatics. Collapse Key Words FOSS Format High-performance Molecular fingerprints Open source Performance benchmark Similarity searching Tanimoto Collapse MESH Headings Collapse Grants Collapse
29	Prediction of the skin sensitising potential and potency of compounds via mechanism-based binary and ternary classification models. Toxicol In Vitro 2019;59:204-214. [PMID: 31028860 DOI: 10.1016/j.tiv.2019.01.004] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/14/2018] [Revised: 12/28/2018] [Accepted: 01/10/2019] [Indexed: 10/26/2022] Abstract Skin sensitisation, one of the most frequent forms of human immune toxicity, is authenticated to be a significant endpoint in the field of drug discovery and cosmetics. Due to the drawbacks of traditional animal testing methods, in silico methods have advanced to study skin sensitisation. In this study, mechanism-based binary and ternary classification models were constructed with a comprehensive data set. 1007 compounds were collected to develop five series of local and global models based on mechanisms. In each series, compounds were classified into five groups according to EC3 values, and applied as training sets, test sets and external validation sets. For each of the five series, 81 binary classification models and 81 ternary classification models were acquired via 9 molecular fingerprints and 9 machine learning methods using a novel KNIME workflow. Meanwhile, the applicability domains for the best 10 models were figured out to certify the rationality of prediction effect. In addition, 8 toxic substructures probably causing skin sensitisation were identified to speculate whether a compound is a skin sensitiser. The mechanism-based prediction models and the toxic substructures can be applied to predict the skin sensitising potential and potency of compounds. Collapse Key Words Computational toxicology Machine learning Molecular fingerprints Prediction models Skin sensitisation Collapse MESH Headings Collapse Grants Collapse
30	Web-Based Tools for Polypharmacology Prediction. Methods Mol Biol 2019;1888:255-272. [PMID: 30519952 DOI: 10.1007/978-1-4939-8891-4_15] [Citation(s) in RCA: 16] [Impact Index Per Article: 3.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/05/2023] Abstract Drug promiscuity or polypharmacology is the ability of small molecules to interact with multiple protein targets simultaneously. In drug discovery, understanding the polypharmacology of potential drug molecules is crucial to improve their efficacy and safety, and to discover the new therapeutic potentials of existing drugs. Over the past decade, several computational methods have been developed to study the polypharmacology of small molecules, many of which are available as Web services. In this chapter, we review some of these Web tools focusing on ligand based approaches. We highlight in particular our recently developed polypharmacology browser (PPB) and its application for finding the side targets of a new inhibitor of the TRPV6 calcium channel. Collapse Key Words Drug–target interactions Molecular fingerprints Polypharmacology Similarity searching Target prediction Collapse MESH Headings Collapse Grants Collapse
31	Statistical-based database fingerprint: chemical space dependent representation of compound databases. J Cheminform 2018;10:55. [PMID: 30467740 PMCID: PMC6755589 DOI: 10.1186/s13321-018-0311-x] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/08/2018] [Accepted: 11/14/2018] [Indexed: 11/30/2022] Open Abstract Background Simplified representation of compound databases has several applications in cheminformatics. Herein, we introduce an alternative and general method to build single fingerprint representations of compound databases. The approach is inspired on the previously published modal fingerprints that are aimed to capture the most significant bits of a fingerprint representation for a compound data set. The novelty of the herein proposed statistical-based database fingerprint (SB-DFP) is that it is generated based on binomial proportions comparisons taking as reference the distribution of “1” bits on a large representative set of the chemical space. Results To illustrate the Method, SB-DFPs were constructed for 28 epigenetic target data sets retrieved from a recently published epigenomics database of interest in probe and drug discovery. For each target data set, the SB-DFPs were built based on two representative fingerprints of different design using as reference a data set with more than 15 million compounds from ZINC. The application of SB-DFP was illustrated and compared to other methods through association relationships of the 28 epigenetic data sets and similarity searching. It was found that SB-DFPs captured overall, the common features between data sets and the distinct features of each set. In similarity searching SB-DFP equaled or outperformed other approaches for at least 20 out of the 28 sets. Conclusions SB-DFP is a general approach based on binomial proportion comparisons to represent a compound data set with a single fingerprint. SB-DFP can be developed, at least in principle, based on any fingerprint and reference data set. SB-DFP is a good alternative for exploration of relationships between targets through its associated compound data sets and performing similarity searching. Electronic supplementary material The online version of this article (10.1186/s13321-018-0311-x) contains supplementary material, which is available to authorized users. Collapse Key Words Chemical space Epi-informatics Molecular fingerprints Representation Similarity searching Collapse MESH Headings Collapse Grants Collapse
32	Utilizing knowledge base of amino acids structural neighborhoods to predict protein-protein interaction sites. BMC Bioinformatics 2017;18:492. [PMID: 29244012 PMCID: PMC5731498 DOI: 10.1186/s12859-017-1921-4] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022] Open Abstract BACKGROUND Protein-protein interactions (PPI) play a key role in an investigation of various biochemical processes, and their identification is thus of great importance. Although computational prediction of which amino acids take part in a PPI has been an active field of research for some time, the quality of in-silico methods is still far from perfect. RESULTS We have developed a novel prediction method called INSPiRE which benefits from a knowledge base built from data available in Protein Data Bank. All proteins involved in PPIs were converted into labeled graphs with nodes corresponding to amino acids and edges to pairs of neighboring amino acids. A structural neighborhood of each node was then encoded into a bit string and stored in the knowledge base. When predicting PPIs, INSPiRE labels amino acids of unknown proteins as interface or non-interface based on how often their structural neighborhood appears as interface or non-interface in the knowledge base. We evaluated INSPiRE's behavior with respect to different types and sizes of the structural neighborhood. Furthermore, we examined the suitability of several different features for labeling the nodes. Our evaluations showed that INSPiRE clearly outperforms existing methods with respect to Matthews correlation coefficient. CONCLUSION In this paper we introduce a new knowledge-based method for identification of protein-protein interaction sites called INSPiRE. Its knowledge base utilizes structural patterns of known interaction sites in the Protein Data Bank which are then used for PPI prediction. Extensive experiments on several well-established datasets show that INSPiRE significantly surpasses existing PPI approaches. Collapse Key Words Data mining Molecular fingerprints Prediction Protein-protein interaction Collapse MESH Headings Amino Acids/chemistry Amino Acids/metabolism Computational Biology Databases, Protein Knowledge Bases Models, Statistical Protein Interaction Mapping/methods Proteins/chemistry Proteins/metabolism Software Collapse Grants Collapse
33	ExCAPE-DB: an integrated large scale dataset facilitating Big Data analysis in chemogenomics. J Cheminform 2017;9:17. [PMID: 28316655 PMCID: PMC5340785 DOI: 10.1186/s13321-017-0203-5] [Citation(s) in RCA: 75] [Impact Index Per Article: 10.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/05/2016] [Accepted: 02/24/2017] [Indexed: 12/02/2022] Open Abstract Chemogenomics data generally refers to the activity data of chemical compounds on an array of protein targets and represents an important source of information for building in silico target prediction models. The increasing volume of chemogenomics data offers exciting opportunities to build models based on Big Data. Preparing a high quality data set is a vital step in realizing this goal and this work aims to compile such a comprehensive chemogenomics dataset. This dataset comprises over 70 million SAR data points from publicly available databases (PubChem and ChEMBL) including structure, target information and activity annotations. Our aspiration is to create a useful chemogenomics resource reflecting industry-scale data not only for building predictive models of in silico polypharmacology and off-target effects but also for the validation of cheminformatics approaches in general. Collapse Key Words Big Data Bioactivity Chemical structure Chemogenomics Molecular fingerprints QSAR Search engine Collapse MESH Headings Collapse Grants Collapse
34	The polypharmacology browser: a web-based multi-fingerprint target prediction tool using ChEMBL bioactivity data. J Cheminform 2017;9:11. [PMID: 28270862 PMCID: PMC5319934 DOI: 10.1186/s13321-017-0199-x] [Citation(s) in RCA: 64] [Impact Index Per Article: 9.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/29/2016] [Accepted: 02/10/2017] [Indexed: 12/31/2022] Open Abstract Background Several web-based tools have been reported recently which predict the possible targets of a small molecule by similarity to compounds of known bioactivity using molecular fingerprints (fps), however predictions in each case rely on similarities computed from only one or two fps. Considering that structural similarity and therefore the predicted targets strongly depend on the method used for comparison, it would be highly desirable to predict targets using a broader set of fps simultaneously. Results Herein, we present the polypharmacology browser (PPB), a web-based platform which predicts possible targets for small molecules by searching for nearest neighbors using ten different fps describing composition, substructures, molecular shape and pharmacophores. PPB searches through 4613 groups of at least 10 same target annotated bioactive molecules from ChEMBL and returns a list of predicted targets ranked by consensus voting scheme and p value. A validation study across 670 drugs with up to 20 targets showed that combining the predictions from all 10 fps gives the best results, with on average 50% of the known targets of a drug being correctly predicted with a hit rate of 25%. Furthermore, when profiling a new inhibitor of the calcium channel TRPV6 against 24 targets taken from a safety screen panel, we observed inhibition in 5 out of 5 targets predicted by PPB and in 7 out of 18 targets not predicted by PPB. The rate of correct (5/12) and incorrect (0/12) predictions for this compound by PPB was comparable to that of other web-based prediction tools. Conclusion PPB offers a versatile platform for target prediction based on multi-fingerprint comparisons, and is freely accessible at www.gdb.unibe.ch as a valuable support for drug discovery.Graphical abstract
35	Database fingerprint (DFP): an approach to represent molecular databases. J Cheminform 2017;9:9. [PMID: 28224019 PMCID: PMC5293704 DOI: 10.1186/s13321-017-0195-1] [Citation(s) in RCA: 32] [Impact Index Per Article: 4.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/30/2016] [Accepted: 01/23/2017] [Indexed: 01/19/2023] Open Abstract BACKGROUND Molecular fingerprints are widely used in several areas of chemoinformatics including diversity analysis and similarity searching. The fingerprint-based analysis of chemical libraries, in particular of large collections, usually requires the molecular representation of each compound in the library that may lead to issues of storage space and redundant calculations. In fact, information redundancy is inherent to the data, resulting on binary digit positions in the fingerprint without significant information. RESULTS Herein is proposed a general approach to represent an entire compound library with a single binary fingerprint. The development of the database fingerprint (DFP) is illustrated first using a short fingerprint (MACCS keys) for 10 data sets of general interest in chemistry. The application of the DFP is further shown with PubChem fingerprints for the data sets used in the primary example but with a larger number of compounds, up to 25,000 molecules. The performance of DFP were studied through differential Shannon entropy, k-mean clustering, and DFP/Tanimoto similarity. CONCLUSIONS The DFP is designed to capture key information of the compound collection and can be used to compare and assess the diversity of molecular libraries. This Preliminary Communication shows the potential of the novel fingerprint to conduct inter-library relationships. A major future goal is to apply the DFP for virtual screening and developing DFP for other data sets based on several different type of fingerprints.Graphical AbstractDatabase fingerprint captures the key information of molecular databases to perform chemical space characterization and virtual screening. Collapse Key Words Diversity Information content Molecular fingerprints Shannon entropy Similarity Collapse MESH Headings Collapse Grants Collapse
36	Consensus Diversity Plots: a global diversity analysis of chemical libraries. J Cheminform 2016;8:63. [PMID: 27895718 PMCID: PMC5105260 DOI: 10.1186/s13321-016-0176-9] [Citation(s) in RCA: 48] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/31/2016] [Accepted: 10/27/2016] [Indexed: 01/14/2023] Open Abstract Background Measuring the structural diversity of compound databases is relevant in drug discovery and many other areas of chemistry. Since molecular diversity depends on molecular representation, comprehensive chemoinformatic analysis of the diversity of libraries uses multiple criteria. For instance, the diversity of the molecular libraries is typically evaluated employing molecular scaffolds, structural fingerprints, and physicochemical properties. However, the assessment with each criterion is analyzed independently and it is not straightforward to provide an evaluation of the “global diversity”. Results Herein the Consensus Diversity Plot (CDP) is proposed as a novel method to represent in low dimensions the diversity of chemical libraries considering simultaneously multiple molecular representations. We illustrate the application of CDPs to classify eight compound data sets and two subsets with different sizes and compositions using molecular scaffolds, structural fingerprints, and physicochemical properties. Conclusions CDPs are general data mining tools that represent in two-dimensions the global diversity of compound data sets using multiple metrics. These plots can be constructed using single or combined measures of diversity. An online version of the CDPs is freely available at: https://consensusdiversityplots-difacquim-unam.shinyapps.io/RscriptsCDPlots/.Graphical Abstract
37	Computational methods for prediction of in vitro effects of new chemical structures. J Cheminform 2016;8:51. [PMID: 28316649 PMCID: PMC5043617 DOI: 10.1186/s13321-016-0162-2] [Citation(s) in RCA: 24] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/02/2015] [Accepted: 09/05/2016] [Indexed: 12/22/2022] Open Abstract BACKGROUND With a constant increase in the number of new chemicals synthesized every year, it becomes important to employ the most reliable and fast in silico screening methods to predict their safety and activity profiles. In recent years, in silico prediction methods received great attention in an attempt to reduce animal experiments for the evaluation of various toxicological endpoints, complementing the theme of replace, reduce and refine. Various computational approaches have been proposed for the prediction of compound toxicity ranging from quantitative structure activity relationship modeling to molecular similarity-based methods and machine learning. Within the "Toxicology in the 21st Century" screening initiative, a crowd-sourcing platform was established for the development and validation of computational models to predict the interference of chemical compounds with nuclear receptor and stress response pathways based on a training set containing more than 10,000 compounds tested in high-throughput screening assays. RESULTS Here, we present the results of various molecular similarity-based and machine-learning based methods over an independent evaluation set containing 647 compounds as provided by the Tox21 Data Challenge 2014. It was observed that the Random Forest approach based on MACCS molecular fingerprints and a subset of 13 molecular descriptors selected based on statistical and literature analysis performed best in terms of the area under the receiver operating characteristic curve values. Further, we compared the individual and combined performance of different methods. In retrospect, we also discuss the reasons behind the superior performance of an ensemble approach, combining a similarity search method with the Random Forest algorithm, compared to individual methods while explaining the intrinsic limitations of the latter. CONCLUSIONS Our results suggest that, although prediction methods were optimized individually for each modelled target, an ensemble of similarity and machine-learning approaches provides promising performance indicating its broad applicability in toxicity prediction. Collapse Key Words Machine learning Molecular fingerprints Similarity searching Tox21 challenge Toxicity prediction Collapse MESH Headings Collapse Grants Bundesministerium für Bildung und Forschung Deutsche Forschungsgemeinschaft Collapse
38	Comparing structural fingerprints using a literature-based similarity benchmark. J Cheminform 2016;8:36. [PMID: 27382417 PMCID: PMC4932683 DOI: 10.1186/s13321-016-0148-0] [Citation(s) in RCA: 101] [Impact Index Per Article: 12.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/04/2016] [Accepted: 06/26/2016] [Indexed: 01/17/2023] Open Abstract Background The concept of molecular similarity is one of the central ideas in cheminformatics, despite the fact that it is ill-defined and rather difficult to assess objectively. Here we propose a practical definition of molecular similarity in the context of drug discovery: molecules A and B are similar if a medicinal chemist would be likely to synthesise and test them around the same time as part of the same medicinal chemistry program. The attraction of such a definition is that it matches one of the key uses of similarity measures in early-stage drug discovery. If we make the assumption that molecules in the same compound activity table in a medicinal chemistry paper were considered similar by the authors of the paper, we can create a dataset of similar molecules from the medicinal chemistry literature. Furthermore, molecules with decreasing levels of similarity to a reference can be found by either ordering molecules in an activity table by their activity, or by considering activity tables in different papers which have at least one molecule in common. Results Using this procedure with activity data from ChEMBL, we have created two benchmark datasets for structural similarity that can be used to guide the development of improved measures. Compared to similar results from a virtual screen, these benchmarks are an order of magnitude more sensitive to differences between fingerprints both because of their size and because they avoid loss of statistical power due to the use of mean scores or ranks. We measure the performance of 28 different fingerprints on the benchmark sets and compare the results to those from the Riniker and Landrum (J Cheminf 5:26, 2013. doi:10.1186/1758-2946-5-26) ligand-based virtual screening benchmark. Conclusions Extended-connectivity fingerprints of diameter 4 and 6 are among the best performing fingerprints when ranking diverse structures by similarity, as is the topological torsion fingerprint. However, when ranking very close analogues, the atom pair fingerprint outperforms the others tested. When ranking diverse structures or carrying out a virtual screen, we find that the performance of the ECFP fingerprints significantly improves if the bit-vector length is increased from 1024 to 16,384.Graphical abstract
39	ChemDes: an integrated web-based platform for molecular descriptor and fingerprint computation. J Cheminform 2015;7:60. [PMID: 26664458 PMCID: PMC4674923 DOI: 10.1186/s13321-015-0109-z] [Citation(s) in RCA: 162] [Impact Index Per Article: 18.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/05/2015] [Accepted: 11/26/2015] [Indexed: 12/13/2022] Open Abstract BACKGROUND Molecular descriptors and fingerprints have been routinely used in QSAR/SAR analysis, virtual drug screening, compound search/ranking, drug ADME/T prediction and other drug discovery processes. Since the calculation of such quantitative representations of molecules may require substantial computational skills and efforts, several tools have been previously developed to make an attempt to ease the process. However, there are still several hurdles for users to overcome to fully harness the power of these tools. First, most of the tools are distributed as standalone software or packages that require necessary configuration or programming efforts of users. Second, many of the tools can only calculate a subset of molecular descriptors, and the results from multiple tools need to be manually merged to generate a comprehensive set of descriptors. Third, some packages only provide application programming interfaces and are implemented in different computer languages, which pose additional challenges to the integration of these tools. RESULTS A freely available web-based platform, named ChemDes, is developed in this study. It integrates multiple state-of-the-art packages (i.e., Pybel, CDK, RDKit, BlueDesc, Chemopy, PaDEL and jCompoundMapper) for computing molecular descriptors and fingerprints. ChemDes not only provides friendly web interfaces to relieve users from burdensome programming work, but also offers three useful and convenient auxiliary tools for format converting, MOPAC optimization and fingerprint similarity calculation. Currently, ChemDes has the capability of computing 3679 molecular descriptors and 59 types of molecular fingerprints. CONCLUSION ChemDes provides users an integrated and friendly tool to calculate various molecular descriptors and fingerprints. It is freely available at http://www.scbdd.com/chemdes. The source code of the project is also available as a supplementary file. Graphical abstract:An overview of ChemDes. A platform for computing various molecular descriptors and fingerprints. Collapse Key Words Chemoinformatics Molecular descriptors Molecular fingerprints Molecular representation Online descriptor calculation QSAR/QSPR Collapse MESH Headings Collapse Grants Collapse
40	Cytotoxicity of thiazolidinedione-, oxazolidinedione- and pyrrolidinedione-ring containing compounds in HepG2 cells. Toxicol In Vitro 2015;29:1887-96. [PMID: 26193171 DOI: 10.1016/j.tiv.2015.07.015] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/03/2015] [Revised: 06/24/2015] [Accepted: 07/16/2015] [Indexed: 11/23/2022] Abstract Liver damage occurred in some patients who took troglitazone (TGZ) for type II diabetes. The 2,4-thiazolidinedione (TZD) ring in TGZ's structure has been implicated in its hepatotoxicity. To further examine the potential role of a TZD ring in toxicity we used HepG2 cells to evaluate two series of compounds containing different cyclic imides. N-phenyl analogues comprised 3-(3,5-dichlorophenyl)-2,4-thiazolidinedione (DCPT); 3-(3,5-dichlorophenyl)-2,4-oxazolidinedione (DCPO) and N-(3,5-dichlorophenyl)succinimide (NDPS). Benzylic compounds, which closely resemble TGZ, included 5-(3,5-dichlorophenylmethyl)-2,4-thiazolidinedione (DCPMT); 5-(4-methoxyphenylmethyl)-2,4-thiazolidinedione (MPMT); 5-(4-methoxyphenylmethylene)-2,4-thiazolidinedione (MPMT-I); 5-(4-methoxyphenylmethyl)-2,4-oxazolidinedione (MPMO); 3-(4-methoxyphenylmethyl)succinimide (MPMS) and 3-(4-methoxyphenylmethylene)succinimide (MPMS-I). Cytotoxicity was assessed using the MTS assay after incubating the compounds (0-250μM) with HepG2 cells for 24h. Only certain TZD derivatives (TGZ, DCPT, DCPMT and MPMT-I) markedly decreased cell viability, whereas MPMT had low toxicity. In contrast, analogues without a TZD ring (DCPO, NDPS, MPMO, MPMS and MPMS-I) were not cytotoxic. These findings suggest that a TZD ring may be an important determinant of toxicity, although different structural features, chemical stability, cellular uptake or metabolism, etc., may also be involved. A simple clustering approach, using chemical fingerprints, assigned each compound to one of three classes (each containing one active compound and close homologues), and provided a framework for rationalizing the activity in terms of structure. Collapse Key Words Clustering Cytotoxicity HepG2 cells Molecular fingerprints Oxazolidinedione Pyrrolidinedione Succinimide Thiazolidinedione Troglitazone Collapse MESH Headings Collapse Grants Collapse
41	In Silico target fishing: addressing a "Big Data" problem by ligand-based similarity rankings with data fusion. J Cheminform 2014;6:33. [PMID: 24976868 PMCID: PMC4068908 DOI: 10.1186/1758-2946-6-33] [Citation(s) in RCA: 40] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/12/2014] [Accepted: 06/10/2014] [Indexed: 11/16/2022] Open Abstract Background Ligand-based in silico target fishing can be used to identify the potential interacting target of bioactive ligands, which is useful for understanding the polypharmacology and safety profile of existing drugs. The underlying principle of the approach is that known bioactive ligands can be used as reference to predict the targets for a new compound. Results We tested a pipeline enabling large-scale target fishing and drug repositioning, based on simple fingerprint similarity rankings with data fusion. A large library containing 533 drug relevant targets with 179,807 active ligands was compiled, where each target was defined by its ligand set. For a given query molecule, its target profile is generated by similarity searching against the ligand sets assigned to each target, for which individual searches utilizing multiple reference structures are then fused into a single ranking list representing the potential target interaction profile of the query compound. The proposed approach was validated by 10-fold cross validation and two external tests using data from DrugBank and Therapeutic Target Database (TTD). The use of the approach was further demonstrated with some examples concerning the drug repositioning and drug side-effects prediction. The promising results suggest that the proposed method is useful for not only finding promiscuous drugs for their new usages, but also predicting some important toxic liabilities. Conclusions With the rapid increasing volume and diversity of data concerning drug related targets and their ligands, the simple ligand-based target fishing approach would play an important role in assisting future drug design and discovery. Collapse Key Words Big data Data fusion Molecular fingerprints Similarity searching Target fishing Collapse MESH Headings Collapse Grants Collapse
42	iCDI-PseFpt: identify the channel-drug interaction in cellular networking with PseAAC and molecular fingerprints. J Theor Biol 2013;337:71-9. [PMID: 23988798 DOI: 10.1016/j.jtbi.2013.08.013] [Citation(s) in RCA: 104] [Impact Index Per Article: 9.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/19/2013] [Revised: 07/26/2013] [Accepted: 08/14/2013] [Indexed: 12/29/2022] Abstract Many crucial functions in life, such as heartbeat, sensory transduction and central nervous system response, are controlled by cell signalings via various ion channels. Therefore, ion channels have become an excellent drug target, and study of ion channel-drug interaction networks is an important topic for drug development. However, it is both time-consuming and costly to determine whether a drug and a protein ion channel are interacting with each other in a cellular network by means of experimental techniques. Although some computational methods were developed in this regard based on the knowledge of the 3D (three-dimensional) structure of protein, unfortunately their usage is quite limited because the 3D structures for most protein ion channels are still unknown. With the avalanche of protein sequences generated in the post-genomic age, it is highly desirable to develop the sequence-based computational method to address this problem. To take up the challenge, we developed a new predictor called iCDI-PseFpt, in which the protein ion-channel sample is formulated by the PseAAC (pseudo amino acid composition) generated with the gray model theory, the drug compound by the 2D molecular fingerprint, and the operation engine is the fuzzy K-nearest neighbor algorithm. The overall success rate achieved by iCDI-PseFpt via the jackknife cross-validation was 87.27%, which is remarkably higher than that by any of the existing predictors in this area. As a user-friendly web-server, iCDI-PseFpt is freely accessible to the public at the website http://www.jci-bioinfo.cn/iCDI-PseFpt/. Furthermore, for the convenience of most experimental scientists, a step-by-step guide is provided on how to use the web-server to get the desired results without the need to follow the complicated math equations presented in the paper just for its integrity. It has not escaped our notice that the current approach can also be used to study other drug-target interaction networks. Collapse Key Words Fuzzy K-nearest neighbor algorithm Gray model Ion channels Molecular fingerprints Pseudo amino acid composition Collapse MESH Headings Collapse Grants Collapse
43	Virtual Activity Profiling of Bioactive Molecules by 1D Fingerprinting. Mol Inform 2010;29:773-9. [PMID: 27464267 DOI: 10.1002/minf.201000075] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/09/2010] [Accepted: 09/30/2010] [Indexed: 11/11/2022] Abstract Collapse Key Words 1D Molecular representation Molecular fingerprints Pharmacophore Potential pharmacophoric points Similarity searching Virtual screening Collapse MESH Headings Collapse Grants Collapse