1
|
Martínez‐Mauricio KL, García‐Jacas CR, Cordoves‐Delgado G. Examining evolutionary scale modeling-derived different-dimensional embeddings in the antimicrobial peptide classification through a KNIME workflow. Protein Sci 2024; 33:e4928. [PMID: 38501511 PMCID: PMC10949403 DOI: 10.1002/pro.4928] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/07/2023] [Revised: 01/28/2024] [Accepted: 01/30/2024] [Indexed: 03/20/2024]
Abstract
Molecular features play an important role in different bio-chem-informatics tasks, such as the Quantitative Structure-Activity Relationships (QSAR) modeling. Several pre-trained models have been recently created to be used in downstream tasks, either by fine-tuning a specific model or by extracting features to feed traditional classifiers. In this regard, a new family of Evolutionary Scale Modeling models (termed as ESM-2 models) was recently introduced, demonstrating outstanding results in protein structure prediction benchmarks. Herein, we studied the usefulness of the different-dimensional embeddings derived from the ESM-2 models to classify antimicrobial peptides (AMPs). To this end, we built a KNIME workflow to use the same modeling methodology across experiments in order to guarantee fair analyses. As a result, the 640- and 1280-dimensional embeddings derived from the 30- and 33-layer ESM-2 models, respectively, are the most valuable since statistically better performances were achieved by the QSAR models built from them. We also fused features of the different ESM-2 models, and it was concluded that the fusion contributes to getting better QSAR models than using features of a single ESM-2 model. Frequency studies revealed that only a portion of the ESM-2 embeddings is valuable for modeling tasks since between 43% and 66% of the features were never used. Comparisons regarding state-of-the-art deep learning (DL) models confirm that when performing methodologically principled studies in the prediction of AMPs, non-DL based QSAR models yield comparable-to-superior performances to DL-based QSAR models. The developed KNIME workflow is available-freely at https://github.com/cicese-biocom/classification-QSAR-bioKom. This workflow can be valuable to avoid unfair comparisons regarding new computational methods, as well as to propose new non-DL based QSAR models.
Collapse
Affiliation(s)
- Karla L. Martínez‐Mauricio
- Departamento de Ciencias de la ComputaciónCentro de Investigación Científica y de Educación Superior de Ensenada (CICESE)EnsenadaMexico
| | - César R. García‐Jacas
- Cátedras CONAHCYT – Departamento de Ciencias de la ComputaciónCentro de Investigación Científica y de Educación Superior de Ensenada (CICESE)EnsenadaMexico
| | - Greneter Cordoves‐Delgado
- Departamento de Ciencias de la ComputaciónCentro de Investigación Científica y de Educación Superior de Ensenada (CICESE)EnsenadaMexico
| |
Collapse
|
2
|
He D, Liu Q, Mi Y, Meng Q, Xu L, Hou C, Wang J, Li N, Liu Y, Chai H, Yang Y, Liu J, Wang L, Hou Y. De Novo Generation and Identification of Novel Compounds with Drug Efficacy Based on Machine Learning. ADVANCED SCIENCE (WEINHEIM, BADEN-WURTTEMBERG, GERMANY) 2024; 11:e2307245. [PMID: 38204214 PMCID: PMC10962488 DOI: 10.1002/advs.202307245] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 09/29/2023] [Revised: 12/05/2023] [Indexed: 01/12/2024]
Abstract
One of the main challenges in small molecule drug discovery is finding novel chemical compounds with desirable activity. Traditional drug development typically begins with target selection, but the correlation between targets and disease remains to be further investigated, and drugs designed based on targets may not always have the desired drug efficacy. The emergence of machine learning provides a powerful tool to overcome the challenge. Herein, a machine learning-based strategy is developed for de novo generation of novel compounds with drug efficacy termed DTLS (Deep Transfer Learning-based Strategy) by using dataset of disease-direct-related activity as input. DTLS is applied in two kinds of disease: colorectal cancer (CRC) and Alzheimer's disease (AD). In each case, novel compound is discovered and identified in in vitro and in vivo disease models. Their mechanism of actionis further explored. The experimental results reveal that DTLS can not only realize the generation and identification of novel compounds with drug efficacy but also has the advantage of identifying compounds by focusing on protein targets to facilitate the mechanism study. This work highlights the significant impact of machine learning on the design of novel compounds with drug efficacy, which provides a powerful new approach to drug discovery.
Collapse
Affiliation(s)
- Dakuo He
- College of Information Science and EngineeringState Key Laboratory of Synthetical Automation for Process IndustriesNortheastern UniversityShenyang110819China
| | - Qing Liu
- College of Information Science and EngineeringState Key Laboratory of Synthetical Automation for Process IndustriesNortheastern UniversityShenyang110819China
| | - Yan Mi
- Key Laboratory of Bioresource Research and Development of Liaoning ProvinceCollege of Life and Health SciencesNational Frontiers Science Center for Industrial Intelligence and Systems OptimizationNortheastern UniversityShenyang110169China
- Key Laboratory of Data Analytics and Optimization for Smart IndustryMinistry of EducationNortheastern UniversityShenyang110169China
| | - Qingqi Meng
- Key Laboratory of Bioresource Research and Development of Liaoning ProvinceCollege of Life and Health SciencesNational Frontiers Science Center for Industrial Intelligence and Systems OptimizationNortheastern UniversityShenyang110169China
- Key Laboratory of Data Analytics and Optimization for Smart IndustryMinistry of EducationNortheastern UniversityShenyang110169China
| | - Libin Xu
- Key Laboratory of Bioresource Research and Development of Liaoning ProvinceCollege of Life and Health SciencesNational Frontiers Science Center for Industrial Intelligence and Systems OptimizationNortheastern UniversityShenyang110169China
- Key Laboratory of Data Analytics and Optimization for Smart IndustryMinistry of EducationNortheastern UniversityShenyang110169China
| | - Chunyu Hou
- College of Information Science and EngineeringState Key Laboratory of Synthetical Automation for Process IndustriesNortheastern UniversityShenyang110819China
| | - Jinpeng Wang
- College of Information Science and EngineeringState Key Laboratory of Synthetical Automation for Process IndustriesNortheastern UniversityShenyang110819China
| | - Ning Li
- School of Traditional Chinese Materia MedicaKey Laboratory for TCM Material Basis Study and Innovative Drug Development of Shenyang CityShenyang Pharmaceutical UniversityShenyang110016China
| | - Yang Liu
- Key Laboratory of Structure‐Based Drug Design & Discovery of Ministry of EducationShenyang Pharmaceutical UniversityShenyang110016China
| | - Huifang Chai
- School of PharmacyGuizhou University of Traditional Chinese MedicineGuiyang550025China
| | - Yanqiu Yang
- Key Laboratory of Bioresource Research and Development of Liaoning ProvinceCollege of Life and Health SciencesNational Frontiers Science Center for Industrial Intelligence and Systems OptimizationNortheastern UniversityShenyang110169China
- Key Laboratory of Data Analytics and Optimization for Smart IndustryMinistry of EducationNortheastern UniversityShenyang110169China
| | - Jingyu Liu
- Key Laboratory of Bioresource Research and Development of Liaoning ProvinceCollege of Life and Health SciencesNational Frontiers Science Center for Industrial Intelligence and Systems OptimizationNortheastern UniversityShenyang110169China
- Key Laboratory of Data Analytics and Optimization for Smart IndustryMinistry of EducationNortheastern UniversityShenyang110169China
| | - Lihui Wang
- Department of PharmacologyShenyang Pharmaceutical UniversityShenyang110016China
| | - Yue Hou
- Key Laboratory of Bioresource Research and Development of Liaoning ProvinceCollege of Life and Health SciencesNational Frontiers Science Center for Industrial Intelligence and Systems OptimizationNortheastern UniversityShenyang110169China
- Key Laboratory of Data Analytics and Optimization for Smart IndustryMinistry of EducationNortheastern UniversityShenyang110169China
| |
Collapse
|
3
|
Bohn L, Drouin SM, McFall GP, Rolfson DB, Andrew MK, Dixon RA. Machine learning analyses identify multi-modal frailty factors that selectively discriminate four cohorts in the Alzheimer's disease spectrum: a COMPASS-ND study. BMC Geriatr 2023; 23:837. [PMID: 38082372 PMCID: PMC10714519 DOI: 10.1186/s12877-023-04546-1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/07/2023] [Accepted: 11/30/2023] [Indexed: 12/18/2023] Open
Abstract
BACKGROUND Frailty indicators can operate in dynamic amalgamations of disease conditions, clinical symptoms, biomarkers, medical signals, cognitive characteristics, and even health beliefs and practices. This study is the first to evaluate which, among these multiple frailty-related indicators, are important and differential predictors of clinical cohorts that represent progression along an Alzheimer's disease (AD) spectrum. We applied machine-learning technology to such indicators in order to identify the leading predictors of three AD spectrum cohorts; viz., subjective cognitive impairment (SCI), mild cognitive impairment (MCI), and AD. The common benchmark was a cohort of cognitively unimpaired (CU) older adults. METHODS The four cohorts were from the cross-sectional Comprehensive Assessment of Neurodegeneration and Dementia dataset. We used random forest analysis (Python 3.7) to simultaneously test the relative importance of 83 multi-modal frailty indicators in discriminating the cohorts. We performed an explainable artificial intelligence method (Tree Shapley Additive exPlanation values) for deep interpretation of prediction effects. RESULTS We observed strong concurrent prediction results, with clusters varying across cohorts. The SCI model demonstrated excellent prediction accuracy (AUC = 0.89). Three leading predictors were poorer quality of life ([QoL]; memory), abnormal lymphocyte count, and abnormal neutrophil count. The MCI model demonstrated a similarly high AUC (0.88). Five leading predictors were poorer QoL (memory, leisure), male sex, abnormal lymphocyte count, and poorer self-rated eyesight. The AD model demonstrated outstanding prediction accuracy (AUC = 0.98). Ten leading predictors were poorer QoL (memory), reduced olfaction, male sex, increased dependence in activities of daily living (n = 6), and poorer visual contrast. CONCLUSIONS Both convergent and cohort-specific frailty factors discriminated the AD spectrum cohorts. Convergence was observed as all cohorts were marked by lower quality of life (memory), supporting recent research and clinical attention to subjective experiences of memory aging and their potentially broad ramifications. Diversity was displayed in that, of the 14 leading predictors extracted across models, 11 were selectively sensitive to one cohort. A morbidity intensity trend was indicated by an increasing number and diversity of predictors corresponding to clinical severity, especially in AD. Knowledge of differential deficit predictors across AD clinical cohorts may promote precision interventions.
Collapse
Affiliation(s)
- Linzy Bohn
- Department of Psychology, University of Alberta, P217 Biological Sciences Building, Edmonton, AB, T6G 2E9, Canada.
- Neuroscience and Mental Health Institute, University of Alberta, 2-132 Li Ka Shing Center for Health Research Innovation, Edmonton, AB, T6G 2E1, Canada.
| | - Shannon M Drouin
- Department of Psychology, University of Alberta, P217 Biological Sciences Building, Edmonton, AB, T6G 2E9, Canada
- Neuroscience and Mental Health Institute, University of Alberta, 2-132 Li Ka Shing Center for Health Research Innovation, Edmonton, AB, T6G 2E1, Canada
| | - G Peggy McFall
- Department of Psychology, University of Alberta, P217 Biological Sciences Building, Edmonton, AB, T6G 2E9, Canada
- Neuroscience and Mental Health Institute, University of Alberta, 2-132 Li Ka Shing Center for Health Research Innovation, Edmonton, AB, T6G 2E1, Canada
| | - Darryl B Rolfson
- Department of Medicine, Division of Geriatric Medicine, University of Alberta, 13-135 Clinical Sciences Building, Edmonton, AB, T6G 2G3, Canada
| | - Melissa K Andrew
- Department of Medicine, Division of Geriatric Medicine, Dalhousie University, 5955 Veterans' Memorial Lane, Halifax, NS, B3H 2E1, Canada
| | - Roger A Dixon
- Department of Psychology, University of Alberta, P217 Biological Sciences Building, Edmonton, AB, T6G 2E9, Canada
- Neuroscience and Mental Health Institute, University of Alberta, 2-132 Li Ka Shing Center for Health Research Innovation, Edmonton, AB, T6G 2E1, Canada
| |
Collapse
|
4
|
Cerruela-García G, Cuevas-Muñoz JM, García-Pedrajas N. Graph-Based Feature Selection Approach for Molecular Activity Prediction. J Chem Inf Model 2022; 62:1618-1632. [PMID: 35315648 PMCID: PMC9006223 DOI: 10.1021/acs.jcim.1c01578] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/02/2022]
Abstract
![]()
In the construction
of QSAR models for the prediction of molecular
activity, feature selection is a common task aimed at improving the
results and understanding of the problem. The selection of features
allows elimination of irrelevant and redundant features, reduces the
effect of dimensionality problems, and improves the generalization
and interpretability of the models. In many feature selection applications,
such as those based on ensembles of feature selectors, it is necessary
to combine different selection processes. In this work, we evaluate
the application of a new feature selection approach to the prediction
of molecular activity, based on the construction of an undirected
graph to combine base feature selectors. The experimental results
demonstrate the efficiency of the graph-based method in terms of the
classification performance, reduction, and redundancy compared to
the standard voting method. The graph-based method can be extended
to different feature selection algorithms and applied to other cheminformatics
problems.
Collapse
Affiliation(s)
- Gonzalo Cerruela-García
- Department of Computing and Numerical Analysis, University of Córdoba, Campus de Rabanales, Albert Einstein Building, E-14071 Córdoba, Spain
| | - José Manuel Cuevas-Muñoz
- Department of Computing and Numerical Analysis, University of Córdoba, Campus de Rabanales, Albert Einstein Building, E-14071 Córdoba, Spain
| | - Nicolás García-Pedrajas
- Department of Computing and Numerical Analysis, University of Córdoba, Campus de Rabanales, Albert Einstein Building, E-14071 Córdoba, Spain
| |
Collapse
|
5
|
Mater AC, Coote ML. Explainable Molecular Sets: Using Information Theory to Generate Meaningful Descriptions of Groups of Molecules. J Chem Inf Model 2021; 61:4877-4889. [PMID: 34636543 DOI: 10.1021/acs.jcim.1c00519] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/28/2022]
Abstract
Algorithmically identifying the meaningful similarities between an assortment of molecules is a critical chemical problem, and one which is only gaining in relevance as data-driven chemistry continues to progress. Effectively addressing this challenge can be achieved through a reformulation of the problem into information theory, cluster-based supervised classification, and the implementation of key concepts, particularly information entropy and mutual information. These concepts are combined with unsupervised learning atop learned chemical spaces to generate meaningful labels for arbitrary collections of molecules. An open-source and highly extensible codebase is provided to undertake these experiments, demonstrate the viability of the approach on known clusters, and glean insights into the learned representations of chemical space within message-passing neural networks, an architecture not readily permitting interpretability. This approach facilitates the interoperability between human chemical knowledge and the algorithmically derived insights, which will continue to become more prevalent in the coming years.
Collapse
Affiliation(s)
- Adam C Mater
- Research School of Chemistry, Australian National University, Canberra, Australian Capital Territory 2601, Australia
| | - Michelle L Coote
- Research School of Chemistry, Australian National University, Canberra, Australian Capital Territory 2601, Australia
| |
Collapse
|
6
|
Algamal ZY, Qasim MK, Lee MH, Ali HTM. QSAR model for predicting neuraminidase inhibitors of influenza A viruses (H1N1) based on adaptive grasshopper optimization algorithm. SAR AND QSAR IN ENVIRONMENTAL RESEARCH 2020; 31:803-814. [PMID: 32938208 DOI: 10.1080/1062936x.2020.1818616] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 07/07/2020] [Accepted: 08/31/2020] [Indexed: 06/11/2023]
Abstract
High-dimensionality is one of the major problems which affect the quality of the quantitative structure-activity relationship (QSAR) modelling. Obtaining a reliable QSAR model with few descriptors is an essential procedure in chemometrics. The binary grasshopper optimization algorithm (BGOA) is a new meta-heuristic optimization algorithm, which has been used successfully to perform feature selection. In this paper, four new transfer functions were adapted to improve the exploration and exploitation capability of the BGOA in QSAR modelling of influenza A viruses (H1N1). The QSAR model with these new quadratic transfer functions was internally and externally validated based on MSEtrain, Y-randomization test, MSEtest, and the applicability domain (AD). The validation results indicate that the model is robust and not due to chance correlation. In addition, the results indicate that the descriptor selection and prediction performance of the QSAR model for training dataset outperform the other S-shaped and V-shaped transfer functions. QSAR model using quadratic transfer function shows the lowest MSEtrain. For the test dataset, proposed QSAR model shows lower value of MSEtest compared with the other methods, indicating its higher predictive ability. In conclusion, the results reveal that the proposed QSAR model is an efficient approach for modelling high-dimensional QSAR models and it is useful for the estimation of IC50 values of neuraminidase inhibitors that have not been experimentally tested.
Collapse
Affiliation(s)
- Z Y Algamal
- Department of Statistics and Informatics, University of Mosul , Mosul, Iraq
| | - M K Qasim
- Department of General Science, University of Mosul , Mosul, Iraq
| | - M H Lee
- Department of Mathematical Sciences, Faculty of Science, Universiti Teknologi Malaysia , Johor, Malaysia
| | - H T M Ali
- College of Computers and Information Technology, Nawroz University , Dahuk, Iraq
| |
Collapse
|
7
|
Tinkov O, Polishchuk P, Matveieva M, Grigorev V, Grigoreva L, Porozov Y. The Influence of Structural Patterns on Acute Aquatic Toxicity of Organic Compounds. Mol Inform 2020; 40:e2000209. [PMID: 33029954 DOI: 10.1002/minf.202000209] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/14/2020] [Accepted: 10/01/2020] [Indexed: 12/28/2022]
Abstract
Investigation of the influence of molecular structure of different organic compounds on acute toxicity towards Fathead minnow, Daphnia magna, and Tetrahymena pyriformis has been carried out using 2D simplex representation of molecular structure and two modelling methods: Random Forest (RF) and Gradient Boosting Machine (GBM). Suitable QSAR (Quantitative Structure - Activity Relationships) models were obtained. The study was focused on QSAR models interpretation. The aim of the study was to develop a set of structural fragments that simultaneously consistently increase toxicity toward Fathead minnow, Daphnia magna, Tetrahymena pyriformis. The interpretation allowed to gain more details about known toxicophores and to propose new fragments. The results obtained made it possible to rank the contributions of molecular fragments to various types of toxicity to aquatic organisms. This information can be used for molecular optimization of chemicals. According to the results of structural interpretation, the most significant common mechanisms of the toxic effect of organic compounds on Fathead minnow, Daphnia magna and Tetrahymena pyriformis are reactions of nucleophilic substitution and inhibition of oxidative phosphorylation in mitochondria. In addition acetylcholinesterase and voltage-gated ion channel of Fathead minnow and Daphnia magna are important targets for toxicants. The on-line version of the OCHEM expert system (https://ochem.eu) were used for a comparative QSAR investigation. The proposed QSAR models comply with the OECD principles and can be used to reliably predict acute toxicity of organic compounds towards Fathead minnow, Daphnia magna and Tetrahymena pyriformis with allowance for applicability domain estimation.
Collapse
Affiliation(s)
- Oleg Tinkov
- Department of Computer Science, Military Institute of the Ministry of Defense, 3300, Gogol str. 2"B", Tiraspol, Transdniestria, Moldova.,Department of Pharmacology and Pharmaceutical Chemistry, Medical Faculty, Transnistrian State University, 3300, October 25 str. 128, Tiraspol, Transdniestria, Moldova
| | - Pavel Polishchuk
- Institute of Molecular and Translational Medicine Faculty of Medicine and Dentistry Palacký University and University Hospital in Olomouc, Hnevotinska 5, 77900, Olomouc, Czech Republic
| | - Mariia Matveieva
- Institute of Molecular and Translational Medicine Faculty of Medicine and Dentistry Palacký University and University Hospital in Olomouc, Hnevotinska 5, 77900, Olomouc, Czech Republic
| | - Veniamin Grigorev
- Institute of Physiologically Active Compounds, Russian Academy of Sciences, 142432, Severniy proezd 1, Chernogolovka, Moscow region, Russia
| | - Ludmila Grigoreva
- Department of Fundamental Physical and Chemical Engineering, Moscow State University, 119991, Leninskiye Gory 1/51, Moscow, Russia
| | - Yuri Porozov
- World-Class Research Center "Digital biodesign and personalized healthcare", I.M. Sechenov First Moscow State Medical University, Moscow, Russia.,Department of Computational Biology, Sirius University of Science and Technology, 354340, Olympic Ave 1, Sochi, Russia
| |
Collapse
|
8
|
Tarekegn A, Ricceri F, Costa G, Ferracin E, Giacobini M. Predictive Modeling for Frailty Conditions in Elderly People: Machine Learning Approaches. JMIR Med Inform 2020; 8:e16678. [PMID: 32442149 PMCID: PMC7303829 DOI: 10.2196/16678] [Citation(s) in RCA: 22] [Impact Index Per Article: 5.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/15/2019] [Revised: 01/07/2020] [Accepted: 02/16/2020] [Indexed: 12/15/2022] Open
Abstract
Background Frailty is one of the most critical age-related conditions in older adults. It is often recognized as a syndrome of physiological decline in late life, characterized by a marked vulnerability to adverse health outcomes. A clear operational definition of frailty, however, has not been agreed so far. There is a wide range of studies on the detection of frailty and their association with mortality. Several of these studies have focused on the possible risk factors associated with frailty in the elderly population while predicting who will be at increased risk of frailty is still overlooked in clinical settings. Objective The objective of our study was to develop predictive models for frailty conditions in older people using different machine learning methods based on a database of clinical characteristics and socioeconomic factors. Methods An administrative health database containing 1,095,612 elderly people aged 65 or older with 58 input variables and 6 output variables was used. We first identify and define six problems/outputs as surrogates of frailty. We then resolve the imbalanced nature of the data through resampling process and a comparative study between the different machine learning (ML) algorithms – Artificial neural network (ANN), Genetic programming (GP), Support vector machines (SVM), Random Forest (RF), Logistic regression (LR) and Decision tree (DT) – was carried out. The performance of each model was evaluated using a separate unseen dataset. Results Predicting mortality outcome has shown higher performance with ANN (TPR 0.81, TNR 0.76, accuracy 0.78, F1-score 0.79) and SVM (TPR 0.77, TNR 0.80, accuracy 0.79, F1-score 0.78) than predicting the other outcomes. On average, over the six problems, the DT classifier has shown the lowest accuracy, while other models (GP, LR, RF, ANN, and SVM) performed better. All models have shown lower accuracy in predicting an event of an emergency admission with red code than predicting fracture and disability. In predicting urgent hospitalization, only SVM achieved better performance (TPR 0.75, TNR 0.77, accuracy 0.73, F1-score 0.76) with the 10-fold cross validation compared with other models in all evaluation metrics. Conclusions We developed machine learning models for predicting frailty conditions (mortality, urgent hospitalization, disability, fracture, and emergency admission). The results show that the prediction performance of machine learning models significantly varies from problem to problem in terms of different evaluation metrics. Through further improvement, the model that performs better can be used as a base for developing decision-support tools to improve early identification and prediction of frail older adults.
Collapse
Affiliation(s)
- Adane Tarekegn
- Modeling and Data Science, Department of Mathematics, University of Turin, Turin, Italy
| | - Fulvio Ricceri
- Department of Clinical and Biological Sciences, University of Turin, Turin, Italy.,Unit of Epidemiology, Regional Health Service, Local Health Unit Torino 3, Turin, Italy
| | - Giuseppe Costa
- Department of Clinical and Biological Sciences, University of Turin, Turin, Italy.,Unit of Epidemiology, Regional Health Service, Local Health Unit Torino 3, Turin, Italy
| | - Elisa Ferracin
- Unit of Epidemiology, Regional Health Service, Local Health Unit Torino 3, Turin, Italy
| | - Mario Giacobini
- Data Analysis and Modeling Unit, Department of Veterinary Sciences, University of Turin, Turin, Italy
| |
Collapse
|
9
|
Pogodin PV, Lagunin AA, Filimonov DA, Nicklaus MC, Poroikov VV. Improving (Q)SAR predictions by examining bias in the selection of compounds for experimental testing. SAR AND QSAR IN ENVIRONMENTAL RESEARCH 2019; 30:759-773. [PMID: 31547686 DOI: 10.1080/1062936x.2019.1665580] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 07/26/2019] [Accepted: 09/05/2019] [Indexed: 06/10/2023]
Abstract
Existing data on structures and biological activities are limited and distributed unevenly across distinct molecular targets and chemical compounds. The question arises if these data represent an unbiased sample of the general population of chemical-biological interactions. To answer this question, we analyzed ChEMBL data for 87,583 molecules tested against 919 protein targets using supervised and unsupervised approaches. Hierarchical clustering of the Murcko frameworks generated using Chemistry Development Toolkit showed that the available data form a big diffuse cloud without apparent structure. In contrast hereto, PASS-based classifiers allowed prediction whether the compound had been tested against the particular molecular target, despite whether it was active or not. Thus, one may conclude that the selection of chemical compounds for testing against specific targets is biased, probably due to the influence of prior knowledge. We assessed the possibility to improve (Q)SAR predictions using this fact: PASS prediction of the interaction with the particular target for compounds predicted as tested against the target has significantly higher accuracy than for those predicted as untested (average ROC AUC are about 0.87 and 0.75, respectively). Thus, considering the existing bias in the data of the training set may increase the performance of virtual screening.
Collapse
Affiliation(s)
- P V Pogodin
- Department of Bioinformatics, Institute of Biomedical Chemistry , Moscow , Russia
| | - A A Lagunin
- Department of Bioinformatics, Institute of Biomedical Chemistry , Moscow , Russia
- Department of Bioinformatics, Medical-Biological Department, Pirogov Russian National Research Medical University , Moscow , Russia
| | - D A Filimonov
- Department of Bioinformatics, Institute of Biomedical Chemistry , Moscow , Russia
| | - M C Nicklaus
- Computer-Aided Drug Design Group, Chemical Biology Laboratory, Center for Cancer Research, National Cancer Institute, NIH, NCI-Frederick , Frederick , MD , USA
| | - V V Poroikov
- Department of Bioinformatics, Institute of Biomedical Chemistry , Moscow , Russia
| |
Collapse
|