1
|
Ancajas CMF, Oyedele AS, Butt CM, Walker AS. Advances, opportunities, and challenges in methods for interrogating the structure activity relationships of natural products. Nat Prod Rep 2024; 41:1543-1578. [PMID: 38912779 PMCID: PMC11484176 DOI: 10.1039/d4np00009a] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/27/2024] [Indexed: 06/25/2024]
Abstract
Time span in literature: 1985-early 2024Natural products play a key role in drug discovery, both as a direct source of drugs and as a starting point for the development of synthetic compounds. Most natural products are not suitable to be used as drugs without further modification due to insufficient activity or poor pharmacokinetic properties. Choosing what modifications to make requires an understanding of the compound's structure-activity relationships. Use of structure-activity relationships is commonplace and essential in medicinal chemistry campaigns applied to human-designed synthetic compounds. Structure-activity relationships have also been used to improve the properties of natural products, but several challenges still limit these efforts. Here, we review methods for studying the structure-activity relationships of natural products and their limitations. Specifically, we will discuss how synthesis, including total synthesis, late-stage derivatization, chemoenzymatic synthetic pathways, and engineering and genome mining of biosynthetic pathways can be used to produce natural product analogs and discuss the challenges of each of these approaches. Finally, we will discuss computational methods including machine learning methods for analyzing the relationship between biosynthetic genes and product activity, computer aided drug design techniques, and interpretable artificial intelligence approaches towards elucidating structure-activity relationships from models trained to predict bioactivity from chemical structure. Our focus will be on these latter topics as their applications for natural products have not been extensively reviewed. We suggest that these methods are all complementary to each other, and that only collaborative efforts using a combination of these techniques will result in a full understanding of the structure-activity relationships of natural products.
Collapse
Affiliation(s)
| | | | - Caitlin M Butt
- Department of Chemistry, Vanderbilt University, Nashville, TN, USA.
| | - Allison S Walker
- Department of Chemistry, Vanderbilt University, Nashville, TN, USA.
- Department of Biological Sciences, Vanderbilt University, Nashville, TN, USA
- Department of Pathology, Microbiology, and Immunology, Vanderbilt University Medical Center, Nashville, TN, USA
| |
Collapse
|
2
|
Fronk AD, Manzanares MA, Zheng P, Geier A, Anderson K, Stanton S, Zumrut H, Gera S, Munch R, Frederick V, Dhingra P, Arun G, Akerman M. Development and validation of AI/ML derived splice-switching oligonucleotides. Mol Syst Biol 2024; 20:676-701. [PMID: 38664594 PMCID: PMC11148135 DOI: 10.1038/s44320-024-00034-9] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/15/2023] [Revised: 04/03/2024] [Accepted: 04/09/2024] [Indexed: 06/05/2024] Open
Abstract
Splice-switching oligonucleotides (SSOs) are antisense compounds that act directly on pre-mRNA to modulate alternative splicing (AS). This study demonstrates the value that artificial intelligence/machine learning (AI/ML) provides for the identification of functional, verifiable, and therapeutic SSOs. We trained XGboost tree models using splicing factor (SF) pre-mRNA binding profiles and spliceosome assembly information to identify modulatory SSO binding sites on pre-mRNA. Using Shapley and out-of-bag analyses we also predicted the identity of specific SFs whose binding to pre-mRNA is blocked by SSOs. This step adds considerable transparency to AI/ML-driven drug discovery and informs biological insights useful in further validation steps. We applied this approach to previously established functional SSOs to retrospectively identify the SFs likely to regulate those events. We then took a prospective validation approach using a novel target in triple negative breast cancer (TNBC), NEDD4L exon 13 (NEDD4Le13). Targeting NEDD4Le13 with an AI/ML-designed SSO decreased the proliferative and migratory behavior of TNBC cells via downregulation of the TGFβ pathway. Overall, this study illustrates the ability of AI/ML to extract actionable insights from RNA-seq data.
Collapse
Affiliation(s)
| | | | - Paulina Zheng
- Envisagenics, Inc., Long Island City, NY, 11101, USA
| | - Adam Geier
- Envisagenics, Inc., Long Island City, NY, 11101, USA
| | | | | | - Hasan Zumrut
- Envisagenics, Inc., Long Island City, NY, 11101, USA
| | - Sakshi Gera
- Envisagenics, Inc., Long Island City, NY, 11101, USA
| | - Robin Munch
- Envisagenics, Inc., Long Island City, NY, 11101, USA
| | | | | | - Gayatri Arun
- Envisagenics, Inc., Long Island City, NY, 11101, USA
| | | |
Collapse
|
3
|
Srithanyarat T, Taoma K, Sutthibutpong T, Ruengjitchatchawalya M, Liangruksa M, Laomettachit T. Interpreting drug synergy in breast cancer with deep learning using target-protein inhibition profiles. BioData Min 2024; 17:8. [PMID: 38424554 PMCID: PMC10905801 DOI: 10.1186/s13040-024-00359-z] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/08/2023] [Accepted: 02/23/2024] [Indexed: 03/02/2024] Open
Abstract
BACKGROUND Breast cancer is the most common malignancy among women worldwide. Despite advances in treating breast cancer over the past decades, drug resistance and adverse effects remain challenging. Recent therapeutic progress has shifted toward using drug combinations for better treatment efficiency. However, with a growing number of potential small-molecule cancer inhibitors, in silico strategies to predict pharmacological synergy before experimental trials are required to compensate for time and cost restrictions. Many deep learning models have been previously proposed to predict the synergistic effects of drug combinations with high performance. However, these models heavily relied on a large number of drug chemical structural fingerprints as their main features, which made model interpretation a challenge. RESULTS This study developed a deep neural network model that predicts synergy between small-molecule pairs based on their inhibitory activities against 13 selected key proteins. The synergy prediction model achieved a Pearson correlation coefficient between model predictions and experimental data of 0.63 across five breast cancer cell lines. BT-549 and MCF-7 achieved the highest correlation of 0.67 when considering individual cell lines. Despite achieving a moderate correlation compared to previous deep learning models, our model offers a distinctive advantage in terms of interpretability. Using the inhibitory activities against key protein targets as the main features allowed a straightforward interpretation of the model since the individual features had direct biological meaning. By tracing the synergistic interactions of compounds through their target proteins, we gained insights into the patterns our model recognized as indicative of synergistic effects. CONCLUSIONS The framework employed in the present study lays the groundwork for future advancements, especially in model interpretation. By combining deep learning techniques and target-specific models, this study shed light on potential patterns of target-protein inhibition profiles that could be exploited in breast cancer treatment.
Collapse
Affiliation(s)
- Thanyawee Srithanyarat
- Bioinformatics and Systems Biology Program, School of Bioresources and Technology, King Mongkut's University of Technology Thonburi, Bangkok, 10150, Thailand
- School of Information Technology, King Mongkut's University of Technology Thonburi, Bangkok, 10140, Thailand
| | - Kittisak Taoma
- Bioinformatics and Systems Biology Program, School of Bioresources and Technology, King Mongkut's University of Technology Thonburi, Bangkok, 10150, Thailand
- School of Information Technology, King Mongkut's University of Technology Thonburi, Bangkok, 10140, Thailand
| | - Thana Sutthibutpong
- Department of Physics, Faculty of Science, King Mongkut's University of Technology Thonburi, Bangkok, 10140, Thailand
- Theoretical and Computational Physics Group, Center of Excellence in Theoretical and Computational Science, King Mongkut's University of Technology Thonburi, Bangkok, 10140, Thailand
| | - Marasri Ruengjitchatchawalya
- Bioinformatics and Systems Biology Program, School of Bioresources and Technology, King Mongkut's University of Technology Thonburi, Bangkok, 10150, Thailand
- Biotechnology Program, School of Bioresources and Technology, King Mongkut's University of Technology Thonburi, Bangkok, 10150, Thailand
| | - Monrudee Liangruksa
- National Nanotechnology Center (NANOTEC), National Science and Technology Development Agency (NSTDA), Pathum Thani, 12120, Thailand.
| | - Teeraphan Laomettachit
- Bioinformatics and Systems Biology Program, School of Bioresources and Technology, King Mongkut's University of Technology Thonburi, Bangkok, 10150, Thailand.
- Theoretical and Computational Physics Group, Center of Excellence in Theoretical and Computational Science, King Mongkut's University of Technology Thonburi, Bangkok, 10140, Thailand.
| |
Collapse
|
4
|
Jaotombo F, Adorni L, Ghattas B, Boyer L. Finding the best trade-off between performance and interpretability in predicting hospital length of stay using structured and unstructured data. PLoS One 2023; 18:e0289795. [PMID: 38032876 PMCID: PMC10688642 DOI: 10.1371/journal.pone.0289795] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/08/2023] [Accepted: 07/25/2023] [Indexed: 12/02/2023] Open
Abstract
OBJECTIVE This study aims to develop high-performing Machine Learning and Deep Learning models in predicting hospital length of stay (LOS) while enhancing interpretability. We compare performance and interpretability of models trained only on structured tabular data with models trained only on unstructured clinical text data, and on mixed data. METHODS The structured data was used to train fourteen classical Machine Learning models including advanced ensemble trees, neural networks and k-nearest neighbors. The unstructured data was used to fine-tune a pre-trained Bio Clinical BERT Transformer Deep Learning model. The structured and unstructured data were then merged into a tabular dataset after vectorization of the clinical text and a dimensional reduction through Latent Dirichlet Allocation. The study used the free and publicly available Medical Information Mart for Intensive Care (MIMIC) III database, on the open AutoML Library AutoGluon. Performance is evaluated with respect to two types of random classifiers, used as baselines. RESULTS The best model from structured data demonstrates high performance (ROC AUC = 0.944, PRC AUC = 0.655) with limited interpretability, where the most important predictors of prolonged LOS are the level of blood urea nitrogen and of platelets. The Transformer model displays a good but lower performance (ROC AUC = 0.842, PRC AUC = 0.375) with a richer array of interpretability by providing more specific in-hospital factors including procedures, conditions, and medical history. The best model trained on mixed data satisfies both a high level of performance (ROC AUC = 0.963, PRC AUC = 0.746) and a much larger scope in interpretability including pathologies of the intestine, the colon, and the blood; infectious diseases, respiratory problems, procedures involving sedation and intubation, and vascular surgery. CONCLUSIONS Our results outperform most of the state-of-the-art models in LOS prediction both in terms of performance and of interpretability. Data fusion between structured and unstructured text data may significantly improve performance and interpretability.
Collapse
Affiliation(s)
- Franck Jaotombo
- EMLYON Business School, Ecully, France
- Research Centre on Health Services and Quality of Life, Aix Marseille University, Marseille, France
| | - Luca Adorni
- Becker Friedman Institute, Chicago, IL, United States of America
| | - Badih Ghattas
- Aix Marseille University, CNRS, AMSE, Marseille, France
| | - Laurent Boyer
- Research Centre on Health Services and Quality of Life, Aix Marseille University, Marseille, France
- Department of Public Health, Assistance Publique–Hopitaux de Marseille, Marseille, France
| |
Collapse
|
5
|
Amara K, Rodríguez-Pérez R, Jiménez-Luna J. Explaining compound activity predictions with a substructure-aware loss for graph neural networks. J Cheminform 2023; 15:67. [PMID: 37491407 PMCID: PMC10369817 DOI: 10.1186/s13321-023-00733-9] [Citation(s) in RCA: 2] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/04/2023] [Accepted: 07/08/2023] [Indexed: 07/27/2023] Open
Abstract
Explainable machine learning is increasingly used in drug discovery to help rationalize compound property predictions. Feature attribution techniques are popular choices to identify which molecular substructures are responsible for a predicted property change. However, established molecular feature attribution methods have so far displayed low performance for popular deep learning algorithms such as graph neural networks (GNNs), especially when compared with simpler modeling alternatives such as random forests coupled with atom masking. To mitigate this problem, a modification of the regression objective for GNNs is proposed to specifically account for common core structures between pairs of molecules. The presented approach shows higher accuracy on a recently-proposed explainability benchmark. This methodology has the potential to assist with model explainability in drug discovery pipelines, particularly in lead optimization efforts where specific chemical series are investigated.
Collapse
Affiliation(s)
- Kenza Amara
- Microsoft Research AI4Science, 21 Station Rd., Cambridge, CB1 2FB UK
- Department of Computer Science, ETH Zurich, Andreasstrasse 5, 8050 Zurich, Switzerland
| | | | - José Jiménez-Luna
- Microsoft Research AI4Science, 21 Station Rd., Cambridge, CB1 2FB UK
| |
Collapse
|
6
|
Han JH, Lee S, Lee B, Baek OK, Washington SL, Herlemann A, Lonergan PE, Carroll PR, Jeong CW, Cooperberg MR. Explainable ML models for a deeper insight on treatment decision for localized prostate cancer. Sci Rep 2023; 13:11532. [PMID: 37460568 DOI: 10.1038/s41598-023-38162-1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/21/2022] [Accepted: 07/04/2023] [Indexed: 07/20/2023] Open
Abstract
Although there are several decision aids for the treatment of localized prostate cancer (PCa), there are limitations in the consistency and certainty of the information provided. We aimed to better understand the treatment decision process and develop a decision-predicting model considering oncologic, demographic, socioeconomic, and geographic factors. Men newly diagnosed with localized PCa between 2010 and 2015 from the Surveillance, Epidemiology, and End Results Prostate with Watchful Waiting database were included (n = 255,837). We designed two prediction models: (1) Active surveillance/watchful waiting (AS/WW), radical prostatectomy (RP), and radiation therapy (RT) decision prediction in the entire cohort. (2) Prediction of AS/WW decisions in the low-risk cohort. The discrimination of the model was evaluated using the multiclass area under the curve (AUC). A plausible Shapley additive explanations value was used to explain the model's prediction results. Oncological variables affected the RP decisions most, whereas RT was highly affected by geographic factors. The dependence plot depicted the feature interactions in reaching a treatment decision. The decision predicting model achieved an overall multiclass AUC of 0.77, whereas 0.74 was confirmed for the low-risk model. Using a large population-based real-world database, we unraveled the complex decision-making process and visualized nonlinear feature interactions in localized PCa.
Collapse
Affiliation(s)
- Jang Hee Han
- Department of Urology, Seoul National University Hospital, Seoul, Republic of Korea
| | - Sungyup Lee
- Electronics and Telecommunications Research Institute (ETRI), Daejeon, Republic of Korea
| | - Byounghwa Lee
- Electronics and Telecommunications Research Institute (ETRI), Daejeon, Republic of Korea
| | - Ock-Kee Baek
- Electronics and Telecommunications Research Institute (ETRI), Daejeon, Republic of Korea
| | - Samuel L Washington
- Department of Urology, Helen Diller Family Comprehensive Cancer Center, University of California, San Francisco, CA, USA
- Department of Epidemiology and Biostatistics, University of California, San Francisco, CA, USA
| | - Annika Herlemann
- Department of Urology, Helen Diller Family Comprehensive Cancer Center, University of California, San Francisco, CA, USA
- Department of Urology, Ludwig-Maximilians-University of Munich, Munich, Germany
| | - Peter E Lonergan
- Department of Urology, Helen Diller Family Comprehensive Cancer Center, University of California, San Francisco, CA, USA
- Department of Urology, St. James's Hospital, Dublin, Ireland
- Department of Surgery, Trinity College, Dublin, Ireland
| | - Peter R Carroll
- Department of Urology, Helen Diller Family Comprehensive Cancer Center, University of California, San Francisco, CA, USA
| | - Chang Wook Jeong
- Department of Urology, Seoul National University Hospital, Seoul, Republic of Korea.
- Department of Urology, Helen Diller Family Comprehensive Cancer Center, University of California, San Francisco, CA, USA.
- Department of Urology, Seoul National University College of Medicine, Seoul, Republic of Korea.
| | - Matthew R Cooperberg
- Department of Urology, Helen Diller Family Comprehensive Cancer Center, University of California, San Francisco, CA, USA
- Department of Epidemiology and Biostatistics, University of California, San Francisco, CA, USA
| |
Collapse
|
7
|
Young MJ, Fefferman NH. A 'Portfolio of Model Approximations' approach to understanding invasion success with vector-borne disease. Math Biosci 2023; 358:108994. [PMID: 36914154 DOI: 10.1016/j.mbs.2023.108994] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/21/2022] [Revised: 03/01/2023] [Accepted: 03/03/2023] [Indexed: 03/16/2023]
Abstract
The central challenge of mathematical modeling of real-world systems is to strike an appropriate balance between insightful abstraction and detailed accuracy. Models in mathematical epidemiology frequently tend to either extreme, focusing on analytically provable boundaries in simplified, mass-action approximations, or else relying on calculated numerical solutions and computational simulation experiments to capture nuance and details specific to a particular host-disease system. We propose that there is value in an approach striking a slightly different compromise in which a detailed but analytically difficult system is modeled with careful detail, but then abstraction is applied to the results of numerical solutions to that system, rather than to the biological system itself. In this 'Portfolio of Model Approximations' approach, multiple levels of approximation are used to analyze the model at different scales of complexity. While this method has the potential to introduce error in the translation from model to model, it also has the potential to produce generalizable insight for the set of all similar systems, rather than isolated, tailored results that must be started anew for each next question. In this paper, we demonstrate this process and its value with a case study from evolutionary epidemiology. We consider a modified Susceptible-Infected-Recovered model for a vector-borne pathogen affecting two annually reproducing hosts. From observing patterns in simulations of the system and exploiting basic epidemiological properties, we construct two approximations of the model at different levels of complexity that can be treated as hypotheses about the behavior of the model. We compare the predictions of the approximations to the simulated results and discuss the trade-offs between accuracy and abstraction. We discuss the implications for this particular model, and in the context of mathematical biology in general.
Collapse
Affiliation(s)
- Matthew J Young
- National Institute for Mathematical and Biological Synthesis (NIMBioS), University of Tennessee, Knoxville, TN, USA; Department of Ecology and Evolutionary Biology, University of Tennessee, Knoxville, TN, USA.
| | - Nina H Fefferman
- National Institute for Mathematical and Biological Synthesis (NIMBioS), University of Tennessee, Knoxville, TN, USA; Department of Ecology and Evolutionary Biology, University of Tennessee, Knoxville, TN, USA
| |
Collapse
|
8
|
Fieggen J, Smith E, Arora L, Segal B. The role of machine learning in HIV risk prediction. FRONTIERS IN REPRODUCTIVE HEALTH 2022; 4:1062387. [PMID: 36619681 PMCID: PMC9815547 DOI: 10.3389/frph.2022.1062387] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/05/2022] [Accepted: 12/05/2022] [Indexed: 12/24/2022] Open
Abstract
Despite advances in reducing HIV-related mortality, persistently high HIV incidence rates are undermining global efforts to end the epidemic by 2030. The UNAIDS Fast-track targets as well as other preventative strategies, such as pre-exposure prophylaxis, have been identified as priority areas to reduce the ongoing transmission threatening to undermine recent progress. Accurate and granular risk prediction is critical for these campaigns but is often lacking in regions where the burden is highest. Owing to their ability to capture complex interactions between data, machine learning and artificial intelligence algorithms have proven effective at predicting the risk of HIV infection in both high resource and low resource settings. However, interpretability of these algorithms presents a challenge to the understanding and adoption of these algorithms. In this perspectives article, we provide an introduction to machine learning and discuss some of the important considerations when choosing the variables used in model development and when evaluating the performance of different machine learning algorithms, as well as the role emerging tools such as Shapely Additive Explanations may play in helping understand and decompose these models in the context of HIV. Finally, we discuss some of the potential public health and clinical use cases for such decomposed risk assessment models in directing testing and preventative interventions including pre-exposure prophylaxis, as well as highlight the potential integration synergies with algorithms that predict the risk of sexually transmitted infections and tuberculosis.
Collapse
Affiliation(s)
- Joshua Fieggen
- School of Public Health and Family Medicine, Faculty of Health Sciences, University of Cape Town, Cape Town, South Africa,Phithos Technologies, Johannesburg, South Africa,Correspondence: Joshua Fieggen ;
| | - Eli Smith
- Phithos Technologies, Johannesburg, South Africa
| | | | - Bradley Segal
- Phithos Technologies, Johannesburg, South Africa,Department of Biomedical Engineering, University of the Witwatersrand, Johannesburg, South Africa
| |
Collapse
|
9
|
Mahmood U, Fu Z, Ghosh S, Calhoun V, Plis S. Through the looking glass: Deep interpretable dynamic directed connectivity in resting fMRI. Neuroimage 2022; 264:119737. [PMID: 36356823 PMCID: PMC9844250 DOI: 10.1016/j.neuroimage.2022.119737] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/28/2022] [Revised: 11/01/2022] [Accepted: 11/06/2022] [Indexed: 11/09/2022] Open
Abstract
Brain network interactions are commonly assessed via functional (network) connectivity, captured as an undirected matrix of Pearson correlation coefficients. Functional connectivity can represent static and dynamic relations, but often these are modeled using a fixed choice for the data window Alternatively, deep learning models may flexibly learn various representations from the same data based on the model architecture and the training task. However, the representations produced by deep learning models are often difficult to interpret and require additional posthoc methods, e.g., saliency maps. In this work, we integrate the strengths of deep learning and functional connectivity methods while also mitigating their weaknesses. With interpretability in mind, we present a deep learning architecture that exposes a directed graph layer that represents what the model has learned about relevant brain connectivity. A surprising benefit of this architectural interpretability is significantly improved accuracy in discriminating controls and patients with schizophrenia, autism, and dementia, as well as age and gender prediction from functional MRI data. We also resolve the window size selection problem for dynamic directed connectivity estimation as we estimate windowing functions from the data, capturing what is needed to estimate the graph at each time-point. We demonstrate efficacy of our method in comparison with multiple existing models that focus on classification accuracy, unlike our interpretability-focused architecture. Using the same data but training different models on their own discriminative tasks we are able to estimate task-specific directed connectivity matrices for each subject. Results show that the proposed approach is also more robust to confounding factors compared to standard dynamic functional connectivity models. The dynamic patterns captured by our model are naturally interpretable since they highlight the intervals in the signal that are most important for the prediction. The proposed approach reveals that differences in connectivity among sensorimotor networks relative to default-mode networks are an important indicator of dementia and gender. Dysconnectivity between networks, specially sensorimotor and visual, is linked with schizophrenic patients, however schizophrenic patients show increased intra-network default-mode connectivity compared to healthy controls. Sensorimotor connectivity was important for both dementia and schizophrenia prediction, but schizophrenia is more related to dysconnectivity between networks whereas, dementia bio-markers were mostly intra-network connectivity.
Collapse
Affiliation(s)
- Usman Mahmood
- Tri-institutional Center for Translational Research in Neuroimaging and Data Science (TReNDS), Georgia State University, Georgia Institute of Technology, Emory University, Atlanta, GA, USA; Georgia State University, Department of Computer Science, Atlanta, GA, USA.
| | - Zening Fu
- Tri-institutional Center for Translational Research in Neuroimaging and Data Science (TReNDS), Georgia State University, Georgia Institute of Technology, Emory University, Atlanta, GA, USA; Georgia State University, Department of Computer Science, Atlanta, GA, USA
| | - Satrajit Ghosh
- McGovern Institute for Brain Research, Massachusetts Institute of Technology, Cambridge, MA USA; Department of Otolaryngology - Head and Neck Surgery, Harvard Medical School, Boston, MA USA
| | - Vince Calhoun
- Tri-institutional Center for Translational Research in Neuroimaging and Data Science (TReNDS), Georgia State University, Georgia Institute of Technology, Emory University, Atlanta, GA, USA; Georgia State University, Department of Computer Science, Atlanta, GA, USA; Georgia Institute of Technology, Department of Electrical and Computer Engineering, Atlanta, GA, USA
| | - Sergey Plis
- Tri-institutional Center for Translational Research in Neuroimaging and Data Science (TReNDS), Georgia State University, Georgia Institute of Technology, Emory University, Atlanta, GA, USA; Georgia State University, Department of Computer Science, Atlanta, GA, USA
| |
Collapse
|
10
|
Camel ( Camelus spp.) Urine Bioactivity and Metabolome: A Systematic Review of Knowledge Gaps, Advances, and Directions for Future Research. Int J Mol Sci 2022; 23:ijms232315024. [PMID: 36499353 PMCID: PMC9740287 DOI: 10.3390/ijms232315024] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/15/2022] [Revised: 11/04/2022] [Accepted: 11/14/2022] [Indexed: 12/02/2022] Open
Abstract
Up to the present day, studies on the therapeutic properties of camel (Camelus spp.) urine and the detailed characterization of its metabolomic profile are scarce and often unrelated. Information on inter individual variability is noticeably limited, and there is a wide divergence across studies regarding the methods for sample storage, pre-processing, and extract derivatization for metabolomic analysis. Additionally, medium osmolarity is not experimentally adjusted prior to bioactivity assays. In this scenario, the methodological standardization and interdisciplinary approach of such processes will strengthen the interpretation, repeatability, and replicability of the empirical results on the compounds with bioactive properties present in camel urine. Furthermore, sample enlargement would also permit the evaluation of camel urine's intra- and interindividual variability in terms of chemical composition, bioactive effects, and efficacy, while it may also permit researchers to discriminate potential animal-intrinsic and extrinsic conditioning factors. Altogether, the results would help to evaluate the role of camel urine as a natural source for the identification and extraction of specific novel bioactive substances that may deserve isolated chemical and pharmacognostic investigations through preclinical tests to determine their biological activity and the suitability of their safety profile for their potential inclusion in therapeutic formulas for improving human and animal health.
Collapse
|
11
|
de Hond AAH, Kant IMJ, Honkoop PJ, Smith AD, Steyerberg EW, Sont JK. Machine learning did not beat logistic regression in time series prediction for severe asthma exacerbations. Sci Rep 2022; 12:20363. [PMID: 36437306 PMCID: PMC9701686 DOI: 10.1038/s41598-022-24909-9] [Citation(s) in RCA: 4] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/11/2022] [Accepted: 11/22/2022] [Indexed: 11/28/2022] Open
Abstract
Early detection of severe asthma exacerbations through home monitoring data in patients with stable mild-to-moderate chronic asthma could help to timely adjust medication. We evaluated the potential of machine learning methods compared to a clinical rule and logistic regression to predict severe exacerbations. We used daily home monitoring data from two studies in asthma patients (development: n = 165 and validation: n = 101 patients). Two ML models (XGBoost, one class SVM) and a logistic regression model provided predictions based on peak expiratory flow and asthma symptoms. These models were compared with an asthma action plan rule. Severe exacerbations occurred in 0.2% of all daily measurements in the development (154/92,787 days) and validation cohorts (94/40,185 days). The AUC of the best performing XGBoost was 0.85 (0.82-0.87) and 0.88 (0.86-0.90) for logistic regression in the validation cohort. The XGBoost model provided overly extreme risk estimates, whereas the logistic regression underestimated predicted risks. Sensitivity and specificity were better overall for XGBoost and logistic regression compared to one class SVM and the clinical rule. We conclude that ML models did not beat logistic regression in predicting short-term severe asthma exacerbations based on home monitoring data. Clinical application remains challenging in settings with low event incidence and high false alarm rates with high sensitivity.
Collapse
Affiliation(s)
- Anne A. H. de Hond
- grid.10419.3d0000000089452978Department of Information Technology and Digital Innovation, Leiden University Medical Centre, Albinusdreef 2, 2300 RC Leiden, The Netherlands ,grid.10419.3d0000000089452978Clinical AI Implementation and Research Lab, Leiden University Medical Centre, Albinusdreef 2, 2300 RC Leiden, The Netherlands ,grid.10419.3d0000000089452978Department of Biomedical Data Sciences, Leiden University Medical Centre, Albinusdreef 2, 2300 RC Leiden, the Netherlands
| | - Ilse M. J. Kant
- grid.10419.3d0000000089452978Department of Information Technology and Digital Innovation, Leiden University Medical Centre, Albinusdreef 2, 2300 RC Leiden, The Netherlands ,grid.10419.3d0000000089452978Clinical AI Implementation and Research Lab, Leiden University Medical Centre, Albinusdreef 2, 2300 RC Leiden, The Netherlands ,grid.10419.3d0000000089452978Department of Biomedical Data Sciences, Leiden University Medical Centre, Albinusdreef 2, 2300 RC Leiden, the Netherlands
| | - Persijn J. Honkoop
- grid.10419.3d0000000089452978Department of Biomedical Data Sciences, Leiden University Medical Centre, Albinusdreef 2, 2300 RC Leiden, the Netherlands
| | - Andrew D. Smith
- grid.417145.20000 0004 0624 9990Department of Respiratory Medicine, University Hospital Wishaw, 50 Netherton Street, Wishaw, ML2 0DP UK
| | - Ewout W. Steyerberg
- grid.10419.3d0000000089452978Clinical AI Implementation and Research Lab, Leiden University Medical Centre, Albinusdreef 2, 2300 RC Leiden, The Netherlands ,grid.10419.3d0000000089452978Department of Biomedical Data Sciences, Leiden University Medical Centre, Albinusdreef 2, 2300 RC Leiden, the Netherlands
| | - Jacob K. Sont
- grid.10419.3d0000000089452978Department of Biomedical Data Sciences, Leiden University Medical Centre, Albinusdreef 2, 2300 RC Leiden, the Netherlands
| |
Collapse
|
12
|
Hu H, Lai T, Farid F. Feasibility Study of Constructing a Screening Tool for Adolescent Diabetes Detection Applying Machine Learning Methods. SENSORS (BASEL, SWITZERLAND) 2022; 22:s22166155. [PMID: 36015915 PMCID: PMC9416136 DOI: 10.3390/s22166155] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 07/05/2022] [Revised: 08/02/2022] [Accepted: 08/15/2022] [Indexed: 06/02/2023]
Abstract
Prediabetes and diabetes are becoming alarmingly prevalent among adolescents over the past decade. However, an effective screening tool that can assess diabetes risks smoothly is still in its infancy. In order to contribute to such significant gaps, this research proposes a machine learning-based predictive model to detect adolescent diabetes. The model applies supervised machine learning and a novel feature selection method to the National Health and Nutritional Examination Survey datasets after an exhaustive search to select reliable and accurate data. The best model achieved an area under the curve (AUC) score of 71%. This research proves that a screening tool based on supervised machine learning models can assist in the automated detection of youth diabetes. It also identifies some critical predictors to such detection using Lasso Regression, Random Forest Importance and Gradient Boosted Tree Importance feature selection methods. The most contributing features to Youth diabetes detection are physical characteristics (e.g., waist, leg length, gender), dietary information (e.g., water, protein, sodium) and demographics. These predictors can be further utilised in other areas of medical research, such as electronic medical history.
Collapse
Affiliation(s)
- Hansel Hu
- Atlas Advisors, Australia Pty Ltd., Sydney, NSW 2000, Australia
| | - Tin Lai
- School of Computer Science, Faculty of Engineering, The University of Sydney, Camperdown, NSW 2006, Australia
| | - Farnaz Farid
- Cybersecurity and Behavioural Science, School of Social Sciences, Western Sydney University, Penrith, NSW 2751, Australia
| |
Collapse
|
13
|
Cheng L, Qiu Y, Schmidt BJ, Wei GW. Review of applications and challenges of quantitative systems pharmacology modeling and machine learning for heart failure. J Pharmacokinet Pharmacodyn 2022; 49:39-50. [PMID: 34637069 PMCID: PMC8837528 DOI: 10.1007/s10928-021-09785-6] [Citation(s) in RCA: 7] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/24/2021] [Accepted: 09/22/2021] [Indexed: 12/24/2022]
Abstract
Quantitative systems pharmacology (QSP) is an important approach in pharmaceutical research and development that facilitates in silico generation of quantitative mechanistic hypotheses and enables in silico trials. As demonstrated by applications from numerous industry groups and interest from regulatory authorities, QSP is becoming an increasingly critical component in clinical drug development. With rapidly evolving computational tools and methods, QSP modeling has achieved important progress in pharmaceutical research and development, including for heart failure (HF). However, various challenges exist in the QSP modeling and clinical characterization of HF. Machine/deep learning (ML/DL) methods have had success in a wide variety of fields and disciplines. They provide data-driven approaches in HF diagnosis and modeling, and offer a novel strategy to inform QSP model development and calibration. The combination of ML/DL and QSP modeling becomes an emergent direction in the understanding of HF and clinical development new therapies. In this work, we review the current status and achievement in QSP and ML/DL for HF, and discuss remaining challenges and future perspectives in the field.
Collapse
Affiliation(s)
- Limei Cheng
- Quantitative Systems Pharmacology and Physiologically Based Pharmacokinetics, Bristol Myers Squibb, Princeton, NJ, 08536, USA.
| | - Yuchi Qiu
- Department of Mathematics, Michigan State University, East Lansing, MI, 48824, USA
| | - Brian J Schmidt
- Quantitative Systems Pharmacology and Physiologically Based Pharmacokinetics, Bristol Myers Squibb, Princeton, NJ, 08536, USA
| | - Guo-Wei Wei
- Department of Mathematics, Michigan State University, East Lansing, MI, 48824, USA
- Department of Electrical and Computer Engineering, Michigan State University, East Lansing, MI, 48824, USA
- Department of Biochemistry and Molecular Biology, Michigan State University, East Lansing, MI, 48824, USA
| |
Collapse
|
14
|
Ye Z, Yang W, Yang Y, Ouyang D. Interpretable machine learning methods for in vitro pharmaceutical formulation development. FOOD FRONTIERS 2021. [DOI: 10.1002/fft2.78] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/21/2022] Open
Affiliation(s)
- Zhuyifan Ye
- State Key Laboratory of Quality Research in Chinese Medicine Institute of Chinese Medical Sciences (ICMS) University of Macau Macau China
| | - Wenmian Yang
- State Key Laboratory of Internet of Things for Smart City University of Macau Macau China
| | - Yilong Yang
- School of Software Beihang University Beijing China
| | - Defang Ouyang
- State Key Laboratory of Quality Research in Chinese Medicine Institute of Chinese Medical Sciences (ICMS) University of Macau Macau China
| |
Collapse
|
15
|
Zhang Z, Genc Y, Wang D, Ahsen ME, Fan X. Effect of AI Explanations on Human Perceptions of Patient-Facing AI-Powered Healthcare Systems. J Med Syst 2021; 45:64. [PMID: 33948743 DOI: 10.1007/s10916-021-01743-6] [Citation(s) in RCA: 9] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/15/2021] [Accepted: 04/28/2021] [Indexed: 10/21/2022]
Abstract
Ongoing research efforts have been examining how to utilize artificial intelligence technology to help healthcare consumers make sense of their clinical data, such as diagnostic radiology reports. How to promote the acceptance of such novel technology is a heated research topic. Recent studies highlight the importance of providing local explanations about AI prediction and model performance to help users determine whether to trust AI's predictions. Despite some efforts, limited empirical research has been conducted to quantitatively measure how AI explanations impact healthcare consumers' perceptions of using patient-facing, AI-powered healthcare systems. The aim of this study is to evaluate the effects of different AI explanations on people's perceptions of AI-powered healthcare system. In this work, we designed and deployed a large-scale experiment (N = 3,423) on Amazon Mechanical Turk (MTurk) to evaluate the effects of AI explanations on people's perceptions in the context of comprehending radiology reports. We created four groups based on two factors-the extent of explanations for the prediction (High vs. Low Transparency) and the model performance (Good vs. Weak AI Model)-and randomly assigned participants to one of the four conditions. Participants were instructed to classify a radiology report as describing a normal or abnormal finding, followed by completing a post-study survey to indicate their perceptions of the AI tool. We found that revealing model performance information can promote people's trust and perceived usefulness of system outputs, while providing local explanations for the rationale of a prediction can promote understandability but not necessarily trust. We also found that when model performance is low, the more information the AI system discloses, the less people would trust the system. Lastly, whether human agrees with AI predictions or not and whether the AI prediction is correct or not could also influence the effect of AI explanations. We conclude this paper by discussing implications for designing AI systems for healthcare consumers to interpret diagnostic report.
Collapse
Affiliation(s)
- Zhan Zhang
- School of Computer Science and Information Systems, Pace University, New York, USA.
| | - Yegin Genc
- School of Computer Science and Information Systems, Pace University, New York, USA
| | | | - Mehmet Eren Ahsen
- College of Business, University of Illinois At Urbana-Champaign, Champaign, USA
| | - Xiangmin Fan
- The Institute of Software, Chinese Academy of Sciences, Beijing, China
| |
Collapse
|
16
|
Jiang D, Wu Z, Hsieh CY, Chen G, Liao B, Wang Z, Shen C, Cao D, Wu J, Hou T. Could graph neural networks learn better molecular representation for drug discovery? A comparison study of descriptor-based and graph-based models. J Cheminform 2021; 13:12. [PMID: 33597034 PMCID: PMC7888189 DOI: 10.1186/s13321-020-00479-8] [Citation(s) in RCA: 178] [Impact Index Per Article: 59.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/21/2020] [Accepted: 11/26/2020] [Indexed: 12/31/2022] Open
Abstract
Graph neural networks (GNN) has been considered as an attractive modelling method for molecular property prediction, and numerous studies have shown that GNN could yield more promising results than traditional descriptor-based methods. In this study, based on 11 public datasets covering various property endpoints, the predictive capacity and computational efficiency of the prediction models developed by eight machine learning (ML) algorithms, including four descriptor-based models (SVM, XGBoost, RF and DNN) and four graph-based models (GCN, GAT, MPNN and Attentive FP), were extensively tested and compared. The results demonstrate that on average the descriptor-based models outperform the graph-based models in terms of prediction accuracy and computational efficiency. SVM generally achieves the best predictions for the regression tasks. Both RF and XGBoost can achieve reliable predictions for the classification tasks, and some of the graph-based models, such as Attentive FP and GCN, can yield outstanding performance for a fraction of larger or multi-task datasets. In terms of computational cost, XGBoost and RF are the two most efficient algorithms and only need a few seconds to train a model even for a large dataset. The model interpretations by the SHAP method can effectively explore the established domain knowledge for the descriptor-based models. Finally, we explored use of these models for virtual screening (VS) towards HIV and demonstrated that different ML algorithms offer diverse VS profiles. All in all, we believe that the off-the-shelf descriptor-based models still can be directly employed to accurately predict various chemical endpoints with excellent computability and interpretability.![]()
Collapse
Affiliation(s)
- Dejun Jiang
- Innovation Institute for Artificial Intelligence in Medicine of Zhejiang University, College of Pharmaceutical Sciences, Zhejiang University, Hangzhou, 310058, Zhejiang, China.,State Key Lab of CAD & CG, Zhejiang University, Hangzhou, 310058, Zhejiang, China.,College of Computer Science and Technology, Zhejiang University, Hangzhou, China
| | - Zhenxing Wu
- Innovation Institute for Artificial Intelligence in Medicine of Zhejiang University, College of Pharmaceutical Sciences, Zhejiang University, Hangzhou, 310058, Zhejiang, China
| | - Chang-Yu Hsieh
- Tencent Quantum Laboratory Tencent, Shenzhen, 518057, Guangdong, China
| | - Guangyong Chen
- Shenzhen Institutes of Advanced Technology, Shenzhen, 518055, Guangdong, China
| | - Ben Liao
- Tencent Quantum Laboratory Tencent, Shenzhen, 518057, Guangdong, China
| | - Zhe Wang
- Innovation Institute for Artificial Intelligence in Medicine of Zhejiang University, College of Pharmaceutical Sciences, Zhejiang University, Hangzhou, 310058, Zhejiang, China
| | - Chao Shen
- Innovation Institute for Artificial Intelligence in Medicine of Zhejiang University, College of Pharmaceutical Sciences, Zhejiang University, Hangzhou, 310058, Zhejiang, China
| | - Dongsheng Cao
- Xiangya School of Pharmaceutical Sciences, Central South University, Changsha, 410004, Hunan, China.
| | - Jian Wu
- College of Computer Science and Technology, Zhejiang University, Hangzhou, China.
| | - Tingjun Hou
- Innovation Institute for Artificial Intelligence in Medicine of Zhejiang University, College of Pharmaceutical Sciences, Zhejiang University, Hangzhou, 310058, Zhejiang, China. .,State Key Lab of CAD & CG, Zhejiang University, Hangzhou, 310058, Zhejiang, China.
| |
Collapse
|
17
|
Mitchell EG, Tabak EG, Levine ME, Mamykina L, Albers DJ. Enabling personalized decision support with patient-generated data and attributable components. J Biomed Inform 2020; 113:103639. [PMID: 33316422 DOI: 10.1016/j.jbi.2020.103639] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/22/2019] [Revised: 08/03/2020] [Accepted: 11/26/2020] [Indexed: 10/22/2022]
Abstract
Decision-making related to health is complex. Machine learning (ML) and patient generated data can identify patterns and insights at the individual level, where human cognition falls short, but not all ML-generated information is of equal utility for making health-related decisions. We develop and apply attributable components analysis (ACA), a method inspired by optimal transport theory, to type 2 diabetes self-monitoring data to identify patterns of association between nutrition and blood glucose control. In comparison with linear regression, we found that ACA offers a number of characteristics that make it promising for use in decision support applications. For example, ACA was able to identify non-linear relationships, was more robust to outliers, and offered broader and more expressive uncertainty estimates. In addition, our results highlight a tradeoff between model accuracy and interpretability, and we discuss implications for ML-driven decision support systems.
Collapse
Affiliation(s)
- Elliot G Mitchell
- Department of Biomedical Informatics, Columbia University, New York, NY, USA.
| | - Esteban G Tabak
- Courant Institute of Mathematical Sciences, New York, NY, USA.
| | | | - Lena Mamykina
- Department of Biomedical Informatics, Columbia University, New York, NY, USA.
| | - David J Albers
- Department of Biomedical Informatics, Columbia University, New York, NY, USA; Department of Pediatrics, Division of Informatics, University of Colorado, Aurora, CO, USA.
| |
Collapse
|
18
|
Le DH. Machine learning-based approaches for disease gene prediction. Brief Funct Genomics 2020; 19:350-363. [PMID: 32567652 DOI: 10.1093/bfgp/elaa013] [Citation(s) in RCA: 15] [Impact Index Per Article: 3.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/25/2020] [Revised: 04/30/2020] [Accepted: 05/09/2020] [Indexed: 12/20/2022] Open
Abstract
Disease gene prediction is an essential issue in biomedical research. In the early days, annotation-based approaches were proposed for this problem. With the development of high-throughput technologies, interaction data between genes/proteins have grown quickly and covered almost genome and proteome; thus, network-based methods for the problem become prominent. In parallel, machine learning techniques, which formulate the problem as a classification, have also been proposed. Here, we firstly show a roadmap of the machine learning-based methods for the disease gene prediction. In the beginning, the problem was usually approached using a binary classification, where positive and negative training sample sets are comprised of disease genes and non-disease genes, respectively. The disease genes are ones known to be associated with diseases; meanwhile, non-disease genes were randomly selected from those not yet known to be associated with diseases. However, the later may contain unknown disease genes. To overcome this uncertainty of defining the non-disease genes, more realistic approaches have been proposed for the problem, such as unary and semi-supervised classification. Recently, more advanced methods, including ensemble learning, matrix factorization and deep learning, have been proposed for the problem. Secondly, 12 representative machine learning-based methods for the disease gene prediction were examined and compared in terms of prediction performance and running time. Finally, their advantages, disadvantages, interpretability and trust were also analyzed and discussed.
Collapse
Affiliation(s)
- Duc-Hau Le
- Department of Computational Biomedicine, Vingroup Big Data Institute, Hanoi, Vietnam
| |
Collapse
|
19
|
Diaz-Quijano FA, Calixto FM, da Silva JMN. How feasible is it to abandon statistical significance? A reflection based on a short survey. BMC Med Res Methodol 2020; 20:140. [PMID: 32493293 PMCID: PMC7271502 DOI: 10.1186/s12874-020-01030-x] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/13/2019] [Accepted: 05/24/2020] [Indexed: 11/30/2022] Open
Abstract
Background There is a growing trend in using the “statistically significant” term in the scientific literature. However, harsh criticism of this concept motivated the recommendation to withdraw its use of scientific publications. We aimed to validate the support and the feasibility of adherence to this recommendation, among researchers having declared in favor of removing the statistical significance. Methods We surveyed signatories of an article published that defended this recommendation, to validate their opinion and ask them about how likely they will retire the concept of statistical significance. Results We obtained 151 responses which confirmed the support for the mentioned publication in aspects such as the adequate interpretation of the p-value, the degree of agreement, and the motivations to sign it. However, there was a wide distribution of answers about how likely are they to use the concept of “statistical significance” in future publications. About 42% declared being neutral, or that would likely use it again. We described arguments referred by several signatories and discussed aspects to be considered in the interpretation of research results. Conclusions The responses obtained from a proportion of signatories validated their declared position against the use of statistical significance. However, even in this group, the full application of this recommendation does not seem feasible. The arguments related to the inappropriate use of statistical tests should promote more education among researchers and users of scientific evidence.
Collapse
Affiliation(s)
- Fredi Alexander Diaz-Quijano
- Department of Epidemiology, School of Public Health, University of São Paulo, Av. Dr. Arnaldo, 715, Cerqueira César, CEP 01246-904, São Paulo, SP, 01246-904, Brazil. .,Laboratório de Inferência Causal em Epidemiologia da Universidade de São Paulo (LINCE-USP), São Paulo, Brazil.
| | - Fernando Morelli Calixto
- Laboratório de Inferência Causal em Epidemiologia da Universidade de São Paulo (LINCE-USP), São Paulo, Brazil.,Public Health, School of Public Health, University of São Paulo, São Paulo, Brazil
| | - José Mário Nunes da Silva
- Laboratório de Inferência Causal em Epidemiologia da Universidade de São Paulo (LINCE-USP), São Paulo, Brazil.,Epidemiology, School of Public Health, University of São Paulo, São Paulo, Brazil
| |
Collapse
|
20
|
Chuang KV, Gunsalus LM, Keiser MJ. Learning Molecular Representations for Medicinal Chemistry. J Med Chem 2020; 63:8705-8722. [PMID: 32366098 DOI: 10.1021/acs.jmedchem.0c00385] [Citation(s) in RCA: 78] [Impact Index Per Article: 19.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/21/2022]
Abstract
The accurate modeling and prediction of small molecule properties and bioactivities depend on the critical choice of molecular representation. Decades of informatics-driven research have relied on expert-designed molecular descriptors to establish quantitative structure-activity and structure-property relationships for drug discovery. Now, advances in deep learning make it possible to efficiently and compactly learn molecular representations directly from data. In this review, we discuss how active research in molecular deep learning can address limitations of current descriptors and fingerprints while creating new opportunities in cheminformatics and virtual screening. We provide a concise overview of the role of representations in cheminformatics, key concepts in deep learning, and argue that learning representations provides a way forward to improve the predictive modeling of small molecule bioactivities and properties.
Collapse
Affiliation(s)
- Kangway V Chuang
- Department of Pharmaceutical Chemistry, Department of Bioengineering & Therapeutic Sciences, Institute for Neurodegenerative Diseases, Kavli Institute for Fundamental Neuroscience, Bakar Computational Health Sciences Institute, University of California, San Francisco, 675 Nelson Rising Lane, San Francisco, California 94143, United States
| | - Laura M Gunsalus
- Department of Pharmaceutical Chemistry, Department of Bioengineering & Therapeutic Sciences, Institute for Neurodegenerative Diseases, Kavli Institute for Fundamental Neuroscience, Bakar Computational Health Sciences Institute, University of California, San Francisco, 675 Nelson Rising Lane, San Francisco, California 94143, United States
| | - Michael J Keiser
- Department of Pharmaceutical Chemistry, Department of Bioengineering & Therapeutic Sciences, Institute for Neurodegenerative Diseases, Kavli Institute for Fundamental Neuroscience, Bakar Computational Health Sciences Institute, University of California, San Francisco, 675 Nelson Rising Lane, San Francisco, California 94143, United States
| |
Collapse
|
21
|
Interpretation of machine learning models using shapley values: application to compound potency and multi-target activity predictions. J Comput Aided Mol Des 2020; 34:1013-1026. [PMID: 32361862 PMCID: PMC7449951 DOI: 10.1007/s10822-020-00314-0] [Citation(s) in RCA: 155] [Impact Index Per Article: 38.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/06/2020] [Accepted: 04/24/2020] [Indexed: 02/07/2023]
Abstract
Difficulties in interpreting machine learning (ML) models and their predictions limit the practical applicability of and confidence in ML in pharmaceutical research. There is a need for agnostic approaches aiding in the interpretation of ML models regardless of their complexity that is also applicable to deep neural network (DNN) architectures and model ensembles. To these ends, the SHapley Additive exPlanations (SHAP) methodology has recently been introduced. The SHAP approach enables the identification and prioritization of features that determine compound classification and activity prediction using any ML model. Herein, we further extend the evaluation of the SHAP methodology by investigating a variant for exact calculation of Shapley values for decision tree methods and systematically compare this variant in compound activity and potency value predictions with the model-independent SHAP method. Moreover, new applications of the SHAP analysis approach are presented including interpretation of DNN models for the generation of multi-target activity profiles and ensemble regression models for potency prediction.
Collapse
|
22
|
Orrù G, Monaro M, Conversano C, Gemignani A, Sartori G. Machine Learning in Psychometrics and Psychological Research. Front Psychol 2020; 10:2970. [PMID: 31998200 PMCID: PMC6966768 DOI: 10.3389/fpsyg.2019.02970] [Citation(s) in RCA: 54] [Impact Index Per Article: 13.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/20/2019] [Accepted: 12/16/2019] [Indexed: 11/28/2022] Open
Abstract
Recent controversies about the level of replicability of behavioral research analyzed using statistical inference have cast interest in developing more efficient techniques for analyzing the results of psychological experiments. Here we claim that complementing the analytical workflow of psychological experiments with Machine Learning-based analysis will both maximize accuracy and minimize replicability issues. As compared to statistical inference, ML analysis of experimental data is model agnostic and primarily focused on prediction rather than inference. We also highlight some potential pitfalls resulting from adoption of Machine Learning based experiment analysis. If not properly used it can lead to over-optimistic accuracy estimates similarly observed using statistical inference. Remedies to such pitfalls are also presented such and building model based on cross validation and the use of ensemble models. ML models are typically regarded as black boxes and we will discuss strategies aimed at rendering more transparent the predictions.
Collapse
Affiliation(s)
- Graziella Orrù
- Department of Surgical, Medical, Molecular and Critical Area Pathology, University of Pisa, Pisa, Italy
| | - Merylin Monaro
- Department of General Psychology, University of Padua, Padua, Italy
| | - Ciro Conversano
- Department of Surgical, Medical, Molecular and Critical Area Pathology, University of Pisa, Pisa, Italy
| | - Angelo Gemignani
- Department of Surgical, Medical, Molecular and Critical Area Pathology, University of Pisa, Pisa, Italy
| | - Giuseppe Sartori
- Department of General Psychology, University of Padua, Padua, Italy
| |
Collapse
|
23
|
Rodríguez-Pérez R, Bajorath J. Interpretation of Compound Activity Predictions from Complex Machine Learning Models Using Local Approximations and Shapley Values. J Med Chem 2019; 63:8761-8777. [PMID: 31512867 DOI: 10.1021/acs.jmedchem.9b01101] [Citation(s) in RCA: 147] [Impact Index Per Article: 29.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/27/2023]
Abstract
In qualitative or quantitative studies of structure-activity relationships (SARs), machine learning (ML) models are trained to recognize structural patterns that differentiate between active and inactive compounds. Understanding model decisions is challenging but of critical importance to guide compound design. Moreover, the interpretation of ML results provides an additional level of model validation based on expert knowledge. A number of complex ML approaches, especially deep learning (DL) architectures, have distinctive black-box character. Herein, a locally interpretable explanatory method termed Shapley additive explanations (SHAP) is introduced for rationalizing activity predictions of any ML algorithm, regardless of its complexity. Models resulting from random forest (RF), nonlinear support vector machine (SVM), and deep neural network (DNN) learning are interpreted, and structural patterns determining the predicted probability of activity are identified and mapped onto test compounds. The results indicate that SHAP has high potential for rationalizing predictions of complex ML models.
Collapse
Affiliation(s)
- Raquel Rodríguez-Pérez
- Department of Life Science Informatics, B-IT, LIMES Program Unit Chemical Biology and Medicinal Chemistry, Rheinische Friedrich-Wilhelms-Universität, Endenicher Allee 19c, D-53115 Bonn, Germany.,Department of Medicinal Chemistry, Boehringer Ingelheim Pharma GmbH & Co. KG, Birkendorfer Straße 65, 88397 Biberach an der Riß, Germany
| | - Jürgen Bajorath
- Department of Life Science Informatics, B-IT, LIMES Program Unit Chemical Biology and Medicinal Chemistry, Rheinische Friedrich-Wilhelms-Universität, Endenicher Allee 19c, D-53115 Bonn, Germany
| |
Collapse
|
24
|
Polishchuk P. Interpretation of Quantitative Structure–Activity Relationship Models: Past, Present, and Future. J Chem Inf Model 2017; 57:2618-2639. [DOI: 10.1021/acs.jcim.7b00274] [Citation(s) in RCA: 120] [Impact Index Per Article: 17.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/01/2023]
Affiliation(s)
- Pavel Polishchuk
- Institute of Molecular and
Translational Medicine, Faculty of Medicine and Dentistry, Palacký University and University Hospital in Olomouc, Hněvotínská
1333/5, 779 00 Olomouc, Czech Republic
| |
Collapse
|
25
|
Cumming JG, Davis AM, Muresan S, Haeberlein M, Chen H. Chemical predictive modelling to improve compound quality. Nat Rev Drug Discov 2014; 12:948-62. [PMID: 24287782 DOI: 10.1038/nrd4128] [Citation(s) in RCA: 167] [Impact Index Per Article: 16.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/09/2022]
Abstract
The 'quality' of small-molecule drug candidates, encompassing aspects including their potency, selectivity and ADMET (absorption, distribution, metabolism, excretion and toxicity) characteristics, is a key factor influencing the chances of success in clinical trials. Importantly, such characteristics are under the control of chemists during the identification and optimization of lead compounds. Here, we discuss the application of computational methods, particularly quantitative structure-activity relationships (QSARs), in guiding the selection of higher-quality drug candidates, as well as cultural factors that may have affected their use and impact.
Collapse
Affiliation(s)
- John G Cumming
- Chemistry Innovation Centre, Discovery Sciences, AstraZeneca R&D, Alderley Park, Macclesfield SK10 4TG, UK
| | | | | | | | | |
Collapse
|
26
|
Prediction of Drug Exposure in the Brain from the Chemical Structure. DRUG DELIVERY TO THE BRAIN 2014. [DOI: 10.1007/978-1-4614-9105-7_11] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/29/2022]
|
27
|
Norinder U, Boström H. Representing descriptors derived from multiple conformations as uncertain features for machine learning. J Mol Model 2013; 19:2679-85. [DOI: 10.1007/s00894-013-1806-z] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/29/2012] [Accepted: 02/11/2013] [Indexed: 10/27/2022]
|
28
|
QSAR investigation of NaV1.7 active compounds using the SVM/Signature approach and the Bioclipse Modeling platform. Bioorg Med Chem Lett 2013. [DOI: 10.1016/j.bmcl.2012.10.102] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/20/2022]
|
29
|
Norinder U, Boström H. Introducing Uncertainty in Predictive Modeling—Friend or Foe? J Chem Inf Model 2012; 52:2815-22. [DOI: 10.1021/ci3003446] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/28/2022]
Affiliation(s)
- Ulf Norinder
- AstraZeneca R&D Södertälje, Sweden
- Department of
Pharmacy, Uppsala University, Sweden
| | - Henrik Boström
- Department of Computer and Systems
Sciences, Stockholm University, Sweden
| |
Collapse
|
30
|
|