1
|
Gómez-Sacristán P, Simeon S, Tran-Nguyen VK, Patil S, Ballester PJ. Inactive-enriched machine-learning models exploiting patent data improve structure-based virtual screening for PDL1 dimerizers. J Adv Res 2025; 67:185-196. [PMID: 38280715 DOI: 10.1016/j.jare.2024.01.024] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/30/2023] [Revised: 12/01/2023] [Accepted: 01/21/2024] [Indexed: 01/29/2024] Open
Abstract
INTRODUCTION Small-molecule Programmable Cell Death Protein 1/Programmable Death-Ligand 1 (PD1/PDL1) inhibition via PDL1 dimerization has the potential to lead to inexpensive drugs with better cancer patient outcomes and milder side effects. However, this therapeutic approach has proven challenging, with only one PDL1 dimerizer reaching early clinical trials so far. There is hence a need for fast and accurate methods to develop alternative PDL1 dimerizers. OBJECTIVES We aim to show that structure-based virtual screening (SBVS) based on PDL1-specific machine-learning (ML) scoring functions (SFs) is a powerful drug design tool for detecting PD1/PDL1 inhibitors via PDL1 dimerization. METHODS By incorporating the latest MLSF advances, we generated and evaluated PDL1-specific MLSFs (classifiers and inactive-enriched regressors) on two demanding test sets. RESULTS 60 PDL1-specific MLSFs (30 classifiers and 30 regressors) were generated. Our large-scale analysis provides highly predictive PDL1-specific MLSFs that benefitted from training with large volumes of docked inactives and enabling inactive-enriched regression. CONCLUSION PDL1-specific MLSFs strongly outperformed generic SFs of various types on this target and are released here without restrictions.
Collapse
Affiliation(s)
| | - Saw Simeon
- Centre de Recherche en Cancérologie de Marseille, Marseille 13009, France
| | | | - Sachin Patil
- NanoBio Laboratory, Widener University, Chester, PA 19013, USA
| | - Pedro J Ballester
- Department of Bioengineering, Imperial College London, London SW7 2AZ, UK.
| |
Collapse
|
2
|
Lee HJ, Emani PS, Gerstein MB. Improved Prediction of Ligand-Protein Binding Affinities by Meta-modeling. J Chem Inf Model 2024; 64:8684-8704. [PMID: 39576762 PMCID: PMC11632770 DOI: 10.1021/acs.jcim.4c01116] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/25/2024] [Revised: 10/21/2024] [Accepted: 10/28/2024] [Indexed: 11/24/2024]
Abstract
The accurate screening of candidate drug ligands against target proteins through computational approaches is of prime interest to drug development efforts. Such virtual screening depends in part on methods to predict the binding affinity between ligands and proteins. Many computational models for binding affinity prediction have been developed, but with varying results across targets. Given that ensembling or meta-modeling approaches have shown great promise in reducing model-specific biases, we develop a framework to integrate published force-field-based empirical docking and sequence-based deep learning models. In building this framework, we evaluate many combinations of individual base models, training databases, and several meta-modeling approaches. We show that many of our meta-models significantly improve affinity predictions over base models. Our best meta-models achieve comparable performance to state-of-the-art deep learning tools exclusively based on 3D structures while allowing for improved database scalability and flexibility through the explicit inclusion of features such as physicochemical properties or molecular descriptors. We further demonstrate improved generalization capability by our models using a large-scale benchmark of affinity prediction as well as a virtual screening application benchmark. Overall, we demonstrate that diverse modeling approaches can be ensembled together to gain meaningful improvement in binding affinity prediction.
Collapse
Affiliation(s)
- Ho-Joon Lee
- Department
of Genetics and Yale Center for Genome Analysis, Yale University, New Haven, Connecticut 06510, United States
| | - Prashant S. Emani
- Department
of Molecular Biophysics & Biochemistry, Yale University, New Haven, Connecticut 06520, United States
| | - Mark B. Gerstein
- Department
of Molecular Biophysics & Biochemistry, Yale University, New Haven, Connecticut 06520, United States
- Program
in Computational Biology & Bioinformatics, Department of Computer
Science, Department
of Statistics & Data Science, and Department of Biomedical Informatics
& Data Science, Yale University, New Haven, Connecticut 06520, United States
| |
Collapse
|
3
|
Suwayyid F, Wei GW. Persistent Mayer Dirac. JOURNAL OF PHYSICS. COMPLEXITY 2024; 5:045005. [PMID: 39429974 PMCID: PMC11488505 DOI: 10.1088/2632-072x/ad83a5] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 08/23/2024] [Revised: 09/11/2024] [Accepted: 10/04/2024] [Indexed: 10/22/2024]
Abstract
Topological data analysis (TDA) has made significant progress in developing a new class of fundamental operators known as the Dirac operator, particularly in topological signals and molecular representations. However, the current approaches being used are based on the classical case of chain complexes. The present study establishes Mayer Dirac operators based on N-chain complexes. These operators interconnect an alternating sequence of Mayer Laplacian operators, providing a generalization of the classical resultD 2 = L . Furthermore, the research presents an explicit formulation of the Laplacian for N-chain complexes induced by vertex sequences on a finite set. Weighted versions of Mayer Laplacian and Dirac operators are introduced to expand the scope and improve applicability, showcasing their effectiveness in capturing physical attributes in various practical scenarios. The study presents a generalized version for factorizing Laplacian operators as an operator's product and its 'adjoint'. Additionally, the proposed persistent Mayer Dirac operators and extensions are applied to biological and chemical domains, particularly in the analysis of molecular structures. The study also highlights the potential applications of persistent Mayer Dirac operators in data science.
Collapse
Affiliation(s)
- Faisal Suwayyid
- Department of Mathematics, King Fahd University of Petroleum and Minerals, Dhahran 31261, Saudi Arabia
- Department of Mathematics, Michigan State University, East Lansing, MI 48824, United States of America
| | - Guo-Wei Wei
- Department of Mathematics, Michigan State University, East Lansing, MI 48824, United States of America
- Department of Electrical and Computer Engineering, Michigan State University, East Lansing, MI 48824, United States of America
- Department of Biochemistry and Molecular Biology, Michigan State University, East Lansing, MI 48824, United States of America
| |
Collapse
|
4
|
Zhang Y, Shen C, Xia K. Multi-Cover Persistence (MCP)-based machine learning for polymer property prediction. Brief Bioinform 2024; 25:bbae465. [PMID: 39323091 PMCID: PMC11424509 DOI: 10.1093/bib/bbae465] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/29/2024] [Revised: 08/07/2024] [Accepted: 09/05/2024] [Indexed: 09/27/2024] Open
Abstract
Accurate and efficient prediction of polymers properties is crucial for polymer design. Recently, data-driven artificial intelligence (AI) models have demonstrated great promise in polymers property analysis. Even with the great progresses, a pivotal challenge in all the AI-driven models remains to be the effective representation of molecules. Here we introduce Multi-Cover Persistence (MCP)-based molecular representation and featurization for the first time. Our MCP-based polymer descriptors are combined with machine learning models, in particular, Gradient Boosting Tree (GBT) models, for polymers property prediction. Different from all previous molecular representation, polymer molecular structure and interactions are represented as MCP, which utilizes Delaunay slices at different dimensions and Rhomboid tiling to characterize the complicated geometric and topological information within the data. Statistic features from the generated persistent barcodes are used as polymer descriptors, and further combined with GBT model. Our model has been extensively validated on polymer benchmark datasets. It has been found that our models can outperform traditional fingerprint-based models and has similar accuracy with geometric deep learning models. In particular, our model tends to be more effective on large-sized monomer structures, demonstrating the great potential of MCP in characterizing more complicated polymer data. This work underscores the potential of MCP in polymer informatics, presenting a novel perspective on molecular representation and its application in polymer science.
Collapse
Affiliation(s)
- Yipeng Zhang
- Division of Mathematical Sciences, School of Physical and Mathematical Sciences, Nanyang Technological University, Singapore 637371, Singapore
| | - Cong Shen
- Department of Mathematics, National University of Singapore, Singapore 119076, Singapore
| | - Kelin Xia
- Division of Mathematical Sciences, School of Physical and Mathematical Sciences, Nanyang Technological University, Singapore 637371, Singapore
| |
Collapse
|
5
|
Ai H, Wu D, Zhou H, Xu J, Gu Q. dMXP: A De Novo Small-Molecule 3D Structure Predictor with Graph Attention Networks. J Chem Inf Model 2024; 64:3744-3755. [PMID: 38662925 DOI: 10.1021/acs.jcim.4c00391] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 05/14/2024]
Abstract
Generating the three-dimensional (3D) structure of small molecules is crucial in both structure- and ligand-based drug design. Structure-based drug design needs bioactive conformations of compounds for lead identification and optimization. Ligand-based drug design techniques, such as 3D shape similarity search, 3D pharmacophore model, 3D-QSAR, etc., all require high-quality small-molecule ligand conformations to obtain reliable results. Although predicting a small molecular bioactive conformer requires information from the receptor, a crystal structure of the molecule is a proper approximation to its bioactive conformer in a specific receptor because the binding pose of a small molecule in its receptor's binding pockets should be energetically close to the crystal structures. This study presents a de novo small molecular structure predictor (dMXP) with graph attention networks based on crystal data derived from the Cambridge Structural Database (CSD) combined with molecular electrostatic information calculated by density-functional theory (DFT). Two featuring strategies (topological and atomic partial change features) were employed to explore the relation between these features and the 3D crystal structure of a small molecule. These features were then assembled to construct the holistic 3D crystal structure of a molecule. Molecular graphs were encoded using a graph attention mechanism to deal with the issues of the inconsistencies of local substructures contributing to the entire molecular structure. The root-mean-square deviation (RMSDs) of approximately 80% dMXP predicted structures and the native binding poses within receptors are less than 2.0 Å.
Collapse
Affiliation(s)
- Haopeng Ai
- Research Center for Drug Discovery, School of Pharmaceutical Sciences, Sun Yat-Sen University, 132 East Circle at University City, Guangzhou 510006, China
| | - Deyin Wu
- Research Center for Drug Discovery, School of Pharmaceutical Sciences, Sun Yat-Sen University, 132 East Circle at University City, Guangzhou 510006, China
| | - Huihao Zhou
- Research Center for Drug Discovery, School of Pharmaceutical Sciences, Sun Yat-Sen University, 132 East Circle at University City, Guangzhou 510006, China
| | - Jun Xu
- Research Center for Drug Discovery, School of Pharmaceutical Sciences, Sun Yat-Sen University, 132 East Circle at University City, Guangzhou 510006, China
| | - Qiong Gu
- Research Center for Drug Discovery, School of Pharmaceutical Sciences, Sun Yat-Sen University, 132 East Circle at University City, Guangzhou 510006, China
| |
Collapse
|
6
|
Wee J, Chen J, Xia K, Wei GW. Integration of persistent Laplacian and pre-trained transformer for protein solubility changes upon mutation. Comput Biol Med 2024; 169:107918. [PMID: 38194782 PMCID: PMC10922365 DOI: 10.1016/j.compbiomed.2024.107918] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/28/2023] [Revised: 12/21/2023] [Accepted: 01/01/2024] [Indexed: 01/11/2024]
Abstract
Protein mutations can significantly influence protein solubility, which results in altered protein functions and leads to various diseases. Despite tremendous effort, machine learning prediction of protein solubility changes upon mutation remains a challenging task as indicated by the poor scores of normalized Correct Prediction Ratio (CPR). Part of the challenge stems from the fact that there is no three-dimensional (3D) structures for the wild-type and mutant proteins. This work integrates persistent Laplacians and pre-trained Transformer for the task. The Transformer, pretrained with hundreds of millions of protein sequences, embeds wild-type and mutant sequences, while persistent Laplacians track the topological invariant change and homotopic shape evolution induced by mutations in 3D protein structures, which are rendered from AlphaFold2. The resulting machine learning model was trained on an extensive data set labeled with three solubility types. Our model outperforms all existing predictive methods and improves the state-of-the-art up to 15%.
Collapse
Affiliation(s)
- JunJie Wee
- Department of Mathematics, Michigan State University, East Lansing, MI 48824, USA
| | - Jiahui Chen
- Department of Mathematical Sciences, University of Arkansas, Fayetteville, AR 72701, USA
| | - Kelin Xia
- Division of Mathematical Sciences, School of Physical and Mathematical Sciences, Nanyang Technological University, Singapore 637371, Singapore.
| | - Guo-Wei Wei
- Department of Mathematics, Michigan State University, East Lansing, MI 48824, USA; Department of Biochemistry and Molecular Biology, Michigan State University, East Lansing, MI 48824, USA; Department of Electrical and Computer Engineering, Michigan State University, East Lansing, MI 48824, USA.
| |
Collapse
|
7
|
Rana MM, Nguyen DD. Geometric Graph Learning to Predict Changes in Binding Free Energy and Protein Thermodynamic Stability upon Mutation. J Phys Chem Lett 2023; 14:10870-10879. [PMID: 38032742 DOI: 10.1021/acs.jpclett.3c02679] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/02/2023]
Abstract
Accurate prediction of binding free energy changes upon mutations is vital for optimizing drugs, designing proteins, understanding genetic diseases, and cost-effective virtual screening. While machine learning methods show promise in this domain, achieving accuracy and generalization across diverse data sets remains a challenge. This study introduces Geometric Graph Learning for Protein-Protein Interactions (GGL-PPI), a novel approach integrating geometric graph representation and machine learning to forecast mutation-induced binding free energy changes. GGL-PPI leverages atom-level graph coloring and multiscale weighted colored geometric subgraphs to capture structural features of biomolecules, demonstrating superior performance on three standard data sets, namely, AB-Bind, SKEMPI 1.0, and SKEMPI 2.0 data sets. The model's efficacy extends to predicting protein thermodynamic stability in a blind test set, providing unbiased predictions for both direct and reverse mutations and showcasing notable generalization. GGL-PPI's precision in predicting changes in binding free energy and stability due to mutations enhances our comprehension of protein complexes, offering valuable insights for drug design endeavors.
Collapse
Affiliation(s)
- Md Masud Rana
- Department of Mathematics, University of Kentucky, Lexington, Kentucky 40506, United States
| | - Duc Duy Nguyen
- Department of Mathematics, University of Kentucky, Lexington, Kentucky 40506, United States
| |
Collapse
|
8
|
Xianjin X, Rui D, Xiaoqin Z. Template-guided method for protein-ligand complex structure prediction: Application to CASP15 protein-ligand studies. Proteins 2023; 91:1829-1836. [PMID: 37283068 PMCID: PMC10700664 DOI: 10.1002/prot.26535] [Citation(s) in RCA: 4] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/16/2023] [Revised: 05/12/2023] [Accepted: 05/18/2023] [Indexed: 06/08/2023]
Abstract
Critical Assessment of Structure Prediction 15 (CASP15) added a new category of ligand prediction to promote the development of protein/RNA-ligand modeling methods, which have become important tools in modern drug discovery. A total of 22 targets were released, including 18 protein-ligand targets and 4 RNA-ligand targets. We applied our recently developed template-guided method to the protein-ligand complex structure predictions. The method combined a physicochemical, molecular docking method, and a bioinformatics-based ligand similarity method. The Protein Data Bank was scanned for template structures containing the target protein, homologous proteins, or proteins sharing a similar fold with the target protein. The binding modes of the co-bound ligands in the template structures were used to guide the complex structure prediction for the target. The CASP assessment results show that the overall performance of our method was ranked second when the top predicted model was considered for each target. Here, we analyzed our predictions in detail, and discussed the challenges including protein conformational changes, large and flexible ligands, and multiple diverse ligands in a binding pocket.
Collapse
Affiliation(s)
| | | | - Zou Xiaoqin
- Dalton Cardiovascular Research Center, Department of Physics, Department of Biochemistry, Institute for Data Science and Informatics, University of Missouri, Columbia, Missouri 65211, United States
| |
Collapse
|
9
|
Wee J, Chen J, Xia K, Wei GW. Integration of persistent Laplacian and pre-trained transformer for protein solubility changes upon mutation. ARXIV 2023:arXiv:2310.18760v2. [PMID: 37961732 PMCID: PMC10635294] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Subscribe] [Scholar Register] [Indexed: 11/15/2023]
Abstract
Protein mutations can significantly influence protein solubility, which results in altered protein functions and leads to various diseases. Despite of tremendous effort, machine learning prediction of protein solubility changes upon mutation remains a challenging task as indicated by the poor scores of normalized Correct Prediction Ratio (CPR). Part of the challenge stems from the fact that there is no three-dimensional (3D) structures for the wild-type and mutant proteins. This work integrates persistent Laplacians and pre-trained Transformer for the task. The Transformer, pretrained with hunderds of millions of protein sequences, embeds wild-type and mutant sequences, while persistent Laplacians track the topological invariant change and homotopic shape evolution induced by mutations in 3D protein structures, which are rendered from AlphaFold2. The resulting machine learning model was trained on an extensive data set labeled with three solubility types. Our model outperforms all existing predictive methods and improves the state-of-the-art up to 15%.
Collapse
Affiliation(s)
- JunJie Wee
- Department of Mathematics, Michigan State University, East Lansing, MI 48824, USA
| | - Jiahui Chen
- Department of Mathematical Sciences, University of Arkansas, Fayetteville, AR 72701, USA
| | - Kelin Xia
- Division of Mathematical Sciences, School of Physical and Mathematical Sciences, Nanyang Technological University, Singapore 637371
| | - Guo-Wei Wei
- Department of Mathematics, Michigan State University, East Lansing, MI 48824, USA
- Department of Biochemistry and Molecular Biology, Michigan State University, East Lansing, MI 48824, USA
- Department of Electrical and Computer Engineering, Michigan State University, East Lansing, MI 48824, USA
| |
Collapse
|
10
|
Tran-Nguyen VK, Junaid M, Simeon S, Ballester PJ. A practical guide to machine-learning scoring for structure-based virtual screening. Nat Protoc 2023; 18:3460-3511. [PMID: 37845361 DOI: 10.1038/s41596-023-00885-w] [Citation(s) in RCA: 9] [Impact Index Per Article: 4.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/08/2022] [Accepted: 07/03/2023] [Indexed: 10/18/2023]
Abstract
Structure-based virtual screening (SBVS) via docking has been used to discover active molecules for a range of therapeutic targets. Chemical and protein data sets that contain integrated bioactivity information have increased both in number and in size. Artificial intelligence and, more concretely, its machine-learning (ML) branch, including deep learning, have effectively exploited these data sets to build scoring functions (SFs) for SBVS against targets with an atomic-resolution 3D model (e.g., generated by X-ray crystallography or predicted by AlphaFold2). Often outperforming their generic and non-ML counterparts, target-specific ML-based SFs represent the state of the art for SBVS. Here, we present a comprehensive and user-friendly protocol to build and rigorously evaluate these new SFs for SBVS. This protocol is organized into four sections: (i) using a public benchmark of a given target to evaluate an existing generic SF; (ii) preparing experimental data for a target from public repositories; (iii) partitioning data into a training set and a test set for subsequent target-specific ML modeling; and (iv) generating and evaluating target-specific ML SFs by using the prepared training-test partitions. All necessary code and input/output data related to three example targets (acetylcholinesterase, HMG-CoA reductase, and peroxisome proliferator-activated receptor-α) are available at https://github.com/vktrannguyen/MLSF-protocol , can be run by using a single computer within 1 week and make use of easily accessible software/programs (e.g., Smina, CNN-Score, RF-Score-VS and DeepCoy) and web resources. Our aim is to provide practical guidance on how to augment training data to enhance SBVS performance, how to identify the most suitable supervised learning algorithm for a data set, and how to build an SF with the highest likelihood of discovering target-active molecules within a given compound library.
Collapse
Affiliation(s)
| | - Muhammad Junaid
- Centre de Recherche en Cancérologie de Marseille, Marseille, France
| | - Saw Simeon
- Centre de Recherche en Cancérologie de Marseille, Marseille, France
| | | |
Collapse
|
11
|
Rana MM, Nguyen DD. Geometric graph learning with extended atom-types features for protein-ligand binding affinity prediction. Comput Biol Med 2023; 164:107250. [PMID: 37515872 DOI: 10.1016/j.compbiomed.2023.107250] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/12/2023] [Revised: 06/12/2023] [Accepted: 07/07/2023] [Indexed: 07/31/2023]
Abstract
Understanding and accurately predicting protein-ligand binding affinity are essential in the drug design and discovery process. At present, machine learning-based methodologies are gaining popularity as a means of predicting binding affinity due to their efficiency and accuracy, as well as the increasing availability of structural and binding affinity data for protein-ligand complexes. In biomolecular studies, graph theory has been widely applied since graphs can be used to model molecules or molecular complexes in a natural manner. In the present work, we upgrade the graph-based learners for the study of protein-ligand interactions by integrating extensive atom types such as SYBYL and extended connectivity interactive features (ECIF) into multiscale weighted colored graphs (MWCG). By pairing with the gradient boosting decision tree (GBDT) machine learning algorithm, our approach results in two different methods, namely sybylGGL-Score and ecifGGL-Score. Both of our models are extensively validated in their scoring power using three commonly used benchmark datasets in the drug design area, namely CASF-2007, CASF-2013, and CASF-2016. The performance of our best model sybylGGL-Score is compared with other state-of-the-art models in the binding affinity prediction for each benchmark. While both of our models achieve state-of-the-art results, the SYBYL atom-type model sybylGGL-Score outperforms other methods by a wide margin in all benchmarks. Finally, the best-performing SYBYL atom-type model is evaluated on two test sets that are independent of CASF benchmarks.
Collapse
Affiliation(s)
- Md Masud Rana
- Department of Mathematics, University of Kentucky, Lexington, 40506, KY, USA.
| | - Duc Duy Nguyen
- Department of Mathematics, University of Kentucky, Lexington, 40506, KY, USA.
| |
Collapse
|
12
|
Abstract
Drug development is a wide scientific field that faces many challenges these days. Among them are extremely high development costs, long development times, and a small number of new drugs that are approved each year. New and innovative technologies are needed to solve these problems that make the drug discovery process of small molecules more time and cost efficient, and that allow previously undruggable receptor classes to be targeted, such as protein-protein interactions. Structure-based virtual screenings (SBVSs) have become a leading contender in this context. In this review, we give an introduction to the foundations of SBVSs and survey their progress in the past few years with a focus on ultralarge virtual screenings (ULVSs). We outline key principles of SBVSs, recent success stories, new screening techniques, available deep learning-based docking methods, and promising future research directions. ULVSs have an enormous potential for the development of new small-molecule drugs and are already starting to transform early-stage drug discovery.
Collapse
Affiliation(s)
- Christoph Gorgulla
- Harvard Medical School and Physics Department, Harvard University, Boston, Massachusetts, USA;
- Department of Cancer Biology, Dana-Farber Cancer Institute, Boston, Massachusetts, USA
- Current affiliation: Department of Structural Biology, St. Jude Children's Research Hospital, Memphis, Tennessee, USA
| |
Collapse
|
13
|
Wee J, Bianconi G, Xia K. Persistent Dirac for molecular representation. Sci Rep 2023; 13:11183. [PMID: 37433870 DOI: 10.1038/s41598-023-37853-z] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/20/2023] [Accepted: 06/28/2023] [Indexed: 07/13/2023] Open
Abstract
Molecular representations are of fundamental importance for the modeling and analysing molecular systems. The successes in drug design and materials discovery have been greatly contributed by molecular representation models. In this paper, we present a computational framework for molecular representation that is mathematically rigorous and based on the persistent Dirac operator. The properties of the discrete weighted and unweighted Dirac matrix are systematically discussed, and the biological meanings of both homological and non-homological eigenvectors are studied. We also evaluate the impact of various weighting schemes on the weighted Dirac matrix. Additionally, a set of physical persistent attributes that characterize the persistence and variation of spectrum properties of Dirac matrices during a filtration process is proposed to be molecular fingerprints. Our persistent attributes are used to classify molecular configurations of nine different types of organic-inorganic halide perovskites. The combination of persistent attributes with gradient boosting tree model has achieved great success in molecular solvation free energy prediction. The results show that our model is effective in characterizing the molecular structures, demonstrating the power of our molecular representation and featurization approach.
Collapse
Affiliation(s)
- Junjie Wee
- Division of Mathematical Sciences, School of Physical and Mathematical Sciences, Nanyang Technological University, Singapore, 637371, Singapore.
| | - Ginestra Bianconi
- School of Mathematical Sciences, Queen Mary University of London, London, E1 4NS, UK
- The Alan Turing Institute, London, NW1 2DB, UK
| | - Kelin Xia
- Division of Mathematical Sciences, School of Physical and Mathematical Sciences, Nanyang Technological University, Singapore, 637371, Singapore
| |
Collapse
|
14
|
Majumder R, Ghosh S, Singh MK, Das A, Roy Chowdhury S, Saha A, Saha RP. Revisiting the COVID-19 Pandemic: An Insight into Long-Term Post-COVID Complications and Repurposing of Drugs. COVID 2023; 3:494-519. [DOI: 10.3390/covid3040037] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 01/03/2025]
Abstract
SARS-CoV-2 is a highly contagious and dangerous coronavirus that has been spreading around the world since late December 2019. Severe COVID-19 has been observed to induce severe damage to the alveoli, and the slow loss of lung function led to the deaths of many patients. Scientists from all over the world are now saying that SARS-CoV-2 can spread through the air, which is a very frightening prospect for humans. Many scientists thought that this virus would evolve during the first wave of the pandemic and that the second wave of reinfection with the coronavirus would also be very dangerous. In late 2020 and early 2021, researchers found different genetic versions of the SARS-CoV-2 virus in many places around the world. Patients with different types of viruses had different symptoms. It is now evident from numerous case studies that many COVID-19 patients who are released from nursing homes or hospitals are more prone to developing multi-organ dysfunction than the general population. Understanding the pathophysiology of COVID-19 and its impact on various organ systems is crucial for developing effective treatment strategies and managing long-term health consequences. The case studies highlighted in this review provide valuable insights into the ongoing health concerns of individuals affected by COVID-19.
Collapse
Affiliation(s)
- Rajib Majumder
- Department of Biotechnology, School of Life Science & Biotechnology, Adamas University, Kolkata 700126, India
| | - Sanmitra Ghosh
- Department of Biological Sciences, School of Life Science & Biotechnology, Adamas University, Kolkata 700126, India
| | - Manoj K. Singh
- Department of Biotechnology, School of Life Science & Biotechnology, Adamas University, Kolkata 700126, India
| | - Arpita Das
- Department of Biotechnology, School of Life Science & Biotechnology, Adamas University, Kolkata 700126, India
| | - Swagata Roy Chowdhury
- Department of Biotechnology, School of Life Science & Biotechnology, Adamas University, Kolkata 700126, India
| | - Abinit Saha
- Department of Biotechnology, School of Life Science & Biotechnology, Adamas University, Kolkata 700126, India
| | - Rudra P. Saha
- Department of Biotechnology, School of Life Science & Biotechnology, Adamas University, Kolkata 700126, India
| |
Collapse
|
15
|
Tran-Nguyen VK, Ballester PJ. Beware of Simple Methods for Structure-Based Virtual Screening: The Critical Importance of Broader Comparisons. J Chem Inf Model 2023; 63:1401-1405. [PMID: 36848585 PMCID: PMC10015451 DOI: 10.1021/acs.jcim.3c00218] [Citation(s) in RCA: 6] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 03/01/2023]
Abstract
We discuss how data unbiasing and simple methods such as protein-ligand Interaction FingerPrint (IFP) can overestimate virtual screening performance. We also show that IFP is strongly outperformed by target-specific machine-learning scoring functions, which were not considered in a recent report concluding that simple methods were better than machine-learning scoring functions at virtual screening.
Collapse
Affiliation(s)
| | - Pedro J Ballester
- Department of Bioengineering, Imperial College London, London SW7 2AZ, U.K
| |
Collapse
|
16
|
Zhu H, Yang J, Huang N. Assessment of the Generalization Abilities of Machine-Learning Scoring Functions for Structure-Based Virtual Screening. J Chem Inf Model 2022; 62:5485-5502. [PMID: 36268980 DOI: 10.1021/acs.jcim.2c01149] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/13/2022]
Abstract
In structure-based virtual screening (SBVS), it is critical that scoring functions capture protein-ligand atomic interactions. By focusing on the local domains of ligand binding pockets, a standardized pocket Pfam-based clustering (Pfam-cluster) approach was developed to assess the cross-target generalization ability of machine-learning scoring functions (MLSFs). Subsequently, 12 typical MLSFs were evaluated using random cross-validation (Random-CV), protein sequence similarity-based cross-validation (Seq-CV), and pocket Pfam-based cross-validation (Pfam-CV) methods. Surprisingly, all of the tested models showed decreased performances from Random-CV to Seq-CV to Pfam-CV experiments, not showing satisfactory generalization capacity. Our interpretable analysis suggested that the predictions on novel targets by MLSFs were dependent on buried solvent-accessible surface area (SASA)-related features of complex structures, with greater predicted binding affinities on complexes owning larger protein-ligand interfaces. By combining buried SASA-related features with target-specific patterns that were only shared among structurally similar compounds in the same cluster, the random forest (RF)-Score attained a good performance in the Random-CV test. Based on these findings, we strongly advise assessing the generalization ability of MLSFs with the Pfam-cluster approach and being cautious with the features learned by MLSFs.
Collapse
Affiliation(s)
- Hui Zhu
- Tsinghua Institute of Multidisciplinary Biomedical Research, Tsinghua University, Beijing, China102206, China.,National Institute of Biological Sciences, 7 Science Park Road, Zhongguancun Life Science Park, Beijing102206, China
| | - Jincai Yang
- National Institute of Biological Sciences, 7 Science Park Road, Zhongguancun Life Science Park, Beijing102206, China
| | - Niu Huang
- Tsinghua Institute of Multidisciplinary Biomedical Research, Tsinghua University, Beijing, China102206, China.,National Institute of Biological Sciences, 7 Science Park Road, Zhongguancun Life Science Park, Beijing102206, China
| |
Collapse
|
17
|
Limbu S, Dakshanamurthy S. A New Hybrid Neural Network Deep Learning Method for Protein-Ligand Binding Affinity Prediction and De Novo Drug Design. Int J Mol Sci 2022; 23:ijms232213912. [PMID: 36430386 PMCID: PMC9693376 DOI: 10.3390/ijms232213912] [Citation(s) in RCA: 9] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/14/2022] [Revised: 10/25/2022] [Accepted: 11/09/2022] [Indexed: 11/16/2022] Open
Abstract
Accurately predicting ligand binding affinity in a virtual screening campaign is still challenging. Here, we developed hybrid neural network (HNN) machine deep learning methods, HNN-denovo and HNN-affinity, by combining the 3D-CNN (convolutional neural network) and the FFNN (fast forward neural network) hybrid neural network framework. The HNN-denovo uses protein pocket structure and protein-ligand interactions as input features. The HNN-affinity uses protein sequences and ligand features as input features. The HNN method combines the CNN and FCNN machine architecture for the protein structure or protein sequence and ligand descriptors. To train the model, the HNN methods used thousands of known protein-ligand binding affinity data retrieved from the PDBBind database. We also developed the Random Forest (RF), Gradient Boosting (GB), Decision Tree with AdaBoost (DT), and a consensus model. We compared the HNN results with models developed based on the RF, GB, and DT methods. We also independently compared the HNN method results with the literature reported deep learning protein-ligand binding affinity predictions made by the DLSCORE, KDEEP, and DeepAtom. The predictive performance of the HNN methods (max Pearson's R achieved was 0.86) was consistently better than or comparable to the DLSCORE, KDEEP, and DeepAtom deep learning learning methods for both balanced and unbalanced data sets. The HNN-affinity can be applied for the protein-ligand affinity prediction even in the absence of protein structure information, as it considers the protein sequence as standalone feature in addition to the ligand descriptors. The HNN-denovo method can be efficiently implemented to the structure-based de novo drug design campaign. The HNN-affinity method can be used in conjunction with the deep learning molecular docking protocols as a standalone. Further, it can be combined with the conventional molecular docking methods as a multistep approach to rapidly screen billions of diverse compounds. The HNN method are highly scalable in the cloud ML platform.
Collapse
|
18
|
Liu J, Xia KL, Wu J, Yau SST, Wei GW. Biomolecular Topology: Modelling and Analysis. ACTA MATHEMATICA SINICA, ENGLISH SERIES 2022; 38:1901-1938. [PMID: 36407804 PMCID: PMC9640850 DOI: 10.1007/s10114-022-2326-5] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 05/27/2022] [Accepted: 07/12/2022] [Indexed: 05/25/2023]
Abstract
With the great advancement of experimental tools, a tremendous amount of biomolecular data has been generated and accumulated in various databases. The high dimensionality, structural complexity, the nonlinearity, and entanglements of biomolecular data, ranging from DNA knots, RNA secondary structures, protein folding configurations, chromosomes, DNA origami, molecular assembly, to others at the macromolecular level, pose a severe challenge in their analysis and characterization. In the past few decades, mathematical concepts, models, algorithms, and tools from algebraic topology, combinatorial topology, computational topology, and topological data analysis, have demonstrated great power and begun to play an essential role in tackling the biomolecular data challenge. In this work, we introduce biomolecular topology, which concerns the topological problems and models originated from the biomolecular systems. More specifically, the biomolecular topology encompasses topological structures, properties and relations that are emerged from biomolecular structures, dynamics, interactions, and functions. We discuss the various types of biomolecular topology from structures (of proteins, DNAs, and RNAs), protein folding, and protein assembly. A brief discussion of databanks (and databases), theoretical models, and computational algorithms, is presented. Further, we systematically review related topological models, including graphs, simplicial complexes, persistent homology, persistent Laplacians, de Rham-Hodge theory, Yau-Hausdorff distance, and the topology-based machine learning models.
Collapse
Affiliation(s)
- Jian Liu
- School of Mathematical Sciences, Hebei Normal University, Shijiazhuang, 050024 P. R. China
- Yanqi Lake Beijing Institute of Mathematical Sciences and Applications, Beijing, 101408 P. R. China
| | - Ke-Lin Xia
- School of Physical and Mathematical Sciences, Nanyang Technological University, Singapore, 639798 Singapore
| | - Jie Wu
- Yanqi Lake Beijing Institute of Mathematical Sciences and Applications, Beijing, 101408 P. R. China
- Department of Mathematical Sciences, Tsinghua University, Beijing, 100084 P. R. China
| | - Stephen Shing-Toung Yau
- Yanqi Lake Beijing Institute of Mathematical Sciences and Applications, Beijing, 101408 P. R. China
- Department of Mathematical Sciences, Tsinghua University, Beijing, 100084 P. R. China
| | - Guo-Wei Wei
- Department of Mathematics & Department of Biochemistry and Molecular Biology & Department of Electrical and Computer Engineering, Michigan State University, Wells Hall 619 Red Cedar Road, East Lansing, MI 48824-1027 USA
| |
Collapse
|
19
|
Deep learning methods for molecular representation and property prediction. Drug Discov Today 2022; 27:103373. [PMID: 36167282 DOI: 10.1016/j.drudis.2022.103373] [Citation(s) in RCA: 35] [Impact Index Per Article: 11.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/07/2022] [Revised: 08/22/2022] [Accepted: 09/21/2022] [Indexed: 01/11/2023]
Abstract
With advances in artificial intelligence (AI) methods, computer-aided drug design (CADD) has developed rapidly in recent years. Effective molecular representation and accurate property prediction are crucial tasks in CADD workflows. In this review, we summarize contemporary applications of deep learning (DL) methods for molecular representation and property prediction. We categorize DL methods according to the format of molecular data (1D, 2D, and 3D). In addition, we discuss some common DL models, such as ensemble learning and transfer learning, and analyze the interpretability methods for these models. We also highlight the challenges and opportunities of DL methods for molecular representation and property prediction.
Collapse
|
20
|
Korlepara DB, Vasavi CS, Jeurkar S, Pal PK, Roy S, Mehta S, Sharma S, Kumar V, Muvva C, Sridharan B, Garg A, Modee R, Bhati AP, Nayar D, Priyakumar UD. PLAS-5k: Dataset of Protein-Ligand Affinities from Molecular Dynamics for Machine Learning Applications. Sci Data 2022; 9:548. [PMID: 36071074 PMCID: PMC9451116 DOI: 10.1038/s41597-022-01631-9] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/22/2022] [Accepted: 08/15/2022] [Indexed: 11/08/2022] Open
Abstract
Computational methods and recently modern machine learning methods have played a key role in structure-based drug design. Though several benchmarking datasets are available for machine learning applications in virtual screening, accurate prediction of binding affinity for a protein-ligand complex remains a major challenge. New datasets that allow for the development of models for predicting binding affinities better than the state-of-the-art scoring functions are important. For the first time, we have developed a dataset, PLAS-5k comprised of 5000 protein-ligand complexes chosen from PDB database. The dataset consists of binding affinities along with energy components like electrostatic, van der Waals, polar and non-polar solvation energy calculated from molecular dynamics simulations using MMPBSA (Molecular Mechanics Poisson-Boltzmann Surface Area) method. The calculated binding affinities outperformed docking scores and showed a good correlation with the available experimental values. The availability of energy components may enable optimization of desired components during machine learning-based drug design. Further, OnionNet model has been retrained on PLAS-5k dataset and is provided as a baseline for the prediction of binding affinities.
Collapse
Affiliation(s)
- Divya B Korlepara
- Centre for Computational Natural Sciences and Bioinformatics, International Institute of Information Technology, Hyderabad, 500032, India
| | - C S Vasavi
- Centre for Computational Natural Sciences and Bioinformatics, International Institute of Information Technology, Hyderabad, 500032, India
| | - Shruti Jeurkar
- Centre for Computational Natural Sciences and Bioinformatics, International Institute of Information Technology, Hyderabad, 500032, India
| | - Pradeep Kumar Pal
- Centre for Computational Natural Sciences and Bioinformatics, International Institute of Information Technology, Hyderabad, 500032, India
| | - Subhajit Roy
- Centre for Computational Natural Sciences and Bioinformatics, International Institute of Information Technology, Hyderabad, 500032, India
- UM-DAE-Centre For Excellence In Basic Sciences, University of Mumbai, Vidyanagari, Mumbai, India
| | - Sarvesh Mehta
- Centre for Computational Natural Sciences and Bioinformatics, International Institute of Information Technology, Hyderabad, 500032, India
| | - Shubham Sharma
- Centre for Computational Natural Sciences and Bioinformatics, International Institute of Information Technology, Hyderabad, 500032, India
| | - Vishal Kumar
- Centre for Computational Natural Sciences and Bioinformatics, International Institute of Information Technology, Hyderabad, 500032, India
| | - Charuvaka Muvva
- Centre for Computational Natural Sciences and Bioinformatics, International Institute of Information Technology, Hyderabad, 500032, India
| | - Bhuvanesh Sridharan
- Centre for Computational Natural Sciences and Bioinformatics, International Institute of Information Technology, Hyderabad, 500032, India
| | - Akshit Garg
- Centre for Computational Natural Sciences and Bioinformatics, International Institute of Information Technology, Hyderabad, 500032, India
| | - Rohit Modee
- Centre for Computational Natural Sciences and Bioinformatics, International Institute of Information Technology, Hyderabad, 500032, India
| | - Agastya P Bhati
- Centre for Computational Science, Department of Chemistry, University College London, London, WC1H 0AJ, United Kingdom
| | - Divya Nayar
- Department of Materials Science and Engineering, Indian Institute of Technology Delhi, Hauz Khas, New Delhi, 110016, India.
| | - U Deva Priyakumar
- Centre for Computational Natural Sciences and Bioinformatics, International Institute of Information Technology, Hyderabad, 500032, India.
| |
Collapse
|
21
|
Avery C, Patterson J, Grear T, Frater T, Jacobs DJ. Protein Function Analysis through Machine Learning. Biomolecules 2022; 12:1246. [PMID: 36139085 PMCID: PMC9496392 DOI: 10.3390/biom12091246] [Citation(s) in RCA: 9] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/16/2022] [Revised: 08/22/2022] [Accepted: 08/31/2022] [Indexed: 11/16/2022] Open
Abstract
Machine learning (ML) has been an important arsenal in computational biology used to elucidate protein function for decades. With the recent burgeoning of novel ML methods and applications, new ML approaches have been incorporated into many areas of computational biology dealing with protein function. We examine how ML has been integrated into a wide range of computational models to improve prediction accuracy and gain a better understanding of protein function. The applications discussed are protein structure prediction, protein engineering using sequence modifications to achieve stability and druggability characteristics, molecular docking in terms of protein-ligand binding, including allosteric effects, protein-protein interactions and protein-centric drug discovery. To quantify the mechanisms underlying protein function, a holistic approach that takes structure, flexibility, stability, and dynamics into account is required, as these aspects become inseparable through their interdependence. Another key component of protein function is conformational dynamics, which often manifest as protein kinetics. Computational methods that use ML to generate representative conformational ensembles and quantify differences in conformational ensembles important for function are included in this review. Future opportunities are highlighted for each of these topics.
Collapse
Affiliation(s)
- Chris Avery
- Department of Bioinformatics and Genomics, University of North Carolina at Charlotte, Charlotte, NC 28223, USA
| | - John Patterson
- Department of Bioinformatics and Genomics, University of North Carolina at Charlotte, Charlotte, NC 28223, USA
| | - Tyler Grear
- Department of Bioinformatics and Genomics, University of North Carolina at Charlotte, Charlotte, NC 28223, USA
- Department of Physics and Optical Science, University of North Carolina at Charlotte, Charlotte, NC 28223, USA
| | - Theodore Frater
- Department of Bioinformatics and Genomics, University of North Carolina at Charlotte, Charlotte, NC 28223, USA
| | - Donald J. Jacobs
- Department of Physics and Optical Science, University of North Carolina at Charlotte, Charlotte, NC 28223, USA
| |
Collapse
|
22
|
Liu X, Feng H, Wu J, Xia K. Hom-Complex-Based Machine Learning (HCML) for the Prediction of Protein-Protein Binding Affinity Changes upon Mutation. J Chem Inf Model 2022; 62:3961-3969. [PMID: 36040839 DOI: 10.1021/acs.jcim.2c00580] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/28/2022]
Abstract
Protein-protein interactions (PPIs) are involved in almost all biological processes in the cell. Understanding protein-protein interactions holds the key for the understanding of biological functions, diseases and the development of therapeutics. Recently, artificial intelligence (AI) models have demonstrated great power in PPIs. However, a key issue for all AI-based PPI models is efficient molecular representations and featurization. Here, we propose Hom-complex-based PPI representation, and Hom-complex-based machine learning models for the prediction of PPI binding affinity changes upon mutation, for the first time. In our model, various Hom complexes Hom(G1, G) can be generated for the graph representation G of protein-protein complex by using different graphs G1, which reveal G1-related inner connections within the graph representation G of protein-protein complex. Further, for a specific graph G1, a series of nested Hom complexes are generated to give a multiscale characterization of the PPIs. Its persistent homology and persistent Euler characteristic are used as molecular descriptors and further combined with the machine learning model, in particular, gradient boosting tree (GBT). We systematically test our model on the two most-commonly used data sets, that is, SKEMPI and AB-Bind. It has been found that our model outperforms all the existing models as far as we know, which demonstrates the great potential of our model for the analysis of PPIs. Our model can be used for the analysis and design of efficient antibodies for SARS-CoV-2.
Collapse
Affiliation(s)
- Xiang Liu
- Chern Institute of Mathematics and LPMC, Nankai University, Tianjin, China, 300071.,Division of Mathematical Sciences, School of Physical and Mathematical Sciences Nanyang Technological University, Singapore 637371
| | - Huitao Feng
- Division of Mathematical Sciences, School of Physical and Mathematical Sciences Nanyang Technological University, Singapore 637371.,Mathematical Science Research Center, Chongqing University of Technology, Chongqing, China, 400054
| | - Jie Wu
- Yanqi Lake Beijing Institute of Mathematical Sciences and Applications (BIMSA), Beijing, China,101408
| | - Kelin Xia
- Division of Mathematical Sciences, School of Physical and Mathematical Sciences Nanyang Technological University, Singapore 637371
| |
Collapse
|
23
|
Cong Y, Endo T. Multi-Omics and Artificial Intelligence-Guided Drug Repositioning: Prospects, Challenges, and Lessons Learned from COVID-19. OMICS : A JOURNAL OF INTEGRATIVE BIOLOGY 2022; 26:361-371. [PMID: 35759424 DOI: 10.1089/omi.2022.0068] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/15/2023]
Abstract
Drug repurposing is of interest for therapeutics innovation in many human diseases including coronavirus disease 2019 (COVID-19). Methodological innovations in drug repurposing are currently being empowered by convergence of omics systems science and digital transformation of life sciences. This expert review article offers a systematic summary of the application of artificial intelligence (AI), particularly machine learning (ML), to drug repurposing and classifies and introduces the common clustering, dimensionality reduction, and other methods. We highlight, as a present-day high-profile example, the involvement of AI/ML-based drug discovery in the COVID-19 pandemic and discuss the collection and sharing of diverse data types, and the possible futures awaiting drug repurposing in an era of AI/ML and digital technologies. The article provides new insights on convergence of multi-omics and AI-based drug repurposing. We conclude with reflections on the various pathways to expedite innovation in drug development through drug repurposing for prompt responses to the current COVID-19 pandemic and future ecological crises in the 21st century.
Collapse
Affiliation(s)
- Yi Cong
- Laboratory of Information Biology, Information Science and Technology, Hokkaido University, Sapporo, Japan
| | - Toshinori Endo
- Laboratory of Information Biology, Information Science and Technology, Hokkaido University, Sapporo, Japan
| |
Collapse
|
24
|
Grbić J, Wu J, Xia K, Wei GW. ASPECTS OF TOPOLOGICAL APPROACHES FOR DATA SCIENCE. FOUNDATIONS OF DATA SCIENCE (SPRINGFIELD, MO.) 2022; 4:165-216. [PMID: 36712596 PMCID: PMC9881677 DOI: 10.3934/fods.2022002] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/18/2023]
Abstract
We establish a new theory which unifies various aspects of topological approaches for data science, by being applicable both to point cloud data and to graph data, including networks beyond pairwise interactions. We generalize simplicial complexes and hypergraphs to super-hypergraphs and establish super-hypergraph homology as an extension of simplicial homology. Driven by applications, we also introduce super-persistent homology.
Collapse
Affiliation(s)
- Jelena Grbić
- School of Mathematical Sciences, University of Southampton, Southampton, UK
| | - Jie Wu
- School of Mathematical Sciences, Center of Topology and Geometry based Technology, Hebei Normal University, Yuhua District, Shijiazhuang, Hebei, 050024 China
- Yanqi Lake Beijing Institute of Mathematica Sciences, Yanqihu, Huairou District, Beijing, 101408 China
| | - Kelin Xia
- School of Physical and Mathematical Sciences, Nanyang Technological University, SPMS-MAS-05-18, 21 Nanyang Link, 1, Singapore 63737
| | - Guo-Wei Wei
- Department of Mathematics, Department of Computer Science and Engineering, Department of Biochemistry and Molecular Biology, Michigan State University, MI 48824, USA
| |
Collapse
|
25
|
Yang C, Zhang Y. Delta Machine Learning to Improve Scoring-Ranking-Screening Performances of Protein-Ligand Scoring Functions. J Chem Inf Model 2022; 62:2696-2712. [PMID: 35579568 DOI: 10.1021/acs.jcim.2c00485] [Citation(s) in RCA: 25] [Impact Index Per Article: 8.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/01/2023]
Abstract
Protein-ligand scoring functions are widely used in structure-based drug design for fast evaluation of protein-ligand interactions, and it is of strong interest to develop scoring functions with machine-learning approaches. In this work, by expanding the training set, developing physically meaningful features, employing our recently developed linear empirical scoring function Lin_F9 (Yang, C. J. Chem. Inf. Model. 2021, 61, 4630-4644) as the baseline, and applying extreme gradient boosting (XGBoost) with Δ-machine learning, we have further improved the robustness and applicability of machine-learning scoring functions. Besides the top performances for scoring-ranking-screening power tests of the CASF-2016 benchmark, the new scoring function ΔLin_F9XGB also achieves superior scoring and ranking performances in different structure types that mimic real docking applications. The scoring powers of ΔLin_F9XGB for locally optimized poses, flexible redocked poses, and ensemble docked poses of the CASF-2016 core set achieve Pearson's correlation coefficient (R) values of 0.853, 0.839, and 0.813, respectively. In addition, the large-scale docking-based virtual screening test on the LIT-PCBA data set demonstrates the reliability and robustness of ΔLin_F9XGB in virtual screening application. The ΔLin_F9XGB scoring function and its code are freely available on the web at (https://yzhang.hpc.nyu.edu/Delta_LinF9_XGB).
Collapse
Affiliation(s)
- Chao Yang
- Department of Chemistry, New York University, New York, New York 10003, United States
| | - Yingkai Zhang
- Department of Chemistry, New York University, New York, New York 10003, United States.,NYU-ECNU Center for Computational Chemistry at NYU Shanghai, Shanghai 200062, China
| |
Collapse
|
26
|
Xie W, Wang F, Li Y, Lai L, Pei J. Advances and Challenges in De Novo Drug Design Using Three-Dimensional Deep Generative Models. J Chem Inf Model 2022; 62:2269-2279. [PMID: 35544331 DOI: 10.1021/acs.jcim.2c00042] [Citation(s) in RCA: 17] [Impact Index Per Article: 5.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/30/2022]
Abstract
A persistent goal for de novo drug design is to generate novel chemical compounds with desirable properties in a labor-, time-, and cost-efficient manner. Deep generative models provide alternative routes to this goal. Numerous model architectures and optimization strategies have been explored in recent years, most of which have been developed to generate two-dimensional molecular structures. Some generative models aiming at three-dimensional (3D) molecule generation have also been proposed, gaining attention for their unique advantages and potential to directly design drug-like molecules in a target-conditioning manner. This review highlights current developments in 3D molecular generative models combined with deep learning and discusses future directions for de novo drug design.
Collapse
Affiliation(s)
- Weixin Xie
- Center for Quantitative Biology, Academy for Advanced Interdisciplinary Studies, Peking University, Beijing 100871, China
| | - Fanhao Wang
- Center for Quantitative Biology, Academy for Advanced Interdisciplinary Studies, Peking University, Beijing 100871, China
| | - Yibo Li
- Center for Life Sciences, Academy for Advanced Interdisciplinary Studies, Peking University, Beijing 100871, China
| | - Luhua Lai
- Center for Quantitative Biology, Academy for Advanced Interdisciplinary Studies, Peking University, Beijing 100871, China.,Peking-Tsinghua Center for Life Science at BNLMS, College of Chemistry and Molecular Engineering, Peking University, Beijing 100871, China
| | - Jianfeng Pei
- Center for Quantitative Biology, Academy for Advanced Interdisciplinary Studies, Peking University, Beijing 100871, China
| |
Collapse
|
27
|
Liu X, Feng H, Wu J, Xia K. Dowker complex based machine learning (DCML) models for protein-ligand binding affinity prediction. PLoS Comput Biol 2022; 18:e1009943. [PMID: 35385478 PMCID: PMC8985993 DOI: 10.1371/journal.pcbi.1009943] [Citation(s) in RCA: 14] [Impact Index Per Article: 4.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/14/2021] [Accepted: 02/21/2022] [Indexed: 11/19/2022] Open
Abstract
With the great advancements in experimental data, computational power and learning algorithms, artificial intelligence (AI) based drug design has begun to gain momentum recently. AI-based drug design has great promise to revolutionize pharmaceutical industries by significantly reducing the time and cost in drug discovery processes. However, a major issue remains for all AI-based learning model that is efficient molecular representations. Here we propose Dowker complex (DC) based molecular interaction representations and Riemann Zeta function based molecular featurization, for the first time. Molecular interactions between proteins and ligands (or others) are modeled as Dowker complexes. A multiscale representation is generated by using a filtration process, during which a series of DCs are generated at different scales. Combinatorial (Hodge) Laplacian matrices are constructed from these DCs, and the Riemann zeta functions from their spectral information can be used as molecular descriptors. To validate our models, we consider protein-ligand binding affinity prediction. Our DC-based machine learning (DCML) models, in particular, DC-based gradient boosting tree (DC-GBT), are tested on three most-commonly used datasets, i.e., including PDBbind-2007, PDBbind-2013 and PDBbind-2016, and extensively compared with other existing state-of-the-art models. It has been found that our DC-based descriptors can achieve the state-of-the-art results and have better performance than all machine learning models with traditional molecular descriptors. Our Dowker complex based machine learning models can be used in other tasks in AI-based drug design and molecular data analysis. With the ever-increasing accumulation of chemical and biomolecular data, data-driven artificial intelligence (AI) models will usher in an era of faster, cheaper and more-efficient drug design and drug discovery. However, unlike image, text, video, audio data, molecular data from chemistry and biology, have much complicated three-dimensional structures, as well as physical and chemical properties. Efficient molecular representations and descriptors are key to the success of machine learning models in drug design. Here, we propose Dowker complex based molecular representation and Riemann Zeta function based molecular featurization, for the first time. To characterize the complicated molecular structures and interactions at the atomic level, Dowker complexes are constructed. Based on them, intrinsic mathematical invariants are derived and used as molecular descriptors, which can be further combined with machine learning and deep learning models. Our model has achieved state-of-the-art results in protein-ligand binding affinity prediction, demonstrating its great potential for other drug design and discovery problems.
Collapse
Affiliation(s)
- Xiang Liu
- Division of Mathematical Sciences, School of Physical and Mathematical Sciences, Nanyang Technological University, Singapore
- Chern Institute of Mathematics and LPMC, Nankai University, Tianjin, China
- Center for Topology and Geometry Based Technology, Hebei Normal University, Hebei, China
| | - Huitao Feng
- Chern Institute of Mathematics and LPMC, Nankai University, Tianjin, China
- Mathematical Science Research Center, Chongqing University of Technology, Chongqing, China
| | - Jie Wu
- Center for Topology and Geometry Based Technology, Hebei Normal University, Hebei, China
- School of Mathematical Sciences, Hebei Normal University, Hebei, China
| | - Kelin Xia
- Division of Mathematical Sciences, School of Physical and Mathematical Sciences, Nanyang Technological University, Singapore
- * E-mail:
| |
Collapse
|
28
|
Abstract
The biological significance of proteins attracted the scientific community in exploring their characteristics. The studies shed light on the interaction patterns and functions of proteins in a living body. Due to their practical difficulties, reliable experimental techniques pave the way for introducing computational methods in the interaction prediction. Automated methods reduced the difficulties but could not yet replace experimental studies as the field is still evolving. Interaction prediction problem being critical needs highly accurate results, but none of the existing methods could offer reliable performance that can parallel with experimental results yet. This article aims to assess the existing computational docking algorithms, their challenges, and future scope. Blind docking techniques are quite helpful when no information other than the individual structures are available. As more and more complex structures are being added to different databases, information-driven approaches can be a good alternative. Artificial intelligence, ruling over the major fields, is expected to take over this domain very shortly.
Collapse
|
29
|
Tayara H, Abdelbaky I, To Chong K. Recent omics-based computational methods for COVID-19 drug discovery and repurposing. Brief Bioinform 2021; 22:6355836. [PMID: 34423353 DOI: 10.1093/bib/bbab339] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/22/2021] [Revised: 07/09/2021] [Indexed: 12/22/2022] Open
Abstract
The coronavirus disease 2019 (COVID-19) pandemic, caused by the severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2), is the main reason for the increasing number of deaths worldwide. Although strict quarantine measures were followed in many countries, the disease situation is still intractable. Thus, it is needed to utilize all possible means to confront this pandemic. Therefore, researchers are in a race against the time to produce potential treatments to cure or reduce the increasing infections of COVID-19. Computational methods are widely proving rapid successes in biological related problems, including diagnosis and treatment of diseases. Many efforts in recent months utilized Artificial Intelligence (AI) techniques in the context of fighting the spread of COVID-19. Providing periodic reviews and discussions of recent efforts saves the time of researchers and helps to link their endeavors for a faster and efficient confrontation of the pandemic. In this review, we discuss the recent promising studies that used Omics-based data and utilized AI algorithms and other computational tools to achieve this goal. We review the established datasets and the developed methods that were basically directed to new or repurposed drugs, vaccinations and diagnosis. The tools and methods varied depending on the level of details in the available information such as structures, sequences or metabolic data.
Collapse
Affiliation(s)
- Hilal Tayara
- School of international Engineering and Science, Jeonbuk National University, Jeonju 54896, Republic of Korea
| | - Ibrahim Abdelbaky
- Artificial Intelligence Department, Faculty of Computers and Artificial Intelligence, Benha University, Banha 13518, Egypt
| | - Kil To Chong
- Department of Electronics and Information Engineering, Jeonbuk National University, Jeonju, Jeollabukdo 54896, Republic of Korea.,Advances Electronics and Information Research Center, Jeonbuk National University, Jeonju 54896, Republic of Korea
| |
Collapse
|
30
|
Chen J, Zhao R, Tong Y, Wei GW. EVOLUTIONARY DE RHAM-HODGE METHOD. DISCRETE AND CONTINUOUS DYNAMICAL SYSTEMS. SERIES B 2021; 26:3785-3821. [PMID: 34675756 DOI: 10.3934/dcdsb.2020257] [Citation(s) in RCA: 18] [Impact Index Per Article: 4.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/29/2022]
Abstract
The de Rham-Hodge theory is a landmark of the 20th Century's mathematics and has had a great impact on mathematics, physics, computer science, and engineering. This work introduces an evolutionary de Rham-Hodge method to provide a unified paradigm for the multiscale geometric and topological analysis of evolving manifolds constructed from a filtration, which induces a family of evolutionary de Rham complexes. While the present method can be easily applied to close manifolds, the emphasis is given to more challenging compact manifolds with 2-manifold boundaries, which require appropriate analysis and treatment of boundary conditions on differential forms to maintain proper topological properties. Three sets of unique evolutionary Hodge Laplacians are proposed to generate three sets of topology-preserving singular spectra, for which the multiplicities of zero eigenvalues correspond to exactly the persistent Betti numbers of dimensions 0, 1 and 2. Additionally, three sets of non-zero eigenvalues further reveal both topological persistence and geometric progression during the manifold evolution. Extensive numerical experiments are carried out via the discrete exterior calculus to demonstrate the potential of the proposed paradigm for data representation and shape analysis of both point cloud data and density maps. To demonstrate the utility of the proposed method, the application is considered to the protein B-factor predictions of a few challenging cases for which existing biophysical models break down.
Collapse
Affiliation(s)
- Jiahui Chen
- Department of Mathematics, Michigan State University, MI 48824, USA
| | - Rundong Zhao
- Department of Computer Science and Engineering, Michigan State University, MI 48824, USA
| | - Yiying Tong
- Department of Computer Science and Engineering, Michigan State University, MI 48824, USA
| | - Guo-Wei Wei
- Department of Mathematics, Michigan State University, MI 48824, USA
| |
Collapse
|
31
|
Xiong G, Shen C, Yang Z, Jiang D, Liu S, Lu A, Chen X, Hou T, Cao D. Featurization strategies for protein–ligand interactions and their applications in scoring function development. WIRES COMPUTATIONAL MOLECULAR SCIENCE 2021. [DOI: 10.1002/wcms.1567] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/18/2022]
Affiliation(s)
- Guoli Xiong
- Xiangya School of Pharmaceutical Sciences Central South University Changsha China
| | - Chao Shen
- Hangzhou Institute of Innovative Medicine, College of Pharmaceutical Sciences Zhejiang University Hangzhou China
| | - Ziyi Yang
- Xiangya School of Pharmaceutical Sciences Central South University Changsha China
| | - Dejun Jiang
- Hangzhou Institute of Innovative Medicine, College of Pharmaceutical Sciences Zhejiang University Hangzhou China
- College of Computer Science and Technology Zhejiang University Hangzhou China
| | - Shao Liu
- Department of Pharmacy Xiangya Hospital, Central South University Changsha China
| | - Aiping Lu
- Institute for Advancing Translational Medicine in Bone & Joint Diseases, School of Chinese Medicine Hong Kong Baptist University Hong Kong SAR China
| | - Xiang Chen
- Department of Dermatology, Hunan Engineering Research Center of Skin Health and Disease, Hunan Key Laboratory of Skin Cancer and Psoriasis Xiangya Hospital, Central South University Changsha China
| | - Tingjun Hou
- Hangzhou Institute of Innovative Medicine, College of Pharmaceutical Sciences Zhejiang University Hangzhou China
| | - Dongsheng Cao
- Xiangya School of Pharmaceutical Sciences Central South University Changsha China
- Institute for Advancing Translational Medicine in Bone & Joint Diseases, School of Chinese Medicine Hong Kong Baptist University Hong Kong SAR China
| |
Collapse
|
32
|
El-Rashidy N, Abdelrazik S, Abuhmed T, Amer E, Ali F, Hu JW, El-Sappagh S. Comprehensive Survey of Using Machine Learning in the COVID-19 Pandemic. Diagnostics (Basel) 2021; 11:1155. [PMID: 34202587 PMCID: PMC8303306 DOI: 10.3390/diagnostics11071155] [Citation(s) in RCA: 23] [Impact Index Per Article: 5.8] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/12/2021] [Revised: 05/29/2021] [Accepted: 05/31/2021] [Indexed: 12/11/2022] Open
Abstract
Since December 2019, the global health population has faced the rapid spreading of coronavirus disease (COVID-19). With the incremental acceleration of the number of infected cases, the World Health Organization (WHO) has reported COVID-19 as an epidemic that puts a heavy burden on healthcare sectors in almost every country. The potential of artificial intelligence (AI) in this context is difficult to ignore. AI companies have been racing to develop innovative tools that contribute to arm the world against this pandemic and minimize the disruption that it may cause. The main objective of this study is to survey the decisive role of AI as a technology used to fight against the COVID-19 pandemic. Five significant applications of AI for COVID-19 were found, including (1) COVID-19 diagnosis using various data types (e.g., images, sound, and text); (2) estimation of the possible future spread of the disease based on the current confirmed cases; (3) association between COVID-19 infection and patient characteristics; (4) vaccine development and drug interaction; and (5) development of supporting applications. This study also introduces a comparison between current COVID-19 datasets. Based on the limitations of the current literature, this review highlights the open research challenges that could inspire the future application of AI in COVID-19.
Collapse
Affiliation(s)
- Nora El-Rashidy
- Machine Learning and Information Retrieval Department, Faculty of Artificial Intelligence, Kafrelsheiksh University, Kafrelsheiksh 13518, Egypt
| | - Samir Abdelrazik
- Information System Department, Faculty of Computer Science and Information Systems, Mansoura University, Mansoura 13518, Egypt;
| | - Tamer Abuhmed
- College of Computing and Informatics, Sungkyunkwan University, Seoul 03063, Korea
| | - Eslam Amer
- Faculty of Computer Science, Misr International University, Cairo 11828, Egypt;
| | - Farman Ali
- Department of Software, Sejong University, Seoul 05006, Korea;
| | - Jong-Wan Hu
- Department of Civil and Environmental Engineering, Incheon National University, Incheon 22012, Korea
| | - Shaker El-Sappagh
- Centro Singular de Investigación en Tecnoloxías Intelixentes (CiTIUS), Universidade de Santiago de Compostela, 15782 Santiago de Compostela, Spain
- Information Systems Department, Faculty of Computers and Artificial Intelligence, Benha University, Banha 13518, Egypt
| |
Collapse
|
33
|
Chen D, Gao K, Nguyen DD, Chen X, Jiang Y, Wei GW, Pan F. Algebraic graph-assisted bidirectional transformers for molecular property prediction. Nat Commun 2021; 12:3521. [PMID: 34112777 PMCID: PMC8192505 DOI: 10.1038/s41467-021-23720-w] [Citation(s) in RCA: 68] [Impact Index Per Article: 17.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/21/2021] [Accepted: 05/06/2021] [Indexed: 11/09/2022] Open
Abstract
The ability of molecular property prediction is of great significance to drug discovery, human health, and environmental protection. Despite considerable efforts, quantitative prediction of various molecular properties remains a challenge. Although some machine learning models, such as bidirectional encoder from transformer, can incorporate massive unlabeled molecular data into molecular representations via a self-supervised learning strategy, it neglects three-dimensional (3D) stereochemical information. Algebraic graph, specifically, element-specific multiscale weighted colored algebraic graph, embeds complementary 3D molecular information into graph invariants. We propose an algebraic graph-assisted bidirectional transformer (AGBT) framework by fusing representations generated by algebraic graph and bidirectional transformer, as well as a variety of machine learning algorithms, including decision trees, multitask learning, and deep neural networks. We validate the proposed AGBT framework on eight molecular datasets, involving quantitative toxicity, physical chemistry, and physiology datasets. Extensive numerical experiments have shown that AGBT is a state-of-the-art framework for molecular property prediction.
Collapse
Affiliation(s)
- Dong Chen
- School of Advanced Materials, Peking University, Shenzhen Graduate School, Shenzhen, China
- Department of Mathematics, Michigan State University, East Lansing, MI, USA
| | - Kaifu Gao
- Department of Mathematics, Michigan State University, East Lansing, MI, USA
| | - Duc Duy Nguyen
- Department of Mathematics, University of Kentucky, Lexington, KY, USA
| | - Xin Chen
- School of Advanced Materials, Peking University, Shenzhen Graduate School, Shenzhen, China
| | - Yi Jiang
- School of Advanced Materials, Peking University, Shenzhen Graduate School, Shenzhen, China
| | - Guo-Wei Wei
- Department of Mathematics, Michigan State University, East Lansing, MI, USA.
- Department of Electrical and Computer Engineering, Michigan State University, East Lansing, MI, USA.
- Department of Biochemistry and Molecular Biology, Michigan State University, East Lansing, MI, USA.
| | - Feng Pan
- School of Advanced Materials, Peking University, Shenzhen Graduate School, Shenzhen, China.
| |
Collapse
|
34
|
McNutt AT, Francoeur P, Aggarwal R, Masuda T, Meli R, Ragoza M, Sunseri J, Koes DR. GNINA 1.0: molecular docking with deep learning. J Cheminform 2021; 13:43. [PMID: 34108002 PMCID: PMC8191141 DOI: 10.1186/s13321-021-00522-2] [Citation(s) in RCA: 204] [Impact Index Per Article: 51.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/19/2021] [Accepted: 05/26/2021] [Indexed: 12/20/2022] Open
Abstract
Molecular docking computationally predicts the conformation of a small molecule when binding to a receptor. Scoring functions are a vital piece of any molecular docking pipeline as they determine the fitness of sampled poses. Here we describe and evaluate the 1.0 release of the Gnina docking software, which utilizes an ensemble of convolutional neural networks (CNNs) as a scoring function. We also explore an array of parameter values for Gnina 1.0 to optimize docking performance and computational cost. Docking performance, as evaluated by the percentage of targets where the top pose is better than 2Å root mean square deviation (Top1), is compared to AutoDock Vina scoring when utilizing explicitly defined binding pockets or whole protein docking. GNINA, utilizing a CNN scoring function to rescore the output poses, outperforms AutoDock Vina scoring on redocking and cross-docking tasks when the binding pocket is defined (Top1 increases from 58% to 73% and from 27% to 37%, respectively) and when the whole protein defines the binding pocket (Top1 increases from 31% to 38% and from 12% to 16%, respectively). The derived ensemble of CNNs generalizes to unseen proteins and ligands and produces scores that correlate well with the root mean square deviation to the known binding pose. We provide the 1.0 version of GNINA under an open source license for use as a molecular docking tool at https://github.com/gnina/gnina .
Collapse
Affiliation(s)
- Andrew T McNutt
- Department of Computational and Systems Biology, University of Pittsburgh, Pittsburgh, PA, USA
| | - Paul Francoeur
- Department of Computational and Systems Biology, University of Pittsburgh, Pittsburgh, PA, USA
| | - Rishal Aggarwal
- Center for Computational Natural Sciences and Bioinformatics, International Institute of Information Technology, Hyderabad, 500 032, India
| | - Tomohide Masuda
- Department of Computational and Systems Biology, University of Pittsburgh, Pittsburgh, PA, USA
| | - Rocco Meli
- Department of Biochemistry, University of Oxford, Oxford, United Kingdom
| | - Matthew Ragoza
- Department of Computational and Systems Biology, University of Pittsburgh, Pittsburgh, PA, USA
| | - Jocelyn Sunseri
- Department of Computational and Systems Biology, University of Pittsburgh, Pittsburgh, PA, USA
| | - David Ryan Koes
- Department of Computational and Systems Biology, University of Pittsburgh, Pittsburgh, PA, USA.
| |
Collapse
|
35
|
Meng Z, Xia K. Persistent spectral-based machine learning (PerSpect ML) for protein-ligand binding affinity prediction. SCIENCE ADVANCES 2021; 7:7/19/eabc5329. [PMID: 33962954 PMCID: PMC8104863 DOI: 10.1126/sciadv.abc5329] [Citation(s) in RCA: 83] [Impact Index Per Article: 20.8] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 04/29/2020] [Accepted: 03/18/2021] [Indexed: 05/11/2023]
Abstract
Molecular descriptors are essential to not only quantitative structure-activity relationship (QSAR) models but also machine learning-based material, chemical, and biological data analysis. Here, we propose persistent spectral-based machine learning (PerSpect ML) models for drug design. Different from all previous spectral models, a filtration process is introduced to generate a sequence of spectral models at various different scales. PerSpect attributes are defined as the function of spectral variables over the filtration value. Molecular descriptors obtained from PerSpect attributes are combined with machine learning models for protein-ligand binding affinity prediction. Our results, for the three most commonly used databases including PDBbind-2007, PDBbind-2013, and PDBbind-2016, are better than all existing models, as far as we know. The proposed PerSpect theory provides a powerful feature engineering framework. PerSpect ML models demonstrate great potential to significantly improve the performance of learning models in molecular data analysis.
Collapse
Affiliation(s)
- Zhenyu Meng
- Division of Mathematical Sciences, School of Physical and Mathematical Sciences, Nanyang Technological University, Singapore 637371, Singapore
| | - Kelin Xia
- Division of Mathematical Sciences, School of Physical and Mathematical Sciences, Nanyang Technological University, Singapore 637371, Singapore.
| |
Collapse
|
36
|
Kimber TB, Chen Y, Volkamer A. Deep Learning in Virtual Screening: Recent Applications and Developments. Int J Mol Sci 2021; 22:4435. [PMID: 33922714 PMCID: PMC8123040 DOI: 10.3390/ijms22094435] [Citation(s) in RCA: 71] [Impact Index Per Article: 17.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/18/2021] [Revised: 04/13/2021] [Accepted: 04/14/2021] [Indexed: 01/03/2023] Open
Abstract
Drug discovery is a cost and time-intensive process that is often assisted by computational methods, such as virtual screening, to speed up and guide the design of new compounds. For many years, machine learning methods have been successfully applied in the context of computer-aided drug discovery. Recently, thanks to the rise of novel technologies as well as the increasing amount of available chemical and bioactivity data, deep learning has gained a tremendous impact in rational active compound discovery. Herein, recent applications and developments of machine learning, with a focus on deep learning, in virtual screening for active compound design are reviewed. This includes introducing different compound and protein encodings, deep learning techniques as well as frequently used bioactivity and benchmark data sets for model training and testing. Finally, the present state-of-the-art, including the current challenges and emerging problems, are examined and discussed.
Collapse
Affiliation(s)
| | | | - Andrea Volkamer
- In Silico Toxicology and Structural Bioinformatics, Institute of Physiology, Charité-Universitätsmedizin Berlin, Charitéplatz 1, 10117 Berlin, Germany; (T.B.K.); (Y.C.)
| |
Collapse
|
37
|
Liu X, Feng H, Wu J, Xia K. Persistent spectral hypergraph based machine learning (PSH-ML) for protein-ligand binding affinity prediction. Brief Bioinform 2021; 22:6219114. [PMID: 33837771 DOI: 10.1093/bib/bbab127] [Citation(s) in RCA: 24] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/12/2021] [Revised: 03/14/2021] [Accepted: 03/16/2021] [Indexed: 12/21/2022] Open
Abstract
Molecular descriptors are essential to not only quantitative structure activity/property relationship (QSAR/QSPR) models, but also machine learning based chemical and biological data analysis. In this paper, we propose persistent spectral hypergraph (PSH) based molecular descriptors or fingerprints for the first time. Our PSH-based molecular descriptors are used in the characterization of molecular structures and interactions, and further combined with machine learning models, in particular gradient boosting tree (GBT), for protein-ligand binding affinity prediction. Different from traditional molecular descriptors, which are usually based on molecular graph models, a hypergraph-based topological representation is proposed for protein-ligand interaction characterization. Moreover, a filtration process is introduced to generate a series of nested hypergraphs in different scales. For each of these hypergraphs, its eigen spectrum information can be obtained from the corresponding (Hodge) Laplacain matrix. PSH studies the persistence and variation of the eigen spectrum of the nested hypergraphs during the filtration process. Molecular descriptors or fingerprints can be generated from persistent attributes, which are statistical or combinatorial functions of PSH, and combined with machine learning models, in particular, GBT. We test our PSH-GBT model on three most commonly used datasets, including PDBbind-2007, PDBbind-2013 and PDBbind-2016. Our results, for all these databases, are better than all existing machine learning models with traditional molecular descriptors, as far as we know.
Collapse
Affiliation(s)
- Xiang Liu
- Division of Mathematical Sciences, School of Physical and Mathematical Sciences, Nanyang Technological University, Singapore 637371.,Chern Institute of Mathematics and LPMC, Nankai University, Tianjin, China, 300071.,Center for Topology and Geometry Based Technology, Hebei Normal University, Hebei, China, 050024
| | - Huitao Feng
- Chern Institute of Mathematics and LPMC, Nankai University, Tianjin, China, 300071.,Mathematical Science Research Center, Chongqing University of Technology, Chongqing, China, 400054
| | - Jie Wu
- Center for Topology and Geometry Based Technology, Hebei Normal University, Hebei, China, 050024.,School of Mathematical Sciences, Hebei Normal University, Hebei, China, 050024
| | - Kelin Xia
- Division of Mathematical Sciences, School of Physical and Mathematical Sciences, Nanyang Technological University, Singapore 637371
| |
Collapse
|
38
|
Lim S, Lu Y, Cho CY, Sung I, Kim J, Kim Y, Park S, Kim S. A review on compound-protein interaction prediction methods: Data, format, representation and model. Comput Struct Biotechnol J 2021; 19:1541-1556. [PMID: 33841755 PMCID: PMC8008185 DOI: 10.1016/j.csbj.2021.03.004] [Citation(s) in RCA: 46] [Impact Index Per Article: 11.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/31/2020] [Revised: 02/28/2021] [Accepted: 03/01/2021] [Indexed: 01/27/2023] Open
Abstract
There has recently been a rapid progress in computational methods for determining protein targets of small molecule drugs, which will be termed as compound protein interaction (CPI). In this review, we comprehensively review topics related to computational prediction of CPI. Data for CPI has been accumulated and curated significantly both in quantity and quality. Computational methods have become powerful ever to analyze such complex the data. Thus, recent successes in the improved quality of CPI prediction are due to use of both sophisticated computational techniques and higher quality information in the databases. The goal of this article is to provide reviews of topics related to CPI, such as data, format, representation, to computational models, so that researchers can take full advantages of these resources to develop novel prediction methods. Chemical compounds and protein data from various resources were discussed in terms of data formats and encoding schemes. For the CPI methods, we grouped prediction methods into five categories from traditional machine learning techniques to state-of-the-art deep learning techniques. In closing, we discussed emerging machine learning topics to help both experimental and computational scientists leverage the current knowledge and strategies to develop more powerful and accurate CPI prediction methods.
Collapse
Affiliation(s)
- Sangsoo Lim
- Bioinformatics Institute, Seoul National University, Seoul, Republic of Korea
| | - Yijingxiu Lu
- Department of Computer Science and Engineering, College of Engineering, Seoul National University, Seoul, Republic of Korea
| | - Chang Yun Cho
- Institute of Engineering Research, Seoul National University, Seoul, Republic of Korea
| | - Inyoung Sung
- Institute of Engineering Research, Seoul National University, Seoul, Republic of Korea
| | - Jungwoo Kim
- Department of Computer Science and Engineering, College of Engineering, Seoul National University, Seoul, Republic of Korea
| | - Youngkuk Kim
- Department of Computer Science and Engineering, College of Engineering, Seoul National University, Seoul, Republic of Korea
| | - Sungjoon Park
- Department of Computer Science and Engineering, College of Engineering, Seoul National University, Seoul, Republic of Korea
| | - Sun Kim
- Bioinformatics Institute, Seoul National University, Seoul, Republic of Korea
- Department of Computer Science and Engineering, College of Engineering, Seoul National University, Seoul, Republic of Korea
- Institute of Engineering Research, Seoul National University, Seoul, Republic of Korea
- Interdisciplinary Program in Bioinformatics, College of Natural Sciences, Seoul National University, Seoul, Republic of Korea
| |
Collapse
|
39
|
Galindez G, Matschinske J, Rose TD, Sadegh S, Salgado-Albarrán M, Späth J, Baumbach J, Pauling JK. Lessons from the COVID-19 pandemic for advancing computational drug repurposing strategies. NATURE COMPUTATIONAL SCIENCE 2021; 1:33-41. [PMID: 38217166 DOI: 10.1038/s43588-020-00007-6] [Citation(s) in RCA: 68] [Impact Index Per Article: 17.0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 09/16/2020] [Accepted: 12/01/2020] [Indexed: 12/15/2022]
Abstract
Responding quickly to unknown pathogens is crucial to stop uncontrolled spread of diseases that lead to epidemics, such as the novel coronavirus, and to keep protective measures at a level that causes as little social and economic harm as possible. This can be achieved through computational approaches that significantly speed up drug discovery. A powerful approach is to restrict the search to existing drugs through drug repurposing, which can vastly accelerate the usually long approval process. In this Review, we examine a representative set of currently used computational approaches to identify repurposable drugs for COVID-19, as well as their underlying data resources. Furthermore, we compare drug candidates predicted by computational methods to drugs being assessed by clinical trials. Finally, we discuss lessons learned from the reviewed research efforts, including how to successfully connect computational approaches with experimental studies, and propose a unified drug repurposing strategy for better preparedness in the case of future outbreaks.
Collapse
Affiliation(s)
- Gihanna Galindez
- Chair of Experimental Bioinformatics, TUM School of Life Sciences, Technical University of Munich, Munich, Germany
| | - Julian Matschinske
- Chair of Experimental Bioinformatics, TUM School of Life Sciences, Technical University of Munich, Munich, Germany
| | - Tim Daniel Rose
- LipiTUM, Chair of Experimental Bioinformatics, TUM School of Life Sciences, Technical University of Munich, Munich, Germany
| | - Sepideh Sadegh
- Chair of Experimental Bioinformatics, TUM School of Life Sciences, Technical University of Munich, Munich, Germany
| | - Marisol Salgado-Albarrán
- Chair of Experimental Bioinformatics, TUM School of Life Sciences, Technical University of Munich, Munich, Germany
- Natural Sciences Department, Universidad Autónoma Metropolitana-Cuajimalpa (UAM-C), Mexico City, Mexico
| | - Julian Späth
- Chair of Experimental Bioinformatics, TUM School of Life Sciences, Technical University of Munich, Munich, Germany
| | - Jan Baumbach
- Chair of Experimental Bioinformatics, TUM School of Life Sciences, Technical University of Munich, Munich, Germany
- Computational Biomedicine Lab, Department of Mathematics and Computer Science, University of Southern Denmark, Odense, Denmark
| | - Josch Konstantin Pauling
- LipiTUM, Chair of Experimental Bioinformatics, TUM School of Life Sciences, Technical University of Munich, Munich, Germany.
| |
Collapse
|
40
|
Nguyen DD, Gao K, Chen J, Wang R, Wei GW. Unveiling the molecular mechanism of SARS-CoV-2 main protease inhibition from 137 crystal structures using algebraic topology and deep learning. Chem Sci 2020; 11:12036-12046. [PMID: 34123218 PMCID: PMC8162568 DOI: 10.1039/d0sc04641h] [Citation(s) in RCA: 57] [Impact Index Per Article: 11.4] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/21/2020] [Accepted: 09/30/2020] [Indexed: 12/27/2022] Open
Abstract
Currently, there is neither effective antiviral drugs nor vaccine for coronavirus disease 2019 (COVID-19) caused by acute respiratory syndrome coronavirus 2 (SARS-CoV-2). Due to its high conservativeness and low similarity with human genes, SARS-CoV-2 main protease (Mpro) is one of the most favorable drug targets. However, the current understanding of the molecular mechanism of Mpro inhibition is limited by the lack of reliable binding affinity ranking and prediction of existing structures of Mpro-inhibitor complexes. This work integrates mathematics (i.e., algebraic topology) and deep learning (MathDL) to provide a reliable ranking of the binding affinities of 137 SARS-CoV-2 Mpro inhibitor structures. We reveal that Gly143 residue in Mpro is the most attractive site to form hydrogen bonds, followed by Glu166, Cys145, and His163. We also identify 71 targeted covalent bonding inhibitors. MathDL was validated on the PDBbind v2016 core set benchmark and a carefully curated SARS-CoV-2 inhibitor dataset to ensure the reliability of the present binding affinity prediction. The present binding affinity ranking, interaction analysis, and fragment decomposition offer a foundation for future drug discovery efforts.
Collapse
Affiliation(s)
- Duc Duy Nguyen
- Department of Mathematics, University of Kentucky KY 40506 USA
| | - Kaifu Gao
- Department of Mathematics, Michigan State University MI 48824 USA
| | - Jiahui Chen
- Department of Mathematics, Michigan State University MI 48824 USA
| | - Rui Wang
- Department of Mathematics, Michigan State University MI 48824 USA
| | - Guo-Wei Wei
- Department of Mathematics, Michigan State University MI 48824 USA
- Department of Biochemistry and Molecular Biology, Michigan State University MI 48824 USA
- Department of Electrical and Computer Engineering, Michigan State University MI 48824 USA
| |
Collapse
|
41
|
Estrada E. COVID-19 and SARS-CoV-2. Modeling the present, looking at the future. PHYSICS REPORTS 2020; 869:1-51. [PMID: 32834430 PMCID: PMC7386394 DOI: 10.1016/j.physrep.2020.07.005] [Citation(s) in RCA: 65] [Impact Index Per Article: 13.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Received: 07/27/2020] [Accepted: 07/27/2020] [Indexed: 05/21/2023]
Abstract
Since December 2019 the Severe Acute Respiratory Syndrome Coronavirus 2 (SARS-CoV-2) has produced an outbreak of pulmonary disease which has soon become a global pandemic, known as COronaVIrus Disease-19 (COVID-19). The new coronavirus shares about 82% of its genome with the one which produced the 2003 outbreak (SARS CoV-1). Both coronaviruses also share the same cellular receptor, which is the angiotensin-converting enzyme 2 (ACE2) one. In spite of these similarities, the new coronavirus has expanded more widely, more faster and more lethally than the previous one. Many researchers across the disciplines have used diverse modeling tools to analyze the impact of this pandemic at global and local scales. This includes a wide range of approaches - deterministic, data-driven, stochastic, agent-based, and their combinations - to forecast the progression of the epidemic as well as the effects of non-pharmaceutical interventions to stop or mitigate its impact on the world population. The physical complexities of modern society need to be captured by these models. This includes the many ways of social contacts - (multiplex) social contact networks, (multilayers) transport systems, metapopulations, etc. - that may act as a framework for the virus propagation. But modeling not only plays a fundamental role in analyzing and forecasting epidemiological variables, but it also plays an important role in helping to find cures for the disease and in preventing contagion by means of new vaccines. The necessity for answering swiftly and effectively the questions: could existing drugs work against SARS CoV-2? and can new vaccines be developed in time? demands the use of physical modeling of proteins, protein-inhibitors interactions, virtual screening of drugs against virus targets, predicting immunogenicity of small peptides, modeling vaccinomics and vaccine design, to mention just a few. Here, we review these three main areas of modeling research against SARS CoV-2 and COVID-19: (1) epidemiology; (2) drug repurposing; and (3) vaccine design. Therefore, we compile the most relevant existing literature about modeling strategies against the virus to help modelers to navigate this fast-growing literature. We also keep an eye on future outbreaks, where the modelers can find the most relevant strategies used in an emergency situation as the current one to help in fighting future pandemics.
Collapse
Affiliation(s)
- Ernesto Estrada
- Instituto Universitario de Matemáticas y Aplicaciones, Universidad de Zaragoza, 50009 Zaragoza, Spain
- ARAID Foundation, Government of Aragón, 50018 Zaragoza, Spain
| |
Collapse
|
42
|
Gao K, Nguyen DD, Chen J, Wang R, Wei GW. Repositioning of 8565 Existing Drugs for COVID-19. J Phys Chem Lett 2020; 11:5373-5382. [PMID: 32543196 PMCID: PMC7313673 DOI: 10.1021/acs.jpclett.0c01579] [Citation(s) in RCA: 67] [Impact Index Per Article: 13.4] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/21/2020] [Accepted: 06/16/2020] [Indexed: 05/06/2023]
Abstract
The coronavirus disease 2019 (COVID-19) pandemic caused by severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) has infected over 7.1 million people and led to over 0.4 million deaths. Currently, there is no specific anti-SARS-CoV-2 medication. New drug discovery typically takes more than 10 years. Drug repositioning becomes one of the most feasible approaches for combating COVID-19. This work curates the largest available experimental data set for SARS-CoV-2 or SARS-CoV 3CL (main) protease inhibitors. On the basis of this data set, we develop validated machine learning models with relatively low root-mean-square error to screen 1553 FDA-approved drugs as well as another 7012 investigational or off-market drugs in DrugBank. We found that many existing drugs might be potentially potent to SARS-CoV-2. The druggability of many potent SARS-CoV-2 3CL protease inhibitors is analyzed. This work offers a foundation for further experimental studies of COVID-19 drug repositioning.
Collapse
Affiliation(s)
- Kaifu Gao
- Department of Mathematics, Michigan State University, East Lansing, Michigan 48824, United States
| | - Duc Duy Nguyen
- Department of Mathematics, University of Kentucky, Lexington, Kentucky 40506, United States
| | - Jiahui Chen
- Department of Mathematics, Michigan State University, East Lansing, Michigan 48824, United States
| | - Rui Wang
- Department of Mathematics, Michigan State University, East Lansing, Michigan 48824, United States
| | - Guo-Wei Wei
- Department of Mathematics, Michigan State University, East Lansing, Michigan 48824, United States
- Department of Biochemistry and Molecular Biology, Michigan State University, East Lansing, Michigan 48824, United States
- Department of Electrical and Computer Engineering, Michigan State University, East Lansing, Michigan 48824, United States
| |
Collapse
|
43
|
Fresnais L, Ballester PJ. The impact of compound library size on the performance of scoring functions for structure-based virtual screening. Brief Bioinform 2020; 22:5855396. [PMID: 32568385 DOI: 10.1093/bib/bbaa095] [Citation(s) in RCA: 19] [Impact Index Per Article: 3.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/20/2020] [Revised: 04/20/2020] [Accepted: 04/28/2020] [Indexed: 12/20/2022] Open
Abstract
Larger training datasets have been shown to improve the accuracy of machine learning (ML)-based scoring functions (SFs) for structure-based virtual screening (SBVS). In addition, massive test sets for SBVS, known as ultra-large compound libraries, have been demonstrated to enable the fast discovery of selective drug leads with low-nanomolar potency. This proof-of-concept was carried out on two targets using a single docking tool along with its SF. It is thus unclear whether this high level of performance would generalise to other targets, docking tools and SFs. We found that screening a larger compound library results in more potent actives being identified in all six additional targets using a different docking tool along with its classical SF. Furthermore, we established that a way to improve the potency of the retrieved molecules further is to rank them with more accurate ML-based SFs (we found this to be true in four of the six targets; the difference was not significant in the remaining two targets). A 3-fold increase in average hit rate across targets was also achieved by the ML-based SFs. Lastly, we observed that classical and ML-based SFs often find different actives, which supports using both types of SFs on those targets.
Collapse
|
44
|
Wang K, Lyu N, Diao H, Jin S, Zeng T, Zhou Y, Wu R. GM-DockZn: a geometry matching-based docking algorithm for zinc proteins. Bioinformatics 2020; 36:4004-4011. [DOI: 10.1093/bioinformatics/btaa292] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/13/2020] [Revised: 04/06/2020] [Accepted: 04/27/2020] [Indexed: 12/23/2022] Open
Abstract
Abstract
Motivation
Molecular docking is a widely used technique for large-scale virtual screening of the interactions between small-molecule ligands and their target proteins. However, docking methods often perform poorly for metalloproteins due to additional complexity from the three-way interactions among amino-acid residues, metal ions and ligands. This is a significant problem because zinc proteins alone comprise about 10% of all available protein structures in the protein databank. Here, we developed GM-DockZn that is dedicated for ligand docking to zinc proteins. Unlike the existing docking methods developed specifically for zinc proteins, GM-DockZn samples ligand conformations directly using a geometric grid around the ideal zinc-coordination positions of seven discovered coordination motifs, which were found from the survey of known zinc proteins complexed with a single ligand.
Results
GM-DockZn has the best performance in sampling near-native poses with correct coordination atoms and numbers within the top 50 and top 10 predictions when compared to several state-of-the-art techniques. This is true not only for a non-redundant dataset of zinc proteins but also for a homolog set of different ligand and zinc-coordination systems for the same zinc proteins. Similar superior performance of GM-DockZn for near-native-pose sampling was also observed for docking to apo-structures and cross-docking between different ligand complex structures of the same protein. The highest success rate for sampling nearest near-native poses within top 5 and top 1 was achieved by combining GM-DockZn for conformational sampling with GOLD for ranking. The proposed geometry-based sampling technique will be useful for ligand docking to other metalloproteins.
Availability and implementation
GM-DockZn is freely available at www.qmclab.com/ for academic users.
Supplementary information
Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Kai Wang
- Guangdong Provincial Key Laboratory of New Drug Design and Evaluation, School of Pharmaceutical Sciences, Sun Yat-sen University, Guangzhou 510006
- School of Agriculture and Biology, Zhongkai University of Agriculture and Engineering, Guangzhou 510000
| | - Nan Lyu
- Guangdong Provincial Key Laboratory of New Drug Design and Evaluation, School of Pharmaceutical Sciences, Sun Yat-sen University, Guangzhou 510006
| | - Hongjuan Diao
- Guangdong Provincial Key Laboratory of New Drug Design and Evaluation, School of Pharmaceutical Sciences, Sun Yat-sen University, Guangzhou 510006
| | - Shujuan Jin
- Peking University Shenzhen Graduate School, Shenzhen 518055
- Shenzhen Bay Laboratory, Shenzhen 518055, China
| | - Tao Zeng
- Guangdong Provincial Key Laboratory of New Drug Design and Evaluation, School of Pharmaceutical Sciences, Sun Yat-sen University, Guangzhou 510006
| | - Yaoqi Zhou
- Peking University Shenzhen Graduate School, Shenzhen 518055
- Shenzhen Bay Laboratory, Shenzhen 518055, China
- Institute for Glycomics and School of Information and Communication Technology, Griffith University, Southport, QLD 4222, Australia
| | - Ruibo Wu
- Guangdong Provincial Key Laboratory of New Drug Design and Evaluation, School of Pharmaceutical Sciences, Sun Yat-sen University, Guangzhou 510006
- Institute for Glycomics and School of Information and Communication Technology, Griffith University, Southport, QLD 4222, Australia
| |
Collapse
|
45
|
Abstract
Recently, machine learning (ML) has established itself in various worldwide benchmarking competitions in computational biology, including Critical Assessment of Structure Prediction (CASP) and Drug Design Data Resource (D3R) Grand Challenges. However, the intricate structural complexity and high ML dimensionality of biomolecular datasets obstruct the efficient application of ML algorithms in the field. In addition to data and algorithm, an efficient ML machinery for biomolecular predictions must include structural representation as an indispensable component. Mathematical representations that simplify the biomolecular structural complexity and reduce ML dimensionality have emerged as a prime winner in D3R Grand Challenges. This review is devoted to the recent advances in developing low-dimensional and scalable mathematical representations of biomolecules in our laboratory. We discuss three classes of mathematical approaches, including algebraic topology, differential geometry, and graph theory. We elucidate how the physical and biological challenges have guided the evolution and development of these mathematical apparatuses for massive and diverse biomolecular data. We focus the performance analysis on protein-ligand binding predictions in this review although these methods have had tremendous success in many other applications, such as protein classification, virtual screening, and the predictions of solubility, solvation free energies, toxicity, partition coefficients, protein folding stability changes upon mutation, etc.
Collapse
Affiliation(s)
- Duc Duy Nguyen
- Department of Mathematics, Michigan State University, MI 48824, USA.
| | - Zixuan Cang
- Department of Mathematics, Michigan State University, MI 48824, USA.
| | - Guo-Wei Wei
- Department of Mathematics, Michigan State University, MI 48824, USA. and Department of Biochemistry and Molecular Biology, Michigan State University, MI 48824, USA and Department of Electrical and Computer Engineering, Michigan State University, MI 48824, USA
| |
Collapse
|
46
|
Nguyen DD, Gao K, Chen J, Wang R, Wei GW. Potentially highly potent drugs for 2019-nCoV. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2020:2020.02.05.936013. [PMID: 32511344 PMCID: PMC7255774 DOI: 10.1101/2020.02.05.936013] [Citation(s) in RCA: 22] [Impact Index Per Article: 4.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/17/2022]
Abstract
The World Health Organization (WHO) has declared the 2019 novel coronavirus (2019-nCoV) infection outbreak a global health emergency. Currently, there is no effective anti-2019-nCoV medication. The sequence identity of the 3CL proteases of 2019-nCoV and SARS is 96%, which provides a sound foundation for structural-based drug repositioning (SBDR). Based on a SARS 3CL protease X-ray crystal structure, we construct a 3D homology structure of 2019-nCoV 3CL protease. Based on this structure and existing experimental datasets for SARS 3CL protease inhibitors, we develop an SBDR model based on machine learning and mathematics to screen 1465 drugs in the DrugBank that have been approved by the U.S. Food and Drug Administration (FDA). We found that many FDA approved drugs are potentially highly potent to 2019-nCoV.
Collapse
|
47
|
Gao K, Nguyen DD, Wang R, Wei GW. Machine intelligence design of 2019-nCoV drugs. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2020:2020.01.30.927889. [PMID: 32511308 PMCID: PMC7217289 DOI: 10.1101/2020.01.30.927889] [Citation(s) in RCA: 21] [Impact Index Per Article: 4.2] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 12/22/2022]
Abstract
Wuhan coronavirus, called 2019-nCoV, is a newly emerged virus that infected more than 9692 people and leads to more than 213 fatalities by January 30, 2020. Currently, there is no effective treatment for this epidemic. However, the viral protease of a coronavirus is well-known to be essential for its replication and thus is an effective drug target. Fortunately, the sequence identity of the 2019-nCoV protease and that of severe-acute respiratory syndrome virus (SARS-CoV) is as high as 96.1%. We show that the protease inhibitor binding sites of 2019-nCoV and SARS-CoV are almost identical, which means all potential anti-SARS-CoV chemotherapies are also potential 2019-nCoV drugs. Here, we report a family of potential 2019-nCoV drugs generated by a machine intelligence-based generative network complex (GNC). The potential effectiveness of treating 2019-nCoV by using some existing HIV drugs is also analyzed.
Collapse
Affiliation(s)
- Kaifu Gao
- Department of Mathematics, Michigan State University, MI 48824, USA
| | - Duc Duy Nguyen
- Department of Mathematics, Michigan State University, MI 48824, USA
| | - Rui Wang
- Department of Mathematics, Michigan State University, MI 48824, USA
| | - Guo-Wei Wei
- Department of Mathematics, Michigan State University, MI 48824, USA
- Department of Biochemistry and Molecular Biology, Michigan State University, MI 48824, USA
- Department of Electrical and Computer Engineering, Michigan State University, MI 48824, USA
| |
Collapse
|
48
|
Alekseenko A, Kotelnikov S, Ignatov M, Egbert M, Kholodov Y, Vajda S, Kozakov D. ClusPro LigTBM: Automated Template-based Small Molecule Docking. J Mol Biol 2019; 432:3404-3410. [PMID: 31863748 DOI: 10.1016/j.jmb.2019.12.011] [Citation(s) in RCA: 16] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/14/2019] [Revised: 12/03/2019] [Accepted: 12/04/2019] [Indexed: 12/31/2022]
Abstract
The template-based approach has been essential for achieving high-quality models in the recent rounds of blind protein-protein docking competition CAPRI (Critical Assessment of Predicted Interactions). However, few such automated methods exist for protein-small molecule docking. In this paper, we present an algorithm for template-based docking of small molecules. It searches for known complexes with ligands that have partial coverage of the target ligand, performs conformational sampling and template-guided energy refinement to produce a variety of possible poses, and then scores the refined poses. The algorithm is available as the automated ClusPro LigTBM server. It allows the user to specify the target protein as a PDB file and the ligand as a SMILES string. The server then searches for templates and uses them for docking, presenting the user with top-scoring poses and their confidence scores. The method is tested on the Astex Diverse benchmark, as well as on the targets from the last round of the D3R (Drug Design Data Resource) Grand Challenge. The server is publicly available as part of the ClusPro docking server suite at https://ligtbm.cluspro.org/.
Collapse
Affiliation(s)
- Andrey Alekseenko
- Department of Applied Mathematics and Statistics, Stony Brook University, 11794 Stony Brook, NY, USA; Laufer Center for Physical and Quantitative Biology, Stony Brook University, 11794 Stony Brook, NY, USA
| | - Sergei Kotelnikov
- Department of Applied Mathematics and Statistics, Stony Brook University, 11794 Stony Brook, NY, USA; Laufer Center for Physical and Quantitative Biology, Stony Brook University, 11794 Stony Brook, NY, USA; Innopolis University, 420500, Innopolis, Russia
| | - Mikhail Ignatov
- Department of Applied Mathematics and Statistics, Stony Brook University, 11794 Stony Brook, NY, USA; Laufer Center for Physical and Quantitative Biology, Stony Brook University, 11794 Stony Brook, NY, USA; Institute for Advanced Computational Sciences, Stony Brook University, 11794, Stony Brook, NY, USA
| | - Megan Egbert
- Department of Biomedical Engineering, Boston University, 02215, Boston, MA, USA
| | | | - Sandor Vajda
- Department of Biomedical Engineering, Boston University, 02215, Boston, MA, USA; Department of Chemistry, Boston University, 02215, Boston, MA, USA
| | - Dima Kozakov
- Department of Applied Mathematics and Statistics, Stony Brook University, 11794 Stony Brook, NY, USA; Laufer Center for Physical and Quantitative Biology, Stony Brook University, 11794 Stony Brook, NY, USA; Institute for Advanced Computational Sciences, Stony Brook University, 11794, Stony Brook, NY, USA.
| |
Collapse
|
49
|
Grow C, Gao K, Nguyen DD, Wei GW. Generative network complex (GNC) for drug discovery. COMMUNICATIONS IN INFORMATION AND SYSTEMS 2019; 19:241-277. [PMID: 34257523 PMCID: PMC8274326 DOI: 10.4310/cis.2019.v19.n3.a2] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/22/2022]
Abstract
It remains a challenging task to generate a vast variety of novel compounds with desirable pharmacological properties. In this work, a generative network complex (GNC) is proposed as a new platform for designing novel compounds, predicting their physical and chemical properties, and selecting potential drug candidates that fulfill various druggable criteria such as binding affinity, solubility, partition coefficient, etc. We combine a SMILES string generator, which consists of an encoder, a drug-property controlled or regulated latent space, and a decoder, with verification deep neural networks, a target-specific three-dimensional (3D) pose generator, and mathematical deep learning networks to generate new compounds, predict their drug properties, construct 3D poses associated with target proteins, and reevaluate druggability, respectively. New compounds were generated in the latent space by either randomized output, controlled output, or optimized output. In our demonstration, 2.08 million and 2.8 million novel compounds are generated respectively for Cathepsin S and BACE targets. These new compounds are very different from the seeds and cover a larger chemical space. For potentially active compounds, their 3D poses are generated using a state-of-the-art method. The resulting 3D complexes are further evaluated for druggability by a championing deep learning algorithm based on algebraic topology, differential geometry, and algebraic graph theories. Performed on supercomputers, the whole process took less than one week. Therefore, our GNC is an efficient new paradigm for discovering new drug candidates.
Collapse
Affiliation(s)
- Christopher Grow
- Department of Mathematics, Michigan State University, East Lansing, MI 48824, USA
| | - Kaifu Gao
- Department of Mathematics, Michigan State University, East Lansing, MI 48824, USA
| | - Duc Duy Nguyen
- Department of Mathematics, Michigan State University, East Lansing, MI 48824, USA
| | - Guo-Wei Wei
- Department of Mathematics, Michigan State University, East Lansing, MI 48824, USA
| |
Collapse
|