1
|
Mao J, Akhtar J, Zhang X, Sun L, Guan S, Li X, Chen G, Liu J, Jeon HN, Kim MS, No KT, Wang G. Comprehensive strategies of machine-learning-based quantitative structure-activity relationship models. iScience 2021; 24:103052. [PMID: 34553136 PMCID: PMC8441174 DOI: 10.1016/j.isci.2021.103052] [Citation(s) in RCA: 49] [Impact Index Per Article: 12.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/21/2022] Open
Abstract
Early quantitative structure-activity relationship (QSAR) technologies have unsatisfactory versatility and accuracy in fields such as drug discovery because they are based on traditional machine learning and interpretive expert features. The development of Big Data and deep learning technologies significantly improve the processing of unstructured data and unleash the great potential of QSAR. Here we discuss the integration of wet experiments (which provide experimental data and reliable verification), molecular dynamics simulation (which provides mechanistic interpretation at the atomic/molecular levels), and machine learning (including deep learning) techniques to improve QSAR models. We first review the history of traditional QSAR and point out its problems. We then propose a better QSAR model characterized by a new iterative framework to integrate machine learning with disparate data input. Finally, we discuss the application of QSAR and machine learning to many practical research fields, including drug development and clinical trials.
Collapse
Affiliation(s)
- Jiashun Mao
- The Interdisciplinary Graduate Program in Integrative Biotechnology and Translational Medicine, Yonsei University, Incheon 21983, Republic of Korea
- Department of Biology, School of Life Sciences, Southern University of Science and Technology, 1088 Xueyuan Avenue, Shenzhen, Guangdong 518055, China
- Guangdong Provincial Key Laboratory of Computational Science and Material Design, Shenzhen, Guangdong 518055 China
| | - Javed Akhtar
- Department of Biology, School of Life Sciences, Southern University of Science and Technology, 1088 Xueyuan Avenue, Shenzhen, Guangdong 518055, China
- Guangdong Provincial Key Laboratory of Cell Microenvironment and Disease Research, Shenzhen, Guangdong 518055, China
| | - Xiao Zhang
- Shanghai Rural Commercial Bank Co., Ltd, Shanghai 200002, China
| | - Liang Sun
- Department of Physics, City University of Hong Kong, 83 Tat Chee Avenue, Kowloon, Hong Kong, China
| | - Shenghui Guan
- Department of Biology, School of Life Sciences, Southern University of Science and Technology, 1088 Xueyuan Avenue, Shenzhen, Guangdong 518055, China
- Guangdong Provincial Key Laboratory of Computational Science and Material Design, Shenzhen, Guangdong 518055 China
| | - Xinyu Li
- School of Life and Health Sciences and Warshel Institute for Computational Biology, The Chinese University of Hong Kong, Shenzhen 518172, China
| | - Guangming Chen
- Department of Biology, School of Life Sciences, Southern University of Science and Technology, 1088 Xueyuan Avenue, Shenzhen, Guangdong 518055, China
- Guangdong Provincial Key Laboratory of Cell Microenvironment and Disease Research, Shenzhen, Guangdong 518055, China
| | - Jiaxin Liu
- Biotechnology, College of Life Science and Biotechnology, Yonsei University, Seoul 03722, Republic of Korea
| | - Hyeon-Nae Jeon
- Biotechnology, College of Life Science and Biotechnology, Yonsei University, Seoul 03722, Republic of Korea
| | - Min Sung Kim
- Biotechnology, College of Life Science and Biotechnology, Yonsei University, Seoul 03722, Republic of Korea
| | - Kyoung Tai No
- The Interdisciplinary Graduate Program in Integrative Biotechnology and Translational Medicine, Yonsei University, Incheon 21983, Republic of Korea
| | - Guanyu Wang
- Department of Biology, School of Life Sciences, Southern University of Science and Technology, 1088 Xueyuan Avenue, Shenzhen, Guangdong 518055, China
- Guangdong Provincial Key Laboratory of Computational Science and Material Design, Shenzhen, Guangdong 518055 China
- Guangdong Provincial Key Laboratory of Cell Microenvironment and Disease Research, Shenzhen, Guangdong 518055, China
| |
Collapse
|
2
|
Abstract
Atom pairwise potential functions make up an essential part of many scoring functions for protein decoy detection. With the development of machine learning (ML) tools, there are multiple ways to combine potential functions to create novel ML models and methods. Potential function parameters can be easily extracted; however, it is usually hard to directly obtain the calculated atom pairwise energies from scoring functions. Amber, as one of the most popular suites of modeling programs, has an extensive history and library of force field potential functions. In this work, we directly used the force field parameters in ff94 and ff14SB from Amber and encoded them to calculate atom pairwise energies for different interactions. Two sets of structures (single amino acid set and a dipeptide set) were used to evaluate the performance of our encoded Amber potentials. From the comparison results between energy terms obtained from our encoding and Amber, we find energy difference within ±0.06 kcal/mol for all tested structures. Previously we have shown that the Random Forest (RF) model can help to emphasize more important atom pairwise interactions and ignore insignificant ones [Pei, J.; Zheng, Z.; Merz, K. M. J. Chem. Inf. Model. 2019, 59, 1919-1929]. Here, as an example of combining ML methods with traditional potential functions, we followed the same work flow to combine the RF models with force field potential functions from Amber. To determine the performance of our RF models with force field potential functions, 224 different protein native-decoy systems were used as our training and testing sets We find that the RF models with ff94 and ff14SB force field parameters outperformed all other scoring functions (RF models with KECSA2, RWplus, DFIRE, dDFIRE, and GOAP) considered in this work for native structure detection, and they performed similarly in detecting the best decoy. Through inclusion of best decoy to decoy comparisons in building our RF models, we were able to generate models that outperformed the score functions tested herein both on accuracy and best decoy detection, again showing the performance and flexibility of our RF models to tackle this problem. Finally, the importance of the RF algorithm and force field parameters were also tested and the comparison results suggest that both the RF algorithm and force field potentials are important with the ML scoring function achieving its best performance only by combining them together. All code and data used in this work are available at https://github.com/JunPei000/FFENCODER_for_Protein_Folding_Pose_Selection.
Collapse
Affiliation(s)
- Jun Pei
- Department of Chemistry and the Department of Biochemistry and Molecular Biology, Michigan State University, 578 South Shaw Lane, East Lansing, Michigan 48824, United States
| | - Lin Frank Song
- Department of Chemistry and the Department of Biochemistry and Molecular Biology, Michigan State University, 578 South Shaw Lane, East Lansing, Michigan 48824, United States
| | - Kenneth M Merz
- Department of Chemistry and the Department of Biochemistry and Molecular Biology, Michigan State University, 578 South Shaw Lane, East Lansing, Michigan 48824, United States
| |
Collapse
|
3
|
Tanemura KA, Pei J, Merz KM. Refinement of pairwise potentials via logistic regression to score protein-protein interactions. Proteins 2020; 88:1559-1568. [PMID: 32729132 DOI: 10.1002/prot.25973] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/25/2020] [Revised: 05/17/2020] [Accepted: 06/14/2020] [Indexed: 12/20/2022]
Abstract
Protein-protein interactions (PPIs) are ubiquitous and functionally of great importance in biological systems. Hence, the accurate prediction of PPIs by protein-protein docking and scoring tools is highly desirable in order to characterize their structure and biological function. Ab initio docking protocols are divided into the sampling of docking poses to produce at least one near-native structure, and then to evaluate the vast candidate structures by scoring. Concurrent development in both sampling and scoring is crucial for the deployment of protein-protein docking software. In the present work, we apply a machine learning model on pairwise potentials to refine the task of protein quaternary structure native structure detection among decoys. A decoy set was featurized using the Knowledge and Empirical Combined Scoring Algorithm 2 (KECSA2) pairwise potential. The highly unbalanced decoy set was then balanced using a comparison concept between native and decoy structures. The resultant comparison descriptors were used to train a logistic regression (LR) classifier. The LR model yielded the optimal performance for native detection among decoys compared with conventional scoring functions, while exhibiting lesser performance for the detection of low root mean square deviation decoy structures. Its deployment on an independent benchmark set confirms that the scoring function performs competitively relative to other scoring functions. The scripts used are available at https://github.com/TanemuraKiyoto/PPI-native-detection-via-LR.
Collapse
Affiliation(s)
- Kiyoto A Tanemura
- Department of Chemistry, Michigan State University, East Lansing, Michigan, USA
| | - Jun Pei
- Department of Chemistry, Michigan State University, East Lansing, Michigan, USA
| | - Kenneth M Merz
- Department of Chemistry, Michigan State University, East Lansing, Michigan, USA
| |
Collapse
|
4
|
Lee MH. Identification of host-guest systems in green TADF-based OLEDs with energy level matching based on a machine-learning study. Phys Chem Chem Phys 2020; 22:16378-16386. [PMID: 32657298 DOI: 10.1039/d0cp02871a] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/21/2022]
Abstract
Booming progress has been made in both the molecular design concept and the fundamental electroluminescence (EL) mechanism of thermally activated delayed fluorescence (TADF)-based organic light-emitting diodes (OLEDs) in recent years. One of the requirements for TADF-based OLEDs having high external quantum efficiency (EQE) is the favorable energy level alignment between the host and the guest to promote the energy transfer and improve the carrier balance. However, strategies to optimize the TADF-based OLED performance by selecting suitable host-guest systems in the light-emitting layer are far from enough. In this work, we investigated guest-host systems through the use of two machine-learning approaches (feature-based and similarity-based algorithms) from our recent effort for the optimization of TADF-based OLEDs. The Random Forest (RF) algorithm based on the features of electronic and photo-physical properties can accurately predict the EQE of green TADF-based OLEDs with average correlation coefficients of R2 = 0.85 for the training set and R2 = 0.74 for the testing set. Also, the Support Vector Regression (SVR) algorithm based on similarity metrics between pairs of materials (e.g., host and guest) in terms of electronic parameters can provide reasonable device performance prediction (R2 = 0.72) through the optimization procedure of the parameters. These results show that the predictive capability and model applicability of both machine-learning models can be used to identify suitable host-guest systems and explore complex relationships in green TADF-based OLEDs.
Collapse
Affiliation(s)
- Min-Hsuan Lee
- Rm. 1006, Bldg. 51, No. 195, Sec. 4, Chung Hsing Road, Chutung, Hsinchu 31057, Taiwan.
| |
Collapse
|
5
|
Serafimova K, Mihaylov I, Vassilev D, Avdjieva I, Zielenkiewicz P, Kaczanowski S. Using Machine Learning in Accuracy Assessment of Knowledge-Based Energy and Frequency Base Likelihood in Protein Structures. LECTURE NOTES IN COMPUTER SCIENCE 2020. [PMCID: PMC7304015 DOI: 10.1007/978-3-030-50420-5_43] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 12/02/2022]
Abstract
Many aspects of the study of protein folding and dynamics have been affected by the accumulation of data about native protein structures and recent advances in machine learning. Computational methods for predicting protein structures from their sequences are now heavily based on machine learning tools and on approaches that extract knowledge and rules from data using probabilistic models. Many of these methods use scoring functions to determine which structure best fits a native protein sequence. Using computational approaches, we obtained two scoring functions: knowledge-based energy and likelihood of base frequency, and we compared their accuracy in measuring the sequence structure fit. We compared the machine learning models’ accuracy of predictions for knowledge-based energy and likelihood values to validate our results, showing that likelihood is a more accurate scoring function than knowledge-based energy.
Collapse
|
6
|
Moman E, Grishina MA, Potemkin VA. Nonparametric chemical descriptors for the calculation of ligand-biopolymer affinities with machine-learning scoring functions. J Comput Aided Mol Des 2019; 33:943-953. [PMID: 31728812 DOI: 10.1007/s10822-019-00248-2] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/30/2019] [Accepted: 11/04/2019] [Indexed: 12/20/2022]
Abstract
The computational prediction of ligand-biopolymer affinities is a crucial endeavor in modern drug discovery and one that still poses major challenges. The choice of the appropriate computational method often reveals itself as a trade-off between accuracy and speed, with mathematical devices referred to as scoring functions being the fastest. Among the many shortcomings of scoring functions there is the lack of universal applicability to every molecular system. This is so largely due to their reliance on atom type perception and/or parametrization. This article proposes the use of nonparametric Model of Effective Radii of Atoms descriptors that can be readily computed for the entire Periodic Table and demonstrate that, in combination with machine learning algorithms, they can yield competitive performances and chemically meaningful insights.
Collapse
Affiliation(s)
- Edelmiro Moman
- South Ural State University, 20A Tchaikovsky Street, Chelyabinsk, Russian Federation, 454080.
| | - Maria A Grishina
- South Ural State University, 20A Tchaikovsky Street, Chelyabinsk, Russian Federation, 454080
| | - Vladimir A Potemkin
- South Ural State University, 20A Tchaikovsky Street, Chelyabinsk, Russian Federation, 454080
| |
Collapse
|
7
|
Long S, Tian P. A simple neural network implementation of generalized solvation free energy for assessment of protein structural models. RSC Adv 2019; 9:36227-36233. [PMID: 35540566 PMCID: PMC9074945 DOI: 10.1039/c9ra05168f] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/08/2019] [Accepted: 10/14/2019] [Indexed: 11/21/2022] Open
Abstract
Rapid and accurate assessment of protein structural models is essential for protein structure prediction and design. Great progress has been made in this regard, especially by recent application of "knowledge-based" potentials. Various machine learning based protein structural model quality assessment methods are also quite successful. However, performance of traditional "physics-based" models has not been as effective. Based on our analysis of the fundamental computational limitation behind unsatisfactory performance of "physics-based" models, we propose a generalized solvation free energy (GSFE) framework, which is intrinsically flexible for multi-scale treatments and is amenable for machine learning implementation. Finally, we implemented a simple example of backbone-based residue level GSFE with neural network, which was found to have competitive performance when compared with highly complex latest "knowledge-based" atomic potentials in distinguishing native structures from decoys.
Collapse
Affiliation(s)
- Shiyang Long
- School of Chemistry, Jilin University Changchun China
| | - Pu Tian
- School of Life Science and School of Artificial Intelligence, Jilin University 2699 Qianjin Street Changchun China 130012
| |
Collapse
|
8
|
Pei J, Zheng Z, Kim H, Song LF, Walworth S, Merz MR, Merz KM. Random Forest Refinement of Pairwise Potentials for Protein–Ligand Decoy Detection. J Chem Inf Model 2019; 59:3305-3315. [DOI: 10.1021/acs.jcim.9b00356] [Citation(s) in RCA: 12] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/27/2022]
Affiliation(s)
- Jun Pei
- Department of Chemistry, Michigan State University, 578 South Shaw Lane, East Lansing, Michigan 48824, United States
| | - Zheng Zheng
- Department of Chemistry, Michigan State University, 578 South Shaw Lane, East Lansing, Michigan 48824, United States
| | - Hyunji Kim
- Department of Chemistry, Michigan State University, 578 South Shaw Lane, East Lansing, Michigan 48824, United States
| | - Lin Frank Song
- Department of Chemistry, Michigan State University, 578 South Shaw Lane, East Lansing, Michigan 48824, United States
| | - Sarah Walworth
- Department of Chemistry, Michigan State University, 578 South Shaw Lane, East Lansing, Michigan 48824, United States
| | - Margaux R. Merz
- Department of Chemistry, Michigan State University, 578 South Shaw Lane, East Lansing, Michigan 48824, United States
| | - Kenneth M. Merz
- Department of Chemistry, Michigan State University, 578 South Shaw Lane, East Lansing, Michigan 48824, United States
- Institute for Cyber Enabled Research, Michigan State University, 567 Wilson Road, East Lansing, Michigan 48824, United States
| |
Collapse
|
9
|
Wahab HA, Amaro RE, Cournia Z. A Celebration of Women in Computational Chemistry. J Chem Inf Model 2019; 59:1683-1692. [DOI: 10.1021/acs.jcim.9b00368] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/16/2022]
Affiliation(s)
| | - Rommie E. Amaro
- Department of Chemistry and Biochemistry, University of California, San Diego, 3234 Urey Hall, #0340, 9500 Gilman Drive, La Jolla, California 92093-0340, United States
| | - Zoe Cournia
- Biomedical Research Foundation, Academy of Athens, 11527 Athens, Greece
| |
Collapse
|
10
|
Chen HY, Chen JQ, Li JY, Huang HJ, Chen X, Zhang HY, Chen CYC. Deep Learning and Random Forest Approach for Finding the Optimal Traditional Chinese Medicine Formula for Treatment of Alzheimer's Disease. J Chem Inf Model 2019; 59:1605-1623. [PMID: 30888812 DOI: 10.1021/acs.jcim.9b00041] [Citation(s) in RCA: 15] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/21/2022]
Abstract
It has demonstrated that glycogen synthase kinase 3β (GSK3β) is related to Alzheimer's disease (AD). On the basis of the world largest traditional Chinese medicine (TCM) database, a network-pharmacology-based approach was utilized to investigate TCM candidates that can dock well with multiple targets. Support vector machine (SVM) and multiple linear regression (MLR) methods were utilized to obtain predicted models. In particular, the deep learning method and the random forest (RF) algorithm were adopted. We achieved R2 values of 0.927 on the training set and 0.862 on the test set with deep learning and 0.869 on the training set and 0.890 on the test set with RF. Besides, comparative molecular similarity indices analysis (CoMSIA) was performed to get a predicted model. All of the training models achieved good results on the test set. The stability of GSK3β protein-ligand complexes was evaluated using 100 ns of MD simulation. Methyl 3- O-feruloylquinate and cynanogenin A induced both more compactness to the GSK3β complex and stable conditions at all simulation times, and the GSK3β complex also had no substantial fluctuations after a simulation time of 5 ns. For TCM molecules, we used the trained models to calculate predicted bioactivity values, and the optimum TCM candidates were obtained by ranking the predicted values. The results showed that methyl 3- O-feruloylquinate contained in Phellodendron amurense and cynanogenin A contained in Cynanchum atratum are capable of forming stable interactions with GSK3β.
Collapse
Affiliation(s)
- Hsin-Yi Chen
- School of Intelligent Systems Engineering , Sun Yat-sen University , Shenzhen 510275 , China
| | - Jian-Qiang Chen
- School of Intelligent Systems Engineering , Sun Yat-sen University , Shenzhen 510275 , China
| | - Jun-Yan Li
- School of Intelligent Systems Engineering , Sun Yat-sen University , Shenzhen 510275 , China
| | - Hung-Jin Huang
- School of Intelligent Systems Engineering , Sun Yat-sen University , Shenzhen 510275 , China
| | - Xi Chen
- School of Intelligent Systems Engineering , Sun Yat-sen University , Shenzhen 510275 , China
| | - Hao-Ying Zhang
- School of Intelligent Systems Engineering , Sun Yat-sen University , Shenzhen 510275 , China
| | - Calvin Yu-Chian Chen
- School of Intelligent Systems Engineering , Sun Yat-sen University , Shenzhen 510275 , China.,Department of Medical Research , China Medical University Hospital , Taichung 40447 , Taiwan.,Department of Bioinformatics and Medical Engineering , Asia University , Taichung 41354 , Taiwan
| |
Collapse
|