1
|
Dewaker V, Morya VK, Kim YH, Park ST, Kim HS, Koh YH. Revolutionizing oncology: the role of Artificial Intelligence (AI) as an antibody design, and optimization tools. Biomark Res 2025; 13:52. [PMID: 40155973 PMCID: PMC11954232 DOI: 10.1186/s40364-025-00764-4] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/25/2025] [Accepted: 03/13/2025] [Indexed: 04/01/2025] Open
Abstract
Antibodies play a crucial role in defending the human body against diseases, including life-threatening conditions like cancer. They mediate immune responses against foreign antigens and, in some cases, self-antigens. Over time, antibody-based technologies have evolved from monoclonal antibodies (mAbs) to chimeric antigen receptor T cells (CAR-T cells), significantly impacting biotechnology, diagnostics, and therapeutics. Although these advancements have enhanced therapeutic interventions, the integration of artificial intelligence (AI) is revolutionizing antibody design and optimization. This review explores recent AI advancements, including large language models (LLMs), diffusion models, and generative AI-based applications, which have transformed antibody discovery by accelerating de novo generation, enhancing immune response precision, and optimizing therapeutic efficacy. Through advanced data analysis, AI enables the prediction and design of antibody sequences, 3D structures, complementarity-determining regions (CDRs), paratopes, epitopes, and antigen-antibody interactions. These AI-powered innovations address longstanding challenges in antibody development, significantly improving speed, specificity, and accuracy in therapeutic design. By integrating computational advancements with biomedical applications, AI is driving next-generation cancer therapies, transforming precision medicine, and enhancing patient outcomes.
Collapse
Affiliation(s)
- Varun Dewaker
- Institute of New Frontier Research Team, Hallym University, Chuncheon-Si, Gangwon-Do, 24252, Republic of Korea
| | - Vivek Kumar Morya
- Department of Orthopedic Surgery, Hallym University Dongtan Sacred Hospital, Hwaseong-Si, 18450, Republic of Korea
| | - Yoo Hee Kim
- Department of Biomedical Gerontology, Ilsong Institute of Life Science, Hallym University, Seoul, 07247, Republic of Korea
| | - Sung Taek Park
- Institute of New Frontier Research Team, Hallym University, Chuncheon-Si, Gangwon-Do, 24252, Republic of Korea
- Department of Obstetrics and Gynecology, Kangnam Sacred-Heart Hospital, Hallym University Medical Center, Hallym University College of Medicine, Seoul, 07441, Republic of Korea
- EIONCELL Inc, Chuncheon-Si, 24252, Republic of Korea
| | - Hyeong Su Kim
- Institute of New Frontier Research Team, Hallym University, Chuncheon-Si, Gangwon-Do, 24252, Republic of Korea.
- Department of Internal Medicine, Division of Hemato-Oncology, Kangnam Sacred-Heart Hospital, Hallym University Medical Center, Hallym University College of Medicine, Seoul, 07441, Republic of Korea.
- EIONCELL Inc, Chuncheon-Si, 24252, Republic of Korea.
| | - Young Ho Koh
- Department of Biomedical Gerontology, Ilsong Institute of Life Science, Hallym University, Seoul, 07247, Republic of Korea.
| |
Collapse
|
2
|
Li G, Zhang N, Fan L. ProG-SOL: Predicting Protein Solubility Using Protein Embeddings and Dual-Graph Convolutional Networks. ACS OMEGA 2025; 10:3910-3916. [PMID: 39926503 PMCID: PMC11800053 DOI: 10.1021/acsomega.4c09688] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 10/24/2024] [Revised: 01/06/2025] [Accepted: 01/13/2025] [Indexed: 02/11/2025]
Abstract
Solubility is a key biophysical property of proteins and is essential for evaluating the effectiveness of proteins in biochemical engineering. In recent years, the prediction method of protein solubility has received extensive attention in the protein engineering research community. Many methods have been developed to predict protein solubility, but the generalization performance of existing prediction methods on independent test sets must be improved. In addition, solubility prediction methods do not work well when they are used for regression tasks. To address these issues, we developed a new method, ProG-SOL, an innovative sequence-based dual-graph convolutional network that simultaneously exploits the protein pretrained graph and the protein evolutionary graph for assessing solubility. Compared with other methods, ProG-SOL achieves better classification and regression results for different independent test sets at the same time. The model framework of our method may also be used to predict other properties of proteins such as protein function, protein-protein interaction, protein folding, and drug design, which provide broad application prospects in protein engineering.
Collapse
Affiliation(s)
- Gen Li
- Production
and R&D Center I of LSS, GenScript (Shanghai)
Biotech Co., Ltd., Shanghai 200131, China
| | - Ning Zhang
- Production
and R&D Center I of LSS, GenScript Biotech
Corporation, Nanjing 211122, China
| | - Long Fan
- Production
and R&D Center I of LSS, GenScript (Shanghai)
Biotech Co., Ltd., Shanghai 200131, China
| |
Collapse
|
3
|
Mall R, Kaushik R, Martinez ZA, Thomson MW, Castiglione F. Benchmarking protein language models for protein crystallization. Sci Rep 2025; 15:2381. [PMID: 39827171 PMCID: PMC11743144 DOI: 10.1038/s41598-025-86519-5] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/04/2024] [Accepted: 01/13/2025] [Indexed: 01/22/2025] Open
Abstract
The problem of protein structure determination is usually solved by X-ray crystallography. Several in silico deep learning methods have been developed to overcome the high attrition rate, cost of experiments and extensive trial-and-error settings, for predicting the crystallization propensities of proteins based on their sequences. In this work, we benchmark the power of open protein language models (PLMs) through the TRILL platform, a be-spoke framework democratizing the usage of PLMs for the task of predicting crystallization propensities of proteins. By comparing LightGBM / XGBoost classifiers built on the average embedding representations of proteins learned by different PLMs, such as ESM2, Ankh, ProtT5-XL, ProstT5, xTrimoPGLM, SaProt with the performance of state-of-the-art sequence-based methods like DeepCrystal, ATTCrys and CLPred, we identify the most effective methods for predicting crystallization outcomes. The LightGBM classifiers utilizing embeddings from ESM2 model with 30 and 36 transformer layers and 150 and 3000 million parameters respectively have performance gains by 3-[Formula: see text] than all compared models for various evaluation metrics, including AUPR (Area Under Precision-Recall Curve), AUC (Area Under the Receiver Operating Characteristic Curve), and F1 on independent test sets. Furthermore, we fine-tune the ProtGPT2 model available via TRILL to generate crystallizable proteins. Starting with 3000 generated proteins and through a step of filtration processes including consensus of all open PLM-based classifiers, sequence identity through CD-HIT, secondary structure compatibility, aggregation screening, homology search and foldability evaluation, we identified a set of 5 novel proteins as potentially crystallizable.
Collapse
Affiliation(s)
- Raghvendra Mall
- Biotechnology Research Center, Technology Innovation Institute, P.O. Box 9639, Abu Dhabi, United Arab Emirates.
| | - Rahul Kaushik
- Biotechnology Research Center, Technology Innovation Institute, P.O. Box 9639, Abu Dhabi, United Arab Emirates
| | - Zachary A Martinez
- Division of Biology and Bioengineering, California Institute of Technology, Pasadena, 91125, CA, USA
| | - Matt W Thomson
- Division of Biology and Bioengineering, California Institute of Technology, Pasadena, 91125, CA, USA
| | - Filippo Castiglione
- Biotechnology Research Center, Technology Innovation Institute, P.O. Box 9639, Abu Dhabi, United Arab Emirates.
- Institute for Applied Computing, National Research Council of Italy, 00185, Rome, Italy.
| |
Collapse
|
4
|
Pimtawong T, Ren J, Lee J, Lee HM, Na D. A review on computational models for predicting protein solubility. J Microbiol 2025; 63:e.2408001. [PMID: 39895070 DOI: 10.71150/jm.2408001] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/27/2024] [Accepted: 10/29/2024] [Indexed: 02/04/2025]
Abstract
Protein solubility is a critical factor in the production of recombinant proteins, which are widely used in various industries, including pharmaceuticals, diagnostics, and biotechnology. Predicting protein solubility remains a challenging task due to the complexity of protein structures and the multitude of factors influencing solubility. Recent advances in computational methods, particularly those based on machine learning, have provided powerful tools for predicting protein solubility, thereby reducing the need for extensive experimental trials. This review provides an overview of current computational approaches to predict protein solubility. We discuss the datasets, features, and algorithms employed in these models. The review aims to bridge the gap between computational predictions and experimental validations, fostering the development of more accurate and reliable solubility prediction models that can significantly enhance recombinant protein production.
Collapse
Affiliation(s)
- Teerapat Pimtawong
- Department of Biomedical Engineering, Chung-Ang University, Seoul 06974, Republic of Korea
| | - Jun Ren
- Department of Biomedical Engineering, Chung-Ang University, Seoul 06974, Republic of Korea
| | - Jingyu Lee
- Department of Biomedical Engineering, Chung-Ang University, Seoul 06974, Republic of Korea
| | - Hyang-Mi Lee
- Department of Biomedical Engineering, Chung-Ang University, Seoul 06974, Republic of Korea
| | - Dokyun Na
- Department of Biomedical Engineering, Chung-Ang University, Seoul 06974, Republic of Korea
| |
Collapse
|
5
|
Kwon H, Du Z, Li Y. AlphaFold 2-based stacking model for protein solubility prediction and its transferability on seed storage proteins. Int J Biol Macromol 2024; 278:134601. [PMID: 39137857 DOI: 10.1016/j.ijbiomac.2024.134601] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/07/2024] [Revised: 07/29/2024] [Accepted: 08/07/2024] [Indexed: 08/15/2024]
Abstract
Accurate protein solubility prediction is crucial in screening suitable candidates for food application. Existing models often rely only on sequences, overlooking important structural details. In this study, a regression model for protein solubility was developed using both the sequences and predicted structures of 2983 E. coli proteins. The sequence and structural level properties of the proteins were bioinformatically extracted and subjected to multilayer perceptron (MLP). Moreover, residue level features and contact maps were utilized to construct a graph convolutional network (GCN). The out-of-fold predictions of the two models were combined and fed into multiple meta-regressors to create a stacking model. The stacking model with support vector regressor (SVR) achieved R2 of 0.502 and 0.468 on test and external validation datasets, respectively, displaying higher performance compared to existing regression models. Based on the improved performance compared to its based models, the stacking model effectively captured the strength of its base models as well as the significance of the different features used. Furthermore, the model's transferability was indirectly validated on a dataset of seed storage proteins using Osborne definition as well as on a case study using molecular dynamic simulation, showing potential for application beyond microbial proteins to food and agriculture-related ones.
Collapse
Affiliation(s)
- Hyukjin Kwon
- Department of Grain Science and Industry, Kansas State University, Manhattan, KS 66506, USA
| | - Zhenjiao Du
- Department of Grain Science and Industry, Kansas State University, Manhattan, KS 66506, USA
| | - Yonghui Li
- Department of Grain Science and Industry, Kansas State University, Manhattan, KS 66506, USA.
| |
Collapse
|
6
|
Ghafoor H, Asim MN, Ibrahim MA, Dengel A. ProSol-multi: Protein solubility prediction via amino acids multi-level correlation and discriminative distribution. Heliyon 2024; 10:e36041. [PMID: 39281576 PMCID: PMC11401092 DOI: 10.1016/j.heliyon.2024.e36041] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/16/2024] [Revised: 08/01/2024] [Accepted: 08/08/2024] [Indexed: 09/18/2024] Open
Abstract
Protein solubility prediction is useful for the careful selection of highly effective candidate proteins for drug development. In recombinant proteins synthesis, solubility prediction is valuable for optimizing key protein characteristics, including stability, functionality, and ease of purification. It contains valuable information about potential biomarkers or therapeutic targets and helps in early forecasting of neurodegenerative diseases, cancer, and cardiovascular disorders. Traditional wet-lab experimental protein solubility prediction approaches are error-prone, time-consuming, and costly. Researchers harnessed the competence of Artificial Intelligence approaches for replacing experimental approaches with computational predictors. These predictors inferred the solubility of proteins by analyzing amino acids distributions in raw protein sequences. There is still a lot of room for the development of robust computational predictors because existing predictors remain fail in extracting comprehensive discriminative distribution of amino acids. To more precisely discriminate soluble proteins from insoluble proteins, this paper presents ProSol-Multi predictor that makes use of a novel MLCDE encoder and Random Forest classifier. MLCDE encoder transforms protein sequences into informative statistical vectors by capturing amino acids multi-level correlation and discriminative distribution within raw protein sequences. The performance of proposed encoder is evaluated against 56 existing protein sequence encoding methods on a widely used protein solubility prediction benchmark dataset under two different experimental settings namely intrinsic and extrinsic. Intrinsic evaluation reveals that from all sequence encoders, proposed MLCDE encoder manages to generate non-overlapping clusters of soluble and insoluble classes. In extrinsic evaluation, 10 machine learning classifiers achieve better performance with proposed MLCDE encoder as compared to 56 existing protein sequence encoders. Moreover, across 4 public benchmark datasets, proposed ProSol-Multi predictor outshines 20 existing predictors by an average accuracy of 3%, MCC and AU-ROC of 2%. ProSol-Multi interactive web application is available at https://sds_genetic_analysis.opendfki.de/ProSol-Multi.
Collapse
Affiliation(s)
- Hina Ghafoor
- Department of Computer Science, Rhineland-Palatinate Technical University of Kaiserslautern-Landau, Kaiserslautern, 67663, Germany
- German Research Center for Artificial Intelligence GmbH, Kaiserslautern, 67663, Germany
| | - Muhammad Nabeel Asim
- German Research Center for Artificial Intelligence GmbH, Kaiserslautern, 67663, Germany
| | - Muhammad Ali Ibrahim
- Department of Computer Science, Rhineland-Palatinate Technical University of Kaiserslautern-Landau, Kaiserslautern, 67663, Germany
- German Research Center for Artificial Intelligence GmbH, Kaiserslautern, 67663, Germany
| | - Andreas Dengel
- Department of Computer Science, Rhineland-Palatinate Technical University of Kaiserslautern-Landau, Kaiserslautern, 67663, Germany
- German Research Center for Artificial Intelligence GmbH, Kaiserslautern, 67663, Germany
| |
Collapse
|
7
|
Li W, Lin H, Huang Z, Xie S, Zhou Y, Gong R, Jiang Q, Xiang C, Huang J. DOTAD: A Database of Therapeutic Antibody Developability. Interdiscip Sci 2024; 16:623-634. [PMID: 38530613 DOI: 10.1007/s12539-024-00613-2] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/13/2023] [Revised: 01/25/2024] [Accepted: 01/27/2024] [Indexed: 03/28/2024]
Abstract
The development of therapeutic antibodies is an important aspect of new drug discovery pipelines. The assessment of an antibody's developability-its suitability for large-scale production and therapeutic use-is a particularly important step in this process. Given that experimental assays to assess antibody developability in large scale are expensive and time-consuming, computational methods have been a more efficient alternative. However, the antibody research community faces significant challenges due to the scarcity of readily accessible data on antibody developability, which is essential for training and validating computational models. To address this gap, DOTAD (Database Of Therapeutic Antibody Developability) has been built as the first database dedicated exclusively to the curation of therapeutic antibody developability information. DOTAD aggregates all available therapeutic antibody sequence data along with various developability metrics from the scientific literature, offering researchers a robust platform for data storage, retrieval, exploration, and downloading. In addition to serving as a comprehensive repository, DOTAD enhances its utility by integrating a web-based interface that features state-of-the-art tools for the assessment of antibody developability. This ensures that users not only have access to critical data but also have the convenience of analyzing and interpreting this information. The DOTAD database represents a valuable resource for the scientific community, facilitating the advancement of therapeutic antibody research. It is freely accessible at http://i.uestc.edu.cn/DOTAD/ , providing an open data platform that supports the continuous growth and evolution of computational methods in the field of antibody development.
Collapse
Affiliation(s)
- Wenzhen Li
- School of Life Science and Technology, University of Electronic Science and Technology of China, Chengdu, 611731, China
| | - Hongyan Lin
- School of Life Science and Technology, University of Electronic Science and Technology of China, Chengdu, 611731, China
| | - Ziru Huang
- School of Life Science and Technology, University of Electronic Science and Technology of China, Chengdu, 611731, China
| | - Shiyang Xie
- School of Life Science and Technology, University of Electronic Science and Technology of China, Chengdu, 611731, China
| | - Yuwei Zhou
- School of Life Science and Technology, University of Electronic Science and Technology of China, Chengdu, 611731, China
| | - Rong Gong
- School of Computer Science and Technology, Aba Teachers University, Aba, 623002, China
| | - Qianhu Jiang
- School of Life Science and Technology, University of Electronic Science and Technology of China, Chengdu, 611731, China
| | - ChangCheng Xiang
- School of Computer Science and Technology, Aba Teachers University, Aba, 623002, China.
| | - Jian Huang
- School of Life Science and Technology, University of Electronic Science and Technology of China, Chengdu, 611731, China.
- School of Healthcare Technology, Chengdu Neusoft University, Chengdu, 611844, China.
| |
Collapse
|
8
|
Bucataru C, Ciobanasu C. Antimicrobial peptides: Opportunities and challenges in overcoming resistance. Microbiol Res 2024; 286:127822. [PMID: 38986182 DOI: 10.1016/j.micres.2024.127822] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/09/2024] [Revised: 06/20/2024] [Accepted: 06/25/2024] [Indexed: 07/12/2024]
Abstract
Antibiotic resistance represents a global health threat, challenging the efficacy of traditional antimicrobial agents and necessitating innovative approaches to combat infectious diseases. Among these alternatives, antimicrobial peptides have emerged as promising candidates against resistant pathogens. Unlike traditional antibiotics with only one target, these peptides can use different mechanisms to destroy bacteria, with low toxicity to mammalian cells compared to many conventional antibiotics. Antimicrobial peptides (AMPs) have encouraging antibacterial properties and are currently employed in the clinical treatment of pathogen infection, cancer, wound healing, cosmetics, or biotechnology. This review summarizes the mechanisms of antimicrobial peptides against bacteria, discusses the mechanisms of drug resistance, the limitations and challenges of AMPs in peptide drug applications for combating drug-resistant bacterial infections, and strategies to enhance their capabilities.
Collapse
Affiliation(s)
- Cezara Bucataru
- Alexandru I. Cuza University, Institute of Interdisciplinary Research, Department of Exact and Natural Sciences, Bulevardul Carol I, Nr.11, Iasi 700506, Romania
| | - Corina Ciobanasu
- Alexandru I. Cuza University, Institute of Interdisciplinary Research, Department of Exact and Natural Sciences, Bulevardul Carol I, Nr.11, Iasi 700506, Romania.
| |
Collapse
|
9
|
Zhang X, Hu X, Zhang T, Yang L, Liu C, Xu N, Wang H, Sun W. PLM_Sol: predicting protein solubility by benchmarking multiple protein language models with the updated Escherichia coli protein solubility dataset. Brief Bioinform 2024; 25:bbae404. [PMID: 39179250 PMCID: PMC11343611 DOI: 10.1093/bib/bbae404] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/29/2024] [Revised: 07/19/2024] [Accepted: 08/07/2024] [Indexed: 08/26/2024] Open
Abstract
Protein solubility plays a crucial role in various biotechnological, industrial, and biomedical applications. With the reduction in sequencing and gene synthesis costs, the adoption of high-throughput experimental screening coupled with tailored bioinformatic prediction has witnessed a rapidly growing trend for the development of novel functional enzymes of interest (EOI). High protein solubility rates are essential in this process and accurate prediction of solubility is a challenging task. As deep learning technology continues to evolve, attention-based protein language models (PLMs) can extract intrinsic information from protein sequences to a greater extent. Leveraging these models along with the increasing availability of protein solubility data inferred from structural database like the Protein Data Bank holds great potential to enhance the prediction of protein solubility. In this study, we curated an Updated Escherichia coli protein Solubility DataSet (UESolDS) and employed a combination of multiple PLMs and classification layers to predict protein solubility. The resulting best-performing model, named Protein Language Model-based protein Solubility prediction model (PLM_Sol), demonstrated significant improvements over previous reported models, achieving a notable 6.4% increase in accuracy, 9.0% increase in F1_score, and 11.1% increase in Matthews correlation coefficient score on the independent test set. Moreover, additional evaluation utilizing our in-house synthesized protein resource as test data, encompassing diverse types of enzymes, also showcased the good performance of PLM_Sol. Overall, PLM_Sol exhibited consistent and promising performance across both independent test set and experimental set, thereby making it well suited for facilitating large-scale EOI studies. PLM_Sol is available as a standalone program and as an easy-to-use model at https://zenodo.org/doi/10.5281/zenodo.10675340.
Collapse
Affiliation(s)
- Xuechun Zhang
- Key Laboratory of Organ Regeneration and Reconstruction, State Key Laboratory of Stem Cell and Reproductive Biology, Institute of Zoology, Chinese Academy of Sciences, 1 Beichen West Road, Chaoyang District, Beijing 100101, China
- Institute for Stem Cell and Regeneration, Chinese Academy of Sciences, 1 Beichen West Road, Chaoyang District, Beijing 100101, China
- University of Chinese Academy of Sciences, No. 1 Yanqihu East Rd, Huairou District, Beijing 101408, China
| | - Xiaoxuan Hu
- Key Laboratory of Organ Regeneration and Reconstruction, State Key Laboratory of Stem Cell and Reproductive Biology, Institute of Zoology, Chinese Academy of Sciences, 1 Beichen West Road, Chaoyang District, Beijing 100101, China
- Institute for Stem Cell and Regeneration, Chinese Academy of Sciences, 1 Beichen West Road, Chaoyang District, Beijing 100101, China
- University of Chinese Academy of Sciences, No. 1 Yanqihu East Rd, Huairou District, Beijing 101408, China
| | - Tongtong Zhang
- Key Laboratory of Organ Regeneration and Reconstruction, State Key Laboratory of Stem Cell and Reproductive Biology, Institute of Zoology, Chinese Academy of Sciences, 1 Beichen West Road, Chaoyang District, Beijing 100101, China
- Institute for Stem Cell and Regeneration, Chinese Academy of Sciences, 1 Beichen West Road, Chaoyang District, Beijing 100101, China
- University of Chinese Academy of Sciences, No. 1 Yanqihu East Rd, Huairou District, Beijing 101408, China
| | - Ling Yang
- Key Laboratory of Organ Regeneration and Reconstruction, State Key Laboratory of Stem Cell and Reproductive Biology, Institute of Zoology, Chinese Academy of Sciences, 1 Beichen West Road, Chaoyang District, Beijing 100101, China
- Institute for Stem Cell and Regeneration, Chinese Academy of Sciences, 1 Beichen West Road, Chaoyang District, Beijing 100101, China
- University of Chinese Academy of Sciences, No. 1 Yanqihu East Rd, Huairou District, Beijing 101408, China
| | - Chunhong Liu
- Key Laboratory of Organ Regeneration and Reconstruction, State Key Laboratory of Stem Cell and Reproductive Biology, Institute of Zoology, Chinese Academy of Sciences, 1 Beichen West Road, Chaoyang District, Beijing 100101, China
- Institute for Stem Cell and Regeneration, Chinese Academy of Sciences, 1 Beichen West Road, Chaoyang District, Beijing 100101, China
- University of Chinese Academy of Sciences, No. 1 Yanqihu East Rd, Huairou District, Beijing 101408, China
| | - Ning Xu
- Key Laboratory of Organ Regeneration and Reconstruction, State Key Laboratory of Stem Cell and Reproductive Biology, Institute of Zoology, Chinese Academy of Sciences, 1 Beichen West Road, Chaoyang District, Beijing 100101, China
- Institute for Stem Cell and Regeneration, Chinese Academy of Sciences, 1 Beichen West Road, Chaoyang District, Beijing 100101, China
- University of Chinese Academy of Sciences, No. 1 Yanqihu East Rd, Huairou District, Beijing 101408, China
| | - Haoyi Wang
- Key Laboratory of Organ Regeneration and Reconstruction, State Key Laboratory of Stem Cell and Reproductive Biology, Institute of Zoology, Chinese Academy of Sciences, 1 Beichen West Road, Chaoyang District, Beijing 100101, China
- Institute for Stem Cell and Regeneration, Chinese Academy of Sciences, 1 Beichen West Road, Chaoyang District, Beijing 100101, China
- University of Chinese Academy of Sciences, No. 1 Yanqihu East Rd, Huairou District, Beijing 101408, China
- Beijing Institute for Stem Cell and Regenerative Medicine, A 3 Datun Road, Chaoyang District, Beijing 100100, China
| | - Wen Sun
- Key Laboratory of Organ Regeneration and Reconstruction, State Key Laboratory of Stem Cell and Reproductive Biology, Institute of Zoology, Chinese Academy of Sciences, 1 Beichen West Road, Chaoyang District, Beijing 100101, China
- Institute for Stem Cell and Regeneration, Chinese Academy of Sciences, 1 Beichen West Road, Chaoyang District, Beijing 100101, China
- Beijing Institute for Stem Cell and Regenerative Medicine, A 3 Datun Road, Chaoyang District, Beijing 100100, China
| |
Collapse
|
10
|
Mall R, Singh A, Patel CN, Guirimand G, Castiglione F. VISH-Pred: an ensemble of fine-tuned ESM models for protein toxicity prediction. Brief Bioinform 2024; 25:bbae270. [PMID: 38842509 PMCID: PMC11154842 DOI: 10.1093/bib/bbae270] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/29/2024] [Revised: 05/06/2024] [Accepted: 05/23/2024] [Indexed: 06/07/2024] Open
Abstract
Peptide- and protein-based therapeutics are becoming a promising treatment regimen for myriad diseases. Toxicity of proteins is the primary hurdle for protein-based therapies. Thus, there is an urgent need for accurate in silico methods for determining toxic proteins to filter the pool of potential candidates. At the same time, it is imperative to precisely identify non-toxic proteins to expand the possibilities for protein-based biologics. To address this challenge, we proposed an ensemble framework, called VISH-Pred, comprising models built by fine-tuning ESM2 transformer models on a large, experimentally validated, curated dataset of protein and peptide toxicities. The primary steps in the VISH-Pred framework are to efficiently estimate protein toxicities taking just the protein sequence as input, employing an under sampling technique to handle the humongous class-imbalance in the data and learning representations from fine-tuned ESM2 protein language models which are then fed to machine learning techniques such as Lightgbm and XGBoost. The VISH-Pred framework is able to correctly identify both peptides/proteins with potential toxicity and non-toxic proteins, achieving a Matthews correlation coefficient of 0.737, 0.716 and 0.322 and F1-score of 0.759, 0.696 and 0.713 on three non-redundant blind tests, respectively, outperforming other methods by over $10\%$ on these quality metrics. Moreover, VISH-Pred achieved the best accuracy and area under receiver operating curve scores on these independent test sets, highlighting the robustness and generalization capability of the framework. By making VISH-Pred available as an easy-to-use web server, we expect it to serve as a valuable asset for future endeavors aimed at discerning the toxicity of peptides and enabling efficient protein-based therapeutics.
Collapse
Affiliation(s)
- Raghvendra Mall
- Biotechnology Research Center, Technology Innovation Institute, P.O. Box 9639, Abu Dhabi, United Arab Emirates
| | - Ankita Singh
- Biotechnology Research Center, Technology Innovation Institute, P.O. Box 9639, Abu Dhabi, United Arab Emirates
| | - Chirag N Patel
- Biotechnology Research Center, Technology Innovation Institute, P.O. Box 9639, Abu Dhabi, United Arab Emirates
| | - Gregory Guirimand
- Biotechnology Research Center, Technology Innovation Institute, P.O. Box 9639, Abu Dhabi, United Arab Emirates
- Graduate School of Science, Technology and Innovation, Kobe University, 1-1 Rokkodai-cho, Nada-ku, Kobe, 657-8501, Japan
| | - Filippo Castiglione
- Biotechnology Research Center, Technology Innovation Institute, P.O. Box 9639, Abu Dhabi, United Arab Emirates
- Institute for Applied Computing, National Research Council of Italy, Via dei Taurini, 19, 00185, Rome, Italy
| |
Collapse
|
11
|
Nielsen H, Teufel F, Brunak S, von Heijne G. SignalP: The Evolution of a Web Server. Methods Mol Biol 2024; 2836:331-367. [PMID: 38995548 DOI: 10.1007/978-1-0716-4007-4_17] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 07/13/2024]
Abstract
SignalP ( https://services.healthtech.dtu.dk/services/SignalP-6.0/ ) is a very popular prediction method for signal peptides, the intrinsic signals that make proteins secretory. The SignalP web server has existed since 1995 and is now in its sixth major version. In this historical account, we (three authors who have taken part in the entire journey plus the first author of the latest version) describe the differences between the versions and discuss the various decisions taken along the way.
Collapse
Affiliation(s)
- Henrik Nielsen
- Section for Bioinformatics, Department of Health Technology, Technical University of Denmark, Kongens Lyngby, Denmark.
| | - Felix Teufel
- Bioinformatics Centre, Department of Biology, University of Copenhagen, Copenhagen, Denmark
- Digital Science & Innovation, Novo Nordisk A/S, Malov, Denmark
| | - Søren Brunak
- Novo Nordisk Foundation Center for Protein Research, Faculty of Health and Medical Sciences, University of Copenhagen, Copenhagen, Denmark
| | - Gunnar von Heijne
- Department of Biochemistry and Biophysics, Stockholm University, Stockholm, Sweden
- Science for Life Laboratory, Stockholm University, Solna, Sweden
| |
Collapse
|
12
|
Zhou Y, Huang Z, Li W, Wei J, Jiang Q, Yang W, Huang J. Deep learning in preclinical antibody drug discovery and development. Methods 2023; 218:57-71. [PMID: 37454742 DOI: 10.1016/j.ymeth.2023.07.003] [Citation(s) in RCA: 9] [Impact Index Per Article: 4.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/22/2022] [Revised: 03/20/2023] [Accepted: 07/10/2023] [Indexed: 07/18/2023] Open
Abstract
Antibody drugs have become a key part of biotherapeutics. Patients suffering from various diseases have benefited from antibody therapies. However, its development process is rather long, expensive and risky. To speed up the process, reduce cost and improve success rate, artificial intelligence, especially deep learning methods, have been widely used in all aspects of preclinical antibody drug development, from library generation to hit identification, developability screening, lead selection and optimization. In this review, we systematically summarize antibody encodings, deep learning architectures and models used in preclinical antibody drug discovery and development. We also critically discuss challenges and opportunities, problems and possible solutions, current applications and future directions of deep learning in antibody drug development.
Collapse
Affiliation(s)
- Yuwei Zhou
- School of Life Science and Technology, University of Electronic Science and Technology of China, Chengdu 611731, China
| | - Ziru Huang
- School of Life Science and Technology, University of Electronic Science and Technology of China, Chengdu 611731, China
| | - Wenzhen Li
- School of Life Science and Technology, University of Electronic Science and Technology of China, Chengdu 611731, China
| | - Jinyi Wei
- School of Life Science and Technology, University of Electronic Science and Technology of China, Chengdu 611731, China
| | - Qianhu Jiang
- School of Life Science and Technology, University of Electronic Science and Technology of China, Chengdu 611731, China
| | - Wei Yang
- School of Computer Science and Engineering, University of Electronic Science and Technology of China, Chengdu 611731, China
| | - Jian Huang
- School of Life Science and Technology, University of Electronic Science and Technology of China, Chengdu 611731, China.
| |
Collapse
|
13
|
Emonts J, Buyel J. An overview of descriptors to capture protein properties - Tools and perspectives in the context of QSAR modeling. Comput Struct Biotechnol J 2023; 21:3234-3247. [PMID: 38213891 PMCID: PMC10781719 DOI: 10.1016/j.csbj.2023.05.022] [Citation(s) in RCA: 4] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/13/2023] [Revised: 05/23/2023] [Accepted: 05/23/2023] [Indexed: 01/13/2024] Open
Abstract
Proteins are important ingredients in food and feed, they are the active components of many pharmaceutical products, and they are necessary, in the form of enzymes, for the success of many technical processes. However, production can be challenging, especially when using heterologous host cells such as bacteria to express and assemble recombinant mammalian proteins. The manufacturability of proteins can be hindered by low solubility, a tendency to aggregate, or inefficient purification. Tools such as in silico protein engineering and models that predict separation criteria can overcome these issues but usually require the complex shape and surface properties of proteins to be represented by a small number of quantitative numeric values known as descriptors, as similarly used to capture the features of small molecules. Here, we review the current status of protein descriptors, especially for application in quantitative structure activity relationship (QSAR) models. First, we describe the complexity of proteins and the properties that descriptors must accommodate. Then we introduce descriptors of shape and surface properties that quantify the global and local features of proteins. Finally, we highlight the current limitations of protein descriptors and propose strategies for the derivation of novel protein descriptors that are more informative.
Collapse
Affiliation(s)
- J. Emonts
- Fraunhofer Institute for Molecular Biology and Applied Ecology IME, Germany
| | - J.F. Buyel
- University of Natural Resources and Life Sciences, Vienna (BOKU), Department of Biotechnology (DBT), Institute of Bioprocess Science and Engineering (IBSE), Muthgasse 18, 1190 Vienna, Austria
- Institute for Molecular Biotechnology, Worringerweg 1, RWTH Aachen University, 52074 Aachen, Germany
| |
Collapse
|
14
|
Chen Z, Wang X, Chen X, Huang J, Wang C, Wang J, Wang Z. Accelerating therapeutic protein design with computational approaches toward the clinical stage. Comput Struct Biotechnol J 2023; 21:2909-2926. [PMID: 38213894 PMCID: PMC10781723 DOI: 10.1016/j.csbj.2023.04.027] [Citation(s) in RCA: 8] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/15/2023] [Revised: 04/11/2023] [Accepted: 04/27/2023] [Indexed: 01/13/2024] Open
Abstract
Therapeutic protein, represented by antibodies, is of increasing interest in human medicine. However, clinical translation of therapeutic protein is still largely hindered by different aspects of developability, including affinity and selectivity, stability and aggregation prevention, solubility and viscosity reduction, and deimmunization. Conventional optimization of the developability with widely used methods, like display technologies and library screening approaches, is a time and cost-intensive endeavor, and the efficiency in finding suitable solutions is still not enough to meet clinical needs. In recent years, the accelerated advancement of computational methodologies has ushered in a transformative era in the field of therapeutic protein design. Owing to their remarkable capabilities in feature extraction and modeling, the integration of cutting-edge computational strategies with conventional techniques presents a promising avenue to accelerate the progression of therapeutic protein design and optimization toward clinical implementation. Here, we compared the differences between therapeutic protein and small molecules in developability and provided an overview of the computational approaches applicable to the design or optimization of therapeutic protein in several developability issues.
Collapse
Affiliation(s)
- Zhidong Chen
- Department of Pathology, The Eighth Affiliated Hospital, Sun Yat-sen University, Shenzhen 518033, China
- School of Pharmaceutical Sciences, Shenzhen Campus of Sun Yat-sen University, Shenzhen 518107, China
| | - Xinpei Wang
- School of Pharmaceutical Sciences, Shenzhen Campus of Sun Yat-sen University, Shenzhen 518107, China
| | - Xu Chen
- School of Pharmaceutical Sciences, Shenzhen Campus of Sun Yat-sen University, Shenzhen 518107, China
| | - Juyang Huang
- School of Pharmaceutical Sciences, Shenzhen Campus of Sun Yat-sen University, Shenzhen 518107, China
| | - Chenglin Wang
- Shenzhen Qiyu Biotechnology Co., Ltd, Shenzhen 518107, China
| | - Junqing Wang
- School of Pharmaceutical Sciences, Shenzhen Campus of Sun Yat-sen University, Shenzhen 518107, China
| | - Zhe Wang
- Department of Pathology, The Eighth Affiliated Hospital, Sun Yat-sen University, Shenzhen 518033, China
| |
Collapse
|
15
|
Patel CN, Mall R, Bensmail H. AI-driven drug repurposing and binding pose meta dynamics identifies novel targets for Monkeypox virus. J Infect Public Health 2023; 16:799-807. [PMID: 36966703 PMCID: PMC10014505 DOI: 10.1016/j.jiph.2023.03.007] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/20/2022] [Revised: 02/28/2023] [Accepted: 03/05/2023] [Indexed: 03/17/2023] Open
Abstract
Monkeypox virus (MPXV) was confirmed in May 2022 and designated a global health emergency by WHO in July 2022. MPX virions are big, enclosed, brick-shaped, and contain a linear, double-stranded DNA genome as well as enzymes. MPXV particles bind to the host cell membrane via a variety of viral-host protein interactions. As a result, the wrapped structure is a potential therapeutic target. DeepRepurpose, an artificial intelligence-based compound-viral proteins interaction framework, was used via a transfer learning setting to prioritize a set of FDA approved and investigational drugs which can potentially inhibit MPXV viral proteins. To filter and narrow down the lead compounds from curated collections of pharmaceutical compounds, we used a rigorous computational framework that included homology modeling, molecular docking, dynamic simulations, binding free energy calculations, and binding pose metadynamics. We identified Elvitegravir as a potential inhibitor of MPXV virus using our comprehensive pipeline.
Collapse
Affiliation(s)
- Chirag N. Patel
- Department of Botany, Bioinformatics and Climate Change Impacts Management, School of Science, Gujarat University, Ahmedabad-380009, India,Chemical Biology Laboratory, Center for Cancer Research, National Cancer Institute, National Institute of Health, Frederick, MD-21702, USA
| | - Raghvendra Mall
- Department of Immunology, St. Jude Children’s Research Hospital, 262 Danny Thomas Place, Memphis, Tennessee-38105, USA,Biotechnology Research Center, Technology Innovation Institute, Abu Dhabi-9639, United Arab Emirates,Corresponding author at: Department of Immunology, St. Jude Children’s Research Hospital, 262 Danny Thomas Place, Memphis, Tennessee-38105, USA
| | - Halima Bensmail
- Qatar Computing Research Institute, Hamad Bin Khalifa University, Doha-34110, Qatar,Corresponding author
| |
Collapse
|
16
|
Luo X, Tong F, Zhao W, Zheng X, Li J, Li J, Zhao D. BERT2DAb: a pre-trained model for antibody representation based on amino acid sequences and 2D-structure. MAbs 2023; 15:2285904. [PMID: 38010801 DOI: 10.1080/19420862.2023.2285904] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/25/2023] [Accepted: 11/16/2023] [Indexed: 11/29/2023] Open
Abstract
Prior research has generated a vast amount of antibody sequences, which has allowed the pre-training of language models on amino acid sequences to improve the efficiency of antibody screening and optimization. However, compared to those for proteins, there are fewer pre-trained language models available for antibody sequences. Additionally, existing pre-trained models solely rely on embedding representations using amino acids or k-mers, which do not explicitly take into account the role of secondary structure features. Here, we present a new pre-trained model called BERT2DAb. This model incorporates secondary structure information based on self-attention to learn representations of antibody sequences. Our model achieves state-of-the-art performance on three downstream tasks, including two antigen-antibody binding classification tasks (precision: 85.15%/94.86%; recall:87.41%/86.15%) and one antigen-antibody complex mutation binding free energy prediction task (Pearson correlation coefficient: 0.77). Moreover, we propose a novel method to analyze the relationship between attention weights and contact states of pairs of subsequences in tertiary structures. This enhances the interpretability of BERT2DAb. Overall, our model demonstrates strong potential for improving antibody screening and design through downstream applications.
Collapse
Affiliation(s)
- Xiaowei Luo
- Information Center, Academy of Military Medical Sciences, Beijing, China
| | - Fan Tong
- Information Center, Academy of Military Medical Sciences, Beijing, China
| | - Wenbin Zhao
- Information Center, Academy of Military Medical Sciences, Beijing, China
| | - Xiangwen Zheng
- Information Center, Academy of Military Medical Sciences, Beijing, China
| | - Jiangyu Li
- Information Center, Academy of Military Medical Sciences, Beijing, China
| | - Jing Li
- State Key Laboratory of Pathogen and Biosecurity, Beijing Institute of Microbiology and Epidemiology, Beijing, China
| | - Dongsheng Zhao
- Information Center, Academy of Military Medical Sciences, Beijing, China
| |
Collapse
|
17
|
Susithra Priyadarshni M, Isaac Kirubakaran S, Harish MC. In silico approach to design a multi-epitopic vaccine candidate targeting the non-mutational immunogenic regions in envelope protein and surface glycoprotein of SARS-CoV-2. J Biomol Struct Dyn 2022; 40:12948-12963. [PMID: 34528491 PMCID: PMC8477437 DOI: 10.1080/07391102.2021.1977702] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/06/2023]
Abstract
The novel corona virus (COVID-19) is a causative agent for severe acute respiratory syndrome (SARS-CoV-2) and responsible for the current human pandemic situation which has caused global social and economic commotion. The currently available vaccines use whole viruses whereas there is scope for peptide based vaccines. Thus, the global raise in statistics of this infection at an alarming rate evoked us to determine a novel and effective vaccine candidate against SARS-CoV-2. To find the potential vaccine candidate targets, immunoinformatics approaches were used to analyze the mutations in the envelope protein and surface glycoprotein and determine the conserved region; further specific T-cell epitopes VSLVKPSFY, SLVKPSFYV, RVKNLNSSR, SEETGTLIV, LVKPSFYVY, LTDEMIAQY, YLQPRTFLL, RLFRKSNLK, SPRRARSVA, AEIRASANL, TLLALHRSY, YSRVKNLNS and FELLHAPAT and B-cells epitopes TLAILTALRLCAYCCN and AGTITSGWTFGAGAAL were identified. The 3 D structure of epitope was predicted, refined and validated. The molecular docking analysis of multi-epitope vaccine candidates with TLR receptors, predicted effective binding. Overall, using bioinformatics approach this multi-epitopic target facilitates the proof of concept for SARS-CoV-2 conserved epitopic vaccine design.Communicated by Ramaswamy H. Sarma.
Collapse
Affiliation(s)
| | - S. Isaac Kirubakaran
- Division of Pulmonary, Critical Care and Sleep Medicine, Department of Internal Medicine, University of Kansas Medical Center, KS, USA
| | - M. C. Harish
- Department of Biotechnology, Thiruvalluvar University, Vellore, Tamil Nadu, India,CONTACT M. C. Harish Department of Biotechnology, Thiruvalluvar University, Serkkadu, Vellore632115, India
| |
Collapse
|
18
|
Qing R, Hao S, Smorodina E, Jin D, Zalevsky A, Zhang S. Protein Design: From the Aspect of Water Solubility and Stability. Chem Rev 2022; 122:14085-14179. [PMID: 35921495 PMCID: PMC9523718 DOI: 10.1021/acs.chemrev.1c00757] [Citation(s) in RCA: 89] [Impact Index Per Article: 29.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/31/2021] [Indexed: 12/13/2022]
Abstract
Water solubility and structural stability are key merits for proteins defined by the primary sequence and 3D-conformation. Their manipulation represents important aspects of the protein design field that relies on the accurate placement of amino acids and molecular interactions, guided by underlying physiochemical principles. Emulated designer proteins with well-defined properties both fuel the knowledge-base for more precise computational design models and are used in various biomedical and nanotechnological applications. The continuous developments in protein science, increasing computing power, new algorithms, and characterization techniques provide sophisticated toolkits for solubility design beyond guess work. In this review, we summarize recent advances in the protein design field with respect to water solubility and structural stability. After introducing fundamental design rules, we discuss the transmembrane protein solubilization and de novo transmembrane protein design. Traditional strategies to enhance protein solubility and structural stability are introduced. The designs of stable protein complexes and high-order assemblies are covered. Computational methodologies behind these endeavors, including structure prediction programs, machine learning algorithms, and specialty software dedicated to the evaluation of protein solubility and aggregation, are discussed. The findings and opportunities for Cryo-EM are presented. This review provides an overview of significant progress and prospects in accurate protein design for solubility and stability.
Collapse
Affiliation(s)
- Rui Qing
- State
Key Laboratory of Microbial Metabolism, School of Life Sciences and
Biotechnology, Shanghai Jiao Tong University, Shanghai 200240, China
- Media
Lab, Massachusetts Institute of Technology, 77 Massachusetts Avenue, Cambridge, Massachusetts 02139, United States
- The
David H. Koch Institute for Integrative Cancer Research, Massachusetts Institute of Technology, 77 Massachusetts Avenue, Cambridge, Massachusetts 02139, United States
| | - Shilei Hao
- Media
Lab, Massachusetts Institute of Technology, 77 Massachusetts Avenue, Cambridge, Massachusetts 02139, United States
- Key
Laboratory of Biorheological Science and Technology, Ministry of Education, College of Bioengineering, Chongqing University, Chongqing 400030, China
| | - Eva Smorodina
- Department
of Immunology, University of Oslo and Oslo
University Hospital, Oslo 0424, Norway
| | - David Jin
- Avalon GloboCare
Corp., Freehold, New Jersey 07728, United States
| | - Arthur Zalevsky
- Laboratory
of Bioinformatics Approaches in Combinatorial Chemistry and Biology, Shemyakin−Ovchinnikov Institute of Bioorganic
Chemistry RAS, Moscow 117997, Russia
| | - Shuguang Zhang
- Media
Lab, Massachusetts Institute of Technology, 77 Massachusetts Avenue, Cambridge, Massachusetts 02139, United States
| |
Collapse
|
19
|
Abuei H, Pirouzfar M, Mojiri A, Behzad-Behbahani A, Kalantari T, Bemani P, Farhadi A. Maximizing the recovery of the native p28 bacterial peptide with improved activity and maintained solubility and stability in Escherichia coli BL21 (DE3). METHODS IN MICROBIOLOGY 2022; 200:106560. [PMID: 36031157 DOI: 10.1016/j.mimet.2022.106560] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 04/19/2022] [Revised: 08/10/2022] [Accepted: 08/20/2022] [Indexed: 02/06/2023]
Abstract
p28 is a natural bacterial product, which recently has attracted much attention as an efficient cell penetrating peptide (CPP) and a promising anticancer agent. Considering the interesting biological qualities of p28, maximizing its expression appears to be a prominent priority. The optimization of such bioprocesses might be facilitated by utilizing statistical approaches such as Design of Experiment (DoE). In this study, we aimed to maximize the expression of "biologically active" p28 in Escherichia coli BL21 (DE3) host by harnessing statistical tools and experimental methods. Using Minitab, Plackett-Burman and Box-Behnken Response Surface Methodology (RSM) designs were generated to optimize the conditions for the expression of p28. Each condition was experimentally investigated by assessing the biological activity of the purified p28 in the MCF-7 breast cancer cell line. Seven independent variables were investigated, and three of them including ethanol concentration, OD600 of the culture at the time of induction, and the post-induction temperature were demonstrated to significantly affect the p28 expression in E. coli. The cytotoxicity, penetration efficiency, and total process time were measured as dependent variables. The optimized expression conditions were validated experimentally, and the final products were investigated in terms of expression yield, solubility, and stability in vitro. Following the optimization, an 8-fold increase of the concentration of p28 expression was observed. In this study, we suggest an optimized combination of effective factors to produce soluble p28 in the E. coli host, a protocol that results in the production of a significantly high amount of the biologically active peptide with retained solubility and stability.
Collapse
Affiliation(s)
- Haniyeh Abuei
- Division of Medical Biotechnology, Department of Medical Laboratory Sciences, School of Paramedical Sciences, Shiraz University of Medical Sciences, Shiraz, Iran
| | - Mohammad Pirouzfar
- Human and Animal Cell Bank, Iranian Biological Resource Center (IBRC), ACECR, Tehran, Iran
| | - Anahita Mojiri
- Center for Cardiovascular Regeneration, Department of Cardiovascular Sciences, Houston Methodist Research Institute, Houston 77030, TX, USA
| | - Abbas Behzad-Behbahani
- Diagnostic Laboratory Sciences and Technology Research Center, School of Paramedical Sciences, Shiraz University of Medical Sciences, Shiraz, Iran
| | - Tahereh Kalantari
- Division of Medical Biotechnology, Department of Medical Laboratory Sciences, School of Paramedical Sciences, Shiraz University of Medical Sciences, Shiraz, Iran; Diagnostic Laboratory Sciences and Technology Research Center, School of Paramedical Sciences, Shiraz University of Medical Sciences, Shiraz, Iran
| | - Peyman Bemani
- Department of Immunology, School of Medicine, Isfahan University of Medical Sciences, Isfahan, Iran
| | - Ali Farhadi
- Division of Medical Biotechnology, Department of Medical Laboratory Sciences, School of Paramedical Sciences, Shiraz University of Medical Sciences, Shiraz, Iran; Diagnostic Laboratory Sciences and Technology Research Center, School of Paramedical Sciences, Shiraz University of Medical Sciences, Shiraz, Iran.
| |
Collapse
|
20
|
Wang H, Kwong CF, Liu Q, Liu Z, Chen Z. A Novel Artificial Intelligence System in Formulation Dissolution Prediction. COMPUTATIONAL INTELLIGENCE AND NEUROSCIENCE 2022; 2022:8640115. [PMID: 35978897 PMCID: PMC9377879 DOI: 10.1155/2022/8640115] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 05/01/2022] [Revised: 06/20/2022] [Accepted: 06/24/2022] [Indexed: 11/29/2022]
Abstract
Artificial neural network (ANN) techniques are widely used to screen the data and predict the experimental result in pharmaceutical studies. In this study, a novel dissolution result prediction and screen system with a backpropagation network and regression methods was modeled. For this purpose, 21 groups of dissolution data were used to train and verify the ANN model. Based on the design of input data, the related data were still available to train the ANN model when the formulation composition was changed. Two regression methods, the effective data regression method (EDRM) and the reference line regression method (RLRM), make this system predict dissolution results with a high accuracy rate but use less database than the orthogonal experiment. Based on the decision tree, a data screen function is also realized in this system. This ANN model provides a novel drug prediction system with a decrease in time and cost and also easily facilitates the design of new formulation.
Collapse
Affiliation(s)
- Haoyu Wang
- Department of Electrical and Electronic Engineering, University of Nottingham Ningbo China, Ningbo, China
| | - Chiew Foong Kwong
- Department of Electrical and Electronic Engineering, University of Nottingham Ningbo China, Ningbo, China
| | - Qianyu Liu
- International Doctoral Innovation Centre, NingboTech University, Ningbo, China
| | - Zhixin Liu
- Department of Outpatient, Liaoning Thrombus Treatment Center of Integrated Chinese and Western Medicine, Shenyang, China
| | - Zhiyuan Chen
- Department of Mechanical, Materials and Manufacture, University of Nottingham Ningbo China, Ningbo, China
| |
Collapse
|
21
|
Karaiyan P, Chang CCH, Chan ES, Tey BT, Ramanan RN, Ooi CW. In silico screening and heterologous expression of soluble dimethyl sulfide monooxygenases of microbial origin in Escherichia coli. Appl Microbiol Biotechnol 2022; 106:4523-4537. [PMID: 35713659 PMCID: PMC9259527 DOI: 10.1007/s00253-022-12008-8] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/20/2022] [Revised: 05/30/2022] [Accepted: 06/01/2022] [Indexed: 11/28/2022]
Abstract
Abstract Sequence-based screening has been widely applied in the discovery of novel microbial enzymes. However, majority of the sequences in the genomic databases were annotated using computational approaches and lacks experimental characterization. Hence, the success in obtaining the functional biocatalysts with improved characteristics requires an efficient screening method that considers a wide array of factors. Recombinant expression of microbial enzymes is often hampered by the undesirable formation of inclusion body. Here, we present a systematic in silico screening method to identify the proteins expressible in soluble form and with the desired biological properties. The screening approach was adopted in the recombinant expression of dimethyl sulfide (DMS) monooxygenase in Escherichia coli. DMS monooxygenase, a two-component enzyme consisting of DmoA and DmoB subunits, was used as a model protein. The success rate of producing soluble and active DmoA is 71% (5 out of 7 genes). Interestingly, the soluble recombinant DmoA enzymes exhibited the NADH:FMN oxidoreductase activity in the absence of DmoB (second subunit), and the cofactor FMN, suggesting that DmoA is also an oxidoreductase. DmoA originated from Janthinobacterium sp. AD80 showed the maximum NADH oxidation activity (maximum reaction rate: 6.6 µM/min; specific activity: 133 µM/min/mg). This novel finding may allow DmoA to be used as an oxidoreductase biocatalyst for various industrial applications. The in silico gene screening methodology established from this study can increase the success rate of producing soluble and functional enzymes while avoiding the laborious trial and error involved in the screening of a large pool of genes available. Key points • A systematic gene screening method was demonstrated. • DmoA is also an oxidoreductase capable of oxidizing NADH and reducing FMN. • DmoA oxidizes NADH in the absence of external FMN. Supplementary Information The online version contains supplementary material available at 10.1007/s00253-022-12008-8.
Collapse
Affiliation(s)
- Prasanth Karaiyan
- Chemical Engineering Discipline, School of Engineering, Monash University Malaysia, Jalan Lagoon Selatan, 47500, Bandar Sunway, Selangor, Malaysia
| | - Catherine Ching Han Chang
- Arkema Thiochemicals Sdn. Bhd., Jalan PJU 1A/7A OASIS Ara Damansara, 47301, Petaling Jaya, Selangor Darul Ehsan, Malaysia
| | - Eng-Seng Chan
- Chemical Engineering Discipline, School of Engineering, Monash University Malaysia, Jalan Lagoon Selatan, 47500, Bandar Sunway, Selangor, Malaysia
| | - Beng Ti Tey
- Chemical Engineering Discipline, School of Engineering, Monash University Malaysia, Jalan Lagoon Selatan, 47500, Bandar Sunway, Selangor, Malaysia.,Advanced Engineering Platform, Monash University Malaysia, Jalan Lagoon Selatan, 47500, Bandar Sunway, Selangor, Malaysia
| | - Ramakrishnan Nagasundara Ramanan
- Chemical Engineering Discipline, School of Engineering, Monash University Malaysia, Jalan Lagoon Selatan, 47500, Bandar Sunway, Selangor, Malaysia. .,Arkema Thiochemicals Sdn. Bhd., Jalan PJU 1A/7A OASIS Ara Damansara, 47301, Petaling Jaya, Selangor Darul Ehsan, Malaysia.
| | - Chien Wei Ooi
- Chemical Engineering Discipline, School of Engineering, Monash University Malaysia, Jalan Lagoon Selatan, 47500, Bandar Sunway, Selangor, Malaysia. .,Advanced Engineering Platform, Monash University Malaysia, Jalan Lagoon Selatan, 47500, Bandar Sunway, Selangor, Malaysia.
| |
Collapse
|
22
|
Packiam KAR, Ooi CW, Li F, Mei S, Tey BT, Fang Ong H, Song J, Ramanan RN. PERISCOPE-Opt: Machine learning-based prediction of optimal fermentation conditions and yields of recombinant periplasmic protein expressed in Escherichia coli. Comput Struct Biotechnol J 2022; 20:2909-2920. [PMID: 35765650 PMCID: PMC9201004 DOI: 10.1016/j.csbj.2022.06.006] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/09/2022] [Revised: 06/01/2022] [Accepted: 06/01/2022] [Indexed: 11/26/2022] Open
Abstract
The ensemble model considered both fermentation conditions and protein properties. Optimal fermentation conditions and periplasmic recombinant protein yield can be predicted. Predictor’s accuracy and Pearson correlation coefficient are 75% and 0.91, respectively.
Optimization of the fermentation process for recombinant protein production (RPP) is often resource-intensive. Machine learning (ML) approaches are helpful in minimizing the experimentations and find vast applications in RPP. However, these ML-based tools primarily focus on features with respect to amino-acid-sequence, ruling out the influence of fermentation process conditions. The present study combines the features derived from fermentation process conditions with that from amino acid-sequence to construct an ML-based model that predicts the maximal protein yields and the corresponding fermentation conditions for the expression of target recombinant protein in the Escherichia coli periplasm. Two sets of XGBoost classifiers were employed in the first stage to classify the expression levels of the target protein as high (>50 mg/L), medium (between 0.5 and 50 mg/L), or low (<0.5 mg/L). The second-stage framework consisted of three regression models involving support vector machines and random forest to predict the expression yields corresponding to each expression-level-class. Independent tests showed that the predictor achieved an overall average accuracy of 75% and a Pearson coefficient correlation of 0.91 for the correctly classified instances. Therefore, our model offers a reliable substitution of numerous trial-and-error experiments to identify the optimal fermentation conditions and yield for RPP. It is also implemented as an open-access webserver, PERISCOPE-Opt (http://periscope-opt.erc.monash.edu).
Collapse
|
23
|
The R Language: An Engine for Bioinformatics and Data Science. Life (Basel) 2022; 12:life12050648. [PMID: 35629316 PMCID: PMC9148156 DOI: 10.3390/life12050648] [Citation(s) in RCA: 60] [Impact Index Per Article: 20.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/15/2022] [Revised: 04/21/2022] [Accepted: 04/23/2022] [Indexed: 12/14/2022] Open
Abstract
The R programming language is approaching its 30th birthday, and in the last three decades it has achieved a prominent role in statistics, bioinformatics, and data science in general. It currently ranks among the top 10 most popular languages worldwide, and its community has produced tens of thousands of extensions and packages, with scopes ranging from machine learning to transcriptome data analysis. In this review, we provide an historical chronicle of how R became what it is today, describing all its current features and capabilities. We also illustrate the major tools of R, such as the current R editors and integrated development environments (IDEs), the R Shiny web server, the R methods for machine learning, and its relationship with other programming languages. We also discuss the role of R in science in general as a driver for reproducibility. Overall, we hope to provide both a complete snapshot of R today and a practical compendium of the major features and applications of this programming language.
Collapse
|
24
|
Machine Learning Based Analysis of Relations between Antigen Expression and Genetic Aberrations in Childhood B-Cell Precursor Acute Lymphoblastic Leukaemia. J Clin Med 2022; 11:jcm11092281. [PMID: 35566407 PMCID: PMC9100578 DOI: 10.3390/jcm11092281] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/15/2022] [Revised: 04/11/2022] [Accepted: 04/14/2022] [Indexed: 12/23/2022] Open
Abstract
Flow cytometry technique (FC) is a standard diagnostic tool for diagnostics of B-cell precursor acute lymphoblastic leukemia (BCP-ALL) assessing the immunophenotype of blast cells. BCP-ALL is often associated with underlying genetic aberrations, that have evidenced prognostic significance and can impact the disease outcome. Since the determination of patient prognosis is already important at the initial phase of BCP-ALL diagnostics, we aimed to reveal specific genetic aberrations by finding specific multiple antigen expression patterns with FC immunophenotyping. The FC immunophenotype data were analysed using machine learning methods (gradient boosting, decision trees, classification rules). The obtained results were verified with the use of repeated cross-validation. The t(12;21)/ETV6-RUNX1 aberration occurs more often when blasts present high expression of CD10, CD38, low CD34, CD45 and specific low expression of CD81. The t(v;11q23)/KMT2A is associated with positive NG2 expression and low CD10, CD34, TdT and CD24. Hyperdiploidy is associated with CD123, CD66c and CD34 expression on blast cells. In turn, high expression of CD81, low expression of CD45, CD22 and lack of CD123 and NG2 indicates that none of the studied aberrations is present. Detecting aberrations in pediatric BCP-ALL, based on the expression of multiple markers, can be done with decent efficiency.
Collapse
|
25
|
A Comprehensive Review of Computation-Based Metal-Binding Prediction Approaches at the Residue Level. BIOMED RESEARCH INTERNATIONAL 2022; 2022:8965712. [PMID: 35402609 PMCID: PMC8989566 DOI: 10.1155/2022/8965712] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 01/02/2022] [Accepted: 03/04/2022] [Indexed: 12/29/2022]
Abstract
Clear evidence has shown that metal ions strongly connect and delicately tune the dynamic homeostasis in living bodies. They have been proved to be associated with protein structure, stability, regulation, and function. Even small changes in the concentration of metal ions can shift their effects from natural beneficial functions to harmful. This leads to degenerative diseases, malignant tumors, and cancers. Accurate characterizations and predictions of metalloproteins at the residue level promise informative clues to the investigation of intrinsic mechanisms of protein-metal ion interactions. Compared to biophysical or biochemical wet-lab technologies, computational methods provide open web interfaces of high-resolution databases and high-throughput predictors for efficient investigation of metal-binding residues. This review surveys and details 18 public databases of metal-protein binding. We collect a comprehensive set of 44 computation-based methods and classify them into four categories, namely, learning-, docking-, template-, and meta-based methods. We analyze the benchmark datasets, assessment criteria, feature construction, and algorithms. We also compare several methods on two benchmark testing datasets and include a discussion about currently publicly available predictive tools. Finally, we summarize the challenges and underlying limitations of the current studies and propose several prospective directions concerning the future development of the related databases and methods.
Collapse
|
26
|
Thumuluri V, Martiny HM, Almagro Armenteros JJ, Salomon J, Nielsen H, Johansen AR. NetSolP: predicting protein solubility in Escherichia coli using language models. Bioinformatics 2022; 38:941-946. [PMID: 35088833 DOI: 10.1093/bioinformatics/btab801] [Citation(s) in RCA: 19] [Impact Index Per Article: 6.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/14/2021] [Revised: 10/13/2021] [Accepted: 11/23/2021] [Indexed: 11/13/2022] Open
Abstract
MOTIVATION Solubility and expression levels of proteins can be a limiting factor for large-scale studies and industrial production. By determining the solubility and expression directly from the protein sequence, the success rate of wet-lab experiments can be increased. RESULTS In this study, we focus on predicting the solubility and usability for purification of proteins expressed in Escherichia coli directly from the sequence. Our model NetSolP is based on deep learning protein language models called transformers and we show that it achieves state-of-the-art performance and improves extrapolation across datasets. As we find current methods are built on biased datasets, we curate existing datasets by using strict sequence-identity partitioning and ensure that there is minimal bias in the sequences. AVAILABILITY AND IMPLEMENTATION The predictor and data are available at https://services.healthtech.dtu.dk/service.php?NetSolP and the open-sourced code is available at https://github.com/tvinet/NetSolP-1.0. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
| | - Hannah-Marie Martiny
- Research Group for Genomic Epidemiology, National Food Institute, Technical University of Denmark, Lyngby 2800, Denmark
| | - Jose J Almagro Armenteros
- Novo Nordisk Foundation Center for Protein Research, Faculty of Health and Medical Sciences, University of Copenhagen, 2200 Copenhagen, Denmark
| | | | - Henrik Nielsen
- Department of Health Technology, Technical University of Denmark, Lyngby 2800, Denmark
| | - Alexander Rosenberg Johansen
- Department of Computer Science, Stanford University, Stanford, CA 94305, USA
- Department of Genetics, Stanford University School of Medicine, Stanford, CA, USA
| |
Collapse
|
27
|
Xu S, Hu X, Feng Z, Pang J, Sun K, You X, Wang Z. Recognition of Metal Ion Ligand-Binding Residues by Adding Correlation Features and Propensity Factors. Front Genet 2022; 12:793800. [PMID: 35058970 PMCID: PMC8764267 DOI: 10.3389/fgene.2021.793800] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/12/2021] [Accepted: 11/30/2021] [Indexed: 11/13/2022] Open
Abstract
The realization of many protein functions is inseparable from the interaction with ligands; in particular, the combination of protein and metal ion ligands performs an important biological function. Currently, it is a challenging work to identify the metal ion ligand-binding residues accurately by computational approaches. In this study, we proposed an improved method to predict the binding residues of 10 metal ion ligands (Zn2+, Cu2+, Fe2+, Fe3+, Co2+, Mn2+, Ca2+, Mg2+, Na+, and K+). Based on the basic feature parameters of amino acids, and physicochemical and predicted structural information, we added another two features of amino acid correlation information and binding residue propensity factors. With the optimized parameters, we used the GBM algorithm to predict metal ion ligand-binding residues. In the obtained results, the Sn and MCC values were over 10.17% and 0.297, respectively. Besides, the Sn and MCC values of transition metals were higher than 34.46% and 0.564, respectively. In order to test the validity of our model, another method (Random Forest) was also used in comparison. The better results of this work indicated that the proposed method would be a valuable tool to predict metal ion ligand-binding residues.
Collapse
Affiliation(s)
- Shuang Xu
- College of Sciences, Inner Mongolia University of Technology, Hohhot, China.,Inner Mongolia Key Laboratory of Statistical Analysis Theory for Life Data and Neural Network Modeling, Hohhot, China
| | - Xiuzhen Hu
- College of Sciences, Inner Mongolia University of Technology, Hohhot, China.,Inner Mongolia Key Laboratory of Statistical Analysis Theory for Life Data and Neural Network Modeling, Hohhot, China
| | - Zhenxing Feng
- College of Sciences, Inner Mongolia University of Technology, Hohhot, China.,Inner Mongolia Key Laboratory of Statistical Analysis Theory for Life Data and Neural Network Modeling, Hohhot, China
| | - Jing Pang
- College of Sciences, Inner Mongolia University of Technology, Hohhot, China.,Inner Mongolia Key Laboratory of Statistical Analysis Theory for Life Data and Neural Network Modeling, Hohhot, China
| | - Kai Sun
- College of Sciences, Inner Mongolia University of Technology, Hohhot, China.,Inner Mongolia Key Laboratory of Statistical Analysis Theory for Life Data and Neural Network Modeling, Hohhot, China
| | - Xiaoxiao You
- College of Sciences, Inner Mongolia University of Technology, Hohhot, China.,Inner Mongolia Key Laboratory of Statistical Analysis Theory for Life Data and Neural Network Modeling, Hohhot, China
| | - Ziyang Wang
- College of Sciences, Inner Mongolia University of Technology, Hohhot, China.,Inner Mongolia Key Laboratory of Statistical Analysis Theory for Life Data and Neural Network Modeling, Hohhot, China
| |
Collapse
|
28
|
Cadet XF, Gelly JC, van Noord A, Cadet F, Acevedo-Rocha CG. Learning Strategies in Protein Directed Evolution. Methods Mol Biol 2022; 2461:225-275. [PMID: 35727454 DOI: 10.1007/978-1-0716-2152-3_15] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/13/2022]
Abstract
Synthetic biology is a fast-evolving research field that combines biology and engineering principles to develop new biological systems for medical, pharmacological, and industrial applications. Synthetic biologists use iterative "design, build, test, and learn" cycles to efficiently engineer genetic systems that are reliable, reproducible, and predictable. Protein engineering by directed evolution can benefit from such a systematic engineering approach for various reasons. Learning can be carried out before starting, throughout or after finalizing a directed evolution project. Computational tools, bioinformatics, and scanning mutagenesis methods can be excellent starting points, while molecular dynamics simulations and other strategies can guide engineering efforts. Similarly, studying protein intermediates along evolutionary pathways offers fascinating insights into the molecular mechanisms shaped by evolution. The learning step of the cycle is not only crucial for proteins or enzymes that are not suitable for high-throughput screening or selection systems, but it is also valuable for any platform that can generate a large amount of data that can be aided by machine learning algorithms. The main challenge in protein engineering is to predict the effect of a single mutation on one functional parameter-to say nothing of several mutations on multiple parameters. This is largely due to nonadditive mutational interactions, known as epistatic effects-beneficial mutations present in a genetic background may not be beneficial in another genetic background. In this work, we provide an overview of experimental and computational strategies that can guide the user to learn protein function at different stages in a directed evolution project. We also discuss how epistatic effects can influence the success of directed evolution projects. Since machine learning is gaining momentum in protein engineering and the field is becoming more interdisciplinary thanks to collaboration between mathematicians, computational scientists, engineers, molecular biologists, and chemists, we provide a general workflow that familiarizes nonexperts with the basic concepts, dataset requirements, learning approaches, model capabilities and performance metrics of this intriguing area. Finally, we also provide some practical recommendations on how machine learning can harness epistatic effects for engineering proteins in an "outside-the-box" way.
Collapse
Affiliation(s)
- Xavier F Cadet
- PEACCEL, Artificial Intelligence Department, Paris, France
| | - Jean Christophe Gelly
- Laboratoire d'Excellence GR-Ex, Paris, France
- BIGR, DSIMB, UMR_S1134, INSERM, University of Paris & University of Reunion, Paris, France
| | | | - Frédéric Cadet
- Laboratoire d'Excellence GR-Ex, Paris, France
- BIGR, DSIMB, UMR_S1134, INSERM, University of Paris & University of Reunion, Paris, France
| | | |
Collapse
|
29
|
Madani M, Lin K, Tarakanova A. DSResSol: A Sequence-Based Solubility Predictor Created with Dilated Squeeze Excitation Residual Networks. Int J Mol Sci 2021; 22:13555. [PMID: 34948354 PMCID: PMC8704505 DOI: 10.3390/ijms222413555] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/29/2021] [Revised: 12/13/2021] [Accepted: 12/14/2021] [Indexed: 11/16/2022] Open
Abstract
Protein solubility is an important thermodynamic parameter that is critical for the characterization of a protein's function, and a key determinant for the production yield of a protein in both the research setting and within industrial (e.g., pharmaceutical) applications. Experimental approaches to predict protein solubility are costly, time-consuming, and frequently offer only low success rates. To reduce cost and expedite the development of therapeutic and industrially relevant proteins, a highly accurate computational tool for predicting protein solubility from protein sequence is sought. While a number of in silico prediction tools exist, they suffer from relatively low prediction accuracy, bias toward the soluble proteins, and limited applicability for various classes of proteins. In this study, we developed a novel deep learning sequence-based solubility predictor, DSResSol, that takes advantage of the integration of squeeze excitation residual networks with dilated convolutional neural networks and outperforms all existing protein solubility prediction models. This model captures the frequently occurring amino acid k-mers and their local and global interactions and highlights the importance of identifying long-range interaction information between amino acid k-mers to achieve improved accuracy, using only protein sequence as input. DSResSol outperforms all available sequence-based solubility predictors by at least 5% in terms of accuracy when evaluated by two different independent test sets. Compared to existing predictors, DSResSol not only reduces prediction bias for insoluble proteins but also predicts soluble proteins within the test sets with an accuracy that is at least 13% higher than existing models. We derive the key amino acids, dipeptides, and tripeptides contributing to protein solubility, identifying glutamic acid and serine as critical amino acids for protein solubility prediction. Overall, DSResSol can be used for the fast, reliable, and inexpensive prediction of a protein's solubility to guide experimental design.
Collapse
Affiliation(s)
- Mohammad Madani
- Department of Mechanical Engineering, University of Connecticut, Storrs, CT 06269, USA;
- Department of Computer Science & Engineering, University of Connecticut, Storrs, CT 06269, USA;
| | - Kaixiang Lin
- Department of Computer Science & Engineering, University of Connecticut, Storrs, CT 06269, USA;
| | - Anna Tarakanova
- Department of Mechanical Engineering, University of Connecticut, Storrs, CT 06269, USA;
- Department of Biomedical Engineering, University of Connecticut, Storrs, CT 06269, USA
| |
Collapse
|
30
|
Muller C, Rabal O, Diaz Gonzalez C. Artificial Intelligence, Machine Learning, and Deep Learning in Real-Life Drug Design Cases. METHODS IN MOLECULAR BIOLOGY (CLIFTON, N.J.) 2021; 2390:383-407. [PMID: 34731478 DOI: 10.1007/978-1-0716-1787-8_16] [Citation(s) in RCA: 11] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Subscribe] [Scholar Register] [Indexed: 02/06/2023]
Abstract
The discovery and development of drugs is a long and expensive process with a high attrition rate. Computational drug discovery contributes to ligand discovery and optimization, by using models that describe the properties of ligands and their interactions with biological targets. In recent years, artificial intelligence (AI) has made remarkable modeling progress, driven by new algorithms and by the increase in computing power and storage capacities, which allow the processing of large amounts of data in a short time. This review provides the current state of the art of AI methods applied to drug discovery, with a focus on structure- and ligand-based virtual screening, library design and high-throughput analysis, drug repurposing and drug sensitivity, de novo design, chemical reactions and synthetic accessibility, ADMET, and quantum mechanics.
Collapse
Affiliation(s)
- Christophe Muller
- Evotec (France) SAS, Computational Drug Discovery, Integrated Drug Discovery, Toulouse, France
| | - Obdulia Rabal
- Evotec (France) SAS, Computational Drug Discovery, Integrated Drug Discovery, Toulouse, France
| | | |
Collapse
|
31
|
Martiny HM, Armenteros JJA, Johansen AR, Salomon J, Nielsen H. Deep protein representations enable recombinant protein expression prediction. Comput Biol Chem 2021; 95:107596. [PMID: 34775287 DOI: 10.1016/j.compbiolchem.2021.107596] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/11/2021] [Revised: 10/21/2021] [Accepted: 10/21/2021] [Indexed: 11/19/2022]
Abstract
A crucial process in the production of industrial enzymes is recombinant gene expression, which aims to induce enzyme overexpression of the genes in a host microbe. Current approaches for securing overexpression rely on molecular tools such as adjusting the recombinant expression vector, adjusting cultivation conditions, or performing codon optimizations. However, such strategies are time-consuming, and an alternative strategy would be to select genes for better compatibility with the recombinant host. Several methods for predicting soluble expression are available; however, they are all optimized for the expression host Escherichia coli and do not consider the possibility of an expressed protein not being soluble. We show that these tools are not suited for predicting expression potential in the industrially important host Bacillus subtilis. Instead, we build a B. subtilis-specific machine learning model for expressibility prediction. Given millions of unlabelled proteins and a small labeled dataset, we can successfully train such a predictive model. The unlabeled proteins provide a performance boost relative to using amino acid frequencies of the labeled proteins as input. On average, we obtain a modest performance of 0.64 area-under-the-curve (AUC) and 0.2 Matthews correlation coefficient (MCC). However, we find that this is sufficient for the prioritization of expression candidates for high-throughput studies. Moreover, the predicted class probabilities are correlated with expression levels. A number of features related to protein expression, including base frequencies and solubility, are captured by the model.
Collapse
Affiliation(s)
- Hannah-Marie Martiny
- Research Group for Genomic Epidemiology, National Food Institute, Technical University of Denmark, 2800 Kgs. Lyngby, Denmark.
| | - Jose Juan Almagro Armenteros
- Novo Nordisk Foundation Center for Protein Research, Faculty of Health and Medical Sciences, University of Copenhagen, 2200 Copenhagen, Denmark
| | | | - Jesper Salomon
- Enzyme Research, Novozymes A/S, Krogshøjvej 36, 2880 Bagsværd, Denmark
| | - Henrik Nielsen
- Department of Health Technology, Section for Bioinformatics, Technical University of Denmark, 2800 Kgs. Lyngby, Denmark
| |
Collapse
|
32
|
Xuan P, Chen B, Zhang T, Yang Y. Prediction of Drug-Target Interactions Based on Network Representation Learning and Ensemble Learning. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2021; 18:2671-2681. [PMID: 32340959 DOI: 10.1109/tcbb.2020.2989765] [Citation(s) in RCA: 16] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/11/2023]
Abstract
Identifying interactions between drugs and target proteins is a critical step in the drug development process, as it helps identify new targets for drugs and accelerate drug development. The number of known drug-protein interactions (positive samples) is much lower than that of the unknown ones (negative samples), which forms a class imbalance. Most previous methods only utilised part of the negative samples to train the prediction model, so most of the information on negative samples was neglected. Therefore, a new method must be developed to predict candidate drug-related proteins and fully utilise negative samples to improve prediction performance. We present a method based on non-negative matrix factorisation and gradient boosting decision tree (GBDT), named NGDTP, to identify the candidate drug-protein interactions. NGDTP integrates multiple kinds of protein similarities, drugs-proteins interactions, and multiple kinds of drugs similarities at different levels, including target proteins of drugs, drug-related diseases, and side effects of drugs. We propose a network representation learning method based on matrix factorisation to learn low-dimensional vector representations of drug and protein nodes. On the basis of these low-dimensional node representations, a GBDT-based prediction model was constructed and it obtains the association scores through establishing multiple decision trees for a drug-protein pairs. NGDTP is an ensemble learning model that fully utilises all the negative samples to effectively alleviate the problem of class imbalance. NGDTP achieves superior prediction performance when it is compared with several state-of-the-art methods. The experimental results indicate that NGDTP also retrieves more actual drug-protein interactions in the top part of prediction result, which drew significant attention from the biologists. In addition, case studies on 10 drugs further confirmed the ability of the NGDTP to identify potential candidate proteins for drugs.
Collapse
|
33
|
Hu X, Feng C, Zhou Y, Harrison A, Chen M. DeepTrio: a ternary prediction system for protein-protein interaction using mask multiple parallel convolutional neural networks. Bioinformatics 2021; 38:694-702. [PMID: 34694333 PMCID: PMC8756175 DOI: 10.1093/bioinformatics/btab737] [Citation(s) in RCA: 25] [Impact Index Per Article: 6.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/09/2021] [Revised: 10/05/2021] [Accepted: 10/20/2021] [Indexed: 02/03/2023] Open
Abstract
MOTIVATION Protein-protein interaction (PPI), as a relative property, is determined by two binding proteins, which brings a great challenge to design an expert model with an unbiased learning architecture and a superior generalization performance. Additionally, few efforts have been made to allow PPI predictors to discriminate between relative properties and intrinsic properties. RESULTS We present a sequence-based approach, DeepTrio, for PPI prediction using mask multiple parallel convolutional neural networks. Experimental evaluations show that DeepTrio achieves a better performance over several state-of-the-art methods in terms of various quality metrics. Besides, DeepTrio is extended to provide additional insights into the contribution of each input neuron to the prediction results. AVAILABILITY AND IMPLEMENTATION We provide an online application at http://bis.zju.edu.cn/deeptrio. The DeepTrio models and training data are deposited at https://github.com/huxiaoti/deeptrio.git. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Xiaotian Hu
- Department of Bioinformatics, College of Life Sciences, Zhejiang University, Hangzhou 310058, China
| | - Cong Feng
- Department of Bioinformatics, College of Life Sciences, Zhejiang University, Hangzhou 310058, China
| | - Yincong Zhou
- Department of Bioinformatics, College of Life Sciences, Zhejiang University, Hangzhou 310058, China
| | - Andrew Harrison
- Department of Mathematical Sciences, University of Essex, Colchester CO4 3SQ, UK
| | - Ming Chen
- To whom correspondence should be addressed.
| |
Collapse
|
34
|
Agrawal S, Sisodia DS, Nagwani NK. Augmented sequence features and subcellular localization for functional characterization of unknown protein sequences. Med Biol Eng Comput 2021; 59:2297-2310. [PMID: 34545514 DOI: 10.1007/s11517-021-02436-5] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/08/2020] [Accepted: 08/29/2021] [Indexed: 11/24/2022]
Abstract
Advances in high-throughput techniques lead to evolving a large number of unknown protein sequences (UPS). Functional characterization of UPS is significant for the investigation of disease symptoms and drug repositioning. Protein subcellular localization is imperative for the functional characterization of protein sequences. Diverse techniques are used on protein sequences for feature extraction. However, many times a single feature extraction technique leads to poor prediction performance. In this paper, two feature augmentations are described through sequence induced, physicochemical, and evolutionary information of the amino acid residues. While augmented features preserve the sequence-order-information and protein-residue-properties. Two bacterial protein datasets Gram-Positive (G +) and Gram-Negative (G-) are utilized for the experimental work. After performing essential preprocessing on protein datasets, two sets of feature vectors are obtained. These feature vectors are used separately to train the different individual and ensembles such as decision tree (C 4.5), k-nearest neighbor (k-NN), multi-layer perceptron (MLP), Naïve Bayes (NB), support vector machine (SVM), AdaBoost, gradient boosting machine (GBM), and random forest (RF) with fivefold cross-validation. Prediction results of the model demonstrate that overall accuracy reported by C4.5 is highest 99.57% on G + and 97.47% on G- datasets with known protein sequences. Similarly, for the UPS overall accuracy of G + is 85.17% with SVM and 82.45% with G- dataset using MLP.
Collapse
Affiliation(s)
- Saurabh Agrawal
- Department of Computer Science & Engineering, National Institute of Technology Raipur, GE Road, Raipur, Chhattisgarh, 492010, India.
| | - Dilip Singh Sisodia
- Department of Computer Science & Engineering, National Institute of Technology Raipur, GE Road, Raipur, Chhattisgarh, 492010, India
| | - Naresh Kumar Nagwani
- Department of Computer Science & Engineering, National Institute of Technology Raipur, GE Road, Raipur, Chhattisgarh, 492010, India
| |
Collapse
|
35
|
Melo MCR, Maasch JRMA, de la Fuente-Nunez C. Accelerating antibiotic discovery through artificial intelligence. Commun Biol 2021; 4:1050. [PMID: 34504303 PMCID: PMC8429579 DOI: 10.1038/s42003-021-02586-0] [Citation(s) in RCA: 87] [Impact Index Per Article: 21.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/18/2021] [Accepted: 07/16/2021] [Indexed: 02/07/2023] Open
Abstract
By targeting invasive organisms, antibiotics insert themselves into the ancient struggle of the host-pathogen evolutionary arms race. As pathogens evolve tactics for evading antibiotics, therapies decline in efficacy and must be replaced, distinguishing antibiotics from most other forms of drug development. Together with a slow and expensive antibiotic development pipeline, the proliferation of drug-resistant pathogens drives urgent interest in computational methods that promise to expedite candidate discovery. Strides in artificial intelligence (AI) have encouraged its application to multiple dimensions of computer-aided drug design, with increasing application to antibiotic discovery. This review describes AI-facilitated advances in the discovery of both small molecule antibiotics and antimicrobial peptides. Beyond the essential prediction of antimicrobial activity, emphasis is also given to antimicrobial compound representation, determination of drug-likeness traits, antimicrobial resistance, and de novo molecular design. Given the urgency of the antimicrobial resistance crisis, we analyze uptake of open science best practices in AI-driven antibiotic discovery and argue for openness and reproducibility as a means of accelerating preclinical research. Finally, trends in the literature and areas for future inquiry are discussed, as artificially intelligent enhancements to drug discovery at large offer many opportunities for future applications in antibiotic development.
Collapse
Affiliation(s)
- Marcelo C R Melo
- Machine Biology Group, Departments of Psychiatry and Microbiology, Institute for Biomedical Informatics, Institute for Translational Medicine and Therapeutics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA, USA
- Departments of Bioengineering and Chemical and Biomolecular Engineering, School of Engineering and Applied Science, University of Pennsylvania, Philadelphia, PA, USA
- Penn Institute for Computational Science, University of Pennsylvania, Philadelphia, PA, USA
| | - Jacqueline R M A Maasch
- Machine Biology Group, Departments of Psychiatry and Microbiology, Institute for Biomedical Informatics, Institute for Translational Medicine and Therapeutics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA, USA
- Departments of Bioengineering and Chemical and Biomolecular Engineering, School of Engineering and Applied Science, University of Pennsylvania, Philadelphia, PA, USA
- Penn Institute for Computational Science, University of Pennsylvania, Philadelphia, PA, USA
- Department of Computer and Information Science, School of Engineering and Applied Science, University of Pennsylvania, Philadelphia, PA, USA
| | - Cesar de la Fuente-Nunez
- Machine Biology Group, Departments of Psychiatry and Microbiology, Institute for Biomedical Informatics, Institute for Translational Medicine and Therapeutics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA, USA.
- Departments of Bioengineering and Chemical and Biomolecular Engineering, School of Engineering and Applied Science, University of Pennsylvania, Philadelphia, PA, USA.
- Penn Institute for Computational Science, University of Pennsylvania, Philadelphia, PA, USA.
| |
Collapse
|
36
|
De Paoli-Iseppi R, Gleeson J, Clark MB. Isoform Age - Splice Isoform Profiling Using Long-Read Technologies. Front Mol Biosci 2021; 8:711733. [PMID: 34409069 PMCID: PMC8364947 DOI: 10.3389/fmolb.2021.711733] [Citation(s) in RCA: 31] [Impact Index Per Article: 7.8] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/19/2021] [Accepted: 07/19/2021] [Indexed: 01/12/2023] Open
Abstract
Alternative splicing (AS) of RNA is a key mechanism that results in the expression of multiple transcript isoforms from single genes and leads to an increase in the complexity of both the transcriptome and proteome. Regulation of AS is critical for the correct functioning of many biological pathways, while disruption of AS can be directly pathogenic in diseases such as cancer or cause risk for complex disorders. Current short-read sequencing technologies achieve high read depth but are limited in their ability to resolve complex isoforms. In this review we examine how long-read sequencing (LRS) technologies can address this challenge by covering the entire RNA sequence in a single read and thereby distinguish isoform changes that could impact RNA regulation or protein function. Coupling LRS with technologies such as single cell sequencing, targeted sequencing and spatial transcriptomics is producing a rapidly expanding suite of technological approaches to profile alternative splicing at the isoform level with unprecedented detail. In addition, integrating LRS with genotype now allows the impact of genetic variation on isoform expression to be determined. Recent results demonstrate the potential of these techniques to elucidate the landscape of splicing, including in tissues such as the brain where AS is particularly prevalent. Finally, we also discuss how AS can impact protein function, potentially leading to novel therapeutic targets for a range of diseases.
Collapse
Affiliation(s)
| | | | - Michael B. Clark
- Centre for Stem Cell Systems, Department of Anatomy and Physiology, The University of Melbourne, Parkville, VIC, Australia
| |
Collapse
|
37
|
Prediction of Protein Solubility Based on Sequence Feature Fusion and DDcCNN. Interdiscip Sci 2021; 13:703-716. [PMID: 34236625 DOI: 10.1007/s12539-021-00456-1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/27/2021] [Revised: 06/21/2021] [Accepted: 06/23/2021] [Indexed: 10/20/2022]
Abstract
BACKGROUND Prediction of protein solubility is an indispensable prerequisite for pharmaceutical research and production. The general and specific objective of this work is to design a new model for predicting protein solubility by using protein sequence feature fusion and deep dual-channel convolutional neural networks (DDcCNN) to improve the performance of existing prediction models. METHODS The redundancy of raw protein is reduced by CD-HIT. The four subsequences are built from protein sequence: one global and three locals. The global subsequence is the entire protein sequence, and these local subsequences are obtained by moving a sliding window with some rules. Using G-gap to extract the features of the above four subsequences, a mixed matrix is constructed as the input of one channel which is composed of three-layer convolutional operating. Additional features are extracted by SCRATCH tool as input of another channel, which is consist of a single convolution in order to find hidden relationships and improve the accuracy of predictor. The outputs of two parallel channels are concatenated as the input of the hidden layer. And the prediction of protein solubility is obtained in the output layer. The best protein solubility prediction model is obtained by doing some comparative experiments of different frameworks. RESULTS The performance indicators of DDcCNN model (our designed) are as follows: accuracy of 77.82%, Matthew's correlation coefficient of 0.57, sensitivity of 76.13% and specificity of 79.32%. The results of some comparative experiments show that the overall performance of DDcCNN model is better than existing models (GCNN, LCNN and PCNN). The related models and data are publicly deposited at http://www.ddccnn.wang . CONCLUSION The satisfactory performance of DDcCNN model reveals that these features and flexible computational methodologies can reinforce the existing prediction models for better prediction of protein solubility could be applied in several applications, such as to preselect initial targets that are soluble or to alter solubility of target proteins, thus can help to reduce the production cost.
Collapse
|
38
|
Sequence-Based Prediction of Transmembrane Protein Crystallization Propensity. Interdiscip Sci 2021; 13:693-702. [PMID: 34143353 DOI: 10.1007/s12539-021-00448-1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/28/2020] [Revised: 05/31/2021] [Accepted: 06/04/2021] [Indexed: 10/21/2022]
Abstract
Transmembrane proteins play a vital role in cell life activities. There are several techniques to determine transmembrane protein structures and X-ray crystallography is the primary methodology. However, due to the special properties of transmembrane proteins, it is still hard to determine their structures by X-ray crystallography technique. To reduce experimental consumption and improve experimental efficiency, it is of great significance to develop computational methods for predicting the crystallization propensity of transmembrane proteins. In this work, we proposed a sequence-based machine learning method, namely Prediction of TransMembrane protein Crystallization propensity (PTMC), to predict the propensity of transmembrane protein crystallization. First, we obtained several general sequence features and the specific encoded features of relative solvent accessibility and hydrophobicity. Second, feature selection was employed to filter out redundant and irrelevant features, and the optimal feature subset is composed of hydrophobicity, amino acid composition and relative solvent accessibility. Finally, we chose extreme gradient boosting by comparing with other several machine learning methods. Comparative results on the independent test set indicate that PTMC outperforms state-of-the-art sequence-based methods in terms of sensitivity, specificity, accuracy, Matthew's Correlation Coefficient (MCC) and Area Under the receiver operating characteristic Curve (AUC). In comparison with two competitors, Bcrystal and TMCrys, PTMC achieves an improvement by 0.132 and 0.179 for sensitivity, 0.014 and 0.127 for specificity, 0.037 and 0.192 for accuracy, 0.128 and 0.362 for MCC, and 0.027 and 0.125 for AUC, respectively.
Collapse
|
39
|
Wu X, Yu L. EPSOL: sequence-based protein solubility prediction using multidimensional embedding. Bioinformatics 2021; 37:4314-4320. [PMID: 34145885 DOI: 10.1093/bioinformatics/btab463] [Citation(s) in RCA: 25] [Impact Index Per Article: 6.3] [Reference Citation Analysis] [Abstract] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/16/2021] [Revised: 05/18/2021] [Accepted: 06/17/2021] [Indexed: 11/14/2022] Open
Abstract
MOTIVATION The heterologous expression of recombinant protein requires host cells, such as Escherichia coli, and the solubility of protein greatly affects the protein yield. A novel and highly accurate solubility predictor that concurrently improves the production yield and minimizes production cost, and that forecasts protein solubility in an E. coli expression system before the actual experimental work is highly sought. RESULTS In this paper, EPSOL, a novel deep learning architecture for the prediction of protein solubility in an E. coli expression system, which automatically obtains comprehensive protein feature representations using multidimensional embedding, is presented. EPSOL outperformed all existing sequence-based solubility predictors and achieved 0.79 in accuracy and 0.58 in Matthew's correlation coefficient. The higher performance of EPSOL permits large-scale screening for sequence variants with enhanced manufacturability and predicts the solubility of new recombinant proteins in an E. coli expression system with greater reliability. AVAILABILITY AND IMPLEMENTATION EPSOL's best model and results can be downloaded from GitHub (https://github.com/LiangYu-Xidian/EPSOL). SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Xiang Wu
- School of Computer Science and Technology, Xidian University, Xi'an 710071, Shaanxi, China
| | - Liang Yu
- School of Computer Science and Technology, Xidian University, Xi'an 710071, Shaanxi, China
| |
Collapse
|
40
|
Ptak-Kaczor M, Banach M, Stapor K, Fabian P, Konieczny L, Roterman I. Solubility and Aggregation of Selected Proteins Interpreted on the Basis of Hydrophobicity Distribution. Int J Mol Sci 2021; 22:ijms22095002. [PMID: 34066830 PMCID: PMC8125953 DOI: 10.3390/ijms22095002] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/19/2021] [Revised: 05/03/2021] [Accepted: 05/06/2021] [Indexed: 11/30/2022] Open
Abstract
Protein solubility is based on the compatibility of the specific protein surface with the polar aquatic environment. The exposure of polar residues to the protein surface promotes the protein’s solubility in the polar environment. The aquatic environment also influences the folding process by favoring the centralization of hydrophobic residues with the simultaneous exposure to polar residues. The degree of compatibility of the residue distribution, with the model of the concentration of hydrophobic residues in the center of the molecule, with the simultaneous exposure of polar residues is determined by the sequence of amino acids in the chain. The fuzzy oil drop model enables the quantification of the degree of compatibility of the hydrophobicity distribution observed in the protein to a form fully consistent with the Gaussian 3D function, which expresses an idealized distribution that meets the preferences of the polar water environment. The varied degrees of compatibility of the distribution observed with the idealized one allow the prediction of preferences to interactions with molecules of different polarity, including water molecules in particular. This paper analyzes a set of proteins with different levels of hydrophobicity distribution in the context of the solubility of a given protein and the possibility of complex formation.
Collapse
Affiliation(s)
- Magdalena Ptak-Kaczor
- Department of Bioinformatics and Telemedicine, Jagiellonian University—Medical College, Medyczna 7, 30-688 Kraków, Poland; (M.P.-K.); (M.B.)
- Faculty of Physics, Astronomy and Applied Computer Science, Jagiellonian University, Łojasiewicza 11, 30-348 Kraków, Poland
| | - Mateusz Banach
- Department of Bioinformatics and Telemedicine, Jagiellonian University—Medical College, Medyczna 7, 30-688 Kraków, Poland; (M.P.-K.); (M.B.)
| | - Katarzyna Stapor
- Institute of Computer Science, Silesian University of Technology, Akademicka 16, 44-100 Gliwice, Poland; (K.S.); (P.F.)
| | - Piotr Fabian
- Institute of Computer Science, Silesian University of Technology, Akademicka 16, 44-100 Gliwice, Poland; (K.S.); (P.F.)
| | - Leszek Konieczny
- Chair of Medical Biochemistry—Jagiellonian University—Medical College, Kopernika 7, 31-034 Kraków, Poland;
| | - Irena Roterman
- Department of Bioinformatics and Telemedicine, Jagiellonian University—Medical College, Medyczna 7, 30-688 Kraków, Poland; (M.P.-K.); (M.B.)
- Faculty of Physics, Astronomy and Applied Computer Science, Jagiellonian University, Łojasiewicza 11, 30-348 Kraków, Poland
- Correspondence:
| |
Collapse
|
41
|
Bhandari BK, Gardner PP, Lim CS. Solubility-Weighted Index: fast and accurate prediction of protein solubility. Bioinformatics 2021; 36:4691-4698. [PMID: 32559287 PMCID: PMC7750957 DOI: 10.1093/bioinformatics/btaa578] [Citation(s) in RCA: 37] [Impact Index Per Article: 9.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/16/2019] [Revised: 05/05/2020] [Accepted: 06/12/2020] [Indexed: 12/14/2022] Open
Abstract
Motivation Recombinant protein production is a widely used technique in the biotechnology and biomedical industries, yet only a quarter of target proteins are soluble and can therefore be purified. Results We have discovered that global structural flexibility, which can be modeled by normalized B-factors, accurately predicts the solubility of 12 216 recombinant proteins expressed in Escherichia coli. We have optimized these B-factors, and derived a new set of values for solubility scoring that further improves prediction accuracy. We call this new predictor the ‘Solubility-Weighted Index’ (SWI). Importantly, SWI outperforms many existing protein solubility prediction tools. Furthermore, we have developed ‘SoDoPE’ (Soluble Domain for Protein Expression), a web interface that allows users to choose a protein region of interest for predicting and maximizing both protein expression and solubility. Availability and implementation The SoDoPE web server and source code are freely available at https://tisigner.com/sodope and https://github.com/Gardner-BinfLab/TISIGNER-ReactJS, respectively. The code and data for reproducing our analysis can be found at https://github.com/Gardner-BinfLab/SoDoPE_paper_2020. Supplementary information Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Bikash K Bhandari
- Department of Biochemistry, School of Biomedical Sciences, University of Otago, Dunedin, New Zealand
| | - Paul P Gardner
- Department of Biochemistry, School of Biomedical Sciences, University of Otago, Dunedin, New Zealand.,Biomolecular Interaction Centre, University of Canterbury, Christchurch, New Zealand
| | - Chun Shen Lim
- Department of Biochemistry, School of Biomedical Sciences, University of Otago, Dunedin, New Zealand
| |
Collapse
|
42
|
Narayanan H, Dingfelder F, Butté A, Lorenzen N, Sokolov M, Arosio P. Machine Learning for Biologics: Opportunities for Protein Engineering, Developability, and Formulation. Trends Pharmacol Sci 2021; 42:151-165. [DOI: 10.1016/j.tips.2020.12.004] [Citation(s) in RCA: 27] [Impact Index Per Article: 6.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/19/2020] [Revised: 12/10/2020] [Accepted: 12/16/2020] [Indexed: 12/19/2022]
|
43
|
Mall R, Elbasir A, Almeer H, Islam Z, Kolatkar PR, Chawla S, Ullah E. A Modelling Framework for Embedding-based Predictions for Compound-Viral Protein Activity. Bioinformatics 2021; 37:2544-2555. [PMID: 33638345 PMCID: PMC8163000 DOI: 10.1093/bioinformatics/btab130] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/09/2020] [Revised: 02/16/2021] [Accepted: 02/24/2021] [Indexed: 11/14/2022] Open
Abstract
Motivation A global effort is underway to identify compounds for the treatment of COVID-19. Since de novo compound design is an extremely long, time-consuming, and expensive process, efforts are underway to discover existing compounds that can be repurposed for COVID-19 and new viral diseases. Model We propose a machine learning representation framework that uses deep learning induced vector embeddings of compounds and viral proteins as features to predict compound-viral protein activity. The prediction model in-turn uses a consensus framework to rank approved compounds against viral proteins of interest. Results Our consensus framework achieves a highmean Pearson correlation of 0.916, mean R2 of 0.840 and a low mean squared error of 0.313 for the task of compound-viral protein activity prediction on an independent test set. As a use case, we identify a ranked list of 47 compounds common to three main proteins of SARS-COV-2 virus (PL-PRO, 3CL-PRO and Spike protein) as potential targets including 21 antivirals, 15 anticancer, 5 antibiotics and 6 other investigationalhuman compounds.We performadditional molecular docking simulations to demonstrate thatmajority of these compounds have low binding energies and thus high binding affinity with the potential to be effective against the SARS-COV-2 virus. Availability All the source code and data is available at: https://github.com/raghvendra5688/Drug-Repurposing and https://dx.doi.org/10.17632/8rrwnbcgmx.3. We also implemented a web-server at: https://machinelearning-protein.qcri.org/index.html. Supplementary information Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Raghvendra Mall
- Qatar Computing Research Institute, Hamad Bin Khalifa University, Doha, 34110, Qatar
| | - Abdurrahman Elbasir
- ICT Division, College of Science and Engineering, Hamad Bin Khalifa University, Doha, 34110, Qatar
| | - Hossam Almeer
- Qatar Computing Research Institute, Hamad Bin Khalifa University, Doha, 34110, Qatar
| | - Zeyaul Islam
- Qatar Biomedical Research Institute, Hamad Bin Khalifa Univeristy, Doha, 34110, Qatar
| | - Prasanna R Kolatkar
- Qatar Biomedical Research Institute, Hamad Bin Khalifa Univeristy, Doha, 34110, Qatar
| | - Sanjay Chawla
- Qatar Computing Research Institute, Hamad Bin Khalifa University, Doha, 34110, Qatar
| | - Ehsan Ullah
- Qatar Computing Research Institute, Hamad Bin Khalifa University, Doha, 34110, Qatar
| |
Collapse
|
44
|
Chen J, Zheng S, Zhao H, Yang Y. Structure-aware protein solubility prediction from sequence through graph convolutional network and predicted contact map. J Cheminform 2021; 13:7. [PMID: 33557952 PMCID: PMC7869490 DOI: 10.1186/s13321-021-00488-1] [Citation(s) in RCA: 42] [Impact Index Per Article: 10.5] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/21/2020] [Accepted: 01/20/2021] [Indexed: 11/26/2022] Open
Abstract
Protein solubility is significant in producing new soluble proteins that can reduce the cost of biocatalysts or therapeutic agents. Therefore, a computational model is highly desired to accurately predict protein solubility from the amino acid sequence. Many methods have been developed, but they are mostly based on the one-dimensional embedding of amino acids that is limited to catch spatially structural information. In this study, we have developed a new structure-aware method GraphSol to predict protein solubility by attentive graph convolutional network (GCN), where the protein topology attribute graph was constructed through predicted contact maps only from the sequence. GraphSol was shown to substantially outperform other sequence-based methods. The model was proven to be stable by consistent [Formula: see text] of 0.48 in both the cross-validation and independent test of the eSOL dataset. To our best knowledge, this is the first study to utilize the GCN for sequence-based protein solubility predictions. More importantly, this architecture could be easily extended to other protein prediction tasks requiring a raw protein sequence.
Collapse
Affiliation(s)
- Jianwen Chen
- School of Data and Computer Science, Sun Yat-Sen University, Guangzhou, China
| | - Shuangjia Zheng
- School of Data and Computer Science, Sun Yat-Sen University, Guangzhou, China
| | - Huiying Zhao
- Sun Yat-Sen Memorial Hospital, Sun Yat-Sen University, Guangzhou, China
| | - Yuedong Yang
- School of Data and Computer Science, Sun Yat-Sen University, Guangzhou, China.
- Key Laboratory of Machine Intelligence and Advanced Computing (Sun Yat-Sen University), Guangzhou, 510000, China.
| |
Collapse
|
45
|
Makowski EK, Wu L, Gupta P, Tessier PM. Discovery-stage identification of drug-like antibodies using emerging experimental and computational methods. MAbs 2021; 13:1895540. [PMID: 34313532 PMCID: PMC8346245 DOI: 10.1080/19420862.2021.1895540] [Citation(s) in RCA: 37] [Impact Index Per Article: 9.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/10/2020] [Revised: 02/05/2021] [Accepted: 02/22/2021] [Indexed: 11/30/2022] Open
Abstract
There is intense and widespread interest in developing monoclonal antibodies as therapeutic agents to treat diverse human disorders. During early-stage antibody discovery, hundreds to thousands of lead candidates are identified, and those that lack optimal physical and chemical properties must be deselected as early as possible to avoid problems later in drug development. It is particularly challenging to characterize such properties for large numbers of candidates with the low antibody quantities, concentrations, and purities that are available at the discovery stage, and to predict concentrated antibody properties (e.g., solubility, viscosity) required for efficient formulation, delivery, and efficacy. Here we review key recent advances in developing and implementing high-throughput methods for identifying antibodies with desirable in vitro and in vivo properties, including favorable antibody stability, specificity, solubility, pharmacokinetics, and immunogenicity profiles, that together encompass overall drug developability. In particular, we highlight impressive recent progress in developing computational methods for improving rational antibody design and prediction of drug-like behaviors that hold great promise for reducing the amount of required experimentation. We also discuss outstanding challenges that will need to be addressed in the future to fully realize the great potential of using such analysis for minimizing development times and improving the success rate of antibody candidates in the clinic.
Collapse
Affiliation(s)
- Emily K. Makowski
- Department of Pharmaceutical Sciences, University of Michigan, Ann Arbor, MI, USA
- Biointerfaces Institute, University of Michigan, Ann Arbor, MI, USA
| | - Lina Wu
- Biointerfaces Institute, University of Michigan, Ann Arbor, MI, USA
- Department of Chemical Engineering
| | - Priyanka Gupta
- Department of Biochemistry and Biophysics, Rensselaer Polytechnic Institute, Troy, NY, USA
- Biotherapeutics Discovery Department, Boehringer Ingelheim, Ridgefield, CT, USA
| | - Peter M. Tessier
- Department of Pharmaceutical Sciences, University of Michigan, Ann Arbor, MI, USA
- Biointerfaces Institute, University of Michigan, Ann Arbor, MI, USA
- Department of Chemical Engineering
- Department of Biomedical Engineering, University of Michigan, Ann Arbor, MI, USA
| |
Collapse
|
46
|
Esmaili N, Buchlak QD, Piccardi M, Kruger B, Girosi F. Multichannel mixture models for time-series analysis and classification of engagement with multiple health services: An application to psychology and physiotherapy utilization patterns after traffic accidents. Artif Intell Med 2020; 111:101997. [PMID: 33461690 DOI: 10.1016/j.artmed.2020.101997] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/13/2020] [Revised: 10/02/2020] [Accepted: 11/23/2020] [Indexed: 10/22/2022]
Abstract
BACKGROUND Motor vehicle accidents (MVA) represent a significant burden on health systems globally. Tens of thousands of people are injured in Australia every year and may experience significant disability. Associated economic costs are substantial. There is little literature on the health service utilization patterns of MVA patients. To fill this gap, this study has been designed to investigate temporal patterns of psychology and physiotherapy service utilization following transport-related injuries. METHOD De-identified compensation data was provided by the Australian Transport Accident Commission. Utilization of physiotherapy and psychology services was analysed. The datasets contained 788 psychology and 3115 physiotherapy claimants and 22,522 and 118,453 episodes of service utilization, respectively. 582 claimants used both services, and their data were preprocessed to generate multidimensional time series. Time series clustering was applied using a mixture of hidden Markov models to identify the main distinct patterns of service utilization. Combinations of hidden states and clusters were evaluated and optimized using the Bayesian information criterion and interpretability. Cluster membership was further investigated using static covariates and multinomial logistic regression, and classified using high-performing classifiers (extreme gradient boosting machine, random forest and support vector machine) with 5-fold cross-validation. RESULTS Four clusters of claimants were obtained from the clustering of the time series of service utilization. Service volumes and costs increased progressively from clusters 1 to 4. Membership of cluster 1 was positively associated with nerve damage and negatively associated with severe ABI and spinal injuries. Cluster 3 was positively associated with severe ABI, brain/head injury and psychiatric injury. Cluster 4 was positively associated with internal injuries. The classifiers were capable of classifying cluster membership with moderate to strong performance (AUC: 0.62-0.96). CONCLUSION The available time series of post-accident psychology and physiotherapy service utilization were coalesced into four clusters that were clearly distinct in terms of patterns of utilization. In addition, pre-treatment covariates allowed prediction of a claimant's post-accident service utilization with reasonable accuracy. Such results can be useful for a range of decision-making processes, including the design of interventions aimed at improving claimant care and recovery.
Collapse
Affiliation(s)
- Nazanin Esmaili
- Faculty of Engineering and IT, University of Technology Sydney, NSW, Australia; School of Medicine, University of Notre Dame Australia, Sydney, NSW, Australia.
| | - Quinlan D Buchlak
- School of Medicine, University of Notre Dame Australia, Sydney, NSW, Australia
| | - Massimo Piccardi
- Faculty of Engineering and IT, University of Technology Sydney, NSW, Australia
| | - Bernie Kruger
- Transport Accident Commission, Geelong, VIC, Australia
| | - Federico Girosi
- Capital Markets Cooperative Research Centre (CMCRC), Sydney, NSW, Australia; Translational Health Research Institute, Western Sydney University, Penrith, NSW, Australia
| |
Collapse
|
47
|
Safavi A, Kefayat A, Mahdevar E, Abiri A, Ghahremani F. Exploring the out of sight antigens of SARS-CoV-2 to design a candidate multi-epitope vaccine by utilizing immunoinformatics approaches. Vaccine 2020; 38:7612-7628. [PMID: 33082015 PMCID: PMC7546226 DOI: 10.1016/j.vaccine.2020.10.016] [Citation(s) in RCA: 85] [Impact Index Per Article: 17.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/05/2020] [Revised: 08/25/2020] [Accepted: 10/06/2020] [Indexed: 12/12/2022]
Abstract
SARS-CoV-2 causes a severe respiratory disease called COVID-19. Currently, global health is facing its devastating outbreak. However, there is no vaccine available against this virus up to now. In this study, a novel multi-epitope vaccine against SARS-CoV-2 was designed to provoke both innate and adaptive immune responses. The immunodominant regions of six non-structural proteins (nsp7, nsp8, nsp9, nsp10, nsp12 and nsp14) of SARS-CoV-2 were selected by multiple immunoinformatic tools to provoke T cell immune response. Also, immunodominant fragment of the functional region of SARS-CoV-2 spike (400-510 residues) protein was selected for inducing neutralizing antibodies production. The selected regions' sequences were connected to each other by furin-sensitive linker (RVRR). Moreover, the functional region of β-defensin as a well-known agonist for the TLR-4/MD complex was added at the N-terminus of the vaccine using (EAAAK)3 linker. Also, a CD4 + T-helper epitope, PADRE, was used at the C-terminal of the vaccine by GPGPG and A(EAAAK)2A linkers to form the final vaccine construct. The physicochemical properties, allergenicity, antigenicity, functionality and population coverage of the final vaccine construct were analyzed. The final vaccine construct was an immunogenic, non-allergen and unfunctional protein which contained multiple CD8 + and CD4 + overlapping epitopes, IFN-γ inducing epitopes, linear and conformational B cell epitopes. It could form stable and significant interactions with TLR-4/MD according to molecular docking and dynamics simulations. Global population coverage of the vaccine for HLA-I and II were estimated 96.2% and 97.1%, respectively. At last, the final vaccine construct was reverse translated to design the DNA vaccine. Although the designed vaccine exhibited high efficacy in silico, further experimental validation is necessary.
Collapse
Affiliation(s)
- Ashkan Safavi
- Department of Biology, Science and Research Branch, Islamic Azad University, Tehran, Iran
| | - Amirhosein Kefayat
- Department of Oncology, Cancer Prevention Research Center, Isfahan University of Medical Sciences, Isfahan, Iran
| | - Elham Mahdevar
- Department of Biology, Faculty of Science and Engineering, Science and Arts University, Yazd, Iran
| | - Ardavan Abiri
- Department of Medicinal Chemistry, Faculty of Pharmacy, Kerman University of Medical Sciences, Kerman, Iran
| | - Fatemeh Ghahremani
- Department of Medical Physics and Radiotherapy, Arak School of Paramedicine, Arak University of Medical Sciences, Arak, Iran.
| |
Collapse
|
48
|
Elbasir A, Mall R, Kunji K, Rawi R, Islam Z, Chuang GY, Kolatkar PR, Bensmail H. BCrystal: an interpretable sequence-based protein crystallization predictor. Bioinformatics 2020; 36:1429-1438. [PMID: 31603511 DOI: 10.1093/bioinformatics/btz762] [Citation(s) in RCA: 16] [Impact Index Per Article: 3.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/01/2019] [Revised: 09/19/2019] [Accepted: 10/08/2019] [Indexed: 02/01/2023] Open
Abstract
MOTIVATION X-ray crystallography has facilitated the majority of protein structures determined to date. Sequence-based predictors that can accurately estimate protein crystallization propensities would be highly beneficial to overcome the high expenditure, large attrition rate, and to reduce the trial-and-error settings required for crystallization. RESULTS In this study, we present a novel model, BCrystal, which uses an optimized gradient boosting machine (XGBoost) on sequence, structural and physio-chemical features extracted from the proteins of interest. BCrystal also provides explanations, highlighting the most important features for the predicted crystallization propensity of an individual protein using the SHAP algorithm. On three independent test sets, BCrystal outperforms state-of-the-art sequence-based methods by more than 12.5% in accuracy, 18% in recall and 0.253 in Matthew's correlation coefficient, with an average accuracy of 93.7%, recall of 96.63% and Matthew's correlation coefficient of 0.868. For relative solvent accessibility of exposed residues, we observed higher values to associate positively with protein crystallizability and the number of disordered regions, fraction of coils and tripeptide stretches that contain multiple histidines associate negatively with crystallizability. The higher accuracy of BCrystal enables it to accurately screen for sequence variants with enhanced crystallizability. AVAILABILITY AND IMPLEMENTATION Our BCrystal webserver is at https://machinelearning-protein.qcri.org/ and source code is available at https://github.com/raghvendra5688/BCrystal. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Abdurrahman Elbasir
- ICT Division, College of Science and Engineering, Hamad Bin Khalifa University
| | - Raghvendra Mall
- Data Analytics, Qatar Computing Research Institute, Hamad Bin Khalifa University, Doha 34110, Qatar
| | - Khalid Kunji
- Data Analytics, Qatar Computing Research Institute, Hamad Bin Khalifa University, Doha 34110, Qatar
| | - Reda Rawi
- Vaccine Research Center, National Institute of Allergy and Infectious Diseases, National Institutes of Health, Bethesda, MD 20892, USA
| | - Zeyaul Islam
- Diabetes Research Center, Qatar Biomedical Research Institute, Hamad Bin Khalifa University, Doha 34100, Qatar
| | - Gwo-Yu Chuang
- Vaccine Research Center, National Institute of Allergy and Infectious Diseases, National Institutes of Health, Bethesda, MD 20892, USA
| | - Prasanna R Kolatkar
- Diabetes Research Center, Qatar Biomedical Research Institute, Hamad Bin Khalifa University, Doha 34100, Qatar
| | - Halima Bensmail
- Data Analytics, Qatar Computing Research Institute, Hamad Bin Khalifa University, Doha 34110, Qatar
| |
Collapse
|
49
|
Hou Q, Kwasigroch JM, Rooman M, Pucci F. SOLart: a structure-based method to predict protein solubility and aggregation. Bioinformatics 2020; 36:1445-1452. [PMID: 31603466 DOI: 10.1093/bioinformatics/btz773] [Citation(s) in RCA: 28] [Impact Index Per Article: 5.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/03/2019] [Revised: 08/31/2019] [Accepted: 10/08/2019] [Indexed: 12/12/2022] Open
Abstract
MOTIVATION The solubility of a protein is often decisive for its proper functioning. Lack of solubility is a major bottleneck in high-throughput structural genomic studies and in high-concentration protein production, and the formation of protein aggregates causes a wide variety of diseases. Since solubility measurements are time-consuming and expensive, there is a strong need for solubility prediction tools. RESULTS We have recently introduced solubility-dependent distance potentials that are able to unravel the role of residue-residue interactions in promoting or decreasing protein solubility. Here, we extended their construction by defining solubility-dependent potentials based on backbone torsion angles and solvent accessibility, and integrated them, together with other structure- and sequence-based features, into a random forest model trained on a set of Escherichia coli proteins with experimental structures and solubility values. We thus obtained the SOLart protein solubility predictor, whose most informative features turned out to be folding free energy differences computed from our solubility-dependent statistical potentials. SOLart performances are very good, with a Pearson correlation coefficient between experimental and predicted solubility values of almost 0.7 both in cross-validation on the training dataset and in an independent set of Saccharomyces cerevisiae proteins. On test sets of modeled structures, only a limited drop in performance is observed. SOLart can thus be used with both high-resolution and low-resolution structures, and clearly outperforms state-of-art solubility predictors. It is available through a user-friendly webserver, which is easy to use by non-expert scientists. AVAILABILITY AND IMPLEMENTATION The SOLart webserver is freely available at http://babylone.ulb.ac.be/SOLART/. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Qingzhen Hou
- Computational Biology and Bioinformatics, Université Libre de Bruxelles, Avenue Roosevelt 50, 1050 Brussels, Belgium.,Interuniversity Institute of Bioinformatics in Brussels, Boulevard du Triomphe, 1050 Brussels, Belgium
| | - Jean Marc Kwasigroch
- Computational Biology and Bioinformatics, Université Libre de Bruxelles, Avenue Roosevelt 50, 1050 Brussels, Belgium.,Interuniversity Institute of Bioinformatics in Brussels, Boulevard du Triomphe, 1050 Brussels, Belgium
| | - Marianne Rooman
- Computational Biology and Bioinformatics, Université Libre de Bruxelles, Avenue Roosevelt 50, 1050 Brussels, Belgium.,Interuniversity Institute of Bioinformatics in Brussels, Boulevard du Triomphe, 1050 Brussels, Belgium
| | - Fabrizio Pucci
- Computational Biology and Bioinformatics, Université Libre de Bruxelles, Avenue Roosevelt 50, 1050 Brussels, Belgium.,Interuniversity Institute of Bioinformatics in Brussels, Boulevard du Triomphe, 1050 Brussels, Belgium.,John von Neumann Institute for Computing, Jülich Supercomputer Centre, Forschungszentrum Jülich, 52428 Jülich, Germany
| |
Collapse
|
50
|
Liu ZL, Hu JH, Jiang F, Wu YD. CRiSP: accurate structure prediction of disulfide-rich peptides with cystine-specific sequence alignment and machine learning. Bioinformatics 2020; 36:3385-3392. [PMID: 32215567 DOI: 10.1093/bioinformatics/btaa193] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/16/2019] [Revised: 02/06/2020] [Accepted: 03/22/2020] [Indexed: 12/19/2022] Open
Abstract
MOTIVATION High-throughput sequencing discovers many naturally occurring disulfide-rich peptides or cystine-rich peptides (CRPs) with diversified bioactivities. However, their structure information, which is very important to peptide drug discovery, is still very limited. RESULTS We have developed a CRP-specific structure prediction method called Cystine-Rich peptide Structure Prediction (CRiSP), based on a customized template database with cystine-specific sequence alignment and three machine-learning predictors. The modeling accuracy is significantly better than several popular general-purpose structure modeling methods, and our CRiSP can provide useful model quality estimations. AVAILABILITY AND IMPLEMENTATION The CRiSP server is freely available on the website at http://wulab.com.cn/CRISP. CONTACT wuyd@pkusz.edu.cn or jiangfan@pku.edu.cn. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Zi-Lin Liu
- Laboratory of Computational Chemistry and Drug Design, State Key Laboratory of Chemical Oncogenomics, Peking University Shenzhen Graduate School, Shenzhen 518055, China
| | - Jing-Hao Hu
- Laboratory of Computational Chemistry and Drug Design, State Key Laboratory of Chemical Oncogenomics, Peking University Shenzhen Graduate School, Shenzhen 518055, China.,College of Chemistry and Molecular Engineering, Peking University, Beijing 100871, China
| | - Fan Jiang
- Laboratory of Computational Chemistry and Drug Design, State Key Laboratory of Chemical Oncogenomics, Peking University Shenzhen Graduate School, Shenzhen 518055, China.,NanoAI Biotech Co., Ltd, Shenzhen 518118, China
| | - Yun-Dong Wu
- Laboratory of Computational Chemistry and Drug Design, State Key Laboratory of Chemical Oncogenomics, Peking University Shenzhen Graduate School, Shenzhen 518055, China.,College of Chemistry and Molecular Engineering, Peking University, Beijing 100871, China.,Shenzhen Bay Laboratory, Shenzhen 518055, China
| |
Collapse
|