1
|
Li D, Zhu Y, Zhang W, Liu J, Yang X, Liu Z, Wei D. AI Prediction of Structural Stability of Nanoproteins Based on Structures and Residue Properties by Mean Pooled Dual Graph Convolutional Network. Interdiscip Sci 2024:10.1007/s12539-024-00662-7. [PMID: 39367992 DOI: 10.1007/s12539-024-00662-7] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/21/2024] [Revised: 09/18/2024] [Accepted: 09/22/2024] [Indexed: 10/07/2024]
Abstract
The structural stability of proteins is an important topic in various fields such as biotechnology, pharmaceuticals, and enzymology. Specifically, understanding the structural stability of protein is crucial for protein design. Artificial design, while pursuing high thermodynamic stability and rigidity of proteins, inevitably sacrifices biological functions closely related to protein flexibility. The thermodynamic stability of proteins is not always optimal when they are highest to perfectly perform their biological functions. Extensive theoretical and experimental screening is often required to obtain stable protein structures. Thus, it becomes critically important to develop a stability prediction model based on the balance between protein stability and bioactivity. To design protein drugs with better functionality in a broader structural space, a novel protein structural stability predictor called PSSP has been developed in this study. PSSP is a mean pooled dual graph convolutional network (GCN) model based on sequence characteristics and secondary structure, distance matrix, graph, and residue properties of a nanoprotein to provide rapid prediction and judgment. This model exhibits excellent robustness in predicting the structural stability of nanoproteins. Comparing with previous artificial intelligence algorithms, the results indicate this model can provide a rapid and accurate assessment of the structural stability of artificially designed proteins, which shows the great promises for promoting the robust development of protein design.
Collapse
Affiliation(s)
- Daixi Li
- Institute of Biothermal Engineering, University of Shanghai for Science and Technology, Shanghai, 20093, China.
- Pengcheng Laboratory, Shenzhen, 518055, China.
| | - Yuqi Zhu
- Institute of Biothermal Engineering, University of Shanghai for Science and Technology, Shanghai, 20093, China
| | - Wujie Zhang
- Chemical and Biomolecular Engineering Program, Physics and Chemistry Department, Milwaukee School of Engineering, Milwaukee, 53202, USA
| | - Jing Liu
- Institute of Biothermal Engineering, University of Shanghai for Science and Technology, Shanghai, 20093, China
| | - Xiaochen Yang
- Institute of Biothermal Engineering, University of Shanghai for Science and Technology, Shanghai, 20093, China
| | - Zhihong Liu
- Pingshan Translational Medicine Center, Shenzhen Bay Laboratory, Shenzhen, 518118, China
| | - Dongqing Wei
- Pengcheng Laboratory, Shenzhen, 518055, China
- State Key Laboratory of Microbial Metabolism, Shanghai-Islamabad-Belgrade Joint Innovation, Center On Antibacterial Resistances, Joint International Research Laboratory of Metabolic and Developmental Sciences, School of Life Sciences and Biotechnology, Shanghai Jiao Tong University, Shanghai, 200240, China
| |
Collapse
|
2
|
Parthiban S, Vijeesh T, Gayathri T, Shanmugaraj B, Sharma A, Sathishkumar R. Artificial intelligence-driven systems engineering for next-generation plant-derived biopharmaceuticals. FRONTIERS IN PLANT SCIENCE 2023; 14:1252166. [PMID: 38034587 PMCID: PMC10684705 DOI: 10.3389/fpls.2023.1252166] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 07/03/2023] [Accepted: 10/17/2023] [Indexed: 12/02/2023]
Abstract
Recombinant biopharmaceuticals including antigens, antibodies, hormones, cytokines, single-chain variable fragments, and peptides have been used as vaccines, diagnostics and therapeutics. Plant molecular pharming is a robust platform that uses plants as an expression system to produce simple and complex recombinant biopharmaceuticals on a large scale. Plant system has several advantages over other host systems such as humanized expression, glycosylation, scalability, reduced risk of human or animal pathogenic contaminants, rapid and cost-effective production. Despite many advantages, the expression of recombinant proteins in plant system is hindered by some factors such as non-human post-translational modifications, protein misfolding, conformation changes and instability. Artificial intelligence (AI) plays a vital role in various fields of biotechnology and in the aspect of plant molecular pharming, a significant increase in yield and stability can be achieved with the intervention of AI-based multi-approach to overcome the hindrance factors. Current limitations of plant-based recombinant biopharmaceutical production can be circumvented with the aid of synthetic biology tools and AI algorithms in plant-based glycan engineering for protein folding, stability, viability, catalytic activity and organelle targeting. The AI models, including but not limited to, neural network, support vector machines, linear regression, Gaussian process and regressor ensemble, work by predicting the training and experimental data sets to design and validate the protein structures thereby optimizing properties such as thermostability, catalytic activity, antibody affinity, and protein folding. This review focuses on, integrating systems engineering approaches and AI-based machine learning and deep learning algorithms in protein engineering and host engineering to augment protein production in plant systems to meet the ever-expanding therapeutics market.
Collapse
Affiliation(s)
- Subramanian Parthiban
- Plant Genetic Engineering Laboratory, Department of Biotechnology, Bharathiar University, Coimbatore, India
| | - Thandarvalli Vijeesh
- Plant Genetic Engineering Laboratory, Department of Biotechnology, Bharathiar University, Coimbatore, India
| | - Thashanamoorthi Gayathri
- Plant Genetic Engineering Laboratory, Department of Biotechnology, Bharathiar University, Coimbatore, India
| | - Balamurugan Shanmugaraj
- Plant Genetic Engineering Laboratory, Department of Biotechnology, Bharathiar University, Coimbatore, India
| | - Ashutosh Sharma
- Tecnologico de Monterrey, School of Engineering and Sciences, Centre of Bioengineering, Queretaro, Mexico
| | - Ramalingam Sathishkumar
- Plant Genetic Engineering Laboratory, Department of Biotechnology, Bharathiar University, Coimbatore, India
| |
Collapse
|
3
|
Bijak V, Szczygiel M, Lenkiewicz J, Gucwa M, Cooper DR, Murzyn K, Minor W. The current role and evolution of X-ray crystallography in drug discovery and development. Expert Opin Drug Discov 2023; 18:1221-1230. [PMID: 37592849 PMCID: PMC10620067 DOI: 10.1080/17460441.2023.2246881] [Citation(s) in RCA: 4] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/31/2023] [Accepted: 08/08/2023] [Indexed: 08/19/2023]
Abstract
INTRODUCTION Macromolecular X-ray crystallography and cryo-EM are currently the primary techniques used to determine the three-dimensional structures of proteins, nucleic acids, and viruses. Structural information has been critical to drug discovery and structural bioinformatics. The integration of artificial intelligence (AI) into X-ray crystallography has shown great promise in automating and accelerating the analysis of complex structural data, further improving the efficiency and accuracy of structure determination. AREAS COVERED This review explores the relationship between X-ray crystallography and other modern structural determination methods. It examines the integration of data acquired from diverse biochemical and biophysical techniques with those derived from structural biology. Additionally, the paper offers insights into the influence of AI on X-ray crystallography, emphasizing how integrating AI with experimental approaches can revolutionize our comprehension of biological processes and interactions. EXPERT OPINION Investing in science is crucially emphasized due to its significant role in drug discovery and advancements in healthcare. X-ray crystallography remains an essential source of structural biology data for drug discovery. Recent advances in biochemical, spectroscopic, and bioinformatic methods, along with the integration of AI techniques, hold the potential to revolutionize drug discovery when effectively combined with robust data management practices.
Collapse
Affiliation(s)
- Vanessa Bijak
- Department of Molecular Physiology and Biological Physics, University of Virginia, Charlottesville 22908
| | - Michal Szczygiel
- Department of Molecular Physiology and Biological Physics, University of Virginia, Charlottesville 22908
- Department of Computational Biophysics and Bioinformatics, Jagiellonian University, Krakow, Poland
| | - Joanna Lenkiewicz
- Department of Molecular Physiology and Biological Physics, University of Virginia, Charlottesville 22908
| | - Michal Gucwa
- Department of Molecular Physiology and Biological Physics, University of Virginia, Charlottesville 22908
- Doctoral School of Exact and Natural Sciences, Jagiellonian University, Krakow, Poland
| | - David R. Cooper
- Department of Molecular Physiology and Biological Physics, University of Virginia, Charlottesville 22908
| | - Krzysztof Murzyn
- Department of Computational Biophysics and Bioinformatics, Jagiellonian University, Krakow, Poland
| | - Wladek Minor
- Department of Molecular Physiology and Biological Physics, University of Virginia, Charlottesville 22908
| |
Collapse
|
4
|
Ouellet S, Ferguson L, Lau AZ, Lim TKY. CysPresso: a classification model utilizing deep learning protein representations to predict recombinant expression of cysteine-dense peptides. BMC Bioinformatics 2023; 24:200. [PMID: 37193950 PMCID: PMC10189939 DOI: 10.1186/s12859-023-05327-8] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/12/2023] [Accepted: 05/08/2023] [Indexed: 05/18/2023] Open
Abstract
BACKGROUND Cysteine-dense peptides (CDPs) are an attractive pharmaceutical scaffold that display extreme biochemical properties, low immunogenicity, and the ability to bind targets with high affinity and selectivity. While many CDPs have potential and confirmed therapeutic uses, synthesis of CDPs is a challenge. Recent advances have made the recombinant expression of CDPs a viable alternative to chemical synthesis. Moreover, identifying CDPs that can be expressed in mammalian cells is crucial in predicting their compatibility with gene therapy and mRNA therapy. Currently, we lack the ability to identify CDPs that will express recombinantly in mammalian cells without labour intensive experimentation. To address this, we developed CysPresso, a novel machine learning model that predicts recombinant expression of CDPs based on primary sequence. RESULTS We tested various protein representations generated by deep learning algorithms (SeqVec, proteInfer, AlphaFold2) for their suitability in predicting CDP expression and found that AlphaFold2 representations possessed the best predictive features. We then optimized the model by concatenation of AlphaFold2 representations, time series transformation with random convolutional kernels, and dataset partitioning. CONCLUSION Our novel model, CysPresso, is the first to successfully predict recombinant CDP expression in mammalian cells and is particularly well suited for predicting recombinant expression of knottin peptides. When preprocessing the deep learning protein representation for supervised machine learning, we found that random convolutional kernel transformation preserves more pertinent information relevant for predicting expressibility than embedding averaging. Our study showcases the applicability of deep learning-based protein representations, such as those provided by AlphaFold2, in tasks beyond structure prediction.
Collapse
Affiliation(s)
| | - Larissa Ferguson
- Neurobiology Division, MRC Laboratory of Molecular Biology, Cambridge, UK
| | - Angus Z Lau
- Medical Biophysics, University of Toronto, Toronto, ON, Canada
- Physical Sciences Platform, Sunnybrook Research Institute, Toronto, ON, Canada
| | - Tony K Y Lim
- , Vancouver, Canada.
- Department of Pharmacology, University of Cambridge, Cambridge, UK.
| |
Collapse
|
5
|
Thumuluri V, Almagro Armenteros JJ, Johansen A, Nielsen H, Winther O. DeepLoc 2.0: multi-label subcellular localization prediction using protein language models. Nucleic Acids Res 2022; 50:W228-W234. [PMID: 35489069 PMCID: PMC9252801 DOI: 10.1093/nar/gkac278] [Citation(s) in RCA: 183] [Impact Index Per Article: 91.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/05/2022] [Revised: 04/07/2022] [Accepted: 04/19/2022] [Indexed: 12/19/2022] Open
Abstract
The prediction of protein subcellular localization is of great relevance for proteomics research. Here, we propose an update to the popular tool DeepLoc with multi-localization prediction and improvements in both performance and interpretability. For training and validation, we curate eukaryotic and human multi-location protein datasets with stringent homology partitioning and enriched with sorting signal information compiled from the literature. We achieve state-of-the-art performance in DeepLoc 2.0 by using a pre-trained protein language model. It has the further advantage that it uses sequence input rather than relying on slower protein profiles. We provide two means of better interpretability: an attention output along the sequence and highly accurate prediction of nine different types of protein sorting signals. We find that the attention output correlates well with the position of sorting signals. The webserver is available at services.healthtech.dtu.dk/service.php?DeepLoc-2.0.
Collapse
Affiliation(s)
| | - José Juan Almagro Armenteros
- Novo Nordisk Foundation Center for Protein Research, Faculty of Health and Medical Sciences, University of Copenhagen, Copenhagen 2200, Denmark
- Department of Genetics, Stanford University School of Medicine, Stanford 94305, CA, USA
| | - Alexander Rosenberg Johansen
- Department of Computer Science, Stanford University, Stanford 94305, CA, USA
- Department of Genetics, Stanford University School of Medicine, Stanford 94305, CA, USA
| | - Henrik Nielsen
- Section for Bioinformatics, Department of Health Technology, Technical University of Denmark, Kongens Lyngby 2800, Denmark
| | - Ole Winther
- Center for Genomic Medicine, Rigshospitalet (Copenhagen University Hospital), Copenhagen 2100, Denmark
- Department of Biology, Bioinformatics Centre, University of Copenhagen, Copenhagen 2200, Denmark
- Section for Cognitive Systems, Department of Applied Mathematics and Computer Science, Technical University of Denmark, Kongens Lyngby 2800, Denmark
| |
Collapse
|
6
|
Thumuluri V, Martiny HM, Almagro Armenteros JJ, Salomon J, Nielsen H, Johansen AR. NetSolP: predicting protein solubility in Escherichia coli using language models. Bioinformatics 2022; 38:941-946. [PMID: 35088833 DOI: 10.1093/bioinformatics/btab801] [Citation(s) in RCA: 9] [Impact Index Per Article: 4.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/14/2021] [Revised: 10/13/2021] [Accepted: 11/23/2021] [Indexed: 11/13/2022] Open
Abstract
MOTIVATION Solubility and expression levels of proteins can be a limiting factor for large-scale studies and industrial production. By determining the solubility and expression directly from the protein sequence, the success rate of wet-lab experiments can be increased. RESULTS In this study, we focus on predicting the solubility and usability for purification of proteins expressed in Escherichia coli directly from the sequence. Our model NetSolP is based on deep learning protein language models called transformers and we show that it achieves state-of-the-art performance and improves extrapolation across datasets. As we find current methods are built on biased datasets, we curate existing datasets by using strict sequence-identity partitioning and ensure that there is minimal bias in the sequences. AVAILABILITY AND IMPLEMENTATION The predictor and data are available at https://services.healthtech.dtu.dk/service.php?NetSolP and the open-sourced code is available at https://github.com/tvinet/NetSolP-1.0. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
| | - Hannah-Marie Martiny
- Research Group for Genomic Epidemiology, National Food Institute, Technical University of Denmark, Lyngby 2800, Denmark
| | - Jose J Almagro Armenteros
- Novo Nordisk Foundation Center for Protein Research, Faculty of Health and Medical Sciences, University of Copenhagen, 2200 Copenhagen, Denmark
| | | | - Henrik Nielsen
- Department of Health Technology, Technical University of Denmark, Lyngby 2800, Denmark
| | - Alexander Rosenberg Johansen
- Department of Computer Science, Stanford University, Stanford, CA 94305, USA
- Department of Genetics, Stanford University School of Medicine, Stanford, CA, USA
| |
Collapse
|
7
|
Mardikoraem M, Woldring D. Machine Learning-driven Protein Library Design: A Path Toward Smarter Libraries. Methods Mol Biol 2022; 2491:87-104. [PMID: 35482186 DOI: 10.1007/978-1-0716-2285-8_5] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 06/14/2023]
Abstract
Proteins are small yet valuable biomolecules that play a versatile role in therapeutics and diagnostics. The intricate sequence-structure-function paradigm in the realm of proteins opens the possibility for directly mapping amino acid sequence to function. However, the rugged nature of the protein fitness landscape and an astronomical number of possible mutations even for small proteins make navigating this system a daunting task. Moreover, the scarcity of functional proteins and the ease with which deleterious mutations are introduced, due to complex epistatic relationships, compound the existing challenges. This highlights the need for auxiliary tools in current techniques such as rational design and directed evolution. To that end, the state-of-the-art machine learning can offer time and cost efficiency in finding high fitness proteins, circumventing unnecessary wet-lab experiments. In the context of improving library design, machine learning provides valuable insights via its unique features such as high adaptation to complex systems, multi-tasking, and parallelism, and the ability to capture hidden trends in input data. Finally, both the advancements in computational resources and the rapidly increasing number of sequences in protein databases will allow more promising and detailed insights delivered from machine learning to protein library design. In this chapter, fundamental concepts and a method for machine learning-driven library design leveraging deep sequencing datasets will be discussed. We elaborate on (1) basic knowledge about machine learning algorithms, (2) the benefit of machine learning in library design, and (3) methodology for implementing machine learning in library design.
Collapse
Affiliation(s)
- Mehrsa Mardikoraem
- Department of Chemical Engineering and Materials Science, Michigan State University, East Lansing, MI, USA
- Institute for Quantitative Health Science and Engineering, Michigan State University, East Lansing, MI, USA
| | - Daniel Woldring
- Department of Chemical Engineering and Materials Science, Michigan State University, East Lansing, MI, USA.
- Institute for Quantitative Health Science and Engineering, Michigan State University, East Lansing, MI, USA.
| |
Collapse
|