1
|
Murali H, Wang P, Liao EC, Wang K. Genetic variant classification by predicted protein structure: A case study on IRF6. Comput Struct Biotechnol J 2024; 23:892-904. [PMID: 38370976 PMCID: PMC10869248 DOI: 10.1016/j.csbj.2024.01.019] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/02/2023] [Revised: 01/24/2024] [Accepted: 01/25/2024] [Indexed: 02/20/2024] Open
Abstract
Next-generation genome sequencing has revolutionized genetic testing, identifying numerous rare disease-associated gene variants. However, to impute pathogenicity, computational approaches remain inadequate and functional testing of gene variant is required to provide the highest level of evidence. The emergence of AlphaFold2 has transformed the field of protein structure determination, and here we outline a strategy that leverages predicted protein structure to enhance genetic variant classification. We used the gene IRF6 as a case study due to its clinical relevance, its critical role in cleft lip/palate malformation, and the availability of experimental data on the pathogenicity of IRF6 gene variants through phenotype rescue experiments in irf6-/- zebrafish. We compared results from over 30 pathogenicity prediction tools on 37 IRF6 missense variants. IRF6 lacks an experimentally derived structure, so we used predicted structures to explore associations between mutational clustering and pathogenicity. We found that among these variants, 19 of 37 were unanimously predicted as deleterious by computational tools. Comparing in silico predictions with experimental findings, 12 variants predicted as pathogenic were experimentally determined as benign. Even with the recently published AlphaMissense model, 15/18 (83%) of the predicted pathogenic variants were experimentally determined as benign. In comparison, mapping variants to the protein revealed deleterious mutation clusters around the protein binding domain, whereas N-terminal variants tend to be benign, suggesting the importance of structural information in determining pathogenicity of mutations in this gene. In conclusion, incorporating gene-specific structural features of known pathogenic/benign mutations may provide meaningful insights into pathogenicity predictions in a gene-specific manner and facilitate the interpretation of variant pathogenicity.
Collapse
Affiliation(s)
- Hemma Murali
- Graduate Program in Biochemistry and Molecular Biophysics, University of Pennsylvania, Philadelphia, PA 19104, United States
- Center for Cellular and Molecular Therapeutics, Children’s Hospital of Philadelphia, Philadelphia, PA 19104, United States
| | - Peng Wang
- Center for Cellular and Molecular Therapeutics, Children’s Hospital of Philadelphia, Philadelphia, PA 19104, United States
- Master of Biotechnology Program, University of Pennsylvania, Philadelphia, PA 19104, United States
| | - Eric C. Liao
- Department of Surgery, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA 19104, United States
- Center for Craniofacial Innovation, Children’s Hospital of Philadelphia, Philadelphia, PA 19104, United States
| | - Kai Wang
- Center for Cellular and Molecular Therapeutics, Children’s Hospital of Philadelphia, Philadelphia, PA 19104, United States
- Department of Pathology and Laboratory Medicine, University of Pennsylvania, Philadelphia, PA 19104, United States
| |
Collapse
|
2
|
Kovacs AS, Portelli S, Silk M, Rodrigues CHM, Ascher DB. MTR3D-AF2: Expanding the coverage of spatially derived missense tolerance scores across the human proteome using AlphaFold2. Protein Sci 2024; 33:e5112. [PMID: 39031445 DOI: 10.1002/pro.5112] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/17/2023] [Revised: 06/24/2024] [Accepted: 06/26/2024] [Indexed: 07/22/2024]
Abstract
The missense tolerance ratio (MTR) was developed as a novel approach to assess the deleteriousness of variants. Its three-dimensional successor, MTR3D, was demonstrated powerful at discriminating pathogenic from benign variants. However, its reliance on experimental structures and homologs limited its coverage of the proteome. We have now utilized AlphaFold2 models to develop MTR3D-AF2, which covers 89.31% of proteins and 85.39% of residues across the human proteome. This work has improved MTR3D's ability to distinguish clinically established pathogenic from benign variants. MTR3D-AF2 is freely available as an interactive web server at https://biosig.lab.uq.edu.au/mtr3daf2/.
Collapse
Affiliation(s)
- Aaron S Kovacs
- The Australian Center for Ecogenomics, School of Chemistry and Molecular Biosciences, The University of Queensland, Brisbane, Queensland, Australia
| | - Stephanie Portelli
- The Australian Center for Ecogenomics, School of Chemistry and Molecular Biosciences, The University of Queensland, Brisbane, Queensland, Australia
- Computational Biology and Clinical Informatics, Baker Heart and Diabetes Institute, Melbourne, Australia
| | - Michael Silk
- Centre for Population Genomics, Murdoch Children's Research Institute, Melbourne, Australia
- Systems and Computational Biology, Bio21 Institute, The University of Melbourne, Melbourne, Australia
| | - Carlos H M Rodrigues
- The Australian Center for Ecogenomics, School of Chemistry and Molecular Biosciences, The University of Queensland, Brisbane, Queensland, Australia
- Computational Biology and Clinical Informatics, Baker Heart and Diabetes Institute, Melbourne, Australia
| | - David B Ascher
- The Australian Center for Ecogenomics, School of Chemistry and Molecular Biosciences, The University of Queensland, Brisbane, Queensland, Australia
- Computational Biology and Clinical Informatics, Baker Heart and Diabetes Institute, Melbourne, Australia
- Systems and Computational Biology, Bio21 Institute, The University of Melbourne, Melbourne, Australia
| |
Collapse
|
3
|
Dijkema FM, Escarpizo‐Lorenzana MI, Nordentoft MK, Rabe HC, Sahin C, Landreh M, Branca RM, Sørensen ES, Christensen B, Prestel A, Teilum K, Winther JR. A suicidal and extensively disordered luciferase with a bright luminescence. Protein Sci 2024; 33:e5115. [PMID: 39023083 PMCID: PMC11255867 DOI: 10.1002/pro.5115] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/28/2024] [Revised: 06/07/2024] [Accepted: 07/01/2024] [Indexed: 07/20/2024]
Abstract
Gaussia luciferase (GLuc) is one of the most luminescent luciferases known and is widely used as a reporter in biochemistry and cell biology. During catalysis, GLuc undergoes inactivation by irreversible covalent modification. The mechanism by which GLuc generates luminescence and how it becomes inactivated are however not known. Here, we show that GLuc unlike other enzymes has an extensively disordered structure with a minimal hydrophobic core and no apparent binding pocket for the main substrate, coelenterazine. From an alanine scan, we identified two Arg residues required for light production. These residues separated with an average of about 22 Å and a major structural rearrangement is required if they are to interact with the substrate simultaneously. We furthermore show that in addition to coelenterazine, GLuc also can oxidize furimazine, however, in this case without production of light. Both substrates result in the formation of adducts with the enzyme, which eventually leads to enzyme inactivation. Our results demonstrate that a rigid protein structure and substrate-binding site are no prerequisites for high enzymatic activity and specificity. In addition to the increased understanding of enzymes in general, the findings will facilitate future improvement of GLuc as a reporter luciferase.
Collapse
Affiliation(s)
- Fenne Marjolein Dijkema
- The Linderstrøm‐Lang Centre for Protein Science, Department of BiologyUniversity of CopenhagenCopenhagenDenmark
| | | | | | - Hanna Christin Rabe
- The Linderstrøm‐Lang Centre for Protein Science, Department of BiologyUniversity of CopenhagenCopenhagenDenmark
| | - Cagla Sahin
- The Linderstrøm‐Lang Centre for Protein Science, Department of BiologyUniversity of CopenhagenCopenhagenDenmark
- Department of Microbiology, Tumor and Cell BiologyKarolinska InstitutetStockholmSweden
| | - Michael Landreh
- Department of Microbiology, Tumor and Cell BiologyKarolinska InstitutetStockholmSweden
| | - Rui Mamede Branca
- Science for Life Laboratory, Department of Oncology‐PathologyKarolinska InstitutetStockholmSweden
| | - Esben Skipper Sørensen
- Department of Molecular Biology and Genetics, Section for Cellular Health, Intervention and NutritionAarhus UniversityAarhus CentrumDenmark
| | - Brian Christensen
- Department of Molecular Biology and Genetics, Section for Cellular Health, Intervention and NutritionAarhus UniversityAarhus CentrumDenmark
| | - Andreas Prestel
- The Linderstrøm‐Lang Centre for Protein Science, Department of BiologyUniversity of CopenhagenCopenhagenDenmark
| | - Kaare Teilum
- The Linderstrøm‐Lang Centre for Protein Science, Department of BiologyUniversity of CopenhagenCopenhagenDenmark
| | - Jakob Rahr Winther
- The Linderstrøm‐Lang Centre for Protein Science, Department of BiologyUniversity of CopenhagenCopenhagenDenmark
| |
Collapse
|
4
|
Zhou J, Huang M. Navigating the landscape of enzyme design: from molecular simulations to machine learning. Chem Soc Rev 2024. [PMID: 38990263 DOI: 10.1039/d4cs00196f] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 07/12/2024]
Abstract
Global environmental issues and sustainable development call for new technologies for fine chemical synthesis and waste valorization. Biocatalysis has attracted great attention as the alternative to the traditional organic synthesis. However, it is challenging to navigate the vast sequence space to identify those proteins with admirable biocatalytic functions. The recent development of deep-learning based structure prediction methods such as AlphaFold2 reinforced by different computational simulations or multiscale calculations has largely expanded the 3D structure databases and enabled structure-based design. While structure-based approaches shed light on site-specific enzyme engineering, they are not suitable for large-scale screening of potential biocatalysts. Effective utilization of big data using machine learning techniques opens up a new era for accelerated predictions. Here, we review the approaches and applications of structure-based and machine-learning guided enzyme design. We also provide our view on the challenges and perspectives on effectively employing enzyme design approaches integrating traditional molecular simulations and machine learning, and the importance of database construction and algorithm development in attaining predictive ML models to explore the sequence fitness landscape for the design of admirable biocatalysts.
Collapse
Affiliation(s)
- Jiahui Zhou
- School of Chemistry and Chemical Engineering, Queen's University, David Keir Building, Stranmillis Road, Belfast BT9 5AG, Northern Ireland, UK.
| | - Meilan Huang
- School of Chemistry and Chemical Engineering, Queen's University, David Keir Building, Stranmillis Road, Belfast BT9 5AG, Northern Ireland, UK.
| |
Collapse
|
5
|
Rahimzadeh F, Mohammad Khanli L, Salehpoor P, Golabi F, PourBahrami S. Unveiling the evolution of policies for enhancing protein structure predictions: A comprehensive analysis. Comput Biol Med 2024; 179:108815. [PMID: 38986287 DOI: 10.1016/j.compbiomed.2024.108815] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/04/2024] [Revised: 06/09/2024] [Accepted: 06/24/2024] [Indexed: 07/12/2024]
Abstract
Predicting protein structure is both fascinating and formidable, playing a crucial role in structure-based drug discovery and unraveling diseases with elusive origins. The Critical Assessment of Protein Structure Prediction (CASP) serves as a biannual battleground where global scientists converge to untangle the intricate relationships within amino acid chains. Two primary methods, Template-Based Modeling (TBM) and Template-Free (TF) strategies, dominate protein structure prediction. The trend has shifted towards Template-Free predictions due to their broader sequence coverage with fewer templates. The predictive process can be broadly classified into contact map, binned-distance, and real-valued distance predictions, each with distinctive strengths and limitations manifested through tailored loss functions. We have also introduced revolutionary end-to-end, and all-atom diffusion-based techniques that have transformed protein structure predictions. Recent advancements in deep learning techniques have significantly improved prediction accuracy, although the effectiveness is contingent upon the quality of input features derived from natural bio-physiochemical attributes and Multiple Sequence Alignments (MSA). Hence, the generation of high-quality MSA data holds paramount importance in harnessing informative input features for enhanced prediction outcomes. Remarkable successes have been achieved in protein structure prediction accuracy, however not enough for what structural knowledge was intended to, which implies need for development in some other aspects of the predictions. In this regard, scientists have opened other frontiers for protein structural prediction. The utilization of subsampling in multiple sequence alignment (MSA) and protein language modeling appears to be particularly promising in enhancing the accuracy and efficiency of predictions, ultimately aiding in drug discovery efforts. The exploration of predicting protein complex structure also opens up exciting opportunities to deepen our knowledge of molecular interactions and design therapeutics that are more effective. In this article, we have discussed the vicissitudes that the scientists have gone through to improve prediction accuracy, and examined the effective policies in predicting from different aspects, including the construction of high quality MSA, providing informative input features, and progresses in deep learning approaches. We have also briefly touched upon transitioning from predicting single-chain protein structures to predicting protein complex structures. Our findings point towards promoting open research environments to support the objectives of protein structure prediction.
Collapse
Affiliation(s)
- Faezeh Rahimzadeh
- Faculty of Electrical and Computer Engineering, University of Tabriz, Tabriz, Iran
| | | | - Pedram Salehpoor
- Faculty of Electrical and Computer Engineering, University of Tabriz, Tabriz, Iran
| | - Faegheh Golabi
- Department of Biomedical Engineering, Faculty of Advanced Medical Sciences, Tabriz University of Medical Sciences, Tabriz, Iran
| | - Shahin PourBahrami
- Department of Computer Engineering, Technical and Vocational University (TVU), Tehran, Iran
| |
Collapse
|
6
|
Waterhouse AM, Studer G, Robin X, Bienert S, Tauriello G, Schwede T. The structure assessment web server: for proteins, complexes and more. Nucleic Acids Res 2024; 52:W318-W323. [PMID: 38634802 PMCID: PMC11223858 DOI: 10.1093/nar/gkae270] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/02/2024] [Revised: 03/21/2024] [Accepted: 04/02/2024] [Indexed: 04/19/2024] Open
Abstract
The 'structure assessment' web server is a one-stop shop for interactive evaluation and benchmarking of structural models of macromolecular complexes including proteins and nucleic acids. A user-friendly web dashboard links sequence with structure information and results from a variety of state-of-the-art tools, which facilitates the visual exploration and evaluation of structure models. The dashboard integrates stereochemistry information, secondary structure information, global and local model quality assessment of the tertiary structure of comparative protein models, as well as prediction of membrane location. In addition, a benchmarking mode is available where a model can be compared to a reference structure, providing easy access to scores that have been used in recent CASP experiments and CAMEO. The structure assessment web server is available at https://swissmodel.expasy.org/assess.
Collapse
Affiliation(s)
- Andrew M Waterhouse
- Biozentrum, University of Basel, Switzerland
- SIB Swiss Institute of Bioinformatics, Computational Structural Biology, Basel, Switzerland
| | - Gabriel Studer
- Biozentrum, University of Basel, Switzerland
- SIB Swiss Institute of Bioinformatics, Computational Structural Biology, Basel, Switzerland
| | - Xavier Robin
- Biozentrum, University of Basel, Switzerland
- SIB Swiss Institute of Bioinformatics, Computational Structural Biology, Basel, Switzerland
| | - Stefan Bienert
- Biozentrum, University of Basel, Switzerland
- SIB Swiss Institute of Bioinformatics, Computational Structural Biology, Basel, Switzerland
| | - Gerardo Tauriello
- Biozentrum, University of Basel, Switzerland
- SIB Swiss Institute of Bioinformatics, Computational Structural Biology, Basel, Switzerland
| | - Torsten Schwede
- Biozentrum, University of Basel, Switzerland
- SIB Swiss Institute of Bioinformatics, Computational Structural Biology, Basel, Switzerland
| |
Collapse
|
7
|
Martin J, Lequerica Mateos M, Onuchic JN, Coluzza I, Morcos F. Machine learning in biological physics: From biomolecular prediction to design. Proc Natl Acad Sci U S A 2024; 121:e2311807121. [PMID: 38913893 PMCID: PMC11228481 DOI: 10.1073/pnas.2311807121] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 06/26/2024] Open
Abstract
Machine learning has been proposed as an alternative to theoretical modeling when dealing with complex problems in biological physics. However, in this perspective, we argue that a more successful approach is a proper combination of these two methodologies. We discuss how ideas coming from physical modeling neuronal processing led to early formulations of computational neural networks, e.g., Hopfield networks. We then show how modern learning approaches like Potts models, Boltzmann machines, and the transformer architecture are related to each other, specifically, through a shared energy representation. We summarize recent efforts to establish these connections and provide examples on how each of these formulations integrating physical modeling and machine learning have been successful in tackling recent problems in biomolecular structure, dynamics, function, evolution, and design. Instances include protein structure prediction; improvement in computational complexity and accuracy of molecular dynamics simulations; better inference of the effects of mutations in proteins leading to improved evolutionary modeling and finally how machine learning is revolutionizing protein engineering and design. Going beyond naturally existing protein sequences, a connection to protein design is discussed where synthetic sequences are able to fold to naturally occurring motifs driven by a model rooted in physical principles. We show that this model is "learnable" and propose its future use in the generation of unique sequences that can fold into a target structure.
Collapse
Affiliation(s)
- Jonathan Martin
- Department of Biological Sciences, University of Texas at Dallas, Richardson, TX75080
| | - Marcos Lequerica Mateos
- BCMaterials, Basque Center for Materials, Applications and Nanostructures, Universidad del País Vasco/Euskal Herriko Unibertsitatea Science Park, Leioa48940, Spain
| | - José N. Onuchic
- Center for Theoretical Biological Physics, Rice University, Houston, TX77005
- Department of Physics and Astronomy, Rice University, Houston, TX77005
- Department of Chemistry, Rice University, Houston, TX77005
- Department of BioSciences, Rice University, Houston, TX77005
| | - Ivan Coluzza
- BCMaterials, Basque Center for Materials, Applications and Nanostructures, Universidad del País Vasco/Euskal Herriko Unibertsitatea Science Park, Leioa48940, Spain
- Basque Foundation for Science, Ikerbasque, Bilbao48940, Spain
| | - Faruck Morcos
- Department of Biological Sciences, University of Texas at Dallas, Richardson, TX75080
- Department of Bioengineering, Center for Systems Biology, University of Texas at Dallas, Richardson, TX75080
| |
Collapse
|
8
|
Wossnig L, Furtmann N, Buchanan A, Kumar S, Greiff V. Best practices for machine learning in antibody discovery and development. Drug Discov Today 2024; 29:104025. [PMID: 38762089 DOI: 10.1016/j.drudis.2024.104025] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/14/2023] [Revised: 04/25/2024] [Accepted: 05/13/2024] [Indexed: 05/20/2024]
Abstract
In the past 40 years, therapeutic antibody discovery and development have advanced considerably, with machine learning (ML) offering a promising way to speed up the process by reducing costs and the number of experiments required. Recent progress in ML-guided antibody design and development (D&D) has been hindered by the diversity of data sets and evaluation methods, which makes it difficult to conduct comparisons and assess utility. Establishing standards and guidelines will be crucial for the wider adoption of ML and the advancement of the field. This perspective critically reviews current practices, highlights common pitfalls and proposes method development and evaluation guidelines for various ML-based techniques in therapeutic antibody D&D. Addressing challenges across the ML process, best practices are recommended for each stage to enhance reproducibility and progress.
Collapse
Affiliation(s)
- Leonard Wossnig
- LabGenius Ltd, The Biscuit Factory, 100 Drummond Road, London SE16 4DG, UK; Department of Computer Science, University College London, 66-72 Gower St, London WC1E 6EA, UK.
| | - Norbert Furtmann
- R&D Large Molecules Research Platform, Sanofi Deutschland GmbH, Industriepark Höchst, Frankfurt Am Main, Germany
| | - Andrew Buchanan
- Biologics Engineering, R&D, AstraZeneca, Cambridge CB2 0AA, UK
| | - Sandeep Kumar
- Computational Protein Design and Modeling Group, Computational Science, Moderna Therapeutics, 200 Technology Square, Cambridge, MA 02139, USA
| | - Victor Greiff
- Department of Immunology and Oslo University Hospital, University of Oslo, Oslo, Norway
| |
Collapse
|
9
|
Huang YJ, Montelione GT. Hidden Structural States of Proteins Revealed by Conformer Selection with AlphaFold-NMR. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2024.06.26.600902. [PMID: 38979209 PMCID: PMC11230435 DOI: 10.1101/2024.06.26.600902] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 07/10/2024]
Abstract
Recent advances in molecular modeling using deep learning can revolutionize our understanding of dynamic protein structures. NMR is particularly well-suited for determining dynamic features of biomolecular structures. The conventional process for determining biomolecular structures from experimental NMR data involves its representation as conformation-dependent restraints, followed by generation of structural models guided by these spatial restraints. Here we describe an alternative approach: generating a distribution of realistic protein conformational models using artificial intelligence-(AI-) based methods and then selecting the sets of conformers that best explain the experimental data. We applied this conformational selection approach to redetermine the solution NMR structure of the enzyme Gaussia luciferase. First, we generated a diverse set of conformer models using AlphaFold2 (AF2) with an enhanced sampling protocol. The models that best-fit NOESY and chemical shift data were then selected with a Bayesian scoring metric. The resulting models include features of both the published NMR structure and the standard AF2 model generated without enhanced sampling. This "AlphaFold-NMR" protocol also generated an alternative "open" conformational state that fits nearly as well to the overall NMR data but accounts for some NOESY data that is not consistent with first "closed" conformational state; while other NOESY data consistent with this second state are not consistent with the first conformational state. The structure of this "open" structural state differs from that of the "closed" state primarily by the position of a thumb-shaped loop between α-helices H5 and H6, revealing a cryptic surface pocket. These alternative conformational states of Gluc are supported by "double recall" analysis of NOESY data and AF2 models. Additional structural states are also indicated by backbone chemical shift data indicating partially-disordered conformations for the C-terminal segment. Considered as a multistate ensemble, these multiple states of Gluc together fit the NOESY and chemical shift data better than the "restraint-based" NMR structure and provide novel insights into its structure-dynamic-function relationships. This study demonstrates the potential of AI-based modeling with enhanced sampling to generate conformational ensembles followed by conformer selection with experimental data as an alternative to conventional restraint satisfaction protocols for protein NMR structure determination.
Collapse
Affiliation(s)
- Yuanpeng J. Huang
- Dept of Chemistry and Chemical Biology, Center for Biotechnology and Interdisciplinary Sciences, Rensselaer Polytechnic Institute, Troy, New York, 12180 USA
| | - Gaetano T. Montelione
- Dept of Chemistry and Chemical Biology, Center for Biotechnology and Interdisciplinary Sciences, Rensselaer Polytechnic Institute, Troy, New York, 12180 USA
| |
Collapse
|
10
|
Sanejouand YH. Are Most Human-Specific Proteins Encoded by Long Noncoding RNAs? J Mol Evol 2024:10.1007/s00239-024-10174-z. [PMID: 38916610 DOI: 10.1007/s00239-024-10174-z] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/17/2023] [Accepted: 05/03/2024] [Indexed: 06/26/2024]
Abstract
By looking for a lack of homologs in a reference database of 27 well-annotated proteomes of primates and 52 well-annotated proteomes of other mammals, 170 putative human-specific proteins were identified. While most of them are deemed uncertain, 2 are known at the protein level and 23 at the transcript level, according to UniProt. Interestingly, 23 of these 25 proteins are found to be encoded or to have close homologs in an open reading frame of a long noncoding human RNA. However, half of them are predicted to be at least 80% globular, with a single structural domain, according to IUPred, and with at least 80% of ordered residues, according to flDPnn. Strikingly, there is a near-complete lack of structural knowledge about these proteins, with no tertiary structure presently available in the Protein Data Bank and a fair prediction for one of them in the AlphaFold Protein Structure Database. Moreover, knowledge about the function of these possibly key proteins remains scarce.
Collapse
Affiliation(s)
- Yves-Henri Sanejouand
- US2B, UMR 6286 of CNRS, Nantes University, 2 rue de la Houssinière, Nantes, 44322, Pays de la Loire, France.
| |
Collapse
|
11
|
Jain S, Trinidad M, Nguyen TB, Jones K, Neto SD, Ge F, Glagovsky A, Jones C, Moran G, Wang B, Rahimi K, Çalıcı SZ, Cedillo LR, Berardelli S, Özden B, Chen K, Katsonis P, Williams A, Lichtarge O, Rana S, Pradhan S, Srinivasan R, Sajeed R, Joshi D, Faraggi E, Jernigan R, Kloczkowski A, Xu J, Song Z, Özkan S, Padilla N, de la Cruz X, Acuna-Hidalgo R, Grafmüller A, Jiménez Barrón LT, Manfredi M, Savojardo C, Babbi G, Martelli PL, Casadio R, Sun Y, Zhu S, Shen Y, Pucci F, Rooman M, Cia G, Raimondi D, Hermans P, Kwee S, Chen E, Astore C, Kamandula A, Pejaver V, Ramola R, Velyunskiy M, Zeiberg D, Mishra R, Sterling T, Goldstein JL, Lugo-Martinez J, Kazi S, Li S, Long K, Brenner SE, Bakolitsa C, Radivojac P, Suhr D, Suhr T, Clark WT. Evaluation of enzyme activity predictions for variants of unknown significance in Arylsulfatase A. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2024.05.16.594558. [PMID: 38798479 PMCID: PMC11118473 DOI: 10.1101/2024.05.16.594558] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/29/2024]
Abstract
Continued advances in variant effect prediction are necessary to demonstrate the ability of machine learning methods to accurately determine the clinical impact of variants of unknown significance (VUS). Towards this goal, the ARSA Critical Assessment of Genome Interpretation (CAGI) challenge was designed to characterize progress by utilizing 219 experimentally assayed missense VUS in the Arylsulfatase A (ARSA) gene to assess the performance of community-submitted predictions of variant functional effects. The challenge involved 15 teams, and evaluated additional predictions from established and recently released models. Notably, a model developed by participants of a genetics and coding bootcamp, trained with standard machine-learning tools in Python, demonstrated superior performance among submissions. Furthermore, the study observed that state-of-the-art deep learning methods provided small but statistically significant improvement in predictive performance compared to less elaborate techniques. These findings underscore the utility of variant effect prediction, and the potential for models trained with modest resources to accurately classify VUS in genetic and clinical research.
Collapse
Affiliation(s)
- Shantanu Jain
- The Institute for Experiential AI, Northeastern University, Boston, MA, USA
- Khoury College of Computer Sciences, Northeastern University, Boston, MA, USA
| | - Marena Trinidad
- Innovative Genomics Institute, University of California, Berkeley, Berkeley, CA, USA
- Howard Hughes Medical Institute, University of California, Berkeley, Berkeley, CA, USA
| | - Thanh Binh Nguyen
- School of Chemistry and Molecular Biosciences, University of Queensland, Brisbane, Australia
| | | | | | - Fang Ge
- State Key Laboratory of Organic Electronics and Information Displays & Institute of Advanced Materials (IAM), Nanjing University of Posts & Telecommunications, Nanjing, China
| | | | | | | | - Boqi Wang
- Department of Bioinformatics and System Biology, University of California, San Diego, La Jolla, CA, USA
| | - Kobra Rahimi
- Department of Computational Biology, School of Life Sciences, Ochanomizu University, Tokyo, Japan
| | - Sümeyra Zeynep Çalıcı
- Department of Genomics, Faculty of Aquatic Science, Istanbul University, Istanbul, Türkiye
| | | | - Silvia Berardelli
- Department of Electrical, Computer and Biomedical Engineering, University of Pavia, Pavia, Italy
- enGenome srl, Pavia, Italy
| | - Buse Özden
- Program of Molecular Biotechnology and Genetics, Institute of Science, Istanbul University, Istanbul, Türkiye
| | - Ken Chen
- University of California, Berkeley, Berkeley, CA, USA
| | - Panagiotis Katsonis
- Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, TX, USA
| | - Amanda Williams
- Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, TX, USA
| | - Olivier Lichtarge
- Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, TX, USA
| | | | | | | | | | | | - Eshel Faraggi
- Research and Information Systems LLC, Indianapolis, IN, USA
- Physics Department, Indiana University-Purdue University, Indianapolis, IN, USA
| | - Robert Jernigan
- Roy J. Carver Department of Biochemistry, Iowa State University, Ames, IA, USA
| | - Andrzej Kloczkowski
- Institute for Genomic Medicine, The Research Institute at Nationwide Children's Hospital, Columbus, OH, USA
| | - Jierui Xu
- University of California, Berkeley, Berkeley, CA, USA
| | | | - Selen Özkan
- Vall d'Hebron Institute of Research (VHIR), Barcelona, Spain
- Universitat Autònoma de Barcelona, Barcelona, Spain
| | - Natàlia Padilla
- Vall d'Hebron Institute of Research (VHIR), Barcelona, Spain
- Universitat Autònoma de Barcelona, Barcelona, Spain
| | - Xavier de la Cruz
- Vall d'Hebron Institute of Research (VHIR), Barcelona, Spain
- Universitat Autònoma de Barcelona, Barcelona, Spain
- Institucío Catalana de Recerca i Estudis Avançats (ICREA), Barcelona, Spain
| | | | | | | | | | | | - Giulia Babbi
- Biocomputing Group, University of Bologna, Bologna, Italy
| | | | - Rita Casadio
- Biocomputing Group, University of Bologna, Bologna, Italy
| | - Yuanfei Sun
- Department of Electrical & Computer Engineering, Texas A&M University, College Station, TX, USA
| | - Shaowen Zhu
- Department of Electrical & Computer Engineering, Texas A&M University, College Station, TX, USA
| | - Yang Shen
- Department of Electrical & Computer Engineering, Texas A&M University, College Station, TX, USA
| | - Fabrizio Pucci
- Computational Biology and Bioinformatics, Université Libre de Bruxelles, Bruxelles, Belgium
| | - Marianne Rooman
- Computational Biology and Bioinformatics, Université Libre de Bruxelles, Bruxelles, Belgium
| | - Gabriel Cia
- Computational Biology and Bioinformatics, Université Libre de Bruxelles, Bruxelles, Belgium
| | | | - Pauline Hermans
- Computational Biology and Bioinformatics, Université Libre de Bruxelles, Bruxelles, Belgium
| | - Sofia Kwee
- University of California, Berkeley, Berkeley, CA, USA
| | - Ella Chen
- University of California, Berkeley, Berkeley, CA, USA
| | | | - Akash Kamandula
- Khoury College of Computer Sciences, Northeastern University, Boston, MA, USA
| | - Vikas Pejaver
- Institute for Genomic Health, Icahn School of Medicine at Mount Sinai, New York, NY, USA
- Department of Genetics and Genomic Sciences, Icahn School of Medicine at Mount Sinai, New York, NY, USA
| | - Rashika Ramola
- Khoury College of Computer Sciences, Northeastern University, Boston, MA, USA
| | - Michelle Velyunskiy
- Khoury College of Computer Sciences, Northeastern University, Boston, MA, USA
| | - Daniel Zeiberg
- Khoury College of Computer Sciences, Northeastern University, Boston, MA, USA
| | - Reet Mishra
- Department of Bioengineering, University of California, Berkeley, CA, USA
- Department of Bioengineering, University of California, San Francisco, CA, USA
| | | | - Jennifer L Goldstein
- Department of Genetics, University of North Carolina at Chapel Hill, Chapel Hill, NC, USA
| | - Jose Lugo-Martinez
- Ray and Stephanie Lane Computational Biology Department, Carnegie Mellon University, Pittsburgh, PA, USA
| | | | - Sindy Li
- University of California, Berkeley, Berkeley, CA, USA
| | - Kinsey Long
- University of California, Berkeley, Berkeley, CA, USA
| | | | | | - Predrag Radivojac
- Khoury College of Computer Sciences, Northeastern University, Boston, MA, USA
| | | | | | | |
Collapse
|
12
|
Kellogg GE. Three-Dimensional Interaction Homology: Deconstructing Residue-Residue and Residue-Lipid Interactions in Membrane Proteins. Molecules 2024; 29:2838. [PMID: 38930903 PMCID: PMC11207109 DOI: 10.3390/molecules29122838] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/24/2024] [Revised: 06/09/2024] [Accepted: 06/10/2024] [Indexed: 06/28/2024] Open
Abstract
A method is described to deconstruct the network of hydropathic interactions within and between a protein's sidechain and its environment into residue-based three-dimensional maps. These maps encode favorable and unfavorable hydrophobic and polar interactions, in terms of spatial positions for optimal interactions, relative interaction strength, as well as character. In addition, these maps are backbone angle-dependent. After map calculation and clustering, a finite number of unique residue sidechain interaction maps exist for each backbone conformation, with the number related to the residue's size and interaction complexity. Structures for soluble proteins (~749,000 residues) and membrane proteins (~387,000 residues) were analyzed, with the latter group being subdivided into three subsets related to the residue's position in the membrane protein: soluble domain, core-facing transmembrane domain, and lipid-facing transmembrane domain. This work suggests that maps representing residue types and their backbone conformation can be reassembled to optimize the medium-to-high resolution details of a protein structure. In particular, the information encoded in maps constructed from the lipid-facing transmembrane residues appears to paint a clear picture of the protein-lipid interactions that are difficult to obtain experimentally.
Collapse
Affiliation(s)
- Glen E Kellogg
- Department of Medicinal Chemistry, Virginia Commonwealth University, Richmond, VA 23298-0540, USA
| |
Collapse
|
13
|
He S, Huang R, Townley J, Kretsch RC, Karagianes TG, Cox DBT, Blair H, Penzar D, Vyaltsev V, Aristova E, Zinkevich A, Bakulin A, Sohn H, Krstevski D, Fukui T, Tatematsu F, Uchida Y, Jang D, Lee JS, Shieh R, Ma T, Martynov E, Shugaev MV, Bukhari HST, Fujikawa K, Onodera K, Henkel C, Ron S, Romano J, Nicol JJ, Nye GP, Wu Y, Choe C, Reade W, Das R. Ribonanza: deep learning of RNA structure through dual crowdsourcing. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2024.02.24.581671. [PMID: 38464325 PMCID: PMC10925082 DOI: 10.1101/2024.02.24.581671] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 03/12/2024]
Abstract
Prediction of RNA structure from sequence remains an unsolved problem, and progress has been slowed by a paucity of experimental data. Here, we present Ribonanza, a dataset of chemical mapping measurements on two million diverse RNA sequences collected through Eterna and other crowdsourced initiatives. Ribonanza measurements enabled solicitation, training, and prospective evaluation of diverse deep neural networks through a Kaggle challenge, followed by distillation into a single, self-contained model called RibonanzaNet. When fine tuned on auxiliary datasets, RibonanzaNet achieves state-of-the-art performance in modeling experimental sequence dropout, RNA hydrolytic degradation, and RNA secondary structure, with implications for modeling RNA tertiary structure.
Collapse
Affiliation(s)
- Shujun He
- Department of Chemical Engineering, Texas A&M University, TX, USA
| | - Rui Huang
- Department of Biochemistry, Stanford CA, USA
| | | | | | | | - David B T Cox
- Department of Biochemistry, Stanford CA, USA
- Department of Medicine, Division of Hematology, and Department of Biochemistry, Stanford CA, USA
| | | | - Dmitry Penzar
- AIRI, Moscow, Russia
- Vavilov Institute of General Genetics, Moscow 119991, Russia
- Institute of Translational Medicine, Pirogov Russian National Research Medical University, Moscow 117997, Russia
| | - Valeriy Vyaltsev
- Faculty of Bioengineering and Bioinformatics, Lomonosov Moscow State University, Russian Federation
| | - Elizaveta Aristova
- Faculty of Bioengineering and Bioinformatics, Lomonosov Moscow State University, Russian Federation
| | - Arsenii Zinkevich
- Faculty of Bioengineering and Bioinformatics, Lomonosov Moscow State University, Russian Federation
| | - Artemy Bakulin
- Faculty of Bioengineering and Bioinformatics, Lomonosov Moscow State University, Russian Federation
| | - Hoyeol Sohn
- Department of Chemical Engineering, Texas A&M University, TX, USA
- Department of Biochemistry, Stanford CA, USA
- Eterna Massive Open Laboratory
- Biophysics Program, Stanford CA, USA
- Department of Medicine, Division of Hematology, and Department of Biochemistry, Stanford CA, USA
- Department of Mathematics, Stanford CA, USA
- AIRI, Moscow, Russia
- Vavilov Institute of General Genetics, Moscow 119991, Russia
- Institute of Translational Medicine, Pirogov Russian National Research Medical University, Moscow 117997, Russia
- Faculty of Bioengineering and Bioinformatics, Lomonosov Moscow State University, Russian Federation
- GO Inc., Tokyo, Japan
- Department of Electrical and Computer Engineering, Inha University, Incheon, Republic of Korea
- DeltaX, Seoul, Republic of Korea
- Faculty of Computational Mathematics and Cybernetics, Lomonosov Moscow State University, Russian Federation
- Department of Materials Science and Engineering, University of Virginia, Charlottesville, VA 22904-4745, USA
- Vergesense, CA
- DeNA, Tokyo, Japan
- NVIDIA, Tokyo, Japan
- NVIDIA, Munich
- Howard Hughes Medical Institute
- Department of Bioengineering, Stanford CA, USA
- Kaggle, San Francisco CA, USA
| | - Daniel Krstevski
- Department of Chemical Engineering, Texas A&M University, TX, USA
- Department of Biochemistry, Stanford CA, USA
- Eterna Massive Open Laboratory
- Biophysics Program, Stanford CA, USA
- Department of Medicine, Division of Hematology, and Department of Biochemistry, Stanford CA, USA
- Department of Mathematics, Stanford CA, USA
- AIRI, Moscow, Russia
- Vavilov Institute of General Genetics, Moscow 119991, Russia
- Institute of Translational Medicine, Pirogov Russian National Research Medical University, Moscow 117997, Russia
- Faculty of Bioengineering and Bioinformatics, Lomonosov Moscow State University, Russian Federation
- GO Inc., Tokyo, Japan
- Department of Electrical and Computer Engineering, Inha University, Incheon, Republic of Korea
- DeltaX, Seoul, Republic of Korea
- Faculty of Computational Mathematics and Cybernetics, Lomonosov Moscow State University, Russian Federation
- Department of Materials Science and Engineering, University of Virginia, Charlottesville, VA 22904-4745, USA
- Vergesense, CA
- DeNA, Tokyo, Japan
- NVIDIA, Tokyo, Japan
- NVIDIA, Munich
- Howard Hughes Medical Institute
- Department of Bioengineering, Stanford CA, USA
- Kaggle, San Francisco CA, USA
| | | | | | | | - Donghoon Jang
- Department of Electrical and Computer Engineering, Inha University, Incheon, Republic of Korea
| | | | - Roger Shieh
- Department of Chemical Engineering, Texas A&M University, TX, USA
- Department of Biochemistry, Stanford CA, USA
- Eterna Massive Open Laboratory
- Biophysics Program, Stanford CA, USA
- Department of Medicine, Division of Hematology, and Department of Biochemistry, Stanford CA, USA
- Department of Mathematics, Stanford CA, USA
- AIRI, Moscow, Russia
- Vavilov Institute of General Genetics, Moscow 119991, Russia
- Institute of Translational Medicine, Pirogov Russian National Research Medical University, Moscow 117997, Russia
- Faculty of Bioengineering and Bioinformatics, Lomonosov Moscow State University, Russian Federation
- GO Inc., Tokyo, Japan
- Department of Electrical and Computer Engineering, Inha University, Incheon, Republic of Korea
- DeltaX, Seoul, Republic of Korea
- Faculty of Computational Mathematics and Cybernetics, Lomonosov Moscow State University, Russian Federation
- Department of Materials Science and Engineering, University of Virginia, Charlottesville, VA 22904-4745, USA
- Vergesense, CA
- DeNA, Tokyo, Japan
- NVIDIA, Tokyo, Japan
- NVIDIA, Munich
- Howard Hughes Medical Institute
- Department of Bioengineering, Stanford CA, USA
- Kaggle, San Francisco CA, USA
| | - Tom Ma
- Department of Chemical Engineering, Texas A&M University, TX, USA
- Department of Biochemistry, Stanford CA, USA
- Eterna Massive Open Laboratory
- Biophysics Program, Stanford CA, USA
- Department of Medicine, Division of Hematology, and Department of Biochemistry, Stanford CA, USA
- Department of Mathematics, Stanford CA, USA
- AIRI, Moscow, Russia
- Vavilov Institute of General Genetics, Moscow 119991, Russia
- Institute of Translational Medicine, Pirogov Russian National Research Medical University, Moscow 117997, Russia
- Faculty of Bioengineering and Bioinformatics, Lomonosov Moscow State University, Russian Federation
- GO Inc., Tokyo, Japan
- Department of Electrical and Computer Engineering, Inha University, Incheon, Republic of Korea
- DeltaX, Seoul, Republic of Korea
- Faculty of Computational Mathematics and Cybernetics, Lomonosov Moscow State University, Russian Federation
- Department of Materials Science and Engineering, University of Virginia, Charlottesville, VA 22904-4745, USA
- Vergesense, CA
- DeNA, Tokyo, Japan
- NVIDIA, Tokyo, Japan
- NVIDIA, Munich
- Howard Hughes Medical Institute
- Department of Bioengineering, Stanford CA, USA
- Kaggle, San Francisco CA, USA
| | - Eduard Martynov
- Faculty of Computational Mathematics and Cybernetics, Lomonosov Moscow State University, Russian Federation
| | - Maxim V Shugaev
- Department of Materials Science and Engineering, University of Virginia, Charlottesville, VA 22904-4745, USA
| | | | | | | | | | - Shlomo Ron
- Department of Chemical Engineering, Texas A&M University, TX, USA
- Department of Biochemistry, Stanford CA, USA
- Eterna Massive Open Laboratory
- Biophysics Program, Stanford CA, USA
- Department of Medicine, Division of Hematology, and Department of Biochemistry, Stanford CA, USA
- Department of Mathematics, Stanford CA, USA
- AIRI, Moscow, Russia
- Vavilov Institute of General Genetics, Moscow 119991, Russia
- Institute of Translational Medicine, Pirogov Russian National Research Medical University, Moscow 117997, Russia
- Faculty of Bioengineering and Bioinformatics, Lomonosov Moscow State University, Russian Federation
- GO Inc., Tokyo, Japan
- Department of Electrical and Computer Engineering, Inha University, Incheon, Republic of Korea
- DeltaX, Seoul, Republic of Korea
- Faculty of Computational Mathematics and Cybernetics, Lomonosov Moscow State University, Russian Federation
- Department of Materials Science and Engineering, University of Virginia, Charlottesville, VA 22904-4745, USA
- Vergesense, CA
- DeNA, Tokyo, Japan
- NVIDIA, Tokyo, Japan
- NVIDIA, Munich
- Howard Hughes Medical Institute
- Department of Bioengineering, Stanford CA, USA
- Kaggle, San Francisco CA, USA
| | - Jonathan Romano
- Eterna Massive Open Laboratory
- Howard Hughes Medical Institute
| | | | - Grace P Nye
- Department of Biochemistry, Stanford CA, USA
| | - Yuan Wu
- Department of Biochemistry, Stanford CA, USA
- Howard Hughes Medical Institute
| | | | | | - Rhiju Das
- Department of Biochemistry, Stanford CA, USA
- Biophysics Program, Stanford CA, USA
- Howard Hughes Medical Institute
| |
Collapse
|
14
|
Dahlström KM, Salminen TA. Apprehensions and emerging solutions in ML-based protein structure prediction. Curr Opin Struct Biol 2024; 86:102819. [PMID: 38631107 DOI: 10.1016/j.sbi.2024.102819] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/19/2024] [Revised: 03/05/2024] [Accepted: 03/31/2024] [Indexed: 04/19/2024]
Abstract
The three-dimensional structure of proteins determines their function in vital biological processes. Thus, when the structure is known, the molecular mechanism of protein function can be understood in more detail and obtained information utilized in biotechnological, diagnostics, and therapeutic applications. Over the past five years, machine learning (ML)-based modeling has pushed protein structure prediction to the next level with AlphaFold in the front line, predicting the structure for hundreds of millions of proteins. Further advances recently report promising ML-based approaches for solving remaining challenges by incorporating functionally important metals, co-factors, post-translational modifications, structural dynamics, and interdomain and multimer interactions in the structure prediction process.
Collapse
Affiliation(s)
- Käthe M Dahlström
- Structural Bioinformatics Laboratory, Biochemistry, Faculty of Science and Engineering, Åbo Akademi University, Tykistökatu 6A, 20520 Turku, Finland; InFLAMES Research Flagship Center, Åbo Akademi University, 20520 Turku, Finland
| | - Tiina A Salminen
- Structural Bioinformatics Laboratory, Biochemistry, Faculty of Science and Engineering, Åbo Akademi University, Tykistökatu 6A, 20520 Turku, Finland; InFLAMES Research Flagship Center, Åbo Akademi University, 20520 Turku, Finland.
| |
Collapse
|
15
|
Randolph NZ, Kuhlman B. Invariant point message passing for protein side chain packing. Proteins 2024. [PMID: 38790143 DOI: 10.1002/prot.26705] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/21/2023] [Revised: 04/19/2024] [Accepted: 05/13/2024] [Indexed: 05/26/2024]
Abstract
Protein side chain packing (PSCP) is a fundamental problem in the field of protein engineering, as high-confidence and low-energy conformations of amino acid side chains are crucial for understanding (and designing) protein folding, protein-protein interactions, and protein-ligand interactions. Traditional PSCP methods (such as the Rosetta Packer) often rely on a library of discrete side chain conformations, or rotamers, and a forcefield to guide the structure to low-energy conformations. Recently, deep learning (DL) based methods (such as DLPacker, AttnPacker, and DiffPack) have demonstrated state-of-the-art predictions and speed in the PSCP task. Building off the success of geometric graph neural networks for protein modeling, we present the Protein Invariant Point Packer (PIPPack) which effectively processes local structural and sequence information to produce realistic, idealized side chain coordinates usingχ $$ \chi $$ -angle distribution predictions and geometry-aware invariant point message passing (IPMP). On a test set of ∼1400 high-quality protein chains, PIPPack is highly competitive with other state-of-the-art PSCP methods in rotamer recovery and per-residue RMSD but is significantly faster.
Collapse
Affiliation(s)
- Nicholas Z Randolph
- Department of Bioinformatics and Computational Biology, University of North Carolina School of Medicine, Chapel Hill, North Carolina, USA
- Department of Biochemistry and Biophysics, University of North Carolina School of Medicine, Chapel Hill, North Carolina, USA
| | - Brian Kuhlman
- Department of Bioinformatics and Computational Biology, University of North Carolina School of Medicine, Chapel Hill, North Carolina, USA
- Department of Biochemistry and Biophysics, University of North Carolina School of Medicine, Chapel Hill, North Carolina, USA
| |
Collapse
|
16
|
Tang X, Dai H, Knight E, Wu F, Li Y, Li T, Gerstein M. A survey of generative AI for de novo drug design: new frontiers in molecule and protein generation. Brief Bioinform 2024; 25:bbae338. [PMID: 39007594 PMCID: PMC11247410 DOI: 10.1093/bib/bbae338] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/10/2024] [Revised: 05/21/2024] [Accepted: 06/27/2024] [Indexed: 07/16/2024] Open
Abstract
Artificial intelligence (AI)-driven methods can vastly improve the historically costly drug design process, with various generative models already in widespread use. Generative models for de novo drug design, in particular, focus on the creation of novel biological compounds entirely from scratch, representing a promising future direction. Rapid development in the field, combined with the inherent complexity of the drug design process, creates a difficult landscape for new researchers to enter. In this survey, we organize de novo drug design into two overarching themes: small molecule and protein generation. Within each theme, we identify a variety of subtasks and applications, highlighting important datasets, benchmarks, and model architectures and comparing the performance of top models. We take a broad approach to AI-driven drug design, allowing for both micro-level comparisons of various methods within each subtask and macro-level observations across different fields. We discuss parallel challenges and approaches between the two applications and highlight future directions for AI-driven de novo drug design as a whole. An organized repository of all covered sources is available at https://github.com/gersteinlab/GenAI4Drug.
Collapse
Affiliation(s)
- Xiangru Tang
- Department of Computer Science, Yale University, New Haven, CT 06520, United States
| | - Howard Dai
- Department of Computer Science, Yale University, New Haven, CT 06520, United States
| | - Elizabeth Knight
- School of Medicine, Yale University, New Haven, CT 06520, United States
| | - Fang Wu
- Computer Science Department, Stanford University, CA 94305, United States
| | - Yunyang Li
- Department of Computer Science, Yale University, New Haven, CT 06520, United States
| | - Tianxiao Li
- Program in Computational Biology & Bioinformatics, Yale University, New Haven, CT 06520, United States
| | - Mark Gerstein
- Department of Computer Science, Yale University, New Haven, CT 06520, United States
- Program in Computational Biology & Bioinformatics, Yale University, New Haven, CT 06520, United States
- Department of Statistics & Data Science, Yale University, New Haven, CT 06520, United States
- Department of Biomedical Informatics & Data Science, Yale University, New Haven, CT 06520, United States
- Department of Molecular Biophysics & Biochemistry, Yale University, New Haven, CT 06520, United States
| |
Collapse
|
17
|
Doga H, Raubenolt B, Cumbo F, Joshi J, DiFilippo FP, Qin J, Blankenberg D, Shehab O. A Perspective on Protein Structure Prediction Using Quantum Computers. J Chem Theory Comput 2024; 20:3359-3378. [PMID: 38703105 PMCID: PMC11099973 DOI: 10.1021/acs.jctc.4c00067] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/22/2024] [Revised: 04/19/2024] [Accepted: 04/22/2024] [Indexed: 05/06/2024]
Abstract
Despite the recent advancements by deep learning methods such as AlphaFold2, in silico protein structure prediction remains a challenging problem in biomedical research. With the rapid evolution of quantum computing, it is natural to ask whether quantum computers can offer some meaningful benefits for approaching this problem. Yet, identifying specific problem instances amenable to quantum advantage and estimating the quantum resources required are equally challenging tasks. Here, we share our perspective on how to create a framework for systematically selecting protein structure prediction problems that are amenable for quantum advantage, and estimate quantum resources for such problems on a utility-scale quantum computer. As a proof-of-concept, we validate our problem selection framework by accurately predicting the structure of a catalytic loop of the Zika Virus NS3 Helicase, on quantum hardware.
Collapse
Affiliation(s)
- Hakan Doga
- IBM Quantum,
Almaden Research Center, San Jose, California 95120, United States
| | - Bryan Raubenolt
- Center
for Computational Life Sciences, Lerner
Research Institute, The Cleveland Clinic, Cleveland, Ohio 44106, United States
| | - Fabio Cumbo
- Center
for Computational Life Sciences, Lerner
Research Institute, The Cleveland Clinic, Cleveland, Ohio 44106, United States
| | - Jayadev Joshi
- Center
for Computational Life Sciences, Lerner
Research Institute, The Cleveland Clinic, Cleveland, Ohio 44106, United States
| | - Frank P. DiFilippo
- Center
for Computational Life Sciences, Lerner
Research Institute, The Cleveland Clinic, Cleveland, Ohio 44106, United States
| | - Jun Qin
- Center
for Computational Life Sciences, Lerner
Research Institute, The Cleveland Clinic, Cleveland, Ohio 44106, United States
| | - Daniel Blankenberg
- Center
for Computational Life Sciences, Lerner
Research Institute, The Cleveland Clinic, Cleveland, Ohio 44106, United States
| | - Omar Shehab
- IBM
Quantum, IBM Thomas J Watson Research Center, Yorktown Heights, New York 10598, United States
| |
Collapse
|
18
|
Fazekas Z, K Menyhárd D, Perczel A. LoCoHD: a metric for comparing local environments of proteins. Nat Commun 2024; 15:4029. [PMID: 38740745 DOI: 10.1038/s41467-024-48225-0] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/29/2023] [Accepted: 04/22/2024] [Indexed: 05/16/2024] Open
Abstract
Protein folds and the local environments they create can be compared using a variety of differently designed measures, such as the root mean squared deviation, the global distance test, the template modeling score or the local distance difference test. Although these measures have proven to be useful for a variety of tasks, each fails to fully incorporate the valuable chemical information inherent to atoms and residues, and considers these only partially and indirectly. Here, we develop the highly flexible local composition Hellinger distance (LoCoHD) metric, which is based on the chemical composition of local residue environments. Using LoCoHD, we analyze the chemical heterogeneity of amino acid environments and identify valines having the most conserved-, and arginines having the most variable chemical environments. We use LoCoHD to investigate structural ensembles, to evaluate critical assessment of structure prediction (CASP) competitors, to compare the results with the local distance difference test (lDDT) scoring system, and to evaluate a molecular dynamics simulation. We show that LoCoHD measurements provide unique information about protein structures that is distinct from, for example, those derived using the alignment-based RMSD metric, or the similarly distance matrix-based but alignment-free lDDT metric.
Collapse
Affiliation(s)
- Zsolt Fazekas
- Laboratory of Structural Chemistry and Biology, Institute of Chemistry, ELTE Eötvös Loránd University, Budapest, Hungary
- ELTE Hevesy György PhD School of Chemistry, ELTE Eötvös Loránd University, Budapest, Hungary
| | - Dóra K Menyhárd
- Laboratory of Structural Chemistry and Biology, Institute of Chemistry, ELTE Eötvös Loránd University, Budapest, Hungary
- HUN-REN-ELTE Protein Modeling Research Group, ELTE Eötvös Loránd University, Budapest, Hungary
| | - András Perczel
- Laboratory of Structural Chemistry and Biology, Institute of Chemistry, ELTE Eötvös Loránd University, Budapest, Hungary.
- HUN-REN-ELTE Protein Modeling Research Group, ELTE Eötvös Loránd University, Budapest, Hungary.
| |
Collapse
|
19
|
Ille AM, Markosian C, Burley SK, Mathews MB, Pasqualini R, Arap W. Generative artificial intelligence performs rudimentary structural biology modeling. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2024.01.10.575113. [PMID: 38293060 PMCID: PMC10827103 DOI: 10.1101/2024.01.10.575113] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 02/01/2024]
Abstract
Natural language-based generative artificial intelligence (AI) has become increasingly prevalent in scientific research. Intriguingly, capabilities of generative pre-trained transformer (GPT) language models beyond the scope of natural language tasks have recently been identified. Here we explored how GPT-4 might be able to perform rudimentary structural biology modeling. We prompted GPT-4 to model 3D structures for the 20 standard amino acids and an α-helical polypeptide chain, with the latter incorporating Wolfram mathematical computation. We also used GPT-4 to perform structural interaction analysis between nirmatrelvir and its target, the SARS-CoV-2 main protease. Geometric parameters of the generated structures typically approximated close to experimental references. However, modeling was sporadically error-prone and molecular complexity was not well tolerated. Interaction analysis further revealed the ability of GPT-4 to identify specific amino acid residues involved in ligand binding along with corresponding bond distances. Despite current limitations, we show the capacity of natural language generative AI to perform basic structural biology modeling and interaction analysis with atomic-scale accuracy.
Collapse
Affiliation(s)
- Alexander M. Ille
- School of Graduate Studies, Rutgers, The State University of New Jersey, Newark, New Jersey, USA
- Rutgers Cancer Institute of New Jersey, Newark, New Jersey, USA
- Division of Cancer Biology, Department of Radiation Oncology, Rutgers New Jersey Medical School, Newark, New Jersey, USA
| | - Christopher Markosian
- School of Graduate Studies, Rutgers, The State University of New Jersey, Newark, New Jersey, USA
- Rutgers Cancer Institute of New Jersey, Newark, New Jersey, USA
- Division of Cancer Biology, Department of Radiation Oncology, Rutgers New Jersey Medical School, Newark, New Jersey, USA
| | - Stephen K. Burley
- Research Collaboratory for Structural Bioinformatics Protein Data Bank, Institute for Quantitative Biomedicine, Rutgers, The State University of New Jersey, Piscataway, New Jersey, USA
- Department of Chemistry and Chemical Biology, Rutgers, The State University of New Jersey, Piscataway, New Jersey, USA
- Rutgers Cancer Institute of New Jersey, Robert Wood Johnson Medical School, New Brunswick, New Jersey, USA
- Research Collaboratory for Structural Bioinformatics Protein Data Bank, San Diego Supercomputer Center, University of California, San Diego, La Jolla, California, USA
| | - Michael B. Mathews
- School of Graduate Studies, Rutgers, The State University of New Jersey, Newark, New Jersey, USA
- Division of Infectious Disease, Department of Medicine, Rutgers New Jersey Medical School, Newark, New Jersey, USA
| | - Renata Pasqualini
- Rutgers Cancer Institute of New Jersey, Newark, New Jersey, USA
- Division of Cancer Biology, Department of Radiation Oncology, Rutgers New Jersey Medical School, Newark, New Jersey, USA
| | - Wadih Arap
- Rutgers Cancer Institute of New Jersey, Newark, New Jersey, USA
- Division of Hematology/Oncology, Department of Medicine, Rutgers New Jersey Medical School, Newark, New Jersey, USA
| |
Collapse
|
20
|
Lee S, Kim G, Karin EL, Mirdita M, Park S, Chikhi R, Babaian A, Kryshtafovych A, Steinegger M. Petabase-Scale Homology Search for Structure Prediction. Cold Spring Harb Perspect Biol 2024; 16:a041465. [PMID: 38316555 PMCID: PMC11065157 DOI: 10.1101/cshperspect.a041465] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/07/2024]
Abstract
The recent CASP15 competition highlighted the critical role of multiple sequence alignments (MSAs) in protein structure prediction, as demonstrated by the success of the top AlphaFold2-based prediction methods. To push the boundaries of MSA utilization, we conducted a petabase-scale search of the Sequence Read Archive (SRA), resulting in gigabytes of aligned homologs for CASP15 targets. These were merged with default MSAs produced by ColabFold-search and provided to ColabFold-predict. By using SRA data, we achieved highly accurate predictions (GDT_TS > 70) for 66% of the non-easy targets, whereas using ColabFold-search default MSAs scored highly in only 52%. Next, we tested the effect of deep homology search and ColabFold's advanced features, such as more recycles, on prediction accuracy. While SRA homologs were most significant for improving ColabFold's CASP15 ranking from 11th to 3rd place, other strategies contributed too. We analyze these in the context of existing strategies to improve prediction.
Collapse
Affiliation(s)
- Sewon Lee
- School of Biological Sciences, Seoul National University, Gwanak-gu, Seoul 08826, South Korea
| | - Gyuri Kim
- School of Biological Sciences, Seoul National University, Gwanak-gu, Seoul 08826, South Korea
| | | | - Milot Mirdita
- School of Biological Sciences, Seoul National University, Gwanak-gu, Seoul 08826, South Korea
| | - Sukhwan Park
- Interdisciplinary Program in Bioinformatics, Seoul National University, Seoul 08826, South Korea
| | - Rayan Chikhi
- Institut Pasteur, Université Paris Cité, G5 Sequence Bioinformatics, 75015 Paris, France
| | - Artem Babaian
- Department of Molecular Genetics, University of Toronto, Toronto, Ontario M5S 1A8, Canada
- Donnelly Centre for Cellular and Biomolecular Research, University of Toronto, Toronto, Ontario M5S 3E1, Canada
| | | | - Martin Steinegger
- School of Biological Sciences, Seoul National University, Gwanak-gu, Seoul 08826, South Korea
- Interdisciplinary Program in Bioinformatics, Seoul National University, Seoul 08826, South Korea
- Artificial Intelligence Institute, Seoul National University, Seoul 08826, South Korea
- Institute of Molecular Biology and Genetics, Seoul National University, Seoul 08826, South Korea
| |
Collapse
|
21
|
Mallick B, Dutta A, Mondal P, Dutta M. Proteomic analysis and protein structure prediction of Shigella phage Sfk20 based on a comparative study using structure prediction approaches. Proteins 2024; 92:637-648. [PMID: 38146101 DOI: 10.1002/prot.26653] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/26/2023] [Revised: 11/21/2023] [Accepted: 12/01/2023] [Indexed: 12/27/2023]
Abstract
Bacteriophages are the natural predators of bacteria and are available abundantly everywhere in nature. Lytic phages can specifically infect their bacterial host (through attachment to the receptor) and use their host replication machinery to replicate rapidly, a feature that enables them to kill a disease-causing bacteria. Hence, phage attachment to the host bacteria is the first important step of the infection process. It is reported in this study that the receptor could be an LPS which is responsible for the attachment of the Sfk20 phage to its host (Shigella flexneri 2a). Phage Sfk20 bacteriolytic activity was examined for preliminary optimization of phage titer. The phage Sfk20 viability at different saline conditions was conducted. The LC-MS/MS technique used here for detecting and identifying 40 Sfk20 phage proteins helped us to get an initial understanding of the structural landscape of phage Sfk20. From the identified proteins, six structurally significant proteins were selected for structure prediction using two neural network systems: AlphaFold2 and ESMFold, and one homology modeling software: Phyre2. Later the performance of these modeling systems was compared using various metrics. We conclude from the available and generated information that AlphaFold2 and Phyre2 perform better than ESMFold for predicting Sfk20 phage protein structures.
Collapse
Affiliation(s)
- Bani Mallick
- Division of Electron Microscopy, ICMR-National Institute of Cholera & Enteric Diseases, Kolkata, West Bengal, India
| | - Aninda Dutta
- Division of Electron Microscopy, ICMR-National Institute of Cholera & Enteric Diseases, Kolkata, West Bengal, India
| | - Payel Mondal
- Division of Electron Microscopy, ICMR-National Institute of Cholera & Enteric Diseases, Kolkata, West Bengal, India
| | - Moumita Dutta
- Division of Electron Microscopy, ICMR-National Institute of Cholera & Enteric Diseases, Kolkata, West Bengal, India
| |
Collapse
|
22
|
Brooks TG, Lahens NF, Mrčela A, Grant GR. Challenges and best practices in omics benchmarking. Nat Rev Genet 2024; 25:326-339. [PMID: 38216661 DOI: 10.1038/s41576-023-00679-6] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 11/14/2023] [Indexed: 01/14/2024]
Abstract
Technological advances enabling massively parallel measurement of biological features - such as microarrays, high-throughput sequencing and mass spectrometry - have ushered in the omics era, now in its third decade. The resulting complex landscape of analytical methods has naturally fostered the growth of an omics benchmarking industry. Benchmarking refers to the process of objectively comparing and evaluating the performance of different computational or analytical techniques when processing and analysing large-scale biological data sets, such as transcriptomics, proteomics and metabolomics. With thousands of omics benchmarking studies published over the past 25 years, the field has matured to the point where the foundations of benchmarking have been established and well described. However, generating meaningful benchmarking data and properly evaluating performance in this complex domain remains challenging. In this Review, we highlight some common oversights and pitfalls in omics benchmarking. We also establish a methodology to bring the issues that can be addressed into focus and to be transparent about those that cannot: this takes the form of a spreadsheet template of guidelines for comprehensive reporting, intended to accompany publications. In addition, a survey of recent developments in benchmarking is provided as well as specific guidance for commonly encountered difficulties.
Collapse
Affiliation(s)
- Thomas G Brooks
- Institute for Translational Medicine and Therapeutics, University of Pennsylvania, Philadelphia, PA, USA
| | - Nicholas F Lahens
- Institute for Translational Medicine and Therapeutics, University of Pennsylvania, Philadelphia, PA, USA
| | - Antonijo Mrčela
- Institute for Translational Medicine and Therapeutics, University of Pennsylvania, Philadelphia, PA, USA
| | - Gregory R Grant
- Institute for Translational Medicine and Therapeutics, University of Pennsylvania, Philadelphia, PA, USA.
- Department of Genetics, University of Pennsylvania, Philadelphia, PA, USA.
| |
Collapse
|
23
|
Cowan DA, Albers SV, Antranikian G, Atomi H, Averhoff B, Basen M, Driessen AJM, Jebbar M, Kelman Z, Kerou M, Littlechild J, Müller V, Schönheit P, Siebers B, Vorgias K. Extremophiles in a changing world. Extremophiles 2024; 28:26. [PMID: 38683238 PMCID: PMC11058618 DOI: 10.1007/s00792-024-01341-7] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/27/2023] [Accepted: 04/02/2024] [Indexed: 05/01/2024]
Abstract
Extremophiles and their products have been a major focus of research interest for over 40 years. Through this period, studies of these organisms have contributed hugely to many aspects of the fundamental and applied sciences, and to wider and more philosophical issues such as the origins of life and astrobiology. Our understanding of the cellular adaptations to extreme conditions (such as acid, temperature, pressure and more), of the mechanisms underpinning the stability of macromolecules, and of the subtleties, complexities and limits of fundamental biochemical processes has been informed by research on extremophiles. Extremophiles have also contributed numerous products and processes to the many fields of biotechnology, from diagnostics to bioremediation. Yet, after 40 years of dedicated research, there remains much to be discovered in this field. Fortunately, extremophiles remain an active and vibrant area of research. In the third decade of the twenty-first century, with decreasing global resources and a steadily increasing human population, the world's attention has turned with increasing urgency to issues of sustainability. These global concerns were encapsulated and formalized by the United Nations with the adoption of the 2030 Agenda for Sustainable Development and the presentation of the seventeen Sustainable Development Goals (SDGs) in 2015. In the run-up to 2030, we consider the contributions that extremophiles have made, and will in the future make, to the SDGs.
Collapse
Affiliation(s)
- D A Cowan
- Centre for Microbial Ecology and Genomics, Department of Biochemistry, Genetics and Microbiology, University of Pretoria, Pretoria, 0002, South Africa.
| | - S V Albers
- Faculty of Biology, University of Freiburg, Freiburg, Germany
| | - G Antranikian
- Institute of Technical Biocatalysis, Hamburg University of Technology, 21073, Hamburg, Germany
| | - H Atomi
- Graduate School of Engineering, Kyoto University, Kyoto, Japan
| | - B Averhoff
- Department of Molecular Microbiology and Bioenergetics, Institute of Molecular Biosciences, Goethe University Frankfurt, Frankfurt Am Main, Germany
| | - M Basen
- Department of Microbiology, Institute of Biological Sciences, University of Rostock, Rostock, Germany
| | - A J M Driessen
- Groningen Biomolecular Sciences and Biotechnology Institute, University of Groningen, Nijenborgh 7, 9747 AG, Groningen, The Netherlands
| | - M Jebbar
- Univ. Brest, CNRS, Ifremer, Laboratoire de Biologie Et d'Écologie Des Écosystèmes Marins Profonds (BEEP), IUEM, Rue Dumont d'Urville, 29280, Plouzané, France
| | - Z Kelman
- Institute for Bioscience and Biotechnology Research and the National Institute of Standards and Technology, Rockville, MD, USA
| | - M Kerou
- Department of Functional and Evolutionary Ecology, Faculty of Life Sciences, University of Vienna, Vienna, Austria
| | - J Littlechild
- Henry Wellcome Building for Biocatalysis, Faculty of Health and Life Sciences, University of Exeter, Exeter, UK
| | - V Müller
- Department of Molecular Microbiology and Bioenergetics, Institute of Molecular Biosciences, Goethe University Frankfurt, Frankfurt Am Main, Germany
| | - P Schönheit
- Institute of General Microbiology, Christian Albrechts University, Kiel, Germany
| | - B Siebers
- Molecular Enzyme Technology and Biochemistry (MEB), Environmental Microbiology and Biotechnology (EMB), Centre for Water and Environmental Research (CWE), University of Duisburg-Essen, 45117, Essen, Germany
| | - K Vorgias
- Biology Department and RI-Bio3, National and Kapodistrian University of Athens, Athens, Greece
| |
Collapse
|
24
|
Ertelt M, Meiler J, Schoeder CT. Combining Rosetta Sequence Design with Protein Language Model Predictions Using Evolutionary Scale Modeling (ESM) as Restraint. ACS Synth Biol 2024; 13:1085-1092. [PMID: 38568188 PMCID: PMC11036486 DOI: 10.1021/acssynbio.3c00753] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/16/2023] [Revised: 02/16/2024] [Accepted: 03/20/2024] [Indexed: 04/20/2024]
Abstract
Computational protein sequence design has the ambitious goal of modifying existing or creating new proteins; however, designing stable and functional proteins is challenging without predictability of protein dynamics and allostery. Informing protein design methods with evolutionary information limits the mutational space to more native-like sequences and results in increased stability while maintaining functions. Recently, language models, trained on millions of protein sequences, have shown impressive performance in predicting the effects of mutations. Assessing Rosetta-designed sequences with a language model showed scores that were worse than those of their original sequence. To inform Rosetta design protocols with language model predictions, we added a new metric to restrain the energy function during design using the Evolutionary Scale Modeling (ESM) model. The resulting sequences have better language model scores and similar sequence recovery, with only a minor decrease in the fitness as assessed by Rosetta energy. In conclusion, our work combines the strength of recent machine learning approaches with the Rosetta protein design toolbox.
Collapse
Affiliation(s)
- Moritz Ertelt
- Institute
for Drug Discovery, University Leipzig Medicine
Faculty, Liebigstr. 19, D-04103 Leipzig, Germany
- Center
for Scalable Data Analytics and Artificial Intelligence ScaDS.AI, D-04105 Leipzig, Germany
| | - Jens Meiler
- Institute
for Drug Discovery, University Leipzig Medicine
Faculty, Liebigstr. 19, D-04103 Leipzig, Germany
- Center
for Scalable Data Analytics and Artificial Intelligence ScaDS.AI, D-04105 Leipzig, Germany
- Department
of Chemistry, Vanderbilt University, Nashville, Tennessee 37235, United
States
- Center
for Structural Biology, Vanderbilt University, Nashville, Tennessee 37235, United States
| | - Clara T. Schoeder
- Institute
for Drug Discovery, University Leipzig Medicine
Faculty, Liebigstr. 19, D-04103 Leipzig, Germany
- Center
for Scalable Data Analytics and Artificial Intelligence ScaDS.AI, D-04105 Leipzig, Germany
| |
Collapse
|
25
|
Tripp A, Braun M, Wieser F, Oberdorfer G, Lechner H. Click, Compute, Create: A Review of Web-based Tools for Enzyme Engineering. Chembiochem 2024:e202400092. [PMID: 38634409 DOI: 10.1002/cbic.202400092] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/31/2024] [Revised: 04/14/2024] [Accepted: 04/15/2024] [Indexed: 04/19/2024]
Abstract
Enzyme engineering, though pivotal across various biotechnological domains, is often plagued by its time-consuming and labor-intensive nature. This review aims to offer an overview of supportive in silico methodologies for this demanding endeavor. Starting from methods to predict protein structures, to classification of their activity and even the discovery of new enzymes we continue with describing tools used to increase thermostability and production yields of selected targets. Subsequently, we discuss computational methods to modulate both, the activity as well as selectivity of enzymes. Last, we present recent approaches based on cutting-edge machine learning methods to redesign enzymes. With exception of the last chapter, there is a strong focus on methods easily accessible via web-interfaces or simple Python-scripts, therefore readily useable for a diverse and broad community.
Collapse
Affiliation(s)
- Adrian Tripp
- Institute of Biochemistry, Graz University of Technology, Petersgasse 12/2, 8010, Graz, Austria
| | - Markus Braun
- Institute of Biochemistry, Graz University of Technology, Petersgasse 12/2, 8010, Graz, Austria
| | - Florian Wieser
- Institute of Biochemistry, Graz University of Technology, Petersgasse 12/2, 8010, Graz, Austria
| | - Gustav Oberdorfer
- Institute of Biochemistry, Graz University of Technology, Petersgasse 12/2, 8010, Graz, Austria
- BioTechMed, Graz, Austria
| | - Horst Lechner
- Institute of Biochemistry, Graz University of Technology, Petersgasse 12/2, 8010, Graz, Austria
- BioTechMed, Graz, Austria
| |
Collapse
|
26
|
Grassmann G, Miotto M, Desantis F, Di Rienzo L, Tartaglia GG, Pastore A, Ruocco G, Monti M, Milanetti E. Computational Approaches to Predict Protein-Protein Interactions in Crowded Cellular Environments. Chem Rev 2024; 124:3932-3977. [PMID: 38535831 PMCID: PMC11009965 DOI: 10.1021/acs.chemrev.3c00550] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/31/2023] [Revised: 02/20/2024] [Accepted: 02/21/2024] [Indexed: 04/11/2024]
Abstract
Investigating protein-protein interactions is crucial for understanding cellular biological processes because proteins often function within molecular complexes rather than in isolation. While experimental and computational methods have provided valuable insights into these interactions, they often overlook a critical factor: the crowded cellular environment. This environment significantly impacts protein behavior, including structural stability, diffusion, and ultimately the nature of binding. In this review, we discuss theoretical and computational approaches that allow the modeling of biological systems to guide and complement experiments and can thus significantly advance the investigation, and possibly the predictions, of protein-protein interactions in the crowded environment of cell cytoplasm. We explore topics such as statistical mechanics for lattice simulations, hydrodynamic interactions, diffusion processes in high-viscosity environments, and several methods based on molecular dynamics simulations. By synergistically leveraging methods from biophysics and computational biology, we review the state of the art of computational methods to study the impact of molecular crowding on protein-protein interactions and discuss its potential revolutionizing effects on the characterization of the human interactome.
Collapse
Affiliation(s)
- Greta Grassmann
- Department
of Biochemical Sciences “Alessandro Rossi Fanelli”, Sapienza University of Rome, Rome 00185, Italy
- Center
for Life Nano & Neuro Science, Istituto
Italiano di Tecnologia, Rome 00161, Italy
| | - Mattia Miotto
- Center
for Life Nano & Neuro Science, Istituto
Italiano di Tecnologia, Rome 00161, Italy
| | - Fausta Desantis
- Center
for Life Nano & Neuro Science, Istituto
Italiano di Tecnologia, Rome 00161, Italy
- The
Open University Affiliated Research Centre at Istituto Italiano di
Tecnologia, Genoa 16163, Italy
| | - Lorenzo Di Rienzo
- Center
for Life Nano & Neuro Science, Istituto
Italiano di Tecnologia, Rome 00161, Italy
| | - Gian Gaetano Tartaglia
- Center
for Life Nano & Neuro Science, Istituto
Italiano di Tecnologia, Rome 00161, Italy
- Department
of Neuroscience and Brain Technologies, Istituto Italiano di Tecnologia, Genoa 16163, Italy
- Center
for Human Technologies, Genoa 16152, Italy
| | - Annalisa Pastore
- Experiment
Division, European Synchrotron Radiation
Facility, Grenoble 38043, France
| | - Giancarlo Ruocco
- Center
for Life Nano & Neuro Science, Istituto
Italiano di Tecnologia, Rome 00161, Italy
- Department
of Physics, Sapienza University, Rome 00185, Italy
| | - Michele Monti
- RNA
System Biology Lab, Department of Neuroscience and Brain Technologies, Istituto Italiano di Tecnologia, Genoa 16163, Italy
| | - Edoardo Milanetti
- Center
for Life Nano & Neuro Science, Istituto
Italiano di Tecnologia, Rome 00161, Italy
- Department
of Physics, Sapienza University, Rome 00185, Italy
| |
Collapse
|
27
|
Li X, Shen C, Zhu H, Yang Y, Wang Q, Yang J, Huang N. A High-Quality Data Set of Protein-Ligand Binding Interactions Via Comparative Complex Structure Modeling. J Chem Inf Model 2024; 64:2454-2466. [PMID: 38181418 DOI: 10.1021/acs.jcim.3c01170] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/07/2024]
Abstract
High-quality protein-ligand complex structures provide the basis for understanding the nature of noncovalent binding interactions at the atomic level and enable structure-based drug design. However, experimentally determined complex structures are scarce compared with the vast chemical space. In this study, we addressed this issue by constructing the BindingNet data set via comparative complex structure modeling, which contains 69,816 modeled high-quality protein-ligand complex structures with experimental binding affinity data. BindingNet provides valuable insights into investigating protein-ligand interactions, allowing visual inspection and interpretation of structural analogues' structure-activity relationships. It can also be used for evaluating machine-learning-based scoring functions. Our results indicate that machine learning models trained on BindingNet could reduce the bias caused by buried solvent-accessible surface area, as we previously found for models trained on the PDBbind data set. We also discussed strategies to improve BindingNet and its potential utilization for benchmarking the molecular docking methods and ligand binding free energy calculation approaches. The BindingNet complements PDBbind in constructing a sufficient and unbiased protein-ligand binding data set and is freely available at http://bindingnet.huanglab.org.cn.
Collapse
Affiliation(s)
- Xuelian Li
- National Institute of Biological Sciences, Chinese Academy of Medical Sciences & Peking Union Medical College, Beijing 100730, China
- National Institute of Biological Sciences, 7 Science Park Road, Zhongguancun Life Science Park, Beijing 102206, China
| | - Cheng Shen
- National Institute of Biological Sciences, 7 Science Park Road, Zhongguancun Life Science Park, Beijing 102206, China
| | - Hui Zhu
- National Institute of Biological Sciences, 7 Science Park Road, Zhongguancun Life Science Park, Beijing 102206, China
- Tsinghua Institute of Multidisciplinary Biomedical Research, Tsinghua University, Beijing 102206, China
| | - Yujian Yang
- National Institute of Biological Sciences, 7 Science Park Road, Zhongguancun Life Science Park, Beijing 102206, China
| | - Qing Wang
- National Institute of Biological Sciences, 7 Science Park Road, Zhongguancun Life Science Park, Beijing 102206, China
| | - Jincai Yang
- National Institute of Biological Sciences, 7 Science Park Road, Zhongguancun Life Science Park, Beijing 102206, China
| | - Niu Huang
- National Institute of Biological Sciences, Chinese Academy of Medical Sciences & Peking Union Medical College, Beijing 100730, China
- National Institute of Biological Sciences, 7 Science Park Road, Zhongguancun Life Science Park, Beijing 102206, China
- Tsinghua Institute of Multidisciplinary Biomedical Research, Tsinghua University, Beijing 102206, China
| |
Collapse
|
28
|
Capponi S, Wang S. AI in cellular engineering and reprogramming. Biophys J 2024:S0006-3495(24)00245-5. [PMID: 38576162 DOI: 10.1016/j.bpj.2024.04.001] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/29/2023] [Revised: 03/19/2024] [Accepted: 04/01/2024] [Indexed: 04/06/2024] Open
Abstract
During the last decade, artificial intelligence (AI) has increasingly been applied in biophysics and related fields, including cellular engineering and reprogramming, offering novel approaches to understand, manipulate, and control cellular function. The potential of AI lies in its ability to analyze complex datasets and generate predictive models. AI algorithms can process large amounts of data from single-cell genomics and multiomic technologies, allowing researchers to gain mechanistic insights into the control of cell identity and function. By integrating and interpreting these complex datasets, AI can help identify key molecular events and regulatory pathways involved in cellular reprogramming. This knowledge can inform the design of precision engineering strategies, such as the development of new transcription factor and signaling molecule cocktails, to manipulate cell identity and drive authentic cell fate across lineage boundaries. Furthermore, when used in combination with computational methods, AI can accelerate and improve the analysis and understanding of the intricate relationships between genes, proteins, and cellular processes. In this review article, we explore the current state of AI applications in biophysics with a specific focus on cellular engineering and reprogramming. Then, we showcase a couple of recent applications where we combined machine learning with experimental and computational techniques. Finally, we briefly discuss the challenges and prospects of AI in cellular engineering and reprogramming, emphasizing the potential of these technologies to revolutionize our ability to engineer cells for a variety of applications, from disease modeling and drug discovery to regenerative medicine and biomanufacturing.
Collapse
Affiliation(s)
- Sara Capponi
- IBM Almaden Research Center, San Jose, California; Center for Cellular Construction, San Francisco, California.
| | - Shangying Wang
- Bay Area Institute of Science, Altos Labs, Redwood City, California.
| |
Collapse
|
29
|
Zhang J, Durham J, Qian Cong. Revolutionizing protein-protein interaction prediction with deep learning. Curr Opin Struct Biol 2024; 85:102775. [PMID: 38330793 DOI: 10.1016/j.sbi.2024.102775] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/13/2023] [Revised: 12/31/2023] [Accepted: 01/05/2024] [Indexed: 02/10/2024]
Abstract
Protein-protein interactions (PPIs) are pivotal for driving diverse biological processes, and any disturbance in these interactions can lead to disease. Thus, the study of PPIs has been a central focus in biology. Recent developments in deep learning methods, coupled with the vast genomic sequence data, have significantly boosted the accuracy of predicting protein structures and modeling protein complexes, approaching levels comparable to experimental techniques. Herein, we review the latest advances in the computational methods for modeling 3D protein complexes and the prediction of protein interaction partners, emphasizing the application of deep learning methods deriving from coevolution analysis. The review also highlights biomedical applications of PPI prediction and outlines challenges in the field.
Collapse
Affiliation(s)
- Jing Zhang
- Eugene McDermott Center for Human Growth and Development, University of Texas Southwestern Medical Center, Dallas, TX, USA; Department of Biophysics, University of Texas Southwestern Medical Center, Dallas, TX, USA; HaroldC.Simmons Comprehensive Cancer Center, University of Texas Southwestern Medical Center, Dallas, TX, USA. https://twitter.com/jzhang_genome
| | - Jesse Durham
- Eugene McDermott Center for Human Growth and Development, University of Texas Southwestern Medical Center, Dallas, TX, USA; Department of Biophysics, University of Texas Southwestern Medical Center, Dallas, TX, USA; HaroldC.Simmons Comprehensive Cancer Center, University of Texas Southwestern Medical Center, Dallas, TX, USA
| | - Qian Cong
- Eugene McDermott Center for Human Growth and Development, University of Texas Southwestern Medical Center, Dallas, TX, USA; Department of Biophysics, University of Texas Southwestern Medical Center, Dallas, TX, USA; HaroldC.Simmons Comprehensive Cancer Center, University of Texas Southwestern Medical Center, Dallas, TX, USA.
| |
Collapse
|
30
|
Waman VP, Bordin N, Alcraft R, Vickerstaff R, Rauer C, Chan Q, Sillitoe I, Yamamori H, Orengo C. CATH 2024: CATH-AlphaFlow Doubles the Number of Structures in CATH and Reveals Nearly 200 New Folds. J Mol Biol 2024:168551. [PMID: 38548261 DOI: 10.1016/j.jmb.2024.168551] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/31/2024] [Revised: 03/20/2024] [Accepted: 03/22/2024] [Indexed: 04/07/2024]
Abstract
CATH (https://www.cathdb.info) classifies domain structures from experimental protein structures in the PDB and predicted structures in the AlphaFold Database (AFDB). To cope with the scale of the predicted data a new NextFlow workflow (CATH-AlphaFlow), has been developed to classify high-quality domains into CATH superfamilies and identify novel fold groups and superfamilies. CATH-AlphaFlow uses a novel state-of-the-art structure-based domain boundary prediction method (ChainSaw) for identifying domains in multi-domain proteins. We applied CATH-AlphaFlow to process PDB structures not classified in CATH and AFDB structures from 21 model organisms, expanding CATH by over 100%. Domains not classified in existing CATH superfamilies or fold groups were used to seed novel folds, giving 253 new folds from PDB structures (September 2023 release) and 96 from AFDB structures of proteomes of 21 model organisms. Where possible, functional annotations were obtained using (i) predictions from publicly available methods (ii) annotations from structural relatives in AFDB/UniProt50. We also predicted functional sites and highly conserved residues. Some folds are associated with important functions such as photosynthetic acclimation (in flowering plants), iron permease activity (in fungi) and post-natal spermatogenesis (in mice). CATH-AlphaFlow will allow us to identify many more CATH relatives in the AFDB, further characterising the protein structure landscape.
Collapse
Affiliation(s)
- Vaishali P Waman
- Institute of Structural and Molecular Biology, University College London, London, United Kingdom
| | - Nicola Bordin
- Institute of Structural and Molecular Biology, University College London, London, United Kingdom
| | - Rachel Alcraft
- Advanced Research Computing Centre, University College London, London, United Kingdom
| | - Robert Vickerstaff
- Advanced Research Computing Centre, University College London, London, United Kingdom
| | - Clemens Rauer
- Institute of Structural and Molecular Biology, University College London, London, United Kingdom
| | - Qian Chan
- Institute of Structural and Molecular Biology, University College London, London, United Kingdom
| | - Ian Sillitoe
- Institute of Structural and Molecular Biology, University College London, London, United Kingdom
| | - Hazuki Yamamori
- Institute of Structural and Molecular Biology, University College London, London, United Kingdom
| | - Christine Orengo
- Institute of Structural and Molecular Biology, University College London, London, United Kingdom.
| |
Collapse
|
31
|
Wu X, Lin H, Bai R, Duan H. Deep learning for advancing peptide drug development: Tools and methods in structure prediction and design. Eur J Med Chem 2024; 268:116262. [PMID: 38387334 DOI: 10.1016/j.ejmech.2024.116262] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/04/2024] [Revised: 02/06/2024] [Accepted: 02/17/2024] [Indexed: 02/24/2024]
Abstract
Peptides can bind challenging disease targets with high affinity and specificity, offering enormous opportunities for addressing unmet medical needs. However, peptides' unique features, including smaller size, increased structural flexibility, and limited data availability, pose additional challenges to the design process compared to proteins. This review explores the dynamic field of peptide therapeutics, leveraging deep learning to enhance structure prediction and design. Our exploration encompasses various facets of peptide research, ranging from dataset curation handling to model development. As deep learning technologies become more refined, we channel our efforts into peptide structure prediction and design, aligning with the fundamental principles of structure-activity relationships in drug development. To guide researchers in harnessing the potential of deep learning to advance peptide drug development, our insights comprehensively explore current challenges and future directions of peptide therapeutics.
Collapse
Affiliation(s)
- Xinyi Wu
- College of Pharmaceutical Sciences, Zhejiang University of Technology, Hangzhou, 310014, PR China
| | - Huitian Lin
- College of Pharmaceutical Sciences, Zhejiang University of Technology, Hangzhou, 310014, PR China
| | - Renren Bai
- School of Pharmacy, Hangzhou Normal University, Hangzhou, 311121, PR China.
| | - Hongliang Duan
- Faculty of Applied Sciences, Macao Polytechnic University, Macao, 999078, PR China.
| |
Collapse
|
32
|
Carbery A, Buttenschoen M, Skyner R, von Delft F, Deane CM. Learnt representations of proteins can be used for accurate prediction of small molecule binding sites on experimentally determined and predicted protein structures. J Cheminform 2024; 16:32. [PMID: 38486231 PMCID: PMC10941399 DOI: 10.1186/s13321-024-00821-4] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/07/2023] [Accepted: 03/01/2024] [Indexed: 03/17/2024] Open
Abstract
Protein-ligand binding site prediction is a useful tool for understanding the functional behaviour and potential drug-target interactions of a novel protein of interest. However, most binding site prediction methods are tested by providing crystallised ligand-bound (holo) structures as input. This testing regime is insufficient to understand the performance on novel protein targets where experimental structures are not available. An alternative option is to provide computationally predicted protein structures, but this is not commonly tested. However, due to the training data used, computationally-predicted protein structures tend to be extremely accurate, and are often biased toward a holo conformation. In this study we describe and benchmark IF-SitePred, a protein-ligand binding site prediction method which is based on the labelling of ESM-IF1 protein language model embeddings combined with point cloud annotation and clustering. We show that not only is IF-SitePred competitive with state-of-the-art methods when predicting binding sites on experimental structures, but it performs better on proxies for novel proteins where low accuracy has been simulated by molecular dynamics. Finally, IF-SitePred outperforms other methods if ensembles of predicted protein structures are generated.
Collapse
Affiliation(s)
- Anna Carbery
- Oxford Protein Informatics Group, Department of Statistics, University of Oxford, Oxford, OX1 3LB, UK
- Diamond Light Source, Harwell Science and Innovation Campus, Didcot, OX11 0DE, UK
| | - Martin Buttenschoen
- Oxford Protein Informatics Group, Department of Statistics, University of Oxford, Oxford, OX1 3LB, UK
| | - Rachael Skyner
- OMass Therapeutics, Building 4000, Chancellor Court, John Smith Drive, ARC Oxford, OX4 2GX, UK
| | - Frank von Delft
- Diamond Light Source, Harwell Science and Innovation Campus, Didcot, OX11 0DE, UK
- Centre for Medicines Discovery, University of Oxford, Oxford, OX3 7DQ, UK
- Research Complex at Harwell, Harwell Science and Innovation Campus, Didcot, OX11 0FA, United Kingdom
- Department of Biochemistry, University of Johannesburg, Johannesburg, 2006, South Africa
| | - Charlotte M Deane
- Oxford Protein Informatics Group, Department of Statistics, University of Oxford, Oxford, OX1 3LB, UK.
| |
Collapse
|
33
|
Wenzel M, Grüner E, Strodthoff N. Insights into the inner workings of transformer models for protein function prediction. Bioinformatics 2024; 40:btae031. [PMID: 38244570 PMCID: PMC10950482 DOI: 10.1093/bioinformatics/btae031] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/09/2023] [Revised: 12/14/2023] [Accepted: 01/16/2024] [Indexed: 01/22/2024] Open
Abstract
MOTIVATION We explored how explainable artificial intelligence (XAI) can help to shed light into the inner workings of neural networks for protein function prediction, by extending the widely used XAI method of integrated gradients such that latent representations inside of transformer models, which were finetuned to Gene Ontology term and Enzyme Commission number prediction, can be inspected too. RESULTS The approach enabled us to identify amino acids in the sequences that the transformers pay particular attention to, and to show that these relevant sequence parts reflect expectations from biology and chemistry, both in the embedding layer and inside of the model, where we identified transformer heads with a statistically significant correspondence of attribution maps with ground truth sequence annotations (e.g. transmembrane regions, active sites) across many proteins. AVAILABILITY AND IMPLEMENTATION Source code can be accessed at https://github.com/markuswenzel/xai-proteins.
Collapse
Affiliation(s)
- Markus Wenzel
- Department of Artificial Intelligence, Fraunhofer Institute for Telecommunications, Heinrich-Hertz-Institut, HHI, Einsteinufer 37, 10587 Berlin, Germany
| | - Erik Grüner
- Department of Artificial Intelligence, Fraunhofer Institute for Telecommunications, Heinrich-Hertz-Institut, HHI, Einsteinufer 37, 10587 Berlin, Germany
| | - Nils Strodthoff
- School VI - Medicine and Health Services, Carl von Ossietzky University of Oldenburg, Ammerländer Heerstr. 114-118, 26129 Oldenburg, Germany
| |
Collapse
|
34
|
Li Z, Fan H, Ding W. Solving protein structures by combining structure prediction, molecular replacement and direct-methods-aided model completion. IUCRJ 2024; 11:152-167. [PMID: 38214490 PMCID: PMC10916285 DOI: 10.1107/s2052252523010291] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 09/05/2023] [Accepted: 11/29/2023] [Indexed: 01/13/2024]
Abstract
Highly accurate protein structure prediction can generate accurate models of protein and protein-protein complexes in X-ray crystallography. However, the question of how to make more effective use of predicted models for completing structure analysis, and which strategies should be employed for the more challenging cases such as multi-helical structures, multimeric structures and extremely large structures, both in the model preparation and in the completion steps, remains open for discussion. In this paper, a new strategy is proposed based on the framework of direct methods and dual-space iteration, which can greatly simplify the pre-processing steps of predicted models both in normal and in challenging cases. Following this strategy, full-length models or the conservative structural domains could be used directly as the starting model, and the phase error and the model bias between the starting model and the real structure would be modified in the direct-methods-based dual-space iteration. Many challenging cases (from CASP14) have been tested for the general applicability of this constructive strategy, and almost complete models have been generated with reasonable statistics. The hybrid strategy therefore provides a meaningful scheme for X-ray structure determination using a predicted model as the starting point.
Collapse
Affiliation(s)
- Zengru Li
- Beijing National Laboratory for Condensed Matter Physics, Institute of Physics, Chinese Academy of Sciences, Beijing 100190, People’s Republic of China
- School of Physical Sciences, University of Chinese Academy of Sciences, Beijing 100049, People’s Republic of China
| | - Haifu Fan
- Beijing National Laboratory for Condensed Matter Physics, Institute of Physics, Chinese Academy of Sciences, Beijing 100190, People’s Republic of China
| | - Wei Ding
- Beijing National Laboratory for Condensed Matter Physics, Institute of Physics, Chinese Academy of Sciences, Beijing 100190, People’s Republic of China
| |
Collapse
|
35
|
Jain S, Bakolitsa C, Brenner SE, Radivojac P, Moult J, Repo S, Hoskins RA, Andreoletti G, Barsky D, Chellapan A, Chu H, Dabbiru N, Kollipara NK, Ly M, Neumann AJ, Pal LR, Odell E, Pandey G, Peters-Petrulewicz RC, Srinivasan R, Yee SF, Yeleswarapu SJ, Zuhl M, Adebali O, Patra A, Beer MA, Hosur R, Peng J, Bernard BM, Berry M, Dong S, Boyle AP, Adhikari A, Chen J, Hu Z, Wang R, Wang Y, Miller M, Wang Y, Bromberg Y, Turina P, Capriotti E, Han JJ, Ozturk K, Carter H, Babbi G, Bovo S, Di Lena P, Martelli PL, Savojardo C, Casadio R, Cline MS, De Baets G, Bonache S, Díez O, Gutiérrez-Enríquez S, Fernández A, Montalban G, Ootes L, Özkan S, Padilla N, Riera C, De la Cruz X, Diekhans M, Huwe PJ, Wei Q, Xu Q, Dunbrack RL, Gotea V, Elnitski L, Margolin G, Fariselli P, Kulakovskiy IV, Makeev VJ, Penzar DD, Vorontsov IE, Favorov AV, Forman JR, Hasenahuer M, Fornasari MS, Parisi G, Avsec Z, Çelik MH, Nguyen TYD, Gagneur J, Shi FY, Edwards MD, Guo Y, Tian K, Zeng H, Gifford DK, Göke J, Zaucha J, Gough J, Ritchie GRS, Frankish A, Mudge JM, Harrow J, Young EL, Yu Y, Huff CD, Murakami K, Nagai Y, Imanishi T, Mungall CJ, Jacobsen JOB, Kim D, Jeong CS, Jones DT, Li MJ, Guthrie VB, Bhattacharya R, Chen YC, Douville C, Fan J, Kim D, Masica D, Niknafs N, Sengupta S, Tokheim C, Turner TN, Yeo HTG, Karchin R, Shin S, Welch R, Keles S, Li Y, Kellis M, Corbi-Verge C, Strokach AV, Kim PM, Klein TE, Mohan R, Sinnott-Armstrong NA, Wainberg M, Kundaje A, Gonzaludo N, Mak ACY, Chhibber A, Lam HYK, Dahary D, Fishilevich S, Lancet D, Lee I, Bachman B, Katsonis P, Lua RC, Wilson SJ, Lichtarge O, Bhat RR, Sundaram L, Viswanath V, Bellazzi R, Nicora G, Rizzo E, Limongelli I, Mezlini AM, Chang R, Kim S, Lai C, O’Connor R, Topper S, van den Akker J, Zhou AY, Zimmer AD, Mishne G, Bergquist TR, Breese MR, Guerrero RF, Jiang Y, Kiga N, Li B, Mort M, Pagel KA, Pejaver V, Stamboulian MH, Thusberg J, Mooney SD, Teerakulkittipong N, Cao C, Kundu K, Yin Y, Yu CH, Kleyman M, Lin CF, Stackpole M, Mount SM, Eraslan G, Mueller NS, Naito T, Rao AR, Azaria JR, Brodie A, Ofran Y, Garg A, Pal D, Hawkins-Hooker A, Kenlay H, Reid J, Mucaki EJ, Rogan PK, Schwarz JM, Searls DB, Lee GR, Seok C, Krämer A, Shah S, Huang CV, Kirsch JF, Shatsky M, Cao Y, Chen H, Karimi M, Moronfoye O, Sun Y, Shen Y, Shigeta R, Ford CT, Nodzak C, Uppal A, Shi X, Joseph T, Kotte S, Rana S, Rao A, Saipradeep VG, Sivadasan N, Sunderam U, Stanke M, Su A, Adzhubey I, Jordan DM, Sunyaev S, Rousseau F, Schymkowitz J, Van Durme J, Tavtigian SV, Carraro M, Giollo M, Tosatto SCE, Adato O, Carmel L, Cohen NE, Fenesh T, Holtzer T, Juven-Gershon T, Unger R, Niroula A, Olatubosun A, Väliaho J, Yang Y, Vihinen M, Wahl ME, Chang B, Chong KC, Hu I, Sun R, Wu WKK, Xia X, Zee BC, Wang MH, Wang M, Wu C, Lu Y, Chen K, Yang Y, Yates CM, Kreimer A, Yan Z, Yosef N, Zhao H, Wei Z, Yao Z, Zhou F, Folkman L, Zhou Y, Daneshjou R, Altman RB, Inoue F, Ahituv N, Arkin AP, Lovisa F, Bonvini P, Bowdin S, Gianni S, Mantuano E, Minicozzi V, Novak L, Pasquo A, Pastore A, Petrosino M, Puglisi R, Toto A, Veneziano L, Chiaraluce R, Ball MP, Bobe JR, Church GM, Consalvi V, Cooper DN, Buckley BA, Sheridan MB, Cutting GR, Scaini MC, Cygan KJ, Fredericks AM, Glidden DT, Neil C, Rhine CL, Fairbrother WG, Alontaga AY, Fenton AW, Matreyek KA, Starita LM, Fowler DM, Löscher BS, Franke A, Adamson SI, Graveley BR, Gray JW, Malloy MJ, Kane JP, Kousi M, Katsanis N, Schubach M, Kircher M, Mak ACY, Tang PLF, Kwok PY, Lathrop RH, Clark WT, Yu GK, LeBowitz JH, Benedicenti F, Bettella E, Bigoni S, Cesca F, Mammi I, Marino-Buslje C, Milani D, Peron A, Polli R, Sartori S, Stanzial F, Toldo I, Turolla L, Aspromonte MC, Bellini M, Leonardi E, Liu X, Marshall C, McCombie WR, Elefanti L, Menin C, Meyn MS, Murgia A, Nadeau KCY, Neuhausen SL, Nussbaum RL, Pirooznia M, Potash JB, Dimster-Denk DF, Rine JD, Sanford JR, Snyder M, Cote AG, Sun S, Verby MW, Weile J, Roth FP, Tewhey R, Sabeti PC, Campagna J, Refaat MM, Wojciak J, Grubb S, Schmitt N, Shendure J, Spurdle AB, Stavropoulos DJ, Walton NA, Zandi PP, Ziv E, Burke W, Chen F, Carr LR, Martinez S, Paik J, Harris-Wai J, Yarborough M, Fullerton SM, Koenig BA, McInnes G, Shigaki D, Chandonia JM, Furutsuki M, Kasak L, Yu C, Chen R, Friedberg I, Getz GA, Cong Q, Kinch LN, Zhang J, Grishin NV, Voskanian A, Kann MG, Tran E, Ioannidis NM, Hunter JM, Udani R, Cai B, Morgan AA, Sokolov A, Stuart JM, Minervini G, Monzon AM, Batzoglou S, Butte AJ, Greenblatt MS, Hart RK, Hernandez R, Hubbard TJP, Kahn S, O’Donnell-Luria A, Ng PC, Shon J, Veltman J, Zook JM. CAGI, the Critical Assessment of Genome Interpretation, establishes progress and prospects for computational genetic variant interpretation methods. Genome Biol 2024; 25:53. [PMID: 38389099 PMCID: PMC10882881 DOI: 10.1186/s13059-023-03113-6] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/21/2023] [Accepted: 11/17/2023] [Indexed: 02/24/2024] Open
Abstract
BACKGROUND The Critical Assessment of Genome Interpretation (CAGI) aims to advance the state-of-the-art for computational prediction of genetic variant impact, particularly where relevant to disease. The five complete editions of the CAGI community experiment comprised 50 challenges, in which participants made blind predictions of phenotypes from genetic data, and these were evaluated by independent assessors. RESULTS Performance was particularly strong for clinical pathogenic variants, including some difficult-to-diagnose cases, and extends to interpretation of cancer-related variants. Missense variant interpretation methods were able to estimate biochemical effects with increasing accuracy. Assessment of methods for regulatory variants and complex trait disease risk was less definitive and indicates performance potentially suitable for auxiliary use in the clinic. CONCLUSIONS Results show that while current methods are imperfect, they have major utility for research and clinical applications. Emerging methods and increasingly large, robust datasets for training and assessment promise further progress ahead.
Collapse
|
36
|
Corum MR, Venkannagari H, Hryc CF, Baker ML. Predictive modeling and cryo-EM: A synergistic approach to modeling macromolecular structure. Biophys J 2024; 123:435-450. [PMID: 38268190 PMCID: PMC10912932 DOI: 10.1016/j.bpj.2024.01.021] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/19/2023] [Revised: 01/09/2024] [Accepted: 01/18/2024] [Indexed: 01/26/2024] Open
Abstract
Over the last 15 years, structural biology has seen unprecedented development and improvement in two areas: electron cryo-microscopy (cryo-EM) and predictive modeling. Once relegated to low resolutions, single-particle cryo-EM is now capable of achieving near-atomic resolutions of a wide variety of macromolecular complexes. Ushered in by AlphaFold, machine learning has powered the current generation of predictive modeling tools, which can accurately and reliably predict models for proteins and some complexes directly from the sequence alone. Although they offer new opportunities individually, there is an inherent synergy between these techniques, allowing for the construction of large, complex macromolecular models. Here, we give a brief overview of these approaches in addition to illustrating works that combine these techniques for model building. These examples provide insight into model building, assessment, and limitations when integrating predictive modeling with cryo-EM density maps. Together, these approaches offer the potential to greatly accelerate the generation of macromolecular structural insights, particularly when coupled with experimental data.
Collapse
Affiliation(s)
- Michael R Corum
- Department of Biochemistry and Molecular Biology, McGovern Medical School at the University of Texas Health Science Center, Houston, Texas
| | - Harikanth Venkannagari
- Department of Biochemistry and Molecular Biology, McGovern Medical School at the University of Texas Health Science Center, Houston, Texas
| | - Corey F Hryc
- Department of Biochemistry and Molecular Biology, McGovern Medical School at the University of Texas Health Science Center, Houston, Texas
| | - Matthew L Baker
- Department of Biochemistry and Molecular Biology, McGovern Medical School at the University of Texas Health Science Center, Houston, Texas.
| |
Collapse
|
37
|
Moreland RT, Zhang S, Barreira SN, Ryan JF, Baxevanis AD. An AI-generated proteome-scale dataset of predicted protein structures for the ctenophore Mnemiopsis leidyi. Proteomics 2024:e2300397. [PMID: 38329168 DOI: 10.1002/pmic.202300397] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/19/2023] [Revised: 01/11/2024] [Accepted: 01/12/2024] [Indexed: 02/09/2024]
Abstract
This Dataset Brief describes the computational prediction of protein structures for the ctenophore Mnemiopsis leidyi. Here, we report the proteome-scale generation of 15,333 protein structure predictions using AlphaFold, as well as an updated implementation of publicly available search, manipulation, and visualization tools for these protein structure predictions through the Mnemiopsis Genome Project Portal (https://research.nhgri.nih.gov/mnemiopsis). The utility of these predictions is demonstrated by highlighting comparisons to experimentally determined structures for the light-sensitive protein mnemiopsin 1 and the ionotropic glutamate receptor (iGluR). The application of these novel protein structure prediction methods will serve to further position non-bilaterian species such as Mnemiopsis as powerful model systems for the study of early animal evolution and human health.
Collapse
Affiliation(s)
- R Travis Moreland
- Center for Genomics and Data Science Research, Division of Intramural Research, National Human Genome Research Institute, National Institutes of Health, Bethesda, Maryland, USA
| | - Suiyuan Zhang
- Center for Genomics and Data Science Research, Division of Intramural Research, National Human Genome Research Institute, National Institutes of Health, Bethesda, Maryland, USA
| | - Sofia N Barreira
- Center for Genomics and Data Science Research, Division of Intramural Research, National Human Genome Research Institute, National Institutes of Health, Bethesda, Maryland, USA
| | - Joseph F Ryan
- Whitney Laboratory for Marine Bioscience, University of Florida, St. Augustine, Florida, USA
| | - Andreas D Baxevanis
- Center for Genomics and Data Science Research, Division of Intramural Research, National Human Genome Research Institute, National Institutes of Health, Bethesda, Maryland, USA
| |
Collapse
|
38
|
Sulea T, Kumar S, Kuroda D. Editorial: Progress and challenges in computational structure-based design and development of biologic drugs. Front Mol Biosci 2024; 11:1360267. [PMID: 38389897 PMCID: PMC10883042 DOI: 10.3389/fmolb.2024.1360267] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/22/2023] [Accepted: 02/01/2024] [Indexed: 02/24/2024] Open
Affiliation(s)
- Traian Sulea
- Human Health Therapeutics Research Centre, National Research Council Canada, Montreal, QC, Canada
| | - Sandeep Kumar
- Computational Protein Design and Modeling, Computational Science, Moderna Therapeutics, Cambridge, MA, United States
| | - Daisuke Kuroda
- Research Center of Drug and Vaccine Development, National Institute of Infectious Diseases, Tokyo, Japan
| |
Collapse
|
39
|
Zheng W, Wuyun Q, Li Y, Zhang C, Freddolino PL, Zhang Y. Improving deep learning protein monomer and complex structure prediction using DeepMSA2 with huge metagenomics data. Nat Methods 2024; 21:279-289. [PMID: 38167654 PMCID: PMC10864179 DOI: 10.1038/s41592-023-02130-4] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/04/2023] [Accepted: 11/13/2023] [Indexed: 01/05/2024]
Abstract
Leveraging iterative alignment search through genomic and metagenome sequence databases, we report the DeepMSA2 pipeline for uniform protein single- and multichain multiple-sequence alignment (MSA) construction. Large-scale benchmarks show that DeepMSA2 MSAs can remarkably increase the accuracy of protein tertiary and quaternary structure predictions compared with current state-of-the-art methods. An integrated pipeline with DeepMSA2 participated in the most recent CASP15 experiment and created complex structural models with considerably higher quality than the AlphaFold2-Multimer server (v.2.2.0). Detailed data analyses show that the major advantage of DeepMSA2 lies in its balanced alignment search and effective model selection, and in the power of integrating huge metagenomics databases. These results demonstrate a new avenue to improve deep learning protein structure prediction through advanced MSA construction and provide additional evidence that optimization of input information to deep learning-based structure prediction methods must be considered with as much care as the design of the predictor itself.
Collapse
Affiliation(s)
- Wei Zheng
- Department of Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, MI, USA
| | - Qiqige Wuyun
- Department of Computer Science and Engineering, Michigan State University, East Lansing, MI, USA
| | - Yang Li
- Department of Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, MI, USA
- Cancer Science Institute of Singapore, National University of Singapore, Singapore, Singapore
| | - Chengxin Zhang
- Department of Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, MI, USA
| | - P Lydia Freddolino
- Department of Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, MI, USA.
- Department of Biological Chemistry, University of Michigan, Ann Arbor, MI, USA.
| | - Yang Zhang
- Department of Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, MI, USA.
- Cancer Science Institute of Singapore, National University of Singapore, Singapore, Singapore.
- Department of Biological Chemistry, University of Michigan, Ann Arbor, MI, USA.
- Department of Computer Science, School of Computing, National University of Singapore, Singapore, Singapore.
- Department of Biochemistry, Yong Loo Lin School of Medicine, National University of Singapore, Singapore, Singapore.
| |
Collapse
|
40
|
Du Z, Peng Z, Yang J. RNA threading with secondary structure and sequence profile. Bioinformatics 2024; 40:btae080. [PMID: 38341662 PMCID: PMC10893584 DOI: 10.1093/bioinformatics/btae080] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/30/2022] [Revised: 01/05/2024] [Accepted: 02/09/2024] [Indexed: 02/12/2024] Open
Abstract
MOTIVATION RNA threading aims to identify remote homologies for template-based modeling of RNA 3D structure. Existing RNA alignment methods primarily rely on secondary structure alignment. They are often time- and memory-consuming, limiting large-scale applications. In addition, the accuracy is far from satisfactory. RESULTS Using RNA secondary structure and sequence profile, we developed a novel RNA threading algorithm, named RNAthreader. To enhance the alignment process and minimize memory usage, a novel approach has been introduced to simplify RNA secondary structures into compact diagrams. RNAthreader employs a two-step methodology. Initially, integer programming and dynamic programming are combined to create an initial alignment for the simplified diagram. Subsequently, the final alignment is obtained using dynamic programming, taking into account the initial alignment derived from the previous step. The benchmark test on 80 RNAs illustrates that RNAthreader generates more accurate alignments than other methods, especially for RNAs with pseudoknots. Another benchmark, involving 30 RNAs from the RNA-Puzzles experiments, exhibits that the models constructed using RNAthreader templates have a lower average RMSD than those created by alternative methods. Remarkably, RNAthreader takes less than two hours to complete alignments with ∼5000 RNAs, which is 3-40 times faster than other methods. These compelling results suggest that RNAthreader is a promising algorithm for RNA template detection. AVAILABILITY AND IMPLEMENTATION https://yanglab.qd.sdu.edu.cn/RNAthreader.
Collapse
Affiliation(s)
- Zongyang Du
- Chongqing Key Laboratory of Big Data for Bio Intelligence, Chongqing University of Posts and Telecommunications, Chongqing 400065, China
| | - Zhenling Peng
- MOE Frontiers Science Center for Nonlinear Expectations, Research Center for Mathematics and Interdisciplinary Sciences, Shandong University, Qingdao 266237, China
| | - Jianyi Yang
- MOE Frontiers Science Center for Nonlinear Expectations, Research Center for Mathematics and Interdisciplinary Sciences, Shandong University, Qingdao 266237, China
| |
Collapse
|
41
|
Morehead A, Cheng J. Geometry-complete perceptron networks for 3D molecular graphs. Bioinformatics 2024; 40:btae087. [PMID: 38373819 PMCID: PMC10904142 DOI: 10.1093/bioinformatics/btae087] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/04/2023] [Revised: 12/30/2023] [Accepted: 02/16/2024] [Indexed: 02/21/2024] Open
Abstract
MOTIVATION The field of geometric deep learning has recently had a profound impact on several scientific domains such as protein structure prediction and design, leading to methodological advancements within and outside of the realm of traditional machine learning. Within this spirit, in this work, we introduce GCPNet, a new chirality-aware SE(3)-equivariant graph neural network designed for representation learning of 3D biomolecular graphs. We show that GCPNet, unlike previous representation learning methods for 3D biomolecules, is widely applicable to a variety of invariant or equivariant node-level, edge-level, and graph-level tasks on biomolecular structures while being able to (1) learn important chiral properties of 3D molecules and (2) detect external force fields. RESULTS Across four distinct molecular-geometric tasks, we demonstrate that GCPNet's predictions (1) for protein-ligand binding affinity achieve a statistically significant correlation of 0.608, more than 5%, greater than current state-of-the-art methods; (2) for protein structure ranking achieve statistically significant target-local and dataset-global correlations of 0.616 and 0.871, respectively; (3) for Newtownian many-body systems modeling achieve a task-averaged mean squared error less than 0.01, more than 15% better than current methods; and (4) for molecular chirality recognition achieve a state-of-the-art prediction accuracy of 98.7%, better than any other machine learning method to date. AVAILABILITY AND IMPLEMENTATION The source code, data, and instructions to train new models or reproduce our results are freely available at https://github.com/BioinfoMachineLearning/GCPNet.
Collapse
Affiliation(s)
- Alex Morehead
- Electrical Engineering & Computer Science, University of Missouri-Columbia, Columbia, MO 65211, United States
| | - Jianlin Cheng
- Electrical Engineering & Computer Science, University of Missouri-Columbia, Columbia, MO 65211, United States
| |
Collapse
|
42
|
Ieremie I, Ewing RM, Niranjan M. Protein language models meet reduced amino acid alphabets. Bioinformatics 2024; 40:btae061. [PMID: 38310333 PMCID: PMC10872054 DOI: 10.1093/bioinformatics/btae061] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/04/2023] [Revised: 12/14/2023] [Accepted: 01/30/2024] [Indexed: 02/05/2024] Open
Abstract
MOTIVATION Protein language models (PLMs), which borrowed ideas for modelling and inference from natural language processing, have demonstrated the ability to extract meaningful representations in an unsupervised way. This led to significant performance improvement in several downstream tasks. Clustering amino acids based on their physical-chemical properties to achieve reduced alphabets has been of interest in past research, but their application to PLMs or folding models is unexplored. RESULTS Here, we investigate the efficacy of PLMs trained on reduced amino acid alphabets in capturing evolutionary information, and we explore how the loss of protein sequence information impacts learned representations and downstream task performance. Our empirical work shows that PLMs trained on the full alphabet and a large number of sequences capture fine details that are lost in alphabet reduction methods. We further show the ability of a structure prediction model(ESMFold) to fold CASP14 protein sequences translated using a reduced alphabet. For 10 proteins out of the 50 targets, reduced alphabets improve structural predictions with LDDT-Cα differences of up to 19%. AVAILABILITY AND IMPLEMENTATION Trained models and code are available at github.com/Ieremie/reduced-alph-PLM.
Collapse
Affiliation(s)
- Ioan Ieremie
- Vision, Learning & Control Group, University of Southampton, Southampton SO17 1BJ, United Kingdom
| | - Rob M Ewing
- Biological Sciences, University of Southampton, Southampton SO17 1BJ, United Kingdom
| | - Mahesan Niranjan
- Vision, Learning & Control Group, University of Southampton, Southampton SO17 1BJ, United Kingdom
| |
Collapse
|
43
|
Stein RA, Mchaourab HS. Rosetta Energy Analysis of AlphaFold2 models: Point Mutations and Conformational Ensembles. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2023.09.05.556364. [PMID: 37732281 PMCID: PMC10508732 DOI: 10.1101/2023.09.05.556364] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 09/22/2023]
Abstract
There has been an explosive growth in the applications of AlphaFold2, and other structure prediction platforms, to accurately predict protein structures from a multiple sequence alignment (MSA) for downstream structural analysis. However, two outstanding questions persist in the field regarding the robustness of AlphaFold2 predictions of the consequences of point mutations and the completeness of its prediction of protein conformational ensembles. We combined our previously developed method SPEACH_AF with model relaxation and energetic analysis with Rosetta to address these questions. SPEACH_AF introduces residue substitutions across the MSA and not just within the input sequence. With respect to conformational ensembles, we combined SPEACH_AF and a new MSA subsampling method, AF_cluster, and for a benchmarked set of proteins, we found that the energetics of the conformational ensembles generated by AlphaFold2 correspond to those of experimental structures and explored by standard molecular dynamic methods. With respect to point mutations, we compared the structural and energetic consequences of having the mutation(s) in the input sequence versus in the whole MSA (SPEACH_AF). Both methods yielded models different from the wild-type sequence, with more robust changes when the mutation(s) were in the whole MSA. While our findings demonstrate the robustness of AlphaFold2 in analyzing point mutations and exploring conformational ensembles, they highlight the need for multi parameter structural and energetic analyses of these models to generate experimentally testable hypotheses.
Collapse
Affiliation(s)
- Richard A Stein
- Department of Molecular Physiology and Biophysics and Center for Applied AI in Protein Dynamics Vanderbilt University
| | - Hassane S Mchaourab
- Department of Molecular Physiology and Biophysics and Center for Applied AI in Protein Dynamics Vanderbilt University
| |
Collapse
|
44
|
Beton JG, Mulvaney T, Cragnolini T, Topf M. Cryo-EM structure and B-factor refinement with ensemble representation. Nat Commun 2024; 15:444. [PMID: 38200043 PMCID: PMC10781738 DOI: 10.1038/s41467-023-44593-1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/11/2022] [Accepted: 12/20/2023] [Indexed: 01/12/2024] Open
Abstract
Cryo-EM experiments produce images of macromolecular assemblies that are combined to produce three-dimensional density maps. Typically, atomic models of the constituent molecules are fitted into these maps, followed by a density-guided refinement. We introduce TEMPy-ReFF, a method for atomic structure refinement in cryo-EM density maps. Our method represents atomic positions as components of a Gaussian mixture model, utilising their variances as B-factors, which are used to derive an ensemble description. Extensively tested on a substantial dataset of 229 cryo-EM maps from EMDB ranging in resolution from 2.1-4.9 Å with corresponding PDB and CERES atomic models, our results demonstrate that TEMPy-ReFF ensembles provide a superior representation of cryo-EM maps. On a single-model basis, it performs similarly to the CERES re-refinement protocol, although there are cases where it provides a better fit to the map. Furthermore, our method enables the creation of composite maps free of boundary artefacts. TEMPy-ReFF is useful for better interpretation of flexible structures, such as those involving RNA, DNA or ligands.
Collapse
Affiliation(s)
- Joseph G Beton
- Leibniz Institute of Virology (LIV) and Universitätsklinikum Hamburg Eppendorf (UKE), Centre for Structural Systems Biology (CSSB), 22607, Hamburg, Germany
| | - Thomas Mulvaney
- Leibniz Institute of Virology (LIV) and Universitätsklinikum Hamburg Eppendorf (UKE), Centre for Structural Systems Biology (CSSB), 22607, Hamburg, Germany
| | - Tristan Cragnolini
- Leibniz Institute of Virology (LIV) and Universitätsklinikum Hamburg Eppendorf (UKE), Centre for Structural Systems Biology (CSSB), 22607, Hamburg, Germany
- Institute of Structural and Molecular Biology, Birkbeck, University of London, London, UK
| | - Maya Topf
- Leibniz Institute of Virology (LIV) and Universitätsklinikum Hamburg Eppendorf (UKE), Centre for Structural Systems Biology (CSSB), 22607, Hamburg, Germany.
| |
Collapse
|
45
|
Varadi M, Bertoni D, Magana P, Paramval U, Pidruchna I, Radhakrishnan M, Tsenkov M, Nair S, Mirdita M, Yeo J, Kovalevskiy O, Tunyasuvunakool K, Laydon A, Žídek A, Tomlinson H, Hariharan D, Abrahamson J, Green T, Jumper J, Birney E, Steinegger M, Hassabis D, Velankar S. AlphaFold Protein Structure Database in 2024: providing structure coverage for over 214 million protein sequences. Nucleic Acids Res 2024; 52:D368-D375. [PMID: 37933859 PMCID: PMC10767828 DOI: 10.1093/nar/gkad1011] [Citation(s) in RCA: 26] [Impact Index Per Article: 26.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/29/2023] [Revised: 10/13/2023] [Accepted: 10/18/2023] [Indexed: 11/08/2023] Open
Abstract
The AlphaFold Database Protein Structure Database (AlphaFold DB, https://alphafold.ebi.ac.uk) has significantly impacted structural biology by amassing over 214 million predicted protein structures, expanding from the initial 300k structures released in 2021. Enabled by the groundbreaking AlphaFold2 artificial intelligence (AI) system, the predictions archived in AlphaFold DB have been integrated into primary data resources such as PDB, UniProt, Ensembl, InterPro and MobiDB. Our manuscript details subsequent enhancements in data archiving, covering successive releases encompassing model organisms, global health proteomes, Swiss-Prot integration, and a host of curated protein datasets. We detail the data access mechanisms of AlphaFold DB, from direct file access via FTP to advanced queries using Google Cloud Public Datasets and the programmatic access endpoints of the database. We also discuss the improvements and services added since its initial release, including enhancements to the Predicted Aligned Error viewer, customisation options for the 3D viewer, and improvements in the search engine of AlphaFold DB.
Collapse
Affiliation(s)
- Mihaly Varadi
- European Molecular Biology Laboratory, European Bioinformatics Institute, Hinxton, UK
| | - Damian Bertoni
- European Molecular Biology Laboratory, European Bioinformatics Institute, Hinxton, UK
| | - Paulyna Magana
- European Molecular Biology Laboratory, European Bioinformatics Institute, Hinxton, UK
| | - Urmila Paramval
- European Molecular Biology Laboratory, European Bioinformatics Institute, Hinxton, UK
| | - Ivanna Pidruchna
- European Molecular Biology Laboratory, European Bioinformatics Institute, Hinxton, UK
| | | | - Maxim Tsenkov
- European Molecular Biology Laboratory, European Bioinformatics Institute, Hinxton, UK
| | - Sreenath Nair
- European Molecular Biology Laboratory, European Bioinformatics Institute, Hinxton, UK
| | - Milot Mirdita
- School of Biological Sciences, Seoul National University, Seoul, South Korea
| | - Jingi Yeo
- School of Biological Sciences, Seoul National University, Seoul, South Korea
| | | | | | | | | | | | | | | | | | | | - Ewan Birney
- European Molecular Biology Laboratory, European Bioinformatics Institute, Hinxton, UK
| | - Martin Steinegger
- School of Biological Sciences, Seoul National University, Seoul, South Korea
| | | | - Sameer Velankar
- European Molecular Biology Laboratory, European Bioinformatics Institute, Hinxton, UK
| |
Collapse
|
46
|
Zhang Y, Wang X, Zhang Z, Huang Y, Kihara D. Assessment of Protein-Protein Docking Models Using Deep Learning. Methods Mol Biol 2024; 2780:149-162. [PMID: 38987469 DOI: 10.1007/978-1-0716-3985-6_10] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 07/12/2024]
Abstract
Protein-protein interactions are involved in almost all processes in a living cell and determine the biological functions of proteins. To obtain mechanistic understandings of protein-protein interactions, the tertiary structures of protein complexes have been determined by biophysical experimental methods, such as X-ray crystallography and cryogenic electron microscopy. However, as experimental methods are costly in resources, many computational methods have been developed that model protein complex structures. One of the difficulties in computational protein complex modeling (protein docking) is to select the most accurate models among many models that are usually generated by a docking method. This article reviews advances in protein docking model assessment methods, focusing on recent developments that apply deep learning to several network architectures.
Collapse
Affiliation(s)
- Yuanyuan Zhang
- Department of Computer Science, Purdue University, West Lafayette, IN, USA
| | - Xiao Wang
- Department of Computer Science, Purdue University, West Lafayette, IN, USA
| | - Zicong Zhang
- Department of Computer Science, Purdue University, West Lafayette, IN, USA
| | - Yunhan Huang
- Department of Computer Science, Purdue University, West Lafayette, IN, USA
| | - Daisuke Kihara
- Department of Computer Science, Purdue University, West Lafayette, IN, USA.
- Department of Biological Sciences, Purdue University, West Lafayette, IN, USA.
| |
Collapse
|
47
|
Terwilliger TC, Liebschner D, Croll TI, Williams CJ, McCoy AJ, Poon BK, Afonine PV, Oeffner RD, Richardson JS, Read RJ, Adams PD. AlphaFold predictions are valuable hypotheses and accelerate but do not replace experimental structure determination. Nat Methods 2024; 21:110-116. [PMID: 38036854 PMCID: PMC10776388 DOI: 10.1038/s41592-023-02087-4] [Citation(s) in RCA: 10] [Impact Index Per Article: 10.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/30/2023] [Accepted: 10/11/2023] [Indexed: 12/02/2023]
Abstract
Artificial intelligence-based protein structure prediction methods such as AlphaFold have revolutionized structural biology. The accuracies of these predictions vary, however, and they do not take into account ligands, covalent modifications or other environmental factors. Here, we evaluate how well AlphaFold predictions can be expected to describe the structure of a protein by comparing predictions directly with experimental crystallographic maps. In many cases, AlphaFold predictions matched experimental maps remarkably closely. In other cases, even very high-confidence predictions differed from experimental maps on a global scale through distortion and domain orientation, and on a local scale in backbone and side-chain conformation. We suggest considering AlphaFold predictions as exceptionally useful hypotheses. We further suggest that it is important to consider the confidence in prediction when interpreting AlphaFold predictions and to carry out experimental structure determination to verify structural details, particularly those that involve interactions not included in the prediction.
Collapse
Affiliation(s)
- Thomas C Terwilliger
- New Mexico Consortium, Los Alamos, NM, USA.
- Los Alamos National Laboratory, Los Alamos, NM, USA.
| | - Dorothee Liebschner
- Molecular Biophysics & Integrated Bioimaging Division, Lawrence Berkeley National Laboratory, Berkeley, CA, USA
| | - Tristan I Croll
- Department of Haematology, Cambridge Institute for Medical Research, University of Cambridge, Cambridge, UK
| | | | - Airlie J McCoy
- Department of Haematology, Cambridge Institute for Medical Research, University of Cambridge, Cambridge, UK
| | - Billy K Poon
- Molecular Biophysics & Integrated Bioimaging Division, Lawrence Berkeley National Laboratory, Berkeley, CA, USA
| | - Pavel V Afonine
- Molecular Biophysics & Integrated Bioimaging Division, Lawrence Berkeley National Laboratory, Berkeley, CA, USA
| | - Robert D Oeffner
- Department of Haematology, Cambridge Institute for Medical Research, University of Cambridge, Cambridge, UK
| | | | - Randy J Read
- Department of Haematology, Cambridge Institute for Medical Research, University of Cambridge, Cambridge, UK
| | - Paul D Adams
- Molecular Biophysics & Integrated Bioimaging Division, Lawrence Berkeley National Laboratory, Berkeley, CA, USA
- Department of Bioengineering, University of California, Berkeley, CA, USA
| |
Collapse
|
48
|
Radjasandirane R, de Brevern AG. AlphaFold2 for Protein Structure Prediction: Best Practices and Critical Analyses. Methods Mol Biol 2024; 2836:235-252. [PMID: 38995544 DOI: 10.1007/978-1-0716-4007-4_13] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 07/13/2024]
Abstract
AlphaFold2 (AF2) has emerged in recent years as a groundbreaking innovation that has revolutionized several scientific fields, in particular structural biology, drug design, and the elucidation of disease mechanisms. Many scientists now use AF2 on a daily basis, including non-specialist users. This chapter is aimed at the latter. Tips and tricks for getting the most out of AF2 to produce a high-quality biological model are discussed here. We suggest to non-specialist users how to maintain a critical perspective when working with AF2 models and provide guidelines on how to properly evaluate them. After showing how to perform our own structure prediction using ColabFold, we list several ways to improve AF2 models by adding information that is missing from the original AF2 model. By using software such as AlphaFill to add cofactors and ligands to the models, or MODELLER to add disulfide bridges between cysteines, we guide users to build a high-quality biological model suitable for applications such as drug design, protein interaction, or molecular dynamics studies.
Collapse
Affiliation(s)
- Ragousandirane Radjasandirane
- Université Paris Cité and Université des Antilles and Université de la Réunion, BIGR, UMR_S1134, DSIMB Team, Inserm, Paris, France
| | - Alexandre G de Brevern
- Université Paris Cité and Université des Antilles and Université de la Réunion, BIGR, UMR_S1134, DSIMB Team, Inserm, Paris, France.
| |
Collapse
|
49
|
Novikova PV, Bhanu Busi S, Probst AJ, May P, Wilmes P. Functional prediction of proteins from the human gut archaeome. ISME COMMUNICATIONS 2024; 4:ycad014. [PMID: 38486809 PMCID: PMC10939349 DOI: 10.1093/ismeco/ycad014] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 12/12/2023] [Revised: 12/16/2023] [Accepted: 12/19/2023] [Indexed: 03/17/2024]
Abstract
The human gastrointestinal tract contains diverse microbial communities, including archaea. Among them, Methanobrevibacter smithii represents a highly active and clinically relevant methanogenic archaeon, being involved in gastrointestinal disorders, such as inflammatory bowel disease and obesity. Herein, we present an integrated approach using sequence and structure information to improve the annotation of M. smithii proteins using advanced protein structure prediction and annotation tools, such as AlphaFold2, trRosetta, ProFunc, and DeepFri. Of an initial set of 873 481 archaeal proteins, we found 707 754 proteins exclusively present in the human gut. Having analysed archaeal proteins together with 87 282 994 bacterial proteins, we identified unique archaeal proteins and archaeal-bacterial homologs. We then predicted and characterized functional domains and structures of 73 unique and homologous archaeal protein clusters linked the human gut and M. smithii. We refined annotations based on the predicted structures, extending existing sequence similarity-based annotations. We identified gut-specific archaeal proteins that may be involved in defense mechanisms, virulence, adhesion, and the degradation of toxic substances. Interestingly, we identified potential glycosyltransferases that could be associated with N-linked and O-glycosylation. Additionally, we found preliminary evidence for interdomain horizontal gene transfer between Clostridia species and M. smithii, which includes sporulation Stage V proteins AE and AD. Our study broadens the understanding of archaeal biology, particularly M. smithii, and highlights the importance of considering both sequence and structure for the prediction of protein function.
Collapse
Affiliation(s)
- Polina V Novikova
- Systems Ecology, Luxembourg Centre for Systems Biomedicine, University of Luxembourg, Esch-sur-Alzette L-4362, Luxembourg
| | - Susheel Bhanu Busi
- Systems Ecology, Luxembourg Centre for Systems Biomedicine, University of Luxembourg, Esch-sur-Alzette L-4362, Luxembourg
- UK Centre for Ecology and Hydrology, Wallingford, OX10 8 BB, United Kingdom
| | - Alexander J Probst
- Environmental Metagenomics, Department of Chemistry, Research Center One Health Ruhr of the University Alliance Ruhr, for Environmental Microbiology and Biotechnology, University Duisburg-Essen, Duisburg 47057, Germany
| | - Patrick May
- Bioinformatics Core, Luxembourg Centre for Systems Biomedicine, University of Luxembourg, Esch-sur-Alzette L-4362, Luxembourg
| | - Paul Wilmes
- Systems Ecology, Luxembourg Centre for Systems Biomedicine, University of Luxembourg, Esch-sur-Alzette L-4362, Luxembourg
- Department of Life Sciences and Medicine, Faculty of Science, Technology and Medicine, University of Luxembourg, Esch-sur-Alzette L-4362, Luxembourg
| |
Collapse
|
50
|
Kotev M, Diaz Gonzalez C. Molecular Dynamics and Other HPC Simulations for Drug Discovery. Methods Mol Biol 2024; 2716:265-291. [PMID: 37702944 DOI: 10.1007/978-1-0716-3449-3_12] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 09/14/2023]
Abstract
High performance computing (HPC) is taking an increasingly important place in drug discovery. It makes possible the simulation of complex biochemical systems with high precision in a short time, thanks to the use of sophisticated algorithms. It promotes the advancement of knowledge in fields that are inaccessible or difficult to access through experimentation and it contributes to accelerating the discovery of drugs for unmet medical needs while reducing costs. Herein, we report how computational performance has evolved over the past years, and then we detail three domains where HPC is essential. Molecular dynamics (MD) is commonly used to explore the flexibility of proteins, thus generating a better understanding of different possible approaches to modulate their activity. Modeling and simulation of biopolymer complexes enables the study of protein-protein interactions (PPI) in healthy and disease states, thus helping the identification of targets of pharmacological interest. Virtual screening (VS) also benefits from HPC to predict in a short time, among millions or billions of virtual chemical compounds, the best potential ligands that will be tested in relevant assays to start a rational drug design process.
Collapse
Affiliation(s)
- Martin Kotev
- Evotec SE, Integrated Drug Discovery, Molecular Architects, Campus Curie, Toulouse, France
| | | |
Collapse
|