1
|
Harding-Larsen D, Funk J, Madsen NG, Gharabli H, Acevedo-Rocha CG, Mazurenko S, Welner DH. Protein representations: Encoding biological information for machine learning in biocatalysis. Biotechnol Adv 2024; 77:108459. [PMID: 39366493 DOI: 10.1016/j.biotechadv.2024.108459] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/18/2024] [Revised: 09/19/2024] [Accepted: 09/29/2024] [Indexed: 10/06/2024]
Abstract
Enzymes offer a more environmentally friendly and low-impact solution to conventional chemistry, but they often require additional engineering for their application in industrial settings, an endeavour that is challenging and laborious. To address this issue, the power of machine learning can be harnessed to produce predictive models that enable the in silico study and engineering of improved enzymatic properties. Such machine learning models, however, require the conversion of the complex biological information to a numerical input, also called protein representations. These inputs demand special attention to ensure the training of accurate and precise models, and, in this review, we therefore examine the critical step of encoding protein information to numeric representations for use in machine learning. We selected the most important approaches for encoding the three distinct biological protein representations - primary sequence, 3D structure, and dynamics - to explore their requirements for employment and inductive biases. Combined representations of proteins and substrates are also introduced as emergent tools in biocatalysis. We propose the division of fixed representations, a collection of rule-based encoding strategies, and learned representations extracted from the latent spaces of large neural networks. To select the most suitable protein representation, we propose two main factors to consider. The first one is the model setup, which is influenced by the size of the training dataset and the choice of architecture. The second factor is the model objectives such as consideration about the assayed property, the difference between wild-type models and mutant predictors, and requirements for explainability. This review is aimed at serving as a source of information and guidance for properly representing enzymes in future machine learning models for biocatalysis.
Collapse
Affiliation(s)
- David Harding-Larsen
- The Novo Nordisk Center for Biosustainability, Technical University of Denmark, Søltofts Plads, Bygning 220, 2800 Kgs. Lyngby, Denmark
| | - Jonathan Funk
- The Novo Nordisk Center for Biosustainability, Technical University of Denmark, Søltofts Plads, Bygning 220, 2800 Kgs. Lyngby, Denmark
| | - Niklas Gesmar Madsen
- The Novo Nordisk Center for Biosustainability, Technical University of Denmark, Søltofts Plads, Bygning 220, 2800 Kgs. Lyngby, Denmark
| | - Hani Gharabli
- The Novo Nordisk Center for Biosustainability, Technical University of Denmark, Søltofts Plads, Bygning 220, 2800 Kgs. Lyngby, Denmark
| | - Carlos G Acevedo-Rocha
- The Novo Nordisk Center for Biosustainability, Technical University of Denmark, Søltofts Plads, Bygning 220, 2800 Kgs. Lyngby, Denmark
| | - Stanislav Mazurenko
- Loschmidt Laboratories, Department of Experimental Biology and RECETOX, Faculty of Science, Masaryk University, Kamenice 5, 625 00 Brno, Czech Republic; International Clinical Research Center, St. Anne's University Hospital Brno, Pekarska 53, 656 91 Brno, Czech Republic
| | - Ditte Hededam Welner
- The Novo Nordisk Center for Biosustainability, Technical University of Denmark, Søltofts Plads, Bygning 220, 2800 Kgs. Lyngby, Denmark.
| |
Collapse
|
2
|
Faure AJ, Martí-Aranda A, Hidalgo-Carcedo C, Beltran A, Schmiedel JM, Lehner B. The genetic architecture of protein stability. Nature 2024:10.1038/s41586-024-07966-0. [PMID: 39322666 DOI: 10.1038/s41586-024-07966-0] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/27/2023] [Accepted: 08/20/2024] [Indexed: 09/27/2024]
Abstract
There are more ways to synthesize a 100-amino acid (aa) protein (20100) than there are atoms in the universe. Only a very small fraction of such a vast sequence space can ever be experimentally or computationally surveyed. Deep neural networks are increasingly being used to navigate high-dimensional sequence spaces1. However, these models are extremely complicated. Here, by experimentally sampling from sequence spaces larger than 1010, we show that the genetic architecture of at least some proteins is remarkably simple, allowing accurate genetic prediction in high-dimensional sequence spaces with fully interpretable energy models. These models capture the nonlinear relationships between free energies and phenotypes but otherwise consist of additive free energy changes with a small contribution from pairwise energetic couplings. These energetic couplings are sparse and associated with structural contacts and backbone proximity. Our results indicate that protein genetics is actually both rather simple and intelligible.
Collapse
Affiliation(s)
- Andre J Faure
- Centre for Genomic Regulation (CRG), The Barcelona Institute of Science and Technology, Barcelona, Spain.
- ALLOX, Barcelona, Spain.
| | - Aina Martí-Aranda
- Centre for Genomic Regulation (CRG), The Barcelona Institute of Science and Technology, Barcelona, Spain
- Wellcome Sanger Institute, Wellcome Genome Campus, Hinxton, UK
| | - Cristina Hidalgo-Carcedo
- Centre for Genomic Regulation (CRG), The Barcelona Institute of Science and Technology, Barcelona, Spain
| | - Antoni Beltran
- Centre for Genomic Regulation (CRG), The Barcelona Institute of Science and Technology, Barcelona, Spain
| | - Jörn M Schmiedel
- Centre for Genomic Regulation (CRG), The Barcelona Institute of Science and Technology, Barcelona, Spain
- factorize.bio, Berlin, Germany
| | - Ben Lehner
- Centre for Genomic Regulation (CRG), The Barcelona Institute of Science and Technology, Barcelona, Spain.
- Wellcome Sanger Institute, Wellcome Genome Campus, Hinxton, UK.
- Universitat Pompeu Fabra (UPF), Barcelona, Spain.
- Institució Catalana de Recerca i Estudis Avançats (ICREA), Barcelona, Spain.
| |
Collapse
|
3
|
Underwood M, Bidlack C, Desch KC. Venous thromboembolic disease genetics: from variants to function. J Thromb Haemost 2024; 22:2393-2403. [PMID: 38908832 DOI: 10.1016/j.jtha.2024.06.004] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/03/2024] [Revised: 06/05/2024] [Accepted: 06/06/2024] [Indexed: 06/24/2024]
Abstract
Venous thromboembolic disease (VTE) is a prevalent and potentially life-threatening vascular disease, including both deep vein thrombosis and pulmonary embolism. This review will focus on recent insights into the heritable factors that influence an individual's risk for VTE. Here, we will explore not only the discovery of new genetic risk variants but also the importance of functional characterization of these variants. These genome-wide studies should lead to a better understanding of the biological role of genes inside and outside of the canonical coagulation system in thrombus formation and lead to an improved ability to predict an individual's risk of VTE. Further understanding of the molecular mechanisms altered by genetic variation in VTE risk will be accelerated by further human genome sequencing efforts and the use of functional genetic screens.
Collapse
Affiliation(s)
- Mary Underwood
- Department of Pediatrics, University of Michigan, Ann Arbor, Michigan, USA
| | - Christopher Bidlack
- Cellular and Molecular Biology Program, University of Michigan, Ann Arbor, Michigan, USA
| | - Karl C Desch
- Department of Pediatrics, University of Michigan, Ann Arbor, Michigan, USA; Cellular and Molecular Biology Program, University of Michigan, Ann Arbor, Michigan, USA.
| |
Collapse
|
4
|
Badonyi M, Marsh JA. Proteome-scale prediction of molecular mechanisms underlying dominant genetic diseases. PLoS One 2024; 19:e0307312. [PMID: 39172982 PMCID: PMC11341024 DOI: 10.1371/journal.pone.0307312] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/07/2024] [Accepted: 06/26/2024] [Indexed: 08/24/2024] Open
Abstract
Many dominant genetic disorders result from protein-altering mutations, acting primarily through dominant-negative (DN), gain-of-function (GOF), and loss-of-function (LOF) mechanisms. Deciphering the mechanisms by which dominant diseases exert their effects is often experimentally challenging and resource intensive, but is essential for developing appropriate therapeutic approaches. Diseases that arise via a LOF mechanism are more amenable to be treated by conventional gene therapy, whereas DN and GOF mechanisms may require gene editing or targeting by small molecules. Moreover, pathogenic missense mutations that act via DN and GOF mechanisms are more difficult to identify than those that act via LOF using nearly all currently available variant effect predictors. Here, we introduce a tripartite statistical model made up of support vector machine binary classifiers trained to predict whether human protein coding genes are likely to be associated with DN, GOF, or LOF molecular disease mechanisms. We test the utility of the predictions by examining biologically and clinically meaningful properties known to be associated with the mechanisms. Our results strongly support that the models are able to generalise on unseen data and offer insight into the functional attributes of proteins associated with different mechanisms. We hope that our predictions will serve as a springboard for researchers studying novel variants and those of uncertain clinical significance, guiding variant interpretation strategies and experimental characterisation. Predictions for the human UniProt reference proteome are available at https://osf.io/z4dcp/.
Collapse
Affiliation(s)
- Mihaly Badonyi
- MRC Human Genetics Unit, Institute of Genetics and Cancer, University of Edinburgh, Edinburgh, United Kingdom
| | - Joseph A. Marsh
- MRC Human Genetics Unit, Institute of Genetics and Cancer, University of Edinburgh, Edinburgh, United Kingdom
| |
Collapse
|
5
|
Guclu TF, Atilgan AR, Atilgan C. Deciphering GB1's Single Mutational Landscape: Insights from MuMi Analysis. J Phys Chem B 2024; 128:7987-7996. [PMID: 39115184 DOI: 10.1021/acs.jpcb.4c04916] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 08/23/2024]
Abstract
Mutational changes that affect the binding of the C2 fragment of Streptococcal protein G (GB1) to the Fc domain of human IgG (IgG-Fc) have been extensively studied using deep mutational scanning (DMS), and the binding affinity of all single mutations has been measured experimentally in the literature. To investigate the underlying molecular basis, we perform in silico mutational scanning for all possible single mutations, along with 2 μs-long molecular dynamics (WT-MD) of the wild-type (WT) GB1 in both unbound and IgG-Fc bound forms. We compute the hydrogen bonds between GB1 and IgG-Fc in WT-MD to identify the dominant hydrogen bonds for binding, which we then assess in conformations produced by Mutation and Minimization (MuMi) to explain the fitness landscape of GB1 and IgG-Fc binding. Furthermore, we analyze MuMi and WT-MD to investigate the dynamics of binding, focusing on the relative solvent accessibility of residues and the probability of residues being located at the binding interface. With these analyses, we explain the interactions between GB1 and IgG-Fc and display the structural features of binding. In sum, our findings highlight the potential of MuMi as a reliable and computationally efficient tool for predicting protein fitness landscapes, offering significant advantages over traditional methods. The methodologies and results presented in this study pave the way for improved predictive accuracy in protein stability and interaction studies, which are crucial for advancements in drug design and synthetic biology.
Collapse
Affiliation(s)
- Tandac F Guclu
- Faculty of Natural Sciences and Engineering, Sabanci University, Tuzla, Istanbul 34956, Turkey
| | - Ali Rana Atilgan
- Faculty of Natural Sciences and Engineering, Sabanci University, Tuzla, Istanbul 34956, Turkey
| | - Canan Atilgan
- Faculty of Natural Sciences and Engineering, Sabanci University, Tuzla, Istanbul 34956, Turkey
| |
Collapse
|
6
|
David C, Arango-Franco CA, Badonyi M, Fouchet J, Rice GI, Didry-Barca B, Maisonneuve L, Seabra L, Kechiche R, Masson C, Cobat A, Abel L, Talouarn E, Béziat V, Deswarte C, Livingstone K, Paul C, Malik G, Ross A, Adam J, Walsh J, Kumar S, Bonnet D, Bodemer C, Bader-Meunier B, Marsh JA, Casanova JL, Crow YJ, Manoury B, Frémond ML, Bohlen J, Lepelley A. Gain-of-function human UNC93B1 variants cause systemic lupus erythematosus and chilblain lupus. J Exp Med 2024; 221:e20232066. [PMID: 38869500 PMCID: PMC11176256 DOI: 10.1084/jem.20232066] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/10/2023] [Revised: 03/29/2024] [Accepted: 05/15/2024] [Indexed: 06/14/2024] Open
Abstract
UNC93B1 is a transmembrane domain protein mediating the signaling of endosomal Toll-like receptors (TLRs). We report five families harboring rare missense substitutions (I317M, G325C, L330R, R466S, and R525P) in UNC93B1 causing systemic lupus erythematosus (SLE) or chilblain lupus (CBL) as either autosomal dominant or autosomal recessive traits. As for a D34A mutation causing murine lupus, we recorded a gain of TLR7 and, to a lesser extent, TLR8 activity with the I317M (in vitro) and G325C (in vitro and ex vivo) variants in the context of SLE. Contrastingly, in three families segregating CBL, the L330R, R466S, and R525P variants were isomorphic with respect to TLR7 activity in vitro and, for R525P, ex vivo. Rather, these variants demonstrated a gain of TLR8 activity. We observed enhanced interaction of the G325C, L330R, and R466S variants with TLR8, but not the R525P substitution, indicating different disease mechanisms. Overall, these observations suggest that UNC93B1 mutations cause monogenic SLE or CBL due to differentially enhanced TLR7 and TLR8 signaling.
Collapse
Affiliation(s)
- Clémence David
- Laboratory of Neurogenetics and Neuroinflammation, Imagine Institute, INSERM UMR1163, Paris, France
| | - Carlos A. Arango-Franco
- Laboratory of Human Genetics of Infectious Diseases, INSERM UMR1163, Necker Hospital for Sick Children, Paris, France
- Department of Microbiology and Parasitology, Group of Primary Immunodeficiencies, School of Medicine, University of Antioquia, Medellín, Colombia
| | - Mihaly Badonyi
- MRC Human Genetics Unit, Institute of Genetics and Cancer, University of Edinburgh, Edinburgh, UK
| | - Julien Fouchet
- Faculté de Médecine Necker, Institut Necker Enfants Malades, INSERM U1151-CNRS UMR 8253, Université Paris Cité, Paris, France
| | - Gillian I. Rice
- Faculty of Biology, Medicine and Health, Division of Evolution and Genomic Sciences, School of Biological Sciences, Manchester Academic Health Science Centre, University of Manchester, Manchester, UK
| | - Blaise Didry-Barca
- Laboratory of Neurogenetics and Neuroinflammation, Imagine Institute, INSERM UMR1163, Paris, France
| | - Lucie Maisonneuve
- Faculté de Médecine Necker, Institut Necker Enfants Malades, INSERM U1151-CNRS UMR 8253, Université Paris Cité, Paris, France
| | - Luis Seabra
- Laboratory of Neurogenetics and Neuroinflammation, Imagine Institute, INSERM UMR1163, Paris, France
| | - Robin Kechiche
- Laboratory of Neurogenetics and Neuroinflammation, Imagine Institute, INSERM UMR1163, Paris, France
- Department of Paediatric Hematology-Immunology and Rheumatology, Necker-Enfants Malades Hospital, Assistance publique–hôpitaux de Paris (AP-HP), Paris, France
| | - Cécile Masson
- Bioinformatics Core Facility, Université Paris Cité-Structure Fédérative de Recherche Necker, INSERM US24/CNRS UMS3633, Paris, France
| | - Aurélie Cobat
- Laboratory of Human Genetics of Infectious Diseases, INSERM UMR1163, Necker Hospital for Sick Children, Paris, France
- St. Giles Laboratory of Human Genetics of Infectious Diseases, Rockefeller Branch, The Rockefeller University, New York, NY, USA
- Imagine Institute, Université Paris Cité, Paris, France
| | - Laurent Abel
- Laboratory of Human Genetics of Infectious Diseases, INSERM UMR1163, Necker Hospital for Sick Children, Paris, France
- St. Giles Laboratory of Human Genetics of Infectious Diseases, Rockefeller Branch, The Rockefeller University, New York, NY, USA
- Imagine Institute, Université Paris Cité, Paris, France
| | - Estelle Talouarn
- Laboratory of Human Genetics of Infectious Diseases, INSERM UMR1163, Necker Hospital for Sick Children, Paris, France
- Imagine Institute, Université Paris Cité, Paris, France
| | - Vivien Béziat
- Laboratory of Human Genetics of Infectious Diseases, INSERM UMR1163, Necker Hospital for Sick Children, Paris, France
- St. Giles Laboratory of Human Genetics of Infectious Diseases, Rockefeller Branch, The Rockefeller University, New York, NY, USA
- Imagine Institute, Université Paris Cité, Paris, France
| | - Caroline Deswarte
- Laboratory of Human Genetics of Infectious Diseases, INSERM UMR1163, Necker Hospital for Sick Children, Paris, France
- Imagine Institute, Université Paris Cité, Paris, France
| | - Katie Livingstone
- MRC Human Genetics Unit, Institute of Genetics and Cancer, University of Edinburgh, Edinburgh, UK
| | - Carle Paul
- Université Toulouse Paul Sabatier, Toulouse, France
| | - Gulshan Malik
- Paediatric Rheumatology, Royal Aberdeen Children’s Hospital, Aberdeen, UK
| | - Alison Ross
- Paediatric Rheumatology, Royal Aberdeen Children’s Hospital, Aberdeen, UK
| | - Jane Adam
- Paediatric Rheumatology, Royal Aberdeen Children’s Hospital, Aberdeen, UK
| | - Jo Walsh
- Department of Paediatric Rheumatology, Royal Hospital for Children, Glasgow, UK
| | - Sathish Kumar
- Department of Pediatrics, Pediatric Rheumatology, Christian Medical College, Vellore, India
| | - Damien Bonnet
- Medical and Surgical Unit of Congenital and Paediatric Cardiology, Reference Centre for Complex Congenital Heart Defects—M3C, University Hospital Necker-Enfants Malades, Paris, France
- Université Paris Cité, Paris, France
| | - Christine Bodemer
- Department of Dermatology, Hospital Necker-Enfants Malades, AP-HP. Université Paris Cité, Paris, France
| | - Brigitte Bader-Meunier
- Department of Paediatric Hematology-Immunology and Rheumatology, Necker-Enfants Malades Hospital, Assistance publique–hôpitaux de Paris (AP-HP), Paris, France
- Centre for Inflammatory Rheumatism, AutoImmune Diseases and Systemic Interferonopathies in Children (RAISE), Paris, France
| | - Joseph A. Marsh
- MRC Human Genetics Unit, Institute of Genetics and Cancer, University of Edinburgh, Edinburgh, UK
| | - Jean-Laurent Casanova
- Laboratory of Human Genetics of Infectious Diseases, INSERM UMR1163, Necker Hospital for Sick Children, Paris, France
- St. Giles Laboratory of Human Genetics of Infectious Diseases, Rockefeller Branch, The Rockefeller University, New York, NY, USA
- Imagine Institute, Université Paris Cité, Paris, France
- Howard Hughes Medical Institute, New York, NY, USA
- Department of Pediatrics, Necker Hospital for Sick Children, Paris, France
| | - Yanick J. Crow
- Laboratory of Neurogenetics and Neuroinflammation, Imagine Institute, INSERM UMR1163, Paris, France
- MRC Human Genetics Unit, Institute of Genetics and Cancer, University of Edinburgh, Edinburgh, UK
- Université Paris Cité, Paris, France
| | - Bénédicte Manoury
- Faculté de Médecine Necker, Institut Necker Enfants Malades, INSERM U1151-CNRS UMR 8253, Université Paris Cité, Paris, France
| | - Marie-Louise Frémond
- Laboratory of Neurogenetics and Neuroinflammation, Imagine Institute, INSERM UMR1163, Paris, France
- Department of Paediatric Hematology-Immunology and Rheumatology, Necker-Enfants Malades Hospital, Assistance publique–hôpitaux de Paris (AP-HP), Paris, France
- Centre for Inflammatory Rheumatism, AutoImmune Diseases and Systemic Interferonopathies in Children (RAISE), Paris, France
| | - Jonathan Bohlen
- Laboratory of Human Genetics of Infectious Diseases, INSERM UMR1163, Necker Hospital for Sick Children, Paris, France
- Imagine Institute, Université Paris Cité, Paris, France
| | - Alice Lepelley
- Laboratory of Neurogenetics and Neuroinflammation, Imagine Institute, INSERM UMR1163, Paris, France
| |
Collapse
|
7
|
McCarthy-Leo CE, Brush GS, Pique-Regi R, Luca F, Tainsky MA, Finley RL. Comprehensive analysis of the functional impact of single nucleotide variants of human CHEK2. PLoS Genet 2024; 20:e1011375. [PMID: 39146382 PMCID: PMC11349238 DOI: 10.1371/journal.pgen.1011375] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/07/2024] [Revised: 08/27/2024] [Accepted: 07/25/2024] [Indexed: 08/17/2024] Open
Abstract
Loss of function mutations in the checkpoint kinase gene CHEK2 are associated with increased risk of breast and other cancers. Most of the 3,188 unique amino acid changes that can result from non-synonymous single nucleotide variants (SNVs) of CHEK2, however, have not been tested for their impact on the function of the CHEK2-enocded protein (CHK2). One successful approach to testing the function of variants has been to test for their ability to complement mutations in the yeast ortholog of CHEK2, RAD53. This approach has been used to provide functional information on over 100 CHEK2 SNVs and the results align with functional assays in human cells and known pathogenicity. Here we tested all but two of the 4,887 possible SNVs in the CHEK2 open reading frame for their ability to complement RAD53 mutants using a high throughput technique of deep mutational scanning (DMS). Among the non-synonymous changes, 770 were damaging to protein function while 2,417 were tolerated. The results correlate well with previous structure and function data and provide a first or additional functional assay for all the variants of uncertain significance identified in clinical databases. Combined, this approach can be used to help predict the pathogenicity of CHEK2 variants of uncertain significance that are found in susceptibility screening and could be applied to other cancer risk genes.
Collapse
Affiliation(s)
- Claire E. McCarthy-Leo
- Center for Molecular Medicine and Genetics, Wayne State University School of Medicine, Detroit, Michigan, United States of America
| | - George S. Brush
- Department of Oncology, Molecular Therapeutics Program, Barbara Ann Karmanos Cancer Institute, Wayne State University School of Medicine, Detroit, Michigan, United States of America
| | - Roger Pique-Regi
- Center for Molecular Medicine and Genetics, Wayne State University School of Medicine, Detroit, Michigan, United States of America
- Department of Obstetrics and Gynecology, Wayne State University School of Medicine, Detroit, Michigan, United States of America
| | - Francesca Luca
- Center for Molecular Medicine and Genetics, Wayne State University School of Medicine, Detroit, Michigan, United States of America
- Department of Obstetrics and Gynecology, Wayne State University School of Medicine, Detroit, Michigan, United States of America
| | - Michael A. Tainsky
- Center for Molecular Medicine and Genetics, Wayne State University School of Medicine, Detroit, Michigan, United States of America
- Department of Oncology, Molecular Therapeutics Program, Barbara Ann Karmanos Cancer Institute, Wayne State University School of Medicine, Detroit, Michigan, United States of America
| | - Russell L. Finley
- Center for Molecular Medicine and Genetics, Wayne State University School of Medicine, Detroit, Michigan, United States of America
| |
Collapse
|
8
|
Correa Marrero M, Jänes J, Baptista D, Beltrao P. Integrating Large-Scale Protein Structure Prediction into Human Genetics Research. Annu Rev Genomics Hum Genet 2024; 25:123-140. [PMID: 38621234 DOI: 10.1146/annurev-genom-120622-020615] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 04/17/2024]
Abstract
The last five years have seen impressive progress in deep learning models applied to protein research. Most notably, sequence-based structure predictions have seen transformative gains in the form of AlphaFold2 and related approaches. Millions of missense protein variants in the human population lack annotations, and these computational methods are a valuable means to prioritize variants for further analysis. Here, we review the recent progress in deep learning models applied to the prediction of protein structure and protein variants, with particular emphasis on their implications for human genetics and health. Improved prediction of protein structures facilitates annotations of the impact of variants on protein stability, protein-protein interaction interfaces, and small-molecule binding pockets. Moreover, it contributes to the study of host-pathogen interactions and the characterization of protein function. As genome sequencing in large cohorts becomes increasingly prevalent, we believe that better integration of state-of-the-art protein informatics technologies into human genetics research is of paramount importance.
Collapse
Affiliation(s)
- Miguel Correa Marrero
- SIB Swiss Institute of Bioinformatics, Lausanne, Switzerland
- Institute of Molecular Systems Biology, Department of Biology, ETH Zurich, Zurich, Switzerland;
| | - Jürgen Jänes
- SIB Swiss Institute of Bioinformatics, Lausanne, Switzerland
- Institute of Molecular Systems Biology, Department of Biology, ETH Zurich, Zurich, Switzerland;
| | | | - Pedro Beltrao
- Instituto Gulbenkian de Ciência, Oeiras, Portugal
- SIB Swiss Institute of Bioinformatics, Lausanne, Switzerland
- Institute of Molecular Systems Biology, Department of Biology, ETH Zurich, Zurich, Switzerland;
| |
Collapse
|
9
|
Ozkan S, Padilla N, de la Cruz X. QAFI: a novel method for quantitative estimation of missense variant impact using protein-specific predictors and ensemble learning. Hum Genet 2024:10.1007/s00439-024-02692-z. [PMID: 39048855 DOI: 10.1007/s00439-024-02692-z] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/30/2024] [Accepted: 07/14/2024] [Indexed: 07/27/2024]
Abstract
Next-generation sequencing (NGS) has revolutionized genetic diagnostics, yet its application in precision medicine remains incomplete, despite significant advances in computational tools for variant annotation. Many variants remain unannotated, and existing tools often fail to accurately predict the range of impacts that variants have on protein function. This limitation restricts their utility in relevant applications such as predicting disease severity and onset age. In response to these challenges, a new generation of computational models is emerging, aimed at producing quantitative predictions of genetic variant impacts. However, the field is still in its early stages, and several issues need to be addressed, including improved performance and better interpretability. This study introduces QAFI, a novel methodology that integrates protein-specific regression models within an ensemble learning framework, utilizing conservation-based and structure-related features derived from AlphaFold models. Our findings indicate that QAFI significantly enhances the accuracy of quantitative predictions across various proteins. The approach has been rigorously validated through its application in the CAGI6 contest, focusing on ARSA protein variants, and further tested on a comprehensive set of clinically labeled variants, demonstrating its generalizability and robust predictive power. The straightforward nature of our models may also contribute to better interpretability of the results.
Collapse
Affiliation(s)
- Selen Ozkan
- Research Unit in Clinical and Translational Bioinformatics, Vall d'Hebron Institute of Research (VHIR), Universitat Autònoma de Barcelona, Barcelona, Spain
| | - Natàlia Padilla
- Research Unit in Clinical and Translational Bioinformatics, Vall d'Hebron Institute of Research (VHIR), Universitat Autònoma de Barcelona, Barcelona, Spain
| | - Xavier de la Cruz
- Research Unit in Clinical and Translational Bioinformatics, Vall d'Hebron Institute of Research (VHIR), Universitat Autònoma de Barcelona, Barcelona, Spain.
- Institució Catalana de Recerca i Estudis Avançats (ICREA), Barcelona, Spain.
| |
Collapse
|
10
|
McDonnell AF, Plech M, Livesey BJ, Gerasimavicius L, Owen LJ, Hall HN, FitzPatrick DR, Marsh JA, Kudla G. Deep mutational scanning quantifies DNA binding and predicts clinical outcomes of PAX6 variants. Mol Syst Biol 2024; 20:825-844. [PMID: 38849565 PMCID: PMC11219921 DOI: 10.1038/s44320-024-00043-8] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/04/2023] [Revised: 04/05/2024] [Accepted: 05/14/2024] [Indexed: 06/09/2024] Open
Abstract
Nonsense and missense mutations in the transcription factor PAX6 cause a wide range of eye development defects, including aniridia, microphthalmia and coloboma. To understand how changes of PAX6:DNA binding cause these phenotypes, we combined saturation mutagenesis of the paired domain of PAX6 with a yeast one-hybrid (Y1H) assay in which expression of a PAX6-GAL4 fusion gene drives antibiotic resistance. We quantified binding of more than 2700 single amino-acid variants to two DNA sequence elements. Mutations in DNA-facing residues of the N-terminal subdomain and linker region were most detrimental, as were mutations to prolines and to negatively charged residues. Many variants caused sequence-specific molecular gain-of-function effects, including variants in position 71 that increased binding to the LE9 enhancer but decreased binding to a SELEX-derived binding site. In the absence of antibiotic selection, variants that retained DNA binding slowed yeast growth, likely because such variants perturbed the yeast transcriptome. Benchmarking against known patient variants and applying ACMG/AMP guidelines to variant classification, we obtained supporting-to-moderate evidence that 977 variants are likely pathogenic and 1306 are likely benign. Our analysis shows that most pathogenic mutations in the paired domain of PAX6 can be explained simply by the effects of these mutations on PAX6:DNA association, and establishes Y1H as a generalisable assay for the interpretation of variant effects in transcription factors.
Collapse
Affiliation(s)
- Alexander F McDonnell
- MRC Human Genetics Unit, Institute of Genetics and Cancer, University of Edinburgh, Edinburgh, EH4 2XU, UK
| | - Marcin Plech
- MRC Human Genetics Unit, Institute of Genetics and Cancer, University of Edinburgh, Edinburgh, EH4 2XU, UK
| | - Benjamin J Livesey
- MRC Human Genetics Unit, Institute of Genetics and Cancer, University of Edinburgh, Edinburgh, EH4 2XU, UK
| | - Lukas Gerasimavicius
- MRC Human Genetics Unit, Institute of Genetics and Cancer, University of Edinburgh, Edinburgh, EH4 2XU, UK
| | - Liusaidh J Owen
- MRC Human Genetics Unit, Institute of Genetics and Cancer, University of Edinburgh, Edinburgh, EH4 2XU, UK
| | - Hildegard Nikki Hall
- MRC Human Genetics Unit, Institute of Genetics and Cancer, University of Edinburgh, Edinburgh, EH4 2XU, UK
| | - David R FitzPatrick
- MRC Human Genetics Unit, Institute of Genetics and Cancer, University of Edinburgh, Edinburgh, EH4 2XU, UK
| | - Joseph A Marsh
- MRC Human Genetics Unit, Institute of Genetics and Cancer, University of Edinburgh, Edinburgh, EH4 2XU, UK
| | - Grzegorz Kudla
- MRC Human Genetics Unit, Institute of Genetics and Cancer, University of Edinburgh, Edinburgh, EH4 2XU, UK.
| |
Collapse
|
11
|
Rubin AF. A new way of looking at transcription factor assays. Mol Syst Biol 2024; 20:741-743. [PMID: 38849564 PMCID: PMC11219719 DOI: 10.1038/s44320-024-00044-7] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/22/2024] [Accepted: 05/14/2024] [Indexed: 06/09/2024] Open
Abstract
AF Rubin discusses a new high-throughput functional assay for transcription factors applied for a deep mutational scanning study of the transcription factor PAX6 by Kudla and colleagues (McDonnell et al, 2023 ) in this issue of Molecular Systems Biology .
Collapse
Affiliation(s)
- Alan F Rubin
- Bioinformatics Division, Walter and Eliza Hall Institute of Medical Research, Parkville, Victoria, Australia.
- Department of Medical Biology, University of Melbourne, Parkville, Victoria, Australia.
| |
Collapse
|
12
|
Tabet DR, Kuang D, Lancaster MC, Li R, Liu K, Weile J, Coté AG, Wu Y, Hegele RA, Roden DM, Roth FP. Benchmarking computational variant effect predictors by their ability to infer human traits. Genome Biol 2024; 25:172. [PMID: 38951922 PMCID: PMC11218265 DOI: 10.1186/s13059-024-03314-7] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/10/2022] [Accepted: 06/17/2024] [Indexed: 07/03/2024] Open
Abstract
BACKGROUND Computational variant effect predictors offer a scalable and increasingly reliable means of interpreting human genetic variation, but concerns of circularity and bias have limited previous methods for evaluating and comparing predictors. Population-level cohorts of genotyped and phenotyped participants that have not been used in predictor training can facilitate an unbiased benchmarking of available methods. Using a curated set of human gene-trait associations with a reported rare-variant burden association, we evaluate the correlations of 24 computational variant effect predictors with associated human traits in the UK Biobank and All of Us cohorts. RESULTS AlphaMissense outperformed all other predictors in inferring human traits based on rare missense variants in UK Biobank and All of Us participants. The overall rankings of computational variant effect predictors in these two cohorts showed a significant positive correlation. CONCLUSION We describe a method to assess computational variant effect predictors that sidesteps the limitations of previous evaluations. This approach is generalizable to future predictors and could continue to inform predictor choice for personal and clinical genetics.
Collapse
Affiliation(s)
- Daniel R Tabet
- Donnelly Centre, University of Toronto, Toronto, ON, Canada
- Department of Molecular Genetics, University of Toronto, Toronto, ON, Canada
- Department of Computer Science, University of Toronto, Toronto, ON, Canada
- Lunenfeld-Tanenbaum Research Institute, Sinai Health, Toronto, ON, Canada
| | - Da Kuang
- Donnelly Centre, University of Toronto, Toronto, ON, Canada
- Department of Molecular Genetics, University of Toronto, Toronto, ON, Canada
- Department of Computer Science, University of Toronto, Toronto, ON, Canada
- Lunenfeld-Tanenbaum Research Institute, Sinai Health, Toronto, ON, Canada
| | - Megan C Lancaster
- Division of Cardiovascular Medicine, Department of Medicine, Vanderbilt University Medical Center, Nashville, TN, USA
| | - Roujia Li
- Donnelly Centre, University of Toronto, Toronto, ON, Canada
- Department of Molecular Genetics, University of Toronto, Toronto, ON, Canada
- Department of Computer Science, University of Toronto, Toronto, ON, Canada
- Lunenfeld-Tanenbaum Research Institute, Sinai Health, Toronto, ON, Canada
| | - Karen Liu
- Donnelly Centre, University of Toronto, Toronto, ON, Canada
- Department of Molecular Genetics, University of Toronto, Toronto, ON, Canada
- Department of Computer Science, University of Toronto, Toronto, ON, Canada
- Lunenfeld-Tanenbaum Research Institute, Sinai Health, Toronto, ON, Canada
| | - Jochen Weile
- Donnelly Centre, University of Toronto, Toronto, ON, Canada
- Department of Molecular Genetics, University of Toronto, Toronto, ON, Canada
- Department of Computer Science, University of Toronto, Toronto, ON, Canada
- Lunenfeld-Tanenbaum Research Institute, Sinai Health, Toronto, ON, Canada
| | - Atina G Coté
- Donnelly Centre, University of Toronto, Toronto, ON, Canada
- Department of Molecular Genetics, University of Toronto, Toronto, ON, Canada
- Department of Computer Science, University of Toronto, Toronto, ON, Canada
- Lunenfeld-Tanenbaum Research Institute, Sinai Health, Toronto, ON, Canada
| | - Yingzhou Wu
- Donnelly Centre, University of Toronto, Toronto, ON, Canada
- Department of Molecular Genetics, University of Toronto, Toronto, ON, Canada
- Department of Computer Science, University of Toronto, Toronto, ON, Canada
- Lunenfeld-Tanenbaum Research Institute, Sinai Health, Toronto, ON, Canada
| | - Robert A Hegele
- Department of Medicine, Department of Biochemistry, Schulich School of Medicine and Dentistry, Robarts Research Institute, Western University, London, ON, Canada
| | - Dan M Roden
- Division of Cardiovascular Medicine, Department of Medicine, Vanderbilt University Medical Center, Nashville, TN, USA
- Department of Pharmacology, Vanderbilt University Medical Centre, Nashville, TN, USA
- Department of Biomedical Informatics, Vanderbilt University Medical Center, Nashville, TN, USA
| | - Frederick P Roth
- Donnelly Centre, University of Toronto, Toronto, ON, Canada.
- Department of Molecular Genetics, University of Toronto, Toronto, ON, Canada.
- Department of Computer Science, University of Toronto, Toronto, ON, Canada.
- Lunenfeld-Tanenbaum Research Institute, Sinai Health, Toronto, ON, Canada.
- Department of Computational and Systems Biology, University of Pittsburgh School of Medicine, Pittsburgh, PA, USA.
| |
Collapse
|
13
|
Jänes J, Müller M, Selvaraj S, Manoel D, Stephenson J, Gonçalves C, Lafita A, Polacco B, Obernier K, Alasoo K, Lemos MC, Krogan N, Martin M, Saraiva LR, Burke D, Beltrao P. Predicted mechanistic impacts of human protein missense variants. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2024.05.29.596373. [PMID: 38854010 PMCID: PMC11160786 DOI: 10.1101/2024.05.29.596373] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/11/2024]
Abstract
Genome sequencing efforts have led to the discovery of tens of millions of protein missense variants found in the human population with the majority of these having no annotated role and some likely contributing to trait variation and disease. Sequence-based artificial intelligence approaches have become highly accurate at predicting variants that are detrimental to the function of proteins but they do not inform on mechanisms of disruption. Here we combined sequence and structure-based methods to perform proteome-wide prediction of deleterious variants with information on their impact on protein stability, protein-protein interactions and small-molecule binding pockets. AlphaFold2 structures were used to predict approximately 100,000 small-molecule binding pockets and stability changes for over 200 million variants. To inform on protein-protein interfaces we used AlphaFold2 to predict structures for nearly 500,000 protein complexes. We illustrate the value of mechanism-aware variant effect predictions to study the relation between protein stability and abundance and the structural properties of interfaces underlying trans protein quantitative trait loci (pQTLs). We characterised the distribution of mechanistic impacts of protein variants found in patients and experimentally studied example disease linked variants in FGFR1.
Collapse
Affiliation(s)
- Jürgen Jänes
- Department of Biology, Institute of Molecular Systems Biology, ETH Zurich, Zurich, Switzerland
- Swiss Institute of Bioinformatics, Lausanne, Switzerland
| | - Marc Müller
- Department of Biology, Institute of Molecular Systems Biology, ETH Zurich, Zurich, Switzerland
| | - Senthil Selvaraj
- Sidra Medicine, Doha, Qatar
- College of Health and Life Sciences, Hamad Bin Khalifa University, Doha, Qatar
| | - Diogo Manoel
- Sidra Medicine, Doha, Qatar
- College of Health and Life Sciences, Hamad Bin Khalifa University, Doha, Qatar
| | - James Stephenson
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Cambridge CB10 1SD, UK
- Open Targets, Wellcome Genome Campus, Cambridge, CB10 1SA, UK
| | - Catarina Gonçalves
- Sidra Medicine, Doha, Qatar
- College of Health and Life Sciences, Hamad Bin Khalifa University, Doha, Qatar
| | | | - Benjamin Polacco
- Quantitative Biosciences Institute (QBI), University of California, San Francisco, CA, USA
- Department of Cellular and Molecular Pharmacology, University of California, San Francisco, CA, USA
| | - Kirsten Obernier
- Quantitative Biosciences Institute (QBI), University of California, San Francisco, CA, USA
- Department of Cellular and Molecular Pharmacology, University of California, San Francisco, CA, USA
| | - Kaur Alasoo
- Institute of Computer Science, University of Tartu, Tartu, Estonia
| | - Manuel C. Lemos
- CICS-UBI, Health Sciences Research Centre, University of Beira Interior, 6200-506, Covilhã, Portugal
| | - Nevan Krogan
- Quantitative Biosciences Institute (QBI), University of California, San Francisco, CA, USA
- Department of Cellular and Molecular Pharmacology, University of California, San Francisco, CA, USA
- J. David Gladstone Institutes, San Francisco, CA, USA
| | - Maria Martin
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Cambridge CB10 1SD, UK
- Open Targets, Wellcome Genome Campus, Cambridge, CB10 1SA, UK
| | - Luis R. Saraiva
- Sidra Medicine, Doha, Qatar
- College of Health and Life Sciences, Hamad Bin Khalifa University, Doha, Qatar
| | - David Burke
- Faculty of Life Sciences and Medicine, King’s College, London, UK
| | - Pedro Beltrao
- Department of Biology, Institute of Molecular Systems Biology, ETH Zurich, Zurich, Switzerland
- Swiss Institute of Bioinformatics, Lausanne, Switzerland
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Cambridge CB10 1SD, UK
- Open Targets, Wellcome Genome Campus, Cambridge, CB10 1SA, UK
| |
Collapse
|
14
|
Brock DC, Wang M, Hussain HMJ, Rauch DE, Marra M, Pennesi ME, Yang P, Everett L, Ajlan RS, Colbert J, Porto FBO, Matynia A, Gorin MB, Koenekoop RK, Lopez I, Sui R, Zou G, Li Y, Chen R. Comparative analysis of in-silico tools in identifying pathogenic variants in dominant inherited retinal diseases. Hum Mol Genet 2024; 33:945-957. [PMID: 38453143 PMCID: PMC11102593 DOI: 10.1093/hmg/ddae028] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/08/2024] [Revised: 02/16/2024] [Accepted: 02/19/2024] [Indexed: 03/09/2024] Open
Abstract
Inherited retinal diseases (IRDs) are a group of rare genetic eye conditions that cause blindness. Despite progress in identifying genes associated with IRDs, improvements are necessary for classifying rare autosomal dominant (AD) disorders. AD diseases are highly heterogenous, with causal variants being restricted to specific amino acid changes within certain protein domains, making AD conditions difficult to classify. Here, we aim to determine the top-performing in-silico tools for predicting the pathogenicity of AD IRD variants. We annotated variants from ClinVar and benchmarked 39 variant classifier tools on IRD genes, split by inheritance pattern. Using area-under-the-curve (AUC) analysis, we determined the top-performing tools and defined thresholds for variant pathogenicity. Top-performing tools were assessed using genome sequencing on a cohort of participants with IRDs of unknown etiology. MutScore achieved the highest accuracy within AD genes, yielding an AUC of 0.969. When filtering for AD gain-of-function and dominant negative variants, BayesDel had the highest accuracy with an AUC of 0.997. Five participants with variants in NR2E3, RHO, GUCA1A, and GUCY2D were confirmed to have dominantly inherited disease based on pedigree, phenotype, and segregation analysis. We identified two uncharacterized variants in GUCA1A (c.428T>A, p.Ile143Thr) and RHO (c.631C>G, p.His211Asp) in three participants. Our findings support using a multi-classifier approach comprised of new missense classifier tools to identify pathogenic variants in participants with AD IRDs. Our results provide a foundation for improved genetic diagnosis for people with IRDs.
Collapse
Affiliation(s)
- Daniel C Brock
- Department of Molecular and Human Genetics, Baylor College of Medicine, One Baylor Plaza, Houston, TX 77030, United States
- Medical Scientist Training Program, Baylor College of Medicine, One Baylor Plaza, Houston, TX 77030, United States
| | - Meng Wang
- Department of Molecular and Human Genetics, Baylor College of Medicine, One Baylor Plaza, Houston, TX 77030, United States
| | - Hafiz Muhammad Jafar Hussain
- Department of Molecular and Human Genetics, Baylor College of Medicine, One Baylor Plaza, Houston, TX 77030, United States
| | - David E Rauch
- Department of Molecular and Human Genetics, Baylor College of Medicine, One Baylor Plaza, Houston, TX 77030, United States
| | - Molly Marra
- Department of Ophthalmology, Casey Eye Institute, Oregon Health & Science University, 515 SW Campus Drive, Portland, OR 97239, United States
| | - Mark E Pennesi
- Department of Ophthalmology, Casey Eye Institute, Oregon Health & Science University, 515 SW Campus Drive, Portland, OR 97239, United States
| | - Paul Yang
- Department of Ophthalmology, Casey Eye Institute, Oregon Health & Science University, 515 SW Campus Drive, Portland, OR 97239, United States
| | - Lesley Everett
- Department of Ophthalmology, Casey Eye Institute, Oregon Health & Science University, 515 SW Campus Drive, Portland, OR 97239, United States
| | - Radwan S Ajlan
- Department of Ophthalmology, University of Kansas School of Medicine, 3901 Rainbow Blvd, Kansas City, KS 66160, United States
| | - Jason Colbert
- Department of Ophthalmology, University of Kansas School of Medicine, 3901 Rainbow Blvd, Kansas City, KS 66160, United States
| | - Fernanda Belga Ottoni Porto
- INRET Clínica e Centro de Pesquisa, Rua dos Otoni, 735/507 - Santa Efigênia, Belo Horizonte, MG 30150270, Brazil
- Department of Ophthalmology, Santa Casa de Misericórdia de Belo Horizonte, Av. Francisco Sales, 1111 - Santa Efigênia, Belo Horizonte, MG 30150221, Brazil
- Centro Oftalmológico de Minas Gerais, R. Santa Catarina, 941 - Lourdes, Belo Horizonte, MG 30180070, Brazil
| | - Anna Matynia
- College of Optometry, University of Houston, 4401 Martin Luther King Boulevard, Houston, TX 77004, United States
| | - Michael B Gorin
- Jules Stein Eye Institute, University of California Los Angeles, 100 Stein Plaza, Los Angeles, CA 90095, United States
- Department of Ophthalmology, University of California Los Angeles David Geffen School of Medicine, 10833 Le Conte Ave, Los Angeles, CA 90095, United States
| | - Robert K Koenekoop
- McGill Ocular Genetics Laboratory and Centre, Department of Paediatric Surgery, Human Genetics, and Ophthalmology, McGill University Health Centre, 5252 Boul de Maisonneuve ouest, Montreal, QC H4A 3S5, Canada
| | - Irma Lopez
- McGill Ocular Genetics Laboratory and Centre, Department of Paediatric Surgery, Human Genetics, and Ophthalmology, McGill University Health Centre, 5252 Boul de Maisonneuve ouest, Montreal, QC H4A 3S5, Canada
| | - Ruifang Sui
- Department of Ophthalmology, Peking Union Medical College Hospital, Peking Union Medical College, Chinese Academy of Medical Sciences, WC67+HW Dongcheng, Beijing 100005, China
| | - Gang Zou
- Department of Ophthalmology, Ningxia Eye Hospital, People's Hospital of Ningxia Hui Autonomous Region, First Affiliated Hospital of Northwest University for Nationalities, Ningxia Clinical Research Center on Diseases of Blindness in Eye, F4RJ+43 Xixia District, Yinchuan, Ningxia, China
| | - Yumei Li
- Department of Molecular and Human Genetics, Baylor College of Medicine, One Baylor Plaza, Houston, TX 77030, United States
- Human Genome Sequencing Center, Baylor College of Medicine, One Baylor Plaza, Houston, TX 77030, United States
| | - Rui Chen
- Department of Molecular and Human Genetics, Baylor College of Medicine, One Baylor Plaza, Houston, TX 77030, United States
- Human Genome Sequencing Center, Baylor College of Medicine, One Baylor Plaza, Houston, TX 77030, United States
| |
Collapse
|
15
|
Riccio C, Jansen ML, Guo L, Ziegler A. Variant effect predictors: a systematic review and practical guide. Hum Genet 2024; 143:625-634. [PMID: 38573379 PMCID: PMC11098935 DOI: 10.1007/s00439-024-02670-5] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/13/2023] [Accepted: 03/11/2024] [Indexed: 04/05/2024]
Abstract
Large-scale association analyses using whole-genome sequence data have become feasible, but understanding the functional impacts of these associations remains challenging. Although many tools are available to predict the functional impacts of genetic variants, it is unclear which tool should be used in practice. This work provides a practical guide to assist in selecting appropriate tools for variant annotation. We conducted a MEDLINE search up to November 10, 2023, and included tools that are applicable to a broad range of phenotypes, can be used locally, and have been recently updated. Tools were categorized based on the types of variants they accept and the functional impacts they predict. Sequence Ontology terms were used for standardization. We identified 118 databases and software packages, encompassing 36 variant types and 161 functional impacts. Combining only three tools, namely SnpEff, FAVOR, and SparkINFERNO, allows predicting 99 (61%) distinct functional impacts. Thirty-seven tools predict 89 functional impacts that are not supported by any other tool, while 75 tools predict pathogenicity and can be used within the ACMG/AMP guidelines in a clinical context. We launched a website allowing researchers to select tools based on desired variants and impacts. In summary, more than 100 tools are already available to predict approximately 160 functional impacts. About 60% of the functional impacts can be predicted by the combination of three tools. Unexpectedly, recent tools do not predict more impacts than older ones. Future research should allow predicting the functionality of so far unsupported variant types, such as gene fusions.URL: https://cardio-care.shinyapps.io/VEP_Finder/ .Registration: OSF Registries on November 10, 2023, https://osf.io/s2gct .
Collapse
Affiliation(s)
- Cristian Riccio
- Cardio-CARE, Medizincampus Davos, Herman-Burchard-Str. 1, Davos Wolfgang, 7265, Davos, Switzerland
- Swiss Institute of Bioinformatics, Lausanne, Switzerland
| | - Max L Jansen
- Cardio-CARE, Medizincampus Davos, Herman-Burchard-Str. 1, Davos Wolfgang, 7265, Davos, Switzerland
- Swiss Institute of Bioinformatics, Lausanne, Switzerland
| | - Linlin Guo
- Center for Population Health Innovation (POINT), University Heart and Vascular Center Hamburg, University Medical Center Hamburg-Eppendorf, Hamburg, Germany
- University Center of Cardiovascular Science & Department of Cardiology, University Heart and Vascular Center Hamburg, University Medical Center Hamburg-Eppendorf, Hamburg, Germany
| | - Andreas Ziegler
- Cardio-CARE, Medizincampus Davos, Herman-Burchard-Str. 1, Davos Wolfgang, 7265, Davos, Switzerland.
- Swiss Institute of Bioinformatics, Lausanne, Switzerland.
- Center for Population Health Innovation (POINT), University Heart and Vascular Center Hamburg, University Medical Center Hamburg-Eppendorf, Hamburg, Germany.
- University Center of Cardiovascular Science & Department of Cardiology, University Heart and Vascular Center Hamburg, University Medical Center Hamburg-Eppendorf, Hamburg, Germany.
- School of Mathematics, Statistics, and Computer Science, University of KwaZulu-Natal, Pietermaritzburg, South Africa.
| |
Collapse
|
16
|
Livesey BJ, Badonyi M, Dias M, Frazer J, Kumar S, Lindorff-Larsen K, McCandlish DM, Orenbuch R, Shearer CA, Muffley L, Foreman J, Glazer AM, Lehner B, Marks DS, Roth FP, Rubin AF, Starita LM, Marsh JA. Guidelines for releasing a variant effect predictor. ARXIV 2024:arXiv:2404.10807v1. [PMID: 38699161 PMCID: PMC11065047] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Subscribe] [Scholar Register] [Indexed: 05/05/2024]
Abstract
Computational methods for assessing the likely impacts of mutations, known as variant effect predictors (VEPs), are widely used in the assessment and interpretation of human genetic variation, as well as in other applications like protein engineering. Many different VEPs have been released to date, and there is tremendous variability in their underlying algorithms and outputs, and in the ways in which the methodologies and predictions are shared. This leads to considerable challenges for end users in knowing which VEPs to use and how to use them. Here, to address these issues, we provide guidelines and recommendations for the release of novel VEPs. Emphasising open-source availability, transparent methodologies, clear variant effect score interpretations, standardised scales, accessible predictions, and rigorous training data disclosure, we aim to improve the usability and interpretability of VEPs, and promote their integration into analysis and evaluation pipelines. We also provide a large, categorised list of currently available VEPs, aiming to facilitate the discovery and encourage the usage of novel methods within the scientific community.
Collapse
Affiliation(s)
- Benjamin J. Livesey
- MRC Human Genetics Unit, Institute of Genetics and Cancer, University of Edinburgh, Edinburgh, UK
| | - Mihaly Badonyi
- MRC Human Genetics Unit, Institute of Genetics and Cancer, University of Edinburgh, Edinburgh, UK
| | - Mafalda Dias
- Centre for Genomic Regulation (CRG),The Barcelona Institute of Science and Technology, Barcelona, Spain
| | - Jonathan Frazer
- Centre for Genomic Regulation (CRG),The Barcelona Institute of Science and Technology, Barcelona, Spain
| | - Sushant Kumar
- Department of Medical Biophysics, University of Toronto; Princess Margaret Cancer Centre, University Health Network, Toronto, Ontario, Canada
| | - Kresten Lindorff-Larsen
- Linderstrøm-Lang Centre for Protein Science, Department of Biology, University of Copenhagen, Copenhagen, Denmark
| | - David M. McCandlish
- Simons Center for Quantitative Biology, Cold Spring Harbor Laboratory, Cold Spring Harbor, NY, USA
| | - Rose Orenbuch
- Department of Systems Biology, Harvard Medical School, Boston, MA, USA
| | | | - Lara Muffley
- Department of Genome Sciences, University of Washington and the Brotman Baty Institute for Precision Medicine, Seattle, WA, USA
| | - Julia Foreman
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge, UK
| | | | - Ben Lehner
- Wellcome Sanger Institute, Cambridge, UK; Universitat Pompeu Fabra (UPF), Barcelona, Spain; Institució Catalana de Recerca i Estudis Avançats (ICREA), Barcelona, Spain
| | - Debora S. Marks
- Department of Systems Biology, Harvard Medical School, Boston, MA, USA
- Broad Institute of MIT and Harvard, Boston, MA, USA
| | - Frederick P. Roth
- Department of Computational and Systems Biology, University of Pittsburgh School of Medicine, Pittsburgh, PA, USA
| | - Alan F. Rubin
- Bioinformatics Division, Walter and Eliza Hall Institute of Medical Research; Department of Medical Biology, University of Melbourne, Parkville, Australia
| | - Lea M. Starita
- Department of Genome Sciences, University of Washington and the Brotman Baty Institute for Precision Medicine, Seattle, WA, USA
| | - Joseph A. Marsh
- MRC Human Genetics Unit, Institute of Genetics and Cancer, University of Edinburgh, Edinburgh, UK
| |
Collapse
|
17
|
Saez-Matia A, Ibarluzea MG, M-Alicante S, Muguruza-Montero A, Nuñez E, Ramis R, Ballesteros OR, Lasa-Goicuria D, Fons C, Gallego M, Casis O, Leonardo A, Bergara A, Villarroel A. MLe-KCNQ2: An Artificial Intelligence Model for the Prognosis of Missense KCNQ2 Gene Variants. Int J Mol Sci 2024; 25:2910. [PMID: 38474157 DOI: 10.3390/ijms25052910] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/31/2024] [Revised: 02/27/2024] [Accepted: 02/29/2024] [Indexed: 03/14/2024] Open
Abstract
Despite the increasing availability of genomic data and enhanced data analysis procedures, predicting the severity of associated diseases remains elusive in the absence of clinical descriptors. To address this challenge, we have focused on the KV7.2 voltage-gated potassium channel gene (KCNQ2), known for its link to developmental delays and various epilepsies, including self-limited benign familial neonatal epilepsy and epileptic encephalopathy. Genome-wide tools often exhibit a tendency to overestimate deleterious mutations, frequently overlooking tolerated variants, and lack the capacity to discriminate variant severity. This study introduces a novel approach by evaluating multiple machine learning (ML) protocols and descriptors. The combination of genomic information with a novel Variant Frequency Index (VFI) builds a robust foundation for constructing reliable gene-specific ML models. The ensemble model, MLe-KCNQ2, formed through logistic regression, support vector machine, random forest and gradient boosting algorithms, achieves specificity and sensitivity values surpassing 0.95 (AUC-ROC > 0.98). The ensemble MLe-KCNQ2 model also categorizes pathogenic mutations as benign or severe, with an area under the receiver operating characteristic curve (AUC-ROC) above 0.67. This study not only presents a transferable methodology for accurately classifying KCNQ2 missense variants, but also provides valuable insights for clinical counseling and aids in the determination of variant severity. The research context emphasizes the necessity of precise variant classification, especially for genes like KCNQ2, contributing to the broader understanding of gene-specific challenges in the field of genomic research. The MLe-KCNQ2 model stands as a promising tool for enhancing clinical decision making and prognosis in the realm of KCNQ2-related pathologies.
Collapse
Affiliation(s)
| | - Markel G Ibarluzea
- Physics Department, Universidad del País Vasco, UPV/EHU, 48940 Leioa, Spain
- Donostia International Physics Center, 20018 Donostia, Spain
| | - Sara M-Alicante
- Instituto Biofisika, CSIC-UPV/EHU, 48940 Leioa, Spain
- Physics Department, Universidad del País Vasco, UPV/EHU, 48940 Leioa, Spain
| | | | - Eider Nuñez
- Instituto Biofisika, CSIC-UPV/EHU, 48940 Leioa, Spain
- Physics Department, Universidad del País Vasco, UPV/EHU, 48940 Leioa, Spain
| | - Rafael Ramis
- Physics Department, Universidad del País Vasco, UPV/EHU, 48940 Leioa, Spain
- Donostia International Physics Center, 20018 Donostia, Spain
| | - Oscar R Ballesteros
- Physics Department, Universidad del País Vasco, UPV/EHU, 48940 Leioa, Spain
- Centro de Física de Materiales CFM, CSIC-UPV/EHU, 20018 Donostia, Spain
| | | | - Carmen Fons
- Pediatric Neurology Department, Sant Joan de Déu Hospital, Institut de Recerca Sant Joan de Déu, Barcelona University, 08950 Barcelona, Spain
| | - Mónica Gallego
- Departamento de Fisiología, Universidad del País Vasco, UPV/EHU, 01006 Vitoria-Gasteiz, Spain
| | - Oscar Casis
- Departamento de Fisiología, Universidad del País Vasco, UPV/EHU, 01006 Vitoria-Gasteiz, Spain
| | - Aritz Leonardo
- Physics Department, Universidad del País Vasco, UPV/EHU, 48940 Leioa, Spain
- Donostia International Physics Center, 20018 Donostia, Spain
| | - Aitor Bergara
- Physics Department, Universidad del País Vasco, UPV/EHU, 48940 Leioa, Spain
- Donostia International Physics Center, 20018 Donostia, Spain
- Centro de Física de Materiales CFM, CSIC-UPV/EHU, 20018 Donostia, Spain
| | | |
Collapse
|
18
|
Schubach M, Maass T, Nazaretyan L, Röner S, Kircher M. CADD v1.7: using protein language models, regulatory CNNs and other nucleotide-level scores to improve genome-wide variant predictions. Nucleic Acids Res 2024; 52:D1143-D1154. [PMID: 38183205 PMCID: PMC10767851 DOI: 10.1093/nar/gkad989] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/14/2023] [Revised: 10/14/2023] [Accepted: 10/17/2023] [Indexed: 01/07/2024] Open
Abstract
Machine Learning-based scoring and classification of genetic variants aids the assessment of clinical findings and is employed to prioritize variants in diverse genetic studies and analyses. Combined Annotation-Dependent Depletion (CADD) is one of the first methods for the genome-wide prioritization of variants across different molecular functions and has been continuously developed and improved since its original publication. Here, we present our most recent release, CADD v1.7. We explored and integrated new annotation features, among them state-of-the-art protein language model scores (Meta ESM-1v), regulatory variant effect predictions (from sequence-based convolutional neural networks) and sequence conservation scores (Zoonomia). We evaluated the new version on data sets derived from ClinVar, ExAC/gnomAD and 1000 Genomes variants. For coding effects, we tested CADD on 31 Deep Mutational Scanning (DMS) data sets from ProteinGym and, for regulatory effect prediction, we used saturation mutagenesis reporter assay data of promoter and enhancer sequences. The inclusion of new features further improved the overall performance of CADD. As with previous releases, all data sets, genome-wide CADD v1.7 scores, scripts for on-site scoring and an easy-to-use webserver are readily provided via https://cadd.bihealth.org/ or https://cadd.gs.washington.edu/ to the community.
Collapse
Affiliation(s)
- Max Schubach
- Exploratory Diagnostic Sciences, Berlin Institute of Health at Charité – Universitätsmedizin Berlin, Berlin, Germany
| | - Thorben Maass
- Institute of Human Genetics, University Hospital Schleswig-Holstein, University of Lübeck, Lübeck, Germany
| | - Lusiné Nazaretyan
- Exploratory Diagnostic Sciences, Berlin Institute of Health at Charité – Universitätsmedizin Berlin, Berlin, Germany
| | - Sebastian Röner
- Exploratory Diagnostic Sciences, Berlin Institute of Health at Charité – Universitätsmedizin Berlin, Berlin, Germany
| | - Martin Kircher
- Exploratory Diagnostic Sciences, Berlin Institute of Health at Charité – Universitätsmedizin Berlin, Berlin, Germany
- Institute of Human Genetics, University Hospital Schleswig-Holstein, University of Lübeck, Lübeck, Germany
| |
Collapse
|
19
|
Weissenow K, Rost B. Rendering protein mutation movies with MutAmore. BMC Bioinformatics 2023; 24:469. [PMID: 38087198 PMCID: PMC10714560 DOI: 10.1186/s12859-023-05610-8] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/17/2023] [Accepted: 12/08/2023] [Indexed: 12/18/2023] Open
Abstract
BACKGROUND The success of AlphaFold2 in reliable protein three-dimensional (3D) structure prediction, assists the move of structural biology toward studies of protein dynamics and mutational impact on structure and function. This transition needs tools that qualitatively assess alternative 3D conformations. RESULTS We introduce MutAmore, a bioinformatics tool that renders individual images of protein 3D structures for, e.g., sequence mutations into a visually intuitive movie format. MutAmore streamlines a pipeline casting single amino-acid variations (SAVs) into a dynamic 3D mutation movie providing a qualitative perspective on the mutational landscape of a protein. By default, the tool first generates all possible variants of the sequence reachable through SAVs (L*19 for proteins with L residues). Next, it predicts the structural conformation for all L*19 variants using state-of-the-art models. Finally, it visualizes the mutation matrix and produces a color-coded 3D animation. Alternatively, users can input other types of variants, e.g., from experimental structures. CONCLUSION MutAmore samples alternative protein configurations to study the dynamical space accessible from SAVs in the post-AlphaFold2 era of structural biology. As the field shifts towards the exploration of alternative conformations of proteins, MutAmore aids in the understanding of the structural impact of mutations by providing a flexible pipeline for the generation of protein mutation movies using current and future structure prediction models.
Collapse
Affiliation(s)
- Konstantin Weissenow
- Department of Informatics, Bioinformatics and Computational Biology i12, TUM (Technical University of Munich), Boltzmannstr. 3, 85748, Garching, Munich, Germany.
- TUM Graduate School, Center of Doctoral Studies in Informatics and Its Applications (CeDoSIA), Boltzmannstr. 11, 85748, Garching, Germany.
| | - Burkhard Rost
- Department of Informatics, Bioinformatics and Computational Biology i12, TUM (Technical University of Munich), Boltzmannstr. 3, 85748, Garching, Munich, Germany
- Institute for Advanced Study (TUM-IAS), Lichtenbergstr. 2a, 85748, Garching, Munich, Germany
- TUM School of Life Sciences Weihenstephan (WZW), Alte Akademie 8, Freising, Germany
| |
Collapse
|
20
|
James JK, Norland K, Johar AS, Kullo IJ. Deep generative models of LDLR protein structure to predict variant pathogenicity. J Lipid Res 2023; 64:100455. [PMID: 37821076 PMCID: PMC10696256 DOI: 10.1016/j.jlr.2023.100455] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/22/2023] [Revised: 09/16/2023] [Accepted: 10/05/2023] [Indexed: 10/13/2023] Open
Abstract
The complex structure and function of low density lipoprotein receptor (LDLR) makes classification of protein-coding missense variants challenging. Deep generative models, including Evolutionary model of Variant Effect (EVE), Evolutionary Scale Modeling (ESM), and AlphaFold 2 (AF2), have enabled significant progress in the prediction of protein structure and function. ESM and EVE directly estimate the likelihood of a variant sequence but are purely data-driven and challenging to interpret. AF2 predicts LDLR structures, but variant effects are explicitly modeled by estimating changes in stability. We tested the effectiveness of these models for predicting variant pathogenicity compared to established methods. AF2 produced two distinct conformations based on a novel hinge mechanism. Within ESM's hidden space, benign and pathogenic variants had different distributions. In EVE, these distributions were similar. EVE and ESM were comparable to Polyphen-2, SIFT, REVEL, and Primate AI for predicting binary classifications in ClinVar. However, they were more strongly correlated with experimental measures of LDL uptake. AF2 poorly performed in these tasks. Using the UK Biobank to compare association with clinical phenotypes, ESM and EVE were more strongly associated with serum LDL-C than Polyphen-2. ESM was able to identify variants with more extreme LDL-C levels than EVE and had a significantly stronger association with atherosclerotic cardiovascular disease. In conclusion, AF2 predicted LDLR structures do not accurately model variant pathogenicity. ESM and EVE are competitive with prior scoring methods for prediction based on binary classifications in ClinVar but are superior based on correlations with experimental assays and clinical phenotypes.
Collapse
Affiliation(s)
- Jose K James
- Department of Cardiovascular Medicine, Mayo Clinic, Rochester, MN, USA
| | - Kristjan Norland
- Department of Cardiovascular Medicine, Mayo Clinic, Rochester, MN, USA
| | - Angad S Johar
- Department of Cardiovascular Medicine, Mayo Clinic, Rochester, MN, USA
| | - Iftikhar J Kullo
- Department of Cardiovascular Medicine, Mayo Clinic, Rochester, MN, USA; Gonda Vascular Center, Mayo Clinic, Rochester, MN, USA.
| |
Collapse
|
21
|
Cheng J, Novati G, Pan J, Bycroft C, Žemgulytė A, Applebaum T, Pritzel A, Wong LH, Zielinski M, Sargeant T, Schneider RG, Senior AW, Jumper J, Hassabis D, Kohli P, Avsec Ž. Accurate proteome-wide missense variant effect prediction with AlphaMissense. Science 2023; 381:eadg7492. [PMID: 37733863 DOI: 10.1126/science.adg7492] [Citation(s) in RCA: 317] [Impact Index Per Article: 317.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/19/2023] [Accepted: 08/23/2023] [Indexed: 09/23/2023]
Abstract
The vast majority of missense variants observed in the human genome are of unknown clinical significance. We present AlphaMissense, an adaptation of AlphaFold fine-tuned on human and primate variant population frequency databases to predict missense variant pathogenicity. By combining structural context and evolutionary conservation, our model achieves state-of-the-art results across a wide range of genetic and experimental benchmarks, all without explicitly training on such data. The average pathogenicity score of genes is also predictive for their cell essentiality, capable of identifying short essential genes that existing statistical approaches are underpowered to detect. As a resource to the community, we provide a database of predictions for all possible human single amino acid substitutions and classify 89% of missense variants as either likely benign or likely pathogenic.
Collapse
|
22
|
Abstract
Machine-learning algorithm uses structure prediction to spot disease-causing mutations.
Collapse
Affiliation(s)
- Joseph A Marsh
- MRC Human Genetics Unit, Institute of Genetics and Cancer, University of Edinburgh, Edinburgh, UK
| | - Sarah A Teichmann
- Wellcome Sanger Institute, Wellcome Genome Campus, Cambridge, UK
- Theory of Condensed Matter, Cavendish Laboratory, University of Cambridge, Cambridge, UK
| |
Collapse
|
23
|
Livesey BJ, Marsh JA. Advancing variant effect prediction using protein language models. Nat Genet 2023; 55:1426-1427. [PMID: 37563330 DOI: 10.1038/s41588-023-01470-3] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 08/12/2023]
Affiliation(s)
- Benjamin J Livesey
- MRC Human Genetics Unit, Institute of Genetics & Cancer, University of Edinburgh, Edinburgh, UK
| | - Joseph A Marsh
- MRC Human Genetics Unit, Institute of Genetics & Cancer, University of Edinburgh, Edinburgh, UK.
| |
Collapse
|
24
|
Brandes N, Goldman G, Wang CH, Ye CJ, Ntranos V. Genome-wide prediction of disease variant effects with a deep protein language model. Nat Genet 2023; 55:1512-1522. [PMID: 37563329 PMCID: PMC10484790 DOI: 10.1038/s41588-023-01465-0] [Citation(s) in RCA: 69] [Impact Index Per Article: 69.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/08/2022] [Accepted: 07/05/2023] [Indexed: 08/12/2023]
Abstract
Predicting the effects of coding variants is a major challenge. While recent deep-learning models have improved variant effect prediction accuracy, they cannot analyze all coding variants due to dependency on close homologs or software limitations. Here we developed a workflow using ESM1b, a 650-million-parameter protein language model, to predict all ~450 million possible missense variant effects in the human genome, and made all predictions available on a web portal. ESM1b outperformed existing methods in classifying ~150,000 ClinVar/HGMD missense variants as pathogenic or benign and predicting measurements across 28 deep mutational scan datasets. We further annotated ~2 million variants as damaging only in specific protein isoforms, demonstrating the importance of considering all isoforms when predicting variant effects. Our approach also generalizes to more complex coding variants such as in-frame indels and stop-gains. Together, these results establish protein language models as an effective, accurate and general approach to predicting variant effects.
Collapse
Affiliation(s)
- Nadav Brandes
- Division of Rheumatology, Department of Medicine, University of California, San Francisco, San Francisco, CA, USA
| | - Grant Goldman
- Biological and Medical Informatics Graduate Program, University of California, San Francisco, San Francisco, CA, USA
| | - Charlotte H Wang
- Biomedical Sciences Graduate Program, University of California, San Francisco, San Francisco, CA, USA
| | - Chun Jimmie Ye
- Division of Rheumatology, Department of Medicine, University of California, San Francisco, San Francisco, CA, USA.
- Bakar Computational Health Sciences Institute, University of California, San Francisco, San Francisco, CA, USA.
- Parker Institute for Cancer Immunotherapy, University of California, San Francisco, San Francisco, CA, USA.
- Gladstone-UCSF Institute of Genomic Immunology, San Francisco, CA, USA.
- Institute for Human Genetics, University of California, San Francisco, San Francisco, CA, USA.
- Department of Epidemiology & Biostatistics, University of California, San Francisco, San Francisco, CA, USA.
- Department of Bioengineering and Therapeutic Sciences, University of California, San Francisco, San Francisco, CA, USA.
| | - Vasilis Ntranos
- Bakar Computational Health Sciences Institute, University of California, San Francisco, San Francisco, CA, USA.
- Department of Epidemiology & Biostatistics, University of California, San Francisco, San Francisco, CA, USA.
- Department of Bioengineering and Therapeutic Sciences, University of California, San Francisco, San Francisco, CA, USA.
- Diabetes Center, University of California, San Francisco, San Francisco, CA, USA.
| |
Collapse
|
25
|
Livesey BJ, Marsh JA. Updated benchmarking of variant effect predictors using deep mutational scanning. Mol Syst Biol 2023; 19:e11474. [PMID: 37310135 PMCID: PMC10407742 DOI: 10.15252/msb.202211474] [Citation(s) in RCA: 10] [Impact Index Per Article: 10.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/23/2022] [Revised: 05/30/2023] [Accepted: 06/02/2023] [Indexed: 06/14/2023] Open
Abstract
The assessment of variant effect predictor (VEP) performance is fraught with biases introduced by benchmarking against clinical observations. In this study, building on our previous work, we use independently generated measurements of protein function from deep mutational scanning (DMS) experiments for 26 human proteins to benchmark 55 different VEPs, while introducing minimal data circularity. Many top-performing VEPs are unsupervised methods including EVE, DeepSequence and ESM-1v, a protein language model that ranked first overall. However, the strong performance of recent supervised VEPs, in particular VARITY, shows that developers are taking data circularity and bias issues seriously. We also assess the performance of DMS and unsupervised VEPs for discriminating between known pathogenic and putatively benign missense variants. Our findings are mixed, demonstrating that some DMS datasets perform exceptionally at variant classification, while others are poor. Notably, we observe a striking correlation between VEP agreement with DMS data and performance in identifying clinically relevant variants, strongly supporting the validity of our rankings and the utility of DMS for independent benchmarking.
Collapse
Affiliation(s)
- Benjamin J Livesey
- MRC Human Genetics Unit, Institute of Genetics and CancerUniversity of EdinburghEdinburghUK
| | - Joseph A Marsh
- MRC Human Genetics Unit, Institute of Genetics and CancerUniversity of EdinburghEdinburghUK
| |
Collapse
|
26
|
Jagota M, Ye C, Albors C, Rastogi R, Koehl A, Ioannidis N, Song YS. Cross-protein transfer learning substantially improves disease variant prediction. Genome Biol 2023; 24:182. [PMID: 37550700 PMCID: PMC10408151 DOI: 10.1186/s13059-023-03024-6] [Citation(s) in RCA: 13] [Impact Index Per Article: 13.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/06/2022] [Accepted: 07/27/2023] [Indexed: 08/09/2023] Open
Abstract
BACKGROUND Genetic variation in the human genome is a major determinant of individual disease risk, but the vast majority of missense variants have unknown etiological effects. Here, we present a robust learning framework for leveraging saturation mutagenesis experiments to construct accurate computational predictors of proteome-wide missense variant pathogenicity. RESULTS We train cross-protein transfer (CPT) models using deep mutational scanning (DMS) data from only five proteins and achieve state-of-the-art performance on clinical variant interpretation for unseen proteins across the human proteome. We also improve predictive accuracy on DMS data from held-out proteins. High sensitivity is crucial for clinical applications and our model CPT-1 particularly excels in this regime. For instance, at 95% sensitivity of detecting human disease variants annotated in ClinVar, CPT-1 improves specificity to 68%, from 27% for ESM-1v and 55% for EVE. Furthermore, for genes not used to train REVEL, a supervised method widely used by clinicians, we show that CPT-1 compares favorably with REVEL. Our framework combines predictive features derived from general protein sequence models, vertebrate sequence alignments, and AlphaFold structures, and it is adaptable to the future inclusion of other sources of information. We find that vertebrate alignments, albeit rather shallow with only 100 genomes, provide a strong signal for variant pathogenicity prediction that is complementary to recent deep learning-based models trained on massive amounts of protein sequence data. We release predictions for all possible missense variants in 90% of human genes. CONCLUSIONS Our results demonstrate the utility of mutational scanning data for learning properties of variants that transfer to unseen proteins.
Collapse
Affiliation(s)
- Milind Jagota
- Computer Science Division, University of California, Berkeley, 94720, CA, USA
| | - Chengzhong Ye
- Department of Statistics, University of California, Berkeley, 94720, CA, USA
| | - Carlos Albors
- Computer Science Division, University of California, Berkeley, 94720, CA, USA
| | - Ruchir Rastogi
- Computer Science Division, University of California, Berkeley, 94720, CA, USA
| | - Antoine Koehl
- Department of Statistics, University of California, Berkeley, 94720, CA, USA
| | - Nilah Ioannidis
- Computer Science Division, University of California, Berkeley, 94720, CA, USA
- Chan Zuckerberg Biohub, San Francisco, 94158, CA, USA
- Center for Computational Biology, University of California, Berkeley, 94720, CA, USA
| | - Yun S Song
- Computer Science Division, University of California, Berkeley, 94720, CA, USA.
- Department of Statistics, University of California, Berkeley, 94720, CA, USA.
- Center for Computational Biology, University of California, Berkeley, 94720, CA, USA.
| |
Collapse
|
27
|
Gerasimavicius L, Livesey BJ, Marsh JA. Correspondence between functional scores from deep mutational scans and predicted effects on protein stability. Protein Sci 2023; 32:e4688. [PMID: 37243972 PMCID: PMC10273344 DOI: 10.1002/pro.4688] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/03/2023] [Revised: 04/19/2023] [Accepted: 05/24/2023] [Indexed: 05/29/2023]
Abstract
Many methodologically diverse computational methods have been applied to the growing challenge of predicting and interpreting the effects of protein variants. As many pathogenic mutations have a perturbing effect on protein stability or intermolecular interactions, one highly interpretable approach is to use protein structural information to model the physical impacts of variants and predict their likely effects on protein stability and interactions. Previous efforts have assessed the accuracy of stability predictors in reproducing thermodynamically accurate values and evaluated their ability to distinguish between known pathogenic and benign mutations. Here, we take an alternate approach, and explore how well stability predictor scores correlate with functional impacts derived from deep mutational scanning (DMS) experiments. In this work, we compare the predictions of 9 protein stability-based tools against mutant protein fitness values from 49 independent DMS datasets, covering 170,940 unique single amino acid variants. We find that FoldX and Rosetta show the strongest correlations with DMS-based functional scores, similar to their previous top performance in distinguishing between pathogenic and benign variants. For both methods, performance is considerably improved when considering intermolecular interactions from protein complex structures, when available. Furthermore, using these two predictors, we derive a "Foldetta" consensus score, which improves upon the performance of both, and manages to match dedicated variant effect predictors in reflecting variant functional impacts. Finally, we also highlight that predicted stability effects show consistently higher correlations with certain DMS experimental phenotypes, particularly those based upon protein abundance, and, in certain cases, can significantly outcompete sequence-based variant effect prediction methodologies for predicting functional scores from DMS experiments.
Collapse
Affiliation(s)
- Lukas Gerasimavicius
- MRC Human Genetics Unit, Institute of Genetics & CancerUniversity of EdinburghEdinburghUK
| | - Benjamin J. Livesey
- MRC Human Genetics Unit, Institute of Genetics & CancerUniversity of EdinburghEdinburghUK
| | - Joseph A. Marsh
- MRC Human Genetics Unit, Institute of Genetics & CancerUniversity of EdinburghEdinburghUK
| |
Collapse
|
28
|
Fu Y, Bedő J, Papenfuss AT, Rubin AF. Integrating deep mutational scanning and low-throughput mutagenesis data to predict the impact of amino acid variants. Gigascience 2022; 12:giad073. [PMID: 37721410 PMCID: PMC10506130 DOI: 10.1093/gigascience/giad073] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/14/2023] [Revised: 07/02/2023] [Accepted: 08/23/2023] [Indexed: 09/19/2023] Open
Abstract
BACKGROUND Evaluating the impact of amino acid variants has been a critical challenge for studying protein function and interpreting genomic data. High-throughput experimental methods like deep mutational scanning (DMS) can measure the effect of large numbers of variants in a target protein, but because DMS studies have not been performed on all proteins, researchers also model DMS data computationally to estimate variant impacts by predictors. RESULTS In this study, we extended a linear regression-based predictor to explore whether incorporating data from alanine scanning (AS), a widely used low-throughput mutagenesis method, would improve prediction results. To evaluate our model, we collected 146 AS datasets, mapping to 54 DMS datasets across 22 distinct proteins. CONCLUSIONS We show that improved model performance depends on the compatibility of the DMS and AS assays, and the scale of improvement is closely related to the correlation between DMS and AS results.
Collapse
Affiliation(s)
- Yunfan Fu
- The Walter and Eliza Hall Institute of Medical Research, Bioinformatics Division, 1G Royal Pde, Parkville, Victoria 3052, Australia
- The University of Melbourne, Department of Medical Biology, Parkville, Victoria 3010, Australia
| | - Justin Bedő
- The Walter and Eliza Hall Institute of Medical Research, Bioinformatics Division, 1G Royal Pde, Parkville, Victoria 3052, Australia
- The University of Melbourne, Department of Medical Biology, Parkville, Victoria 3010, Australia
| | - Anthony T Papenfuss
- The Walter and Eliza Hall Institute of Medical Research, Bioinformatics Division, 1G Royal Pde, Parkville, Victoria 3052, Australia
- The University of Melbourne, Department of Medical Biology, Parkville, Victoria 3010, Australia
- Peter MacCallum Cancer Centre, Melbourne, Victoria 3000, Australia
| | - Alan F Rubin
- The Walter and Eliza Hall Institute of Medical Research, Bioinformatics Division, 1G Royal Pde, Parkville, Victoria 3052, Australia
- The University of Melbourne, Department of Medical Biology, Parkville, Victoria 3010, Australia
| |
Collapse
|