1
|
Marquet C, Heinzinger M, Olenyi T, Dallago C, Erckert K, Bernhofer M, Nechaev D, Rost B. Embeddings from protein language models predict conservation and variant effects. Hum Genet 2022; 141:1629-1647. [PMID: 34967936 PMCID: PMC8716573 DOI: 10.1007/s00439-021-02411-y] [Citation(s) in RCA: 37] [Impact Index Per Article: 18.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/01/2021] [Accepted: 12/06/2021] [Indexed: 12/13/2022]
Abstract
The emergence of SARS-CoV-2 variants stressed the demand for tools allowing to interpret the effect of single amino acid variants (SAVs) on protein function. While Deep Mutational Scanning (DMS) sets continue to expand our understanding of the mutational landscape of single proteins, the results continue to challenge analyses. Protein Language Models (pLMs) use the latest deep learning (DL) algorithms to leverage growing databases of protein sequences. These methods learn to predict missing or masked amino acids from the context of entire sequence regions. Here, we used pLM representations (embeddings) to predict sequence conservation and SAV effects without multiple sequence alignments (MSAs). Embeddings alone predicted residue conservation almost as accurately from single sequences as ConSeq using MSAs (two-state Matthews Correlation Coefficient-MCC-for ProtT5 embeddings of 0.596 ± 0.006 vs. 0.608 ± 0.006 for ConSeq). Inputting the conservation prediction along with BLOSUM62 substitution scores and pLM mask reconstruction probabilities into a simplistic logistic regression (LR) ensemble for Variant Effect Score Prediction without Alignments (VESPA) predicted SAV effect magnitude without any optimization on DMS data. Comparing predictions for a standard set of 39 DMS experiments to other methods (incl. ESM-1v, DeepSequence, and GEMME) revealed our approach as competitive with the state-of-the-art (SOTA) methods using MSA input. No method outperformed all others, neither consistently nor statistically significantly, independently of the performance measure applied (Spearman and Pearson correlation). Finally, we investigated binary effect predictions on DMS experiments for four human proteins. Overall, embedding-based methods have become competitive with methods relying on MSAs for SAV effect prediction at a fraction of the costs in computing/energy. Our method predicted SAV effects for the entire human proteome (~ 20 k proteins) within 40 min on one Nvidia Quadro RTX 8000. All methods and data sets are freely available for local and online execution through bioembeddings.com, https://github.com/Rostlab/VESPA , and PredictProtein.
Collapse
Affiliation(s)
- Céline Marquet
- Department of Informatics, Bioinformatics and Computational Biology - i12, TUM-Technical University of Munich, Boltzmannstr. 3, Garching, 85748, Munich, Germany.
- TUM Graduate School, Center of Doctoral Studies in Informatics and its Applications (CeDoSIA), Boltzmannstr. 11, 85748, Garching, Germany.
| | - Michael Heinzinger
- Department of Informatics, Bioinformatics and Computational Biology - i12, TUM-Technical University of Munich, Boltzmannstr. 3, Garching, 85748, Munich, Germany
- TUM Graduate School, Center of Doctoral Studies in Informatics and its Applications (CeDoSIA), Boltzmannstr. 11, 85748, Garching, Germany
| | - Tobias Olenyi
- Department of Informatics, Bioinformatics and Computational Biology - i12, TUM-Technical University of Munich, Boltzmannstr. 3, Garching, 85748, Munich, Germany
- TUM Graduate School, Center of Doctoral Studies in Informatics and its Applications (CeDoSIA), Boltzmannstr. 11, 85748, Garching, Germany
| | - Christian Dallago
- Department of Informatics, Bioinformatics and Computational Biology - i12, TUM-Technical University of Munich, Boltzmannstr. 3, Garching, 85748, Munich, Germany
- TUM Graduate School, Center of Doctoral Studies in Informatics and its Applications (CeDoSIA), Boltzmannstr. 11, 85748, Garching, Germany
| | - Kyra Erckert
- Department of Informatics, Bioinformatics and Computational Biology - i12, TUM-Technical University of Munich, Boltzmannstr. 3, Garching, 85748, Munich, Germany
- TUM Graduate School, Center of Doctoral Studies in Informatics and its Applications (CeDoSIA), Boltzmannstr. 11, 85748, Garching, Germany
| | - Michael Bernhofer
- Department of Informatics, Bioinformatics and Computational Biology - i12, TUM-Technical University of Munich, Boltzmannstr. 3, Garching, 85748, Munich, Germany
- TUM Graduate School, Center of Doctoral Studies in Informatics and its Applications (CeDoSIA), Boltzmannstr. 11, 85748, Garching, Germany
| | - Dmitrii Nechaev
- Department of Informatics, Bioinformatics and Computational Biology - i12, TUM-Technical University of Munich, Boltzmannstr. 3, Garching, 85748, Munich, Germany
- TUM Graduate School, Center of Doctoral Studies in Informatics and its Applications (CeDoSIA), Boltzmannstr. 11, 85748, Garching, Germany
| | - Burkhard Rost
- Department of Informatics, Bioinformatics and Computational Biology - i12, TUM-Technical University of Munich, Boltzmannstr. 3, Garching, 85748, Munich, Germany
- Institute for Advanced Study (TUM-IAS), Lichtenbergstr. 2a, Garching, 85748, Munich, Germany
- TUM School of Life Sciences Weihenstephan (TUM-WZW), Alte Akademie 8, Freising, Germany
| |
Collapse
|
2
|
Savojardo C, Babbi G, Martelli PL, Casadio R. Mapping OMIM Disease-Related Variations on Protein Domains Reveals an Association Among Variation Type, Pfam Models, and Disease Classes. Front Mol Biosci 2021; 8:617016. [PMID: 34026820 PMCID: PMC8138129 DOI: 10.3389/fmolb.2021.617016] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/13/2020] [Accepted: 04/09/2021] [Indexed: 12/23/2022] Open
Abstract
Human genome resequencing projects provide an unprecedented amount of data about single-nucleotide variations occurring in protein-coding regions and often leading to observable changes in the covalent structure of gene products. For many of these variations, links to Online Mendelian Inheritance in Man (OMIM) genetic diseases are available and are reported in many databases that are collecting human variation data such as Humsavar. However, the current knowledge on the molecular mechanisms that are leading to diseases is, in many cases, still limited. For understanding the complex mechanisms behind disease insurgence, the identification of putative models, when considering the protein structure and chemico-physical features of the variations, can be useful in many contexts, including early diagnosis and prognosis. In this study, we investigate the occurrence and distribution of human disease–related variations in the context of Pfam domains. The aim of this study is the identification and characterization of Pfam domains that are statistically more likely to be associated with disease-related variations. The study takes into consideration 2,513 human protein sequences with 22,763 disease-related variations. We describe patterns of disease-related variation types in biunivocal relation with Pfam domains, which are likely to be possible markers for linking Pfam domains to OMIM diseases. Furthermore, we take advantage of the specific association between disease-related variation types and Pfam domains for clustering diseases according to the Human Disease Ontology, and we establish a relation among variation types, Pfam domains, and disease classes. We find that Pfam models are specific markers of patterns of variation types and that they can serve to bridge genes, diseases, and disease classes. Data are available as Supplementary Material for 1,670 Pfam models, including 22,763 disease-related variations associated to 3,257 OMIM diseases.
Collapse
Affiliation(s)
- Castrense Savojardo
- Biocomputing Group, Department of Pharmacy and Biotechnology, University of Bologna, Bologna, Italy
| | - Giulia Babbi
- Biocomputing Group, Department of Pharmacy and Biotechnology, University of Bologna, Bologna, Italy
| | - Pier Luigi Martelli
- Biocomputing Group, Department of Pharmacy and Biotechnology, University of Bologna, Bologna, Italy
| | - Rita Casadio
- Biocomputing Group, Department of Pharmacy and Biotechnology, University of Bologna, Bologna, Italy.,Institute of Biomembranes, Bioenergetics and Molecular Biotechnologies, National Research Council, Bari, Italy
| |
Collapse
|
3
|
Qiu J, Nechaev D, Rost B. Protein-protein and protein-nucleic acid binding residues important for common and rare sequence variants in human. BMC Bioinformatics 2020; 21:452. [PMID: 33050876 PMCID: PMC7557062 DOI: 10.1186/s12859-020-03759-0] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/01/2020] [Accepted: 09/16/2020] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Any two unrelated people differ by about 20,000 missense mutations (also referred to as SAVs: Single Amino acid Variants or missense SNV). Many SAVs have been predicted to strongly affect molecular protein function. Common SAVs (> 5% of population) were predicted to have, on average, more effect on molecular protein function than rare SAVs (< 1% of population). We hypothesized that the prevalence of effect in common over rare SAVs might partially be caused by common SAVs more often occurring at interfaces of proteins with other proteins, DNA, or RNA, thereby creating subgroup-specific phenotypes. We analyzed SAVs from 60,706 people through the lens of two prediction methods, one (SNAP2) predicting the effects of SAVs on molecular protein function, the other (ProNA2020) predicting residues in DNA-, RNA- and protein-binding interfaces. RESULTS Three results stood out. Firstly, SAVs predicted to occur at binding interfaces were predicted to more likely affect molecular function than those predicted as not binding (p value < 2.2 × 10-16). Secondly, for SAVs predicted to occur at binding interfaces, common SAVs were predicted more strongly with effect on protein function than rare SAVs (p value < 2.2 × 10-16). Restriction to SAVs with experimental annotations confirmed all results, although the resulting subsets were too small to establish statistical significance for any result. Thirdly, the fraction of SAVs predicted at binding interfaces differed significantly between tissues, e.g. urinary bladder tissue was found abundant in SAVs predicted at protein-binding interfaces, and reproductive tissues (ovary, testis, vagina, seminal vesicle and endometrium) in SAVs predicted at DNA-binding interfaces. CONCLUSIONS Overall, the results suggested that residues at protein-, DNA-, and RNA-binding interfaces contributed toward predicting that common SAVs more likely affect molecular function than rare SAVs.
Collapse
Affiliation(s)
- Jiajun Qiu
- Department of Informatics, I12-Chair of Bioinformatics and Computational Biology, Technical University of Munich (TUM), Boltzmannstrasse 3, 85748, Garching, Munich, Germany. .,TUM Graduate School, Center of Doctoral Studies in Informatics and Its Applications (CeDoSIA), 85748, Garching, Germany. .,Biobank of Ninth People's Hospital, Shanghai Ninth People's Hospital, Shanghai Jiao Tong University School of Medicine, Shanghai, 200125, China.
| | - Dmitrii Nechaev
- Department of Informatics, I12-Chair of Bioinformatics and Computational Biology, Technical University of Munich (TUM), Boltzmannstrasse 3, 85748, Garching, Munich, Germany.,TUM Graduate School, Center of Doctoral Studies in Informatics and Its Applications (CeDoSIA), 85748, Garching, Germany
| | - Burkhard Rost
- Department of Informatics, I12-Chair of Bioinformatics and Computational Biology, Technical University of Munich (TUM), Boltzmannstrasse 3, 85748, Garching, Munich, Germany.,Institute of Advanced Study (TUM-IAS), Lichtenbergstr. 2a, 85748, Garching, Munich, Germany.,Institute for Food and Plant Sciences (WZW) Weihenstephan, Alte Akademie 8, 85354, Freising, Germany
| |
Collapse
|
4
|
Capriotti E, Montanucci L, Profiti G, Rossi I, Giannuzzi D, Aresu L, Fariselli P. Fido-SNP: the first webserver for scoring the impact of single nucleotide variants in the dog genome. Nucleic Acids Res 2020; 47:W136-W141. [PMID: 31114899 PMCID: PMC6602425 DOI: 10.1093/nar/gkz420] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/24/2019] [Revised: 04/19/2019] [Accepted: 05/06/2019] [Indexed: 12/22/2022] Open
Abstract
As the amount of genomic variation data increases, tools that are able to score the functional impact of single nucleotide variants become more and more necessary. While there are several prediction servers available for interpreting the effects of variants in the human genome, only few have been developed for other species, and none were specifically designed for species of veterinary interest such as the dog. Here, we present Fido-SNP the first predictor able to discriminate between Pathogenic and Benign single-nucleotide variants in the dog genome. Fido-SNP is a binary classifier based on the Gradient Boosting algorithm. It is able to classify and score the impact of variants in both coding and non-coding regions based on sequence features within seconds. When validated on a previously unseen set of annotated variants from the OMIA database, Fido-SNP reaches 88% overall accuracy, 0.77 Matthews correlation coefficient and 0.91 Area Under the ROC Curve.
Collapse
Affiliation(s)
- Emidio Capriotti
- BioFolD Unit, Department of Pharmacy and Biotechnology (FaBiT), University of Bologna, Via F. Selmi 3, 40126 Bologna, Italy
| | - Ludovica Montanucci
- Department of Comparative Biomedicine and Food Science. University of Padova, Viale dell'Università, 16, 35020 Legnaro (Padova), Italy
| | - Giuseppe Profiti
- BioDec srl. Via Calzavecchio 20, 40033 Casalecchio di Reno (Bologna), Italy
| | - Ivan Rossi
- BioDec srl. Via Calzavecchio 20, 40033 Casalecchio di Reno (Bologna), Italy
| | - Diana Giannuzzi
- Department of Comparative Biomedicine and Food Science. University of Padova, Viale dell'Università, 16, 35020 Legnaro (Padova), Italy
| | - Luca Aresu
- Department of Veterinary Sciences, University of Torino, Largo P. Braccini 2, 10095 Grugliasco, (Torino), Italy
| | - Piero Fariselli
- Department of Comparative Biomedicine and Food Science. University of Padova, Viale dell'Università, 16, 35020 Legnaro (Padova), Italy.,Department of Medical Sciences, University of Torino, Via Santena 19, 10126 Torino, Italy
| |
Collapse
|
5
|
Reeb J, Wirth T, Rost B. Variant effect predictions capture some aspects of deep mutational scanning experiments. BMC Bioinformatics 2020; 21:107. [PMID: 32183714 PMCID: PMC7077003 DOI: 10.1186/s12859-020-3439-4] [Citation(s) in RCA: 19] [Impact Index Per Article: 4.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/12/2019] [Accepted: 03/03/2020] [Indexed: 12/12/2022] Open
Abstract
BACKGROUND Deep mutational scanning (DMS) studies exploit the mutational landscape of sequence variation by systematically and comprehensively assaying the effect of single amino acid variants (SAVs; also referred to as missense mutations, or non-synonymous Single Nucleotide Variants - missense SNVs or nsSNVs) for particular proteins. We assembled SAV annotations from 22 different DMS experiments and normalized the effect scores to evaluate variant effect prediction methods. Three trained on traditional variant effect data (PolyPhen-2, SIFT, SNAP2), a regression method optimized on DMS data (Envision), and a naïve prediction using conservation information from homologs. RESULTS On a set of 32,981 SAVs, all methods captured some aspects of the experimental effect scores, albeit not the same. Traditional methods such as SNAP2 correlated slightly more with measurements and better classified binary states (effect or neutral). Envision appeared to better estimate the precise degree of effect. Most surprising was that the simple naïve conservation approach using PSI-BLAST in many cases outperformed other methods. All methods captured beneficial effects (gain-of-function) significantly worse than deleterious (loss-of-function). For the few proteins with multiple independent experimental measurements, experiments differed substantially, but agreed more with each other than with predictions. CONCLUSIONS DMS provides a new powerful experimental means of understanding the dynamics of the protein sequence space. As always, promising new beginnings have to overcome challenges. While our results demonstrated that DMS will be crucial to improve variant effect prediction methods, data diversity hindered simplification and generalization.
Collapse
Affiliation(s)
- Jonas Reeb
- Department of Informatics, Bioinformatics & Computational Biology - i12, TUM (Technical University of Munich), Boltzmannstr 3, 85748, Garching/Munich, Germany.
| | - Theresa Wirth
- Department of Informatics, Bioinformatics & Computational Biology - i12, TUM (Technical University of Munich), Boltzmannstr 3, 85748, Garching/Munich, Germany
| | - Burkhard Rost
- Department of Informatics, Bioinformatics & Computational Biology - i12, TUM (Technical University of Munich), Boltzmannstr 3, 85748, Garching/Munich, Germany
- Institute for Advanced Study (TUM-IAS), Lichtenbergstr 2a, 85748, Garching/Munich, Germany
- TUM School of Life Sciences Weihenstephan (WZW), Alte Akademie 8, Freising, Germany
- Department of Biochemistry and Molecular Biophysics, Columbia University, 701 West, 168th Street, New York, NY, 10032, USA
| |
Collapse
|
6
|
Qiu J, Bernhofer M, Heinzinger M, Kemper S, Norambuena T, Melo F, Rost B. ProNA2020 predicts protein-DNA, protein-RNA, and protein-protein binding proteins and residues from sequence. J Mol Biol 2020; 432:2428-2443. [PMID: 32142788 DOI: 10.1016/j.jmb.2020.02.026] [Citation(s) in RCA: 43] [Impact Index Per Article: 10.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/28/2019] [Revised: 02/17/2020] [Accepted: 02/23/2020] [Indexed: 11/29/2022]
Abstract
The intricate details of how proteins bind to proteins, DNA, and RNA are crucial for the understanding of almost all biological processes. Disease-causing sequence variants often affect binding residues. Here, we described a new, comprehensive system of in silico methods that take only protein sequence as input to predict binding of protein to DNA, RNA, and other proteins. Firstly, we needed to develop several new methods to predict whether or not proteins bind (per-protein prediction). Secondly, we developed independent methods that predict which residues bind (per-residue). Not requiring three-dimensional information, the system can predict the actual binding residue. The system combined homology-based inference with machine learning and motif-based profile-kernel approaches with word-based (ProtVec) solutions to machine learning protein level predictions. This achieved an overall non-exclusive three-state accuracy of 77% ± 1% (±one standard error) corresponding to a 1.8 fold improvement over random (best classification for protein-protein with F1 = 91 ± 0.8%). Standard neural networks for per-residue binding residue predictions appeared best for DNA-binding (Q2 = 81 ± 0.9%) followed by RNA-binding (Q2 = 80 ± 1%) and worst for protein-protein binding (Q2 = 69 ± 0.8%). The new method, dubbed ProNA2020, is available as code through github (https://github.com/Rostlab/ProNA2020.git) and through PredictProtein (www.predictprotein.org).
Collapse
Affiliation(s)
- Jiajun Qiu
- Department of Informatics, I12-Chair of Bioinformatics and Computational Biology, Technical University of Munich (TUM), Boltzmannstrasse 3, 85748, Garching, Munich, Germany; TUM Graduate School, Center of Doctoral Studies in Informatics and Its Applications (CeDoSIA), Garching, 85748, Germany.
| | - Michael Bernhofer
- Department of Informatics, I12-Chair of Bioinformatics and Computational Biology, Technical University of Munich (TUM), Boltzmannstrasse 3, 85748, Garching, Munich, Germany; TUM Graduate School, Center of Doctoral Studies in Informatics and Its Applications (CeDoSIA), Garching, 85748, Germany
| | - Michael Heinzinger
- Department of Informatics, I12-Chair of Bioinformatics and Computational Biology, Technical University of Munich (TUM), Boltzmannstrasse 3, 85748, Garching, Munich, Germany; TUM Graduate School, Center of Doctoral Studies in Informatics and Its Applications (CeDoSIA), Garching, 85748, Germany
| | - Sofie Kemper
- Department of Informatics, I12-Chair of Bioinformatics and Computational Biology, Technical University of Munich (TUM), Boltzmannstrasse 3, 85748, Garching, Munich, Germany
| | - Tomas Norambuena
- Molecular Bioinformatics Laboratory, Facultad de Ciencias Biológicas, Pontificia Universidad Católica de Chile, Santiago, Chile
| | - Francisco Melo
- Molecular Bioinformatics Laboratory, Facultad de Ciencias Biológicas, Pontificia Universidad Católica de Chile, Santiago, Chile; Institute of Biological and Medical Engineering, Pontificia Universidad Católica de Chile, Santiago, Chile
| | - Burkhard Rost
- Department of Informatics, I12-Chair of Bioinformatics and Computational Biology, Technical University of Munich (TUM), Boltzmannstrasse 3, 85748, Garching, Munich, Germany; Columbia University, Department of Biochemistry and Molecular Biophysics, 701 West, 168th Street, New York, NY, 10032, USA; Institute of Advanced Study (TUM-IAS), Lichtenbergstr. 2a, 85748, Garching/Munich, Germany; Germany & Institute for Food and Plant Sciences (WZW) Weihenstephan, Alte Akademie 8, 85354 Freising, Germany
| |
Collapse
|
7
|
Cao Y, Sun Y, Karimi M, Chen H, Moronfoye O, Shen Y. Predicting pathogenicity of missense variants with weakly supervised regression. Hum Mutat 2019; 40:1579-1592. [PMID: 31144781 PMCID: PMC6744350 DOI: 10.1002/humu.23826] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/11/2019] [Revised: 05/23/2019] [Accepted: 05/27/2019] [Indexed: 12/27/2022]
Abstract
Quickly growing genetic variation data of unknown clinical significance demand computational methods that can reliably predict clinical phenotypes and deeply unravel molecular mechanisms. On the platform enabled by the Critical Assessment of Genome Interpretation (CAGI), we develop a novel "weakly supervised" regression (WSR) model that not only predicts precise clinical significance (probability of pathogenicity) from inexact training annotations (class of pathogenicity) but also infers underlying molecular mechanisms in a variant-specific manner. Compared to multiclass logistic regression, a representative multiclass classifier, our kernelized WSR improves the performance for the ENIGMA Challenge set from 0.72 to 0.97 in binary area under the receiver operating characteristic curve (AUC) and from 0.64 to 0.80 in ordinal multiclass AUC. WSR model interpretation and protein structural interpretation reach consensus in corroborating the most probable molecular mechanisms by which some pathogenic BRCA1 variants confer clinical significance, namely metal-binding disruption for p.C44F and p.C47Y, protein-binding disruption for p.M18T, and structure destabilization for p.S1715N.
Collapse
Affiliation(s)
- Yue Cao
- Department of Electrical and Computer Engineering, Texas A&M University, College Station, Texas, 77843-3128, United States
| | - Yuanfei Sun
- Department of Electrical and Computer Engineering, Texas A&M University, College Station, Texas, 77843-3128, United States
| | - Mostafa Karimi
- Department of Electrical and Computer Engineering, Texas A&M University, College Station, Texas, 77843-3128, United States
| | - Haoran Chen
- Department of Electrical and Computer Engineering, Texas A&M University, College Station, Texas, 77843-3128, United States
| | - Oluwaseyi Moronfoye
- Department of Electrical and Computer Engineering, Texas A&M University, College Station, Texas, 77843-3128, United States
| | - Yang Shen
- Department of Electrical and Computer Engineering, Texas A&M University, College Station, Texas, 77843-3128, United States
| |
Collapse
|
8
|
Priyadarshini P, Mishra C, Sabat SS, Mandal M, Jyotiranjan T, Swain L, Sahoo M. Computational analysis of non-synonymous SNPs in bovine Mx1 gene. GENE REPORTS 2018. [DOI: 10.1016/j.genrep.2018.05.001] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/24/2022]
|
9
|
Pejaver V, Mooney SD, Radivojac P. Missense variant pathogenicity predictors generalize well across a range of function-specific prediction challenges. Hum Mutat 2017; 38:1092-1108. [PMID: 28508593 PMCID: PMC5561458 DOI: 10.1002/humu.23258] [Citation(s) in RCA: 31] [Impact Index Per Article: 4.4] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/13/2016] [Revised: 03/16/2017] [Accepted: 03/26/2017] [Indexed: 11/08/2022]
Abstract
The steady advances in machine learning and accumulation of biomedical data have contributed to the development of numerous computational models that assess the impact of missense variants. Different methods, however, operationalize impact differently. Two common tasks in this context are the prediction of the pathogenicity of variants and the prediction of their effects on a protein's function. These are related but distinct problems, and it is unclear whether methods developed for one are optimized for the other. The Critical Assessment of Genome Interpretation (CAGI) experiment provides a means to address this question empirically. To this end, we participated in various protein-specific challenges in CAGI with two objectives in mind. First, to compare the performance of methods in the MutPred family with the state-of-the-art. Second and more importantly, to investigate the applicability of general-purpose pathogenicity predictors to the classification of specific function-altering variants without additional training or calibration. We find that our pathogenicity predictors performed competitively with other methods, outputting score distributions in agreement with experimental outcomes. Overall, we conclude that binary classifiers learned from disease-causing mutations are capable of modeling important aspects of the underlying biology and the alteration of protein function resulting from mutations.
Collapse
Affiliation(s)
- Vikas Pejaver
- Department of Computer Science and Informatics, Indiana University, Bloomington, Indiana 47405
| | - Sean D. Mooney
- Department of Biomedical Informatics and Medical Education, University of Washington, Seattle, Washington 98109
| | - Predrag Radivojac
- Department of Computer Science and Informatics, Indiana University, Bloomington, Indiana 47405
| |
Collapse
|
10
|
Common sequence variants affect molecular function more than rare variants? Sci Rep 2017; 7:1608. [PMID: 28487536 PMCID: PMC5431670 DOI: 10.1038/s41598-017-01054-2] [Citation(s) in RCA: 19] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/14/2016] [Accepted: 02/28/2017] [Indexed: 12/29/2022] Open
Abstract
Any two unrelated individuals differ by about 10,000 single amino acid variants (SAVs). Do these impact molecular function? Experimental answers cannot answer comprehensively, while state-of-the-art prediction methods can. We predicted the functional impacts of SAVs within human and for variants between human and other species. Several surprising results stood out. Firstly, four methods (CADD, PolyPhen-2, SIFT, and SNAP2) agreed within 10 percentage points on the percentage of rare SAVs predicted with effect. However, they differed substantially for the common SAVs: SNAP2 predicted, on average, more effect for common than for rare SAVs. Given the large ExAC data sets sampling 60,706 individuals, the differences were extremely significant (p-value < 2.2e-16). We provided evidence that SNAP2 might be closer to reality for common SAVs than the other methods, due to its different focus in development. Secondly, we predicted significantly higher fractions of SAVs with effect between healthy individuals than between species; the difference increased for more distantly related species. The same trends were maintained for subsets of only housekeeping proteins and when moving from exomes of 1,000 to 60,000 individuals. SAVs frozen at speciation might maintain protein function, while many variants within a species might bring about crucial changes, for better or worse.
Collapse
|
11
|
Mishra C, Kumar S, Yathish H. Predicting the effect of non synonymous SNPs in bovine TLR4 gene. GENE REPORTS 2017. [DOI: 10.1016/j.genrep.2016.11.005] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/25/2022]
|
12
|
Rost B, Radivojac P, Bromberg Y. Protein function in precision medicine: deep understanding with machine learning. FEBS Lett 2016; 590:2327-41. [PMID: 27423136 PMCID: PMC5937700 DOI: 10.1002/1873-3468.12307] [Citation(s) in RCA: 32] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/27/2016] [Revised: 07/12/2016] [Accepted: 07/12/2016] [Indexed: 12/21/2022]
Abstract
Precision medicine and personalized health efforts propose leveraging complex molecular, medical and family history, along with other types of personal data toward better life. We argue that this ambitious objective will require advanced and specialized machine learning solutions. Simply skimming some low-hanging results off the data wealth might have limited potential. Instead, we need to better understand all parts of the system to define medically relevant causes and effects: how do particular sequence variants affect particular proteins and pathways? How do these effects, in turn, cause the health or disease-related phenotype? Toward this end, deeper understanding will not simply diffuse from deeper machine learning, but from more explicit focus on understanding protein function, context-specific protein interaction networks, and impact of variation on both.
Collapse
Affiliation(s)
- Burkhard Rost
- Department of Informatics and Bioinformatics, Institute for Advanced Studies, Technical University of Munich, Garching, Germany
| | - Predrag Radivojac
- School of Informatics and Computing, Indiana University, Bloomington, IN, USA
| | - Yana Bromberg
- Department of Biochemistry and Microbiology, Rutgers University, New Brunswick, NJ, USA
| |
Collapse
|