1
|
Yazar M, Ozbek P. Assessment of 13 in silico pathogenicity methods on cancer-related variants. Comput Biol Med 2022; 145:105434. [DOI: 10.1016/j.compbiomed.2022.105434] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/04/2022] [Revised: 03/04/2022] [Accepted: 03/20/2022] [Indexed: 11/03/2022]
|
2
|
Computational studies of anaplastic lymphoma kinase mutations reveal common mechanisms of oncogenic activation. Proc Natl Acad Sci U S A 2021; 118:2019132118. [PMID: 33674381 PMCID: PMC7958353 DOI: 10.1073/pnas.2019132118] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/11/2022] Open
Abstract
High-risk tumors are genomically heterogeneous, harboring gene amplifications and mutations. The activation status of mutated proteins in cancer can profoundly impact disease progression, patient response, and drug sensitivity. Yet, outside of a few hotspot mutations, functional studies of clinically observed mutations are not commonly pursued. We report a combined experimental profiling and computational analysis of the effects of clinically observed and “test” mutations in the kinase domain of anaplastic lymphoma kinase (ALK), a known oncogenic driver in pediatric neuroblastoma. We find that the activation status of the mutated protein is a good indicator of the transforming ability in NIH 3T3 cells. We also report biophysical as well as data-driven models with predictive power to profile these mutant kinases in silico. Kinases play important roles in diverse cellular processes, including signaling, differentiation, proliferation, and metabolism. They are frequently mutated in cancer and are the targets of a large number of specific inhibitors. Surveys of cancer genome atlases reveal that kinase domains, which consist of 300 amino acids, can harbor numerous (150 to 200) single-point mutations across different patients in the same disease. This preponderance of mutations—some activating, some silent—in a known target protein make clinical decisions for enrolling patients in drug trials challenging since the relevance of the target and its drug sensitivity often depend on the mutational status in a given patient. We show through computational studies using molecular dynamics (MD) as well as enhanced sampling simulations that the experimentally determined activation status of a mutated kinase can be predicted effectively by identifying a hydrogen bonding fingerprint in the activation loop and the αC-helix regions, despite the fact that mutations in cancer patients occur throughout the kinase domain. In our study, we find that the predictive power of MD is superior to a purely data-driven machine learning model involving biochemical features that we implemented, even though MD utilized far fewer features (in fact, just one) in an unsupervised setting. Moreover, the MD results provide key insights into convergent mechanisms of activation, primarily involving differential stabilization of a hydrogen bond network that engages residues of the activation loop and αC-helix in the active-like conformation (in >70% of the mutations studied, regardless of the location of the mutation).
Collapse
|
3
|
Zhou JB, Xiong Y, An K, Ye ZQ, Wu YD. IDRMutPred: predicting disease-associated germline nonsynonymous single nucleotide variants (nsSNVs) in intrinsically disordered regions. Bioinformatics 2021; 36:4977-4983. [PMID: 32756939 PMCID: PMC7755418 DOI: 10.1093/bioinformatics/btaa618] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/18/2019] [Revised: 06/28/2020] [Accepted: 07/01/2020] [Indexed: 01/09/2023] Open
Abstract
Motivation Despite of the lack of folded structure, intrinsically disordered regions (IDRs) of proteins play versatile roles in various biological processes, and many nonsynonymous single nucleotide variants (nsSNVs) in IDRs are associated with human diseases. The continuous accumulation of nsSNVs resulted from the wide application of NGS has driven the development of disease-association prediction methods for decades. However, their performance on nsSNVs in IDRs remains inferior, possibly due to the domination of nsSNVs from structured regions in training data. Therefore, it is highly demanding to build a disease-association predictor specifically for nsSNVs in IDRs with better performance. Results We present IDRMutPred, a machine learning-based tool specifically for predicting disease-associated germline nsSNVs in IDRs. Based on 17 selected optimal features that are extracted from sequence alignments, protein annotations, hydrophobicity indices and disorder scores, IDRMutPred was trained using three ensemble learning algorithms on the training dataset containing only IDR nsSNVs. The evaluation on the two testing datasets shows that all the three prediction models outperform 17 other popular general predictors significantly, achieving the ACC between 0.856 and 0.868 and MCC between 0.713 and 0.737. IDRMutPred will prioritize disease-associated IDR germline nsSNVs more reliably than general predictors. Availability and implementation The software is freely available at http://www.wdspdb.com/IDRMutPred. Supplementary information Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Jing-Bo Zhou
- Lab of Computational Chemistry and Drug Design, State Key Laboratory of Chemical Oncogenomics, Peking University Shenzhen Graduate School, Shenzhen 518055, China
| | - Yao Xiong
- Lab of Computational Chemistry and Drug Design, State Key Laboratory of Chemical Oncogenomics, Peking University Shenzhen Graduate School, Shenzhen 518055, China
| | - Ke An
- Lab of Computational Chemistry and Drug Design, State Key Laboratory of Chemical Oncogenomics, Peking University Shenzhen Graduate School, Shenzhen 518055, China
| | - Zhi-Qiang Ye
- Lab of Computational Chemistry and Drug Design, State Key Laboratory of Chemical Oncogenomics, Peking University Shenzhen Graduate School, Shenzhen 518055, China.,Shenzhen Bay Laboratory, Shenzhen 518055, China
| | - Yun-Dong Wu
- Lab of Computational Chemistry and Drug Design, State Key Laboratory of Chemical Oncogenomics, Peking University Shenzhen Graduate School, Shenzhen 518055, China.,Shenzhen Bay Laboratory, Shenzhen 518055, China.,College of Chemistry and Molecular Engineering, Peking University, Beijing 100871, China
| |
Collapse
|
4
|
Sarkar A, Yang Y, Vihinen M. Variation benchmark datasets: update, criteria, quality and applications. DATABASE-THE JOURNAL OF BIOLOGICAL DATABASES AND CURATION 2020; 2020:5710862. [PMID: 32016318 PMCID: PMC6997940 DOI: 10.1093/database/baz117] [Citation(s) in RCA: 19] [Impact Index Per Article: 4.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 04/12/2019] [Revised: 06/03/2019] [Accepted: 07/01/2019] [Indexed: 02/07/2023]
Abstract
Development of new computational methods and testing their performance has to be carried out using experimental data. Only in comparison to existing knowledge can method performance be assessed. For that purpose, benchmark datasets with known and verified outcome are needed. High-quality benchmark datasets are valuable and may be difficult, laborious and time consuming to generate. VariBench and VariSNP are the two existing databases for sharing variation benchmark datasets used mainly for variation interpretation. They have been used for training and benchmarking predictors for various types of variations and their effects. VariBench was updated with 419 new datasets from 109 papers containing altogether 329 014 152 variants; however, there is plenty of redundancy between the datasets. VariBench is freely available at http://structure.bmc.lu.se/VariBench/. The contents of the datasets vary depending on information in the original source. The available datasets have been categorized into 20 groups and subgroups. There are datasets for insertions and deletions, substitutions in coding and non-coding region, structure mapped, synonymous and benign variants. Effect-specific datasets include DNA regulatory elements, RNA splicing, and protein property for aggregation, binding free energy, disorder and stability. Then there are several datasets for molecule-specific and disease-specific applications, as well as one dataset for variation phenotype effects. Variants are often described at three molecular levels (DNA, RNA and protein) and sometimes also at the protein structural level including relevant cross references and variant descriptions. The updated VariBench facilitates development and testing of new methods and comparison of obtained performances to previously published methods. We compared the performance of the pathogenicity/tolerance predictor PON-P2 to several benchmark studies, and show that such comparisons are feasible and useful, however, there may be limitations due to lack of provided details and shared data. Database URL: http://structure.bmc.lu.se/VariBench
Collapse
Affiliation(s)
- Anasua Sarkar
- Department of Experimental Medical Science, BMC B13, Lund University, SE-22 184 Lund, Sweden
| | - Yang Yang
- School of Computer Science and Technology, Soochow University, No1. Shizi Street, Suzhou, 215006 Jiangsu, China.,Provincial Key Laboratory for Computer Information Processing Technology, No1. Shizi Street, Soochow University, Suzhou, 215006 Jiangsu, China
| | - Mauno Vihinen
- Department of Experimental Medical Science, BMC B13, Lund University, SE-22 184 Lund, Sweden
| |
Collapse
|
5
|
Zhu C, Miller M, Zeng Z, Wang Y, Mahlich Y, Aptekmann A, Bromberg Y. Computational Approaches for Unraveling the Effects of Variation in the Human Genome and Microbiome. Annu Rev Biomed Data Sci 2020. [DOI: 10.1146/annurev-biodatasci-030320-041014] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/01/2023]
Abstract
The past two decades of analytical efforts have highlighted how much more remains to be learned about the human genome and, particularly, its complex involvement in promoting disease development and progression. While numerous computational tools exist for the assessment of the functional and pathogenic effects of genome variants, their precision is far from satisfactory, particularly for clinical use. Accumulating evidence also suggests that the human microbiome's interaction with the human genome plays a critical role in determining health and disease states. While numerous microbial taxonomic groups and molecular functions of the human microbiome have been associated with disease, the reproducibility of these findings is lacking. The human microbiome–genome interaction in healthy individuals is even less well understood. This review summarizes the available computational methods built to analyze the effect of variation in the human genome and microbiome. We address the applicability and precision of these methods across their possible uses. We also briefly discuss the exciting, necessary, and now possible integration of the two types of data to improve the understanding of pathogenicity mechanisms.
Collapse
Affiliation(s)
- Chengsheng Zhu
- Department of Biochemistry and Microbiology, Rutgers University, New Brunswick, New Jersey 08873, USA;,
| | - Maximilian Miller
- Department of Biochemistry and Microbiology, Rutgers University, New Brunswick, New Jersey 08873, USA;,
| | - Zishuo Zeng
- Department of Biochemistry and Microbiology, Rutgers University, New Brunswick, New Jersey 08873, USA;,
| | - Yanran Wang
- Department of Biochemistry and Microbiology, Rutgers University, New Brunswick, New Jersey 08873, USA;,
| | - Yannick Mahlich
- Department of Biochemistry and Microbiology, Rutgers University, New Brunswick, New Jersey 08873, USA;,
| | - Ariel Aptekmann
- Department of Biochemistry and Microbiology, Rutgers University, New Brunswick, New Jersey 08873, USA;,
| | - Yana Bromberg
- Department of Biochemistry and Microbiology, Rutgers University, New Brunswick, New Jersey 08873, USA;,
- Department of Genetics, Rutgers University, Piscataway, New Jersey 08854, USA
| |
Collapse
|
6
|
Jordan EJ, Patil K, Suresh K, Park JH, Mosse YP, Lemmon MA, Radhakrishnan R. Computational algorithms for in silico profiling of activating mutations in cancer. Cell Mol Life Sci 2019; 76:2663-2679. [PMID: 30982079 PMCID: PMC6589134 DOI: 10.1007/s00018-019-03097-2] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/10/2019] [Revised: 04/01/2019] [Accepted: 04/08/2019] [Indexed: 12/17/2022]
Abstract
Methods to catalog and computationally assess the mutational landscape of proteins in human cancers are desirable. One approach is to adapt evolutionary or data-driven methods developed for predicting whether a single-nucleotide polymorphism (SNP) is deleterious to protein structure and function. In cases where understanding the mechanism of protein activation and regulation is desired, an alternative approach is to employ structure-based computational approaches to predict the effects of point mutations. Through a case study of mutations in kinase domains of three proteins, namely, the anaplastic lymphoma kinase (ALK) in pediatric neuroblastoma patients, serine/threonine-protein kinase B-Raf (BRAF) in melanoma patients, and erythroblastic oncogene B 2 (ErbB2 or HER2) in breast cancer patients, we compare the two approaches above. We find that the structure-based method is most appropriate for developing a binary classification of several different mutations, especially infrequently occurring ones, concerning the activation status of the given target protein. This approach is especially useful if the effects of mutations on the interactions of inhibitors with the target proteins are being sought. However, many patients will present with mutations spread across different target proteins, making structure-based models computationally demanding to implement and execute. In this situation, data-driven methods-including those based on machine learning techniques and evolutionary methods-are most appropriate for recognizing and illuminate mutational patterns. We show, however, that, in the present status of the field, the two methods have very different accuracies and confidence values, and hence, the optimal choice of their deployment is context-dependent.
Collapse
Affiliation(s)
- E Joseph Jordan
- Graduate Group in Biochemistry and Molecular Biophysics, University of Pennsylvania, Philadelphia, PA, USA
| | - Keshav Patil
- Department of Chemical and Biomolecular Engineering, University of Pennsylvania, Philadelphia, PA, USA
| | - Krishna Suresh
- Department of Bioengineering, University of Pennsylvania, Philadelphia, PA, USA
| | - Jin H Park
- Department of Pharmacology, Yale University, New Haven, CT, USA
- Cancer Biology Institute, Yale University, West Haven, CT, USA
| | - Yael P Mosse
- Department of Pediatrics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA, USA
- Children's Hospital of Philadelphia, Philadelphia, PA, USA
| | - Mark A Lemmon
- Department of Pharmacology, Yale University, New Haven, CT, USA
- Cancer Biology Institute, Yale University, West Haven, CT, USA
| | - Ravi Radhakrishnan
- Graduate Group in Biochemistry and Molecular Biophysics, University of Pennsylvania, Philadelphia, PA, USA.
- Department of Chemical and Biomolecular Engineering, University of Pennsylvania, Philadelphia, PA, USA.
- Department of Bioengineering, University of Pennsylvania, Philadelphia, PA, USA.
| |
Collapse
|
7
|
Huang S, Cai N, Pacheco PP, Narrandes S, Wang Y, Xu W. Applications of Support Vector Machine (SVM) Learning in Cancer Genomics. Cancer Genomics Proteomics 2018; 15:41-51. [PMID: 29275361 PMCID: PMC5822181 DOI: 10.21873/cgp.20063] [Citation(s) in RCA: 327] [Impact Index Per Article: 54.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/07/2017] [Revised: 10/03/2017] [Accepted: 10/23/2017] [Indexed: 12/23/2022] Open
Abstract
Machine learning with maximization (support) of separating margin (vector), called support vector machine (SVM) learning, is a powerful classification tool that has been used for cancer genomic classification or subtyping. Today, as advancements in high-throughput technologies lead to production of large amounts of genomic and epigenomic data, the classification feature of SVMs is expanding its use in cancer genomics, leading to the discovery of new biomarkers, new drug targets, and a better understanding of cancer driver genes. Herein we reviewed the recent progress of SVMs in cancer genomic studies. We intend to comprehend the strength of the SVM learning and its future perspective in cancer genomic applications.
Collapse
Affiliation(s)
- Shujun Huang
- College of Pharmacy, Rady Faculty of Health Sciences, University of Manitoba, Winnipeg, Canada
- Research Institute of Oncology and Hematology, CancerCare Manitoba, Winnipeg, Canada
| | - Nianguang Cai
- Research Institute of Oncology and Hematology, CancerCare Manitoba, Winnipeg, Canada
| | - Pedro Penzuti Pacheco
- Research Institute of Oncology and Hematology, CancerCare Manitoba, Winnipeg, Canada
| | - Shavira Narrandes
- Research Institute of Oncology and Hematology, CancerCare Manitoba, Winnipeg, Canada
- Departments of Biochemistry and Medical Genetics, Rady Faculty of Health Sciences, University of Manitoba, Winnipeg, Canada
| | - Yang Wang
- Department of Computer Science, Faculty of Sciences, University of Manitoba, Winnipeg, Canada
| | - Wayne Xu
- Research Institute of Oncology and Hematology, CancerCare Manitoba, Winnipeg, Canada
- Departments of Biochemistry and Medical Genetics, Rady Faculty of Health Sciences, University of Manitoba, Winnipeg, Canada
- College of Pharmacy, Rady Faculty of Health Sciences, University of Manitoba, Winnipeg, Canada
| |
Collapse
|
8
|
Riera C, Padilla N, de la Cruz X. The Complementarity Between Protein-Specific and General Pathogenicity Predictors for Amino Acid Substitutions. Hum Mutat 2016; 37:1013-24. [DOI: 10.1002/humu.23048] [Citation(s) in RCA: 37] [Impact Index Per Article: 4.6] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/30/2016] [Revised: 06/30/2016] [Accepted: 07/06/2016] [Indexed: 11/06/2022]
Affiliation(s)
- Casandra Riera
- Research Unit in Translational Bioinformatics; Vall d'Hebron Institute of Research (VHIR), Universitat Autònoma de Barcelona; Barcelona Spain
| | - Natàlia Padilla
- Research Unit in Translational Bioinformatics; Vall d'Hebron Institute of Research (VHIR), Universitat Autònoma de Barcelona; Barcelona Spain
| | - Xavier de la Cruz
- Research Unit in Translational Bioinformatics; Vall d'Hebron Institute of Research (VHIR), Universitat Autònoma de Barcelona; Barcelona Spain
- ICREA; Barcelona Spain
| |
Collapse
|
9
|
Pons T, Vazquez M, Matey-Hernandez ML, Brunak S, Valencia A, Izarzugaza JM. KinMutRF: a random forest classifier of sequence variants in the human protein kinase superfamily. BMC Genomics 2016; 17 Suppl 2:396. [PMID: 27357839 PMCID: PMC4928150 DOI: 10.1186/s12864-016-2723-1] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Key Words] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/28/2023] Open
Abstract
Background The association between aberrant signal processing by protein kinases and human diseases such as cancer was established long time ago. However, understanding the link between sequence variants in the protein kinase superfamily and the mechanistic complex traits at the molecular level remains challenging: cells tolerate most genomic alterations and only a minor fraction disrupt molecular function sufficiently and drive disease. Results KinMutRF is a novel random-forest method to automatically identify pathogenic variants in human kinases. Twenty six decision trees implemented as a random forest ponder a battery of features that characterize the variants: a) at the gene level, including membership to a Kinbase group and Gene Ontology terms; b) at the PFAM domain level; and c) at the residue level, the types of amino acids involved, changes in biochemical properties, functional annotations from UniProt, Phospho.ELM and FireDB. KinMutRF identifies disease-associated variants satisfactorily (Acc: 0.88, Prec:0.82, Rec:0.75, F-score:0.78, MCC:0.68) when trained and cross-validated with the 3689 human kinase variants from UniProt that have been annotated as neutral or pathogenic. All unclassified variants were excluded from the training set. Furthermore, KinMutRF is discussed with respect to two independent kinase-specific sets of mutations no included in the training and testing, Kin-Driver (643 variants) and Pon-BTK (1495 variants). Moreover, we provide predictions for the 848 protein kinase variants in UniProt that remained unclassified. A public implementation of KinMutRF, including documentation and examples, is available online (http://kinmut2.bioinfo.cnio.es). The source code for local installation is released under a GPL version 3 license, and can be downloaded from https://github.com/Rbbt-Workflows/KinMut2. Conclusions KinMutRF is capable of classifying kinase variation with good performance. Predictions by KinMutRF compare favorably in a benchmark with other state-of-the-art methods (i.e. SIFT, Polyphen-2, MutationAssesor, MutationTaster, LRT, CADD, FATHMM, and VEST). Kinase-specific features rank as the most elucidatory in terms of information gain and are likely the improvement in prediction performance. This advocates for the development of family-specific classifiers able to exploit the discriminatory power of features unique to individual protein families. Electronic supplementary material The online version of this article (doi:10.1186/s12864-016-2723-1) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Tirso Pons
- Structural Biology and BioComputing Programme, Spanish National Cancer Research Centre (CNIO), Melchor Fernández Almagro, 3, 28029, Madrid, Spain
| | - Miguel Vazquez
- Structural Biology and BioComputing Programme, Spanish National Cancer Research Centre (CNIO), Melchor Fernández Almagro, 3, 28029, Madrid, Spain
| | - María Luisa Matey-Hernandez
- Center for Biological Sequence Analysis (CBS), Systems Biology Department, Technical University of Denmark (DTU), Kemitorvet, Building 208, 2800 Kgs., Lyngby, Denmark
| | - Søren Brunak
- Center for Biological Sequence Analysis (CBS), Systems Biology Department, Technical University of Denmark (DTU), Kemitorvet, Building 208, 2800 Kgs., Lyngby, Denmark.,Novo Nordisk Foundation Center for Protein Research, Faculty of Health Sciences, University of Copenhagen, Blegdamsvej 3A, 2200, Copenhagen, Denmark
| | - Alfonso Valencia
- Structural Biology and BioComputing Programme, Spanish National Cancer Research Centre (CNIO), Melchor Fernández Almagro, 3, 28029, Madrid, Spain
| | - Jose Mg Izarzugaza
- Center for Biological Sequence Analysis (CBS), Systems Biology Department, Technical University of Denmark (DTU), Kemitorvet, Building 208, 2800 Kgs., Lyngby, Denmark.
| |
Collapse
|
10
|
Vazquez M, Pons T, Brunak S, Valencia A, Izarzugaza JMG. wKinMut-2: Identification and Interpretation of Pathogenic Variants in Human Protein Kinases. Hum Mutat 2015; 37:36-42. [PMID: 26443060 DOI: 10.1002/humu.22914] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/02/2015] [Accepted: 09/22/2015] [Indexed: 12/31/2022]
Abstract
Most genomic alterations are tolerated while only a minor fraction disrupts molecular function sufficiently to drive disease. Protein kinases play a central biological function and the functional consequences of their variants are abundantly characterized. However, this heterogeneous information is often scattered across different sources, which makes the integrative analysis complex and laborious. wKinMut-2 constitutes a solution to facilitate the interpretation of the consequences of human protein kinase variation. Nine methods predict their pathogenicity, including a kinase-specific random forest approach. To understand the biological mechanisms causative of human diseases and cancer, information from pertinent reference knowledge bases and the literature is automatically mined, digested, and homogenized. Variants are visualized in their structural contexts and residues affecting catalytic and drug binding are identified. Known protein-protein interactions are reported. Altogether, this information is intended to assist the generation of new working hypothesis to be corroborated with ulterior experimental work. The wKinMut-2 system, along with a user manual and examples, is freely accessible at http://kinmut2.bioinfo.cnio.es, the code for local installations can be downloaded from https://github.com/Rbbt-Workflows/KinMut2.
Collapse
Affiliation(s)
- Miguel Vazquez
- Structural Biology and BioComputing Programme, Spanish National Cancer Research Centre (CNIO), Madrid, 28029, Spain
| | - Tirso Pons
- Structural Biology and BioComputing Programme, Spanish National Cancer Research Centre (CNIO), Madrid, 28029, Spain
| | - Søren Brunak
- Novo Nordisk Foundation Center for Protein Research, Faculty of Health Sciences, University of Copenhagen, Copenhagen 2200, Denmark.,Center for Biological Sequence Analysis (CBS), Systems Biology Department, Technical University of Denmark (DTU), Kongens Lyngby 2800, Denmark
| | - Alfonso Valencia
- Structural Biology and BioComputing Programme, Spanish National Cancer Research Centre (CNIO), Madrid, 28029, Spain
| | - Jose M G Izarzugaza
- Center for Biological Sequence Analysis (CBS), Systems Biology Department, Technical University of Denmark (DTU), Kongens Lyngby 2800, Denmark
| |
Collapse
|
11
|
McSkimming DI, Dastgheib S, Talevich E, Narayanan A, Katiyar S, Taylor SS, Kochut K, Kannan N. ProKinO: a unified resource for mining the cancer kinome. Hum Mutat 2015; 36:175-86. [PMID: 25382819 PMCID: PMC4342772 DOI: 10.1002/humu.22726] [Citation(s) in RCA: 34] [Impact Index Per Article: 3.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/24/2014] [Accepted: 10/21/2014] [Indexed: 12/31/2022]
Abstract
Protein kinases represent a large and diverse family of evolutionarily related proteins that are abnormally regulated in human cancers. Although genome sequencing studies have revealed thousands of variants in protein kinases, translating "big" genomic data into biological knowledge remains a challenge. Here, we describe an ontological framework for integrating and conceptualizing diverse forms of information related to kinase activation and regulatory mechanisms in a machine readable, human understandable form. We demonstrate the utility of this framework in analyzing the cancer kinome, and in generating testable hypotheses for experimental studies. Through the iterative process of aggregate ontology querying, hypothesis generation and experimental validation, we identify a novel mutational hotspot in the αC-β4 loop of the kinase domain and demonstrate the functional impact of the identified variants in epidermal growth factor receptor (EGFR) constitutive activity and inhibitor sensitivity. We provide a unified resource for the kinase and cancer community, ProKinO, housed at http://vulcan.cs.uga.edu/prokino.
Collapse
|
12
|
Riera C, Lois S, Domínguez C, Fernandez-Cadenas I, Montaner J, Rodríguez-Sureda V, de la Cruz X. Molecular damage in Fabry disease: characterization and prediction of alpha-galactosidase A pathological mutations. Proteins 2014; 83:91-104. [PMID: 25382311 DOI: 10.1002/prot.24708] [Citation(s) in RCA: 20] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/06/2014] [Revised: 09/25/2014] [Accepted: 10/18/2014] [Indexed: 12/12/2022]
Abstract
Loss-of-function mutations of the enzyme alpha-galactosidase A (GLA) causes Fabry disease (FD), that is a rare and potentially fatal disease. Identification of these pathological mutations by sequencing is important because it allows an early treatment of the disease. However, before taking any treatment decision, if the mutation identified is unknown, we first need to establish if it is pathological or not. General bioinformatic tools (PolyPhen-2, SIFT, Condel, etc.) can be used for this purpose, but their performance is still limited. Here we present a new tool, specifically derived for the assessment of GLA mutations. We first compared mutations of this enzyme known to cause FD with neutral sequence variants, using several structure and sequence properties. Then, we used these properties to develop a family of prediction methods adapted to different quality requirements. Trained and tested on a set of known Fabry mutations, our methods have a performance (Matthews correlation: 0.56-0.72) comparable or better than that of the more complex method, Polyphen-2 (Matthews correlation: 0.61), and better than those of SIFT (Matthews correl.: 0.54) and Condel (Matthews correl.: 0.51). This result is validated in an independent set of 65 pathological mutations, for which our method displayed the best success rate (91.0%, 87.7%, and 73.8%, for our method, PolyPhen-2 and SIFT, respectively). These data confirmed that our specific approach can effectively contribute to the identification of pathological mutations in GLA, and therefore enhance the use of sequence information in the identification of undiagnosed Fabry patients.
Collapse
Affiliation(s)
- Casandra Riera
- Research Unit in Translational Bioinformatics, Institut de Recerca Hospital Vall d'Hebron (VHIR), Universitat Autònoma de Barcelona, Barcelona, Spain
| | | | | | | | | | | | | |
Collapse
|
13
|
Jordan EJ, Radhakrishnan R. Machine Learning Predictions of Cancer Driver Mutations. PROCEEDINGS OF THE 2014 6TH INTERNATIONAL ADVANCED RESEARCH WORKSHOP ON IN SILICO ONCOLOGY AND CANCER INVESTIGATION : THE CHIC PROJECT WORKSHOP (IARWISOCI) : ATHENS, GREECE, 3-4 NOVEMBER 2014. INTERNATIONAL ADVANCED RESEARCH WORKSHOP ON... 2014; 2014:10.1109/iarwisoci.2014.7034632. [PMID: 34622251 PMCID: PMC8494176 DOI: 10.1109/iarwisoci.2014.7034632] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/01/2023]
Abstract
A method to predict the activation status of kinase domain mutations in cancer is presented. This method, which makes use of the machine learning technique support vector machines (SVM), has applications to cancer treatment, as well as numerous other diseases that involve kinase misregulation.
Collapse
Affiliation(s)
- E Joseph Jordan
- The University of Pennsylvania Biochemistry and Molecular Biophysics Graduate Group, PA 19104 USA
| | - Ravi Radhakrishnan
- The University of Pennsylvania Departments of Bioengineering, Chemical and Biomolecular Engineering, Biochemistry and Biophysics, PA 19104 USA
| |
Collapse
|
14
|
Prediction and prioritization of rare oncogenic mutations in the cancer Kinome using novel features and multiple classifiers. PLoS Comput Biol 2014; 10:e1003545. [PMID: 24743239 PMCID: PMC3990476 DOI: 10.1371/journal.pcbi.1003545] [Citation(s) in RCA: 19] [Impact Index Per Article: 1.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/21/2013] [Accepted: 02/18/2014] [Indexed: 01/18/2023] Open
Abstract
Cancer is a genetic disease that develops through a series of somatic mutations, a subset of which drive cancer progression. Although cancer genome sequencing studies are beginning to reveal the mutational patterns of genes in various cancers, identifying the small subset of “causative” mutations from the large subset of “non-causative” mutations, which accumulate as a consequence of the disease, is a challenge. In this article, we present an effective machine learning approach for identifying cancer-associated mutations in human protein kinases, a class of signaling proteins known to be frequently mutated in human cancers. We evaluate the performance of 11 well known supervised learners and show that a multiple-classifier approach, which combines the performances of individual learners, significantly improves the classification of known cancer-associated mutations. We introduce several novel features related specifically to structural and functional characteristics of protein kinases and find that the level of conservation of the mutated residue at specific evolutionary depths is an important predictor of oncogenic effect. We consolidate the novel features and the multiple-classifier approach to prioritize and experimentally test a set of rare unconfirmed mutations in the epidermal growth factor receptor tyrosine kinase (EGFR). Our studies identify T725M and L861R as rare cancer-associated mutations inasmuch as these mutations increase EGFR activity in the absence of the activating EGF ligand in cell-based assays. Cancer progresses by accumulation of mutations in a subset of genes that confer growth advantage. The 518 protein kinase genes encoded in the human genome, collectively called the kinome, represent one of the largest families of oncogenes. Targeted sequencing studies of many different cancers have shown that the mutational landscape comprises both cancer-causing “driver” mutations and harmless “passenger” mutations. While the frequent recurrence of some driver mutations in human cancers helps distinguish them from the large number of passenger mutations, a significant challenge is to identify the rare “driver” mutations that are less frequently observed in patient samples and yet are causative. Here we combine computational and experimental approaches to identify rare cancer-associated mutations in Epidermal Growth Factor receptor kinase (EGFR), a signaling protein frequently mutated in cancers. Specifically, we evaluate a novel multiple-classifier approach and features specific to the protein kinase super-family in distinguishing known cancer-associated mutations from benign mutations. We then apply the multiple classifier to identify and test the functional impact of rare cancer-associated mutations in EGFR. We report, for the first time, that the EGFR mutations T725M and L861R, which are infrequently observed in cancers, constitutively activate EGFR in a manner analogous to the frequently observed driver mutations.
Collapse
|
15
|
Izarzugaza JMG, Vazquez M, del Pozo A, Valencia A. wKinMut: an integrated tool for the analysis and interpretation of mutations in human protein kinases. BMC Bioinformatics 2013; 14:345. [PMID: 24289158 PMCID: PMC3879071 DOI: 10.1186/1471-2105-14-345] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/07/2012] [Accepted: 05/30/2013] [Indexed: 11/13/2022] Open
Abstract
Background Protein kinases are involved in relevant physiological functions and a broad number of mutations in this superfamily have been reported in the literature to affect protein function and stability. Unfortunately, the exploration of the consequences on the phenotypes of each individual mutation remains a considerable challenge. Results The wKinMut web-server offers direct prediction of the potential pathogenicity of the mutations from a number of methods, including our recently developed prediction method based on the combination of information from a range of diverse sources, including physicochemical properties and functional annotations from FireDB and Swissprot and kinase-specific characteristics such as the membership to specific kinase groups, the annotation with disease-associated GO terms or the occurrence of the mutation in PFAM domains, and the relevance of the residues in determining kinase subfamily specificity from S3Det. This predictor yields interesting results that compare favourably with other methods in the field when applied to protein kinases. Together with the predictions, wKinMut offers a number of integrated services for the analysis of mutations. These include: the classification of the kinase, information about associations of the kinase with other proteins extracted from iHop, the mapping of the mutations onto PDB structures, pathogenicity records from a number of databases and the classification of mutations in large-scale cancer studies. Importantly, wKinMut is connected with the SNP2L system that extracts mentions of mutations directly from the literature, and therefore increases the possibilities of finding interesting functional information associated to the studied mutations. Conclusions wKinMut facilitates the exploration of the information available about individual mutations by integrating prediction approaches with the automatic extraction of information from the literature (text mining) and several state-of-the-art databases. wKinMut has been used during the last year for the analysis of the consequences of mutations in the context of a number of cancer genome projects, including the recent analysis of Chronic Lymphocytic Leukemia cases and is publicly available at
http://wkinmut.bioinfo.cnio.es.
Collapse
Affiliation(s)
- Jose M G Izarzugaza
- Structural Biology and BioComputing Programme, Spanish National Cancer Research Centre (CNIO), C/Melchor Fernandez Almagro, 3, E-28029 Madrid, Spain.
| | | | | | | |
Collapse
|
16
|
Riera C, Lois S, de la Cruz X. Prediction of pathological mutations in proteins: the challenge of integrating sequence conservation and structure stability principles. WILEY INTERDISCIPLINARY REVIEWS-COMPUTATIONAL MOLECULAR SCIENCE 2013. [DOI: 10.1002/wcms.1170] [Citation(s) in RCA: 15] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/13/2022]
Affiliation(s)
- Casandra Riera
- Laboratory of Translational Bioinformatics in Neuroscience; VHIR; Barcelona Spain
| | - Sergio Lois
- Laboratory of Translational Bioinformatics in Neuroscience; VHIR; Barcelona Spain
| | - Xavier de la Cruz
- Laboratory of Translational Bioinformatics in Neuroscience; VHIR; Barcelona Spain
- Institució Catalana per la Recerca i Estudis Avançats (ICREA); Barcelona Spain
| |
Collapse
|
17
|
Johansen MB, Izarzugaza JMG, Brunak S, Petersen TN, Gupta R. Prediction of disease causing non-synonymous SNPs by the Artificial Neural Network Predictor NetDiseaseSNP. PLoS One 2013; 8:e68370. [PMID: 23935863 PMCID: PMC3723835 DOI: 10.1371/journal.pone.0068370] [Citation(s) in RCA: 18] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/06/2013] [Accepted: 05/29/2013] [Indexed: 01/10/2023] Open
Abstract
We have developed a sequence conservation-based artificial neural network predictor called NetDiseaseSNP which classifies nsSNPs as disease-causing or neutral. Our method uses the excellent alignment generation algorithm of SIFT to identify related sequences and a combination of 31 features assessing sequence conservation and the predicted surface accessibility to produce a single score which can be used to rank nsSNPs based on their potential to cause disease. NetDiseaseSNP classifies successfully disease-causing and neutral mutations. In addition, we show that NetDiseaseSNP discriminates cancer driver and passenger mutations satisfactorily. Our method outperforms other state-of-the-art methods on several disease/neutral datasets as well as on cancer driver/passenger mutation datasets and can thus be used to pinpoint and prioritize plausible disease candidates among nsSNPs for further investigation. NetDiseaseSNP is publicly available as an online tool as well as a web service: http://www.cbs.dtu.dk/services/NetDiseaseSNP.
Collapse
Affiliation(s)
- Morten Bo Johansen
- Center for Biological Sequence Analysis, Department of Systems Biology, Technical University of Denmark, Kongens Lyngby, Denmark
- Novo Nordisk Foundation Center for Biosustainability, Technical University of Denmark, Hørsholm, Denmark
| | - Jose M. G. Izarzugaza
- Center for Biological Sequence Analysis, Department of Systems Biology, Technical University of Denmark, Kongens Lyngby, Denmark
| | - Søren Brunak
- Center for Biological Sequence Analysis, Department of Systems Biology, Technical University of Denmark, Kongens Lyngby, Denmark
- Novo Nordisk Foundation Center for Protein Research, University of Copenhagen, Copenhagen, Denmark
| | - Thomas Nordahl Petersen
- Novo Nordisk Foundation Center for Biosustainability, Technical University of Denmark, Hørsholm, Denmark
| | - Ramneek Gupta
- Center for Biological Sequence Analysis, Department of Systems Biology, Technical University of Denmark, Kongens Lyngby, Denmark
| |
Collapse
|
18
|
Abstract
Co-evolution is a fundamental component of the theory of evolution and is essential for understanding the relationships between species in complex ecological networks. A wide range of co-evolution-inspired computational methods has been designed to predict molecular interactions, but it is only recently that important advances have been made. Breakthroughs in the handling of phylogenetic information and in disentangling indirect relationships have resulted in an improved capacity to predict interactions between proteins and contacts between different protein residues. Here, we review the main co-evolution-based computational approaches, their theoretical basis, potential applications and foreseeable developments.
Collapse
Affiliation(s)
- David de Juan
- Structural Biology and Biocomputing Programme, Spanish National Cancer Research Centre (CNIO), Madrid, Spain
| | | | | |
Collapse
|
19
|
Izarzugaza JMG, Krallinger M, Valencia A. Interpretation of the consequences of mutations in protein kinases: combined use of bioinformatics and text mining. Front Physiol 2012; 3:323. [PMID: 23055974 PMCID: PMC3449330 DOI: 10.3389/fphys.2012.00323] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/23/2012] [Accepted: 07/23/2012] [Indexed: 11/30/2022] Open
Abstract
Protein kinases play a crucial role in a plethora of significant physiological functions and a number of mutations in this superfamily have been reported in the literature to disrupt protein structure and/or function. Computational and experimental research aims to discover the mechanistic connection between mutations in protein kinases and disease with the final aim of predicting the consequences of mutations on protein function and the subsequent phenotypic alterations. In this article, we will review the possibilities and limitations of current computational methods for the prediction of the pathogenicity of mutations in the protein kinase superfamily. In particular we will focus on the problem of benchmarking the predictions with independent gold standard datasets. We will propose a pipeline for the curation of mutations automatically extracted from the literature. Since many of these mutations are not included in the databases that are commonly used to train the computational methods to predict the pathogenicity of protein kinase mutations we propose them to build a valuable gold standard dataset in the benchmarking of a number of these predictors. Finally, we will discuss how text mining approaches constitute a powerful tool for the interpretation of the consequences of mutations in the context of disease genome analysis with particular focus on cancer.
Collapse
Affiliation(s)
- Jose M G Izarzugaza
- Structural Computational Biology Group, Structural Biology and BioComputing Programme, Spanish National Cancer Research Centre Madrid, Spain
| | | | | |
Collapse
|
20
|
Valencia A, Hidalgo M. Getting personalized cancer genome analysis into the clinic: the challenges in bioinformatics. Genome Med 2012; 4:61. [PMID: 22839973 PMCID: PMC3580417 DOI: 10.1186/gm362] [Citation(s) in RCA: 22] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/18/2022] Open
Abstract
Progress in genomics has raised expectations in many fields, and particularly in personalized cancer research. The new technologies available make it possible to combine information about potential disease markers, altered function and accessible drug targets, which, coupled with pathological and medical information, will help produce more appropriate clinical decisions. The accessibility of such experimental techniques makes it all the more necessary to improve and adapt computational strategies to the new challenges. This review focuses on the critical issues associated with the standard pipeline, which includes: DNA sequencing analysis; analysis of mutations in coding regions; the study of genome rearrangements; extrapolating information on mutations to the functional and signaling level; and predicting the effects of therapies using mouse tumor models. We describe the possibilities, limitations and future challenges of current bioinformatics strategies for each of these issues. Furthermore, we emphasize the need for the collaboration between the bioinformaticians who implement the software and use the data resources, the computational biologists who develop the analytical methods, and the clinicians, the systems' end users and those ultimately responsible for taking medical decisions. Finally, the different steps in cancer genome analysis are illustrated through examples of applications in cancer genome analysis.
Collapse
Affiliation(s)
- Alfonso Valencia
- Spanish National Cancer Research Centre (CNIO), Calle Melchor Fernández Almagro, 3, E-28029 Madrid, Spain
| | - Manuel Hidalgo
- Spanish National Cancer Research Centre (CNIO), Calle Melchor Fernández Almagro, 3, E-28029 Madrid, Spain
| |
Collapse
|
21
|
Bromberg Y, Capriotti E. SNP-SIG Meeting 2011: identification and annotation of SNPs in the context of structure, function, and disease. BMC Genomics 2012; 13 Suppl 4:S1. [PMID: 22759647 PMCID: PMC3395891 DOI: 10.1186/1471-2164-13-s4-s1] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022] Open
Affiliation(s)
- Yana Bromberg
- Department of Biochemistry and Microbiology, School of Environmental and Biological Sciences, Rutgers University, 76 Lipman Drive, New Brunswick, NJ 08901, USA.
| | | |
Collapse
|