101
|
PON-tstab: Protein Variant Stability Predictor. Importance of Training Data Quality. Int J Mol Sci 2018; 19:ijms19041009. [PMID: 29597263 PMCID: PMC5979465 DOI: 10.3390/ijms19041009] [Citation(s) in RCA: 37] [Impact Index Per Article: 6.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/05/2018] [Revised: 03/21/2018] [Accepted: 03/24/2018] [Indexed: 12/24/2022] Open
Abstract
Several methods have been developed to predict effects of amino acid substitutions on protein stability. Benchmark datasets are essential for method training and testing and have numerous requirements including that the data is representative for the investigated phenomenon. Available machine learning algorithms for variant stability have all been trained with ProTherm data. We noticed a number of issues with the contents, quality and relevance of the database. There were errors, but also features that had not been clearly communicated. Consequently, all machine learning variant stability predictors have been trained on biased and incorrect data. We obtained a corrected dataset and trained a random forests-based tool, PON-tstab, applicable to variants in any organism. Our results highlight the importance of the benchmark quality, suitability and appropriateness. Predictions are provided for three categories: stability decreasing, increasing and those not affecting stability.
Collapse
|
102
|
Dang CC, Peón A, Ballester PJ. Unearthing new genomic markers of drug response by improved measurement of discriminative power. BMC Med Genomics 2018; 11:10. [PMID: 29409485 PMCID: PMC5801688 DOI: 10.1186/s12920-018-0336-z] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/14/2016] [Accepted: 01/29/2018] [Indexed: 12/29/2022] Open
Abstract
Background Oncology drugs are only effective in a small proportion of cancer patients. Our current ability to identify these responsive patients before treatment is still poor in most cases. Thus, there is a pressing need to discover response markers for marketed and research oncology drugs. Screening these drugs against a large panel of cancer cell lines has led to the discovery of new genomic markers of in vitro drug response. However, while the identification of such markers among thousands of candidate drug-gene associations in the data is error-prone, an appraisal of the effectiveness of such detection task is currently lacking. Methods Here we present a new non-parametric method to measuring the discriminative power of a drug-gene association. Unlike parametric statistical tests, the adopted non-parametric test has the advantage of not making strong assumptions about the data distorting the identification of genomic markers. Furthermore, we introduce a new benchmark to further validate these markers in vitro using more recent data not used to identify the markers. Results The application of this new methodology has led to the identification of 128 new genomic markers distributed across 61% of the analysed drugs, including 5 drugs without previously known markers, which were missed by the MANOVA test initially applied to analyse data from the Genomics of Drug Sensitivity in Cancer consortium. Conclusions Discovering markers using more than one statistical test and testing them on independent data is unusual. We found this helpful to discard statistically significant drug-gene associations that were actually spurious correlations. This approach also revealed new, independently validated, in vitro markers of drug response such as Temsirolimus-CDKN2A (resistance) and Gemcitabine-EWS_FLI1 (sensitivity). Electronic supplementary material The online version of this article (10.1186/s12920-018-0336-z) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Cuong C Dang
- Cancer Research Center of Marseille, INSERM U1068, F-13009, Marseille, France.,Institut Paoli-Calmettes, F-13009, Marseille, France.,Aix-Marseille Université, F-13284, Marseille, France.,CNRS UMR7258, F-13009, Marseille, France
| | - Antonio Peón
- Cancer Research Center of Marseille, INSERM U1068, F-13009, Marseille, France.,Institut Paoli-Calmettes, F-13009, Marseille, France.,Aix-Marseille Université, F-13284, Marseille, France.,CNRS UMR7258, F-13009, Marseille, France
| | - Pedro J Ballester
- Cancer Research Center of Marseille, INSERM U1068, F-13009, Marseille, France. .,Institut Paoli-Calmettes, F-13009, Marseille, France. .,Aix-Marseille Université, F-13284, Marseille, France. .,CNRS UMR7258, F-13009, Marseille, France.
| |
Collapse
|
103
|
Lee PH, Lee C, Li X, Wee B, Dwivedi T, Daly M. Principles and methods of in-silico prioritization of non-coding regulatory variants. Hum Genet 2018; 137:15-30. [PMID: 29288389 PMCID: PMC5892192 DOI: 10.1007/s00439-017-1861-0] [Citation(s) in RCA: 29] [Impact Index Per Article: 4.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/24/2017] [Accepted: 12/14/2017] [Indexed: 12/13/2022]
Abstract
Over a decade of genome-wide association, studies have made great strides toward the detection of genes and genetic mechanisms underlying complex traits. However, the majority of associated loci reside in non-coding regions that are functionally uncharacterized in general. Now, the availability of large-scale tissue and cell type-specific transcriptome and epigenome data enables us to elucidate how non-coding genetic variants can affect gene expressions and are associated with phenotypic changes. Here, we provide an overview of this emerging field in human genomics, summarizing available data resources and state-of-the-art analytic methods to facilitate in-silico prioritization of non-coding regulatory mutations. We also highlight the limitations of current approaches and discuss the direction of much-needed future research.
Collapse
Affiliation(s)
- Phil H Lee
- Center for Genomic Medicine, Massachusetts General Hospital and Harvard Medical School, Simches Research Building, 185 Cambridge St, Boston, MA, 02114, USA.
- Quantitative Genomics Program, Harvard T.H. Chan School of Public Health, Boston, MA, USA.
| | - Christian Lee
- Center for Genomic Medicine, Massachusetts General Hospital and Harvard Medical School, Simches Research Building, 185 Cambridge St, Boston, MA, 02114, USA
- Department of Life Sciences, Harvard University, Cambridge, MA, USA
| | - Xihao Li
- Quantitative Genomics Program, Harvard T.H. Chan School of Public Health, Boston, MA, USA
- Department of Biostatistics, Harvard T.H. Chan School of Public Health, Boston, MA, USA
| | - Brian Wee
- Center for Genomic Medicine, Massachusetts General Hospital and Harvard Medical School, Simches Research Building, 185 Cambridge St, Boston, MA, 02114, USA
| | - Tushar Dwivedi
- Center for Genomic Medicine, Massachusetts General Hospital and Harvard Medical School, Simches Research Building, 185 Cambridge St, Boston, MA, 02114, USA
- John A. Paulson School of Engineering and Applied Sciences, Harvard University, Cambridge, MA, USA
| | - Mark Daly
- Center for Genomic Medicine, Massachusetts General Hospital and Harvard Medical School, Simches Research Building, 185 Cambridge St, Boston, MA, 02114, USA
- Program in Medical and Population Genetics, Broad Institute of MIT and Harvard, Cambridge, MA, 02142, USA
| |
Collapse
|
104
|
Čalyševa J, Vihinen M. PON-SC - program for identifying steric clashes caused by amino acid substitutions. BMC Bioinformatics 2017; 18:531. [PMID: 29187139 PMCID: PMC5707825 DOI: 10.1186/s12859-017-1947-7] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/14/2017] [Accepted: 11/21/2017] [Indexed: 11/10/2022] Open
Abstract
Background Amino acid substitutions due to DNA nucleotide replacements are frequently disease-causing because of affecting functionally important sites. If the substituting amino acid does not fit into the protein, it causes structural alterations that are often harmful. Clashes of amino acids cause local or global structural changes. Testing structural compatibility of variations has been difficult due to the lack of a dedicated method that could handle vast amounts of variation data produced by next generation sequencing technologies. Results We developed a method, PON-SC, for detecting protein structural clashes due to amino acid substitutions. The method utilizes side chain rotamer library and tests whether any of the common rotamers can be fitted into the protein structure. The tool was tested both with variants that cause and do not cause clashes and found to have accuracy of 0.71 over five test datasets. Conclusions We developed a fast method for residue side chain clash detection. The method provides in addition to the prediction also visualization of the variant in three dimensional structure. Electronic supplementary material The online version of this article (10.1186/s12859-017-1947-7) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Jelena Čalyševa
- Protein Structure and Bioinformatics, Department of Experimental Medical Science, Lund University, BMC B13, SE-22 184, Lund, Sweden.,Present address: EMBL Heidelberg, Meyerhofstraße 1, 69117, Heidelberg, Germany
| | - Mauno Vihinen
- Protein Structure and Bioinformatics, Department of Experimental Medical Science, Lund University, BMC B13, SE-22 184, Lund, Sweden.
| |
Collapse
|
105
|
Alam M, Thapa D, Lim JI, Cao D, Yao X. Computer-aided classification of sickle cell retinopathy using quantitative features in optical coherence tomography angiography. BIOMEDICAL OPTICS EXPRESS 2017; 8:4206-4216. [PMID: 28966859 PMCID: PMC5611935 DOI: 10.1364/boe.8.004206] [Citation(s) in RCA: 21] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 06/27/2017] [Revised: 08/10/2017] [Accepted: 08/21/2017] [Indexed: 05/04/2023]
Abstract
As a new optical coherence tomography (OCT) imaging modality, there is no standardized quantitative interpretation of OCT angiography (OCTA) characteristics of sickle cell retinopathy (SCR). This study is to demonstrate computer-aided SCR classification using quantitative OCTA features, i.e., blood vessel tortuosity (BVT), blood vessel diameter (BVD), vessel perimeter index (VPI), foveal avascular zone (FAZ) area, FAZ contour irregularity, parafoveal avascular density (PAD). It was observed that combined features show improved classification performance, compared to single feature. Three classifiers, including support vector machine (SVM), k-nearest neighbor (KNN) algorithm, and discriminant analysis, were evaluated. Sensitivity, specificity, and accuracy were quantified to assess the performance of each classifier. For SCR vs. control classification, all three classifiers performed well with an average accuracy of 95% using the six quantitative OCTA features. For mild vs. severe stage retinopathy classification, SVM shows better (97% accuracy) performance, compared to KNN algorithm (95% accuracy) and discriminant analysis (88% accuracy).
Collapse
Affiliation(s)
- Minhaj Alam
- Department of Bioengineering, University of Illinois at Chicago, Chicago, IL 60607, USA
| | - Damber Thapa
- Department of Bioengineering, University of Illinois at Chicago, Chicago, IL 60607, USA
| | - Jennifer I Lim
- Department of Ophthalmology and Visual Sciences, University of Illinois at Chicago, Chicago, IL 60612, USA
| | - Dingcai Cao
- Department of Ophthalmology and Visual Sciences, University of Illinois at Chicago, Chicago, IL 60612, USA
| | - Xincheng Yao
- Department of Bioengineering, University of Illinois at Chicago, Chicago, IL 60607, USA
- Department of Ophthalmology and Visual Sciences, University of Illinois at Chicago, Chicago, IL 60612, USA
| |
Collapse
|
106
|
Capriotti E, Martelli PL, Fariselli P, Casadio R. Blind prediction of deleterious amino acid variations with SNPs&GO. Hum Mutat 2017; 38:1064-1071. [PMID: 28102005 PMCID: PMC5522651 DOI: 10.1002/humu.23179] [Citation(s) in RCA: 17] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/24/2016] [Revised: 11/08/2016] [Accepted: 01/10/2017] [Indexed: 01/09/2023]
Abstract
SNPs&GO is a machine learning method for predicting the association of single amino acid variations (SAVs) to disease, considering protein functional annotation. The method is a binary classifier that implements a support vector machine algorithm to discriminate between disease-related and neutral SAVs. SNPs&GO combines information from protein sequence with functional annotation encoded by gene ontology (GO) terms. Tested in sequence mode on more than 38,000 SAVs from the SwissVar dataset, our method reached 81% overall accuracy and an area under the receiving operating characteristic curve of 0.88 with low false-positive rate. In almost all the editions of the Critical Assessment of Genome Interpretation (CAGI) experiments, SNPs&GO ranked among the most accurate algorithms for predicting the effect of SAVs. In this paper, we summarize the best results obtained by SNPs&GO on disease-related variations of four CAGI challenges relative to the following genes: CHEK2 (CAGI 2010), RAD50 (CAGI 2011), p16-INK (CAGI 2013), and NAGLU (CAGI 2016). Result evaluation provides insights about the accuracy of our algorithm and the relevance of GO terms in annotating the effect of the variants. It also helps to define good practices for the detection of deleterious SAVs.
Collapse
Affiliation(s)
- Emidio Capriotti
- Biocomputing Group, BiGeA/Giorgio Prodi Interdepartmental Center for Cancer Research, University of Bologna, Via F. Selmi 3, Bologna, 40126, Italy
| | - Pier Luigi Martelli
- Biocomputing Group, BiGeA/Giorgio Prodi Interdepartmental Center for Cancer Research, University of Bologna, Via F. Selmi 3, Bologna, 40126, Italy
| | - Piero Fariselli
- Department of Comparative Biomedicine and Food Science. University of Padova, Viale dell’Università, 16, 35020 Legnaro (PD), Italy
| | - Rita Casadio
- Biocomputing Group, BiGeA/Giorgio Prodi Interdepartmental Center for Cancer Research, University of Bologna, Via F. Selmi 3, Bologna, 40126, Italy
| |
Collapse
|
107
|
Cysewski P, Przybyłek M. Selection of effective cocrystals former for dissolution rate improvement of active pharmaceutical ingredients based on lipoaffinity index. Eur J Pharm Sci 2017; 107:87-96. [PMID: 28687528 DOI: 10.1016/j.ejps.2017.07.004] [Citation(s) in RCA: 20] [Impact Index Per Article: 2.9] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/11/2017] [Revised: 06/06/2017] [Accepted: 07/03/2017] [Indexed: 10/19/2022]
Abstract
New theoretical screening procedure was proposed for appropriate selection of potential cocrystal formers possessing the ability of enhancing dissolution rates of drugs. The procedure relies on the training set comprising 102 positive and 17 negative cases of cocrystals found in the literature. Despite the fact that the only available data were of qualitative character, performed statistical analysis using binary classification allowed to formulate quantitative criterions. Among considered 3679 molecular descriptors the relative value of lipoaffinity index, expressed as the difference between values calculated for active compound and excipient, has been found as the most appropriate measure suited for discrimination of positive and negative cases. Assuming 5% precision, the applied classification criterion led to inclusion of 70% positive cases in the final prediction. Since lipoaffinity index is a molecular descriptor computed using only 2D information about a chemical structure, its estimation is straightforward and computationally inexpensive. The inclusion of an additional criterion quantifying the cocrystallization probability leads to the following conjunction criterions Hmix<-0.18 and ΔLA>3.61, allowing for identification of dissolution rate enhancers. The screening procedure was applied for finding the most promising coformers of such drugs as Iloperidone, Ritonavir, Carbamazepine and Enthenzamide.
Collapse
Affiliation(s)
- Piotr Cysewski
- Chair and Department of Physical Chemistry, Pharmacy Faculty, Collegium Medicum of Bydgoszcz, Nicolaus Copernicus University in Toruń, Kurpińskiego 5, 85-950 Bydgoszcz, Poland.
| | - Maciej Przybyłek
- Chair and Department of Physical Chemistry, Pharmacy Faculty, Collegium Medicum of Bydgoszcz, Nicolaus Copernicus University in Toruń, Kurpińskiego 5, 85-950 Bydgoszcz, Poland
| |
Collapse
|
108
|
Naslavsky MS, Yamamoto GL, Almeida TF, Ezquina SAM, Sunaga DY, Pho N, Bozoklian D, Sandberg TOM, Brito LA, Lazar M, Bernardo DV, Amaro E, Duarte YAO, Lebrão ML, Passos‐Bueno MR, Zatz M. Exomic variants of an elderly cohort of Brazilians in the ABraOM database. Hum Mutat 2017; 38:751-763. [DOI: 10.1002/humu.23220] [Citation(s) in RCA: 149] [Impact Index Per Article: 21.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/16/2016] [Revised: 03/14/2017] [Accepted: 03/19/2017] [Indexed: 01/03/2023]
Affiliation(s)
- Michel Satya Naslavsky
- Human Genome and Stem Cell Research Center Biosciences Institute, University of São Paulo São Paulo Brazil
- Hospital Israelita Albert Einstein São Paulo Brazil
| | - Guilherme Lopes Yamamoto
- Human Genome and Stem Cell Research Center Biosciences Institute, University of São Paulo São Paulo Brazil
- Department of Clinical Genetics Children's Hospital Medical School University of São Paulo São Paulo Brazil
| | - Tatiana Ferreira Almeida
- Human Genome and Stem Cell Research Center Biosciences Institute, University of São Paulo São Paulo Brazil
| | - Suzana A. M. Ezquina
- Human Genome and Stem Cell Research Center Biosciences Institute, University of São Paulo São Paulo Brazil
| | - Daniele Yumi Sunaga
- Human Genome and Stem Cell Research Center Biosciences Institute, University of São Paulo São Paulo Brazil
| | - Nam Pho
- Department of Biomedical Informatics Harvard Medical School Boston Massachusetts
| | - Daniel Bozoklian
- Human Genome and Stem Cell Research Center Biosciences Institute, University of São Paulo São Paulo Brazil
| | | | - Luciano Abreu Brito
- Human Genome and Stem Cell Research Center Biosciences Institute, University of São Paulo São Paulo Brazil
| | - Monize Lazar
- Human Genome and Stem Cell Research Center Biosciences Institute, University of São Paulo São Paulo Brazil
| | - Danilo Vicensotto Bernardo
- Laboratório de Estudos em Antropologia Biológica Bioarqueologia e Evolução Humana, Instituto de Ciências Humanas e da Informação, Universidade Federal do Rio Grande Rio Grande Rio Grande de Sul Brazil
| | - Edson Amaro
- Hospital Israelita Albert Einstein São Paulo Brazil
- Radiology Institute Medical School, University of São Paulo São Paulo Brazil
| | - Yeda A. O. Duarte
- Department of Epidemiology Public Health School University of São Paulo São Paulo Brazil
- School of Nursing University of São Paulo São Paulo Brazil
| | - Maria Lúcia Lebrão
- Department of Epidemiology Public Health School University of São Paulo São Paulo Brazil
| | - Maria Rita Passos‐Bueno
- Human Genome and Stem Cell Research Center Biosciences Institute, University of São Paulo São Paulo Brazil
| | - Mayana Zatz
- Human Genome and Stem Cell Research Center Biosciences Institute, University of São Paulo São Paulo Brazil
| |
Collapse
|
109
|
Niroula A, Vihinen M. PON-P and PON-P2 predictor performance in CAGI challenges: Lessons learned. Hum Mutat 2017; 38:1085-1091. [PMID: 28224672 DOI: 10.1002/humu.23199] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/24/2016] [Revised: 01/25/2017] [Accepted: 02/17/2017] [Indexed: 01/14/2023]
Abstract
Computational tools are widely used for ranking and prioritizing variants for characterizing their disease relevance. Since numerous tools have been developed, they have to be properly assessed before being applied. Critical Assessment of Genome Interpretation (CAGI) experiments have significantly contributed toward the assessment of prediction methods for various tasks. Within and outside the CAGI, we have addressed several questions that facilitate development and assessment of variation interpretation tools. These areas include collection and distribution of benchmark datasets, their use for systematic large-scale method assessment, and the development of guidelines for reporting methods and their performance. For us, CAGI has provided a chance to experiment with new ideas, test the application areas of our methods, and network with other prediction method developers. In this article, we discuss our experiences and lessons learned from the various CAGI challenges. We describe our approaches, their performance, and impact of CAGI on our research. Finally, we discuss some of the possibilities that CAGI experiments have opened up and make some suggestions for future experiments.
Collapse
Affiliation(s)
- Abhishek Niroula
- Protein Structure and Bioinformatics Group, Department of Experimental Medical Science, Lund University, Lund, Sweden
| | - Mauno Vihinen
- Protein Structure and Bioinformatics Group, Department of Experimental Medical Science, Lund University, Lund, Sweden
| |
Collapse
|
110
|
Kotsampasakou E, Ecker GF. Predicting Drug-Induced Cholestasis with the Help of Hepatic Transporters-An in Silico Modeling Approach. J Chem Inf Model 2017; 57:608-615. [PMID: 28166633 PMCID: PMC5411109 DOI: 10.1021/acs.jcim.6b00518] [Citation(s) in RCA: 32] [Impact Index Per Article: 4.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/25/2022]
Abstract
Cholestasis represents one out of three types of drug induced liver injury (DILI), which comprises a major challenge in drug development. In this study we applied a two-class classification scheme based on k-nearest neighbors in order to predict cholestasis, using a set of 93 two-dimensional (2D) physicochemical descriptors and predictions of selected hepatic transporters' inhibition (BSEP, BCRP, P-gp, OATP1B1, and OATP1B3). In order to assess the potential contribution of transporter inhibition, we compared whether the inclusion of the transporters' inhibition predictions contributes to a significant increase in model performance in comparison to the plain use of the 93 2D physicochemical descriptors. Our findings were in agreement with literature findings, indicating a contribution not only from BSEP inhibition but a rather synergistic effect deriving from the whole set of transporters. The final optimal model was validated via both 10-fold cross validation and external validation. It performs quite satisfactorily resulting in 0.686 ± 0.013 for accuracy and 0.722 ± 0.014 for area under the receiver operating characteristic curve (AUC) for 10-fold cross-validation (mean ± standard deviation from 50 iterations).
Collapse
Affiliation(s)
- Eleni Kotsampasakou
- University of Vienna , Department of Pharmaceutical Chemistry, Althanstrasse 14, 1090 Vienna, Austria
| | - Gerhard F Ecker
- University of Vienna , Department of Pharmaceutical Chemistry, Althanstrasse 14, 1090 Vienna, Austria
| |
Collapse
|
111
|
Niroula A, Vihinen M. Predicting Severity of Disease-Causing Variants. Hum Mutat 2017; 38:357-364. [PMID: 28070986 DOI: 10.1002/humu.23173] [Citation(s) in RCA: 25] [Impact Index Per Article: 3.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/04/2016] [Revised: 12/07/2016] [Accepted: 01/06/2017] [Indexed: 12/22/2022]
Abstract
Most diseases, including those of genetic origin, express a continuum of severity. Clinical interventions for numerous diseases are based on the severity of the phenotype. Predicting severity due to genetic variants could facilitate diagnosis and choice of therapy. Although computational predictions have been used as evidence for classifying the disease relevance of genetic variants, special tools for predicting disease severity in large scale are missing. Here, we manually curated a dataset containing variants leading to severe and less severe phenotypes and studied the abilities of variation impact predictors to distinguish between them. We found that these tools cannot separate the two groups of variants. Then, we developed a novel machine-learning-based method, PON-PS (http://structure.bmc.lu.se/PON-PS), for the classification of amino acid substitutions associated with benign, severe, and less severe phenotypes. We tested the method using an independent test dataset and variants in four additional proteins. For distinguishing severe and nonsevere variants, PON-PS showed an accuracy of 61% in the test dataset, which is higher than for existing tolerance prediction methods. PON-PS is the first generic tool developed for this task. The tool can be used together with other evidence for improving diagnosis and prognosis and for prioritization of preventive interventions, clinical monitoring, and molecular tests.
Collapse
Affiliation(s)
- Abhishek Niroula
- Department of Experimental Medical Science, Lund University, Lund, SE-22184, Sweden
| | - Mauno Vihinen
- Department of Experimental Medical Science, Lund University, Lund, SE-22184, Sweden
| |
Collapse
|
112
|
Bai F, Morcos F, Cheng RR, Jiang H, Onuchic JN. Elucidating the druggable interface of protein-protein interactions using fragment docking and coevolutionary analysis. Proc Natl Acad Sci U S A 2016; 113:E8051-E8058. [PMID: 27911825 PMCID: PMC5167203 DOI: 10.1073/pnas.1615932113] [Citation(s) in RCA: 57] [Impact Index Per Article: 7.1] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/28/2022] Open
Abstract
Protein-protein interactions play a central role in cellular function. Improving the understanding of complex formation has many practical applications, including the rational design of new therapeutic agents and the mechanisms governing signal transduction networks. The generally large, flat, and relatively featureless binding sites of protein complexes pose many challenges for drug design. Fragment docking and direct coupling analysis are used in an integrated computational method to estimate druggable protein-protein interfaces. (i) This method explores the binding of fragment-sized molecular probes on the protein surface using a molecular docking-based screen. (ii) The energetically favorable binding sites of the probes, called hot spots, are spatially clustered to map out candidate binding sites on the protein surface. (iii) A coevolution-based interface interaction score is used to discriminate between different candidate binding sites, yielding potential interfacial targets for therapeutic drug design. This approach is validated for important, well-studied disease-related proteins with known pharmaceutical targets, and also identifies targets that have yet to be studied. Moreover, therapeutic agents are proposed by chemically connecting the fragments that are strongly bound to the hot spots.
Collapse
Affiliation(s)
- Fang Bai
- Center for Theoretical Biological Physics, Rice University, Houston, TX 77005
| | - Faruck Morcos
- Department of Biological Sciences, University of Texas at Dallas, Dallas, TX 75080
- Department of Bioengineering, University of Texas at Dallas, Dallas, TX 75080
- Center for Systems Biology, University of Texas at Dallas, Dallas, TX 75080
| | - Ryan R Cheng
- Center for Theoretical Biological Physics, Rice University, Houston, TX 77005
| | - Hualiang Jiang
- Drug Discovery and Design Center, State Key Laboratory of Drug Research, Shanghai Institute of Materia Medica, Chinese Academy of Sciences, Shanghai 201203, China;
| | - José N Onuchic
- Center for Theoretical Biological Physics, Rice University, Houston, TX 77005;
- Department of Physics and Astronomy, Rice University, Houston, TX 77005
- Department of Chemistry, Rice University, Houston, TX 77005
- Department of Biosciences, Rice University, Houston, TX 77005
| |
Collapse
|
113
|
Ehsani R, Bahrami S, Drabløs F. Feature-based classification of human transcription factors into hypothetical sub-classes related to regulatory function. BMC Bioinformatics 2016; 17:459. [PMID: 27842491 PMCID: PMC5109715 DOI: 10.1186/s12859-016-1349-2] [Citation(s) in RCA: 13] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/19/2016] [Accepted: 11/10/2016] [Indexed: 12/15/2022] Open
Abstract
Background Transcription factors are key proteins in the regulation of gene transcription. An important step in this process is the opening of chromatin in order to make genomic regions available for transcription. Data on DNase I hypersensitivity has previously been used to label a subset of transcription factors as Pioneers, Settlers and Migrants to describe their potential role in this process. These labels represent an interesting hypothesis on gene regulation and possibly a useful approach for data analysis, and therefore we wanted to expand the set of labeled transcription factors to include as many known factors as possible. We have used a well-annotated dataset of 1175 transcription factors as input to supervised machine learning methods, using the subset with previously assigned labels as training set. We then used the final classifier to label the additional transcription factors according to their potential role as Pioneers, Settlers and Migrants. The full set of labeled transcription factors was used to investigate associated properties and functions of each class, including an analysis of interaction data for transcription factors based on DNA co-binding and protein-protein interactions. We also used the assigned labels to analyze a previously published set of gene lists associated with a time course experiment on cell differentiation. Results The analysis showed that the classification of transcription factors with respect to their potential role in chromatin opening largely was determined by how they bind to DNA. Each subclass of transcription factors was enriched for properties that seemed to characterize the subclass relative to its role in gene regulation, with very general functions for Pioneers, whereas Migrants to a larger extent were associated with specific processes. Further analysis showed that the expanded classification is a useful resource for analyzing other datasets on transcription factors with respect to their potential role in gene regulation. The analysis of transcription factor interaction data showed complementary differences between the subclasses, where transcription factors labeled as Pioneers often interact with other transcription factors through DNA co-binding, whereas Migrants to a larger extent use protein-protein interactions. The analysis of time course data on cell differentiation indicated a shift in the regulatory program associated with Pioneer-like transcription factors during differentiation. Conclusions The expanded classification is an interesting resource for analyzing data on gene regulation, as illustrated here on transcription factor interaction data and data from a time course experiment. The potential regulatory function of transcription factors seems largely to be determined by how they bind DNA, but is also influenced by how they interact with each other through cooperativity and protein-protein interactions. Electronic supplementary material The online version of this article (doi:10.1186/s12859-016-1349-2) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Rezvan Ehsani
- Department of Cancer Research and Molecular Medicine, Norwegian University of Science and Technology, PO Box 8905, NO-7491, Trondheim, Norway.,Department of Mathematics, University of Zabol, Zabol, Iran
| | - Shahram Bahrami
- Department of Cancer Research and Molecular Medicine, Norwegian University of Science and Technology, PO Box 8905, NO-7491, Trondheim, Norway.,St. Olavs Hospital, Trondheim University Hospital, NO-7006, Trondheim, Norway
| | - Finn Drabløs
- Department of Cancer Research and Molecular Medicine, Norwegian University of Science and Technology, PO Box 8905, NO-7491, Trondheim, Norway.
| |
Collapse
|
114
|
Fiannaca A, Rosa ML, Paglia LL, Rizzo R, Urso A. MiRNATIP: a SOM-based miRNA-target interactions predictor. BMC Bioinformatics 2016; 17:321. [PMID: 28185545 PMCID: PMC5046196 DOI: 10.1186/s12859-016-1171-x] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/17/2022] Open
Abstract
Background MicroRNAs (miRNAs) are small non-coding RNA sequences with regulatory functions to post-transcriptional level for several biological processes, such as cell disease progression and metastasis. MiRNAs interact with target messenger RNA (mRNA) genes by base pairing. Experimental identification of miRNA target is one of the major challenges in cancer biology because miRNAs can act as tumour suppressors or oncogenes by targeting different type of targets. The use of machine learning methods for the prediction of the target genes is considered a valid support to investigate miRNA functions and to guide related wet-lab experiments. In this paper we propose the miRNA Target Interaction Predictor (miRNATIP) algorithm, a Self-Organizing Map (SOM) based method for the miRNA target prediction. SOM is trained with the seed region of the miRNA sequences and then the mRNA sequences are projected into the SOM lattice in order to find putative interactions with miRNAs. These interactions will be filtered considering the remaining part of the miRNA sequences and estimating the free-energy necessary for duplex stability. Results We tested the proposed method by predicting the miRNA target interactions of both the Homo sapiens and the Caenorhbditis elegans species; then, taking into account validated target (positive) and non-target (negative) interactions, we compared our results with other target predictors, namely miRanda, PITA, PicTar, mirSOM, TargetScan and DIANA-microT, in terms of the most used statistical measures. We demonstrate that our method produces the greatest number of predictions with respect to the other ones, exhibiting good results for both species, reaching the for example the highest percentage of sensitivity of 31 and 30.5 %, respectively for Homo sapiens and for C. elegans. All the predicted interaction are freely available at the following url: http://tblab.pa.icar.cnr.it/public/miRNATIP/. Conclusions Results state miRNATIP outperforms or is comparable to the other six state-of-the-art methods, in terms of validated target and non-target interactions, respectively.
Collapse
Affiliation(s)
- Antonino Fiannaca
- National Research Council of Italy, ICAR-CNR, via Ugo La Malfa 153, Palermo, 90146, Italy.
| | - Massimo La Rosa
- National Research Council of Italy, ICAR-CNR, via Ugo La Malfa 153, Palermo, 90146, Italy
| | - Laura La Paglia
- National Research Council of Italy, ICAR-CNR, via Ugo La Malfa 153, Palermo, 90146, Italy
| | - Riccardo Rizzo
- National Research Council of Italy, ICAR-CNR, via Ugo La Malfa 153, Palermo, 90146, Italy
| | - Alfonso Urso
- National Research Council of Italy, ICAR-CNR, via Ugo La Malfa 153, Palermo, 90146, Italy
| |
Collapse
|
115
|
Pons T, Vazquez M, Matey-Hernandez ML, Brunak S, Valencia A, Izarzugaza JM. KinMutRF: a random forest classifier of sequence variants in the human protein kinase superfamily. BMC Genomics 2016; 17 Suppl 2:396. [PMID: 27357839 PMCID: PMC4928150 DOI: 10.1186/s12864-016-2723-1] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Key Words] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/28/2023] Open
Abstract
Background The association between aberrant signal processing by protein kinases and human diseases such as cancer was established long time ago. However, understanding the link between sequence variants in the protein kinase superfamily and the mechanistic complex traits at the molecular level remains challenging: cells tolerate most genomic alterations and only a minor fraction disrupt molecular function sufficiently and drive disease. Results KinMutRF is a novel random-forest method to automatically identify pathogenic variants in human kinases. Twenty six decision trees implemented as a random forest ponder a battery of features that characterize the variants: a) at the gene level, including membership to a Kinbase group and Gene Ontology terms; b) at the PFAM domain level; and c) at the residue level, the types of amino acids involved, changes in biochemical properties, functional annotations from UniProt, Phospho.ELM and FireDB. KinMutRF identifies disease-associated variants satisfactorily (Acc: 0.88, Prec:0.82, Rec:0.75, F-score:0.78, MCC:0.68) when trained and cross-validated with the 3689 human kinase variants from UniProt that have been annotated as neutral or pathogenic. All unclassified variants were excluded from the training set. Furthermore, KinMutRF is discussed with respect to two independent kinase-specific sets of mutations no included in the training and testing, Kin-Driver (643 variants) and Pon-BTK (1495 variants). Moreover, we provide predictions for the 848 protein kinase variants in UniProt that remained unclassified. A public implementation of KinMutRF, including documentation and examples, is available online (http://kinmut2.bioinfo.cnio.es). The source code for local installation is released under a GPL version 3 license, and can be downloaded from https://github.com/Rbbt-Workflows/KinMut2. Conclusions KinMutRF is capable of classifying kinase variation with good performance. Predictions by KinMutRF compare favorably in a benchmark with other state-of-the-art methods (i.e. SIFT, Polyphen-2, MutationAssesor, MutationTaster, LRT, CADD, FATHMM, and VEST). Kinase-specific features rank as the most elucidatory in terms of information gain and are likely the improvement in prediction performance. This advocates for the development of family-specific classifiers able to exploit the discriminatory power of features unique to individual protein families. Electronic supplementary material The online version of this article (doi:10.1186/s12864-016-2723-1) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Tirso Pons
- Structural Biology and BioComputing Programme, Spanish National Cancer Research Centre (CNIO), Melchor Fernández Almagro, 3, 28029, Madrid, Spain
| | - Miguel Vazquez
- Structural Biology and BioComputing Programme, Spanish National Cancer Research Centre (CNIO), Melchor Fernández Almagro, 3, 28029, Madrid, Spain
| | - María Luisa Matey-Hernandez
- Center for Biological Sequence Analysis (CBS), Systems Biology Department, Technical University of Denmark (DTU), Kemitorvet, Building 208, 2800 Kgs., Lyngby, Denmark
| | - Søren Brunak
- Center for Biological Sequence Analysis (CBS), Systems Biology Department, Technical University of Denmark (DTU), Kemitorvet, Building 208, 2800 Kgs., Lyngby, Denmark.,Novo Nordisk Foundation Center for Protein Research, Faculty of Health Sciences, University of Copenhagen, Blegdamsvej 3A, 2200, Copenhagen, Denmark
| | - Alfonso Valencia
- Structural Biology and BioComputing Programme, Spanish National Cancer Research Centre (CNIO), Melchor Fernández Almagro, 3, 28029, Madrid, Spain
| | - Jose Mg Izarzugaza
- Center for Biological Sequence Analysis (CBS), Systems Biology Department, Technical University of Denmark (DTU), Kemitorvet, Building 208, 2800 Kgs., Lyngby, Denmark.
| |
Collapse
|
116
|
Niroula A, Vihinen M. Variation Interpretation Predictors: Principles, Types, Performance, and Choice. Hum Mutat 2016; 37:579-97. [DOI: 10.1002/humu.22987] [Citation(s) in RCA: 90] [Impact Index Per Article: 11.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/16/2015] [Accepted: 03/07/2016] [Indexed: 12/18/2022]
Affiliation(s)
- Abhishek Niroula
- Department of Experimental Medical Science; Lund University; BMC B13 Lund SE-22184 Sweden
| | - Mauno Vihinen
- Department of Experimental Medical Science; Lund University; BMC B13 Lund SE-22184 Sweden
| |
Collapse
|
117
|
Chang CCH, Li C, Webb GI, Tey B, Song J, Ramanan RN. Periscope: quantitative prediction of soluble protein expression in the periplasm of Escherichia coli. Sci Rep 2016; 6:21844. [PMID: 26931649 PMCID: PMC4773868 DOI: 10.1038/srep21844] [Citation(s) in RCA: 16] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/02/2015] [Accepted: 01/28/2016] [Indexed: 12/20/2022] Open
Abstract
Periplasmic expression of soluble proteins in Escherichia coli not only offers a much-simplified downstream purification process, but also enhances the probability of obtaining correctly folded and biologically active proteins. Different combinations of signal peptides and target proteins lead to different soluble protein expression levels, ranging from negligible to several grams per litre. Accurate algorithms for rational selection of promising candidates can serve as a powerful tool to complement with current trial-and-error approaches. Accordingly, proteomics studies can be conducted with greater efficiency and cost-effectiveness. Here, we developed a predictor with a two-stage architecture, to predict the real-valued expression level of target protein in the periplasm. The output of the first-stage support vector machine (SVM) classifier determines which second-stage support vector regression (SVR) classifier to be used. When tested on an independent test dataset, the predictor achieved an overall prediction accuracy of 78% and a Pearson's correlation coefficient (PCC) of 0.77. We further illustrate the relative importance of various features with respect to different models. The results indicate that the occurrence of dipeptide glutamine and aspartic acid is the most important feature for the classification model. Finally, we provide access to the implemented predictor through the Periscope webserver, freely accessible at http://lightning.med.monash.edu/periscope/.
Collapse
Affiliation(s)
- Catherine Ching Han Chang
- Chemical Engineering Discipline, School of Engineering, Monash University, Jalan Lagoon Selatan 46150, Bandar Sunway, Selangor, Malaysia
- Department of Biochemistry and Molecular Biology, Monash University, Melbourne VIC 3800, Australia
| | - Chen Li
- Department of Biochemistry and Molecular Biology, Monash University, Melbourne VIC 3800, Australia
| | - Geoffrey I. Webb
- Monash Centre for Data Science, Faculty of Information Technology, Monash University, Melbourne VIC 3800, Australia
| | - BengTi Tey
- Chemical Engineering Discipline, School of Engineering, Monash University, Jalan Lagoon Selatan 46150, Bandar Sunway, Selangor, Malaysia
- Advanced Engineering Platform, School of Engineering, Monash University, Jalan Lagoon Selatan 46150, Bandar Sunway, Selangor, Malaysia
| | - Jiangning Song
- Department of Biochemistry and Molecular Biology, Monash University, Melbourne VIC 3800, Australia
- Monash Centre for Data Science, Faculty of Information Technology, Monash University, Melbourne VIC 3800, Australia
- National Engineering Laboratory for Industrial Enzymes, Tianjin Institute of Industrial Biotechnology, Chinese Academy of Sciences, Tianjin 300308, China
| | - Ramakrishnan Nagasundara Ramanan
- Chemical Engineering Discipline, School of Engineering, Monash University, Jalan Lagoon Selatan 46150, Bandar Sunway, Selangor, Malaysia
- Advanced Engineering Platform, School of Engineering, Monash University, Jalan Lagoon Selatan 46150, Bandar Sunway, Selangor, Malaysia
- School of Chemistry, Monash University, Melbourne VIC 3800, Australia
| |
Collapse
|
118
|
Yang Y, Niroula A, Shen B, Vihinen M. PON-Sol: prediction of effects of amino acid substitutions on protein solubility. Bioinformatics 2016; 32:2032-4. [PMID: 27153720 DOI: 10.1093/bioinformatics/btw066] [Citation(s) in RCA: 36] [Impact Index Per Article: 4.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/25/2015] [Accepted: 01/30/2016] [Indexed: 11/13/2022] Open
Abstract
MOTIVATION Solubility is one of the fundamental protein properties. It is of great interest because of its relevance to protein expression. Reduced solubility and protein aggregation are also associated with many diseases. RESULTS We collected from literature the largest experimentally verified solubility affecting amino acid substitution (AAS) dataset and used it to train a predictor called PON-Sol. The predictor can distinguish both solubility decreasing and increasing variants from those not affecting solubility. PON-Sol has normalized correct prediction ratio of 0.491 on cross-validation and 0.432 for independent test set. The performance of the method was compared both to solubility and aggregation predictors and found to be superior. PON-Sol can be used for the prediction of effects of disease-related substitutions, effects on heterologous recombinant protein expression and enhanced crystallizability. One application is to investigate effects of all possible AASs in a protein to aid protein engineering. AVAILABILITY AND IMPLEMENTATION PON-Sol is freely available at http://structure.bmc.lu.se/PON-Sol The training and test data are available at http://structure.bmc.lu.se/VariBench/ponsol.php CONTACT mauno.vihinen@med.lu.se SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Yang Yang
- Center for Systems Biology, School of Computer Science and Technology, Soochow University, Suzhou 215006, China and Department of Experimental Medical Science, Lund University, Lund SE 221 84, Sweden
| | - Abhishek Niroula
- Department of Experimental Medical Science, Lund University, Lund SE 221 84, Sweden
| | | | - Mauno Vihinen
- Department of Experimental Medical Science, Lund University, Lund SE 221 84, Sweden
| |
Collapse
|
119
|
Niroula A, Vihinen M. PON-mt-tRNA: a multifactorial probability-based method for classification of mitochondrial tRNA variations. Nucleic Acids Res 2016; 44:2020-7. [PMID: 26843426 PMCID: PMC4797295 DOI: 10.1093/nar/gkw046] [Citation(s) in RCA: 33] [Impact Index Per Article: 4.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/18/2015] [Accepted: 01/14/2016] [Indexed: 12/19/2022] Open
Abstract
Transfer RNAs (tRNAs) are essential for encoding the transcribed genetic information from DNA into proteins. Variations in the human tRNAs are involved in diverse clinical phenotypes. Interestingly, all pathogenic variations in tRNAs are located in mitochondrial tRNAs (mt-tRNAs). Therefore, it is crucial to identify pathogenic variations in mt-tRNAs for disease diagnosis and proper treatment. We collected mt-tRNA variations using a classification based on evidence from several sources and used the data to develop a multifactorial probability-based prediction method, PON-mt-tRNA, for classification of mt-tRNA single nucleotide substitutions. We integrated a machine learning-based predictor and an evidence-based likelihood ratio for pathogenicity using evidence of segregation, biochemistry and histochemistry to predict the posterior probability of pathogenicity of variants. The accuracy and Matthews correlation coefficient (MCC) of PON-mt-tRNA are 1.00 and 0.99, respectively. In the absence of evidence from segregation, biochemistry and histochemistry, PON-mt-tRNA classifies variations based on the machine learning method with an accuracy and MCC of 0.69 and 0.39, respectively. We classified all possible single nucleotide substitutions in all human mt-tRNAs using PON-mt-tRNA. The variations in the loops are more often tolerated compared to the variations in stems. The anticodon loop contains comparatively more predicted pathogenic variations than the other loops. PON-mt-tRNA is available at http://structure.bmc.lu.se/PON-mt-tRNA/.
Collapse
Affiliation(s)
- Abhishek Niroula
- Department of Experimental Medical Science, Lund University, BMC B13, SE-22184 Lund, Sweden
| | - Mauno Vihinen
- Department of Experimental Medical Science, Lund University, BMC B13, SE-22184 Lund, Sweden
| |
Collapse
|
120
|
Peng Y, Alexov E. Investigating the linkage between disease-causing amino acid variants and their effect on protein stability and binding. Proteins 2016; 84:232-9. [PMID: 26650512 DOI: 10.1002/prot.24968] [Citation(s) in RCA: 42] [Impact Index Per Article: 5.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/28/2015] [Accepted: 11/30/2015] [Indexed: 12/12/2022]
Abstract
Single amino acid variations (SAV) occurring in human population result in natural differences between individuals or cause diseases. It is well understood that the molecular effect of SAV can be manifested as changes of the wild type characteristics of the corresponding protein, among which are the protein stability and protein interactions. Typically the effect of SAV on protein stability and interactions was assessed via the changes of the wild type folding and binding free energies. However, in terms of SAV affecting protein functionally and disease susceptibility, one wants to know to what extend the wild type function is perturbed by the SAV. Here it is demonstrated that relative, rather than the absolute, change of the folding and binding free energy serves as a good indicator for SAV association with disease. Using HumVar as a source for disease-causing SAV and experimentally determined free energy changes from ProTherm and SKEMPI databases, correlation coefficients (CC) between the disease index (Pd) and relative folding (Ppr,f) and binding (Ppr,b) probability indexes, respectively, was achieved. The obtained CCs demonstrated the applicability of the proposed approach and it served as good indicator for SAV association with disease.
Collapse
Affiliation(s)
- Yunhui Peng
- Computational Biophysics and Bioinformatics, Department of Physics, Clemson University, Clemson, South Carolina, 29634
| | - Emil Alexov
- Computational Biophysics and Bioinformatics, Department of Physics, Clemson University, Clemson, South Carolina, 29634
| |
Collapse
|
121
|
Roels SP, Loeys T, Moerkerke B. Evaluation of Second-Level Inference in fMRI Analysis. COMPUTATIONAL INTELLIGENCE AND NEUROSCIENCE 2015; 2016:1068434. [PMID: 26819578 PMCID: PMC4706870 DOI: 10.1155/2016/1068434] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 07/09/2015] [Revised: 08/21/2015] [Accepted: 10/04/2015] [Indexed: 11/30/2022]
Abstract
We investigate the impact of decisions in the second-level (i.e., over subjects) inferential process in functional magnetic resonance imaging on (1) the balance between false positives and false negatives and on (2) the data-analytical stability, both proxies for the reproducibility of results. Second-level analysis based on a mass univariate approach typically consists of 3 phases. First, one proceeds via a general linear model for a test image that consists of pooled information from different subjects. We evaluate models that take into account first-level (within-subjects) variability and models that do not take into account this variability. Second, one proceeds via inference based on parametrical assumptions or via permutation-based inference. Third, we evaluate 3 commonly used procedures to address the multiple testing problem: familywise error rate correction, False Discovery Rate (FDR) correction, and a two-step procedure with minimal cluster size. Based on a simulation study and real data we find that the two-step procedure with minimal cluster size results in most stable results, followed by the familywise error rate correction. The FDR results in most variable results, for both permutation-based inference and parametrical inference. Modeling the subject-specific variability yields a better balance between false positives and false negatives when using parametric inference.
Collapse
Affiliation(s)
- Sanne P. Roels
- Department of Data Analysis, Ghent University, H. Dunantlaan 1, 9000 Ghent, Belgium
| | - Tom Loeys
- Department of Data Analysis, Ghent University, H. Dunantlaan 1, 9000 Ghent, Belgium
| | - Beatrijs Moerkerke
- Department of Data Analysis, Ghent University, H. Dunantlaan 1, 9000 Ghent, Belgium
| |
Collapse
|
122
|
Estrada J, Echenique P, Sancho J. Predicting stabilizing mutations in proteins using Poisson-Boltzmann based models: study of unfolded state ensemble models and development of a successful binary classifier based on residue interaction energies. Phys Chem Chem Phys 2015; 17:31044-54. [PMID: 26530878 DOI: 10.1039/c5cp04348d] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/02/2023]
Abstract
In many cases the stability of a protein has to be increased to permit its biotechnological use. Rational methods of protein stabilization based on optimizing electrostatic interactions have provided some fine successful predictions. However, the precise calculation of stabilization energies remains challenging, one reason being that the electrostatic effects on the unfolded state are often neglected. We have explored here the feasibility of incorporating Poisson-Boltzmann model electrostatic calculations performed on representations of the unfolded state as large ensembles of geometrically optimized conformations calculated using the ProtSA server. Using a data set of 80 electrostatic mutations experimentally tested in two-state proteins, the predictive performance of several such models has been compared to that of a simple one that considers an unfolded structure of non-interacting residues. The unfolded ensemble models, while showing correlation between the predicted stabilization values and the experimental ones, are worse than the simple model, suggesting that the ensembles do not capture well the energetics of the unfolded state. A more attainable goal is classifying potential mutations as either stabilizing or non-stabilizing, rather than accurately calculating their stabilization energies. To implement a fast classification method that can assist in selecting stabilizing mutations, we have used a much simpler electrostatic model based only on the native structure and have determined its precision using different stabilizing energy thresholds. The binary classifier developed finds 7 true stabilizing mutants out of every 10 proposed candidates and can be used as a robust tool to propose stabilizing mutations.
Collapse
Affiliation(s)
- Jorge Estrada
- Departamento de Bioquímica y Biología Molecular y Celular, Facultad de Ciencias, Universidad de Zaragoza, Pedro Cerbuna 12, 50009 Zaragoza, Spain. and Biocomputation and Complex Systems Physics Institute (BIFI), Joint Unit BIFI-IQFR (CSIC), Mariano Esquillor s/n, Edificio I+D, 50018, Zaragoza, Spain
| | - Pablo Echenique
- Biocomputation and Complex Systems Physics Institute (BIFI), Joint Unit BIFI-IQFR (CSIC), Mariano Esquillor s/n, Edificio I+D, 50018, Zaragoza, Spain and Instituto de Química Física "Rocasolano", CSIC, Serrano 119, 28006, Madrid, Spain
| | - Javier Sancho
- Departamento de Bioquímica y Biología Molecular y Celular, Facultad de Ciencias, Universidad de Zaragoza, Pedro Cerbuna 12, 50009 Zaragoza, Spain. and Biocomputation and Complex Systems Physics Institute (BIFI), Joint Unit BIFI-IQFR (CSIC), Mariano Esquillor s/n, Edificio I+D, 50018, Zaragoza, Spain
| |
Collapse
|
123
|
Vazquez M, Pons T, Brunak S, Valencia A, Izarzugaza JMG. wKinMut-2: Identification and Interpretation of Pathogenic Variants in Human Protein Kinases. Hum Mutat 2015; 37:36-42. [PMID: 26443060 DOI: 10.1002/humu.22914] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/02/2015] [Accepted: 09/22/2015] [Indexed: 12/31/2022]
Abstract
Most genomic alterations are tolerated while only a minor fraction disrupts molecular function sufficiently to drive disease. Protein kinases play a central biological function and the functional consequences of their variants are abundantly characterized. However, this heterogeneous information is often scattered across different sources, which makes the integrative analysis complex and laborious. wKinMut-2 constitutes a solution to facilitate the interpretation of the consequences of human protein kinase variation. Nine methods predict their pathogenicity, including a kinase-specific random forest approach. To understand the biological mechanisms causative of human diseases and cancer, information from pertinent reference knowledge bases and the literature is automatically mined, digested, and homogenized. Variants are visualized in their structural contexts and residues affecting catalytic and drug binding are identified. Known protein-protein interactions are reported. Altogether, this information is intended to assist the generation of new working hypothesis to be corroborated with ulterior experimental work. The wKinMut-2 system, along with a user manual and examples, is freely accessible at http://kinmut2.bioinfo.cnio.es, the code for local installations can be downloaded from https://github.com/Rbbt-Workflows/KinMut2.
Collapse
Affiliation(s)
- Miguel Vazquez
- Structural Biology and BioComputing Programme, Spanish National Cancer Research Centre (CNIO), Madrid, 28029, Spain
| | - Tirso Pons
- Structural Biology and BioComputing Programme, Spanish National Cancer Research Centre (CNIO), Madrid, 28029, Spain
| | - Søren Brunak
- Novo Nordisk Foundation Center for Protein Research, Faculty of Health Sciences, University of Copenhagen, Copenhagen 2200, Denmark.,Center for Biological Sequence Analysis (CBS), Systems Biology Department, Technical University of Denmark (DTU), Kongens Lyngby 2800, Denmark
| | - Alfonso Valencia
- Structural Biology and BioComputing Programme, Spanish National Cancer Research Centre (CNIO), Madrid, 28029, Spain
| | - Jose M G Izarzugaza
- Center for Biological Sequence Analysis (CBS), Systems Biology Department, Technical University of Denmark (DTU), Kongens Lyngby 2800, Denmark
| |
Collapse
|
124
|
Rodrigues C, Santos-Silva A, Costa E, Bronze-da-Rocha E. Performance of In Silico Tools for the Evaluation of UGT1A1 Missense Variants. Hum Mutat 2015; 36:1215-25. [PMID: 26377032 DOI: 10.1002/humu.22903] [Citation(s) in RCA: 21] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/03/2014] [Accepted: 08/31/2015] [Indexed: 01/17/2023]
Abstract
Variations in the gene encoding uridine diphosphate glucuronosyltransferase 1A1 (UGT1A1) are particularly important because they have been associated with hyperbilirubinemia in Gilbert's and Crigler-Najjar syndromes as well as with changes in drug metabolism. Several variants associated with these phenotypes are nonsynonymous single-nucleotide polymorphisms (nsSNPs). Bioinformatics approaches have gained increasing importance in predicting the functional significance of these variants. This study was focused on the predictive ability of bioinformatics approaches to determine the pathogenicity of human UGT1A1 nsSNPs, which were previously characterized at the protein level by in vivo and in vitro studies. Using 16 Web algorithms, we evaluated 48 nsSNPs described in the literature and databases. Eight of these algorithms reached or exceeded 90% sensitivity and six presented a Matthews correlation coefficient above 0.46. The best-performing method was MutPred, followed by Sorting Intolerant from Tolerant (SIFT). The prediction measures varied significantly when predictors such us SIFT, polyphen-2, and Prediction of Pathological Mutations on Proteins were run with their native alignment generated by the tool, or with an input alignment that was strictly built with UGT1A1 orthologs and manually curated. Our results showed that the prediction performance of some methods based on sequence conservation analysis can be negatively affected when nsSNPs are positioned at the hypervariable or constant regions of UGT1A1 ortholog sequences.
Collapse
Affiliation(s)
- Carina Rodrigues
- UCIBIO/REQUIMTE, Laboratório de Bioquímica, Departamento de Ciências Biológicas, Faculdade de Farmácia, Universidade do Porto, Porto, Portugal.,Escola Superior de Saúde, Instituto Politécnico de Bragança, Bragança, Portugal
| | - Alice Santos-Silva
- UCIBIO/REQUIMTE, Laboratório de Bioquímica, Departamento de Ciências Biológicas, Faculdade de Farmácia, Universidade do Porto, Porto, Portugal
| | - Elísio Costa
- UCIBIO/REQUIMTE, Laboratório de Bioquímica, Departamento de Ciências Biológicas, Faculdade de Farmácia, Universidade do Porto, Porto, Portugal
| | - Elsa Bronze-da-Rocha
- UCIBIO/REQUIMTE, Laboratório de Bioquímica, Departamento de Ciências Biológicas, Faculdade de Farmácia, Universidade do Porto, Porto, Portugal
| |
Collapse
|
125
|
Niroula A, Vihinen M. Classification of Amino Acid Substitutions in Mismatch Repair Proteins Using PON-MMR2. Hum Mutat 2015; 36:1128-34. [PMID: 26333163 DOI: 10.1002/humu.22900] [Citation(s) in RCA: 15] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/27/2015] [Accepted: 08/24/2015] [Indexed: 12/21/2022]
Abstract
Variations in mismatch repair (MMR) system genes are causative of Lynch syndrome and other cancers. Thousands of variants have been identified in MMR genes, but the clinical relevance is known for only a small proportion. Recently, the InSiGHT group classified 2,360 MMR variants into five classes. One-third of variants, majority of which is nonsynonymous variants, remain to be of uncertain clinical relevance. Computational tools can be used to prioritize variants for disease relevance investigations. Previously, we classified 248 MMR variants as likely pathogenic and likely benign using PON-MMR. We have developed a novel tool, PON-MMR2, which is trained on a larger and more reliable dataset. In performance comparison, PON-MMR2 outperforms both generic tolerance prediction methods as well as methods optimized for MMR variants. It achieves accuracy and MCC of 0.89 and 0.78, respectively, in cross-validation and 0.86 and 0.69, respectively, on an independent test dataset. We classified 354 class 3 variants in InSiGHT database as well as all possible amino acid substitutions in four MMR proteins. Likely harmful variants mainly appear in the protein core, whereas likely benign variants are on the surface. PON-MMR2 is a highly reliable tool to prioritize variants for functional analysis. It is freely available at http://structure.bmc.lu.se/PON-MMR2/.
Collapse
Affiliation(s)
- Abhishek Niroula
- Department of Experimental Medical Science, Lund University, BMC B13, Lund, SE, 22184, Sweden
| | - Mauno Vihinen
- Department of Experimental Medical Science, Lund University, BMC B13, Lund, SE, 22184, Sweden
| |
Collapse
|
126
|
Ho SS, McLachlan AJ, Chen TF, Hibbs DE, Fois RA. Relationships Between Pharmacovigilance, Molecular, Structural, and Pathway Data: Revealing Mechanisms for Immune-Mediated Drug-Induced Liver Injury. CPT-PHARMACOMETRICS & SYSTEMS PHARMACOLOGY 2015; 4:426-41. [PMID: 26312166 PMCID: PMC4544056 DOI: 10.1002/psp4.56] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 03/23/2015] [Accepted: 05/08/2015] [Indexed: 11/18/2022]
Abstract
Immune-mediated drug-induced liver injury (IMDILI) can be devastating, irreversible, and fatal in the absence of successful transplantation surgery. We present a novel approach that combines the methods of pharmacoepidemiology with in silico molecular modeling to identify specific features in toxic ligands that are associated with clinical features of IMDILI. Specifically, from pharmacovigilance data multivariate logistic regression identified 18 drugs associated with IMDILI (P < 0.00015). Eleven of these drugs, along with their known and proposed metabolites, constituted a training set used to develop a four-point pharmacophore model (sensitivity 75%; specificity 85%). Subsequently, this information was combined with information from immune-pathway reviews and genetic-association studies and complemented with ligand-protein docking simulations to support a hypothesis implicating two putative targets within separate, possibly interacting, immune-system pathways: the major histocompatibility complex within the adaptive immune system and Toll-like receptors (TLRs), in particular TLR-7, which represent pattern recognition receptors of the innate immune system.
Collapse
Affiliation(s)
- S S Ho
- Faculty of Pharmacy (A15), University of Sydney Sydney, NSW, Australia
| | - A J McLachlan
- Faculty of Pharmacy (A15), University of Sydney Sydney, NSW, Australia
| | - T F Chen
- Faculty of Pharmacy (A15), University of Sydney Sydney, NSW, Australia
| | - D E Hibbs
- Faculty of Pharmacy (A15), University of Sydney Sydney, NSW, Australia
| | - R A Fois
- Faculty of Pharmacy (A15), University of Sydney Sydney, NSW, Australia
| |
Collapse
|
127
|
Miner RM, Al Qabandi S, Rigali PH, Will LA. Cone-beam computed tomography transverse analyses. Part 2: Measures of performance. Am J Orthod Dentofacial Orthop 2015; 148:253-63. [DOI: 10.1016/j.ajodo.2015.03.027] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/01/2012] [Revised: 03/01/2015] [Accepted: 03/01/2015] [Indexed: 10/23/2022]
|
128
|
Predicting Metabolic Syndrome Using the Random Forest Method. ScientificWorldJournal 2015; 2015:581501. [PMID: 26290899 PMCID: PMC4531182 DOI: 10.1155/2015/581501] [Citation(s) in RCA: 37] [Impact Index Per Article: 4.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/25/2015] [Revised: 06/04/2015] [Accepted: 06/07/2015] [Indexed: 02/08/2023] Open
Abstract
Aims. This study proposes a computational method for determining the prevalence of metabolic syndrome (MS) and to predict its occurrence using the National Cholesterol Education Program Adult Treatment Panel III (NCEP ATP III) criteria. The Random Forest (RF) method is also applied to identify significant health parameters. Materials and Methods. We used data from 5,646 adults aged between 18–78 years residing in Bangkok who had received an annual health check-up in 2008. MS was identified using the NCEP ATP III criteria. The RF method was applied to predict the occurrence of MS and to identify important health parameters surrounding this disorder. Results. The overall prevalence of MS was 23.70% (34.32% for males and 17.74% for females). RF accuracy for predicting MS in an adult Thai population was 98.11%. Further, based on RF, triglyceride levels were the most important health parameter associated with MS. Conclusion. RF was shown to predict MS in an adult Thai population with an accuracy >98% and triglyceride levels were identified as the most informative variable associated with MS. Therefore, using RF to predict MS may be potentially beneficial in identifying MS status for preventing the development of diabetes mellitus and cardiovascular diseases.
Collapse
|
129
|
Gagliano SA, Paterson AD, Weale ME, Knight J. Assessing models for genetic prediction of complex traits: a comparison of visualization and quantitative methods. BMC Genomics 2015; 16:405. [PMID: 25997848 PMCID: PMC4440290 DOI: 10.1186/s12864-015-1616-z] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/23/2014] [Accepted: 05/05/2015] [Indexed: 11/13/2022] Open
Abstract
Background In silico models have recently been created in order to predict which genetic variants are more likely to contribute to the risk of a complex trait given their functional characteristics. However, there has been no comprehensive review as to which type of predictive accuracy measures and data visualization techniques are most useful for assessing these models. Methods We assessed the performance of the models for predicting risk using various methodologies, some of which include: receiver operating characteristic (ROC) curves, histograms of classification probability, and the novel use of the quantile-quantile plot. These measures have variable interpretability depending on factors such as whether the dataset is balanced in terms of numbers of genetic variants classified as risk variants versus those that are not. Results We conclude that the area under the curve (AUC) is a suitable starting place, and for models with similar AUCs, violin plots are particularly useful for examining the distribution of the risk scores. Electronic supplementary material The online version of this article (doi:10.1186/s12864-015-1616-z) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Sarah A Gagliano
- Campbell Family Mental Health Research Institute, Centre for Addiction and Mental Health, Toronto, Ontario, Canada. .,Institute of Medical Science, University of Toronto, Toronto, Ontario, Canada. .,Department of Psychiatry, University of Toronto, Toronto, Ontario, Canada.
| | - Andrew D Paterson
- Institute of Medical Science, University of Toronto, Toronto, Ontario, Canada. .,Department of Psychiatry, University of Toronto, Toronto, Ontario, Canada. .,Program in Genetics and Genome Biology, The Hospital for Sick Children, Toronto, Ontario, Canada. .,Biostatistics Division, Dalla Lana School of Public Health, University of Toronto, Toronto, Ontario, Canada. .,Epidemiology Division, Dalla Lana School of Public Health, University of Toronto, Toronto, Ontario, Canada.
| | - Michael E Weale
- Department of Medical & Molecular Genetics, King's College London, Guy's Hospital, London, UK.
| | - Jo Knight
- Campbell Family Mental Health Research Institute, Centre for Addiction and Mental Health, Toronto, Ontario, Canada. .,Institute of Medical Science, University of Toronto, Toronto, Ontario, Canada. .,Department of Psychiatry, University of Toronto, Toronto, Ontario, Canada. .,Biostatistics Division, Dalla Lana School of Public Health, University of Toronto, Toronto, Ontario, Canada.
| |
Collapse
|
130
|
Assessment of the predictive accuracy of five in silico prediction tools, alone or in combination, and two metaservers to classify long QT syndrome gene mutations. BMC MEDICAL GENETICS 2015; 16:34. [PMID: 25967940 PMCID: PMC4630850 DOI: 10.1186/s12881-015-0176-z] [Citation(s) in RCA: 60] [Impact Index Per Article: 6.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 11/03/2014] [Accepted: 04/22/2015] [Indexed: 11/27/2022]
Abstract
Background Long QT syndrome (LQTS) is an autosomal dominant condition predisposing to sudden death from malignant arrhythmia. Genetic testing identifies many missense single nucleotide variants of uncertain pathogenicity. Establishing genetic pathogenicity is an essential prerequisite to family cascade screening. Many laboratories use in silico prediction tools, either alone or in combination, or metaservers, in order to predict pathogenicity; however, their accuracy in the context of LQTS is unknown. We evaluated the accuracy of five in silico programs and two metaservers in the analysis of LQTS 1–3 gene variants. Methods The in silico tools SIFT, PolyPhen-2, PROVEAN, SNPs&GO and SNAP, either alone or in all possible combinations, and the metaservers Meta-SNP and PredictSNP, were tested on 312 KCNQ1, KCNH2 and SCN5A gene variants that have previously been characterised by either in vitro or co-segregation studies as either “pathogenic” (283) or “benign” (29). The accuracy, sensitivity, specificity and Matthews Correlation Coefficient (MCC) were calculated to determine the best combination of in silico tools for each LQTS gene, and when all genes are combined. Results The best combination of in silico tools for KCNQ1 is PROVEAN, SNPs&GO and SIFT (accuracy 92.7%, sensitivity 93.1%, specificity 100% and MCC 0.70). The best combination of in silico tools for KCNH2 is SIFT and PROVEAN or PROVEAN, SNPs&GO and SIFT. Both combinations have the same scores for accuracy (91.1%), sensitivity (91.5%), specificity (87.5%) and MCC (0.62). In the case of SCN5A, SNAP and PROVEAN provided the best combination (accuracy 81.4%, sensitivity 86.9%, specificity 50.0%, and MCC 0.32). When all three LQT genes are combined, SIFT, PROVEAN and SNAP is the combination with the best performance (accuracy 82.7%, sensitivity 83.0%, specificity 80.0%, and MCC 0.44). Both metaservers performed better than the single in silico tools; however, they did not perform better than the best performing combination of in silico tools. Conclusions The combination of in silico tools with the best performance is gene-dependent. The in silico tools reported here may have some value in assessing variants in the KCNQ1 and KCNH2 genes, but caution should be taken when the analysis is applied to SCN5A gene variants. Electronic supplementary material The online version of this article (doi:10.1186/s12881-015-0176-z) contains supplementary material, which is available to authorized users.
Collapse
|
131
|
Leong IUS, Stuckey A, Lai D, Skinner JR, Love DR. Assessment of the predictive accuracy of five in silico prediction tools, alone or in combination, and two metaservers to classify long QT syndrome gene mutations. BMC MEDICAL GENETICS 2015. [PMID: 25967940 DOI: 10.1186/s12881‐015‐0176‐z] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Subscribe] [Scholar Register] [Indexed: 11/10/2022]
Abstract
BACKGROUND Long QT syndrome (LQTS) is an autosomal dominant condition predisposing to sudden death from malignant arrhythmia. Genetic testing identifies many missense single nucleotide variants of uncertain pathogenicity. Establishing genetic pathogenicity is an essential prerequisite to family cascade screening. Many laboratories use in silico prediction tools, either alone or in combination, or metaservers, in order to predict pathogenicity; however, their accuracy in the context of LQTS is unknown. We evaluated the accuracy of five in silico programs and two metaservers in the analysis of LQTS 1-3 gene variants. METHODS The in silico tools SIFT, PolyPhen-2, PROVEAN, SNPs&GO and SNAP, either alone or in all possible combinations, and the metaservers Meta-SNP and PredictSNP, were tested on 312 KCNQ1, KCNH2 and SCN5A gene variants that have previously been characterised by either in vitro or co-segregation studies as either "pathogenic" (283) or "benign" (29). The accuracy, sensitivity, specificity and Matthews Correlation Coefficient (MCC) were calculated to determine the best combination of in silico tools for each LQTS gene, and when all genes are combined. RESULTS The best combination of in silico tools for KCNQ1 is PROVEAN, SNPs&GO and SIFT (accuracy 92.7%, sensitivity 93.1%, specificity 100% and MCC 0.70). The best combination of in silico tools for KCNH2 is SIFT and PROVEAN or PROVEAN, SNPs&GO and SIFT. Both combinations have the same scores for accuracy (91.1%), sensitivity (91.5%), specificity (87.5%) and MCC (0.62). In the case of SCN5A, SNAP and PROVEAN provided the best combination (accuracy 81.4%, sensitivity 86.9%, specificity 50.0%, and MCC 0.32). When all three LQT genes are combined, SIFT, PROVEAN and SNAP is the combination with the best performance (accuracy 82.7%, sensitivity 83.0%, specificity 80.0%, and MCC 0.44). Both metaservers performed better than the single in silico tools; however, they did not perform better than the best performing combination of in silico tools. CONCLUSIONS The combination of in silico tools with the best performance is gene-dependent. The in silico tools reported here may have some value in assessing variants in the KCNQ1 and KCNH2 genes, but caution should be taken when the analysis is applied to SCN5A gene variants.
Collapse
Affiliation(s)
- Ivone U S Leong
- Diagnostic Genetics, LabPlus, Auckland City Hospital, Auckland, New Zealand.
| | - Alexander Stuckey
- Bioinformatics Institute, University of Auckland, Auckland, New Zealand.
| | - Daniel Lai
- Green Lane Paediatric and Congenital Cardiac Services, Starship Children's Hospital, Private Bag 92024, Auckland, 1142, New Zealand.
| | - Jonathan R Skinner
- Green Lane Paediatric and Congenital Cardiac Services, Starship Children's Hospital, Private Bag 92024, Auckland, 1142, New Zealand. .,Cardiac Inherited Disease Group, Auckland City Hospital, Auckland, New Zealand. .,Department of Child Health, University of Auckland, Auckland, New Zealand.
| | - Donald R Love
- Department of Child Health, University of Auckland, Auckland, New Zealand.
| |
Collapse
|
132
|
Väliaho J, Faisal I, Ortutay C, Smith CIE, Vihinen M. Characterization of all possible single-nucleotide change caused amino acid substitutions in the kinase domain of Bruton tyrosine kinase. Hum Mutat 2015; 36:638-47. [PMID: 25777788 DOI: 10.1002/humu.22791] [Citation(s) in RCA: 32] [Impact Index Per Article: 3.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/30/2014] [Revised: 02/27/2015] [Accepted: 03/10/2015] [Indexed: 12/31/2022]
Abstract
Knowledge about features distinguishing deleterious and neutral variations is crucial for interpretation of novel variants. Bruton tyrosine kinase (BTK) contains the highest number of unique disease-causing variations among the human protein kinases, still it is just 10% of all the possible single-nucleotide substitution-caused amino acid variations (SNAVs). In the BTK kinase domain (BTK-KD) can appear altogether 1,495 SNAVs. We investigated them all with bioinformatic and protein structure analysis methods. Most disease-causing variations affect conserved and buried residues disturbing protein stability. Minority of exposed residues is conserved, but strongly tied to pathogenicity. Sixty-seven percent of variations are predicted to be harmful. In 39% of the residues, all the variants are likely harmful, whereas in 10% of sites, all the substitutions are tolerated. Results indicate the importance of the entire kinase domain, involvement in numerous interactions, and intricate functional regulation by conformational change. These results can be extended to other protein kinases and organisms.
Collapse
Affiliation(s)
- Jouni Väliaho
- BioMediTech, University of Tampere, Tampere, Finland
| | - Imrul Faisal
- BioMediTech, University of Tampere, Tampere, Finland
| | - Csaba Ortutay
- BioMediTech, University of Tampere, Tampere, Finland.,Present address is HiDucator Ltd., Erämiehentie 2 E 22, Kangasala FI-36200, Finland
| | - C I Edvard Smith
- Clinical Research Center, Department of Laboratory Medicine, Karolinska Institutet, Karolinska University Hospital Huddinge, Huddinge, Sweden
| | - Mauno Vihinen
- BioMediTech, University of Tampere, Tampere, Finland.,Department of Experimental Medical Science, Lund University, Lund, Sweden.,Research Unit, Tampere University Hospital, Tampere, Finland
| |
Collapse
|
133
|
Smusz S, Mordalski S, Witek J, Rataj K, Kafel R, Bojarski AJ. Multi-Step Protocol for Automatic Evaluation of Docking Results Based on Machine Learning Methods--A Case Study of Serotonin Receptors 5-HT(6) and 5-HT(7). J Chem Inf Model 2015; 55:823-32. [PMID: 25806997 DOI: 10.1021/ci500564b] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/24/2022]
Abstract
Molecular docking, despite its undeniable usefulness in computer-aided drug design protocols and the increasing sophistication of tools used in the prediction of ligand-protein interaction energies, is still connected with a problem of effective results analysis. In this study, a novel protocol for the automatic evaluation of numerous docking results is presented, being a combination of Structural Interaction Fingerprints and Spectrophores descriptors, machine-learning techniques, and multi-step results analysis. Such an approach takes into consideration the performance of a particular learning algorithm (five machine learning methods were applied), the performance of the docking algorithm itself, the variety of conformations returned from the docking experiment, and the receptor structure (homology models were constructed on five different templates). Evaluation using compounds active toward 5-HT6 and 5-HT7 receptors, as well as additional analysis carried out for beta-2 adrenergic receptor ligands, proved that the methodology is a viable tool for supporting virtual screening protocols, enabling proper discrimination between active and inactive compounds.
Collapse
Affiliation(s)
- Sabina Smusz
- †Department of Medicinal Chemistry, Institute of Pharmacology, Polish Academy of Sciences, 12 Smętna Street, 31-343 Kraków, Poland.,‡Faculty of Chemistry, Jagiellonian University, 3 Ingardena Street, 30-060 Kraków, Poland
| | - Stefan Mordalski
- †Department of Medicinal Chemistry, Institute of Pharmacology, Polish Academy of Sciences, 12 Smętna Street, 31-343 Kraków, Poland
| | - Jagna Witek
- †Department of Medicinal Chemistry, Institute of Pharmacology, Polish Academy of Sciences, 12 Smętna Street, 31-343 Kraków, Poland
| | - Krzysztof Rataj
- †Department of Medicinal Chemistry, Institute of Pharmacology, Polish Academy of Sciences, 12 Smętna Street, 31-343 Kraków, Poland
| | - Rafał Kafel
- †Department of Medicinal Chemistry, Institute of Pharmacology, Polish Academy of Sciences, 12 Smętna Street, 31-343 Kraków, Poland
| | - Andrzej J Bojarski
- †Department of Medicinal Chemistry, Institute of Pharmacology, Polish Academy of Sciences, 12 Smętna Street, 31-343 Kraków, Poland
| |
Collapse
|
134
|
Grimm DG, Azencott C, Aicheler F, Gieraths U, MacArthur DG, Samocha KE, Cooper DN, Stenson PD, Daly MJ, Smoller JW, Duncan LE, Borgwardt KM. The evaluation of tools used to predict the impact of missense variants is hindered by two types of circularity. Hum Mutat 2015; 36:513-23. [PMID: 25684150 PMCID: PMC4409520 DOI: 10.1002/humu.22768] [Citation(s) in RCA: 212] [Impact Index Per Article: 23.6] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/27/2014] [Accepted: 02/06/2015] [Indexed: 01/12/2023]
Abstract
Prioritizing missense variants for further experimental investigation is a key challenge in current sequencing studies for exploring complex and Mendelian diseases. A large number of in silico tools have been employed for the task of pathogenicity prediction, including PolyPhen-2, SIFT, FatHMM, MutationTaster-2, MutationAssessor, Combined Annotation Dependent Depletion, LRT, phyloP, and GERP++, as well as optimized methods of combining tool scores, such as Condel and Logit. Due to the wealth of these methods, an important practical question to answer is which of these tools generalize best, that is, correctly predict the pathogenic character of new variants. We here demonstrate in a study of 10 tools on five datasets that such a comparative evaluation of these tools is hindered by two types of circularity: they arise due to (1) the same variants or (2) different variants from the same protein occurring both in the datasets used for training and for evaluation of these tools, which may lead to overly optimistic results. We show that comparative evaluations of predictors that do not address these types of circularity may erroneously conclude that circularity confounded tools are most accurate among all tools, and may even outperform optimized combinations of tools.
Collapse
Affiliation(s)
- Dominik G. Grimm
- Machine Learning and Computational Biology Research GroupMax Planck Institute for Intelligent Systems and Max Planck Institute for Developmental BiologyTübingenGermany
- Zentrum für BioinformatikEberhard Karls Universität TübingenTübingenGermany
- Department for Biosystems Science and EngineeringETH ZürichBaselSwitzerland
| | - Chloé‐Agathe Azencott
- Machine Learning and Computational Biology Research GroupMax Planck Institute for Intelligent Systems and Max Planck Institute for Developmental BiologyTübingenGermany
- MINES ParisTechPLS Research UniversityCBIO – Centre for Computational BiologyFontainebleauFrance
- Institut CurieParisFrance
- INSERMParisFrance
| | - Fabian Aicheler
- Machine Learning and Computational Biology Research GroupMax Planck Institute for Intelligent Systems and Max Planck Institute for Developmental BiologyTübingenGermany
- Zentrum für BioinformatikEberhard Karls Universität TübingenTübingenGermany
| | - Udo Gieraths
- Machine Learning and Computational Biology Research GroupMax Planck Institute for Intelligent Systems and Max Planck Institute for Developmental BiologyTübingenGermany
| | - Daniel G. MacArthur
- Analytic and Translational Genetics UnitMassachusetts General HospitalBostonMassachusetts
- Harvard Medical SchoolDepartment of MedicineBostonMassachusetts
- Broad Institute of MIT and HarvardCambridgeMassachusetts
| | - Kaitlin E. Samocha
- Analytic and Translational Genetics UnitMassachusetts General HospitalBostonMassachusetts
- Harvard Medical SchoolDepartment of MedicineBostonMassachusetts
- Broad Institute of MIT and HarvardCambridgeMassachusetts
| | - David N. Cooper
- Institute of Medical GeneticsSchool of MedicineCardiff UniversityCardiffUK
| | - Peter D. Stenson
- Institute of Medical GeneticsSchool of MedicineCardiff UniversityCardiffUK
| | - Mark J. Daly
- Analytic and Translational Genetics UnitMassachusetts General HospitalBostonMassachusetts
- Harvard Medical SchoolDepartment of MedicineBostonMassachusetts
- Broad Institute of MIT and HarvardCambridgeMassachusetts
| | - Jordan W. Smoller
- Broad Institute of MIT and HarvardCambridgeMassachusetts
- Psychiatric and Neurodevelopmental Genetics UnitMassachusetts General HospitalBostonMassachusetts
- Harvard Medical SchoolDepartment of PsychiatryBostonMassachusetts
| | - Laramie E. Duncan
- Analytic and Translational Genetics UnitMassachusetts General HospitalBostonMassachusetts
- Harvard Medical SchoolDepartment of MedicineBostonMassachusetts
- Broad Institute of MIT and HarvardCambridgeMassachusetts
| | - Karsten M. Borgwardt
- Machine Learning and Computational Biology Research GroupMax Planck Institute for Intelligent Systems and Max Planck Institute for Developmental BiologyTübingenGermany
- Zentrum für BioinformatikEberhard Karls Universität TübingenTübingenGermany
- Department for Biosystems Science and EngineeringETH ZürichBaselSwitzerland
| |
Collapse
|
135
|
Shoombuatong W, Prachayasittikul V, Prachayasittikul V, Nantasenamat C. Prediction of aromatase inhibitory activity using the efficient linear method (ELM). EXCLI JOURNAL 2015; 14:452-64. [PMID: 26535037 PMCID: PMC4614109 DOI: 10.17179/excli2015-140] [Citation(s) in RCA: 12] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 01/10/2015] [Accepted: 02/02/2015] [Indexed: 12/26/2022]
Abstract
Aromatase inhibition is an effective treatment strategy for breast cancer. Currently, several in silico methods have been developed for the prediction of aromatase inhibitors (AIs) using artificial neural network (ANN) or support vector machine (SVM). In spite of this, there are ample opportunities for further improvements by developing a simple and interpretable quantitative structure-activity relationship (QSAR) method. Herein, an efficient linear method (ELM) is proposed for constructing a highly predictive QSAR model containing a spontaneous feature importance estimator. Briefly, ELM is a linear-based model with optimal parameters derived from genetic algorithm. Results showed that the simple ELM method displayed robust performance with 10-fold cross-validation MCC values of 0.64 and 0.56 for steroidal and non-steroidal AIs, respectively. Comparative analyses with other machine learning methods (i.e. ANN, SVM and decision tree) were also performed. A thorough analysis of informative molecular descriptors for both steroidal and non-steroidal AIs provided insights into the mechanism of action of compounds. Our findings suggest that the shape and polarizability of compounds may govern the inhibitory activity of both steroidal and non-steroidal types whereas the terminal primary C(sp3) functional group and electronegativity may be required for non-steroidal AIs. The R code of the ELM method is available at http://dx.doi.org/10.6084/m9.figshare.1274030.
Collapse
Affiliation(s)
- Watshara Shoombuatong
- Center of Data Mining and Biomedical Informatics, Faculty of Medical Technology, Mahidol University, Bangkok 10700, Thailand
| | - Veda Prachayasittikul
- Center of Data Mining and Biomedical Informatics, Faculty of Medical Technology, Mahidol University, Bangkok 10700, Thailand ; Department of Clinical Microbiology and Applied Technology, Faculty of Medical Technology, Mahidol University, Bangkok 10700, Thailand
| | - Virapong Prachayasittikul
- Department of Clinical Microbiology and Applied Technology, Faculty of Medical Technology, Mahidol University, Bangkok 10700, Thailand
| | - Chanin Nantasenamat
- Center of Data Mining and Biomedical Informatics, Faculty of Medical Technology, Mahidol University, Bangkok 10700, Thailand ; Department of Clinical Microbiology and Applied Technology, Faculty of Medical Technology, Mahidol University, Bangkok 10700, Thailand
| |
Collapse
|
136
|
Popovic D, Sifrim A, Davis J, Moreau Y, De Moor B. Problems with the nested granularity of feature domains in bioinformatics: the eXtasy case. BMC Bioinformatics 2015; 16 Suppl 4:S2. [PMID: 25734591 PMCID: PMC4347616 DOI: 10.1186/1471-2105-16-s4-s2] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/16/2023] Open
Abstract
Background Data from biomedical domains often have an inherit hierarchical structure. As this structure is usually implicit, its existence can be overlooked by practitioners interested in constructing and evaluating predictive models from such data. Ignoring these constructs leads to potentially problematic and the routinely unrecognized bias in the models and results. In this work, we discuss this bias in detail and propose a simple, sampling-based solution for it. Next, we explore its sources and extent on synthetic data. Finally, we demonstrate how the state-of-the-art variant prioritization framework, eXtasy, benefits from using the described approach in its Random forest-based core classification model. Results and conclusions The conducted simulations clearly indicate that the heterogeneous granularity of feature domains poses significant problems for both the standard Random forest classifier and a modification that relies on stratified bootstrapping. Conversely, using the proposed sampling scheme when training the classifier mitigates the described bias. Furthermore, when applied to the eXtasy data under a realistic class distribution scenario, a Random forest learned using the proposed sampling scheme displays much better precision that its standard version, without degrading recall. Moreover, the largest performance gains are achieved in the most important part of the operating range: the top of prioritized gene list.
Collapse
|
137
|
PON-P2: prediction method for fast and reliable identification of harmful variants. PLoS One 2015; 10:e0117380. [PMID: 25647319 PMCID: PMC4315405 DOI: 10.1371/journal.pone.0117380] [Citation(s) in RCA: 159] [Impact Index Per Article: 17.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/24/2014] [Accepted: 12/17/2014] [Indexed: 01/04/2023] Open
Abstract
More reliable and faster prediction methods are needed to interpret enormous amounts of data generated by sequencing and genome projects. We have developed a new computational tool, PON-P2, for classification of amino acid substitutions in human proteins. The method is a machine learning-based classifier and groups the variants into pathogenic, neutral and unknown classes, on the basis of random forest probability score. PON-P2 is trained using pathogenic and neutral variants obtained from VariBench, a database for benchmark variation datasets. PON-P2 utilizes information about evolutionary conservation of sequences, physical and biochemical properties of amino acids, GO annotations and if available, functional annotations of variation sites. Extensive feature selection was performed to identify 8 informative features among altogether 622 features. PON-P2 consistently showed superior performance in comparison to existing state-of-the-art tools. In 10-fold cross-validation test, its accuracy and MCC are 0.90 and 0.80, respectively, and in the independent test, they are 0.86 and 0.71, respectively. The coverage of PON-P2 is 61.7% in the 10-fold cross-validation and 62.1% in the test dataset. PON-P2 is a powerful tool for screening harmful variants and for ranking and prioritizing experimental characterization. It is very fast making it capable of analyzing large variant datasets. PON-P2 is freely available at http://structure.bmc.lu.se/PON-P2/.
Collapse
|
138
|
Zhou Y, Zhang N, Li BQ, Huang T, Cai YD, Kong XY. A method to distinguish between lysine acetylation and lysine ubiquitination with feature selection and analysis. J Biomol Struct Dyn 2015; 33:2479-90. [PMID: 25616595 DOI: 10.1080/07391102.2014.1001793] [Citation(s) in RCA: 22] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/16/2022]
Abstract
Lysine acetylation and ubiquitination are two primary post-translational modifications (PTMs) in most eukaryotic proteins. Lysine residues are targets for both types of PTMs, resulting in different cellular roles. With the increasing availability of protein sequences and PTM data, it is challenging to distinguish the two types of PTMs on lysine residues. Experimental approaches are often laborious and time consuming. There is an urgent need for computational tools to distinguish between lysine acetylation and ubiquitination. In this study, we developed a novel method, called DAUFSA (distinguish between lysine acetylation and lysine ubiquitination with feature selection and analysis), to discriminate ubiquitinated and acetylated lysine residues. The method incorporated several types of features: PSSM (position-specific scoring matrix) conservation scores, amino acid factors, secondary structures, solvent accessibilities, and disorder scores. By using the mRMR (maximum relevance minimum redundancy) method and the IFS (incremental feature selection) method, an optimal feature set containing 290 features was selected from all incorporated features. A dagging-based classifier constructed by the optimal features achieved a classification accuracy of 69.53%, with an MCC of .3853. An optimal feature set analysis showed that the PSSM conservation score features and the amino acid factor features were the most important attributes, suggesting differences between acetylation and ubiquitination. Our study results also supported previous findings that different motifs were employed by acetylation and ubiquitination. The feature differences between the two modifications revealed in this study are worthy of experimental validation and further investigation.
Collapse
Affiliation(s)
- You Zhou
- a The Key Laboratory of Stem Cell Biology, Institute of Health Sciences, Shanghai Institutes for Biological Sciences, Chinese Academy of Sciences and Shanghai Jiao Tong University School of Medicine , Shanghai 200031 , P.R. China
| | - Ning Zhang
- b Department of Biomedical Engineering, Tianjin Key Lab of BME Measurement , Tianjin University , Tianjin 300072 , P.R. China
| | - Bi-Qing Li
- c Key Laboratory of Systems Biology , Shanghai Institutes for Biological Sciences, Chinese Academy of Sciences , Shanghai 200031 , P.R. China
| | - Tao Huang
- a The Key Laboratory of Stem Cell Biology, Institute of Health Sciences, Shanghai Institutes for Biological Sciences, Chinese Academy of Sciences and Shanghai Jiao Tong University School of Medicine , Shanghai 200031 , P.R. China.,d Department of Genetics and Genomic Sciences , Icahn School of Medicine at Mount Sinai , New York , NY 10029 , USA
| | - Yu-Dong Cai
- e Institute of Systems Biology , Shanghai University , Shanghai 200444 , P.R. China
| | - Xiang-Yin Kong
- a The Key Laboratory of Stem Cell Biology, Institute of Health Sciences, Shanghai Institutes for Biological Sciences, Chinese Academy of Sciences and Shanghai Jiao Tong University School of Medicine , Shanghai 200031 , P.R. China
| |
Collapse
|
139
|
Schaafsma GC, Vihinen M. VariSNP, A Benchmark Database for Variations From dbSNP. Hum Mutat 2015; 36:161-6. [DOI: 10.1002/humu.22727] [Citation(s) in RCA: 41] [Impact Index Per Article: 4.6] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/01/2014] [Accepted: 10/28/2014] [Indexed: 11/12/2022]
Affiliation(s)
- Gerard C.P. Schaafsma
- Protein Structure and Bioinformatics; Department of Experimental Medical Science, Lund University; Lund SE-221 84 Sweden
| | - Mauno Vihinen
- Protein Structure and Bioinformatics; Department of Experimental Medical Science, Lund University; Lund SE-221 84 Sweden
| |
Collapse
|
140
|
Dong C, Wei P, Jian X, Gibbs R, Boerwinkle E, Wang K, Liu X. Comparison and integration of deleteriousness prediction methods for nonsynonymous SNVs in whole exome sequencing studies. Hum Mol Genet 2014; 24:2125-37. [PMID: 25552646 DOI: 10.1093/hmg/ddu733] [Citation(s) in RCA: 752] [Impact Index Per Article: 75.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/19/2022] Open
Abstract
Accurate deleteriousness prediction for nonsynonymous variants is crucial for distinguishing pathogenic mutations from background polymorphisms in whole exome sequencing (WES) studies. Although many deleteriousness prediction methods have been developed, their prediction results are sometimes inconsistent with each other and their relative merits are still unclear in practical applications. To address these issues, we comprehensively evaluated the predictive performance of 18 current deleteriousness-scoring methods, including 11 function prediction scores (PolyPhen-2, SIFT, MutationTaster, Mutation Assessor, FATHMM, LRT, PANTHER, PhD-SNP, SNAP, SNPs&GO and MutPred), 3 conservation scores (GERP++, SiPhy and PhyloP) and 4 ensemble scores (CADD, PON-P, KGGSeq and CONDEL). We found that FATHMM and KGGSeq had the highest discriminative power among independent scores and ensemble scores, respectively. Moreover, to ensure unbiased performance evaluation of these prediction scores, we manually collected three distinct testing datasets, on which no current prediction scores were tuned. In addition, we developed two new ensemble scores that integrate nine independent scores and allele frequency. Our scores achieved the highest discriminative power compared with all the deleteriousness prediction scores tested and showed low false-positive prediction rate for benign yet rare nonsynonymous variants, which demonstrated the value of combining information from multiple orthologous approaches. Finally, to facilitate variant prioritization in WES studies, we have pre-computed our ensemble scores for 87 347 044 possible variants in the whole-exome and made them publicly available through the ANNOVAR software and the dbNSFP database.
Collapse
Affiliation(s)
- Chengliang Dong
- Zilkha Neurogenetic Institute, Biostatistics Division, Department of Preventive Medicine and
| | - Peng Wei
- Human Genetics Center, Division of Biostatistics, School of Public Health, The University of Texas Health Science Center at Houston, Houston, TX 77030, USA and
| | - Xueqiu Jian
- Division of Epidemiology, Human Genetics and Environmental Sciences and
| | - Richard Gibbs
- Human Genome Sequencing Center, Baylor College of Medicine, Houston, TX 77030, USA
| | - Eric Boerwinkle
- Human Genetics Center, Division of Epidemiology, Human Genetics and Environmental Sciences and Human Genome Sequencing Center, Baylor College of Medicine, Houston, TX 77030, USA
| | - Kai Wang
- Zilkha Neurogenetic Institute, Biostatistics Division, Department of Preventive Medicine and, Department of Psychiatry, Keck School of Medicine, University of Southern California, Los Angeles, CA 90033, USA,
| | - Xiaoming Liu
- Human Genetics Center, Division of Epidemiology, Human Genetics and Environmental Sciences and
| |
Collapse
|
141
|
Riera C, Lois S, Domínguez C, Fernandez-Cadenas I, Montaner J, Rodríguez-Sureda V, de la Cruz X. Molecular damage in Fabry disease: characterization and prediction of alpha-galactosidase A pathological mutations. Proteins 2014; 83:91-104. [PMID: 25382311 DOI: 10.1002/prot.24708] [Citation(s) in RCA: 20] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/06/2014] [Revised: 09/25/2014] [Accepted: 10/18/2014] [Indexed: 12/12/2022]
Abstract
Loss-of-function mutations of the enzyme alpha-galactosidase A (GLA) causes Fabry disease (FD), that is a rare and potentially fatal disease. Identification of these pathological mutations by sequencing is important because it allows an early treatment of the disease. However, before taking any treatment decision, if the mutation identified is unknown, we first need to establish if it is pathological or not. General bioinformatic tools (PolyPhen-2, SIFT, Condel, etc.) can be used for this purpose, but their performance is still limited. Here we present a new tool, specifically derived for the assessment of GLA mutations. We first compared mutations of this enzyme known to cause FD with neutral sequence variants, using several structure and sequence properties. Then, we used these properties to develop a family of prediction methods adapted to different quality requirements. Trained and tested on a set of known Fabry mutations, our methods have a performance (Matthews correlation: 0.56-0.72) comparable or better than that of the more complex method, Polyphen-2 (Matthews correlation: 0.61), and better than those of SIFT (Matthews correl.: 0.54) and Condel (Matthews correl.: 0.51). This result is validated in an independent set of 65 pathological mutations, for which our method displayed the best success rate (91.0%, 87.7%, and 73.8%, for our method, PolyPhen-2 and SIFT, respectively). These data confirmed that our specific approach can effectively contribute to the identification of pathological mutations in GLA, and therefore enhance the use of sequence information in the identification of undiagnosed Fabry patients.
Collapse
Affiliation(s)
- Casandra Riera
- Research Unit in Translational Bioinformatics, Institut de Recerca Hospital Vall d'Hebron (VHIR), Universitat Autònoma de Barcelona, Barcelona, Spain
| | | | | | | | | | | | | |
Collapse
|
142
|
Ryan NM, Morris SW, Porteous DJ, Taylor MS, Evans KL. SuRFing the genomics wave: an R package for prioritising SNPs by functionality. Genome Med 2014; 6:79. [PMID: 25400697 PMCID: PMC4224693 DOI: 10.1186/s13073-014-0079-1] [Citation(s) in RCA: 12] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/20/2014] [Accepted: 09/26/2014] [Indexed: 12/16/2022] Open
Abstract
Identifying functional non-coding variants is one of the greatest unmet challenges in genetics. To help address this, we introduce an R package, SuRFR, which integrates functional annotation and prior biological knowledge to prioritise candidate functional variants. SuRFR is publicly available, modular, flexible, fast, and simple to use. We demonstrate that SuRFR performs with high sensitivity and specificity and provide a widely applicable and scalable benchmarking dataset for model training and validation. Website: http://www.cgem.ed.ac.uk/resources/
Collapse
Affiliation(s)
- Niamh M Ryan
- Centre for Genomic and Experimental Medicine, Institute of Genetics and Molecular Medicine, The University of Edinburgh, Western General Hospital, Crewe Road, Edinburgh, EH4 2XU UK
| | - Stewart W Morris
- Centre for Genomic and Experimental Medicine, Institute of Genetics and Molecular Medicine, The University of Edinburgh, Western General Hospital, Crewe Road, Edinburgh, EH4 2XU UK
| | - David J Porteous
- Centre for Genomic and Experimental Medicine, Institute of Genetics and Molecular Medicine, The University of Edinburgh, Western General Hospital, Crewe Road, Edinburgh, EH4 2XU UK ; Centre for Cognitive Ageing and Cognitive Epidemiology, The University of Edinburgh, 7 George Square, Edinburgh, EH8 9JZ UK
| | - Martin S Taylor
- MRC Human Genetics Unit, Institute of Genetics and Molecular Medicine, The University of Edinburgh, Western General Hospital, Crewe Road, Edinburgh, EH4 2XU UK
| | - Kathryn L Evans
- Centre for Genomic and Experimental Medicine, Institute of Genetics and Molecular Medicine, The University of Edinburgh, Western General Hospital, Crewe Road, Edinburgh, EH4 2XU UK ; Centre for Cognitive Ageing and Cognitive Epidemiology, The University of Edinburgh, 7 George Square, Edinburgh, EH8 9JZ UK
| |
Collapse
|
143
|
|
144
|
Zhang N, Zhou Y, Huang T, Zhang YC, Li BQ, Chen L, Cai YD. Discriminating between lysine sumoylation and lysine acetylation using mRMR feature selection and analysis. PLoS One 2014; 9:e107464. [PMID: 25222670 PMCID: PMC4164654 DOI: 10.1371/journal.pone.0107464] [Citation(s) in RCA: 12] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/26/2014] [Accepted: 08/10/2014] [Indexed: 11/18/2022] Open
Abstract
Post-translational modifications (PTMs) are crucial steps in protein synthesis and are important factors contributing to protein diversity. PTMs play important roles in the regulation of gene expression, protein stability and metabolism. Lysine residues in protein sequences have been found to be targeted for both types of PTMs: sumoylations and acetylations; however, each PTM has a different cellular role. As experimental approaches are often laborious and time consuming, it is challenging to distinguish the two types of PTMs on lysine residues using computational methods. In this study, we developed a method to discriminate between sumoylated lysine residues and acetylated residues. The method incorporated several features: PSSM conservation scores, amino acid factors, secondary structures, solvent accessibilities and disorder scores. By using the mRMR (Maximum Relevance Minimum Redundancy) method and the IFS (Incremental Feature Selection) method, an optimal feature set was selected from all of the incorporated features, with which the classifier achieved 92.14% accuracy with an MCC value of 0.7322. Analysis of the optimal feature set revealed some differences between acetylation and sumoylation. The results from our study also supported the previous finding that there exist different consensus motifs for the two types of PTMs. The results could suggest possible dominant factors governing the acetylation and sumoylation of lysine residues, shedding some light on the modification dynamics and molecular mechanisms of the two types of PTMs, and provide guidelines for experimental validations.
Collapse
Affiliation(s)
- Ning Zhang
- Department of Biomedical Engineering, Tianjin Key Lab of Biomedical Engineering Measurement, Tianjin University, Tianjin, P.R. China
| | - You Zhou
- Institute of Health Sciences, Shanghai Institutes for Biological Sciences, Chinese Academy of Sciences and Shanghai Jiao Tong University School of Medicine, Shanghai, P. R. China
| | - Tao Huang
- Department of Genetics and Genomic Sciences, Icahn School of Medicine at Mount Sinai, New York, New York, United States of America
| | - Yu-Chao Zhang
- Institute of Health Sciences, Shanghai Institutes for Biological Sciences, Chinese Academy of Sciences and Shanghai Jiao Tong University School of Medicine, Shanghai, P. R. China
| | - Bi-Qing Li
- Key Laboratory of Systems Biology, Shanghai Institutes for Biological Sciences, Chinese Academy of Sciences, Shanghai, P.R. China
| | - Lei Chen
- College of Information Engineering, Shanghai Maritime University, Shanghai, P.R. China
| | - Yu-Dong Cai
- Institute of Systems Biology, Shanghai University, Shanghai, P.R. China
- * E-mail:
| |
Collapse
|
145
|
Vihinen M. Majority vote and other problems when using computational tools. Hum Mutat 2014; 35:912-4. [PMID: 24915749 DOI: 10.1002/humu.22600] [Citation(s) in RCA: 15] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/05/2014] [Accepted: 05/28/2014] [Indexed: 11/06/2022]
Abstract
Computational tools are essential for most of our research. To use these tools, one needs to know how they work. Problems in application of computational methods to variation analysis can appear at several stages and affect, for example, the interpretation of results. Such cases are discussed along with suggestions how to avoid them. The applications include incomplete reporting of methods, especially about the use of prediction tools; method selection on unscientific grounds and without consulting independent method performance assessments; extending application area of methods outside their intended purpose; use of the same data several times for obtaining majority vote; and filtering of datasets so that variants of interest are excluded. All these issues can be avoided by discontinuing the use software tools as black boxes.
Collapse
Affiliation(s)
- Mauno Vihinen
- Department of Experimental Medical Science, BMC D10, Lund University, Lund, Sweden
| |
Collapse
|
146
|
Quintáns B, Ordóñez-Ugalde A, Cacheiro P, Carracedo A, Sobrido MJ. Medical genomics: The intricate path from genetic variant identification to clinical interpretation. Appl Transl Genom 2014; 3:60-7. [PMID: 27284505 PMCID: PMC4887840 DOI: 10.1016/j.atg.2014.06.001] [Citation(s) in RCA: 26] [Impact Index Per Article: 2.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/02/2014] [Accepted: 06/02/2014] [Indexed: 01/23/2023]
Abstract
The field of medical genomics involves translating high throughput genetic methods to the clinic, in order to improve diagnostic efficiency and treatment decision making. Technical questions related to sample enrichment, sequencing methodologies and variant identification and calling algorithms, still need careful investigation in order to validate the analytical step of next generation sequencing techniques for clinical applications. However, the main foreseeable challenge will be interpreting the clinical significance of the variants observed in a given patient, as well as their significance for family members and for other patients. Every step in the variant interpretation process has limitations and difficulties, and its quote of contribution to false positive and false negative results. There is no single piece of evidence enough on its own to make firm conclusions on the pathogenicity and disease causality of a given variant. A plethora of automated analysis software tools is being developed that will enhance efficiency and accuracy. However a risk of misinterpretation could derive from biased biorepository content, facilitated by annotation of variant functional consequences using previous datasets stored in the same or linked repositories. In order to improve variant interpretation and avoid an exponential accumulation of confounding noise in the medical literature, the use of terms in a standard way should be sought and requested when reporting genetic variants and their consequences. Generally, stepwise and linear interpretation processes are likely to overrate some pieces of evidence while underscoring others. Algorithms are needed that allow a multidimensional, parallel analysis of diverse lines of evidence to be carried out by expert teams for specific genes, cellular pathways or disorders.
Collapse
Affiliation(s)
- B Quintáns
- Fundación Pública Galega de Medicina Xenómica and Instituto de Investigación Sanitaria, SERGAS, Santiago de Compostela, Spain; Centro para Investigación Biomédica en red de Enfermedades Raras (CIBERER), Instituto de Salud Carlos III, Spain
| | - A Ordóñez-Ugalde
- Fundación Pública Galega de Medicina Xenómica and Instituto de Investigación Sanitaria, SERGAS, Santiago de Compostela, Spain; Universidade de Santiago de Compostela, Spain
| | - P Cacheiro
- Fundación Pública Galega de Medicina Xenómica and Instituto de Investigación Sanitaria, SERGAS, Santiago de Compostela, Spain; Universidade de Santiago de Compostela, Spain
| | - A Carracedo
- Fundación Pública Galega de Medicina Xenómica and Instituto de Investigación Sanitaria, SERGAS, Santiago de Compostela, Spain; Centro para Investigación Biomédica en red de Enfermedades Raras (CIBERER), Instituto de Salud Carlos III, Spain; Universidade de Santiago de Compostela, Spain
| | - M J Sobrido
- Fundación Pública Galega de Medicina Xenómica and Instituto de Investigación Sanitaria, SERGAS, Santiago de Compostela, Spain; Centro para Investigación Biomédica en red de Enfermedades Raras (CIBERER), Instituto de Salud Carlos III, Spain
| |
Collapse
|
147
|
Ali H, Urolagin S, Gurarslan Ö, Vihinen M. Performance of Protein Disorder Prediction Programs on Amino Acid Substitutions. Hum Mutat 2014; 35:794-804. [DOI: 10.1002/humu.22564] [Citation(s) in RCA: 20] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/20/2013] [Accepted: 04/04/2014] [Indexed: 01/04/2023]
Affiliation(s)
- Heidi Ali
- Institute of Biomedical Technology; FI-33014 University of Tampere; Tampere Finland
- BioMediTech; Tampere Finland
| | - Siddhaling Urolagin
- Department of Experimental Medical Science; Lund University; SE-22184 Lund Sweden
| | - Ömer Gurarslan
- Institute of Biomedical Technology; FI-33014 University of Tampere; Tampere Finland
- BioMediTech; Tampere Finland
| | - Mauno Vihinen
- Institute of Biomedical Technology; FI-33014 University of Tampere; Tampere Finland
- BioMediTech; Tampere Finland
- Department of Experimental Medical Science; Lund University; SE-22184 Lund Sweden
- Tampere University Hospital; Tampere Finland
| |
Collapse
|
148
|
DEFLATE compression algorithm corrects for overestimation of phylogenetic diversity by Grantham approach to single-nucleotide polymorphism classification. Int J Mol Sci 2014; 15:8491-508. [PMID: 24828207 PMCID: PMC4057744 DOI: 10.3390/ijms15058491] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/22/2014] [Revised: 03/28/2014] [Accepted: 05/04/2014] [Indexed: 11/17/2022] Open
Abstract
Improvements in speed and cost of genome sequencing are resulting in increasing numbers of novel non-synonymous single nucleotide polymorphisms (nsSNPs) in genes known to be associated with disease. The large number of nsSNPs makes laboratory-based classification infeasible and familial co-segregation with disease is not always possible. In-silico methods for classification or triage are thus utilised. A popular tool based on multiple-species sequence alignments (MSAs) and work by Grantham, Align-GVGD, has been shown to underestimate deleterious effects, particularly as sequence numbers increase. We utilised the DEFLATE compression algorithm to account for expected variation across a number of species. With the adjusted Grantham measure we derived a means of quantitatively clustering known neutral and deleterious nsSNPs from the same gene; this was then used to assign novel variants to the most appropriate cluster as a means of binary classification. Scaling of clusters allows for inter-gene comparison of variants through a single pathogenicity score. The approach improves upon the classification accuracy of Align-GVGD while correcting for sensitivity to large MSAs. Open-source code and a web server are made available at https://github.com/aschlosberg/CompressGV.
Collapse
|
149
|
Song T, Qu XF, Zhang YT, Cao W, Han BH, Li Y, Piao JY, Yin LL, Da Cheng H. Usefulness of the heart-rate variability complex for predicting cardiac mortality after acute myocardial infarction. BMC Cardiovasc Disord 2014; 14:59. [PMID: 24886422 PMCID: PMC4023175 DOI: 10.1186/1471-2261-14-59] [Citation(s) in RCA: 28] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/08/2014] [Accepted: 04/28/2014] [Indexed: 11/10/2022] Open
Abstract
Background Previous studies indicate that decreased heart-rate variability (HRV) is related to the risk of death in patients after acute myocardial infarction (AMI). However, the conventional indices of HRV have poor predictive value for mortality. Our aim was to develop novel predictive models based on support vector machine (SVM) to study the integrated features of HRV for improving risk stratification after AMI. Methods A series of heart-rate dynamic parameters from 208 patients were analyzed after a mean follow-up time of 28 months. Patient electrocardiographic data were classified as either survivals or cardiac deaths. SVM models were established based on different combinations of heart-rate dynamic variables and compared to left ventricular ejection fraction (LVEF), standard deviation of normal-to-normal intervals (SDNN) and deceleration capacity (DC) of heart rate. We tested the accuracy of predictors by assessing the area under the receiver-operator characteristics curve (AUC). Results We evaluated a SVM algorithm that integrated various electrocardiographic features based on three models: (A) HRV complex; (B) 6 dimension vector; and (C) 8 dimension vector. Mean AUC of HRV complex was 0.8902, 0.8880 for 6 dimension vector and 0.8579 for 8 dimension vector, compared with 0.7424 for LVEF, 0.7932 for SDNN and 0.7399 for DC. Conclusions HRV complex yielded the largest AUC and is the best classifier for predicting cardiac death after AMI.
Collapse
Affiliation(s)
| | - Xiu Fen Qu
- Department of Cardiology, the First Affiliated Hospital of Harbin Medical University, No,23 Youzheng Street, Nangang District, Harbin City 150001, Heilongjiang Province, China.
| | | | | | | | | | | | | | | |
Collapse
|
150
|
Mort M, Sterne-Weiler T, Li B, Ball EV, Cooper DN, Radivojac P, Sanford JR, Mooney SD. MutPred Splice: machine learning-based prediction of exonic variants that disrupt splicing. Genome Biol 2014; 15:R19. [PMID: 24451234 PMCID: PMC4054890 DOI: 10.1186/gb-2014-15-1-r19] [Citation(s) in RCA: 114] [Impact Index Per Article: 11.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/11/2013] [Accepted: 01/13/2014] [Indexed: 11/16/2022] Open
Abstract
We have developed a novel machine-learning approach, MutPred Splice, for the identification of coding region substitutions that disrupt pre-mRNA splicing. Applying MutPred Splice to human disease-causing exonic mutations suggests that 16% of mutations causing inherited disease and 10 to 14% of somatic mutations in cancer may disrupt pre-mRNA splicing. For inherited disease, the main mechanism responsible for the splicing defect is splice site loss, whereas for cancer the predominant mechanism of splicing disruption is predicted to be exon skipping via loss of exonic splicing enhancers or gain of exonic splicing silencer elements. MutPred Splice is available at http://mutdb.org/mutpredsplice.
Collapse
|