1
|
Shirvanizadeh N, Vihinen M. VariBench, new variation benchmark categories and data sets. FRONTIERS IN BIOINFORMATICS 2023; 3:1248732. [PMID: 37795169 PMCID: PMC10546188 DOI: 10.3389/fbinf.2023.1248732] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/27/2023] [Accepted: 09/08/2023] [Indexed: 10/06/2023] Open
Affiliation(s)
| | - Mauno Vihinen
- Department of Experimental Medical Science, Lund University, Lund, Sweden
| |
Collapse
|
2
|
Li MM, Awasthi S, Ghosh S, Bisht D, Coban Akdemir ZH, Sheynkman GM, Sahni N, Yi SS. Gain-of-Function Variomics and Multi-omics Network Biology for Precision Medicine. Methods Mol Biol 2023; 2660:357-372. [PMID: 37191809 PMCID: PMC10476052 DOI: 10.1007/978-1-0716-3163-8_24] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 05/17/2023]
Abstract
Traditionally, disease causal mutations were thought to disrupt gene function. However, it becomes more clear that many deleterious mutations could exhibit a "gain-of-function" (GOF) behavior. Systematic investigation of such mutations has been lacking and largely overlooked. Advances in next-generation sequencing have identified thousands of genomic variants that perturb the normal functions of proteins, further contributing to diverse phenotypic consequences in disease. Elucidating the functional pathways rewired by GOF mutations will be crucial for prioritizing disease-causing variants and their resultant therapeutic liabilities. In distinct cell types (with varying genotypes), precise signal transduction controls cell decision, including gene regulation and phenotypic output. When signal transduction goes awry due to GOF mutations, it would give rise to various disease types. Quantitative and molecular understanding of network perturbations by GOF mutations may provide explanations for 'missing heritability" in previous genome-wide association studies. We envision that it will be instrumental to push current paradigm toward a thorough functional and quantitative modeling of all GOF mutations and their mechanistic molecular events involved in disease development and progression. Many fundamental questions pertaining to genotype-phenotype relationships remain unresolved. For example, which GOF mutations are key for gene regulation and cellular decisions? What are the GOF mechanisms at various regulation levels? How do interaction networks undergo rewiring upon GOF mutations? Is it possible to leverage GOF mutations to reprogram signal transduction in cells, aiming to cure disease? To begin to address these questions, we will cover a wide range of topics regarding GOF disease mutations and their characterization by multi-omic networks. We highlight the fundamental function of GOF mutations and discuss the potential mechanistic effects in the context of signaling networks. We also discuss advances in bioinformatic and computational resources, which will dramatically help with studies on the functional and phenotypic consequences of GOF mutations.
Collapse
Affiliation(s)
- Mark M Li
- Livestrong Cancer Institutes, Department of Oncology, Dell Medical School, The University of Texas at Austin, Austin, TX, USA
| | - Sharad Awasthi
- Department of Epigenetics and Molecular Carcinogenesis, The University of Texas MD Anderson Cancer Center, Houston, TX, USA
| | - Sumanta Ghosh
- Department of Epigenetics and Molecular Carcinogenesis, The University of Texas MD Anderson Cancer Center, Houston, TX, USA
| | - Deepa Bisht
- Department of Epigenetics and Molecular Carcinogenesis, The University of Texas MD Anderson Cancer Center, Houston, TX, USA
| | - Zeynep H Coban Akdemir
- Human Genetics Center, Department of Epidemiology, Human Genetics, and Environmental Sciences, School of Public Health, The University of Texas Health Science Center at Houston, Houston, TX, USA
| | - Gloria M Sheynkman
- Department of Molecular Physiology and Biological Physics, University of Virginia, Charlottesville, VA, USA
- Department of Biochemistry and Molecular Genetics, School of Medicine, University of Virginia, Charlottesville, VA, USA
- Center for Public Health Genomics, and UVA Comprehensive Cancer Center, University of Virginia, Charlottesville, VA, USA
| | - Nidhi Sahni
- Department of Epigenetics and Molecular Carcinogenesis, The University of Texas MD Anderson Cancer Center, Houston, TX, USA.
- Department of Bioinformatics and Computational Biology, The University of Texas MD Anderson Cancer Center, Houston, TX, USA.
- Quantitative and Computational Biosciences Program, Baylor College of Medicine, Houston, TX, USA.
| | - S Stephen Yi
- Livestrong Cancer Institutes, Department of Oncology, Dell Medical School, The University of Texas at Austin, Austin, TX, USA.
- Oden Institute for Computational Engineering and Sciences (ICES), The University of Texas at Austin, Austin, TX, USA.
- Department of Biomedical Engineering, Cockrell School of Engineering, The University of Texas at Austin, Austin, TX, USA.
- Interdisciplinary Life Sciences Graduate Programs (ILSGP), College of Natural Sciences, The University of Texas at Austin, Austin, TX, USA.
| |
Collapse
|
3
|
Identification of 22 novel BTK gene variants in B cell deficiency with hypogammaglobulinemia. Clin Immunol 2021; 229:108788. [PMID: 34182127 DOI: 10.1016/j.clim.2021.108788] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/30/2021] [Revised: 06/15/2021] [Accepted: 06/20/2021] [Indexed: 11/21/2022]
Abstract
X-linked agammaglobulinemia (XLA) is an inborn error of immunity caused by pathogenic variants in the BTK gene, resulting in impaired B cell differentiation and maturation. Over 900 variants have already been described in this gene, however, new pathogenic variants continue to be identified. In this report, we describe 22 novel variants in BTK, associated with B cell deficiency with hypo- or agammaglobulinemia in male patients or in asymptomatic female carriers. Genetic data was correlated with BTK protein expression by flow cytometry, and clinical and family history to obtain a comprehensive assessment of the clinico-pathologic significance of these new variants in the BTK gene. For one novel missense variant, p.Cys502Tyr, site-directed mutagenesis was performed to determine the impact of the sequence change on protein expression and stability. Genetic data should be correlated with protein and/or clinical and immunological data, whenever possible, to determine the clinical significance of the gene sequence alteration.
Collapse
|
4
|
Zhou JB, Xiong Y, An K, Ye ZQ, Wu YD. IDRMutPred: predicting disease-associated germline nonsynonymous single nucleotide variants (nsSNVs) in intrinsically disordered regions. Bioinformatics 2021; 36:4977-4983. [PMID: 32756939 PMCID: PMC7755418 DOI: 10.1093/bioinformatics/btaa618] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/18/2019] [Revised: 06/28/2020] [Accepted: 07/01/2020] [Indexed: 01/09/2023] Open
Abstract
Motivation Despite of the lack of folded structure, intrinsically disordered regions (IDRs) of proteins play versatile roles in various biological processes, and many nonsynonymous single nucleotide variants (nsSNVs) in IDRs are associated with human diseases. The continuous accumulation of nsSNVs resulted from the wide application of NGS has driven the development of disease-association prediction methods for decades. However, their performance on nsSNVs in IDRs remains inferior, possibly due to the domination of nsSNVs from structured regions in training data. Therefore, it is highly demanding to build a disease-association predictor specifically for nsSNVs in IDRs with better performance. Results We present IDRMutPred, a machine learning-based tool specifically for predicting disease-associated germline nsSNVs in IDRs. Based on 17 selected optimal features that are extracted from sequence alignments, protein annotations, hydrophobicity indices and disorder scores, IDRMutPred was trained using three ensemble learning algorithms on the training dataset containing only IDR nsSNVs. The evaluation on the two testing datasets shows that all the three prediction models outperform 17 other popular general predictors significantly, achieving the ACC between 0.856 and 0.868 and MCC between 0.713 and 0.737. IDRMutPred will prioritize disease-associated IDR germline nsSNVs more reliably than general predictors. Availability and implementation The software is freely available at http://www.wdspdb.com/IDRMutPred. Supplementary information Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Jing-Bo Zhou
- Lab of Computational Chemistry and Drug Design, State Key Laboratory of Chemical Oncogenomics, Peking University Shenzhen Graduate School, Shenzhen 518055, China
| | - Yao Xiong
- Lab of Computational Chemistry and Drug Design, State Key Laboratory of Chemical Oncogenomics, Peking University Shenzhen Graduate School, Shenzhen 518055, China
| | - Ke An
- Lab of Computational Chemistry and Drug Design, State Key Laboratory of Chemical Oncogenomics, Peking University Shenzhen Graduate School, Shenzhen 518055, China
| | - Zhi-Qiang Ye
- Lab of Computational Chemistry and Drug Design, State Key Laboratory of Chemical Oncogenomics, Peking University Shenzhen Graduate School, Shenzhen 518055, China.,Shenzhen Bay Laboratory, Shenzhen 518055, China
| | - Yun-Dong Wu
- Lab of Computational Chemistry and Drug Design, State Key Laboratory of Chemical Oncogenomics, Peking University Shenzhen Graduate School, Shenzhen 518055, China.,Shenzhen Bay Laboratory, Shenzhen 518055, China.,College of Chemistry and Molecular Engineering, Peking University, Beijing 100871, China
| |
Collapse
|
5
|
Zhang Y, Qiao S, Lu R, Han N, Liu D, Zhou J. How to balance the bioinformatics data: pseudo-negative sampling. BMC Bioinformatics 2019; 20:695. [PMID: 31874622 PMCID: PMC6929457 DOI: 10.1186/s12859-019-3269-4] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/18/2022] Open
Abstract
BACKGROUND Imbalanced datasets are commonly encountered in bioinformatics classification problems, that is, the number of negative samples is much larger than that of positive samples. Particularly, the data imbalance phenomena will make us underestimate the performance of the minority class of positive samples. Therefore, how to balance the bioinformatic data becomes a very challenging and difficult problem. RESULTS In this study, we propose a new data sampling approach, called pseudo-negative sampling, which can be effectively applied to handle the case that: negative samples greatly dominate positive samples. Specifically, we design a supervised learning method based on a max-relevance min-redundancy criterion beyond Pearson correlation coefficient (MMPCC), which is used to choose pseudo-negative samples from the negative samples and view them as positive samples. In addition, MMPCC uses an incremental searching technique to select optimal pseudo-negative samples to reduce the computation cost. Consequently, the discovered pseudo-negative samples have strong relevance to positive samples and less redundancy to negative ones. CONCLUSIONS To validate the performance of our method, we conduct experiments base on four UCI datasets and three real bioinformatics datasets. According to the experimental results, we clearly observe the performance of MMPCC is better than other sampling methods in terms of Sensitivity, Specificity, Accuracy and the Mathew's Correlation Coefficient. This reveals that the pseudo-negative samples are particularly helpful to solve the imbalance dataset problem. Moreover, the gain of Sensitivity from the minority samples with pseudo-negative samples grows with the improvement of prediction accuracy on all dataset.
Collapse
Affiliation(s)
- Yongqing Zhang
- School of Computer Science, Chengdu University of Information Technology, Chengdu, 610225, China
- School of Computer Science and Engineering, University of Electronic Science and Technology of China, Chengdu, 610054, China
| | - Shaojie Qiao
- School of Software Engineering, Chengdu University of Information Technology, Chengdu, 610225, China.
- Software Automatic Generation and Intelligent Service Key Laboratory of Sichuan Province, Chengdu University of Information Technology, Chengdu, 610225, China.
| | - Rongzhao Lu
- School of Computer Science, Chengdu University of Information Technology, Chengdu, 610225, China
| | - Nan Han
- School of Management, Chengdu University of Information Technology, Chengdu, 610103, China
| | - Dingxiang Liu
- School of Cybersecurity, Chengdu University of Information Technology, Chengdu, 610225, China
| | - Jiliu Zhou
- School of Computer Science, Chengdu University of Information Technology, Chengdu, 610225, China
| |
Collapse
|
6
|
Padilla N, Moles-Fernández A, Riera C, Montalban G, Özkan S, Ootes L, Bonache S, Díez O, Gutiérrez-Enríquez S, de la Cruz X. BRCA1- and BRCA2-specific in silico tools for variant interpretation in the CAGI 5 ENIGMA challenge. Hum Mutat 2019; 40:1593-1611. [PMID: 31112341 DOI: 10.1002/humu.23802] [Citation(s) in RCA: 10] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/08/2019] [Revised: 05/15/2019] [Accepted: 05/17/2019] [Indexed: 11/09/2022]
Abstract
BRCA1 and BRCA2 (BRCA1/2) germline variants disrupting the DNA protective role of these genes increase the risk of hereditary breast and ovarian cancers. Correct identification of these variants then becomes clinically relevant, because it may increase the survival rates of the carriers. Unfortunately, we are still unable to systematically predict the impact of BRCA1/2 variants. In this article, we present a family of in silico predictors that address this problem, using a gene-specific approach. For each protein, we have developed two tools, aimed at predicting the impact of a variant at two different levels: Functional and clinical. Testing their performance in different datasets shows that specific information compensates the small number of predictive features and the reduced training sets employed to develop our models. When applied to the variants of the BRCA1/2 (ENIGMA) challenge in the fifth Critical Assessment of Genome Interpretation (CAGI 5) we find that these methods, particularly those predicting the functional impact of variants, have a good performance, identifying the large compositional bias towards neutral variants in the CAGI sample. This performance is further improved when incorporating to our prediction protocol estimates of the impact on splicing of the target variant.
Collapse
Affiliation(s)
- Natàlia Padilla
- Research Unit in Clinical and Translational Bioinformatics, Vall d'Hebron Institute of Research (VHIR). Universitat Autònoma de Barcelona, Barcelona, Spain
| | | | - Casandra Riera
- Research Unit in Clinical and Translational Bioinformatics, Vall d'Hebron Institute of Research (VHIR). Universitat Autònoma de Barcelona, Barcelona, Spain
| | - Gemma Montalban
- Oncogenetics Group, Vall d'Hebron Institute of Oncology (VHIO), Barcelona, Spain
| | - Selen Özkan
- Research Unit in Clinical and Translational Bioinformatics, Vall d'Hebron Institute of Research (VHIR). Universitat Autònoma de Barcelona, Barcelona, Spain
| | - Lars Ootes
- Research Unit in Clinical and Translational Bioinformatics, Vall d'Hebron Institute of Research (VHIR). Universitat Autònoma de Barcelona, Barcelona, Spain
| | - Sandra Bonache
- Oncogenetics Group, Vall d'Hebron Institute of Oncology (VHIO), Barcelona, Spain
| | - Orland Díez
- Oncogenetics Group, Vall d'Hebron Institute of Oncology (VHIO), Barcelona, Spain.,Area of Clinical and Molecular Genetics, University Hospital of Vall d'Hebron, Barcelona, Spain
| | | | - Xavier de la Cruz
- Research Unit in Clinical and Translational Bioinformatics, Vall d'Hebron Institute of Research (VHIR). Universitat Autònoma de Barcelona, Barcelona, Spain.,Institució Catalana de Recerca i Estudis Avançats (ICREA), Barcelona, Spain
| |
Collapse
|
7
|
de la Campa EÁ, Padilla N, de la Cruz X. Development of pathogenicity predictors specific for variants that do not comply with clinical guidelines for the use of computational evidence. BMC Genomics 2017; 18:569. [PMID: 28812538 PMCID: PMC5558188 DOI: 10.1186/s12864-017-3914-0] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/21/2023] Open
Abstract
Background Strict guidelines delimit the use of computational information in the clinical setting, due to the still moderate accuracy of in silico tools. These guidelines indicate that several tools should always be used and that full coincidence between them is required if we want to consider their results as supporting evidence in medical decision processes. Application of this simple rule certainly decreases the error rate of in silico pathogenicity assignments. However, when predictors disagree this rule results in the rejection of potentially valuable information for a number of variants. In this work, we focus on these variants of the protein sequence and develop specific predictors to help improve the success rate of their annotation. Results We have used a set of 59,442 protein sequence variants (15,723 pathological and 43,719 neutral) from 228 proteins to identify those cases for which pathogenicity predictors disagree. We have repeated this process for all the possible combinations of five known methods (SIFT, PolyPhen-2, PON-P2, CADD and MutationTaster2). For each resulting subset we have trained a specific pathogenicity predictor. We find that these specific predictors are able to discriminate between neutral and pathogenic variants, with a success rate different from random. They tend to outperform the constitutive methods but this trend decreases as the performance of the constitutive predictor improves (e.g. with PON-P2 and PolyPhen-2). We also find that specific methods outperform standard consensus methods (Condel and CAROL). Conclusion Focusing development efforts on the case of variants for which known methods disagree we may obtain pathogenicity predictors with improved performances. Although we have not yet reached the success rate that allows the use of this computational evidence in a clinical setting, the simplicity of the approach indicates that more advanced methods may reach this goal in a close future. Electronic supplementary material The online version of this article (doi:10.1186/s12864-017-3914-0) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Elena Álvarez de la Campa
- Research Unit in Translational Bioinformatics, Vall d'Hebron Institute of Research (VHIR), Universitat Autònoma de Barcelona, Barcelona, Spain.,Department of Molecular Genomics, Instituto de Biología Molecular de Barcelona (IBMB), Consejo Superior de Investigaciones Científicas (CSIC), Barcelona, Spain
| | - Natàlia Padilla
- Research Unit in Translational Bioinformatics, Vall d'Hebron Institute of Research (VHIR), Universitat Autònoma de Barcelona, Barcelona, Spain
| | - Xavier de la Cruz
- Research Unit in Translational Bioinformatics, Vall d'Hebron Institute of Research (VHIR), Universitat Autònoma de Barcelona, Barcelona, Spain. .,ICREA, Barcelona, Spain.
| |
Collapse
|
8
|
Analysis of somatic mutations across the kinome reveals loss-of-function mutations in multiple cancer types. Sci Rep 2017; 7:6418. [PMID: 28743916 PMCID: PMC5527104 DOI: 10.1038/s41598-017-06366-x] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/14/2016] [Accepted: 06/13/2017] [Indexed: 12/17/2022] Open
Abstract
In this study we use somatic cancer mutations to identify important functional residues within sets of related genes. We focus on protein kinases, a superfamily of phosphotransferases that share homologous sequences and structural motifs and have many connections to cancer. We develop several statistical tests for identifying Significantly Mutated Positions (SMPs), which are positions in an alignment with mutations that show signs of selection. We apply our methods to 21,917 mutations that map to the alignment of human kinases and identify 23 SMPs. SMPs occur throughout the alignment, with many in the important A-loop region, and others spread between the N and C lobes of the kinase domain. Since mutations are pooled across the superfamily, these positions may be important to many protein kinases. We select eleven mutations from these positions for functional validation. All eleven mutations cause a reduction or loss of function in the affected kinase. The tested mutations are from four genes, including two tumor suppressors (TGFBR1 and CHEK2) and two oncogenes (KDR and ERBB2). They also represent multiple cancer types, and include both recurrent and non-recurrent events. Many of these mutations warrant further investigation as potential cancer drivers.
Collapse
|
9
|
Bromberg Y, Capriotti E, Carter H. VarI-SIG 2015: methods for personalized medicine - the role of variant interpretation in research and diagnostics. BMC Genomics 2016; 17 Suppl 2:425. [PMID: 27357578 PMCID: PMC4928159 DOI: 10.1186/s12864-016-2721-3] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022] Open
Affiliation(s)
- Yana Bromberg
- Department of Biochemistry and Microbiology, Rutgers University, Lipman Hall 218, 08901, New Brunswick, NJ, USA. .,Department of Genetics, Rutgers University, Lipman Hall 218, 08901, New Brunswick, NJ, USA.
| | - Emidio Capriotti
- Institute for Mathematical Modeling of Biological Systems, Department of Biology, Heinrich Heine University Düsseldorf, Universitaetsstr. 1, 40225, Düsseldorf, Germany.
| | - Hannah Carter
- Division of Medical Genetics, Department of Medicine, University of California, San Diego, 9500 Gilman Dr., 92093, La Jolla, CA, USA.
| |
Collapse
|