1
|
Álvarez-Machancoses Ó, Faraggi E, deAndrés-Galiana EJ, Fernández-Martínez JL, Kloczkowski A. Prediction of Deleterious Single Amino Acid Polymorphisms with a Consensus Holdout Sampler. Curr Genomics 2024; 25:171-184. [PMID: 39086995 PMCID: PMC11288160 DOI: 10.2174/0113892029236347240308054538] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/03/2023] [Revised: 08/03/2023] [Accepted: 09/22/2023] [Indexed: 08/02/2024] Open
Abstract
Background Single Amino Acid Polymorphisms (SAPs) or nonsynonymous Single Nucleotide Variants (nsSNVs) are the most common genetic variations. They result from missense mutations where a single base pair substitution changes the genetic code in such a way that the triplet of bases (codon) at a given position is coding a different amino acid. Since genetic mutations sometimes cause genetic diseases, it is important to comprehend and foresee which variations are harmful and which ones are neutral (not causing changes in the phenotype). This can be posed as a classification problem. Methods Computational methods using machine intelligence are gradually replacing repetitive and exceedingly overpriced mutagenic tests. By and large, uneven quality, deficiencies, and irregularities of nsSNVs datasets debase the convenience of artificial intelligence-based methods. Subsequently, strong and more exact approaches are needed to address these problems. In the present work paper, we show a consensus classifier built on the holdout sampler, which appears strong and precise and outflanks all other popular methods. Results We produced 100 holdouts to test the structures and diverse classification variables of diverse classifiers during the training phase. The finest performing holdouts were chosen to develop a consensus classifier and tested using a k-fold (1 ≤ k ≤5) cross-validation method. We also examined which protein properties have the biggest impact on the precise prediction of the effects of nsSNVs. Conclusion Our Consensus Holdout Sampler outflanks other popular algorithms, and gives excellent results, highly accurate with low standard deviation. The advantage of our method emerges from using a tree of holdouts, where diverse LM/AI-based programs are sampled in diverse ways.
Collapse
Affiliation(s)
- Óscar Álvarez-Machancoses
- Group of Inverse Problems, Optimization and Machine Learning, Department of Mathematics, University of Oviedo, C. Federico García Lorca, 18, 33007, Oviedo, Spain
| | - Eshel Faraggi
- School of Science, Indiana University–Purdue University Indianapolis, IN, USA
| | - Enrique J. deAndrés-Galiana
- Group of Inverse Problems, Optimization and Machine Learning, Department of Mathematics, University of Oviedo, C. Federico García Lorca, 18, 33007, Oviedo, Spain
- Department of Computer Science, University of Oviedo, C. Federico García Lorca, 18, 33007, Oviedo, Spain
| | - Juan L. Fernández-Martínez
- Group of Inverse Problems, Optimization and Machine Learning, Department of Mathematics, University of Oviedo, C. Federico García Lorca, 18, 33007, Oviedo, Spain
| | - Andrzej Kloczkowski
- Institute for Genomic Medicine, Nationwide Children’s Hospital, Columbus, OH, USA
- Department of Pediatrics, The Ohio State University, Columbus, OH, USA
| |
Collapse
|
2
|
Zhang M, Gong C, Ge F, Yu DJ. FCMSTrans: Accurate Prediction of Disease-Associated nsSNPs by Utilizing Multiscale Convolution and Deep Feature Combination within a Transformer Framework. J Chem Inf Model 2024; 64:1394-1406. [PMID: 38349747 DOI: 10.1021/acs.jcim.3c02025] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/27/2024]
Abstract
Nonsynonymous single-nucleotide polymorphisms (nsSNPs), implicated in over 6000 diseases, necessitate accurate prediction for expedited drug discovery and improved disease diagnosis. In this study, we propose FCMSTrans, a novel nsSNP predictor that innovatively combines the transformer framework and multiscale modules for comprehensive feature extraction. The distinctive attribute of FCMSTrans resides in a deep feature combination strategy. This strategy amalgamates evolutionary-scale modeling (ESM) and ProtTrans (PT) features, providing an understanding of protein biochemical properties, and position-specific scoring matrix, secondary structure, predicted relative solvent accessibility, and predicted disorder (PSPP) features, which are derived from four protein sequences and structure-oriented characteristics. This feature combination offers a comprehensive view of the molecular dynamics involving nsSNPs. Our model employs the transformer's self-attention mechanisms across multiple layers, extracting higher-level and abstract representations. Simultaneously, varied-level features are captured by multiscale convolutions, enriching feature abstraction at multiple echelons. Our comparative analyses with existing methodologies highlight significant improvements made possible by the integrated feature fusion approach adopted in FCMSTrans. This is further substantiated by performance assessments based on diverse data sets, such as PredictSNP, MMP, and PMD, with areas under the curve (AUCs) of 0.869, 0.819, and 0.693, respectively. Furthermore, FCMSTrans shows robustness and superiority by outperforming the current best predictor, PROVEAN, in a blind test conducted on a third-party data set, achieving an impressive AUC score of 0.7838. The Python code of FCMSTrans is available at https://github.com/gc212/FCMSTrans for academic usage.
Collapse
Affiliation(s)
- Ming Zhang
- School of Computer, Jiangsu University of Science and Technology, 666 Changhui Road, Zhenjiang 212100, China
| | - Chao Gong
- School of Computer, Jiangsu University of Science and Technology, 666 Changhui Road, Zhenjiang 212100, China
| | - Fang Ge
- State Key Laboratory of Organic Electronics and Information Displays & Institute of Advanced Materials (IAM), Nanjing University of Posts & Telecommunications, 9 Wenyuan Road, Nanjing 210023, China
| | - Dong-Jun Yu
- School of Computer Science and Engineering, Nanjing University of Science and Technology, 200 Xiaolingwei, Nanjing 210094, China
| |
Collapse
|
3
|
Shahjahan, Dey JK, Dey SK. Translational bioinformatics approach to combat cardiovascular disease and cancers. ADVANCES IN PROTEIN CHEMISTRY AND STRUCTURAL BIOLOGY 2024; 139:221-261. [PMID: 38448136 DOI: 10.1016/bs.apcsb.2023.11.006] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 03/08/2024]
Abstract
Bioinformatics is an interconnected subject of science dealing with diverse fields including biology, chemistry, physics, statistics, mathematics, and computer science as the key fields to answer complicated physiological problems. Key intention of bioinformatics is to store, analyze, organize, and retrieve essential information about genome, proteome, transcriptome, metabolome, as well as organisms to investigate the biological system along with its dynamics, if any. The outcome of bioinformatics depends on the type, quantity, and quality of the raw data provided and the algorithm employed to analyze the same. Despite several approved medicines available, cardiovascular disorders (CVDs) and cancers comprises of the two leading causes of human deaths. Understanding the unknown facts of both these non-communicable disorders is inevitable to discover new pathways, find new drug targets, and eventually newer drugs to combat them successfully. Since, all these goals involve complex investigation and handling of various types of macro- and small- molecules of the human body, bioinformatics plays a key role in such processes. Results from such investigation has direct human application and thus we call this filed as translational bioinformatics. Current book chapter thus deals with diverse scope and applications of this translational bioinformatics to find cure, diagnosis, and understanding the mechanisms of CVDs and cancers. Developing complex yet small or long algorithms to address such problems is very common in translational bioinformatics. Structure-based drug discovery or AI-guided invention of novel antibodies that too with super-high accuracy, speed, and involvement of considerably low amount of investment are some of the astonishing features of the translational bioinformatics and its applications in the fields of CVDs and cancers.
Collapse
Affiliation(s)
- Shahjahan
- Laboratory for Structural Biology of Membrane Proteins, Dr. B.R. Ambedkar Center for Biomedical Research, University of Delhi, Delhi, India
| | - Joy Kumar Dey
- Central Council for Research in Homoeopathy, Ministry of Ayush, Govt. of India, New Delhi, Delhi, India
| | - Sanjay Kumar Dey
- Laboratory for Structural Biology of Membrane Proteins, Dr. B.R. Ambedkar Center for Biomedical Research, University of Delhi, Delhi, India.
| |
Collapse
|
4
|
Capriotti E, Fariselli P. Evaluating the relevance of sequence conservation in the prediction of pathogenic missense variants. Hum Genet 2022; 141:1649-1658. [DOI: 10.1007/s00439-021-02419-4] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/06/2021] [Accepted: 12/12/2021] [Indexed: 12/28/2022]
|
5
|
MutTMPredictor: Robust and accurate cascade XGBoost classifier for prediction of mutations in transmembrane proteins. Comput Struct Biotechnol J 2021; 19:6400-6416. [PMID: 34938415 PMCID: PMC8649221 DOI: 10.1016/j.csbj.2021.11.024] [Citation(s) in RCA: 15] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/03/2021] [Revised: 11/05/2021] [Accepted: 11/15/2021] [Indexed: 12/11/2022] Open
Abstract
Prediction of mutations in transmembrane proteins is of significance for diseases diagnosis. Building on the evolutionary information, proposed the Gaussian WAPSSM algorithm. Based on WAPSSM and sequence and structure-based features, proposed the cascade XGBoost algorithm. Webserver is freely at (http://csbio.njust.edu.cn/bioinf/ffmsresmutp/). Implement MutTMPredictor to predict mutations in transmembrane proteins.
Transmembrane proteins have critical biological functions and play a role in a multitude of cellular processes including cell signaling, transport of molecules and ions across membranes. Approximately 60% of transmembrane proteins are considered as drug targets. Missense mutations in such proteins can lead to many diverse diseases and disorders, such as neurodegenerative diseases and cystic fibrosis. However, there are limited studies on mutations in transmembrane proteins. In this work, we first design a new feature encoding method, termed weight attenuation position-specific scoring matrix (WAPSSM), which builds upon the protein evolutionary information. Then, we propose a new mutation prediction algorithm (cascade XGBoost) by leveraging the idea learned from consensus predictors and gcForest. Multi-level experiments illustrate the effectiveness of WAPSSM and cascade XGBoost algorithms. Finally, based on WAPSSM and other three types of features, in combination with the cascade XGBoost algorithm, we develop a new transmembrane protein mutation predictor, named MutTMPredictor. We benchmark the performance of MutTMPredictor against several existing predictors on seven datasets. On the 546 mutations dataset, MutTMPredictor achieves the accuracy (ACC) of 0.9661 and the Matthew’s Correlation Coefficient (MCC) of 0.8950. While on the 67,584 dataset, MutTMPredictor achieves an MCC of 0.7523 and area under curve (AUC) of 0.8746, which are 0.1625 and 0.0801 respectively higher than those of the existing best predictor (fathmm). Besides, MutTMPredictor also outperforms two specific predictors on the Pred-MutHTP datasets. The results suggest that MutTMPredictor can be used as an effective method for predicting and prioritizing missense mutations in transmembrane proteins. The MutTMPredictor webserver and datasets are freely accessible at http://csbio.njust.edu.cn/bioinf/muttmpredictor/ for academic use.
Collapse
Key Words
- 1000 Genomes, 1000 genomes project consortium
- APOGEE, pathogenicity prediction through the logistic model tree
- BorodaTM, boosted regression trees for disease-associated mutations in transmembrane proteins
- COSMIC, catalogue of somatic mutations in cancer
- Cascade XGBoost
- ClinVar, clinical variants
- Condel, consensus deleteriousness score of missense mutations
- Disease-associated mutations
- Entprise, entropy and predicted protein structure
- ExAC, the exome aggregation consortium
- Meta-SNP, meta single nucleotide polymorphism
- Mutation prediction
- PROVEAN, protein variation effect analyzer
- PolyPhen, polymorphism phenotyping
- PolyPhen-2, polymorphism phenotyping v2
- Pred-MutHTP, prediction of mutations in human transmembrane proteins
- PredictSNP1, predict single nucleotide polymorphism v1
- Protein evolutionary information
- REVEL, rare exome variant ensemble learner
- SDM, site-directed mutate
- SIFT, sorting intolerant from tolerant
- SNAP, screening for non-acceptable polymorphisms
- SNP&GO, single nucleotide polymorphisms and gene ontology annotations
- SwissVar, variants in UniProtKB/Swiss-Prot
- TMSNP, transmembrane single nucleotide polymorphisms
- Transmembrane protein
- WEKA, waikato environment for knowledge analysis
- fathmm, functional analysis through hidden markov models
- humsavar, human polymorphisms and disease mutations
Collapse
|
6
|
Tang YY, Wei PJ, Zhao JP, Xia J, Cao RF, Zheng CH. Identification of driver genes based on gene mutational effects and network centrality. BMC Bioinformatics 2021; 22:457. [PMID: 34560840 PMCID: PMC8461858 DOI: 10.1186/s12859-021-04377-0] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/15/2021] [Accepted: 08/23/2021] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND As one of the deadliest diseases in the world, cancer is driven by a few somatic mutations that disrupt the normal growth of cells, and leads to abnormal proliferation and tumor development. The vast majority of somatic mutations did not affect the occurrence and development of cancer; thus, identifying the mutations responsible for tumor occurrence and development is one of the main targets of current cancer treatments. RESULTS To effectively identify driver genes, we adopted a semi-local centrality measure and gene mutation effect function to assess the effect of gene mutations on changes in gene expression patterns. Firstly, we calculated the mutation score for each gene. Secondly, we identified differentially expressed genes (DEGs) in the cohort by comparing the expression profiles of tumor samples and normal samples, and then constructed a local network for each mutation gene using DEGs and mutant genes according to the protein-protein interaction network. Finally, we calculated the score of each mutant gene according to the objective function. The top-ranking mutant genes were selected as driver genes. We name the proposed method as mutations effect and network centrality. CONCLUSIONS Four types of cancer data in The Cancer Genome Atlas were tested. The experimental data proved that our method was superior to the existing network-centric method, as it was able to quickly and easily identify driver genes and rare driver factors.
Collapse
Affiliation(s)
- Yun-Yun Tang
- Key Lab of Intelligent Computing and Signal Processing of Ministry of Education, College of Computer Science and Technology, Anhui University, Hefei, China
| | - Pi-Jing Wei
- Key Lab of Intelligent Computing and Signal Processing of Ministry of Education, College of Computer Science and Technology, Anhui University, Hefei, China
| | - Jian-Ping Zhao
- College of Mathematics and System Sciences, Xinjiang University, Urumqi, China
| | - Junfeng Xia
- Institute of Physical Science and Information Technology, Anhui University, Hefei, China
| | - Rui-Fen Cao
- Key Lab of Intelligent Computing and Signal Processing of Ministry of Education, College of Computer Science and Technology, Anhui University, Hefei, China.,Engineering Research Center of Big Data Application in Private Health Medicine, Fujian Province University, Putian, Fujian, China
| | - Chun-Hou Zheng
- Key Lab of Intelligent Computing and Signal Processing of Ministry of Education, College of Computer Science and Technology, Anhui University, Hefei, China. .,College of Mathematics and System Sciences, Xinjiang University, Urumqi, China.
| |
Collapse
|
7
|
Periwal N, Rathod SB, Pal R, Sharma P, Nebhnani L, Barnwal RP, Arora P, Srivastava KR, Sood V. In silico characterization of mutations circulating in SARS-CoV-2 structural proteins. J Biomol Struct Dyn 2021; 40:8216-8231. [PMID: 33797336 PMCID: PMC8043164 DOI: 10.1080/07391102.2021.1908170] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/15/2022]
Abstract
SARS-CoV-2 has recently emerged as a pandemic that has caused more than 2.4 million deaths worldwide. Since the onset of infections, several full-length sequences of viral genome have been made available which have been used to gain insights into viral dynamics. We utilised a meta-data driven comparative analysis tool for sequences (Meta-CATS) algorithm to identify mutations in 829 SARS-CoV-2 genomes from around the world. The algorithm predicted sixty-one mutations among SARS-CoV-2 genomes. We observed that most of the mutations were concentrated around three protein coding genes viz nsp3 (non-structural protein 3), RdRp (RNA-directed RNA polymerase) and Nucleocapsid (N) proteins of SARS-CoV-2. We used various computational tools including normal mode analysis (NMA), C-α discrete molecular dynamics (DMD) and all-atom molecular dynamic simulations (MD) to study the effect of mutations on functionality, stability and flexibility of SARS-CoV-2 structural proteins including envelope (E), N and spike (S) proteins. PredictSNP predictor suggested that four mutations (L37H in E, R203K and P344S in N and D614G in S) out of seven were predicted to be neutral whilst the remaining ones (P13L, S197L and G204R in N) were predicted to be deleterious in nature thereby impacting protein functionality. NMA, C-α DMD and all-atom MD suggested some mutations to have stabilizing roles (P13L, S197L and R203K in N protein) where remaining ones were predicted to destabilize mutant protein. In summary, we identified significant mutations in SARS-CoV-2 genomes as well as used computational approaches to further characterize the possible effect of highly significant mutations on SARS-CoV-2 structural proteins. Communicated by Ramaswamy H. Sarma
Collapse
Affiliation(s)
- Neha Periwal
- Department of Biochemistry, School of Chemical & Life Sciences, Jamia Hamdard, New Delhi, India
| | - Shravan B Rathod
- Department of Chemistry, Smt. S. M. Panchal Science College, Talod, India
| | - Ranjan Pal
- Biocatalysis and Enzyme Engineering Lab, Regional Centre for Biotechnology, Faridabad, India
| | - Priya Sharma
- Department of Biochemistry, School of Chemical & Life Sciences, Jamia Hamdard, New Delhi, India
| | - Lata Nebhnani
- Department of Chemistry, Gujarat University, Ahmedabad, India
| | - Ravi P Barnwal
- Department of Biophysics, Panjab University, Chandigarh, India
| | - Pooja Arora
- Department of Zoology, Hansraj College, University of Delhi, New Delhi, India
| | - Kinshuk Raj Srivastava
- Biocatalysis and Enzyme Engineering Lab, Regional Centre for Biotechnology, Faridabad, India
| | - Vikas Sood
- Department of Biochemistry, School of Chemical & Life Sciences, Jamia Hamdard, New Delhi, India
| |
Collapse
|
8
|
Benevenuta S, Capriotti E, Fariselli P. Calibrating variant-scoring methods for clinical decision making. Bioinformatics 2021; 36:5709-5711. [PMID: 33492342 PMCID: PMC8023678 DOI: 10.1093/bioinformatics/btaa943] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/28/2020] [Revised: 09/27/2020] [Accepted: 10/28/2020] [Indexed: 12/22/2022] Open
Abstract
Summary Identifying pathogenic variants and annotating them is a major challenge in human genetics, especially for the non-coding ones. Several tools have been developed and used to predict the functional effect of genetic variants. However, the calibration assessment of the predictions has received little attention. Calibration refers to the idea that if a model predicts a group of variants to be pathogenic with a probability P, it is expected that the same fraction P of true positive is found in the observed set. For instance, a well-calibrated classifier should label the variants such that among the ones to which it gave a probability value close to 0.7, approximately 70% actually belong to the pathogenic class. Poorly calibrated algorithms can be misleading and potentially harmful for clinical decision making. Avaliability and implementation The dataset used for testing the methods is available through the DOI:10.5281/zenodo.4448197. Supplementary information Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Silvia Benevenuta
- Department of Medical Sciences, University of Torino, Via Santena, 19, 10126, Torino, Italy
| | - Emidio Capriotti
- BioFolD Unit, Department of Pharmacy and Biotechnology (FaBiT), University of Bologna, Via Selmi 3, 40126, Bologna, Italy
| | - Piero Fariselli
- Department of Medical Sciences, University of Torino, Via Santena, 19, 10126, Torino, Italy
| |
Collapse
|
9
|
Ge F, Hu J, Zhu YH, Arif M, Yu DJ. TargetMM: Accurate Missense Mutation Prediction by Utilizing Local and Global Sequence Information with Classifier Ensemble. Comb Chem High Throughput Screen 2021; 25:38-52. [DOI: 10.2174/1386207323666201204140438] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/01/2020] [Revised: 10/22/2020] [Accepted: 10/26/2020] [Indexed: 11/22/2022]
Abstract
Aim and Objective:
Missense mutation (MM) may lead to various human diseases by
disabling proteins. Accurate prediction of MM is important and challenging for both protein
function annotation and drug design. Although several computational methods yielded acceptable
success rates, there is still room for further enhancing the prediction performance of MM.
Materials and Methods:
In the present study, we designed a new feature extracting method, which
considers the impact degree of residues in the microenvironment range to the mutation site.
Stringent cross-validation and independent test on benchmark datasets were performed to evaluate
the efficacy of the proposed feature extracting method. Furthermore, three heterogeneous
prediction models were trained and then ensembled for the final prediction. By combining the
feature representation method and classifier ensemble technique, we reported a novel MM
predictor called TargetMM for identifying the pathogenic mutations from the neutral ones.
Results:
Comparison outcomes based on statistical evaluation demonstrate that TargetMM
outperforms the prior advanced methods on the independent test data. The source codes and
benchmark datasets of TargetMM are freely available at https://github.com/sera616/TargetMM.git
for academic use.
Collapse
Affiliation(s)
- Fang Ge
- School of Computer Science and Engineering, Nanjing University of Science and Technology, Nanjing 210094,China
| | - Jun Hu
- College of Information Engineering, Zhejiang University of Technology, Hangzhou 310023,China
| | - Yi-Heng Zhu
- School of Computer Science and Engineering, Nanjing University of Science and Technology, Nanjing 210094,China
| | - Muhammad Arif
- School of Computer Science and Engineering, Nanjing University of Science and Technology, Nanjing 210094,China
| | - Dong-Jun Yu
- School of Computer Science and Engineering, Nanjing University of Science and Technology, Nanjing 210094,China
| |
Collapse
|
10
|
Sanavia T, Birolo G, Montanucci L, Turina P, Capriotti E, Fariselli P. Limitations and challenges in protein stability prediction upon genome variations: towards future applications in precision medicine. Comput Struct Biotechnol J 2020; 18:1968-1979. [PMID: 32774791 PMCID: PMC7397395 DOI: 10.1016/j.csbj.2020.07.011] [Citation(s) in RCA: 74] [Impact Index Per Article: 18.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/15/2020] [Revised: 07/10/2020] [Accepted: 07/14/2020] [Indexed: 12/13/2022] Open
Abstract
Protein stability predictions are becoming essential in medicine to develop novel immunotherapeutic agents and for drug discovery. Despite the large number of computational approaches for predicting the protein stability upon mutation, there are still critical unsolved problems: 1) the limited number of thermodynamic measurements for proteins provided by current databases; 2) the large intrinsic variability of ΔΔG values due to different experimental conditions; 3) biases in the development of predictive methods caused by ignoring the anti-symmetry of ΔΔG values between mutant and native protein forms; 4) over-optimistic prediction performance, due to sequence similarity between proteins used in training and test datasets. Here, we review these issues, highlighting new challenges required to improve current tools and to achieve more reliable predictions. In addition, we provide a perspective of how these methods will be beneficial for designing novel precision medicine approaches for several genetic disorders caused by mutations, such as cancer and neurodegenerative diseases.
Collapse
Affiliation(s)
- Tiziana Sanavia
- Department of Medical Sciences, University of Torino, Via Santena 19, 10126 Torino, Italy
| | - Giovanni Birolo
- Department of Medical Sciences, University of Torino, Via Santena 19, 10126 Torino, Italy
| | - Ludovica Montanucci
- Department of Comparative Biomedicine and Food Science (BCA), University of Padova, Viale dell'Università 16, 35020 Legnaro, Italy
| | - Paola Turina
- Department of Pharmacy and Biotechnology (FaBiT), University of Bologna, Via F. Selmi 3, 40126 Bologna, Italy
| | - Emidio Capriotti
- Department of Pharmacy and Biotechnology (FaBiT), University of Bologna, Via F. Selmi 3, 40126 Bologna, Italy
| | - Piero Fariselli
- Department of Medical Sciences, University of Torino, Via Santena 19, 10126 Torino, Italy
| |
Collapse
|
11
|
Capriotti E, Montanucci L, Profiti G, Rossi I, Giannuzzi D, Aresu L, Fariselli P. Fido-SNP: the first webserver for scoring the impact of single nucleotide variants in the dog genome. Nucleic Acids Res 2020; 47:W136-W141. [PMID: 31114899 PMCID: PMC6602425 DOI: 10.1093/nar/gkz420] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/24/2019] [Revised: 04/19/2019] [Accepted: 05/06/2019] [Indexed: 12/22/2022] Open
Abstract
As the amount of genomic variation data increases, tools that are able to score the functional impact of single nucleotide variants become more and more necessary. While there are several prediction servers available for interpreting the effects of variants in the human genome, only few have been developed for other species, and none were specifically designed for species of veterinary interest such as the dog. Here, we present Fido-SNP the first predictor able to discriminate between Pathogenic and Benign single-nucleotide variants in the dog genome. Fido-SNP is a binary classifier based on the Gradient Boosting algorithm. It is able to classify and score the impact of variants in both coding and non-coding regions based on sequence features within seconds. When validated on a previously unseen set of annotated variants from the OMIA database, Fido-SNP reaches 88% overall accuracy, 0.77 Matthews correlation coefficient and 0.91 Area Under the ROC Curve.
Collapse
Affiliation(s)
- Emidio Capriotti
- BioFolD Unit, Department of Pharmacy and Biotechnology (FaBiT), University of Bologna, Via F. Selmi 3, 40126 Bologna, Italy
| | - Ludovica Montanucci
- Department of Comparative Biomedicine and Food Science. University of Padova, Viale dell'Università, 16, 35020 Legnaro (Padova), Italy
| | - Giuseppe Profiti
- BioDec srl. Via Calzavecchio 20, 40033 Casalecchio di Reno (Bologna), Italy
| | - Ivan Rossi
- BioDec srl. Via Calzavecchio 20, 40033 Casalecchio di Reno (Bologna), Italy
| | - Diana Giannuzzi
- Department of Comparative Biomedicine and Food Science. University of Padova, Viale dell'Università, 16, 35020 Legnaro (Padova), Italy
| | - Luca Aresu
- Department of Veterinary Sciences, University of Torino, Largo P. Braccini 2, 10095 Grugliasco, (Torino), Italy
| | - Piero Fariselli
- Department of Comparative Biomedicine and Food Science. University of Padova, Viale dell'Università, 16, 35020 Legnaro (Padova), Italy.,Department of Medical Sciences, University of Torino, Via Santena 19, 10126 Torino, Italy
| |
Collapse
|
12
|
Abolhassani H, Marcotte H, Fang M, Hammarström L. Clinical implications of experimental analyses of AID function on predictive computational tools: Challenge of missense variants. Clin Genet 2020; 97:844-856. [PMID: 32162335 DOI: 10.1111/cge.13737] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/07/2020] [Revised: 02/27/2020] [Accepted: 03/03/2020] [Indexed: 11/30/2022]
Abstract
Due to the increased usage of high throughput sequencing for the diagnosis of genetically inherited disorders, it is vital to evaluate the risk of new variants and novel genes before accepting them in clinical practice. However, discordant in silico and in vitro results, challenge estimations of the effect of an identified genetic variant. We aimed to comprehensively evaluate pathogenic and polymorphic variants using the activation-induced-cytidine-deaminase (AICDA) gene as a model. We systematically searched and identified patients with confirmed AICDA-mutations. Population-based-databases were screened for germline-polymorphic-AICDA-variants. Activity of AICDA-mutant and severity of the clinical and immunologic-phenotype were showed comparing 108 population-based-variants with 48 pathogenic mutations (12 overlapping-variants). Discordant predictions of different algorithms were observed on average in 38% of the population-database variants, mainly for missense mutations. Functional activity in mutations observed only in patients was significantly lower than variants in the population databases and overlapping-variants between patients and the general-population. Surprisingly, overlapping-variants had an even higher functional activity than the most common polymorphic-variants; however, their pathogenicity was still distinguishable when their function was compared with wild-type AICDA. Classifications of genetic variants cannot readily be translated into a clinical implication. Combined databases of functional and computational assays should therefore be developed for each specific gene.
Collapse
Affiliation(s)
- Hassan Abolhassani
- Division of Clinical Immunology, Department of Laboratory Medicine, Karolinska Institutet at Karolinska University Hospital Huddinge, Stockholm, Sweden
| | - Harold Marcotte
- Division of Clinical Immunology, Department of Laboratory Medicine, Karolinska Institutet at Karolinska University Hospital Huddinge, Stockholm, Sweden
| | - Mingyan Fang
- Division of Clinical Immunology, Department of Laboratory Medicine, Karolinska Institutet at Karolinska University Hospital Huddinge, Stockholm, Sweden.,BGI-Shenzhen, Shenzhen, China.,China National GeneBank, Shenzhen, China
| | - Lennart Hammarström
- Division of Clinical Immunology, Department of Laboratory Medicine, Karolinska Institutet at Karolinska University Hospital Huddinge, Stockholm, Sweden.,BGI-Shenzhen, Shenzhen, China.,China National GeneBank, Shenzhen, China
| |
Collapse
|
13
|
Pharmacogenes (PGx-genes): Current understanding and future directions. Gene 2019; 718:144050. [DOI: 10.1016/j.gene.2019.144050] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/12/2019] [Revised: 08/13/2019] [Accepted: 08/14/2019] [Indexed: 12/14/2022]
|
14
|
Voskanian A, Katsonis P, Lichtarge O, Pejaver V, Radivojac P, Mooney SD, Capriotti E, Bromberg Y, Wang Y, Miller M, Martelli PL, Savojardo C, Babbi G, Casadio R, Cao Y, Sun Y, Shen Y, Garg A, Pal D, Yu Y, Huff CD, Tavtigian SV, Young E, Neuhausen SL, Ziv E, Pal LR, Andreoletti G, Brenner S, Kann MG. Assessing the performance of in silico methods for predicting the pathogenicity of variants in the gene CHEK2, among Hispanic females with breast cancer. Hum Mutat 2019; 40:1612-1622. [PMID: 31241222 PMCID: PMC6744287 DOI: 10.1002/humu.23849] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/25/2019] [Revised: 05/23/2019] [Accepted: 06/21/2019] [Indexed: 01/22/2023]
Abstract
The availability of disease-specific genomic data is critical for developing new computational methods that predict the pathogenicity of human variants and advance the field of precision medicine. However, the lack of gold standards to properly train and benchmark such methods is one of the greatest challenges in the field. In response to this challenge, the scientific community is invited to participate in the Critical Assessment for Genome Interpretation (CAGI), where unpublished disease variants are available for classification by in silico methods. As part of the CAGI-5 challenge, we evaluated the performance of 18 submissions and three additional methods in predicting the pathogenicity of single nucleotide variants (SNVs) in checkpoint kinase 2 (CHEK2) for cases of breast cancer in Hispanic females. As part of the assessment, the efficacy of the analysis method and the setup of the challenge were also considered. The results indicated that though the challenge could benefit from additional participant data, the combined generalized linear model analysis and odds of pathogenicity analysis provided a framework to evaluate the methods submitted for SNV pathogenicity identification and for comparison to other available methods. The outcome of this challenge and the approaches used can help guide further advancements in identifying SNV-disease relationships.
Collapse
Affiliation(s)
- Alin Voskanian
- Department of Biological Sciences, University of Maryland, Baltimore County, MD, U.S.A
| | - Panagiotis Katsonis
- Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, Texas 77030, U.S.A
| | - Olivier Lichtarge
- Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, Texas 77030, U.S.A
- Department of Pharmacology, Computational and Integrative Biomedical Research Center, Baylor College of Medicine, Houston, Texas 77030, USA
| | - Vikas Pejaver
- Department of Biomedical Informatics and Medical Education, University of Washington, Seattle, Washington, U.S.A
- The eScience Institute, University of Washington, Seattle, Washington, U.S.A
| | - Predrag Radivojac
- Khoury College of Computer and Information Sciences, Northeastern University, Boston, Massachusetts, U.S.A
| | - Sean D. Mooney
- Department of Biomedical Informatics and Medical Education, University of Washington, Seattle, Washington, U.S.A
| | - Emidio Capriotti
- BioFolD Unit, Department of Pharmacy and Biotechnology (FaBiT), University of Bologna, Via Selmi 3, 40126 Bologna, Italy
| | - Yana Bromberg
- Department of Biochemistry and Microbiology, Rutgers University, New Brunswick, New Jersey, U.S.A
- Department of Genetics, Rutgers University, New Brunswick, New Jersey, U.S.A
- Technical University of Munich Institute for Advanced Study, (TUM-IAS), Lichtenbergstr. 2a, 85748 Garching/Munich, Germany
| | - Yanran Wang
- Department of Biochemistry and Microbiology, Rutgers University, New Brunswick, New Jersey, U.S.A
| | - Max Miller
- Department of Biochemistry and Microbiology, Rutgers University, New Brunswick, New Jersey, U.S.A
| | - Pier Luigi Martelli
- Biocomputing Group, BiGeA/Giorgio Prodi Interdepartmental Center for Cancer Research, University of Bologna, Via F. Selmi 3, Bologna, 40126, Italy
| | - Castrense Savojardo
- Biocomputing Group, BiGeA/Giorgio Prodi Interdepartmental Center for Cancer Research, University of Bologna, Via F. Selmi 3, Bologna, 40126, Italy
| | - Giulia Babbi
- Biocomputing Group, BiGeA/Giorgio Prodi Interdepartmental Center for Cancer Research, University of Bologna, Via F. Selmi 3, Bologna, 40126, Italy
| | - Rita Casadio
- Biocomputing Group, BiGeA/Giorgio Prodi Interdepartmental Center for Cancer Research, University of Bologna, Via F. Selmi 3, Bologna, 40126, Italy
| | - Yue Cao
- Department of Electrical and Computer Engineering, Texas A&M University, College Station, TX 77843-3128, U.S.A
| | - Yuanfei Sun
- Department of Electrical and Computer Engineering, Texas A&M University, College Station, TX 77843-3128, U.S.A
| | - Yang Shen
- Department of Electrical and Computer Engineering, Texas A&M University, College Station, TX 77843-3128, U.S.A
| | - Aditi Garg
- Department of Computational and Data Sciences Indian Institute of Science, Bengaluru 560 012, India
| | - Debnath Pal
- Department of Computational and Data Sciences Indian Institute of Science, Bengaluru 560 012, India
| | - Yao Yu
- Department of Epidemiology, University of Texas MD Anderson Cancer Center, Houston, TX 77030, U.S.A
| | - Chad D. Huff
- Department of Epidemiology, University of Texas MD Anderson Cancer Center, Houston, TX 77030, U.S.A
| | - Sean V. Tavtigian
- Huntsman Cancer Institute, University of Utah School of Medicine, Salt Lake City, UT 84132, U.S.A
| | - Erin Young
- Huntsman Cancer Institute, University of Utah School of Medicine, Salt Lake City, UT 84132, U.S.A
| | - Susan L. Neuhausen
- Department of Population Sciences, Beckman Research Institute of City of Hope, Duarte, CA, 91010 U.S.A
| | - Elad Ziv
- Division of General Internal Medicine, Department of Medicine, Institute of Human Genetics, Helen Diller Family Comprehensive Cancer Center, University of California, San Francisco, San Francisco, CA,U.S.A
| | - Lipika R. Pal
- Institute for Bioscience and Biotechnology Research, University of Maryland, 9600 Gudelsky Drive, Rockville, MD 20850, USA
| | - Gaia Andreoletti
- Department of Plant and Microbial Biology, University of California, Berkeley, CA 94720, USA
| | - Steven Brenner
- Department of Plant and Microbial Biology, University of California, Berkeley, CA 94720, USA
| | - Maricel G. Kann
- Department of Biological Sciences, University of Maryland, Baltimore County, MD, U.S.A
| |
Collapse
|
15
|
Bromberg Y, Capriotti E, Carter H. VarI-COSI 2018: a forum for research advances in variant interpretation and diagnostics. BMC Genomics 2019; 20:550. [PMID: 31307380 PMCID: PMC6631439 DOI: 10.1186/s12864-019-5862-3] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [MESH Headings] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/13/2022] Open
Affiliation(s)
- Yana Bromberg
- Department of Biochemistry and Microbiology, Lipman Hall 218, Rutgers University, New Brunswick, NJ, 08901, USA.
- Department of Genetics, Lipman Hall 218, Rutgers University, New Brunswick, NJ, 08901, USA.
| | - Emidio Capriotti
- Department of Pharmacy and Biotechnology (FaBiT), University of Bologna, 40126, Bologna, Italy.
| | - Hannah Carter
- Division of Medical Genetics, Department of Medicine, University of California, San Diego, CA, 92093, USA.
| |
Collapse
|
16
|
Capriotti E, Fariselli P. PhD-SNPg: a webserver and lightweight tool for scoring single nucleotide variants. Nucleic Acids Res 2019; 45:W247-W252. [PMID: 28482034 PMCID: PMC5570245 DOI: 10.1093/nar/gkx369] [Citation(s) in RCA: 112] [Impact Index Per Article: 22.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/31/2017] [Accepted: 04/24/2017] [Indexed: 12/15/2022] Open
Abstract
One of the major challenges in human genetics is to identify functional effects of coding and non-coding single nucleotide variants (SNVs). In the past, several methods have been developed to identify disease-related single amino acid changes but only few tools are able to score the impact of non-coding variants. Among the most popular algorithms, CADD and FATHMM predict the effect of SNVs in non-coding regions combining sequence conservation with several functional features derived from the ENCODE project data. Thus, to run CADD or FATHMM locally, the installation process requires to download a large set of pre-calculated information. To facilitate the process of variant annotation we develop PhD-SNPg, a new easy-to-install and lightweight machine learning method that depends only on sequence-based features. Despite this, PhD-SNPg performs similarly or better than more complex methods. This makes PhD-SNPg ideal for quick SNV interpretation, and as benchmark for tool development. Availability: PhD-SNPg is accessible at http://snps.biofold.org/phd-snpg.
Collapse
Affiliation(s)
- Emidio Capriotti
- Department of Biological, Geological, and Environmental Sciences (BiGeA), University of Bologna, Via F. Selmi 3, Bologna 40126, Italy
| | - Piero Fariselli
- Department of Comparative Biomedicine and Food Science. University of Padova, Viale dell'Università, 16, 35020 Legnaro, PD, Italy
| |
Collapse
|
17
|
Bhyan SB, Wee Y, Liu Y, Cummins S, Zhao M. Integrative analysis of common genes and driver mutations implicated in hormone stimulation for four cancers in women. PeerJ 2019; 7:e6872. [PMID: 31205821 PMCID: PMC6556371 DOI: 10.7717/peerj.6872] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/29/2019] [Accepted: 03/28/2019] [Indexed: 12/11/2022] Open
Abstract
Cancer is one of the leading cause of death of women worldwide, and breast, ovarian, endometrial and cervical cancers contribute significantly to this every year. Developing early genetic-based diagnostic tools may be an effective approach to increase the chances of survival and provide more treatment opportunities. However, the current cancer genetic studies are mainly conducted independently and, hence lack of common driver genes involved in cancers in women. To explore the potential common molecular mechanism, we integrated four comprehensive literature-based databases to explore the shared implicated genetic effects. Using a total of 460 endometrial, 2,068 ovarian, 2,308 breast and 537 cervical cancer-implicated genes, we identified 52 genes which are common in all four types of cancers in women. Furthermore, we defined their potential functional role in endogenous hormonal regulation pathways within the context of four cancers in women. For example, these genes are strongly associated with hormonal stimulation, which may facilitate rapid diagnosis and treatment management decision making. Additional mutational analyses on combined the cancer genome atlas datasets consisting of 5,919 gynaecological and breast tumor samples were conducted to identify the frequently mutated genes across cancer types. For those common implicated genes for hormonal stimulants, we found that three quarter of 5,919 samples had genomic alteration with the highest frequency in MYC (22%), followed by NDRG1 (19%), ERBB2 (14%), PTEN (13%), PTGS2 (13%) and CDH1 (11%). We also identified 38 hormone related genes, eight of which are associated with the ovulation cycle. Further systems biology approach of the shared genes identified 20 novel genes, of which 12 were involved in the hormone regulation in these four cancers in women. Identification of common driver genes for hormone stimulation provided an unique angle of involving the potential of the hormone stimulants-related genes for cancer diagnosis and prognosis.
Collapse
Affiliation(s)
- Salma Begum Bhyan
- Faculty of Science, Health, Education and Engineering, University of the Sunshine Coast, Sunshine Coast, QLD, Australia
| | - YongKiat Wee
- Faculty of Science, Health, Education and Engineering, University of the Sunshine Coast, Sunshine Coast, QLD, Australia
| | - Yining Liu
- The School of Public Health, Institute for Chemical Carcinogenesis, Guangzhou Medical University, Guangzhou, China
| | - Scott Cummins
- Faculty of Science, Health, Education and Engineering, University of the Sunshine Coast, Sunshine Coast, QLD, Australia
| | - Min Zhao
- Faculty of Science, Health, Education and Engineering, University of the Sunshine Coast, Sunshine Coast, QLD, Australia
| |
Collapse
|
18
|
Capriotti E, Ozturk K, Carter H. Integrating molecular networks with genetic variant interpretation for precision medicine. WILEY INTERDISCIPLINARY REVIEWS-SYSTEMS BIOLOGY AND MEDICINE 2018; 11:e1443. [PMID: 30548534 PMCID: PMC6450710 DOI: 10.1002/wsbm.1443] [Citation(s) in RCA: 24] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 08/15/2018] [Revised: 10/23/2018] [Accepted: 10/30/2018] [Indexed: 02/01/2023]
Abstract
More reliable and cheaper sequencing technologies have revealed the vast mutational landscapes characteristic of many phenotypes. The analysis of such genetic variants has led to successful identification of altered proteins underlying many Mendelian disorders. Nevertheless the simple one‐variant one‐phenotype model valid for many monogenic diseases does not capture the complexity of polygenic traits and disorders. Although experimental and computational approaches have improved detection of functionally deleterious variants and important interactions between gene products, the development of comprehensive models relating genotype and phenotypes remains a challenge in the field of genomic medicine. In this context, a new view of the pathologic state as significant perturbation of the network of interactions between biomolecules is crucial for the identification of biochemical pathways associated with complex phenotypes. Seminal studies in systems biology combined the analysis of genetic variation with protein–protein interaction networks to demonstrate that even as biological systems evolve to be robust to genetic variation, their topologies create disease vulnerabilities. More recent analyses model the impact of genetic variants as changes to the “wiring” of the interactome to better capture heterogeneity in genotype–phenotype relationships. These studies lay the foundation for using networks to predict variant effects at scale using machine‐learning or algorithmic approaches. A wealth of databases and resources for the annotation of genotype–phenotype relationships have been developed to support developments in this area. This overview describes how study of the molecular interactome has generated insights linking the organization of biological systems to disease mechanism, and how this information can enable precision medicine. This article is categorized under:
Translational, Genomic, and Systems Medicine > Translational Medicine Biological Mechanisms > Cell Signaling Models of Systems Properties and Processes > Mechanistic Models Analytical and Computational Methods > Computational Methods
Collapse
Affiliation(s)
- Emidio Capriotti
- Department of Pharmacy and Biotechnology (FaBiT), University of Bologna, Bologna, Italy
| | - Kivilcim Ozturk
- Bioinformatics and Systems Biology Program, University of California, San Diego, La Jolla, California
| | - Hannah Carter
- Department of Medicine and Institute for Genomic Medicine, University of California, San Diego, La Jolla, California
| |
Collapse
|
19
|
Capriotti E, Martelli PL, Fariselli P, Casadio R. Blind prediction of deleterious amino acid variations with SNPs&GO. Hum Mutat 2017; 38:1064-1071. [PMID: 28102005 PMCID: PMC5522651 DOI: 10.1002/humu.23179] [Citation(s) in RCA: 17] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/24/2016] [Revised: 11/08/2016] [Accepted: 01/10/2017] [Indexed: 01/09/2023]
Abstract
SNPs&GO is a machine learning method for predicting the association of single amino acid variations (SAVs) to disease, considering protein functional annotation. The method is a binary classifier that implements a support vector machine algorithm to discriminate between disease-related and neutral SAVs. SNPs&GO combines information from protein sequence with functional annotation encoded by gene ontology (GO) terms. Tested in sequence mode on more than 38,000 SAVs from the SwissVar dataset, our method reached 81% overall accuracy and an area under the receiving operating characteristic curve of 0.88 with low false-positive rate. In almost all the editions of the Critical Assessment of Genome Interpretation (CAGI) experiments, SNPs&GO ranked among the most accurate algorithms for predicting the effect of SAVs. In this paper, we summarize the best results obtained by SNPs&GO on disease-related variations of four CAGI challenges relative to the following genes: CHEK2 (CAGI 2010), RAD50 (CAGI 2011), p16-INK (CAGI 2013), and NAGLU (CAGI 2016). Result evaluation provides insights about the accuracy of our algorithm and the relevance of GO terms in annotating the effect of the variants. It also helps to define good practices for the detection of deleterious SAVs.
Collapse
Affiliation(s)
- Emidio Capriotti
- Biocomputing Group, BiGeA/Giorgio Prodi Interdepartmental Center for Cancer Research, University of Bologna, Via F. Selmi 3, Bologna, 40126, Italy
| | - Pier Luigi Martelli
- Biocomputing Group, BiGeA/Giorgio Prodi Interdepartmental Center for Cancer Research, University of Bologna, Via F. Selmi 3, Bologna, 40126, Italy
| | - Piero Fariselli
- Department of Comparative Biomedicine and Food Science. University of Padova, Viale dell’Università, 16, 35020 Legnaro (PD), Italy
| | - Rita Casadio
- Biocomputing Group, BiGeA/Giorgio Prodi Interdepartmental Center for Cancer Research, University of Bologna, Via F. Selmi 3, Bologna, 40126, Italy
| |
Collapse
|
20
|
Soualmia LF, Lecroq T. Bioinformatics Methods and Tools to Advance Clinical Care. Findings from the Yearbook 2015 Section on Bioinformatics and Translational Informatics. Yearb Med Inform 2017; 10:170-3. [PMID: 26293864 DOI: 10.15265/iy-2015-026] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/14/2022] Open
Abstract
OBJECTIVES To summarize excellent current research in the field of Bioinformatics and Translational Informatics with application in the health domain and clinical care. METHOD We provide a synopsis of the articles selected for the IMIA Yearbook 2015, from which we attempt to derive a synthetic overview of current and future activities in the field. As last year, a first step of selection was performed by querying MEDLINE with a list of MeSH descriptors completed by a list of terms adapted to the section. Each section editor has evaluated separately the set of 1,594 articles and the evaluation results were merged for retaining 15 articles for peer-review. RESULTS The selection and evaluation process of this Yearbook's section on Bioinformatics and Translational Informatics yielded four excellent articles regarding data management and genome medicine that are mainly tool-based papers. In the first article, the authors present PPISURV a tool for uncovering the role of specific genes in cancer survival outcome. The second article describes the classifier PredictSNP which combines six performing tools for predicting disease-related mutations. In the third article, by presenting a high-coverage map of the human proteome using high resolution mass spectrometry, the authors highlight the need for using mass spectrometry to complement genome annotation. The fourth article is also related to patient survival and decision support. The authors present datamining methods of large-scale datasets of past transplants. The objective is to identify chances of survival. CONCLUSIONS The current research activities still attest the continuous convergence of Bioinformatics and Medical Informatics, with a focus this year on dedicated tools and methods to advance clinical care. Indeed, there is a need for powerful tools for managing and interpreting complex, large-scale genomic and biological datasets, but also a need for user-friendly tools developed for the clinicians in their daily practice. All the recent research and development efforts contribute to the challenge of impacting clinically the obtained results towards a personalized medicine.
Collapse
Affiliation(s)
- L F Soualmia
- Dr Lina F. Soualmia, Normandie Univ., Rouen University and Hospital, SIBM & LITIS EA 4108, Information Processing in Biology & Health, 1, rue de Germont, Cour Leschevin porte 21, 76031 Rouen Cedex, France, Tel : +33 232 885 869, E-mail:
| | | |
Collapse
|
21
|
Rost B, Radivojac P, Bromberg Y. Protein function in precision medicine: deep understanding with machine learning. FEBS Lett 2016; 590:2327-41. [PMID: 27423136 PMCID: PMC5937700 DOI: 10.1002/1873-3468.12307] [Citation(s) in RCA: 35] [Impact Index Per Article: 4.4] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/27/2016] [Revised: 07/12/2016] [Accepted: 07/12/2016] [Indexed: 12/21/2022]
Abstract
Precision medicine and personalized health efforts propose leveraging complex molecular, medical and family history, along with other types of personal data toward better life. We argue that this ambitious objective will require advanced and specialized machine learning solutions. Simply skimming some low-hanging results off the data wealth might have limited potential. Instead, we need to better understand all parts of the system to define medically relevant causes and effects: how do particular sequence variants affect particular proteins and pathways? How do these effects, in turn, cause the health or disease-related phenotype? Toward this end, deeper understanding will not simply diffuse from deeper machine learning, but from more explicit focus on understanding protein function, context-specific protein interaction networks, and impact of variation on both.
Collapse
Affiliation(s)
- Burkhard Rost
- Department of Informatics and Bioinformatics, Institute for Advanced Studies, Technical University of Munich, Garching, Germany
| | - Predrag Radivojac
- School of Informatics and Computing, Indiana University, Bloomington, IN, USA
| | - Yana Bromberg
- Department of Biochemistry and Microbiology, Rutgers University, New Brunswick, NJ, USA
| |
Collapse
|
22
|
Bromberg Y, Capriotti E, Carter H. VarI-SIG 2015: methods for personalized medicine - the role of variant interpretation in research and diagnostics. BMC Genomics 2016; 17 Suppl 2:425. [PMID: 27357578 PMCID: PMC4928159 DOI: 10.1186/s12864-016-2721-3] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022] Open
Affiliation(s)
- Yana Bromberg
- Department of Biochemistry and Microbiology, Rutgers University, Lipman Hall 218, 08901, New Brunswick, NJ, USA. .,Department of Genetics, Rutgers University, Lipman Hall 218, 08901, New Brunswick, NJ, USA.
| | - Emidio Capriotti
- Institute for Mathematical Modeling of Biological Systems, Department of Biology, Heinrich Heine University Düsseldorf, Universitaetsstr. 1, 40225, Düsseldorf, Germany.
| | - Hannah Carter
- Division of Medical Genetics, Department of Medicine, University of California, San Diego, 9500 Gilman Dr., 92093, La Jolla, CA, USA.
| |
Collapse
|
23
|
Tang H, Thomas PD. Tools for Predicting the Functional Impact of Nonsynonymous Genetic Variation. Genetics 2016; 203:635-47. [PMID: 27270698 PMCID: PMC4896183 DOI: 10.1534/genetics.116.190033] [Citation(s) in RCA: 77] [Impact Index Per Article: 9.6] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/04/2015] [Accepted: 04/01/2016] [Indexed: 01/09/2023] Open
Abstract
As personal genome sequencing becomes a reality, understanding the effects of genetic variants on phenotype-particularly the impact of germline variants on disease risk and the impact of somatic variants on cancer development and treatment-continues to increase in importance. Because of their clear potential for affecting phenotype, nonsynonymous genetic variants (variants that cause a change in the amino acid sequence of a protein encoded by a gene) have long been the target of efforts to predict the effects of genetic variation. Whole-genome sequencing is identifying large numbers of nonsynonymous variants in each genome, intensifying the need for computational methods that accurately predict which of these are likely to impact disease phenotypes. This review focuses on nonsynonymous variant prediction with two aims in mind: (1) to review the prioritization methods that have been developed to date and the principles on which they are based and (2) to discuss the challenges to further improving these methods.
Collapse
Affiliation(s)
- Haiming Tang
- Division of Bioinformatics, Department of Preventive Medicine, University of Southern California, Los Angeles, California 90033
| | - Paul D Thomas
- Division of Bioinformatics, Department of Preventive Medicine, University of Southern California, Los Angeles, California 90033
| |
Collapse
|
24
|
Bendl J, Musil M, Štourač J, Zendulka J, Damborský J, Brezovský J. PredictSNP2: A Unified Platform for Accurately Evaluating SNP Effects by Exploiting the Different Characteristics of Variants in Distinct Genomic Regions. PLoS Comput Biol 2016; 12:e1004962. [PMID: 27224906 PMCID: PMC4880439 DOI: 10.1371/journal.pcbi.1004962] [Citation(s) in RCA: 133] [Impact Index Per Article: 16.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/16/2015] [Accepted: 05/05/2016] [Indexed: 12/20/2022] Open
Abstract
An important message taken from human genome sequencing projects is that the human population exhibits approximately 99.9% genetic similarity. Variations in the remaining parts of the genome determine our identity, trace our history and reveal our heritage. The precise delineation of phenotypically causal variants plays a key role in providing accurate personalized diagnosis, prognosis, and treatment of inherited diseases. Several computational methods for achieving such delineation have been reported recently. However, their ability to pinpoint potentially deleterious variants is limited by the fact that their mechanisms of prediction do not account for the existence of different categories of variants. Consequently, their output is biased towards the variant categories that are most strongly represented in the variant databases. Moreover, most such methods provide numeric scores but not binary predictions of the deleteriousness of variants or confidence scores that would be more easily understood by users. We have constructed three datasets covering different types of disease-related variants, which were divided across five categories: (i) regulatory, (ii) splicing, (iii) missense, (iv) synonymous, and (v) nonsense variants. These datasets were used to develop category-optimal decision thresholds and to evaluate six tools for variant prioritization: CADD, DANN, FATHMM, FitCons, FunSeq2 and GWAVA. This evaluation revealed some important advantages of the category-based approach. The results obtained with the five best-performing tools were then combined into a consensus score. Additional comparative analyses showed that in the case of missense variations, protein-based predictors perform better than DNA sequence-based predictors. A user-friendly web interface was developed that provides easy access to the five tools’ predictions, and their consensus scores, in a user-understandable format tailored to the specific features of different categories of variations. To enable comprehensive evaluation of variants, the predictions are complemented with annotations from eight databases. The web server is freely available to the community at http://loschmidt.chemi.muni.cz/predictsnp2.
Collapse
Affiliation(s)
- Jaroslav Bendl
- Loschmidt Laboratories, Department of Experimental Biology and Research Centre for Toxic Compounds in the Environment RECETOX, Masaryk University, Brno, Czech Republic
- Department of Information Systems, Faculty of Information Technology, Brno University of Technology, Brno, Czech Republic
- International Clinical Research Center, St. Anne’s University Hospital Brno, Brno, Czech Republic
| | - Miloš Musil
- Loschmidt Laboratories, Department of Experimental Biology and Research Centre for Toxic Compounds in the Environment RECETOX, Masaryk University, Brno, Czech Republic
- Department of Information Systems, Faculty of Information Technology, Brno University of Technology, Brno, Czech Republic
| | - Jan Štourač
- Loschmidt Laboratories, Department of Experimental Biology and Research Centre for Toxic Compounds in the Environment RECETOX, Masaryk University, Brno, Czech Republic
- International Clinical Research Center, St. Anne’s University Hospital Brno, Brno, Czech Republic
| | - Jaroslav Zendulka
- Department of Information Systems, Faculty of Information Technology, Brno University of Technology, Brno, Czech Republic
| | - Jiří Damborský
- Loschmidt Laboratories, Department of Experimental Biology and Research Centre for Toxic Compounds in the Environment RECETOX, Masaryk University, Brno, Czech Republic
- International Clinical Research Center, St. Anne’s University Hospital Brno, Brno, Czech Republic
- * E-mail: (JD); (JBr)
| | - Jan Brezovský
- Loschmidt Laboratories, Department of Experimental Biology and Research Centre for Toxic Compounds in the Environment RECETOX, Masaryk University, Brno, Czech Republic
- International Clinical Research Center, St. Anne’s University Hospital Brno, Brno, Czech Republic
- * E-mail: (JD); (JBr)
| |
Collapse
|
25
|
Niroula A, Vihinen M. Variation Interpretation Predictors: Principles, Types, Performance, and Choice. Hum Mutat 2016; 37:579-97. [DOI: 10.1002/humu.22987] [Citation(s) in RCA: 90] [Impact Index Per Article: 11.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/16/2015] [Accepted: 03/07/2016] [Indexed: 12/18/2022]
Affiliation(s)
- Abhishek Niroula
- Department of Experimental Medical Science; Lund University; BMC B13 Lund SE-22184 Sweden
| | - Mauno Vihinen
- Department of Experimental Medical Science; Lund University; BMC B13 Lund SE-22184 Sweden
| |
Collapse
|
26
|
Mahmood ASMA, Wu TJ, Mazumder R, Vijay-Shanker K. DiMeX: A Text Mining System for Mutation-Disease Association Extraction. PLoS One 2016; 11:e0152725. [PMID: 27073839 PMCID: PMC4830514 DOI: 10.1371/journal.pone.0152725] [Citation(s) in RCA: 43] [Impact Index Per Article: 5.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/17/2015] [Accepted: 03/19/2016] [Indexed: 11/22/2022] Open
Abstract
The number of published articles describing associations between mutations and diseases is increasing at a fast pace. There is a pressing need to gather such mutation-disease associations into public knowledge bases, but manual curation slows down the growth of such databases. We have addressed this problem by developing a text-mining system (DiMeX) to extract mutation to disease associations from publication abstracts. DiMeX consists of a series of natural language processing modules that preprocess input text and apply syntactic and semantic patterns to extract mutation-disease associations. DiMeX achieves high precision and recall with F-scores of 0.88, 0.91 and 0.89 when evaluated on three different datasets for mutation-disease associations. DiMeX includes a separate component that extracts mutation mentions in text and associates them with genes. This component has been also evaluated on different datasets and shown to achieve state-of-the-art performance. The results indicate that our system outperforms the existing mutation-disease association tools, addressing the low precision problems suffered by most approaches. DiMeX was applied on a large set of abstracts from Medline to extract mutation-disease associations, as well as other relevant information including patient/cohort size and population data. The results are stored in a database that can be queried and downloaded at http://biotm.cis.udel.edu/dimex/. We conclude that this high-throughput text-mining approach has the potential to significantly assist researchers and curators to enrich mutation databases.
Collapse
Affiliation(s)
- A. S. M. Ashique Mahmood
- Department of Computer and Information Sciences, University of Delaware, Newark, Delaware, United States of America
- * E-mail:
| | - Tsung-Jung Wu
- Department of Biochemistry and Molecular Medicine, George Washington University, Washington, District of Columbia, United States of America
| | - Raja Mazumder
- Department of Biochemistry and Molecular Medicine, George Washington University, Washington, District of Columbia, United States of America
- McCormick Genomic and Proteomic Center, George Washington University, Washington, District of Columbia, United States of America
| | - K. Vijay-Shanker
- Department of Computer and Information Sciences, University of Delaware, Newark, Delaware, United States of America
| |
Collapse
|
27
|
Douville C, Masica DL, Stenson PD, Cooper DN, Gygax DM, Kim R, Ryan M, Karchin R. Assessing the Pathogenicity of Insertion and Deletion Variants with the Variant Effect Scoring Tool (VEST-Indel). Hum Mutat 2016; 37:28-35. [PMID: 26442818 PMCID: PMC5057310 DOI: 10.1002/humu.22911] [Citation(s) in RCA: 89] [Impact Index Per Article: 11.1] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/09/2015] [Accepted: 09/14/2015] [Indexed: 12/11/2022]
Abstract
Insertion/deletion variants (indels) alter protein sequence and length, yet are highly prevalent in healthy populations, presenting a challenge to bioinformatics classifiers. Commonly used features--DNA and protein sequence conservation, indel length, and occurrence in repeat regions--are useful for inference of protein damage. However, these features can cause false positives when predicting the impact of indels on disease. Existing methods for indel classification suffer from low specificities, severely limiting clinical utility. Here, we further develop our variant effect scoring tool (VEST) to include the classification of in-frame and frameshift indels (VEST-indel) as pathogenic or benign. We apply 24 features, including a new "PubMed" feature, to estimate a gene's importance in human disease. When compared with four existing indel classifiers, our method achieves a drastically reduced false-positive rate, improving specificity by as much as 90%. This approach of estimating gene importance might be generally applicable to missense and other bioinformatics pathogenicity predictors, which often fail to achieve high specificity. Finally, we tested all possible meta-predictors that can be obtained from combining the four different indel classifiers using Boolean conjunctions and disjunctions, and derived a meta-predictor with improved performance over any individual method.
Collapse
Affiliation(s)
- Christopher Douville
- Department of Biomedical Engineering and Institute for Computational MedicineThe Johns Hopkins UniversityBaltimoreMaryland
| | - David L. Masica
- Department of Biomedical Engineering and Institute for Computational MedicineThe Johns Hopkins UniversityBaltimoreMaryland
| | - Peter D. Stenson
- Institute of Medical GeneticsSchool of MedicineCardiff UniversityHeath ParkCardiffUK
| | - David N. Cooper
- Institute of Medical GeneticsSchool of MedicineCardiff UniversityHeath ParkCardiffUK
| | | | - Rick Kim
- In Silico SolutionsFairfaxVirginia
| | | | - Rachel Karchin
- Department of Biomedical Engineering and Institute for Computational MedicineThe Johns Hopkins UniversityBaltimoreMaryland
- Department of OncologyJohns Hopkins University School of MedicineBaltimoreMaryland
| |
Collapse
|
28
|
Cheng R, Leung RKK, Chen Y, Pan Y, Tong Y, Li Z, Ning L, Ling XB, He J. Virtual Pharmacist: A Platform for Pharmacogenomics. PLoS One 2015; 10:e0141105. [PMID: 26496198 PMCID: PMC4619711 DOI: 10.1371/journal.pone.0141105] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/11/2015] [Accepted: 10/03/2015] [Indexed: 01/15/2023] Open
Abstract
We present Virtual Pharmacist, a web-based platform that takes common types of high-throughput data, namely microarray SNP genotyping data, FASTQ and Variant Call Format (VCF) files as inputs, and reports potential drug responses in terms of efficacy, dosage and toxicity at one glance. Batch submission facilitates multivariate analysis or data mining of targeted groups. Individual analysis consists of a report that is readily comprehensible to patients and practioners who have basic knowledge in pharmacology, a table that summarizes variants and potential affected drug response according to the US Food and Drug Administration pharmacogenomic biomarker labeled drug list and PharmGKB, and visualization of a gene-drug-target network. Group analysis provides the distribution of the variants and potential affected drug response of a target group, a sample-gene variant count table, and a sample-drug count table. Our analysis of genomes from the 1000 Genome Project underlines the potentially differential drug responses among different human populations. Even within the same population, the findings from Watson's genome highlight the importance of personalized medicine. Virtual Pharmacist can be accessed freely at http://www.sustc-genome.org.cn/vp or installed as a local web server. The codes and documentation are available at the GitHub repository (https://github.com/VirtualPharmacist/vp). Administrators can download the source codes to customize access settings for further development.
Collapse
Affiliation(s)
- Ronghai Cheng
- Department of Biology, South University of Science and Technology of China, Shenzhen, China
| | - Ross Ka-Kit Leung
- Division of Genomics and Bioinformatics, The Chinese University of Hong Kong, Hong Kong, China
| | - Yao Chen
- Department of Biology, South University of Science and Technology of China, Shenzhen, China
| | - Yidan Pan
- Department of Biology, South University of Science and Technology of China, Shenzhen, China
| | - Yin Tong
- Department of Biology, South University of Science and Technology of China, Shenzhen, China
| | - Zhoufang Li
- Department of Biology, South University of Science and Technology of China, Shenzhen, China
| | - Luwen Ning
- Department of Biology, South University of Science and Technology of China, Shenzhen, China
| | - Xuefeng B. Ling
- Departments of Surgery, Stanford University, Stanford, California, United States of America
| | - Jiankui He
- Department of Biology, South University of Science and Technology of China, Shenzhen, China
- * E-mail:
| |
Collapse
|
29
|
Regan K, Payne PRO. From Molecules to Patients: The Clinical Applications of Translational Bioinformatics. Yearb Med Inform 2015; 10:164-9. [PMID: 26293863 PMCID: PMC4587059 DOI: 10.15265/iy-2015-005] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/14/2022] Open
Abstract
OBJECTIVE In order to realize the promise of personalized medicine, Translational Bioinformatics (TBI) research will need to continue to address implementation issues across the clinical spectrum. In this review, we aim to evaluate the expanding field of TBI towards clinical applications, and define common themes and current gaps in order to motivate future research. METHODS Here we present the state-of-the-art of clinical implementation of TBI-based tools and resources. Our thematic analyses of a targeted literature search of recent TBI-related articles ranged across topics in genomics, data management, hypothesis generation, molecular epidemiology, diagnostics, therapeutics and personalized medicine. RESULTS Open areas of clinically-relevant TBI research identified in this review include developing data standards and best practices, publicly available resources, integrative systemslevel approaches, user-friendly tools for clinical support, cloud computing solutions, emerging technologies and means to address pressing legal, ethical and social issues. CONCLUSIONS There is a need for further research bridging the gap from foundational TBI-based theories and methodologies to clinical implementation. We have organized the topic themes presented in this review into four conceptual foci - domain analyses, knowledge engineering, computational architectures and computation methods alongside three stages of knowledge development in order to orient future TBI efforts to accelerate the goals of personalized medicine.
Collapse
Affiliation(s)
| | - P R O Payne
- Philip R.O. Payne, PhD, FACMI, The Ohio State University, Department of Biomedical Informatics, 250 Lincoln Tower, 1800 Cannon Drive, Columbus, OH 43210, USA, Tel: +1 614 292 4778, E-mail:
| |
Collapse
|
30
|
Bromberg Y, Capriotti E. VarI-SIG 2014--From SNPs to variants: interpreting different types of genetic variants. BMC Genomics 2015; 16 Suppl 8:I1. [PMID: 26110281 PMCID: PMC4480323 DOI: 10.1186/1471-2164-16-s8-i1] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/12/2023] Open
|
31
|
Tian R, Basu MK, Capriotti E. Computational methods and resources for the interpretation of genomic variants in cancer. BMC Genomics 2015; 16 Suppl 8:S7. [PMID: 26111056 PMCID: PMC4480958 DOI: 10.1186/1471-2164-16-s8-s7] [Citation(s) in RCA: 15] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/31/2022] Open
Abstract
The recent improvement of the high-throughput sequencing technologies is having a strong impact on the detection of genetic variations associated with cancer. Several institutions worldwide have been sequencing the whole exomes and or genomes of cancer patients in the thousands, thereby providing an invaluable collection of new somatic mutations in different cancer types. These initiatives promoted the development of methods and tools for the analysis of cancer genomes that are aimed at studying the relationship between genotype and phenotype in cancer. In this article we review the online resources and computational tools for the analysis of cancer genome. First, we describe the available repositories of cancer genome data. Next, we provide an overview of the methods for the detection of genetic variation and computational tools for the prioritization of cancer related genes and causative somatic variations. Finally, we discuss the future perspectives in cancer genomics focusing on the impact of computational methods and quantitative approaches for defining personalized strategies to improve the diagnosis and treatment of cancer.
Collapse
|
32
|
Fariselli P, Martelli PL, Savojardo C, Casadio R. INPS: predicting the impact of non-synonymous variations on protein stability from sequence. Bioinformatics 2015; 31:2816-21. [DOI: 10.1093/bioinformatics/btv291] [Citation(s) in RCA: 77] [Impact Index Per Article: 8.6] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/20/2015] [Accepted: 05/02/2015] [Indexed: 12/22/2022] Open
|
33
|
Limongelli I, Marini S, Bellazzi R. PaPI: pseudo amino acid composition to score human protein-coding variants. BMC Bioinformatics 2015; 16:123. [PMID: 25928477 PMCID: PMC4411653 DOI: 10.1186/s12859-015-0554-8] [Citation(s) in RCA: 35] [Impact Index Per Article: 3.9] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/20/2014] [Accepted: 01/15/2015] [Indexed: 12/31/2022] Open
Abstract
Background High throughput sequencing technologies are able to identify the whole genomic variation of an individual. Gene-targeted and whole-exome experiments are mainly focused on coding sequence variants related to a single or multiple nucleotides. The analysis of the biological significance of this multitude of genomic variant is challenging and computational demanding. Results We present PaPI, a new machine-learning approach to classify and score human coding variants by estimating the probability to damage their protein-related function. The novelty of this approach consists in using pseudo amino acid composition through which wild and mutated protein sequences are represented in a discrete model. A machine learning classifier has been trained on a set of known deleterious and benign coding variants with the aim to score unobserved variants by taking into account hidden sequence patterns in human genome potentially leading to diseases. We show how the combination of amphiphilic pseudo amino acid composition, evolutionary conservation and homologous proteins based methods outperforms several prediction algorithms and it is also able to score complex variants such as deletions, insertions and indels. Conclusions This paper describes a machine-learning approach to predict the deleteriousness of human coding variants. A freely available web application (http://papi.unipv.it) has been developed with the presented method, able to score up to thousands variants in a single run. Electronic supplementary material The online version of this article (doi:10.1186/s12859-015-0554-8) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Ivan Limongelli
- IRCCS Policlinico S. Matteo, Pzz.le Volontari del Sangue 2, 27100, Pavia, Italy. .,Department of Electrical, Computer and Biomedical Engineering, University of Pavia, Via Ferrata 1, 27100, Pavia, Italy.
| | - Simone Marini
- Department of Electrical, Computer and Biomedical Engineering, University of Pavia, Via Ferrata 1, 27100, Pavia, Italy.
| | - Riccardo Bellazzi
- Department of Electrical, Computer and Biomedical Engineering, University of Pavia, Via Ferrata 1, 27100, Pavia, Italy.
| |
Collapse
|
34
|
Luxembourg B, D`Souza M, Körber S, Seifried E. Prediction of the pathogenicity of antithrombin sequence variations by in silico methods. Thromb Res 2015; 135:404-9. [DOI: 10.1016/j.thromres.2014.11.022] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/03/2014] [Revised: 10/30/2014] [Accepted: 11/30/2014] [Indexed: 10/24/2022]
|
35
|
Tian R, Basu MK, Capriotti E. ContrastRank: a new method for ranking putative cancer driver genes and classification of tumor samples. ACTA ACUST UNITED AC 2015; 30:i572-8. [PMID: 25161249 PMCID: PMC4147919 DOI: 10.1093/bioinformatics/btu466] [Citation(s) in RCA: 17] [Impact Index Per Article: 1.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/15/2023]
Abstract
Motivation: The recent advance in high-throughput sequencing technologies is generating a huge amount of data that are becoming an important resource for deciphering the genotype underlying a given phenotype. Genome sequencing has been extensively applied to the study of the cancer genomes. Although a few methods have been already proposed for the detection of cancer-related genes, their automatic identification is still a challenging task. Using the genomic data made available by The Cancer Genome Atlas Consortium (TCGA), we propose a new prioritization approach based on the analysis of the distribution of putative deleterious variants in a large cohort of cancer samples. Results: In this paper, we present ContastRank, a new method for the prioritization of putative impaired genes in cancer. The method is based on the comparison of the putative defective rate of each gene in tumor versus normal and 1000 genome samples. We show that the method is able to provide a ranked list of putative impaired genes for colon, lung and prostate adenocarcinomas. The list significantly overlaps with the list of known cancer driver genes previously published. More importantly, by using our scoring approach, we can successfully discriminate between TCGA normal and tumor samples. A binary classifier based on ContrastRank score reaches an overall accuracy >90% and the area under the curve (AUC) of receiver operating characteristics (ROC) >0.95 for all the three types of adenocarcinoma analyzed in this paper. In addition, using ContrastRank score, we are able to discriminate the three tumor types with a minimum overall accuracy of 77% and AUC of 0.83. Conclusions: We describe ContrastRank, a method for prioritizing putative impaired genes in cancer. The method is based on the comparison of exome sequencing data from different cohorts and can detect putative cancer driver genes. ContrastRank can also be used to estimate a global score for an individual genome about the risk of adenocarcinoma based on the genetic variants information from a whole-exome VCF (Variant Calling Format) file. We believe that the application of ContrastRank can be an important step in genomic medicine to enable genome-based diagnosis. Availability and implementation: The lists of ContrastRank scores of all genes in each tumor type are available as supplementary materials. A webserver for evaluating the risk of the three studied adenocarcinomas starting from whole-exome VCF file is under development. Contact:emidio@uab.edu Supplementary information:Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Rui Tian
- Division of Informatics, Department of Pathology, Department of Clinical and Diagnostic Sciences and Department of Biomedical Engineering, University of Alabama at Birmingham, Birmingham, AL 35249, USA
| | - Malay K Basu
- Division of Informatics, Department of Pathology, Department of Clinical and Diagnostic Sciences and Department of Biomedical Engineering, University of Alabama at Birmingham, Birmingham, AL 35249, USA Division of Informatics, Department of Pathology, Department of Clinical and Diagnostic Sciences and Department of Biomedical Engineering, University of Alabama at Birmingham, Birmingham, AL 35249, USA
| | - Emidio Capriotti
- Division of Informatics, Department of Pathology, Department of Clinical and Diagnostic Sciences and Department of Biomedical Engineering, University of Alabama at Birmingham, Birmingham, AL 35249, USA Division of Informatics, Department of Pathology, Department of Clinical and Diagnostic Sciences and Department of Biomedical Engineering, University of Alabama at Birmingham, Birmingham, AL 35249, USA Division of Informatics, Department of Pathology, Department of Clinical and Diagnostic Sciences and Department of Biomedical Engineering, University of Alabama at Birmingham, Birmingham, AL 35249, USA
| |
Collapse
|
36
|
Doncheva NT, Klein K, Morris JH, Wybrow M, Domingues FS, Albrecht M. Integrative visual analysis of protein sequence mutations. BMC Proc 2014; 8:S2. [PMID: 25237389 PMCID: PMC4155609 DOI: 10.1186/1753-6561-8-s2-s2] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/11/2023] Open
Abstract
Background An important aspect of studying the relationship between protein sequence, structure and function is the molecular characterization of the effect of protein mutations. To understand the functional impact of amino acid changes, the multiple biological properties of protein residues have to be considered together. Results Here, we present a novel visual approach for analyzing residue mutations. It combines different biological visualizations and integrates them with molecular data derived from external resources. To show various aspects of the biological information on different scales, our approach includes one-dimensional sequence views, three-dimensional protein structure views and two-dimensional views of residue interaction networks as well as aggregated views. The views are linked tightly and synchronized to reduce the cognitive load of the user when switching between them. In particular, the protein mutations are mapped onto the views together with further functional and structural information. We also assess the impact of individual amino acid changes by the detailed analysis and visualization of the involved residue interactions. We demonstrate the effectiveness of our approach and the developed software on the data provided for the BioVis 2013 data contest. Conclusions Our visual approach and software greatly facilitate the integrative and interactive analysis of protein mutations based on complementary visualizations. The different data views offered to the user are enriched with information about molecular properties of amino acid residues and further biological knowledge.
Collapse
Affiliation(s)
- Nadezhda T Doncheva
- Max Planck Institute for Informatics, 66123 Saarbücken, Germany ; University of California, San Francisco, 94143-2240 San Francisco, USA
| | | | - John H Morris
- University of California, San Francisco, 94143-2240 San Francisco, USA
| | | | | | - Mario Albrecht
- University Medicine Greifswald, 17475 Greifswald, Germany ; Graz University of Technology, 8010 Graz, Austria ; BioTechMed-Graz, 8010 Graz, Austria
| |
Collapse
|
37
|
Li B, Seligman C, Thusberg J, Miller JL, Auer J, Whirl-Carrillo M, Capriotti E, Klein TE, Mooney SD. In silico comparative characterization of pharmacogenomic missense variants. BMC Genomics 2014; 15 Suppl 4:S4. [PMID: 25057096 PMCID: PMC4092878 DOI: 10.1186/1471-2164-15-s4-s4] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/30/2022] Open
Abstract
BACKGROUND Missense pharmacogenomic (PGx) variants refer to amino acid substitutions that potentially affect the pharmacokinetic (PK) or pharmacodynamic (PD) response to drug therapies. The PGx variants, as compared to disease-associated variants, have not been investigated as deeply. The ability to computationally predict future PGx variants is desirable; however, it is not clear what data sets should be used or what features are beneficial to this end. Hence we carried out a comparative characterization of PGx variants with annotated neutral and disease variants from UniProt, to test the predictive power of sequence conservation and structural information in discriminating these three groups. RESULTS 126 PGx variants of high quality from PharmGKB were selected and two data sets were created: one set contained 416 variants with structural and sequence information, and, the other set contained 1,265 variants with sequence information only. In terms of sequence conservation, PGx variants are more conserved than neutral variants and much less conserved than disease variants. A weighted random forest was used to strike a more balanced classification for PGx variants. Generally structural features are helpful in discriminating PGx variant from the other two groups, but still classification of PGx from neutral polymorphisms is much less effective than between disease and neutral variants. CONCLUSIONS We found that PGx variants are much more similar to neutral variants than to disease variants in the feature space consisting of residue conservation, neighboring residue conservation, number of neighbors, and protein solvent accessibility. Such similarity poses great difficulty in the classification of PGx variants and polymorphisms.
Collapse
|
38
|
SNP-SIG 2013: from coding to non-coding--new approaches for genomic variant interpretation. BMC Genomics 2014; 15 Suppl 4:S1. [PMID: 25056427 PMCID: PMC4083406 DOI: 10.1186/1471-2164-15-s4-s1] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/29/2022] Open
|
39
|
Pires DEV, Ascher DB, Blundell TL. DUET: a server for predicting effects of mutations on protein stability using an integrated computational approach. Nucleic Acids Res 2014; 42:W314-9. [PMID: 24829462 PMCID: PMC4086143 DOI: 10.1093/nar/gku411] [Citation(s) in RCA: 588] [Impact Index Per Article: 58.8] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/06/2023] Open
Abstract
Cancer genome and other sequencing initiatives are generating extensive data on non-synonymous single nucleotide polymorphisms (nsSNPs) in human and other genomes. In order to understand the impacts of nsSNPs on the structure and function of the proteome, as well as to guide protein engineering, accurate in silicomethodologies are required to study and predict their effects on protein stability. Despite the diversity of available computational methods in the literature, none has proven accurate and dependable on its own under all scenarios where mutation analysis is required. Here we present DUET, a web server for an integrated computational approach to study missense mutations in proteins. DUET consolidates two complementary approaches (mCSM and SDM) in a consensus prediction, obtained by combining the results of the separate methods in an optimized predictor using Support Vector Machines (SVM). We demonstrate that the proposed method improves overall accuracy of the predictions in comparison with either method individually and performs as well as or better than similar methods. The DUET web server is freely and openly available at http://structure.bioc.cam.ac.uk/duet.
Collapse
Affiliation(s)
- Douglas E V Pires
- Department of Biochemistry, University of Cambridge, Cambridge, CB2 1GA, UK
| | - David B Ascher
- Department of Biochemistry, University of Cambridge, Cambridge, CB2 1GA, UK ACRF Rational Drug Discovery Centre and Biota Structural Biology Laboratory, St Vincents Institute of Medical Research, Fitzroy, VIC 3065, Australia
| | - Tom L Blundell
- Department of Biochemistry, University of Cambridge, Cambridge, CB2 1GA, UK
| |
Collapse
|
40
|
Wu TJ, Shamsaddini A, Pan Y, Smith K, Crichton DJ, Simonyan V, Mazumder R. A framework for organizing cancer-related variations from existing databases, publications and NGS data using a High-performance Integrated Virtual Environment (HIVE). DATABASE-THE JOURNAL OF BIOLOGICAL DATABASES AND CURATION 2014; 2014:bau022. [PMID: 24667251 PMCID: PMC3965850 DOI: 10.1093/database/bau022] [Citation(s) in RCA: 57] [Impact Index Per Article: 5.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 01/22/2023]
Abstract
Years of sequence feature curation by UniProtKB/Swiss-Prot, PIR-PSD, NCBI-CDD, RefSeq and other database biocurators has led to a rich repository of information on functional sites of genes and proteins. This information along with variation-related annotation can be used to scan human short sequence reads from next-generation sequencing (NGS) pipelines for presence of non-synonymous single-nucleotide variations (nsSNVs) that affect functional sites. This and similar workflows are becoming more important because thousands of NGS data sets are being made available through projects such as The Cancer Genome Atlas (TCGA), and researchers want to evaluate their biomarkers in genomic data. BioMuta, an integrated sequence feature database, provides a framework for automated and manual curation and integration of cancer-related sequence features so that they can be used in NGS analysis pipelines. Sequence feature information in BioMuta is collected from the Catalogue of Somatic Mutations in Cancer (COSMIC), ClinVar, UniProtKB and through biocuration of information available from publications. Additionally, nsSNVs identified through automated analysis of NGS data from TCGA are also included in the database. Because of the petabytes of data and information present in NGS primary repositories, a platform HIVE (High-performance Integrated Virtual Environment) for storing, analyzing, computing and curating NGS data and associated metadata has been developed. Using HIVE, 31 979 nsSNVs were identified in TCGA-derived NGS data from breast cancer patients. All variations identified through this process are stored in a Curated Short Read archive, and the nsSNVs from the tumor samples are included in BioMuta. Currently, BioMuta has 26 cancer types with 13 896 small-scale and 308 986 large-scale study-derived variations. Integration of variation data allows identifications of novel or common nsSNVs that can be prioritized in validation studies. Database URL: BioMuta: http://hive.biochemistry.gwu.edu/tools/biomuta/index.php; CSR: http://hive.biochemistry.gwu.edu/dna.cgi?cmd=csr; HIVE: http://hive.biochemistry.gwu.edu
Collapse
Affiliation(s)
- Tsung-Jung Wu
- Department of Biochemistry and Molecular Medicine, George Washington University, Washington, DC 20037, USA, Data Systems and Technology Jet Propulsion Laboratory 4800 Oak Grove Drive Pasadena, CA 91109 Center for Biologics Evaluation and Research, Food and Drug Administration, Rockville, MD 20852, USA and McCormick Genomic and Proteomic Center, George Washington University, Washington, DC 20037, USA
| | | | | | | | | | | | | |
Collapse
|
41
|
Cole C, Krampis K, Karagiannis K, Almeida JS, Faison WJ, Motwani M, Wan Q, Golikov A, Pan Y, Simonyan V, Mazumder R. Non-synonymous variations in cancer and their effects on the human proteome: workflow for NGS data biocuration and proteome-wide analysis of TCGA data. BMC Bioinformatics 2014; 15:28. [PMID: 24467687 PMCID: PMC3916084 DOI: 10.1186/1471-2105-15-28] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/05/2013] [Accepted: 01/22/2014] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Next-generation sequencing (NGS) technologies have resulted in petabytes of scattered data, decentralized in archives, databases and sometimes in isolated hard-disks which are inaccessible for browsing and analysis. It is expected that curated secondary databases will help organize some of this Big Data thereby allowing users better navigate, search and compute on it. RESULTS To address the above challenge, we have implemented a NGS biocuration workflow and are analyzing short read sequences and associated metadata from cancer patients to better understand the human variome. Curation of variation and other related information from control (normal tissue) and case (tumor) samples will provide comprehensive background information that can be used in genomic medicine research and application studies. Our approach includes a CloudBioLinux Virtual Machine which is used upstream of an integrated High-performance Integrated Virtual Environment (HIVE) that encapsulates Curated Short Read archive (CSR) and a proteome-wide variation effect analysis tool (SNVDis). As a proof-of-concept, we have curated and analyzed control and case breast cancer datasets from the NCI cancer genomics program - The Cancer Genome Atlas (TCGA). Our efforts include reviewing and recording in CSR available clinical information on patients, mapping of the reads to the reference followed by identification of non-synonymous Single Nucleotide Variations (nsSNVs) and integrating the data with tools that allow analysis of effect nsSNVs on the human proteome. Furthermore, we have also developed a novel phylogenetic analysis algorithm that uses SNV positions and can be used to classify the patient population. The workflow described here lays the foundation for analysis of short read sequence data to identify rare and novel SNVs that are not present in dbSNP and therefore provides a more comprehensive understanding of the human variome. Variation results for single genes as well as the entire study are available from the CSR website (http://hive.biochemistry.gwu.edu/dna.cgi?cmd=csr). CONCLUSIONS Availability of thousands of sequenced samples from patients provides a rich repository of sequence information that can be utilized to identify individual level SNVs and their effect on the human proteome beyond what the dbSNP database provides.
Collapse
Affiliation(s)
| | | | | | | | | | | | | | | | | | | | - Raja Mazumder
- Department of Biochemistry and Molecular Medicine, George Washington University Medical Center, Washington, DC 20037, USA.
| |
Collapse
|
42
|
Bendl J, Stourac J, Salanda O, Pavelka A, Wieben ED, Zendulka J, Brezovsky J, Damborsky J. PredictSNP: robust and accurate consensus classifier for prediction of disease-related mutations. PLoS Comput Biol 2014; 10:e1003440. [PMID: 24453961 PMCID: PMC3894168 DOI: 10.1371/journal.pcbi.1003440] [Citation(s) in RCA: 529] [Impact Index Per Article: 52.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/20/2013] [Accepted: 12/03/2013] [Indexed: 02/07/2023] Open
Abstract
Single nucleotide variants represent a prevalent form of genetic variation. Mutations in the coding regions are frequently associated with the development of various genetic diseases. Computational tools for the prediction of the effects of mutations on protein function are very important for analysis of single nucleotide variants and their prioritization for experimental characterization. Many computational tools are already widely employed for this purpose. Unfortunately, their comparison and further improvement is hindered by large overlaps between the training datasets and benchmark datasets, which lead to biased and overly optimistic reported performances. In this study, we have constructed three independent datasets by removing all duplicities, inconsistencies and mutations previously used in the training of evaluated tools. The benchmark dataset containing over 43,000 mutations was employed for the unbiased evaluation of eight established prediction tools: MAPP, nsSNPAnalyzer, PANTHER, PhD-SNP, PolyPhen-1, PolyPhen-2, SIFT and SNAP. The six best performing tools were combined into a consensus classifier PredictSNP, resulting into significantly improved prediction performance, and at the same time returned results for all mutations, confirming that consensus prediction represents an accurate and robust alternative to the predictions delivered by individual tools. A user-friendly web interface enables easy access to all eight prediction tools, the consensus classifier PredictSNP and annotations from the Protein Mutant Database and the UniProt database. The web server and the datasets are freely available to the academic community at http://loschmidt.chemi.muni.cz/predictsnp.
Collapse
Affiliation(s)
- Jaroslav Bendl
- Loschmidt Laboratories, Department of Experimental Biology and Research Centre for Toxic Compounds in the Environment, Faculty of Science, Masaryk University, Brno, Czech Republic
- Department of Information Systems, Faculty of Information Technology, Brno University of Technology, Brno, Czech Republic
- Center of Biomolecular and Cellular Engineering, International Centre for Clinical Research, St. Anne's University Hospital Brno, Brno, Czech Republic
| | - Jan Stourac
- Loschmidt Laboratories, Department of Experimental Biology and Research Centre for Toxic Compounds in the Environment, Faculty of Science, Masaryk University, Brno, Czech Republic
- Center of Biomolecular and Cellular Engineering, International Centre for Clinical Research, St. Anne's University Hospital Brno, Brno, Czech Republic
| | - Ondrej Salanda
- Department of Information Systems, Faculty of Information Technology, Brno University of Technology, Brno, Czech Republic
| | - Antonin Pavelka
- Loschmidt Laboratories, Department of Experimental Biology and Research Centre for Toxic Compounds in the Environment, Faculty of Science, Masaryk University, Brno, Czech Republic
| | - Eric D. Wieben
- Department of Biochemistry and Molecular Biology, Mayo Clinic, Rochester, New York, United States of America
| | - Jaroslav Zendulka
- Department of Information Systems, Faculty of Information Technology, Brno University of Technology, Brno, Czech Republic
| | - Jan Brezovsky
- Loschmidt Laboratories, Department of Experimental Biology and Research Centre for Toxic Compounds in the Environment, Faculty of Science, Masaryk University, Brno, Czech Republic
- * E-mail: (JB); (JD)
| | - Jiri Damborsky
- Loschmidt Laboratories, Department of Experimental Biology and Research Centre for Toxic Compounds in the Environment, Faculty of Science, Masaryk University, Brno, Czech Republic
- Center of Biomolecular and Cellular Engineering, International Centre for Clinical Research, St. Anne's University Hospital Brno, Brno, Czech Republic
- * E-mail: (JB); (JD)
| |
Collapse
|
43
|
Compiani M, Capriotti E. Computational and theoretical methods for protein folding. Biochemistry 2013; 52:8601-24. [PMID: 24187909 DOI: 10.1021/bi4001529] [Citation(s) in RCA: 48] [Impact Index Per Article: 4.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/12/2022]
Abstract
A computational approach is essential whenever the complexity of the process under study is such that direct theoretical or experimental approaches are not viable. This is the case for protein folding, for which a significant amount of data are being collected. This paper reports on the essential role of in silico methods and the unprecedented interplay of computational and theoretical approaches, which is a defining point of the interdisciplinary investigations of the protein folding process. Besides giving an overview of the available computational methods and tools, we argue that computation plays not merely an ancillary role but has a more constructive function in that computational work may precede theory and experiments. More precisely, computation can provide the primary conceptual clues to inspire subsequent theoretical and experimental work even in a case where no preexisting evidence or theoretical frameworks are available. This is cogently manifested in the application of machine learning methods to come to grips with the folding dynamics. These close relationships suggested complementing the review of computational methods within the appropriate theoretical context to provide a self-contained outlook of the basic concepts that have converged into a unified description of folding and have grown in a synergic relationship with their computational counterpart. Finally, the advantages and limitations of current computational methodologies are discussed to show how the smart analysis of large amounts of data and the development of more effective algorithms can improve our understanding of protein folding.
Collapse
Affiliation(s)
- Mario Compiani
- School of Sciences and Technology, University of Camerino , Camerino, Macerata 62032, Italy
| | | |
Collapse
|
44
|
SIFT Indel: predictions for the functional effects of amino acid insertions/deletions in proteins. PLoS One 2013; 8:e77940. [PMID: 24194902 PMCID: PMC3806772 DOI: 10.1371/journal.pone.0077940] [Citation(s) in RCA: 94] [Impact Index Per Article: 8.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/02/2013] [Accepted: 09/05/2013] [Indexed: 12/02/2022] Open
Abstract
Indels in the coding regions of a gene can either cause frameshifts or amino acid insertions/deletions. Frameshifting indels are indels that have a length that is not divisible by 3 and subsequently cause frameshifts. Indels that have a length divisible by 3 cause amino acid insertions/deletions or block substitutions; we call these 3n indels. The new amino acid changes resulting from 3n indels could potentially affect protein function. Therefore, we construct a SIFT Indel prediction algorithm for 3n indels which achieves 82% accuracy, 81% sensitivity, 82% specificity, 82% precision, 0.63 MCC, and 0.87 AUC by 10-fold cross-validation. We have previously published a prediction algorithm for frameshifting indels. The rules for the prediction of 3n indels are different from the rules for the prediction of frameshifting indels and reflect the biological differences of these two different types of variations. SIFT Indel was applied to human 3n indels from the 1000 Genomes Project and the Exome Sequencing Project. We found that common variants are less likely to be deleterious than rare variants. The SIFT indel prediction algorithm for 3n indels is available at http://sift-dna.org/
Collapse
|
45
|
Bromberg Y. Building a genome analysis pipeline to predict disease risk and prevent disease. J Mol Biol 2013; 425:3993-4005. [PMID: 23928561 DOI: 10.1016/j.jmb.2013.07.038] [Citation(s) in RCA: 29] [Impact Index Per Article: 2.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/03/2013] [Revised: 07/26/2013] [Accepted: 07/28/2013] [Indexed: 12/24/2022]
Abstract
Reduced costs and increased speed and accuracy of sequencing can bring the genome-based evaluation of individual disease risk to the bedside. While past efforts have identified a number of actionable mutations, the bulk of genetic risk remains hidden in sequence data. The biggest challenge facing genomic medicine today is the development of new techniques to predict the specifics of a given human phenome (set of all expressed phenotypes) encoded by each individual variome (full set of genome variants) in the context of the given environment. Numerous tools exist for the computational identification of the functional effects of a single variant. However, the pipelines taking advantage of full genomic, exomic, transcriptomic (and other) sequences have only recently become a reality. This review looks at the building of methodologies for predicting "variome"-defined disease risk. It also discusses some of the challenges for incorporating such a pipeline into everyday medical practice.
Collapse
Affiliation(s)
- Y Bromberg
- Department of Biochemistry and Microbiology, Rutgers University, 76 Lipman Drive, New Brunswick, NJ 08873, USA.
| |
Collapse
|
46
|
Capriotti E, Calabrese R, Fariselli P, Martelli PL, Altman RB, Casadio R. WS-SNPs&GO: a web server for predicting the deleterious effect of human protein variants using functional annotation. BMC Genomics 2013; 14 Suppl 3:S6. [PMID: 23819482 PMCID: PMC3665478 DOI: 10.1186/1471-2164-14-s3-s6] [Citation(s) in RCA: 216] [Impact Index Per Article: 19.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/31/2023] Open
Abstract
BACKGROUND SNPs&GO is a method for the prediction of deleterious Single Amino acid Polymorphisms (SAPs) using protein functional annotation. In this work, we present the web server implementation of SNPs&GO (WS-SNPs&GO). The server is based on Support Vector Machines (SVM) and for a given protein, its input comprises: the sequence and/or its three-dimensional structure (when available), a set of target variations and its functional Gene Ontology (GO) terms. The output of the server provides, for each protein variation, the probabilities to be associated to human diseases. RESULTS The server consists of two main components, including updated versions of the sequence-based SNPs&GO (recently scored as one of the best algorithms for predicting deleterious SAPs) and of the structure-based SNPs&GO(3d) programs. Sequence and structure based algorithms are extensively tested on a large set of annotated variations extracted from the SwissVar database. Selecting a balanced dataset with more than 38,000 SAPs, the sequence-based approach achieves 81% overall accuracy, 0.61 correlation coefficient and an Area Under the Curve (AUC) of the Receiver Operating Characteristic (ROC) curve of 0.88. For the subset of ~6,600 variations mapped on protein structures available at the Protein Data Bank (PDB), the structure-based method scores with 84% overall accuracy, 0.68 correlation coefficient, and 0.91 AUC. When tested on a new blind set of variations, the results of the server are 79% and 83% overall accuracy for the sequence-based and structure-based inputs, respectively. CONCLUSIONS WS-SNPs&GO is a valuable tool that includes in a unique framework information derived from protein sequence, structure, evolutionary profile, and protein function. WS-SNPs&GO is freely available at http://snps.biofold.org/snps-and-go.
Collapse
Affiliation(s)
- Emidio Capriotti
- Division of Informatics, Department of Pathology, University of Alabama at Birmingham, Birmingham AL, USA.
| | | | | | | | | | | |
Collapse
|
47
|
Capriotti E, Altman RB, Bromberg Y. Collective judgment predicts disease-associated single nucleotide variants. BMC Genomics 2013; 14 Suppl 3:S2. [PMID: 23819846 PMCID: PMC3839641 DOI: 10.1186/1471-2164-14-s3-s2] [Citation(s) in RCA: 176] [Impact Index Per Article: 16.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/11/2022] Open
Abstract
Background In recent years the number of human genetic variants deposited into the publicly available databases has been increasing exponentially. The latest version of dbSNP, for example, contains ~50 million validated Single Nucleotide Variants (SNVs). SNVs make up most of human variation and are often the primary causes of disease. The non-synonymous SNVs (nsSNVs) result in single amino acid substitutions and may affect protein function, often causing disease. Although several methods for the detection of nsSNV effects have already been developed, the consistent increase in annotated data is offering the opportunity to improve prediction accuracy. Results Here we present a new approach for the detection of disease-associated nsSNVs (Meta-SNP) that integrates four existing methods: PANTHER, PhD-SNP, SIFT and SNAP. We first tested the accuracy of each method using a dataset of 35,766 disease-annotated mutations from 8,667 proteins extracted from the SwissVar database. The four methods reached overall accuracies of 64%-76% with a Matthew's correlation coefficient (MCC) of 0.38-0.53. We then used the outputs of these methods to develop a machine learning based approach that discriminates between disease-associated and polymorphic variants (Meta-SNP). In testing, the combined method reached 79% overall accuracy and 0.59 MCC, ~3% higher accuracy and ~0.05 higher correlation with respect to the best-performing method. Moreover, for the hardest-to-define subset of nsSNVs, i.e. variants for which half of the predictors disagreed with the other half, Meta-SNP attained 8% higher accuracy than the best predictor. Conclusions Here we find that the Meta-SNP algorithm achieves better performance than the best single predictor. This result suggests that the methods used for the prediction of variant-disease associations are orthogonal, encoding different biologically relevant relationships. Careful combination of predictions from various resources is therefore a good strategy for the selection of high reliability predictions. Indeed, for the subset of nsSNVs where all predictors were in agreement (46% of all nsSNVs in the set), our method reached 87% overall accuracy and 0.73 MCC. Meta-SNP server is freely accessible at http://snps.biofold.org/meta-snp.
Collapse
Affiliation(s)
- Emidio Capriotti
- Division of Informatics, Department of Pathology, University of Alabama at Birmingham, Birmingham, AL, USA.
| | | | | |
Collapse
|
48
|
Bromberg Y, Capriotti E. Thoughts from SNP-SIG 2012: future challenges in the annotation of genetic variations. BMC Genomics 2013; 14 Suppl 3:S1. [PMID: 23819751 PMCID: PMC3665538 DOI: 10.1186/1471-2164-14-s3-s1] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/13/2022] Open
Affiliation(s)
- Yana Bromberg
- Department of Biochemistry and Microbiology, Rutgers University, New Brunswick, NJ, USA.
| | | |
Collapse
|
49
|
Wei CH, Harris BR, Kao HY, Lu Z. tmVar: a text mining approach for extracting sequence variants in biomedical literature. Bioinformatics 2013; 29:1433-9. [PMID: 23564842 DOI: 10.1093/bioinformatics/btt156] [Citation(s) in RCA: 98] [Impact Index Per Article: 8.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open
Abstract
MOTIVATION Text-mining mutation information from the literature becomes a critical part of the bioinformatics approach for the analysis and interpretation of sequence variations in complex diseases in the post-genomic era. It has also been used for assisting the creation of disease-related mutation databases. Most of existing approaches are rule-based and focus on limited types of sequence variations, such as protein point mutations. Thus, extending their extraction scope requires significant manual efforts in examining new instances and developing corresponding rules. As such, new automatic approaches are greatly needed for extracting different kinds of mutations with high accuracy. RESULTS Here, we report tmVar, a text-mining approach based on conditional random field (CRF) for extracting a wide range of sequence variants described at protein, DNA and RNA levels according to a standard nomenclature developed by the Human Genome Variation Society. By doing so, we cover several important types of mutations that were not considered in past studies. Using a novel CRF label model and feature set, our method achieves higher performance than a state-of-the-art method on both our corpus (91.4 versus 78.1% in F-measure) and their own gold standard (93.9 versus 89.4% in F-measure). These results suggest that tmVar is a high-performance method for mutation extraction from biomedical literature. AVAILABILITY tmVar software and its corpus of 500 manually curated abstracts are available for download at http://www.ncbi.nlm.nih.gov/CBBresearch/Lu/pub/tmVar
Collapse
Affiliation(s)
- Chih-Hsuan Wei
- National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), 8600 Rockville Pike, Bethesda, MD 20894, USA
| | | | | | | |
Collapse
|
50
|
Hegyi H. GABBR1 has a HERV-W LTR in its regulatory region--a possible implication for schizophrenia. Biol Direct 2013; 8:5. [PMID: 23391219 PMCID: PMC3574838 DOI: 10.1186/1745-6150-8-5] [Citation(s) in RCA: 22] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/10/2012] [Accepted: 02/04/2013] [Indexed: 11/25/2022] Open
Abstract
Schizophrenia is a complex disease with uncertain aetiology. We suggest GABBR1, GABA receptor B1 implicated in schizophrenia based on a HERV-W LTR in the regulatory region of GABBR1. Our hypothesis is supported by: (i) GABBR1 is in the 6p22 genomic region most often implicated in schizophrenia; (ii) microarray studies found that only presynaptic pathway-related genes, including GABA receptors, have altered expression in schizophrenic patients and (iii) it explains how HERV-W elements, expressed in schizophrenia, play a role in the disease: by altering the expression of GABBR1 via a long terminal repeat that is also a regulatory element to GABBR1.
Collapse
Affiliation(s)
- Hedi Hegyi
- CEITEC-Central European Institute of Technology, Masaryk University, CZ-62500, Brno, Czech Republic.
| |
Collapse
|