1
|
Andrades R, Recamonde-Mendoza M. Machine learning methods for prediction of cancer driver genes: a survey paper. Brief Bioinform 2022; 23:6551145. [PMID: 35323900 DOI: 10.1093/bib/bbac062] [Citation(s) in RCA: 8] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/06/2021] [Revised: 02/06/2022] [Accepted: 02/08/2022] [Indexed: 12/21/2022] Open
Abstract
Identifying the genes and mutations that drive the emergence of tumors is a critical step to improving our understanding of cancer and identifying new directions for disease diagnosis and treatment. Despite the large volume of genomics data, the precise detection of driver mutations and their carrying genes, known as cancer driver genes, from the millions of possible somatic mutations remains a challenge. Computational methods play an increasingly important role in discovering genomic patterns associated with cancer drivers and developing predictive models to identify these elements. Machine learning (ML), including deep learning, has been the engine behind many of these efforts and provides excellent opportunities for tackling remaining gaps in the field. Thus, this survey aims to perform a comprehensive analysis of ML-based computational approaches to identify cancer driver mutations and genes, providing an integrated, panoramic view of the broad data and algorithmic landscape within this scientific problem. We discuss how the interactions among data types and ML algorithms have been explored in previous solutions and outline current analytical limitations that deserve further attention from the scientific community. We hope that by helping readers become more familiar with significant developments in the field brought by ML, we may inspire new researchers to address open problems and advance our knowledge towards cancer driver discovery.
Collapse
Affiliation(s)
- Renan Andrades
- Institute of Informatics, Universidade Federal do Rio Grande do Sul, Porto Alegre/RS, Brazil.,Bioinformatics Core, Hospital de Clínicas de Porto Alegre, Porto Alegre/RS, Brazil
| | - Mariana Recamonde-Mendoza
- Institute of Informatics, Universidade Federal do Rio Grande do Sul, Porto Alegre/RS, Brazil.,Bioinformatics Core, Hospital de Clínicas de Porto Alegre, Porto Alegre/RS, Brazil
| |
Collapse
|
2
|
Prediction and prioritization of rare oncogenic mutations in the cancer Kinome using novel features and multiple classifiers. PLoS Comput Biol 2014; 10:e1003545. [PMID: 24743239 PMCID: PMC3990476 DOI: 10.1371/journal.pcbi.1003545] [Citation(s) in RCA: 19] [Impact Index Per Article: 1.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/21/2013] [Accepted: 02/18/2014] [Indexed: 01/18/2023] Open
Abstract
Cancer is a genetic disease that develops through a series of somatic mutations, a subset of which drive cancer progression. Although cancer genome sequencing studies are beginning to reveal the mutational patterns of genes in various cancers, identifying the small subset of “causative” mutations from the large subset of “non-causative” mutations, which accumulate as a consequence of the disease, is a challenge. In this article, we present an effective machine learning approach for identifying cancer-associated mutations in human protein kinases, a class of signaling proteins known to be frequently mutated in human cancers. We evaluate the performance of 11 well known supervised learners and show that a multiple-classifier approach, which combines the performances of individual learners, significantly improves the classification of known cancer-associated mutations. We introduce several novel features related specifically to structural and functional characteristics of protein kinases and find that the level of conservation of the mutated residue at specific evolutionary depths is an important predictor of oncogenic effect. We consolidate the novel features and the multiple-classifier approach to prioritize and experimentally test a set of rare unconfirmed mutations in the epidermal growth factor receptor tyrosine kinase (EGFR). Our studies identify T725M and L861R as rare cancer-associated mutations inasmuch as these mutations increase EGFR activity in the absence of the activating EGF ligand in cell-based assays. Cancer progresses by accumulation of mutations in a subset of genes that confer growth advantage. The 518 protein kinase genes encoded in the human genome, collectively called the kinome, represent one of the largest families of oncogenes. Targeted sequencing studies of many different cancers have shown that the mutational landscape comprises both cancer-causing “driver” mutations and harmless “passenger” mutations. While the frequent recurrence of some driver mutations in human cancers helps distinguish them from the large number of passenger mutations, a significant challenge is to identify the rare “driver” mutations that are less frequently observed in patient samples and yet are causative. Here we combine computational and experimental approaches to identify rare cancer-associated mutations in Epidermal Growth Factor receptor kinase (EGFR), a signaling protein frequently mutated in cancers. Specifically, we evaluate a novel multiple-classifier approach and features specific to the protein kinase super-family in distinguishing known cancer-associated mutations from benign mutations. We then apply the multiple classifier to identify and test the functional impact of rare cancer-associated mutations in EGFR. We report, for the first time, that the EGFR mutations T725M and L861R, which are infrequently observed in cancers, constitutively activate EGFR in a manner analogous to the frequently observed driver mutations.
Collapse
|
3
|
Computational Approaches and Resources in Single Amino Acid Substitutions Analysis Toward Clinical Research. ADVANCES IN PROTEIN CHEMISTRY AND STRUCTURAL BIOLOGY 2014; 94:365-423. [DOI: 10.1016/b978-0-12-800168-4.00010-x] [Citation(s) in RCA: 18] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 02/08/2023]
|
4
|
Mele A, Cervantes JR, Chien V, Friedman D, Ferran C. Single nucleotide polymorphisms at the TNFAIP3/A20 locus and susceptibility/resistance to inflammatory and autoimmune diseases. ADVANCES IN EXPERIMENTAL MEDICINE AND BIOLOGY 2014; 809:163-83. [PMID: 25302371 DOI: 10.1007/978-1-4939-0398-6_10] [Citation(s) in RCA: 22] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/24/2022]
Abstract
The anti-inflammatory and immune regulatory functions of the ubiquitin-editing and NF-kappaB inhibitory protein A20 are well documented in vitro, and in multiple animal models. The high rank held by A20 in the cell's physiologic anti-inflammatory defense mechanisms is highlighted by the striking phenotype of A20 knockout mice, characterized by cachexia, multi-organ failure, and premature death. Even partial depletion of A20, as in A20 heterozygous mice, significantly alters NF-kappaB activation in response to pro-inflammatory activators, even though these mice are phenotypically unremarkable at baseline. A recent burst of genome wide association studies (GWAS), fueled by advances in genomic technologies and analysis tools, uncovered associations between single nucleotide polymorphisms (SNPs) at the TNFAIP3/A20 gene locus and multiple autoimmune and inflammatory diseases in humans. Interestingly, some of these studies emphasized significant associations between TNFAIP3/A20 SNPs imparting decreased expression or loss of NF-kappaB inhibitory function, and susceptibility to systemic lupus erythematosus (SLE) and coronary artery disease (CAD). These clinical data phenocopy partial loss of A20 in mouse models of inflammatory diseases, thereby incriminating TNFAIP3/A20 deficiency as a pathogenic culprit in autoimmune and inflammatory diseases. In this chapter, we undertook a thorough review of studies that explored association between TNFAIP3/A20 SNPs and human autoimmune and inflammatory diseases. Beyond the prognostic value of TNFAIP3/ A20 SNPs for assessing disease risk, their implication in the pathogenic processes of these maladies prompts the pursuit of A20-targeted therapies for disease prevention/treatment in patients harboring susceptibility haplotypes.
Collapse
|
5
|
Peterson TA, Doughty E, Kann MG. Towards precision medicine: advances in computational approaches for the analysis of human variants. J Mol Biol 2013; 425:4047-63. [PMID: 23962656 PMCID: PMC3807015 DOI: 10.1016/j.jmb.2013.08.008] [Citation(s) in RCA: 106] [Impact Index Per Article: 9.6] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/04/2013] [Revised: 08/07/2013] [Accepted: 08/08/2013] [Indexed: 12/26/2022]
Abstract
Variations and similarities in our individual genomes are part of our history, our heritage, and our identity. Some human genomic variants are associated with common traits such as hair and eye color, while others are associated with susceptibility to disease or response to drug treatment. Identifying the human variations producing clinically relevant phenotypic changes is critical for providing accurate and personalized diagnosis, prognosis, and treatment for diseases. Furthermore, a better understanding of the molecular underpinning of disease can lead to development of new drug targets for precision medicine. Several resources have been designed for collecting and storing human genomic variations in highly structured, easily accessible databases. Unfortunately, a vast amount of information about these genetic variants and their functional and phenotypic associations is currently buried in the literature, only accessible by manual curation or sophisticated text text-mining technology to extract the relevant information. In addition, the low cost of sequencing technologies coupled with increasing computational power has enabled the development of numerous computational methodologies to predict the pathogenicity of human variants. This review provides a detailed comparison of current human variant resources, including HGMD, OMIM, ClinVar, and UniProt/Swiss-Prot, followed by an overview of the computational methods and techniques used to leverage the available data to predict novel deleterious variants. We expect these resources and tools to become the foundation for understanding the molecular details of genomic variants leading to disease, which in turn will enable the promise of precision medicine.
Collapse
Affiliation(s)
- Thomas A Peterson
- Department of Biological Sciences, University of Maryland, Baltimore County, 1000 Hilltop Circle, Baltimore, MD 21250, USA
| | - Emily Doughty
- Biomedical Informatics Program, Stanford University, Stanford, CA 94305, USA
| | - Maricel G Kann
- Department of Biological Sciences, University of Maryland, Baltimore County, 1000 Hilltop Circle, Baltimore, MD 21250, USA
| |
Collapse
|
6
|
Yates CM, Sternberg MJE. Proteins and domains vary in their tolerance of non-synonymous single nucleotide polymorphisms (nsSNPs). J Mol Biol 2013; 425:1274-86. [PMID: 23357174 DOI: 10.1016/j.jmb.2013.01.026] [Citation(s) in RCA: 30] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/10/2012] [Revised: 01/11/2013] [Accepted: 01/19/2013] [Indexed: 02/05/2023]
Abstract
The widespread application of whole-genome sequencing is identifying numerous non-synonymous single nucleotide polymorphisms (nsSNPs), many of which are associated with disease. We analyzed nsSNPs from Humsavar and the 1000 Genomes Project to investigate why some proteins and domains are more tolerant of mutations than others. We identified 311 proteins and 112 Pfam families, corresponding to 2910 domains, as diseasesusceptible and 32 proteins and 67 Pfam families (10,783 domains) as diseaseresistant based on the relative numbers of disease-associated and neutral polymorphisms. Proteins with no significant difference from expected numbers of disease and polymorphism nsSNPs are classified as other. This classification takes into account the phenotypes of all known mutations in the protein or domain rather than simply classifying based on the presence or absence of disease nsSNPs. Of the two hypotheses suggested, our results support the model that disease-resistant domains and proteins are more able to tolerate mutations rather than having more lethal mutations that are not observed. Disease-resistant proteins and domains show significantly higher mutation rates and lower sequence conservation than disease-susceptible proteins and domains. Disease-susceptible proteins are more likely to be encoded by essential genes, are more central in protein-protein interaction networks and are less likely to contain loss-of-function mutations in healthy individuals. We use this classification for nsSNP phenotype prediction, predicting nsSNPs in disease-susceptible domains to be disease and those in disease-resistant domains to be polymorphism. In this way, we achieve higher accuracy than SIFT, a state-of-the-art algorithm.
Collapse
Affiliation(s)
- Christopher M Yates
- Centre for Integrative Systems Biology and Bioinformatics, Department of Life Sciences, Imperial College London, Sir Ernst Chain Building, South Kensington, London SW7 2AZ, UK.
| | | |
Collapse
|
7
|
Gong S, Worth CL, Cheng TMK, Blundell TL. Meet Me Halfway: When Genomics Meets Structural Bioinformatics. J Cardiovasc Transl Res 2011; 4:281-303. [DOI: 10.1007/s12265-011-9259-1] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 11/30/2010] [Accepted: 02/08/2011] [Indexed: 01/08/2023]
|
8
|
Abstract
Identification and annotation of mutated genes or proteins involved in oncogenesis and tumor progression are crucial for both cancer biology and clinical applications. We have developed a human Cancer Proteome Variation Database (CanProVar) by integrating information on protein sequence variations from various public resources, with a focus on cancer-related variations (crVAR). We have also built a user-friendly interface for querying the database. The current version of CanProVar comprises 8,570 crVARs in 2,921 proteins derived from existing genome variation databases and recently published large-scale cancer genome resequencing studies. It also includes 41,541 non-cancer specific variations (ncsVARs) in 30,322 proteins derived from the dbSNP database. CanProVar provides quick access to known crVARs in protein sequences along with related cancer samples, relevant publications, data sources, and functional information such as Gene Ontology (GO) annotations for the proteins, protein domains in which the variation occurs, and protein interaction partners with crVARs. CanProVar also helps reveal functional characteristics of crVARs and proteins bearing these variations. Our analysis showed that crVARs were enriched in certain protein domains. We also showed that proteins bearing crVARs were more likely to interact with each other in the protein interaction network. CanProVar can be accessed from http://bioinfo.vanderbilt.edu/canprovar.
Collapse
Affiliation(s)
- Jing Li
- Department of Biomedical Informatics, Vanderbilt University School of Medicine, Nashville, Tennessee 37232, USA
| | | | | |
Collapse
|
9
|
A survey of proteins encoded by non-synonymous single nucleotide polymorphisms reveals a significant fraction with altered stability and activity. Biochem J 2009; 424:15-26. [DOI: 10.1042/bj20090723] [Citation(s) in RCA: 40] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/17/2022]
Abstract
On average, each human gene has approximately four SNPs (single nucleotide polymorphisms) in the coding region, half of which are nsSNPs (non-synonymous SNPs) or missense SNPs. Current attention is focused on those that are known to perturb function and are strongly linked to disease. However, the vast majority of SNPs have not been investigated for the possibility of causing disease. We set out to assess the fraction of nsSNPs that encode proteins that have altered stability and activity, for this class of variants would be candidates to perturb cellular function. We tested the thermostability and, where possible, the catalytic activity for the most common variant (wild-type) and minor variants (total of 46 SNPs) for 16 human enzymes for which the three-dimensional structures were known. There were significant differences in the stability of almost half of the variants (48%) compared with their wild-type counterparts. The catalytic efficiency of approx. 14 variants was significantly altered, including several variants of human PKM2 (pyruvate kinase muscle 2). Two PKM2 variants, S437Y and E28K, also exhibited changes in their allosteric regulation compared with the wild-type enzyme. The high proportion of nsSNPs that affect protein stability and function, albeit subtly, underscores the need for experimental analysis of the diverse human proteome.
Collapse
|
10
|
Kooloos WM, Wessels JA, van der Straaten T, Huizinga TW, Guchelaar HJ. Criteria for the selection of single nucleotide polymorphisms in pathway pharmacogenetics: TNF inhibitors as a case study. Drug Discov Today 2009; 14:837-44. [DOI: 10.1016/j.drudis.2009.05.017] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/18/2008] [Revised: 05/20/2009] [Accepted: 05/27/2009] [Indexed: 12/11/2022]
|
11
|
Abstract
Models that explicitly account for the effect of selection on new mutations have been proposed to account for "codon bias" or the excess of "preferred" codons that results from selection for translational efficiency and/or accuracy. In principle, such models can be applied to any mutation that results in a preferred allele, but in most cases, the fitness effect of a specific mutation cannot be predicted. Here we show that it is possible to assign preferred and unpreferred states to amino acid changing mutations that occur in protein domains. We propose that mutations that lead to more common amino acids (at a given position in a domain) can be considered "preferred alleles" just as are synonymous mutations leading to codons for more abundant tRNAs. We use genome-scale polymorphism data to show that alleles for preferred amino acids in protein domains occur at higher frequencies in the population, as has been shown for preferred codons. We show that this effect is quantitative, such that there is a correlation between the shift in frequency of preferred alleles and the predicted fitness effect. As expected, we also observe a reduction in the numbers of polymorphisms and substitutions at more important positions in domains, consistent with stronger selection at those positions. We examine the derived allele frequency distribution and polymorphism to divergence ratios of preferred and unpreferred differences and find evidence for both negative and positive selections acting to maintain protein domains in the human population. Finally, we analyze a model for selection on amino acid preferences in protein domains and find that it is consistent with the quantitative effects that we observe.
Collapse
Affiliation(s)
- Alan M Moses
- Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton, UK.
| | | |
Collapse
|
12
|
Yang JO, Hwang S, Oh J, Bhak J, Sohn TK. An integrated database-pipeline system for studying single nucleotide polymorphisms and diseases. BMC Bioinformatics 2008; 9 Suppl 12:S19. [PMID: 19091018 PMCID: PMC2638159 DOI: 10.1186/1471-2105-9-s12-s19] [Citation(s) in RCA: 15] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/01/2022] Open
Abstract
BACKGROUND Studies on the relationship between disease and genetic variations such as single nucleotide polymorphisms (SNPs) are important. Genetic variations can cause disease by influencing important biological regulation processes. Despite the needs for analyzing SNP and disease correlation, most existing databases provide information only on functional variants at specific locations on the genome, or deal with only a few genes associated with disease. There is no combined resource to widely support gene-, SNP-, and disease-related information, and to capture relationships among such data. Therefore, we developed an integrated database-pipeline system for studying SNPs and diseases. RESULTS To implement the pipeline system for the integrated database, we first unified complicated and redundant disease terms and gene names using the Unified Medical Language System (UMLS) for classification and noun modification, and the HUGO Gene Nomenclature Committee (HGNC) and NCBI gene databases. Next, we collected and integrated representative databases for three categories of information. For genes and proteins, we examined the NCBI mRNA, UniProt, UCSC Table Track and MitoDat databases. For genetic variants we used the dbSNP, JSNP, ALFRED, and HGVbase databases. For disease, we employed OMIM, GAD, and HGMD databases. The database-pipeline system provides a disease thesaurus, including genes and SNPs associated with disease. The search results for these categories are available on the web page http://diseasome.kobic.re.kr/, and a genome browser is also available to highlight findings, as well as to permit the convenient review of potentially deleterious SNPs among genes strongly associated with specific diseases and clinical phenotypes. CONCLUSION Our system is designed to capture the relationships between SNPs associated with disease and disease-causing genes. The integrated database-pipeline provides a list of candidate genes and SNP markers for evaluation in both epidemiological and molecular biological approaches to diseases-gene association studies. Furthermore, researchers then can decide semi-automatically the data set for association studies while considering the relationships between genetic variation and diseases. The database can also be economical for disease-association studies, as well as to facilitate an understanding of the processes which cause disease. Currently, the database contains 14,674 SNP records and 109,715 gene records associated with human diseases and it is updated at regular intervals.
Collapse
Affiliation(s)
- Jin Ok Yang
- Korean BioInformation Center, Korea Research Institute of Bioscience and Biotechnology (KRIBB), Daejeon, 305-806, Korea
| | - Sohyun Hwang
- Korean BioInformation Center, Korea Research Institute of Bioscience and Biotechnology (KRIBB), Daejeon, 305-806, Korea
- Department of Bio and Brain Engineering, Korea Advanced Institute of Science and Technology (KAIST), Daejeon, Korea
| | - Jeongsu Oh
- Korean BioInformation Center, Korea Research Institute of Bioscience and Biotechnology (KRIBB), Daejeon, 305-806, Korea
| | - Jong Bhak
- Korean BioInformation Center, Korea Research Institute of Bioscience and Biotechnology (KRIBB), Daejeon, 305-806, Korea
| | - Tae-Kwon Sohn
- Korean BioInformation Center, Korea Research Institute of Bioscience and Biotechnology (KRIBB), Daejeon, 305-806, Korea
- Department of Biochemistry, Yonsei University, Seoul, Korea
| |
Collapse
|
13
|
Kim YU, Kim SH, Jin H, Park YK, Ji MH, Kim YJ. The Korean HapMap Project Website. Genomics Inform 2008. [DOI: 10.5808/gi.2008.6.2.091] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/20/2022] Open
|
14
|
Kim BC, Kim WY, Park D, Chung WH, Shin KS, Bhak J. SNP@Promoter: a database of human SNPs (single nucleotide polymorphisms) within the putative promoter regions. BMC Bioinformatics 2008; 9 Suppl 1:S2. [PMID: 18315851 PMCID: PMC2259403 DOI: 10.1186/1471-2105-9-s1-s2] [Citation(s) in RCA: 41] [Impact Index Per Article: 2.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Analysis of single nucleotide polymorphism (SNP) is becoming a key research in genomics fields. Many functional analyses of SNPs have been carried out for coding regions and splicing sites that can alter proteins and mRNA splicing. However, SNPs in non-coding regulatory regions can also influence important biological regulation. Presently, there are few databases for SNPs in non-coding regulatory regions. DESCRIPTION We identified 488,452 human SNPs in the putative promoter regions that extended from the +5000 bp to -500 bp region of the transcription start sites. Some SNPs occurring in transcription factor (TF) binding sites were also predicted (47,832 SNP; 9.8%). The result is stored in a database: SNP@promoter. Users can search the SNP@Promoter database using three entries: 1) by SNP identifier (rs number from dbSNP), 2) by gene (gene name, gene symbol, refSeq ID), and 3) by disease term. The SNP@Promoter database provides extensive genetic information and graphical views of queried terms. CONCLUSION We present the SNP@Promoter database. It was created in order to predict functional SNPs in putative promoter regions and predicted transcription factor binding sites. SNP@Promoter will help researchers to identify functional SNPs in non-coding regions.
Collapse
Affiliation(s)
- Byoung-Chul Kim
- Korean BioInformation Center (KOBIC), KRIBB, Daejeon 305-806, Korea.
| | | | | | | | | | | |
Collapse
|
15
|
Uzun A, Leslin CM, Abyzov A, Ilyin V. Structure SNP (StSNP): a web server for mapping and modeling nsSNPs on protein structures with linkage to metabolic pathways. Nucleic Acids Res 2007; 35:W384-92. [PMID: 17537826 PMCID: PMC1933130 DOI: 10.1093/nar/gkm232] [Citation(s) in RCA: 36] [Impact Index Per Article: 2.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
Abstract
SNPs located within the open reading frame of a gene that result in an alteration in the amino acid sequence of the encoded protein [nonsynonymous SNPs (nsSNPs)] might directly or indirectly affect functionality of the protein, alone or in the interactions in a multi-protein complex, by increasing/decreasing the activity of the metabolic pathway. Understanding the functional consequences of such changes and drawing conclusions about the molecular basis of diseases, involves integrating information from multiple heterogeneous sources including sequence, structure data and pathway relations between proteins. The data from NCBI's SNP database (dbSNP), gene and protein databases from Entrez, protein structures from the PDB and pathway information from KEGG have all been cross referenced into the StSNP web server, in an effort to provide combined integrated, reports about nsSNPs. StSNP provides 'on the fly' comparative modeling of nsSNPs with links to metabolic pathway information, along with real-time visual comparative analysis of the modeled structures using the Friend software application. The use of metabolic pathways in StSNP allows a researcher to examine possible disease-related pathways associated with a particular nsSNP(s), and link the diseases with the current available molecular structure data. The server is publicly available at http://glinka.bio.neu.edu/StSNP/.
Collapse
Affiliation(s)
| | | | | | - Valentin Ilyin
- *To whom correspondence should be addressed. +617 373 7048+617 373 3724
| |
Collapse
|
16
|
Park J, Hwang S, Lee YS, Kim SC, Lee D. SNP@Ethnos: a database of ethnically variant single-nucleotide polymorphisms. Nucleic Acids Res 2006; 35:D711-5. [PMID: 17135185 PMCID: PMC1747186 DOI: 10.1093/nar/gkl962] [Citation(s) in RCA: 22] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
Abstract
Inherited genetic variation plays a critical but largely uncharacterized role in human differentiation. The completion of the International HapMap Project makes it possible to identify loci that may cause human differentiation. We have devised an approach to find such ethnically variant single-nucleotide polymorphisms (ESNPs) from the genotype profile of the populations included in the International HapMap database. We selected ESNPs using the nearest shrunken centroid method (NSCM), and performed multiple tests for genetic heterogeneity and frequency spectrum on genes having ESNPs. The function and disease association of the selected SNPs were also annotated. This resulted in the identification of 100 736 SNPs that appeared uniquely in each ethnic group. Of these SNPs, 1009 were within disease-associated genes, and 85 were predicted as damaging using the Sorting Intolerant From Tolerant system. This study resulted in the creation of the SNP@Ethnos database, which is designed to make this type of detailed genetic variation approach available to a wider range of researchers. SNP@Ethnos is a public database of ESNPs with annotation information that currently contains 100 736 ESNPs from 10 138 genes, and can be accessed at http://variome.net and http://bioportal.net/ or directly at http://bioportal.kobic.re.kr/SNPatETHNIC/.
Collapse
Affiliation(s)
- Jungsun Park
- Korean BioInformation Center, KRIBB, Daejeon 305-806, Korea.
| | | | | | | | | |
Collapse
|