1
|
Saberian N, Shafi A, Peyvandipour A, Draghici S. MAGPEL: an autoMated pipeline for inferring vAriant-driven Gene PanEls from the full-length biomedical literature. Sci Rep 2020; 10:12365. [PMID: 32703994 PMCID: PMC7378213 DOI: 10.1038/s41598-020-68649-0] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/13/2019] [Accepted: 06/17/2020] [Indexed: 11/09/2022] Open
Abstract
In spite of the efforts in developing and maintaining accurate variant databases, a large number of disease-associated variants are still hidden in the biomedical literature. Curation of the biomedical literature in an effort to extract this information is a challenging task due to: (i) the complexity of natural language processing, (ii) inconsistent use of standard recommendations for variant description, and (iii) the lack of clarity and consistency in describing the variant-genotype-phenotype associations in the biomedical literature. In this article, we employ text mining and word cloud analysis techniques to address these challenges. The proposed framework extracts the variant-gene-disease associations from the full-length biomedical literature and designs an evidence-based variant-driven gene panel for a given condition. We validate the identified genes by showing their diagnostic abilities to predict the patients' clinical outcome on several independent validation cohorts. As representative examples, we present our results for acute myeloid leukemia (AML), breast cancer and prostate cancer. We compare these panels with other variant-driven gene panels obtained from Clinvar, Mastermind and others from literature, as well as with a panel identified with a classical differentially expressed genes (DEGs) approach. The results show that the panels obtained by the proposed framework yield better results than the other gene panels currently available in the literature.
Collapse
Affiliation(s)
- Nafiseh Saberian
- Department of Computer Science, Wayne State University, Detroit, MI, USA
| | - Adib Shafi
- Department of Computer Science, Wayne State University, Detroit, MI, USA
| | - Azam Peyvandipour
- Department of Computer Science, Wayne State University, Detroit, MI, USA
| | - Sorin Draghici
- Department of Computer Science, Wayne State University, Detroit, MI, USA.
- Department of Obstetrics and Gynecology, Wayne State University, Detroit, MI, USA.
| |
Collapse
|
2
|
Mahmood ASMA, Rao S, McGarvey P, Wu C, Madhavan S, Vijay-Shanker K. eGARD: Extracting associations between genomic anomalies and drug responses from text. PLoS One 2017; 12:e0189663. [PMID: 29261751 PMCID: PMC5738129 DOI: 10.1371/journal.pone.0189663] [Citation(s) in RCA: 14] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/26/2017] [Accepted: 11/29/2017] [Indexed: 12/25/2022] Open
Abstract
Tumor molecular profiling plays an integral role in identifying genomic anomalies which may help in personalizing cancer treatments, improving patient outcomes and minimizing risks associated with different therapies. However, critical information regarding the evidence of clinical utility of such anomalies is largely buried in biomedical literature. It is becoming prohibitive for biocurators, clinical researchers and oncologists to keep up with the rapidly growing volume and breadth of information, especially those that describe therapeutic implications of biomarkers and therefore relevant for treatment selection. In an effort to improve and speed up the process of manually reviewing and extracting relevant information from literature, we have developed a natural language processing (NLP)-based text mining (TM) system called eGARD (extracting Genomic Anomalies association with Response to Drugs). This system relies on the syntactic nature of sentences coupled with various textual features to extract relations between genomic anomalies and drug response from MEDLINE abstracts. Our system achieved high precision, recall and F-measure of up to 0.95, 0.86 and 0.90, respectively, on annotated evaluation datasets created in-house and obtained externally from PharmGKB. Additionally, the system extracted information that helps determine the confidence level of extraction to support prioritization of curation. Such a system will enable clinical researchers to explore the use of published markers to stratify patients upfront for 'best-fit' therapies and readily generate hypotheses for new clinical trials.
Collapse
Affiliation(s)
- A. S. M. Ashique Mahmood
- Department of Computer and Information Science, University of Delaware, Newark, Delaware, United States of America
- * E-mail:
| | - Shruti Rao
- Innovation Center For Biomedical Informatics, Georgetown University, Washington D.C, United States of America
| | - Peter McGarvey
- Innovation Center For Biomedical Informatics, Georgetown University, Washington D.C, United States of America
- Protein Information Resource, Georgetown University Medical Center, Washington D.C, United States of America
| | - Cathy Wu
- Department of Computer and Information Science, University of Delaware, Newark, Delaware, United States of America
- Protein Information Resource, Georgetown University Medical Center, Washington D.C, United States of America
- Center for Bioinformatics and Computational Biology, University of Delaware, Newark, Delaware, United States of America
| | - Subha Madhavan
- Innovation Center For Biomedical Informatics, Georgetown University, Washington D.C, United States of America
- Lombardi Comprehensive Cancer Center, Georgetown University Medical Center, Washington D.C, United States of America
| | - K. Vijay-Shanker
- Department of Computer and Information Science, University of Delaware, Newark, Delaware, United States of America
| |
Collapse
|
3
|
Langlais D, Fodil N, Gros P. Genetics of Infectious and Inflammatory Diseases: Overlapping Discoveries from Association and Exome-Sequencing Studies. Annu Rev Immunol 2017; 35:1-30. [DOI: 10.1146/annurev-immunol-051116-052442] [Citation(s) in RCA: 28] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/09/2022]
Affiliation(s)
- David Langlais
- McGill University Research Centre on Complex Traits, McGill University, Montreal, Quebec H3G 0B1, Canada;, ,
- Department of Biochemistry, McGill University, Montreal, Quebec H3G 0B1, Canada
| | - Nassima Fodil
- McGill University Research Centre on Complex Traits, McGill University, Montreal, Quebec H3G 0B1, Canada;, ,
- Department of Biochemistry, McGill University, Montreal, Quebec H3G 0B1, Canada
| | - Philippe Gros
- McGill University Research Centre on Complex Traits, McGill University, Montreal, Quebec H3G 0B1, Canada;, ,
- Department of Biochemistry, McGill University, Montreal, Quebec H3G 0B1, Canada
| |
Collapse
|
4
|
Mahmood ASMA, Wu TJ, Mazumder R, Vijay-Shanker K. DiMeX: A Text Mining System for Mutation-Disease Association Extraction. PLoS One 2016; 11:e0152725. [PMID: 27073839 PMCID: PMC4830514 DOI: 10.1371/journal.pone.0152725] [Citation(s) in RCA: 43] [Impact Index Per Article: 5.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/17/2015] [Accepted: 03/19/2016] [Indexed: 11/22/2022] Open
Abstract
The number of published articles describing associations between mutations and diseases is increasing at a fast pace. There is a pressing need to gather such mutation-disease associations into public knowledge bases, but manual curation slows down the growth of such databases. We have addressed this problem by developing a text-mining system (DiMeX) to extract mutation to disease associations from publication abstracts. DiMeX consists of a series of natural language processing modules that preprocess input text and apply syntactic and semantic patterns to extract mutation-disease associations. DiMeX achieves high precision and recall with F-scores of 0.88, 0.91 and 0.89 when evaluated on three different datasets for mutation-disease associations. DiMeX includes a separate component that extracts mutation mentions in text and associates them with genes. This component has been also evaluated on different datasets and shown to achieve state-of-the-art performance. The results indicate that our system outperforms the existing mutation-disease association tools, addressing the low precision problems suffered by most approaches. DiMeX was applied on a large set of abstracts from Medline to extract mutation-disease associations, as well as other relevant information including patient/cohort size and population data. The results are stored in a database that can be queried and downloaded at http://biotm.cis.udel.edu/dimex/. We conclude that this high-throughput text-mining approach has the potential to significantly assist researchers and curators to enrich mutation databases.
Collapse
Affiliation(s)
- A. S. M. Ashique Mahmood
- Department of Computer and Information Sciences, University of Delaware, Newark, Delaware, United States of America
- * E-mail:
| | - Tsung-Jung Wu
- Department of Biochemistry and Molecular Medicine, George Washington University, Washington, District of Columbia, United States of America
| | - Raja Mazumder
- Department of Biochemistry and Molecular Medicine, George Washington University, Washington, District of Columbia, United States of America
- McCormick Genomic and Proteomic Center, George Washington University, Washington, District of Columbia, United States of America
| | - K. Vijay-Shanker
- Department of Computer and Information Sciences, University of Delaware, Newark, Delaware, United States of America
| |
Collapse
|
5
|
Chang TJ, Wang WC, Hsiung CA, He CT, Lin MW, Sheu WHH, Chang YC, Quertermous T, Chen I, Rotter J, Chuang LM. Genetic Variation in the Human SORBS1 Gene is Associated With Blood Pressure Regulation and Age at Onset of Hypertension: A SAPPHIRe Cohort Study. Medicine (Baltimore) 2016; 95:e2970. [PMID: 26962801 PMCID: PMC4998882 DOI: 10.1097/md.0000000000002970] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 10/16/2015] [Revised: 01/19/2016] [Accepted: 02/09/2016] [Indexed: 01/11/2023] Open
Abstract
Essential hypertension is a complex disease involving multiple genetic and environmental factors. A human gene containing a sorbin homology domain and 3 SH3 domains in the C-terminal region, termed SORBS1, plays a significant role in insulin signaling. We previously found a significant association between the T228A polymorphism and insulin resistance, obesity, and type 2 diabetes. It has been hypothesized that a set of genes responsible for insulin resistance may be closely linked with genes susceptible to the development of hypertension. Identification of insulin resistance-related genetic factors may, therefore, enhance our understanding of essential hypertension. This study aimed to examine whether common SORBS1 genetic variations are associated with blood pressure and age at onset of hypertension in an ethnic Chinese cohort.We genotyped 9 common tagged single nucleotide polymorphisms of the SORBS1 gene in 1136 subjects of Chinese origin from the Stanford Asia-Pacific Program for Hypertension and Insulin Resistance family study. Blood pressure was measured upon enrolment. The associations of the SORBS1 single nucleotide polymorphisms with blood pressure and the presence of hypertension were analyzed with a generalized estimating equation model. We used the false-discovery rate measure Q value with a cutoff <0.1 to adjust for multiple comparisons. In the Cox regression analysis for hypertension-free survival, a robust sandwich variance estimator was used to deal with the within-family correlations with age at onset of hypertension. Gender, body mass index, and antihypertension medication were adjustment covariates in the Cox regression analysis.In this study, genetic variants of rs2281939 and rs2274490 were significantly associated with both systolic and diastolic blood pressure. A genetic variant of rs2274490 was also significantly associated with the presence of hypertension. Furthermore, genetic variants of rs2281939 and rs2274490 were associated with age at onset of hypertension after adjustment for gender, body mass index, and antihypertension medication.In conclusion, we provide evidence for an association between common SORBS1 genetic variations and blood pressure, presence of hypertension, and age at onset of hypertension. The biological mechanism of genetic variation associated with blood pressure regulation needs further investigation.
Collapse
Affiliation(s)
- Tien-Jyun Chang
- From the Department of Internal Medicine, National Taiwan University Hospital, Taipei, Taiwan (T-JC, Y-CC, L-MC); The Ph.D. Program for Translational Medicine, College of Medical Science and Technology, Taipei Medical University, Taipei, Taiwan (W-CW); Division of Biostatistics and Bioinformatics, Institute of Population Health Sciences, National Health Research Institutes, Zhunan, Taiwan (W-CW, C-AH); Department of Endocrinology and Metabolism, Tri-Service General Hospital, Taipei, Taiwan (C-TH); Institute of Public Health, National Yang-Ming University, Taipei, Taiwan (M-WL); Department of Medical Research & Education, Taipei Veterans General Hospital, Taipei, Taiwan (M-WL); Department of Endocrinology and Metabolism, Taichung Veterans General Hospital, Taichung, Taiwan (WH-HS); Graduate Institute of Medical Genomics and Proteomics, National Taiwan University Medical College, Taipei, Taiwan (Y-CC); Division of Cardiovascular Medicine, Falk CVRC, Stanford University School of Medicine, Stanford, CA (TQ); Los Angles Biomedical Research Institute, Los Angeles, CA (IC, JR); Institute of Epidemiology and Preventive Medicine, College of Public Health, National Taiwan University, Taipei, Taiwan (L-MC)
| | | | | | | | | | | | | | | | | | | | | |
Collapse
|
6
|
Fodil N, Langlais D, Gros P. Primary Immunodeficiencies and Inflammatory Disease: A Growing Genetic Intersection. Trends Immunol 2016; 37:126-140. [PMID: 26791050 DOI: 10.1016/j.it.2015.12.006] [Citation(s) in RCA: 30] [Impact Index Per Article: 3.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/19/2015] [Revised: 12/10/2015] [Accepted: 12/13/2015] [Indexed: 02/08/2023]
Abstract
Recent advances in genome analysis have provided important insights into the genetic architecture of infectious and inflammatory diseases. The combined analysis of loci detected by genome-wide association studies (GWAS) in 22 inflammatory diseases has revealed a shared genetic core and associated biochemical pathways that play a central role in pathological inflammation. Parallel whole-exome sequencing studies have identified 265 genes mutated in primary immunodeficiencies (PID). Here, we examine the overlap between these two data sets, and find that it consists of genes essential for protection against infections and in which persistent activation causes pathological inflammation. Based on this intersection, we propose that, although strong or inactivating mutations (rare variants) in these genes may cause severe disease (PIDs), their more subtle modulation potentially by common regulatory/coding variants may contribute to chronic inflammation.
Collapse
Affiliation(s)
- Nassima Fodil
- Department of Biochemistry, Complex Traits Group, McGill University, Montreal, QC, Canada
| | - David Langlais
- Department of Biochemistry, Complex Traits Group, McGill University, Montreal, QC, Canada
| | - Philippe Gros
- Department of Biochemistry, Complex Traits Group, McGill University, Montreal, QC, Canada.
| |
Collapse
|
7
|
Genetic polymorphisms of PCSK2 are associated with glucose homeostasis and progression to type 2 diabetes in a Chinese population. Sci Rep 2015; 5:14380. [PMID: 26607656 PMCID: PMC4660384 DOI: 10.1038/srep14380] [Citation(s) in RCA: 18] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/20/2015] [Accepted: 08/21/2015] [Indexed: 01/07/2023] Open
Abstract
Proprotein convertase subtilisin/kexin type 2 (PCSK2) is a prohormone processing enzyme involved in insulin and glucagon biosynthesis. We previously found the genetic polymorphism of PCSK2 on chromosome 20 was responsible for the linkage peak of several glucose homeostasis parameters. The aim of this study is to investigate the association between genetic variants of PCSK2 and glucose homeostasis parameters and incident diabetes. Total 1142 Chinese participants were recruited from the Stanford Asia-Pacific Program for Hypertension and Insulin Resistance (SAPPHIRe) family study, and 759 participants were followed up for 5 years. Ten SNPs of the PCSK2 gene were genotyped. Variants of rs6044695 and rs2284912 were associated with fasting plasma glucose, and variants of rs2269023 were associated with fasting plasma glucose and 1-hour plasma glucose during OGTT. Haplotypes of rs4814605/rs1078199 were associated with fasting plasma insulin levels and HOMA-IR. Haplotypes of rs890609/rs2269023 were also associated with fasting plasma glucose, fasting insulin and HOMA-IR. In the longitudinal study, we found individuals carrying TA/AA genotypes of rs6044695 or TC/CC genotypes of rs2284912 had lower incidence of diabetes during the 5-year follow-up. Our results indicated that PCSK2 gene polymorphisms are associated with pleiotropic effects on various traits of glucose homeostasis and incident diabetes.
Collapse
|
8
|
Winnier DA, Fourcaudot M, Norton L, Abdul-Ghani MA, Hu SL, Farook VS, Coletta DK, Kumar S, Puppala S, Chittoor G, Dyer TD, Arya R, Carless M, Lehman DM, Curran JE, Cromack DT, Tripathy D, Blangero J, Duggirala R, Göring HHH, DeFronzo RA, Jenkinson CP. Transcriptomic identification of ADH1B as a novel candidate gene for obesity and insulin resistance in human adipose tissue in Mexican Americans from the Veterans Administration Genetic Epidemiology Study (VAGES). PLoS One 2015; 10:e0119941. [PMID: 25830378 PMCID: PMC4382323 DOI: 10.1371/journal.pone.0119941] [Citation(s) in RCA: 32] [Impact Index Per Article: 3.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/19/2014] [Accepted: 02/04/2015] [Indexed: 01/01/2023] Open
Abstract
Type 2 diabetes (T2D) is a complex metabolic disease that is more prevalent in ethnic groups such as Mexican Americans, and is strongly associated with the risk factors obesity and insulin resistance. The goal of this study was to perform whole genome gene expression profiling in adipose tissue to detect common patterns of gene regulation associated with obesity and insulin resistance. We used phenotypic and genotypic data from 308 Mexican American participants from the Veterans Administration Genetic Epidemiology Study (VAGES). Basal fasting RNA was extracted from adipose tissue biopsies from a subset of 75 unrelated individuals, and gene expression data generated on the Illumina BeadArray platform. The number of gene probes with significant expression above baseline was approximately 31,000. We performed multiple regression analysis of all probes with 15 metabolic traits. Adipose tissue had 3,012 genes significantly associated with the traits of interest (false discovery rate, FDR ≤ 0.05). The significance of gene expression changes was used to select 52 genes with significant (FDR ≤ 10(-4)) gene expression changes across multiple traits. Gene sets/Pathways analysis identified one gene, alcohol dehydrogenase 1B (ADH1B) that was significantly enriched (P < 10(-60)) as a prime candidate for involvement in multiple relevant metabolic pathways. Illumina BeadChip derived ADH1B expression data was consistent with quantitative real time PCR data. We observed significant inverse correlations with waist circumference (2.8 x 10(-9)), BMI (5.4 x 10(-6)), and fasting plasma insulin (P < 0.001). These findings are consistent with a central role for ADH1B in obesity and insulin resistance and provide evidence for a novel genetic regulatory mechanism for human metabolic diseases related to these traits.
Collapse
Affiliation(s)
- Deidre A. Winnier
- Division of Diabetes, Department of Medicine, University of Texas Health Science Center at San Antonio, San Antonio, TX, United States of America
| | - Marcel Fourcaudot
- Division of Diabetes, Department of Medicine, University of Texas Health Science Center at San Antonio, San Antonio, TX, United States of America
| | - Luke Norton
- Division of Diabetes, Department of Medicine, University of Texas Health Science Center at San Antonio, San Antonio, TX, United States of America
| | - Muhammad A. Abdul-Ghani
- Division of Diabetes, Department of Medicine, University of Texas Health Science Center at San Antonio, San Antonio, TX, United States of America
| | - Shirley L. Hu
- Division of Nephrology, Department of Medicine, University of Texas Health Science Center at San Antonio, San Antonio, TX, United States of America
| | - Vidya S. Farook
- Texas Biomedical Research Institute, San Antonio, TX, United States of America
| | - Dawn K. Coletta
- School of Life Sciences, Arizona State University, Tempe, AZ, United States of America
| | - Satish Kumar
- Texas Biomedical Research Institute, San Antonio, TX, United States of America
| | - Sobha Puppala
- Texas Biomedical Research Institute, San Antonio, TX, United States of America
| | - Geetha Chittoor
- Texas Biomedical Research Institute, San Antonio, TX, United States of America
| | - Thomas D. Dyer
- Texas Biomedical Research Institute, San Antonio, TX, United States of America
| | - Rector Arya
- Division of Endocrinology and Diabetes, Department of Pediatrics, University of Texas Health Science Center at San Antonio, San Antonio, TX, United States of America
| | - Melanie Carless
- Texas Biomedical Research Institute, San Antonio, TX, United States of America
| | - Donna M. Lehman
- Division of Clinical Epidemiology, Department of Medicine, University of Texas Health Science Center at San Antonio, San Antonio, TX, United States of America
| | - Joanne E. Curran
- Texas Biomedical Research Institute, San Antonio, TX, United States of America
| | - Douglas T. Cromack
- Division of Orthopedics, Department of Medicine, University of Texas Health Science Center at San Antonio, San Antonio, TX, United States of America
- South Texas Veterans Health Care System, San Antonio, TX, United States of America
| | - Devjit Tripathy
- Division of Diabetes, Department of Medicine, University of Texas Health Science Center at San Antonio, San Antonio, TX, United States of America
- South Texas Veterans Health Care System, San Antonio, TX, United States of America
| | - John Blangero
- Texas Biomedical Research Institute, San Antonio, TX, United States of America
| | | | - Harald H. H. Göring
- Texas Biomedical Research Institute, San Antonio, TX, United States of America
| | - Ralph A. DeFronzo
- Division of Diabetes, Department of Medicine, University of Texas Health Science Center at San Antonio, San Antonio, TX, United States of America
- South Texas Veterans Health Care System, San Antonio, TX, United States of America
| | - Christopher P. Jenkinson
- Division of Diabetes, Department of Medicine, University of Texas Health Science Center at San Antonio, San Antonio, TX, United States of America
- Texas Biomedical Research Institute, San Antonio, TX, United States of America
- South Texas Veterans Health Care System, San Antonio, TX, United States of America
- * E-mail:
| |
Collapse
|
9
|
|
10
|
Pletscher-Frankild S, Pallejà A, Tsafou K, Binder JX, Jensen LJ. DISEASES: text mining and data integration of disease-gene associations. Methods 2014; 74:83-9. [PMID: 25484339 DOI: 10.1016/j.ymeth.2014.11.020] [Citation(s) in RCA: 334] [Impact Index Per Article: 33.4] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/21/2014] [Revised: 11/15/2014] [Accepted: 11/25/2014] [Indexed: 12/18/2022] Open
Abstract
Text mining is a flexible technology that can be applied to numerous different tasks in biology and medicine. We present a system for extracting disease-gene associations from biomedical abstracts. The system consists of a highly efficient dictionary-based tagger for named entity recognition of human genes and diseases, which we combine with a scoring scheme that takes into account co-occurrences both within and between sentences. We show that this approach is able to extract half of all manually curated associations with a false positive rate of only 0.16%. Nonetheless, text mining should not stand alone, but be combined with other types of evidence. For this reason, we have developed the DISEASES resource, which integrates the results from text mining with manually curated disease-gene associations, cancer mutation data, and genome-wide association studies from existing databases. The DISEASES resource is accessible through a web interface at http://diseases.jensenlab.org/, where the text-mining software and all associations are also freely available for download.
Collapse
Affiliation(s)
- Sune Pletscher-Frankild
- Department of Disease Systems Biology, Novo Nordisk Foundation Center for Protein Research, Faculty of Health and Medical Sciences, University of Copenhagen, Copenhagen, Denmark
| | - Albert Pallejà
- Department of Disease Systems Biology, Novo Nordisk Foundation Center for Protein Research, Faculty of Health and Medical Sciences, University of Copenhagen, Copenhagen, Denmark; Novo Nordisk Foundation Center for Basic Metabolic Research, Faculty of Health and Medical Sciences, University of Copenhagen, Copenhagen, Denmark
| | - Kalliopi Tsafou
- Department of Disease Systems Biology, Novo Nordisk Foundation Center for Protein Research, Faculty of Health and Medical Sciences, University of Copenhagen, Copenhagen, Denmark
| | - Janos X Binder
- Structural and Computational Biology Unit, European Molecular Biology Laboratory (EMBL), Heidelberg, Germany; Bioinformatics Core Facility, Luxembourg Centre for Systems Biomedicine (LCSB), University of Luxembourg, Luxembourg
| | - Lars Juhl Jensen
- Department of Disease Systems Biology, Novo Nordisk Foundation Center for Protein Research, Faculty of Health and Medical Sciences, University of Copenhagen, Copenhagen, Denmark.
| |
Collapse
|
11
|
Leslie R, O'Donnell CJ, Johnson AD. GRASP: analysis of genotype-phenotype results from 1390 genome-wide association studies and corresponding open access database. Bioinformatics 2014; 30:i185-94. [PMID: 24931982 DOI: 10.1093/bioinformatics/btu273] [Citation(s) in RCA: 179] [Impact Index Per Article: 17.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/16/2022] Open
Abstract
SUMMARY We created a deeply extracted and annotated database of genome-wide association studies (GWAS) results. GRASP v1.0 contains >6.2 million SNP-phenotype association from among 1390 GWAS studies. We re-annotated GWAS results with 16 annotation sources including some rarely compared to GWAS results (e.g. RNAediting sites, lincRNAs, PTMs). MOTIVATION To create a high-quality resource to facilitate further use and interpretation of human GWAS results in order to address important scientific questions. RESULTS GWAS have grown exponentially, with increases in sample sizes and markers tested, and continuing bias toward European ancestry samples. GRASP contains >100 000 phenotypes, roughly: eQTLs (71.5%), metabolite QTLs (21.2%), methylation QTLs (4.4%) and diseases, biomarkers and other traits (2.8%). cis-eQTLs, meQTLs, mQTLs and MHC region SNPs are highly enriched among significant results. After removing these categories, GRASP still contains a greater proportion of studies and results than comparable GWAS catalogs. Cardiovascular disease and related risk factors pre-dominate remaining GWAS results, followed by immunological, neurological and cancer traits. Significant results in GWAS display a highly gene-centric tendency. Sex chromosome X (OR = 0.18[0.16-0.20]) and Y (OR = 0.003[0.001-0.01]) genes are depleted for GWAS results. Gene length is correlated with GWAS results at nominal significance (P ≤ 0.05) levels. We show this gene-length correlation decays at increasingly more stringent P-value thresholds. Potential pleotropic genes and SNPs enriched for multi-phenotype association in GWAS are identified. However, we note possible population stratification at some of these loci. Finally, via re-annotation we identify compelling functional hypotheses at GWAS loci, in some cases unrealized in studies to date. CONCLUSION Pooling summary-level GWAS results and re-annotating with bioinformatics predictions and molecular features provides a good platform for new insights. AVAILABILITY The GRASP database is available at http://apps.nhlbi.nih.gov/grasp.
Collapse
Affiliation(s)
- Richard Leslie
- Cardiovascular Epidemiology and Human Genomics Branch, National Heart, Lung and Blood Institute, The Framingham Heart Study, Framingham, MA 01702, University of Massachusetts Medical School, Worcester, MA 01655 and Division of Cardiology, Massachusetts General Hospital, Boston, MA 02114, USACardiovascular Epidemiology and Human Genomics Branch, National Heart, Lung and Blood Institute, The Framingham Heart Study, Framingham, MA 01702, University of Massachusetts Medical School, Worcester, MA 01655 and Division of Cardiology, Massachusetts General Hospital, Boston, MA 02114, USA
| | - Christopher J O'Donnell
- Cardiovascular Epidemiology and Human Genomics Branch, National Heart, Lung and Blood Institute, The Framingham Heart Study, Framingham, MA 01702, University of Massachusetts Medical School, Worcester, MA 01655 and Division of Cardiology, Massachusetts General Hospital, Boston, MA 02114, USACardiovascular Epidemiology and Human Genomics Branch, National Heart, Lung and Blood Institute, The Framingham Heart Study, Framingham, MA 01702, University of Massachusetts Medical School, Worcester, MA 01655 and Division of Cardiology, Massachusetts General Hospital, Boston, MA 02114, USA
| | - Andrew D Johnson
- Cardiovascular Epidemiology and Human Genomics Branch, National Heart, Lung and Blood Institute, The Framingham Heart Study, Framingham, MA 01702, University of Massachusetts Medical School, Worcester, MA 01655 and Division of Cardiology, Massachusetts General Hospital, Boston, MA 02114, USA
| |
Collapse
|
12
|
Burger JD, Doughty E, Khare R, Wei CH, Mishra R, Aberdeen J, Tresner-Kirsch D, Wellner B, Kann MG, Lu Z, Hirschman L. Hybrid curation of gene-mutation relations combining automated extraction and crowdsourcing. DATABASE-THE JOURNAL OF BIOLOGICAL DATABASES AND CURATION 2014; 2014:bau094. [PMID: 25246425 PMCID: PMC4170591 DOI: 10.1093/database/bau094] [Citation(s) in RCA: 31] [Impact Index Per Article: 3.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 11/23/2022]
Abstract
Background: This article describes capture of biological information using a hybrid approach that combines natural language processing to extract biological entities and crowdsourcing with annotators recruited via Amazon Mechanical Turk to judge correctness of candidate biological relations. These techniques were applied to extract gene– mutation relations from biomedical abstracts with the goal of supporting production scale capture of gene–mutation–disease findings as an open source resource for personalized medicine. Results: The hybrid system could be configured to provide good performance for gene–mutation extraction (precision ∼82%; recall ∼70% against an expert-generated gold standard) at a cost of $0.76 per abstract. This demonstrates that crowd labor platforms such as Amazon Mechanical Turk can be used to recruit quality annotators, even in an application requiring subject matter expertise; aggregated Turker judgments for gene–mutation relations exceeded 90% accuracy. Over half of the precision errors were due to mismatches against the gold standard hidden from annotator view (e.g. incorrect EntrezGene identifier or incorrect mutation position extracted), or incomplete task instructions (e.g. the need to exclude nonhuman mutations). Conclusions: The hybrid curation model provides a readily scalable cost-effective approach to curation, particularly if coupled with expert human review to filter precision errors. We plan to generalize the framework and make it available as open source software. Database URL:http://www.mitre.org/publications/technical-papers/hybrid-curation-of-gene-mutation-relations-combining-automated
Collapse
Affiliation(s)
- John D Burger
- The MITRE Corporation, Bedford, MA 01730, USA, Biomedical Informatics Program, Stanford University, Stanford, CA 94305, USA, National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD 20894, USA and The University of Maryland, Baltimore County, Baltimore MD 21250, USA
| | - Emily Doughty
- The MITRE Corporation, Bedford, MA 01730, USA, Biomedical Informatics Program, Stanford University, Stanford, CA 94305, USA, National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD 20894, USA and The University of Maryland, Baltimore County, Baltimore MD 21250, USA
| | - Ritu Khare
- The MITRE Corporation, Bedford, MA 01730, USA, Biomedical Informatics Program, Stanford University, Stanford, CA 94305, USA, National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD 20894, USA and The University of Maryland, Baltimore County, Baltimore MD 21250, USA
| | - Chih-Hsuan Wei
- The MITRE Corporation, Bedford, MA 01730, USA, Biomedical Informatics Program, Stanford University, Stanford, CA 94305, USA, National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD 20894, USA and The University of Maryland, Baltimore County, Baltimore MD 21250, USA
| | - Rajashree Mishra
- The MITRE Corporation, Bedford, MA 01730, USA, Biomedical Informatics Program, Stanford University, Stanford, CA 94305, USA, National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD 20894, USA and The University of Maryland, Baltimore County, Baltimore MD 21250, USA
| | - John Aberdeen
- The MITRE Corporation, Bedford, MA 01730, USA, Biomedical Informatics Program, Stanford University, Stanford, CA 94305, USA, National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD 20894, USA and The University of Maryland, Baltimore County, Baltimore MD 21250, USA
| | - David Tresner-Kirsch
- The MITRE Corporation, Bedford, MA 01730, USA, Biomedical Informatics Program, Stanford University, Stanford, CA 94305, USA, National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD 20894, USA and The University of Maryland, Baltimore County, Baltimore MD 21250, USA
| | - Ben Wellner
- The MITRE Corporation, Bedford, MA 01730, USA, Biomedical Informatics Program, Stanford University, Stanford, CA 94305, USA, National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD 20894, USA and The University of Maryland, Baltimore County, Baltimore MD 21250, USA
| | - Maricel G Kann
- The MITRE Corporation, Bedford, MA 01730, USA, Biomedical Informatics Program, Stanford University, Stanford, CA 94305, USA, National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD 20894, USA and The University of Maryland, Baltimore County, Baltimore MD 21250, USA
| | - Zhiyong Lu
- The MITRE Corporation, Bedford, MA 01730, USA, Biomedical Informatics Program, Stanford University, Stanford, CA 94305, USA, National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD 20894, USA and The University of Maryland, Baltimore County, Baltimore MD 21250, USA
| | - Lynette Hirschman
- The MITRE Corporation, Bedford, MA 01730, USA, Biomedical Informatics Program, Stanford University, Stanford, CA 94305, USA, National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD 20894, USA and The University of Maryland, Baltimore County, Baltimore MD 21250, USA
| |
Collapse
|
13
|
Pharmacogenomics for Precision Medicine in the Era of Collaborative Co-creation and Crowdsourcing. CURRENT GENETIC MEDICINE REPORTS 2014. [DOI: 10.1007/s40142-014-0041-y] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/25/2022]
|
14
|
Ning S, Zhao Z, Ye J, Wang P, Zhi H, Li R, Wang T, Li X. LincSNP: a database of linking disease-associated SNPs to human large intergenic non-coding RNAs. BMC Bioinformatics 2014; 15:152. [PMID: 24885522 PMCID: PMC4038069 DOI: 10.1186/1471-2105-15-152] [Citation(s) in RCA: 44] [Impact Index Per Article: 4.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/20/2014] [Accepted: 05/14/2014] [Indexed: 02/04/2023] Open
Abstract
BACKGROUND Genome-wide association studies (GWAS) have successfully identified a large number of single nucleotide polymorphisms (SNPs) that are associated with a wide range of human diseases. However, many of these disease-associated SNPs are located in non-coding regions and have remained largely unexplained. Recent findings indicate that disease-associated SNPs in human large intergenic non-coding RNA (lincRNA) may lead to susceptibility to diseases through their effects on lincRNA expression. There is, therefore, a need to specifically record these SNPs and annotate them as potential candidates for disease. DESCRIPTION We have built LincSNP, an integrated database, to identify and annotate disease-associated SNPs in human lincRNAs. The current release of LincSNP contains approximately 140,000 disease-associated SNPs (or linkage disequilibrium SNPs), which can be mapped to around 5,000 human lincRNAs, together with their comprehensive functional annotations. The database also contains annotated, experimentally supported SNP-lincRNA-disease associations and disease-associated lincRNAs. It provides flexible search options for data extraction and searches can be performed by disease/phenotype name, SNP ID, lincRNA name and chromosome region. In addition, we provide users with a link to download all the data from LincSNP and have developed a web interface for the submission of novel identified SNP-lincRNA-disease associations. CONCLUSIONS The LincSNP database aims to integrate disease-associated SNPs and human lincRNAs, which will be an important resource for the investigation of the functions and mechanisms of lincRNAs in human disease. The database is available at http://bioinfo.hrbmu.edu.cn/LincSNP.
Collapse
Affiliation(s)
| | | | | | | | | | | | | | - Xia Li
- College of Bioinformatics Science and Technology, Harbin Medical University, Harbin 150081, China.
| |
Collapse
|
15
|
Ryu D, Cho S, Kim H, Lee S, Kim W. GEPdb: a database for investigating the ternary association of genotype, gene expression and phenotype. Bioinformatics 2014; 30:2540-2. [PMID: 24812343 DOI: 10.1093/bioinformatics/btu240] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
Abstract
UNLABELLED GEPdb integrates both genome-wide association studies and expression quantitative trait loci information, the two primary sources of genome-wide mapping for genotype-phenotype and genotype-expression associations together with phenotype-associated gene lists. The GEPdb provides simultaneous interpretation of both genetic risks and potential gene regulatory pathways toward phenotypic outcome by establishing the ternary relationship of genotype-expression-phenotype (GEP). The analytic scope is further extended by linkage disequilibrium from five different populations of the international HapMap Project. AVAILABILITY AND IMPLEMENTATION http://ercsbweb.ewha.ac.kr/gepdb.
Collapse
Affiliation(s)
- Daeun Ryu
- Ewha Research Center for Systems Biology (ERCSB), Ewha Womans University, Seoul 120-750, Division of Biomedical Informatics, National Institute of Health, Korea Center for Disease Control, Cheongwon 363-951 and Ewha Global Top5 Research Program, Ewha Womans University, Seoul 120-750, Republic of Korea
| | - SeongBeom Cho
- Ewha Research Center for Systems Biology (ERCSB), Ewha Womans University, Seoul 120-750, Division of Biomedical Informatics, National Institute of Health, Korea Center for Disease Control, Cheongwon 363-951 and Ewha Global Top5 Research Program, Ewha Womans University, Seoul 120-750, Republic of Korea Ewha Research Center for Systems Biology (ERCSB), Ewha Womans University, Seoul 120-750, Division of Biomedical Informatics, National Institute of Health, Korea Center for Disease Control, Cheongwon 363-951 and Ewha Global Top5 Research Program, Ewha Womans University, Seoul 120-750, Republic of Korea
| | - Hun Kim
- Ewha Research Center for Systems Biology (ERCSB), Ewha Womans University, Seoul 120-750, Division of Biomedical Informatics, National Institute of Health, Korea Center for Disease Control, Cheongwon 363-951 and Ewha Global Top5 Research Program, Ewha Womans University, Seoul 120-750, Republic of Korea
| | - Sanghyuk Lee
- Ewha Research Center for Systems Biology (ERCSB), Ewha Womans University, Seoul 120-750, Division of Biomedical Informatics, National Institute of Health, Korea Center for Disease Control, Cheongwon 363-951 and Ewha Global Top5 Research Program, Ewha Womans University, Seoul 120-750, Republic of Korea Ewha Research Center for Systems Biology (ERCSB), Ewha Womans University, Seoul 120-750, Division of Biomedical Informatics, National Institute of Health, Korea Center for Disease Control, Cheongwon 363-951 and Ewha Global Top5 Research Program, Ewha Womans University, Seoul 120-750, Republic of Korea
| | - Wankyu Kim
- Ewha Research Center for Systems Biology (ERCSB), Ewha Womans University, Seoul 120-750, Division of Biomedical Informatics, National Institute of Health, Korea Center for Disease Control, Cheongwon 363-951 and Ewha Global Top5 Research Program, Ewha Womans University, Seoul 120-750, Republic of Korea Ewha Research Center for Systems Biology (ERCSB), Ewha Womans University, Seoul 120-750, Division of Biomedical Informatics, National Institute of Health, Korea Center for Disease Control, Cheongwon 363-951 and Ewha Global Top5 Research Program, Ewha Womans University, Seoul 120-750, Republic of Korea
| |
Collapse
|
16
|
Lin Y, He Y. The ontology of genetic susceptibility factors (OGSF) and its application in modeling genetic susceptibility to vaccine adverse events. J Biomed Semantics 2014; 5:19. [PMID: 24963371 PMCID: PMC4068904 DOI: 10.1186/2041-1480-5-19] [Citation(s) in RCA: 13] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/17/2013] [Accepted: 02/20/2014] [Indexed: 01/12/2023] Open
Abstract
Background Due to human variations in genetic susceptibility, vaccination often triggers adverse events in a small population of vaccinees. Based on our previous work on ontological modeling of genetic susceptibility to disease, we developed an Ontology of Genetic Susceptibility Factors (OGSF), a biomedical ontology in the domain of genetic susceptibility and genetic susceptibility factors. The OGSF framework was then applied in the area of vaccine adverse events (VAEs). Results OGSF aligns with the Basic Formal Ontology (BFO). OGSF defines ‘genetic susceptibility’ as a subclass of BFO:disposition and has a material basis ‘genetic susceptibility factor’. The ‘genetic susceptibility to pathological bodily process’ is a subclasses of ‘genetic susceptibility’. A VAE is a type of pathological bodily process. OGSF represents different types of genetic susceptibility factors including various susceptibility alleles (e.g., SNP and gene). A general OGSF design pattern was developed to represent genetic susceptibility to VAE and associated genetic susceptibility factors using experimental results in genetic association studies. To test and validate the design pattern, two case studies were populated in OGSF. In the first case study, human gene allele DBR*15:01 is susceptible to influenza vaccine Pandemrix-induced Multiple Sclerosis. The second case study reports genetic susceptibility polymorphisms associated with systemic smallpox VAEs. After the data of the Case Study 2 were represented using OGSF-based axioms, SPARQL was successfully developed to retrieve the susceptibility factors stored in the populated OGSF. A network of data from the Case Study 2 was constructed by using ontology terms and individuals as nodes and ontology relations as edges. Different social network analys
is (SNA) methods were then applied to verify core OGSF terms. Interestingly, a SNA hub analysis verified all susceptibility alleles of SNPs and a SNA closeness analysis verified the susceptibility genes in Case Study 2. These results validated the proper OGSF structure identified different ontology aspects with SNA methods. Conclusions OGSF provides a verified and robust framework for representing various genetic susceptibility types and genetic susceptibility factors annotated from experimental VAE genetic association studies. The RDF/OWL formulated ontology data can be queried using SPARQL and analyzed using centrality-based network analysis methods.
Collapse
Affiliation(s)
- Yu Lin
- Unit for Laboratory Animal Medicine, University of Michigan Medical School, Ann Arbor, MI 48109, USA ; Department of Microbiology and Immunology, University of Michigan Medical School, Ann Arbor, MI 48109, USA ; Center for Computational Medicine and Bioinformatics, University of Michigan Medical School, Ann Arbor, MI 48109, USA
| | - Yongqun He
- Unit for Laboratory Animal Medicine, University of Michigan Medical School, Ann Arbor, MI 48109, USA ; Department of Microbiology and Immunology, University of Michigan Medical School, Ann Arbor, MI 48109, USA ; Center for Computational Medicine and Bioinformatics, University of Michigan Medical School, Ann Arbor, MI 48109, USA
| |
Collapse
|
17
|
Snoek LB, Joeri van der Velde K, Li Y, Jansen RC, Swertz MA, Kammenga JE. Worm variation made accessible: Take your shopping cart to store, link, and investigate! WORM 2014; 3:e28357. [PMID: 24843834 PMCID: PMC4024057 DOI: 10.4161/worm.28357] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 01/21/2014] [Revised: 02/17/2014] [Accepted: 02/25/2014] [Indexed: 11/20/2022]
Abstract
In Caenorhabditis elegans, the recent advances in high-throughput quantitative analyses of natural genetic and phenotypic variation have led to a wealth of data on genotype phenotype relations. This data has resulted in the discovery of genes with major allelic effects and insights in the effect of natural genetic variation on a whole range of complex traits as well as how this variation is distributed across the genome. Regardless of the advances presented in specific studies, the majority of the data generated in these studies had yet to be made easily accessible, allowing for meta-analysis. Not only data in figures or tables but meta-data should be accessible for further investigation and comparison between studies. A platform was created where all the data, phenotypic measurements, genotypes, and mappings can be stored, compared, and new linkages within and between published studies can be discovered. WormQTL focuses on quantitative genetics in Caenorhabditis and other nematode species, whereas WormQTLHD quantitatively links gene expression quantitative trait loci (eQTL) in C. elegans to gene–disease associations in humans.
Collapse
Affiliation(s)
- L Basten Snoek
- Laboratory of Nematology; Wageningen University; The Netherlands
| | - K Joeri van der Velde
- Genomics Coordination Center; University of Groningen; University Medical Center Groningen; The Netherlands ; Groningen Bioinformatics Center; University of Groningen; The Netherlands ; Department of Genetics; University of Groningen; University Medical Center Groningen; The Netherlands
| | - Yang Li
- Genomics Coordination Center; University of Groningen; University Medical Center Groningen; The Netherlands ; Groningen Bioinformatics Center; University of Groningen; The Netherlands
| | - Ritsert C Jansen
- Groningen Bioinformatics Center; University of Groningen; The Netherlands
| | - Morris A Swertz
- Genomics Coordination Center; University of Groningen; University Medical Center Groningen; The Netherlands ; Groningen Bioinformatics Center; University of Groningen; The Netherlands ; Department of Genetics; University of Groningen; University Medical Center Groningen; The Netherlands
| | - Jan E Kammenga
- Laboratory of Nematology; Wageningen University; The Netherlands
| |
Collapse
|
18
|
Elhaik E, Greenspan E, Staats S, Krahn T, Tyler-Smith C, Xue Y, Tofanelli S, Francalacci P, Cucca F, Pagani L, Jin L, Li H, Schurr TG, Greenspan B, Spencer Wells R. The GenoChip: a new tool for genetic anthropology. Genome Biol Evol 2013; 5:1021-31. [PMID: 23666864 PMCID: PMC3673633 DOI: 10.1093/gbe/evt066] [Citation(s) in RCA: 47] [Impact Index Per Article: 4.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/22/2022] Open
Abstract
The Genographic Project is an international effort aimed at charting human migratory history. The project is nonprofit and nonmedical, and, through its Legacy Fund, supports locally led efforts to preserve indigenous and traditional cultures. Although the first phase of the project was focused on uniparentally inherited markers on the Y-chromosome and mitochondrial DNA (mtDNA), the current phase focuses on markers from across the entire genome to obtain a more complete understanding of human genetic variation. Although many commercial arrays exist for genome-wide single-nucleotide polymorphism (SNP) genotyping, they were designed for medical genetic studies and contain medically related markers that are inappropriate for global population genetic studies. GenoChip, the Genographic Project’s new genotyping array, was designed to resolve these issues and enable higher resolution research into outstanding questions in genetic anthropology. The GenoChip includes ancestry informative markers obtained for over 450 human populations, an ancient human (Saqqaq), and two archaic hominins (Neanderthal and Denisovan) and was designed to identify all known Y-chromosome and mtDNA haplogroups. The chip was carefully vetted to avoid inclusion of medically relevant markers. To demonstrate its capabilities, we compared the FST distributions of GenoChip SNPs to those of two commercial arrays. Although all arrays yielded similarly shaped (inverse J) FST distributions, the GenoChip autosomal and X-chromosomal distributions had the highest mean FST, attesting to its ability to discern subpopulations. The chip performances are illustrated in a principal component analysis for 14 worldwide populations. In summary, the GenoChip is a dedicated genotyping platform for genetic anthropology. With an unprecedented number of approximately 12,000 Y-chromosomal and approximately 3,300 mtDNA SNPs and over 130,000 autosomal and X-chromosomal SNPs without any known health, medical, or phenotypic relevance, the GenoChip is a useful tool for genetic anthropology and population genetics.
Collapse
Affiliation(s)
- Eran Elhaik
- Department of Mental Health, Johns Hopkins University Bloomberg School of Public Health, USA
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | |
Collapse
|
19
|
GWAS Central: a comprehensive resource for the comparison and interrogation of genome-wide association studies. Eur J Hum Genet 2013; 22:949-52. [PMID: 24301061 PMCID: PMC4060122 DOI: 10.1038/ejhg.2013.274] [Citation(s) in RCA: 113] [Impact Index Per Article: 10.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/26/2013] [Revised: 10/04/2013] [Accepted: 10/25/2013] [Indexed: 01/29/2023] Open
Abstract
To facilitate broad and convenient integrative visualization of and access to GWAS data, we have created the GWAS Central resource (http://www.gwascentral.org). This database seeks to provide a comprehensive collection of summary-level genetic association data, structured both for maximal utility and for safe open access (i.e., non-directional signals to fully preclude research subject identification). The resource emphasizes on advanced tools that allow comparison and discovery of relevant data sets from the perspective of genes, genome regions, phenotypes or traits. Tested markers and relevant genomic features can be visually interrogated across up to 16 multiple association data sets in a single view, starting at a chromosome-wide view and increasing in resolution down to individual bases. In addition, users can privately upload and view their own data as temporary files. Search and display utility is further enhanced by exploiting phenotype ontology annotations to allow genetic variants associated with phenotypes and traits of interest to be precisely identified, across all studies. Data submissions are accepted from individual researchers, groups and consortia, whereas we also actively gather data sets from various public sources. As a result, the resource now provides over 67 million P-values for over 1600 studies, making it the world's largest openly accessible online collection of summary-level GWAS association information.
Collapse
|
20
|
van der Velde KJ, de Haan M, Zych K, Arends D, Snoek LB, Kammenga JE, Jansen RC, Swertz MA, Li Y. WormQTLHD--a web database for linking human disease to natural variation data in C. elegans. Nucleic Acids Res 2013; 42:D794-801. [PMID: 24217915 PMCID: PMC3965109 DOI: 10.1093/nar/gkt1044] [Citation(s) in RCA: 19] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/23/2022] Open
Abstract
Interactions between proteins are highly conserved across species. As a result, the molecular basis of multiple diseases affecting humans can be studied in model organisms that offer many alternative experimental opportunities. One such organism—Caenorhabditis elegans—has been used to produce much molecular quantitative genetics and systems biology data over the past decade. We present WormQTLHD (Human Disease), a database that quantitatively and systematically links expression Quantitative Trait Loci (eQTL) findings in C. elegans to gene–disease associations in man. WormQTLHD, available online at http://www.wormqtl-hd.org, is a user-friendly set of tools to reveal functionally coherent, evolutionary conserved gene networks. These can be used to predict novel gene-to-gene associations and the functions of genes underlying the disease of interest. We created a new database that links C. elegans eQTL data sets to human diseases (34 337 gene–disease associations from OMIM, DGA, GWAS Central and NHGRI GWAS Catalogue) based on overlapping sets of orthologous genes associated to phenotypes in these two species. We utilized QTL results, high-throughput molecular phenotypes, classical phenotypes and genotype data covering different developmental stages and environments from WormQTL database. All software is available as open source, built on MOLGENIS and xQTL workbench.
Collapse
Affiliation(s)
- K Joeri van der Velde
- Genomics Coordination Center, University of Groningen, University Medical Center Groningen, P.O. Box 30001, 9700 RB Groningen, The Netherlands, Groningen Bioinformatics Center, University of Groningen, P.O. Box 11103, 9700 CC Groningen, The Netherlands, Department of Genetics, University of Groningen, University Medical Center Groningen, P.O. Box 30001, 9700 RB Groningen, The Netherlands, Department of Bioinformatics, Hanze University of Applied Sciences, Groningen, Zernikeplein 11, 9747 AS, The Netherlands and Laboratory of Nematology, Wageningen University, 6708 PB Wageningen, The Netherlands
| | | | | | | | | | | | | | | | | |
Collapse
|
21
|
Nagai Y, Imanishi T. RAvariome: a genetic risk variants database for rheumatoid arthritis based on assessment of reproducibility between or within human populations. DATABASE-THE JOURNAL OF BIOLOGICAL DATABASES AND CURATION 2013; 2013:bat073. [PMID: 24158836 PMCID: PMC3807080 DOI: 10.1093/database/bat073] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 12/25/2022]
Abstract
Rheumatoid arthritis (RA) is a common autoimmune inflammatory disease of the joints and is caused by both genetic and environmental factors. In the past six years, genome-wide association studies (GWASs) have identified many risk variants associated with RA. However, not all associations reported from GWASs are reproduced when tested in follow-up studies. To establish a reliable set of RA risk variants, we systematically classified common variants identified in GWASs by the degree of reproducibility among independent studies. We collected comprehensive genetic associations from 90 papers of GWASs and meta-analysis. The genetic variants were assessed according to the statistical significance and reproducibility between or within nine geographical populations. As a result, 82 and 19 single nucleotide polymorphisms (SNPs) were confirmed as intra- and inter-population-reproduced variants, respectively. Interestingly, majority of the intra-population-reproduced variants from European and East Asian populations were not common in two populations, but their nearby genes appeared to be the components of common pathways. Furthermore, a tool to predict the individual’s genetic risk of RA was developed to facilitate personalized medicine and preventive health care. For further clinical researches, the list of reliable genetic variants of RA and the genetic risk prediction tool are provided by open access database RAvariome. Database URL: http://hinv.jp/hinv/rav/
Collapse
Affiliation(s)
- Yoko Nagai
- Department of Molecular Life Science, Division of Basic Medical Science and Molecular Medicine, Tokai University School of Medicine, 143 Shimokasuya, Isehara, Kanagawa 259-1193, Japan and Data Management and Integration Team, Molecular Profiling Research Center for Drug Discovery, National Institute of Advanced Industrial Science and Technology, Koto-ku, Tokyo 135-0064, Japan
| | | |
Collapse
|
22
|
Johansen MB, Izarzugaza JMG, Brunak S, Petersen TN, Gupta R. Prediction of disease causing non-synonymous SNPs by the Artificial Neural Network Predictor NetDiseaseSNP. PLoS One 2013; 8:e68370. [PMID: 23935863 PMCID: PMC3723835 DOI: 10.1371/journal.pone.0068370] [Citation(s) in RCA: 18] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/06/2013] [Accepted: 05/29/2013] [Indexed: 01/10/2023] Open
Abstract
We have developed a sequence conservation-based artificial neural network predictor called NetDiseaseSNP which classifies nsSNPs as disease-causing or neutral. Our method uses the excellent alignment generation algorithm of SIFT to identify related sequences and a combination of 31 features assessing sequence conservation and the predicted surface accessibility to produce a single score which can be used to rank nsSNPs based on their potential to cause disease. NetDiseaseSNP classifies successfully disease-causing and neutral mutations. In addition, we show that NetDiseaseSNP discriminates cancer driver and passenger mutations satisfactorily. Our method outperforms other state-of-the-art methods on several disease/neutral datasets as well as on cancer driver/passenger mutation datasets and can thus be used to pinpoint and prioritize plausible disease candidates among nsSNPs for further investigation. NetDiseaseSNP is publicly available as an online tool as well as a web service: http://www.cbs.dtu.dk/services/NetDiseaseSNP.
Collapse
Affiliation(s)
- Morten Bo Johansen
- Center for Biological Sequence Analysis, Department of Systems Biology, Technical University of Denmark, Kongens Lyngby, Denmark
- Novo Nordisk Foundation Center for Biosustainability, Technical University of Denmark, Hørsholm, Denmark
| | - Jose M. G. Izarzugaza
- Center for Biological Sequence Analysis, Department of Systems Biology, Technical University of Denmark, Kongens Lyngby, Denmark
| | - Søren Brunak
- Center for Biological Sequence Analysis, Department of Systems Biology, Technical University of Denmark, Kongens Lyngby, Denmark
- Novo Nordisk Foundation Center for Protein Research, University of Copenhagen, Copenhagen, Denmark
| | - Thomas Nordahl Petersen
- Novo Nordisk Foundation Center for Biosustainability, Technical University of Denmark, Hørsholm, Denmark
| | - Ramneek Gupta
- Center for Biological Sequence Analysis, Department of Systems Biology, Technical University of Denmark, Kongens Lyngby, Denmark
| |
Collapse
|
23
|
Georgitsi M, Patrinos GP. Genetic databases in pharmacogenomics: the Frequency of Inherited Disorders Database (FINDbase). Methods Mol Biol 2013; 1015:321-336. [PMID: 23824866 DOI: 10.1007/978-1-62703-435-7_21] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 06/02/2023]
Abstract
Pharmacogenomics studies how the variations of the individuals' genetic makeup are correlated with a person's response to certain drugs in relation to the therapeutic efficiency, clinical outcome, or even survival, and how they affect drug metabolism, transport, or clearance. Yet, since the incidence of these polymorphisms, being either single-point variations or small insertions/deletions, varies among different populations, a systematic collection and documentation of these variations is warranted, in order to facilitate implementation of pharmacogenomics in different populations. Here we review the existing electronic databases related to pharmacogenomics and pay particular attention in the description of the pharmacogenomics module Frequency of Inherited Disorders database (FINDbase), which documents curated allelic frequency data pertaining to 144 pharmacogenomics markers across 14 genes, representing approximately 87,000 individuals from 150 populations and ethnic groups worldwide. Long-term sustainability of these resources aims to contribute to the design, development, and implementation of pharmacogenomics testing towards the application of personalized approaches in medical treatment.
Collapse
Affiliation(s)
- Marianthi Georgitsi
- Department of Pharmacy, School of Health Sciences, University of Patras, Patras, Greece
| | | |
Collapse
|
24
|
Stucki D, Gagneux S. Single nucleotide polymorphisms in Mycobacterium tuberculosis and the need for a curated database. Tuberculosis (Edinb) 2012; 93:30-9. [PMID: 23266261 DOI: 10.1016/j.tube.2012.11.002] [Citation(s) in RCA: 39] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/17/2012] [Accepted: 11/25/2012] [Indexed: 12/12/2022]
Abstract
Recent advances in DNA sequencing have led to the discovery of thousands of single nucleotide polymorphisms (SNPs) in clinical isolates of Mycobacterium tuberculosis complex (MTBC). This genetic variation has changed our understanding of the differences and phylogenetic relationships between strains. Many of these mutations can serve as phylogenetic markers for strain classification, while others cause drug resistance. Moreover, SNPs can affect the bacterial phenotype in various ways, which may have an impact on the outcome of tuberculosis (TB) infection and disease. Despite the importance of SNPs for our understanding of the diversity of MTBC populations, the research community currently lacks a comprehensive, well-curated and user-friendly database dedicated to SNP data. First attempts to catalogue and annotate SNPs in MTBC have been made, but more work is needed. In this review, we discuss the biological and epidemiological relevance of SNPs in MTBC. We then review some of the analytical challenges involved in processing SNP data, and end with a list of features, which should be included in a new SNP database for MTBC.
Collapse
Affiliation(s)
- David Stucki
- Swiss Tropical and Public Health Institute, Basel, Switzerland
| | | |
Collapse
|
25
|
Beck T, Free RC, Thorisson GA, Brookes AJ. Semantically enabling a genome-wide association study database. J Biomed Semantics 2012; 3:9. [PMID: 23244533 PMCID: PMC3579732 DOI: 10.1186/2041-1480-3-9] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/03/2012] [Accepted: 08/22/2012] [Indexed: 01/03/2023] Open
Abstract
Background The amount of data generated from genome-wide association studies (GWAS) has grown rapidly, but considerations for GWAS phenotype data reuse and interchange have not kept pace. This impacts on the work of GWAS Central – a free and open access resource for the advanced querying and comparison of summary-level genetic association data. The benefits of employing ontologies for standardising and structuring data are widely accepted. The complex spectrum of observed human phenotypes (and traits), and the requirement for cross-species phenotype comparisons, calls for reflection on the most appropriate solution for the organisation of human phenotype data. The Semantic Web provides standards for the possibility of further integration of GWAS data and the ability to contribute to the web of Linked Data. Results A pragmatic consideration when applying phenotype ontologies to GWAS data is the ability to retrieve all data, at the most granular level possible, from querying a single ontology graph. We found the Medical Subject Headings (MeSH) terminology suitable for describing all traits (diseases and medical signs and symptoms) at various levels of granularity and the Human Phenotype Ontology (HPO) most suitable for describing phenotypic abnormalities (medical signs and symptoms) at the most granular level. Diseases within MeSH are mapped to HPO to infer the phenotypic abnormalities associated with diseases. Building on the rich semantic phenotype annotation layer, we are able to make cross-species phenotype comparisons and publish a core subset of GWAS data as RDF nanopublications. Conclusions We present a methodology for applying phenotype annotations to a comprehensive genome-wide association dataset and for ensuring compatibility with the Semantic Web. The annotations are used to assist with cross-species genotype and phenotype comparisons. However, further processing and deconstructions of terms may be required to facilitate automatic phenotype comparisons. The provision of GWAS nanopublications enables a new dimension for exploring GWAS data, by way of intrinsic links to related data resources within the Linked Data web. The value of such annotation and integration will grow as more biomedical resources adopt the standards of the Semantic Web.
Collapse
Affiliation(s)
- Tim Beck
- Department of Genetics, University of Leicester, University Road, Leicester, UK.
| | | | | | | |
Collapse
|
26
|
Fernández-Suárez XM, Galperin MY. The 2013 Nucleic Acids Research Database Issue and the online molecular biology database collection. Nucleic Acids Res 2012. [PMID: 23203983 PMCID: PMC3531151 DOI: 10.1093/nar/gks1297] [Citation(s) in RCA: 67] [Impact Index Per Article: 5.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/14/2022] Open
Abstract
The 20th annual Database Issue of Nucleic Acids Research includes 176 articles, half of which describe new online molecular biology databases and the other half provide updates on the databases previously featured in NAR and other journals. This year’s highlights include two databases of DNA repeat elements; several databases of transcriptional factors and transcriptional factor-binding sites; databases on various aspects of protein structure and protein–protein interactions; databases for metagenomic and rRNA sequence analysis; and four databases specifically dedicated to Escherichia coli. The increased emphasis on using the genome data to improve human health is reflected in the development of the databases of genomic structural variation (NCBI’s dbVar and EBI’s DGVa), the NIH Genetic Testing Registry and several other databases centered on the genetic basis of human disease, potential drugs, their targets and the mechanisms of protein–ligand binding. Two new databases present genomic and RNAseq data for monkeys, providing wealth of data on our closest relatives for comparative genomics purposes. The NAR online Molecular Biology Database Collection, available at http://www.oxfordjournals.org/nar/database/a/, has been updated and currently lists 1512 online databases. The full content of the Database Issue is freely available online on the Nucleic Acids Research website (http://nar.oxfordjournals.org/).
Collapse
|
27
|
Doelken SC, Köhler S, Mungall CJ, Gkoutos GV, Ruef BJ, Smith C, Smedley D, Bauer S, Klopocki E, Schofield PN, Westerfield M, Robinson PN, Lewis SE. Phenotypic overlap in the contribution of individual genes to CNV pathogenicity revealed by cross-species computational analysis of single-gene mutations in humans, mice and zebrafish. Dis Model Mech 2012; 6:358-72. [PMID: 23104991 PMCID: PMC3597018 DOI: 10.1242/dmm.010322] [Citation(s) in RCA: 40] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/16/2022] Open
Abstract
Numerous disease syndromes are associated with regions of copy number variation (CNV) in the human genome and, in most cases, the pathogenicity of the CNV is thought to be related to altered dosage of the genes contained within the affected segment. However, establishing the contribution of individual genes to the overall pathogenicity of CNV syndromes is difficult and often relies on the identification of potential candidates through manual searches of the literature and online resources. We describe here the development of a computational framework to comprehensively search phenotypic information from model organisms and single-gene human hereditary disorders, and thus speed the interpretation of the complex phenotypes of CNV disorders. There are currently more than 5000 human genes about which nothing is known phenotypically but for which detailed phenotypic information for the mouse and/or zebrafish orthologs is available. Here, we present an ontology-based approach to identify similarities between human disease manifestations and the mutational phenotypes in characterized model organism genes; this approach can therefore be used even in cases where there is little or no information about the function of the human genes. We applied this algorithm to detect candidate genes for 27 recurrent CNV disorders and identified 802 gene-phenotype associations, approximately half of which involved genes that were previously reported to be associated with individual phenotypic features and half of which were novel candidates. A total of 431 associations were made solely on the basis of model organism phenotype data. Additionally, we observed a striking, statistically significant tendency for individual disease phenotypes to be associated with multiple genes located within a single CNV region, a phenomenon that we denote as pheno-clustering. Many of the clusters also display statistically significant similarities in protein function or vicinity within the protein-protein interaction network. Our results provide a basis for understanding previously un-interpretable genotype-phenotype correlations in pathogenic CNVs and for mobilizing the large amount of model organism phenotype data to provide insights into human genetic disorders.
Collapse
Affiliation(s)
- Sandra C Doelken
- Institute for Medical and Human Genetics, Charité-Universitätsmedizin Berlin, Augustenburger Platz 1, 13353 Berlin, Germany
| | | | | | | | | | | | | | | | | | | | | | | | | |
Collapse
|
28
|
Nair PS, Vihinen M. VariBench: A Benchmark Database for Variations. Hum Mutat 2012; 34:42-9. [DOI: 10.1002/humu.22204] [Citation(s) in RCA: 106] [Impact Index Per Article: 8.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/20/2012] [Accepted: 07/31/2012] [Indexed: 12/21/2022]
|
29
|
Brahmachari SK. Introducing the medical bioinformatics in Journal of Translational Medicine. J Transl Med 2012; 10:202. [PMID: 23013487 PMCID: PMC3533924 DOI: 10.1186/1479-5876-10-202] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/11/2012] [Accepted: 09/11/2012] [Indexed: 11/10/2022] Open
Abstract
The explosion of genome sequencing data along with genotype to phenotype correlation studies has created data deluge in the area of biomedical sciences. The aim of the Medical bioinformatics section is to aid the development and maturation of the field by providing a platform for the translation of these datasets into useful clinical applications. The increase in computing capabilities and availability of different data from advanced technologies will allow researchers to build System Biology models of various diseases in order to efficiently develop new therapeutic interventions and reduce the current prohibitively large costs of drug discovery. The section welcomes studies on the development of Biomedical Informatics for translational medicine and clinical applications, including tools, methodologies and data integration.
Collapse
|
30
|
Schofield PN, Hancock JM. Integration of global resources for human genetic variation and disease. Hum Mutat 2012; 33:813-6. [DOI: 10.1002/humu.22079] [Citation(s) in RCA: 9] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/27/2012] [Accepted: 03/02/2012] [Indexed: 01/22/2023]
|
31
|
Adamusiak T, Parkinson H, Muilu J, Roos E, van der Velde KJ, Thorisson GA, Byrne M, Pang C, Gollapudi S, Ferretti V, Hillege H, Brookes AJ, Swertz MA. Observ-OM and Observ-TAB: Universal syntax solutions for the integration, search, and exchange of phenotype and genotype information. Hum Mutat 2012; 33:867-73. [DOI: 10.1002/humu.22070] [Citation(s) in RCA: 16] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/21/2011] [Accepted: 02/22/2012] [Indexed: 11/12/2022]
|
32
|
Luu TD, Rusu AM, Walter V, Ripp R, Moulinier L, Muller J, Toursel T, Thompson JD, Poch O, Nguyen H. MSV3d: database of human MisSense Variants mapped to 3D protein structure. DATABASE-THE JOURNAL OF BIOLOGICAL DATABASES AND CURATION 2012; 2012:bas018. [PMID: 22491796 PMCID: PMC3317913 DOI: 10.1093/database/bas018] [Citation(s) in RCA: 19] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 01/03/2023]
Abstract
The elucidation of the complex relationships linking genotypic and phenotypic variations to protein structure is a major challenge in the post-genomic era. We present MSV3d (Database of human MisSense Variants mapped to 3D protein structure), a new database that contains detailed annotation of missense variants of all human proteins (20 199 proteins). The multi-level characterization includes details of the physico-chemical changes induced by amino acid modification, as well as information related to the conservation of the mutated residue and its position relative to functional features in the available or predicted 3D model. Major releases of the database are automatically generated and updated regularly in line with the dbSNP (database of Single Nucleotide Polymorphism) and SwissVar releases, by exploiting the extensive Décrypthon computational grid resources. The database (http://decrypthon.igbmc.fr/msv3d) is easily accessible through a simple web interface coupled to a powerful query engine and a standard web service. The content is completely or partially downloadable in XML or flat file formats. Database URL:http://decrypthon.igbmc.fr/msv3d
Collapse
Affiliation(s)
- Tien-Dao Luu
- Laboratoire de Bioinformatique et Génomique Intégratives, Institut de Génétique et de Biologie Moléculaire et Cellulaire (UMR7104), 67404 Illkirch
| | | | | | | | | | | | | | | | | | | |
Collapse
|
33
|
Bray JE. Target selection for structural genomics based on combining fold recognition and crystallisation prediction methods: application to the human proteome. ACTA ACUST UNITED AC 2012; 13:37-46. [PMID: 22354707 DOI: 10.1007/s10969-012-9130-x] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/01/2011] [Accepted: 02/07/2012] [Indexed: 11/29/2022]
Abstract
The objective of this study is to automatically identify regions of the human proteome that are suitable for 3D structure determination by X-ray crystallography and to annotate them according to their likelihood to produce diffraction quality crystals. The results provide a powerful tool for structural genomics laboratories who wish to select human proteins based on the statistical likelihood of crystallisation success. Combining fold recognition and crystallisation prediction algorithms enables the efficient calculation of the crystallisability of the entire human proteome. This novel study estimates that there are approximately 40,000 crystallisable regions in the human proteome. Currently, only 15% of these regions (approx. 6,000 sequences) have been solved to at least 95% sequence identity. The remaining unsolved regions have been categorised into 5 crystallisation classes and an integral membrane protein (IMP) class, based on established structure prediction, crystallisation prediction and transmembrane (TM) helix prediction algorithms. Approximately 750 unsolved regions (2% of the proteome) have been identified as having a PDB fold representative (template) and an 'optimal' likelihood of crystallisation. At the other end of the spectrum, more than 10,500 non-IMP regions with a PDB template are classified as 'very difficult' to crystallise (26%) and almost 2,500 regions (6%) were predicted to contain at least 3 TM helices. The 3D-SPECS (3D Structural Proteomics Explorer with Crystallisation Scores) website contains crystallisation predictions for the entire human proteome and can be found at http://www.bioinformaticsplus.org/3dspecs.
Collapse
Affiliation(s)
- James E Bray
- Structural Genomics Consortium, University of Oxford, Old Road Campus Research Building, Roosevelt Drive, Oxford, OX3 7DQ, UK.
| |
Collapse
|
34
|
Zheng J, Stoyanovich J, Manduchi E, Liu J, Stoeckert CJ. AnnotCompute: annotation-based exploration and meta-analysis of genomics experiments. DATABASE-THE JOURNAL OF BIOLOGICAL DATABASES AND CURATION 2011; 2011:bar045. [PMID: 22190598 PMCID: PMC3244265 DOI: 10.1093/database/bar045] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 01/14/2023]
Abstract
The ever-increasing scale of biological data sets, particularly those arising in the context of high-throughput technologies, requires the development of rich data exploration tools. In this article, we present AnnotCompute, an information discovery platform for repositories of functional genomics experiments such as ArrayExpress. Our system leverages semantic annotations of functional genomics experiments with controlled vocabulary and ontology terms, such as those from the MGED Ontology, to compute conceptual dissimilarities between pairs of experiments. These dissimilarities are then used to support two types of exploratory analysis—clustering and query-by-example. We show that our proposed dissimilarity measures correspond to a user's intuition about conceptual dissimilarity, and can be used to support effective query-by-example. We also evaluate the quality of clustering based on these measures. While AnnotCompute can support a richer data exploration experience, its effectiveness is limited in some cases, due to the quality of available annotations. Nonetheless, tools such as AnnotCompute may provide an incentive for richer annotations of experiments. Code is available for download at http://www.cbil.upenn.edu/downloads/AnnotCompute. Database URL:http://www.cbil.upenn.edu/annotCompute/
Collapse
Affiliation(s)
- Jie Zheng
- Department of Genetics, Center for Bioinformatics, University of Pennsylvania Perelman School of Medicine, Philadelphia, PA 19104, USA
| | | | | | | | | |
Collapse
|
35
|
Li MJ, Wang P, Liu X, Lim EL, Wang Z, Yeager M, Wong MP, Sham PC, Chanock SJ, Wang J. GWASdb: a database for human genetic variants identified by genome-wide association studies. Nucleic Acids Res 2011; 40:D1047-54. [PMID: 22139925 PMCID: PMC3245026 DOI: 10.1093/nar/gkr1182] [Citation(s) in RCA: 152] [Impact Index Per Article: 11.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/26/2022] Open
Abstract
Recent advances in genome-wide association studies (GWAS) have enabled us to identify thousands of genetic variants (GVs) that are associated with human diseases. As next-generation sequencing technologies become less expensive, more GVs will be discovered in the near future. Existing databases, such as NHGRI GWAS Catalog, collect GVs with only genome-wide level significance. However, many true disease susceptibility loci have relatively moderate P values and are not included in these databases. We have developed GWASdb that contains 20 times more data than the GWAS Catalog and includes less significant GVs (P < 1.0 × 10−3) manually curated from the literature. In addition, GWASdb provides comprehensive functional annotations for each GV, including genomic mapping information, regulatory effects (transcription factor binding sites, microRNA target sites and splicing sites), amino acid substitutions, evolution, gene expression and disease associations. Furthermore, GWASdb classifies these GVs according to diseases using Disease-Ontology Lite and Human Phenotype Ontology. It can conduct pathway enrichment and PPI network association analysis for these diseases. GWASdb provides an intuitive, multifunctional database for biologists and clinicians to explore GVs and their functional inferences. It is freely available at http://jjwanglab.org/gwasdb and will be updated frequently.
Collapse
Affiliation(s)
- Mulin Jun Li
- Department of Biochemistry, The University of Hong Kong, Hong Kong SAR, China
| | | | | | | | | | | | | | | | | | | |
Collapse
|
36
|
Abstract
Background SNP (Single Nucleotide Polymorphism), the most common genetic variations between human beings, is believed to be a promising way towards personalized medicine. As more and more research on SNPs are being conducted, non-standard nomenclatures may generate potential problems. The most serious issue is that researchers cannot perform cross referencing among different SNP databases. This will result in more resources and time required to track SNPs. It could be detrimental to the entire academic community. Results UASIS (Universal Automated SNP Identification System) is a web-based server for SNP nomenclature standardization and translation at DNA level. Three utilities are available. They are UASIS Aligner, Universal SNP Name Generator and SNP Name Mapper. UASIS maps SNPs from different databases, including dbSNP, GWAS, HapMap and JSNP etc., into an uniform view efficiently using a proposed universal nomenclature and state-of-art alignment algorithms. UASIS is freely available at http://www.uasis.tk with no requirement of log-in. Conclusions UASIS is a helpful platform for SNP cross referencing and tracking. By providing an informative, unique and unambiguous nomenclature, which utilizes unique position of a SNP, we aim to resolve the ambiguity of SNP nomenclatures currently practised. Our universal nomenclature is a good complement to mainstream SNP notations such as rs# and HGVS guidelines. UASIS acts as a bridge to connect heterogeneous representations of SNPs.
Collapse
Affiliation(s)
- Danny C C Poo
- Department of Information Systems, School of Computing, National University of Singapore, 13 Computing Drive, Singapore 117417.
| | | | | |
Collapse
|
37
|
Pallejà A, Horn H, Eliasson S, Jensen LJ. DistiLD Database: diseases and traits in linkage disequilibrium blocks. Nucleic Acids Res 2011; 40:D1036-40. [PMID: 22058129 PMCID: PMC3245128 DOI: 10.1093/nar/gkr899] [Citation(s) in RCA: 29] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/07/2023] Open
Abstract
Genome-wide association studies (GWAS) have identified thousands of single nucleotide polymorphisms (SNPs) associated with the risk of hundreds of diseases. However, there is currently no database that enables non-specialists to answer the following simple questions: which SNPs associated with diseases are in linkage disequilibrium (LD) with a gene of interest? Which chromosomal regions have been associated with a given disease, and which are the potentially causal genes in each region? To answer these questions, we use data from the HapMap Project to partition each chromosome into so-called LD blocks, so that SNPs in LD with each other are preferentially in the same block, whereas SNPs not in LD are in different blocks. By projecting SNPs and genes onto LD blocks, the DistiLD database aims to increase usage of existing GWAS results by making it easy to query and visualize disease-associated SNPs and genes in their chromosomal context. The database is available at http://distild.jensenlab.org/.
Collapse
Affiliation(s)
- Albert Pallejà
- Novo Nordisk Foundation Center for Protein Research, Faculty of Health Sciences, University of Copenhagen, Copenhagen, Denmark
| | | | | | | |
Collapse
|
38
|
Craig DW, Goor RM, Wang Z, Paschall J, Ostell J, Feolo M, Sherry ST, Manolio TA. Assessing and managing risk when sharing aggregate genetic variant data. Nat Rev Genet 2011; 12:730-6. [PMID: 21921928 PMCID: PMC3349221 DOI: 10.1038/nrg3067] [Citation(s) in RCA: 42] [Impact Index Per Article: 3.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/14/2022]
Abstract
Access to genetic data across studies is an important aspect of identifying new genetic associations through genome-wide association studies (GWASs). Meta-analysis across multiple GWASs with combined cohort sizes of tens of thousands of individuals often uncovers many more genome-wide associated loci than the original individual studies; this emphasizes the importance of tools and mechanisms for data sharing. However, even sharing summary-level data, such as allele frequencies, inherently carries some degree of privacy risk to study participants. Here we discuss mechanisms and resources for sharing data from GWASs, particularly focusing on approaches for assessing and quantifying the privacy risks to participants that result from the sharing of summary-level data.
Collapse
Affiliation(s)
- David W Craig
- Translational Genomics Research Institute (TGen), Phoenix, Arizona 85004, USA.
| | | | | | | | | | | | | | | |
Collapse
|
39
|
Abstract
Locus-specific databases are the most useful repositories of the sequence information underlying medical genetic conditions and, for this reason, they need our continued support.
Collapse
Affiliation(s)
- Mark E Samuels
- Ste-Justine Hospital Research Center and Department of Medicine, University of Montreal, 3175 Cote Ste-Catherine, Montreal, Quebec, Canada
| | | |
Collapse
|
40
|
Webb AJ, Thorisson GA, Brookes AJ. An informatics project and online "Knowledge Centre" supporting modern genotype-to-phenotype research. Hum Mutat 2011; 32:543-50. [PMID: 21438073 DOI: 10.1002/humu.21469] [Citation(s) in RCA: 32] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/20/2011] [Accepted: 01/28/2011] [Indexed: 11/06/2022]
Abstract
Explosive growth in the generation of genotype-to-phenotype (G2P) data necessitates a concerted effort to tackle the logistical and informatics challenges this presents. The GEN2PHEN Project represents one such effort, with a broad strategy of uniting disparate G2P resources into a hybrid centralized-federated network. This is achieved through a holistic strategy focussed on three overlapping areas: data input standards and pipelines through which to submit and collect data (data in); federated, independent, extendable, yet interoperable database platforms on which to store and curate widely diverse datasets (data storage); and data formats and mechanisms with which to exchange, combine, and extract data (data exchange and output). To fully leverage this data network, we have constructed the "G2P Knowledge Centre" (http://www.gen2phen.org). This central platform provides holistic searching of the G2P data domain allied with facilities for data annotation and user feedback, access to extensive G2P and informatics resources, and tools for constructing online working communities centered on the G2P domain. Through the efforts of GEN2PHEN, and through combining data with broader community-derived knowledge, the Knowledge Centre opens up exciting possibilities for organizing, integrating, sharing, and interpreting new waves of G2P data in a collaborative fashion.
Collapse
Affiliation(s)
- Adam J Webb
- Department of Genetics, University of Leicester, University Road, Leicester, United Kingdom.
| | | | | | | |
Collapse
|
41
|
Swertz MA, Dijkstra M, Adamusiak T, van der Velde JK, Kanterakis A, Roos ET, Lops J, Thorisson GA, Arends D, Byelas G, Muilu J, Brookes AJ, de Brock EO, Jansen RC, Parkinson H. The MOLGENIS toolkit: rapid prototyping of biosoftware at the push of a button. BMC Bioinformatics 2010; 11 Suppl 12:S12. [PMID: 21210979 PMCID: PMC3040526 DOI: 10.1186/1471-2105-11-s12-s12] [Citation(s) in RCA: 76] [Impact Index Per Article: 5.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND There is a huge demand on bioinformaticians to provide their biologists with user friendly and scalable software infrastructures to capture, exchange, and exploit the unprecedented amounts of new *omics data. We here present MOLGENIS, a generic, open source, software toolkit to quickly produce the bespoke MOLecular GENetics Information Systems needed. METHODS The MOLGENIS toolkit provides bioinformaticians with a simple language to model biological data structures and user interfaces. At the push of a button, MOLGENIS' generator suite automatically translates these models into a feature-rich, ready-to-use web application including database, user interfaces, exchange formats, and scriptable interfaces. Each generator is a template of SQL, JAVA, R, or HTML code that would require much effort to write by hand. This 'model-driven' method ensures reuse of best practices and improves quality because the modeling language and generators are shared between all MOLGENIS applications, so that errors are found quickly and improvements are shared easily by a re-generation. A plug-in mechanism ensures that both the generator suite and generated product can be customized just as much as hand-written software. RESULTS In recent years we have successfully evaluated the MOLGENIS toolkit for the rapid prototyping of many types of biomedical applications, including next-generation sequencing, GWAS, QTL, proteomics and biobanking. Writing 500 lines of model XML typically replaces 15,000 lines of hand-written programming code, which allows for quick adaptation if the information system is not yet to the biologist's satisfaction. Each application generated with MOLGENIS comes with an optimized database back-end, user interfaces for biologists to manage and exploit their data, programming interfaces for bioinformaticians to script analysis tools in R, Java, SOAP, REST/JSON and RDF, a tab-delimited file format to ease upload and exchange of data, and detailed technical documentation. Existing databases can be quickly enhanced with MOLGENIS generated interfaces using the 'ExtractModel' procedure. CONCLUSIONS The MOLGENIS toolkit provides bioinformaticians with a simple model to quickly generate flexible web platforms for all possible genomic, molecular and phenotypic experiments with a richness of interfaces not provided by other tools. All the software and manuals are available free as LGPLv3 open source at http://www.molgenis.org.
Collapse
Affiliation(s)
- Morris A Swertz
- Genomics Coordination Center, Groningen Bioinformatics Center, University of Groningen & Department of Genetics, University Medical Center Groningen, P.O. Box 30001, 9700 RB Groningen, The Netherlands.
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | |
Collapse
|
42
|
Narang A, Roy RD, Chaurasia A, Mukhopadhyay A, Mukerji M, Dash D. IGVBrowser--a genomic variation resource from diverse Indian populations. DATABASE-THE JOURNAL OF BIOLOGICAL DATABASES AND CURATION 2010; 2010:baq022. [PMID: 20843867 PMCID: PMC2942067 DOI: 10.1093/database/baq022] [Citation(s) in RCA: 22] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 01/10/2023]
Abstract
The Indian Genome Variation Consortium (IGVC) project, an initiative of the Council for Scientific and Industrial Research, has been the first large-scale comprehensive study of the Indian population. One of the major aims of the project is to study and catalog the variations in nearly thousand candidate genes related to diseases and drug response for predictive marker discovery, founder identification and also to address questions related to ethnic diversity, migrations, extent and relatedness with other world population. The Phase I of the project aimed at providing a set of reference populations that would represent the entire genetic spectrum of India in terms of language, ethnicity and geography and Phase II in providing variation data on candidate genes and genome wide neutral markers on these reference set of populations. We report here development of the IGVBrowser that provides allele and genotype frequency data generated in the IGVC project. The database harbors 4229 SNPs from more than 900 candidate genes in contrasting Indian populations. Analysis shows that most of the markers are from genic regions. Further, a large fraction of genes are implicated in cardiovascular, metabolic, cancer and immune system-related diseases. Thus, the IGVC data provide a basal level variation data in Indian population to study genetic diseases and pharmacology. Additionally, it also houses data on ∼50 000 (Affy 50 K array) genome wide neutral markers in these reference populations. In IGVBrowser one can analyze and compare genomic variations in Indian population with those reported in HapMap along with annotation information from various primary data sources. Database URL:http://igvbrowser.igib.res.in
Collapse
Affiliation(s)
- Ankita Narang
- GN Ramachandran Knowledge Centre for Genome Informatics, Institute of Genomics and Integrative Biology, Council for Scientific and Industrial Research, Mall Road, Delhi-110007, India
| | | | | | | | | | | | | |
Collapse
|
43
|
Schofield PN, Gkoutos GV, Gruenberger M, Sundberg JP, Hancock JM. Phenotype ontologies for mouse and man: bridging the semantic gap. Dis Model Mech 2010; 3:281-9. [PMID: 20427557 DOI: 10.1242/dmm.002790] [Citation(s) in RCA: 36] [Impact Index Per Article: 2.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/30/2022] Open
Abstract
A major challenge of the post-genomic era is coding phenotype data from humans and model organisms such as the mouse, to permit the meaningful translation of phenotype descriptions between species. This ability is essential if we are to facilitate phenotype-driven gene function discovery and empower comparative pathobiology. Here, we review the current state of the art for phenotype and disease description in mice and humans, and discuss ways in which the semantic gap between coding systems might be bridged to facilitate the discovery and exploitation of new mouse models of human diseases.
Collapse
Affiliation(s)
- Paul N Schofield
- Department of Physiology, Development and Neuroscience, University of Cambridge, Cambridge, UK.
| | | | | | | | | |
Collapse
|
44
|
Paananen J, Ciszek R, Wong G. Varietas: a functional variation database portal. DATABASE-THE JOURNAL OF BIOLOGICAL DATABASES AND CURATION 2010; 2010:baq016. [PMID: 20671203 PMCID: PMC2997604 DOI: 10.1093/database/baq016] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 01/11/2023]
Abstract
Current high-throughput technologies for investigating genomic variation in large population based samples produce data on a scale of millions of variations. Browsing through these results and identifying relevant functional variations is a major hurdle in these genome-wide association studies. In order to help researchers locate the most promising associations, we have developed a web-based database portal called Varietas. Varietas can be used for retrieving information concerning genomic variations such as single-nucleotide polymorphisms (SNPs), copy number variants and insertions/deletions, while enabling users to annotate large number of variations in a batch like manner and to find information about related genes, phenotypes and diseases. Varietas also links out to various external genomic databases, allowing users to quickly browse through a set of variations and follow the most promising leads. Varietas periodically integrates data from the major SNP and genome databases, including Ensembl genome database, NCBI dbSNP database, The Genomic Association Database and SNPedia. Database URL:http://kokki.uku.fi/bioinformatics/varietas/
Collapse
Affiliation(s)
- Jussi Paananen
- Laboratory of Functional Genomics and Bioinformatics, Department of Neurobiology, A.I. Virtanen Institute for Molecular Sciences and Biocenter Finland, University of Eastern Finland, P.O. Box 1627, FIN-70211 Kuopio, Finland.
| | | | | |
Collapse
|
45
|
Coassin S, Brandstätter A, Kronenberg F. Lost in the space of bioinformatic tools: a constantly updated survival guide for genetic epidemiology. The GenEpi Toolbox. Atherosclerosis 2009; 209:321-35. [PMID: 19963217 DOI: 10.1016/j.atherosclerosis.2009.10.026] [Citation(s) in RCA: 29] [Impact Index Per Article: 1.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 07/23/2009] [Revised: 10/08/2009] [Accepted: 10/14/2009] [Indexed: 12/13/2022]
Abstract
Genome-wide association studies (GWASs) led to impressive advances in the elucidation of genetic factors underlying complex phenotypes and diseases. However, the ability of GWAS to identify new susceptibility loci in a hypothesis-free approach requires tools to quickly retrieve comprehensive information about a genomic region and analyze the potential effects of coding and non-coding SNPs in a candidate gene region. Furthermore, once a candidate region is chosen for resequencing and fine-mapping studies, the identification of several rare mutations is likely and requires strong bioinformatic support to properly evaluate and prioritize the found mutations for further analysis. Due to the variety of regulatory layers that can be affected by a mutation, a comprehensive in-silico evaluation of candidate SNPs can be a demanding and very time-consuming task. Although many bioinformatic tools that significantly simplify this task were made available in the last years, their utility is often still unknown to researches not intensively involved in bioinformatics. We present a comprehensive guide of 64 tools and databases to bioinformatically analyze gene regions of interest to predict SNP effects. In addition, we discuss tools to perform data mining of large genetic regions, predict the presence of regulatory elements, make in-silico evaluations of SNPs effects and address issues ranging from interactome analysis to graphically annotated proteins sequences. Finally, we exemplify the use of these tools by applying them to hits of a recently performed GWAS. Taken together a combination of the discussed tools are summarized and constantly updated in the web-based "GenEpi Toolbox" (http://genepi_toolbox.i-med.ac.at) and can help to get a glimpse at the potential functional relevance of both large genetic regions and single nucleotide mutations which might help to prioritize the next steps.
Collapse
Affiliation(s)
- Stefan Coassin
- Division of Genetic Epidemiology, Department of Medical Genetics, Molecular and Clinical Pharmacology, Innsbruck Medical University, Schöpfstr. 41, A-6020 Innsbruck, Austria
| | | | | |
Collapse
|
46
|
Chandras C, Weaver T, Zouberakis M, Smedley D, Schughart K, Rosenthal N, Hancock JM, Kollias G, Schofield PN, Aidinis V. Models for financial sustainability of biological databases and resources. DATABASE-THE JOURNAL OF BIOLOGICAL DATABASES AND CURATION 2009; 2009:bap017. [PMID: 20157490 PMCID: PMC2790311 DOI: 10.1093/database/bap017] [Citation(s) in RCA: 23] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 07/24/2009] [Revised: 09/02/2009] [Accepted: 09/16/2009] [Indexed: 11/23/2022]
Abstract
Following the technological advances that have enabled genome-wide analysis in most model organisms over the last decade, there has been unprecedented growth in genomic and post-genomic science with concomitant generation of an exponentially increasing volume of data and material resources. As a result, numerous repositories have been created to store and archive data, organisms and material, which are of substantial value to the whole community. Sustained access, facilitating re-use of these resources, is essential, not only for validation, but for re-analysis, testing of new hypotheses and developing new technologies/platforms. A common challenge for most data resources and biological repositories today is finding financial support for maintenance and development to best serve the scientific community. In this study we examine the problems that currently confront the data and resource infrastructure underlying the biomedical sciences. We discuss the financial sustainability issues and potential business models that could be adopted by biological resources and consider long term preservation issues within the context of mouse functional genomics efforts in Europe.
Collapse
Affiliation(s)
- Christina Chandras
- Institute of Immunology, Biomedical Sciences Research Center Alexander Fleming, 34 Fleming Street, 16672 Athens, Greece, MRC Mary Lyon Centre, Harwell Science and Innovation Campus, Oxfordshire, OX11 0RD, European Bioinformatics Institute, EMBL, Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SD, UK, Experimental Mouse Genetics, Helmholtz Centre for Infection Research & University of Veterinary Medicine, Hannover, Inhoffenstrabe 7, D-38124 Braunschweig, Germany, EMBL-Monterotondo Outstation, Via Ramarini 32, 00015 Monterotondo-Scalo (RM), Italy, Bioinformatics Group, MRC Harwell, Harwell, Oxfordshire, OX11 0RD and Department of Physiology, Development and Neuroscience, University of Cambridge, Downing Street, Cambridge, CB2 3DY, UK
| | | | | | | | | | | | | | | | | | | |
Collapse
|
47
|
Winnenburg R, Plake C, Schroeder M. Improved mutation tagging with gene identifiers applied to membrane protein stability prediction. BMC Bioinformatics 2009; 10 Suppl 8:S3. [PMID: 19758467 PMCID: PMC2745585 DOI: 10.1186/1471-2105-10-s8-s3] [Citation(s) in RCA: 14] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/23/2022] Open
Abstract
Background The automated retrieval and integration of information about protein point mutations in combination with structure, domain and interaction data from literature and databases promises to be a valuable approach to study structure-function relationships in biomedical data sets. Results We developed a rule- and regular expression-based protein point mutation retrieval pipeline for PubMed abstracts, which shows an F-measure of 87% for the mutation retrieval task on a benchmark dataset. In order to link mutations to their proteins, we utilize a named entity recognition algorithm for the identification of gene names co-occurring in the abstract, and establish links based on sequence checks. Vice versa, we could show that gene recognition improved from 77% to 91% F-measure when considering mutation information given in the text. To demonstrate practical relevance, we utilize mutation information from text to evaluate a novel solvation energy based model for the prediction of stabilizing regions in membrane proteins. For five G protein-coupled receptors we identified 35 relevant single mutations and associated phenotypes, of which none had been annotated in the UniProt or PDB database. In 71% reported phenotypes were in compliance with the model predictions, supporting a relation between mutations and stability issues in membrane proteins. Conclusion We present a reliable approach for the retrieval of protein mutations from PubMed abstracts for any set of genes or proteins of interest. We further demonstrate how amino acid substitution information from text can be utilized for protein structure stability studies on the basis of a novel energy model.
Collapse
Affiliation(s)
- Rainer Winnenburg
- Biotechnology Center, Technische Universität Dresden, Tatzberg, Germany.
| | | | | |
Collapse
|
48
|
Brookes AJ, Lehvaslaiho H, Muilu J, Shigemoto Y, Oroguchi T, Tomiki T, Mukaiyama A, Konagaya A, Kojima T, Inoue I, Kuroda M, Mizushima H, Thorisson GA, Dash D, Rajeevan H, Darlison MW, Woon M, Fredman D, Smith AV, Senger M, Naito K, Sugawara H. The phenotype and genotype experiment object model (PaGE-OM): a robust data structure for information related to DNA variation. Hum Mutat 2009; 30:968-77. [DOI: 10.1002/humu.20973] [Citation(s) in RCA: 16] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022]
|