1
|
Warmerdam R, Lanting P, Deelen P, Franke L. Idéfix: identifying accidental sample mix-ups in biobanks using polygenic scores. Bioinformatics 2021; 38:1059-1066. [PMID: 34792549 PMCID: PMC8796367 DOI: 10.1093/bioinformatics/btab783] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/17/2021] [Revised: 10/07/2021] [Accepted: 11/15/2021] [Indexed: 02/03/2023] Open
Abstract
MOTIVATION Identifying sample mix-ups in biobanks is essential to allow the repurposing of genetic data for clinical pharmacogenetics. Pharmacogenetic advice based on the genetic information of another individual is potentially harmful. Existing methods for identifying mix-ups are limited to datasets in which additional omics data (e.g. gene expression) is available. Cohorts lacking such data can only use sex, which can reveal only half of the mix-ups. Here, we describe Idéfix, a method for the identification of accidental sample mix-ups in biobanks using polygenic scores. RESULTS In the Lifelines population-based biobank, we calculated polygenic scores (PGSs) for 25 traits for 32 786 participants. We then applied Idéfix to compare the actual phenotypes to PGSs, and to use the relative discordance that is expected for mix-ups, compared to correct samples. In a simulation, using induced mix-ups, Idéfix reaches an AUC of 0.90 using 25 polygenic scores and sex. This is a substantial improvement over using only sex, which has an AUC of 0.75. Subsequent simulations present Idéfix's potential in varying datasets with more powerful PGSs. This suggests its performance will likely improve when more highly powered GWASs for commonly measured traits will become available. Idéfix can be used to identify a set of high-quality participants for whom it is very unlikely that they reflect sample mix-ups, and for these participants we can use genetic data for clinical purposes, such as pharmacogenetic profiles. For instance, in Lifelines, we can select 34.4% of participants, reducing the sample mix-up rate from 0.15% to 0.01%. AVAILABILITYAND IMPLEMENTATION Idéfix is freely available at https://github.com/molgenis/systemsgenetics/wiki/Idefix. The individual-level data that support the findings were obtained from the Lifelines biobank under project application number ov16_0365. Data is made available upon reasonable request submitted to the LifeLines Research office (research@lifelines.nl, https://www.lifelines.nl/researcher/how-to-apply/apply-here). SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Robert Warmerdam
- Department of Genetics, University Medical Center Groningen, University of Groningen, 9700AB Groningen, The Netherlands
| | - Pauline Lanting
- Department of Genetics, University Medical Center Groningen, University of Groningen, 9700AB Groningen, The Netherlands
| | | | - Patrick Deelen
- Department of Genetics, University Medical Center Groningen, University of Groningen, 9700AB Groningen, The Netherlands,Department of Genetics, University Medical Center Utrecht, 3508GA Utrecht, The Netherlands
| | | |
Collapse
|
2
|
He KY, Ge D, He MM. Big Data Analytics for Genomic Medicine. Int J Mol Sci 2017; 18:ijms18020412. [PMID: 28212287 PMCID: PMC5343946 DOI: 10.3390/ijms18020412] [Citation(s) in RCA: 104] [Impact Index Per Article: 14.9] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/24/2016] [Revised: 02/08/2017] [Accepted: 02/09/2017] [Indexed: 12/25/2022] Open
Abstract
Genomic medicine attempts to build individualized strategies for diagnostic or therapeutic decision-making by utilizing patients’ genomic information. Big Data analytics uncovers hidden patterns, unknown correlations, and other insights through examining large-scale various data sets. While integration and manipulation of diverse genomic data and comprehensive electronic health records (EHRs) on a Big Data infrastructure exhibit challenges, they also provide a feasible opportunity to develop an efficient and effective approach to identify clinically actionable genetic variants for individualized diagnosis and therapy. In this paper, we review the challenges of manipulating large-scale next-generation sequencing (NGS) data and diverse clinical data derived from the EHRs for genomic medicine. We introduce possible solutions for different challenges in manipulating, managing, and analyzing genomic and clinical data to implement genomic medicine. Additionally, we also present a practical Big Data toolset for identifying clinically actionable genetic variants using high-throughput NGS data and EHRs.
Collapse
Affiliation(s)
- Karen Y He
- Department of Epidemiology and Biostatistics, Case Western Reserve University, Cleveland, OH 44106, USA.
| | | | - Max M He
- BioSciKin Co., Ltd., Nanjing 210042, China.
- Computation and Informatics in Biology and Medicine, University of Wisconsin-Madison, Madison, WI 53706, USA.
| |
Collapse
|
3
|
Peissig PL, Santos Costa V, Caldwell MD, Rottscheit C, Berg RL, Mendonca EA, Page D. Relational machine learning for electronic health record-driven phenotyping. J Biomed Inform 2014; 52:260-70. [PMID: 25048351 PMCID: PMC4261015 DOI: 10.1016/j.jbi.2014.07.007] [Citation(s) in RCA: 40] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/27/2014] [Revised: 05/21/2014] [Accepted: 07/08/2014] [Indexed: 01/19/2023]
Abstract
OBJECTIVE Electronic health records (EHR) offer medical and pharmacogenomics research unprecedented opportunities to identify and classify patients at risk. EHRs are collections of highly inter-dependent records that include biological, anatomical, physiological, and behavioral observations. They comprise a patient's clinical phenome, where each patient has thousands of date-stamped records distributed across many relational tables. Development of EHR computer-based phenotyping algorithms require time and medical insight from clinical experts, who most often can only review a small patient subset representative of the total EHR records, to identify phenotype features. In this research we evaluate whether relational machine learning (ML) using inductive logic programming (ILP) can contribute to addressing these issues as a viable approach for EHR-based phenotyping. METHODS Two relational learning ILP approaches and three well-known WEKA (Waikato Environment for Knowledge Analysis) implementations of non-relational approaches (PART, J48, and JRIP) were used to develop models for nine phenotypes. International Classification of Diseases, Ninth Revision (ICD-9) coded EHR data were used to select training cohorts for the development of each phenotypic model. Accuracy, precision, recall, F-Measure, and Area Under the Receiver Operating Characteristic (AUROC) curve statistics were measured for each phenotypic model based on independent manually verified test cohorts. A two-sided binomial distribution test (sign test) compared the five ML approaches across phenotypes for statistical significance. RESULTS We developed an approach to automatically label training examples using ICD-9 diagnosis codes for the ML approaches being evaluated. Nine phenotypic models for each ML approach were evaluated, resulting in better overall model performance in AUROC using ILP when compared to PART (p=0.039), J48 (p=0.003) and JRIP (p=0.003). DISCUSSION ILP has the potential to improve phenotyping by independently delivering clinically expert interpretable rules for phenotype definitions, or intuitive phenotypes to assist experts. CONCLUSION Relational learning using ILP offers a viable approach to EHR-driven phenotyping.
Collapse
Affiliation(s)
- Peggy L Peissig
- Biomedical Informatics Research Center, Marshfield Clinic Research Foundation, Marshfield, WI, USA.
| | - Vitor Santos Costa
- DCC-FCUP and CRACS INESC-TEC, Department de Ciência de Computadores, Universidade do Porto, Portugal
| | | | - Carla Rottscheit
- Biomedical Informatics Research Center, Marshfield Clinic Research Foundation, Marshfield, WI, USA
| | - Richard L Berg
- Biomedical Informatics Research Center, Marshfield Clinic Research Foundation, Marshfield, WI, USA
| | - Eneida A Mendonca
- Department of Biostatistics and Medical Informatics, University of Wisconsin-Madison, USA; Department of Pediatrics, University of Wisconsin-Madison, USA
| | - David Page
- Department of Biostatistics and Medical Informatics, University of Wisconsin-Madison, USA; Department of Computer Sciences, University of Wisconsin-Madison, USA
| |
Collapse
|
4
|
Liao J, Li X, Wong TY, Wang JJ, Khor CC, Tai ES, Aung T, Teo YY, Cheng CY. Impact of measurement error on testing genetic association with quantitative traits. PLoS One 2014; 9:e87044. [PMID: 24475218 PMCID: PMC3901720 DOI: 10.1371/journal.pone.0087044] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/30/2013] [Accepted: 12/17/2013] [Indexed: 12/23/2022] Open
Abstract
Measurement error of a phenotypic trait reduces the power to detect genetic associations. We examined the impact of sample size, allele frequency and effect size in presence of measurement error for quantitative traits. The statistical power to detect genetic association with phenotype mean and variability was investigated analytically. The non-centrality parameter for a non-central F distribution was derived and verified using computer simulations. We obtained equivalent formulas for the cost of phenotype measurement error. Effects of differences in measurements were examined in a genome-wide association study (GWAS) of two grading scales for cataract and a replication study of genetic variants influencing blood pressure. The mean absolute difference between the analytic power and simulation power for comparison of phenotypic means and variances was less than 0.005, and the absolute difference did not exceed 0.02. To maintain the same power, a one standard deviation (SD) in measurement error of a standard normal distributed trait required a one-fold increase in sample size for comparison of means, and a three-fold increase in sample size for comparison of variances. GWAS results revealed almost no overlap in the significant SNPs (p<10−5) for the two cataract grading scales while replication results in genetic variants of blood pressure displayed no significant differences between averaged blood pressure measurements and single blood pressure measurements. We have developed a framework for researchers to quantify power in the presence of measurement error, which will be applicable to studies of phenotypes in which the measurement is highly variable.
Collapse
Affiliation(s)
- Jiemin Liao
- Department of Ophthalmology, National University of Singapore and National University Health System, Singapore, Singapore
- Singapore Eye Research Institute, Singapore National Eye Centre, Singapore, Singapore
| | - Xiang Li
- Singapore Eye Research Institute, Singapore National Eye Centre, Singapore, Singapore
- Department of Statistics and Applied Probability, National University of Singapore, Singapore, Singapore
| | - Tien-Yin Wong
- Department of Ophthalmology, National University of Singapore and National University Health System, Singapore, Singapore
- Singapore Eye Research Institute, Singapore National Eye Centre, Singapore, Singapore
- Saw Swee Hock School of Public Health, National University Health System, National University of Singapore, Singapore, Singapore
| | - Jie Jin Wang
- Centre for Vision Research, University of Sydney, Sydney, Australia
| | - Chiea Chuen Khor
- Department of Ophthalmology, National University of Singapore and National University Health System, Singapore, Singapore
- Division of Human Genetics, Genome Institute of Singapore, Singapore, Singapore
| | - E. Shyong Tai
- Saw Swee Hock School of Public Health, National University Health System, National University of Singapore, Singapore, Singapore
- Department of Medicine, National University of Singapore and National University Health System, Singapore, Singapore
- Duke-NUS Graduate Medical School, Singapore, Singapore
| | - Tin Aung
- Department of Ophthalmology, National University of Singapore and National University Health System, Singapore, Singapore
- Singapore Eye Research Institute, Singapore National Eye Centre, Singapore, Singapore
| | - Yik-Ying Teo
- Department of Statistics and Applied Probability, National University of Singapore, Singapore, Singapore
- Saw Swee Hock School of Public Health, National University Health System, National University of Singapore, Singapore, Singapore
| | - Ching-Yu Cheng
- Department of Ophthalmology, National University of Singapore and National University Health System, Singapore, Singapore
- Singapore Eye Research Institute, Singapore National Eye Centre, Singapore, Singapore
- Saw Swee Hock School of Public Health, National University Health System, National University of Singapore, Singapore, Singapore
- Duke-NUS Graduate Medical School, Singapore, Singapore
- * E-mail:
| |
Collapse
|
5
|
Londono D, Chen KM, Musolf A, Wang R, Shen T, Brandon J, Herring JA, Wise CA, Zou H, Jin M, Yu L, Finch SJ, Matise TC, Gordon D. A novel method for analyzing genetic association with longitudinal phenotypes. Stat Appl Genet Mol Biol 2013; 12:241-61. [PMID: 23502345 DOI: 10.1515/sagmb-2012-0070] [Citation(s) in RCA: 13] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/15/2022]
Abstract
Knowledge of genes influencing longitudinal patterns may offer information about predicting disease progression. We developed a systematic procedure for testing association between SNP genotypes and longitudinal phenotypes. We evaluated false positive rates and statistical power to localize genes for disease progression. We used genome-wide SNP data from the Framingham Heart Study. With longitudinal data from two real studies unrelated to Framingham, we estimated three trajectory curves from each study. We performed simulations by randomly selecting 500 individuals. In each simulation replicate, we assigned each individual to one of the three trajectory groups based on the underlying hypothesis (null or alternative), and generated corresponding longitudinal data. Individual Bayesian posterior probabilities (BPPs) for belonging to a specific trajectory curve were estimated. These BPPs were treated as a quantitative trait and tested (using the Wald test) for genome-wide association. Empirical false positive rates and power were calculated. Our method maintained the expected false positive rate for all simulation models. Also, our method achieved high empirical power for most simulations. Our work presents a method for disease progression gene mapping. This method is potentially clinically significant as it may allow doctors to predict disease progression based on genotype and determine treatment accordingly.
Collapse
Affiliation(s)
- Douglas Londono
- Department of Genetics, Rutgers, The State University of New Jersey, 145 Bevier Road, Piscataway, NJ 08854, USA
| | | | | | | | | | | | | | | | | | | | | | | | | | | |
Collapse
|
6
|
Gurwitz D. High-Quality Phenomics are Crucial for Informative Omics Studies. Drug Dev Res 2012. [DOI: 10.1002/ddr.21025] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/09/2022]
Affiliation(s)
- David Gurwitz
- Department of Human Molecular Genetics and Biochemistry; Sackler Faculty of Medicine; Tel-Aviv University; Tel-Aviv; 69978; Israel
| |
Collapse
|
7
|
Strauss KA, Puffenberger EG, Morton DH. One community's effort to control genetic disease. Am J Public Health 2012; 102:1300-6. [PMID: 22594747 PMCID: PMC3477994 DOI: 10.2105/ajph.2011.300569] [Citation(s) in RCA: 27] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 11/06/2011] [Indexed: 01/18/2023]
Abstract
In 1989, we established a small community health clinic to provide care for uninsured Amish and Mennonite children with genetic disorders. Over 20 years, we have used publicly available molecular data and sophisticated technologies to improve diagnostic efficiency, control laboratory costs, reduce hospitalizations, and prevent major neurological impairments within a rural underserved community. These actions allowed the clinic's 2010 operating budget of $1.5 million to save local communities an estimated $20 to $25 million in aggregate medical costs. This exposes an unsettling fact: our failure to improve the lot of most people stricken with genetic disease is no longer a matter of scientific ignorance or prohibitive costs but of choices we make about how to implement existing knowledge and resources.
Collapse
|
8
|
Melendez RI, McGinty JF, Kalivas PW, Becker HC. Brain region-specific gene expression changes after chronic intermittent ethanol exposure and early withdrawal in C57BL/6J mice. Addict Biol 2012; 17:351-64. [PMID: 21812870 DOI: 10.1111/j.1369-1600.2011.00357.x] [Citation(s) in RCA: 68] [Impact Index Per Article: 5.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/01/2023]
Abstract
Neuroadaptations that participate in the ontogeny of alcohol dependence are likely a result of altered gene expression in various brain regions. The present study investigated brain region-specific changes in the pattern and magnitude of gene expression immediately following chronic intermittent ethanol (CIE) exposure and 8 hours following final ethanol exposure [i.e. early withdrawal (EWD)]. High-density oligonucleotide microarrays (Affymetrix 430A 2.0, Affymetrix, Santa Clara, CA, USA) and bioinformatics analysis were used to characterize gene expression and function in the prefrontal cortex (PFC), hippocampus (HPC) and nucleus accumbens (NAc) of C57BL/6J mice (Jackson Laboratories, Bar Harbor, ME, USA). Gene expression levels were determined using gene chip robust multi-array average followed by statistical analysis of microarrays and validated by quantitative real-time reverse transcription polymerase chain reaction and Western blot analysis. Results indicated that immediately following CIE exposure, changes in gene expression were strikingly greater in the PFC (284 genes) compared with the HPC (16 genes) and NAc (32 genes). Bioinformatics analysis revealed that most of the transcriptionally responsive genes in the PFC were involved in Ras/MAPK signaling, notch signaling or ubiquitination. In contrast, during EWD, changes in gene expression were greatest in the HPC (139 genes) compared with the PFC (four genes) and NAc (eight genes). The most transcriptionally responsive genes in the HPC were involved in mRNA processing or actin dynamics. Of the few genes detected in the NAc, the most representatives were involved in circadian rhythms. Overall, these findings indicate that brain region-specific and time-dependent neuroadaptive alterations in gene expression play an integral role in the development of alcohol dependence and withdrawal.
Collapse
Affiliation(s)
- Roberto I Melendez
- Department of Neurosciences, Medical University of South Carolina, Charleston, SC 29425, USA
| | | | | | | |
Collapse
|
9
|
Westra HJ, Jansen RC, Fehrmann RSN, te Meerman GJ, van Heel D, Wijmenga C, Franke L. MixupMapper: correcting sample mix-ups in genome-wide datasets increases power to detect small genetic effects. ACTA ACUST UNITED AC 2011; 27:2104-11. [PMID: 21653519 DOI: 10.1093/bioinformatics/btr323] [Citation(s) in RCA: 59] [Impact Index Per Article: 4.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022]
Abstract
MOTIVATION Sample mix-ups can arise during sample collection, handling, genotyping or data management. It is unclear how often sample mix-ups occur in genome-wide studies, as there currently are no post hoc methods that can identify these mix-ups in unrelated samples. We have therefore developed an algorithm (MixupMapper) that can both detect and correct sample mix-ups in genome-wide studies that study gene expression levels. RESULTS We applied MixupMapper to five publicly available human genetical genomics datasets. On average, 3% of all analyzed samples had been assigned incorrect expression phenotypes: in one of the datasets 23% of the samples had incorrect expression phenotypes. The consequences of sample mix-ups are substantial: when we corrected these sample mix-ups, we identified on average 15% more significant cis-expression quantitative trait loci (cis-eQTLs). In one dataset, we identified three times as many significant cis-eQTLs after correction. Furthermore, we show through simulations that sample mix-ups can lead to an underestimation of the explained heritability of complex traits in genome-wide association datasets. AVAILABILITY AND IMPLEMENTATION MixupMapper is freely available at http://www.genenetwork.nl/mixupmapper/
Collapse
Affiliation(s)
- Harm-Jan Westra
- Department of Genetics, University Medical Center Groningen, Groningen, The Netherlands
| | | | | | | | | | | | | |
Collapse
|
10
|
Abstract
Access to one's own complete genome was unheard of just a few years ago. At present we have a smattering of identifiable complete human genomes, but the coming months and years will undoubtedly bring thousands more. What will this mean for the practice of medicine in the US? No one knows, but given the remarkable drop in the cost of DNA sequencing over the last few years, it seems a safe bet that within the next decade, primary care physicians will order patients' whole genome sequences with no more fanfare than they would a complete blood count. But the challenges of transforming that easily accessible information into cost savings and better health outcomes will be daunting. Obviously, we lack interpretive abilities and phenotypic information commensurate with our skill in amassing DNA sequences. Worse, we have exacerbated these problems by failing to embrace the increasing ubiquity of genomic information, the populace's interest in it, and its relevance to virtually every medical specialty. The success of personal genomics will require a profound cultural shift by every entity with a stake in human health.
Collapse
Affiliation(s)
- Misha Angrist
- Institute for Genome Sciences & Policy, Duke University, Durham, North Carolina 27708-1009, USA.
| |
Collapse
|
11
|
Gurwitz D, Pirmohamed M. Pharmacogenomics: the importance of accurate phenotypes. Pharmacogenomics 2010; 11:469-70. [PMID: 20350123 DOI: 10.2217/pgs.10.41] [Citation(s) in RCA: 29] [Impact Index Per Article: 2.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/16/2022] Open
Abstract
Lack of knowledge regarding genotype-phenotype correlations is often cited as the major barrier delaying the uptake of pharmacogenomics into routine medical practice. When we look forward to genome-wide association studies as one of the most promising tools for overcoming the pharmacogenomics knowledge barrier, we must keep in mind that having large patient cohorts may not help improve our understanding of alleles implicated in drug-response phenotypes, unless we ensure that such phenotypes are precise and pertinent. It may be wiser, and far more cost effective, to invest scarce research funding in accurate patient drug-response phenotyping than to genotype (or fully sequence) hundreds to thousands of study participants. Biobanks created with personalized medicine research in mind should, when possible, have access to donors' clinical data, including detailed disease- and drug-response phenotypes.
Collapse
Affiliation(s)
- David Gurwitz
- Department of Human Molecular Genetics & Biochemistry, Sackler Faculty of Medicine, Tel-Aviv University, Tel-Aviv, 69978 Israel.
| | | |
Collapse
|
12
|
Samuels DC, Chinnery PF. Reply to Lee and Sawcer. Trends Genet 2010. [DOI: 10.1016/j.tig.2010.03.001] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/17/2022]
|
13
|
Lee K, Sawcer S. Detecting genes in complex disease: does phenotype accuracy limit the horizon? Trends Genet 2010; 26:241-2; author reply 242-3. [PMID: 20417577 DOI: 10.1016/j.tig.2010.03.003] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/19/2010] [Revised: 03/29/2010] [Accepted: 03/29/2010] [Indexed: 11/26/2022]
|
14
|
Only Connect. Mol Diagn Ther 2010. [DOI: 10.1007/bf03256355] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/28/2022]
|