1
|
Beesley LJ, Salvatore M, Fritsche LG, Pandit A, Rao A, Brummett C, Willer CJ, Lisabeth LD, Mukherjee B. The emerging landscape of health research based on biobanks linked to electronic health records: Existing resources, statistical challenges, and potential opportunities. Stat Med 2020; 39:773-800. [PMID: 31859414 PMCID: PMC7983809 DOI: 10.1002/sim.8445] [Citation(s) in RCA: 48] [Impact Index Per Article: 12.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/10/2018] [Revised: 09/10/2019] [Accepted: 11/16/2019] [Indexed: 01/03/2023]
Abstract
Biobanks linked to electronic health records provide rich resources for health-related research. With improvements in administrative and informatics infrastructure, the availability and utility of data from biobanks have dramatically increased. In this paper, we first aim to characterize the current landscape of available biobanks and to describe specific biobanks, including their place of origin, size, and data types. The development and accessibility of large-scale biorepositories provide the opportunity to accelerate agnostic searches, expedite discoveries, and conduct hypothesis-generating studies of disease-treatment, disease-exposure, and disease-gene associations. Rather than designing and implementing a single study focused on a few targeted hypotheses, researchers can potentially use biobanks' existing resources to answer an expanded selection of exploratory questions as quickly as they can analyze them. However, there are many obvious and subtle challenges with the design and analysis of biobank-based studies. Our second aim is to discuss statistical issues related to biobank research such as study design, sampling strategy, phenotype identification, and missing data. We focus our discussion on biobanks that are linked to electronic health records. Some of the analytic issues are illustrated using data from the Michigan Genomics Initiative and UK Biobank, two biobanks with two different recruitment mechanisms. We summarize the current body of literature for addressing these challenges and discuss some standing open problems. This work complements and extends recent reviews about biobank-based research and serves as a resource catalog with analytical and practical guidance for statisticians, epidemiologists, and other medical researchers pursuing research using biobanks.
Collapse
Affiliation(s)
| | | | | | - Anita Pandit
- University of Michigan, Department of Biostatistics
| | - Arvind Rao
- University of Michigan, Department of Computational Medicine and Bioinformatics
| | - Chad Brummett
- University of Michigan, Department of Anesthesiology
| | - Cristen J. Willer
- University of Michigan, Department of Computational Medicine and Bioinformatics
| | | | | |
Collapse
|
2
|
Zhuang Y, Wade K, Saba LM, Kechris K. Development of a tissue augmented Bayesian model for expression quantitative trait loci analysis. MATHEMATICAL BIOSCIENCES AND ENGINEERING : MBE 2019; 17:122-143. [PMID: 31731343 PMCID: PMC7384761 DOI: 10.3934/mbe.2020007] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 04/30/2023]
Abstract
Expression quantitative trait loci (eQTL) analyses detect genetic variants (SNPs) associated with RNA expression levels of genes. The conventional eQTL analysis is to perform individual tests for each gene-SNP pair using simple linear regression and to perform the test on each tissue separately ignoring the extensive information known about RNA expression in other tissue(s). Although Bayesian models have been recently developed to improve eQTL prediction on multiple tissues, they are often based on uninformative priors or treat all tissues equally. In this study, we develop a novel tissue augmented Bayesian model for eQTL analysis (TA-eQTL), which takes prior eQTL information from a different tissue into account to better predict eQTL for another tissue. We demonstrate that our modified Bayesian model has comparable performance to several existing methods in terms of sensitivity and specificity using allele-specific expression (ASE) as the gold standard. Furthermore, the tissue augmented Bayesian model improves the power and accuracy for local-eQTL prediction especially when the sample size is small. In summary, TA-eQTL's performance is comparable to existing methods but has additional flexibility to evaluate data from different platforms, can focus prediction on one tissue using only summary statistics from the secondary tissue(s), and provides a closed form solution for estimation.
Collapse
Affiliation(s)
- Yonghua Zhuang
- Department of Biostatistics and Informatics, Colorado School of Public Health, University of Colorado Denver Anschutz Medical Campus, Mail Stop B119, 13001 E. 17th Place, Aurora, 80045, USA
| | - Kristen Wade
- Human Medical Genetics and Genomics Program, School of Medicine, University of Colorado Denver Anschutz Medical Campus, 80045, Aurora, USA
| | - Laura M. Saba
- Department of Pharmaceutical Sciences, Skaggs School of Pharmacy and Pharmaceutical Sciences, University of Colorado Denver Anschutz Medical Campus, 80045, Aurora, USA
| | - Katerina Kechris
- Department of Biostatistics and Informatics, Colorado School of Public Health, University of Colorado Denver Anschutz Medical Campus, Mail Stop B119, 13001 E. 17th Place, Aurora, 80045, USA
- Correspondence:, ; Tel: +13037244363, +13037249697
| |
Collapse
|
3
|
Cristea IA, Ioannidis JPA. P values in display items are ubiquitous and almost invariably significant: A survey of top science journals. PLoS One 2018; 13:e0197440. [PMID: 29763472 PMCID: PMC5953482 DOI: 10.1371/journal.pone.0197440] [Citation(s) in RCA: 28] [Impact Index Per Article: 4.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/07/2018] [Accepted: 05/02/2018] [Indexed: 12/18/2022] Open
Abstract
P values represent a widely used, but pervasively misunderstood and fiercely contested method of scientific inference. Display items, such as figures and tables, often containing the main results, are an important source of P values. We conducted a survey comparing the overall use of P values and the occurrence of significant P values in display items of a sample of articles in the three top multidisciplinary journals (Nature, Science, PNAS) in 2017 and, respectively, in 1997. We also examined the reporting of multiplicity corrections and its potential influence on the proportion of statistically significant P values. Our findings demonstrated substantial and growing reliance on P values in display items, with increases of 2.5 to 14.5 times in 2017 compared to 1997. The overwhelming majority of P values (94%, 95% confidence interval [CI] 92% to 96%) were statistically significant. Methods to adjust for multiplicity were almost non-existent in 1997, but reported in many articles relying on P values in 2017 (Nature 68%, Science 48%, PNAS 38%). In their absence, almost all reported P values were statistically significant (98%, 95% CI 96% to 99%). Conversely, when any multiplicity corrections were described, 88% (95% CI 82% to 93%) of reported P values were statistically significant. Use of Bayesian methods was scant (2.5%) and rarely (0.7%) articles relied exclusively on Bayesian statistics. Overall, wider appreciation of the need for multiplicity corrections is a welcome evolution, but the rapid growth of reliance on P values and implausibly high rates of reported statistical significance are worrisome.
Collapse
Affiliation(s)
- Ioana Alina Cristea
- Meta-Research Innovation Center at Stanford (METRICS), Stanford University, Stanford, California, United States of America
- Department of Clinical Psychology and Psychotherapy, Babes-Bolyai University, Cluj-Napoca Romania
| | - John P. A. Ioannidis
- Meta-Research Innovation Center at Stanford (METRICS), Stanford University, Stanford, California, United States of America
- Departments of Medicine, Stanford University, Stanford, California, United States of America
- Department of Health Research and Policy, Stanford University, Stanford, California, United States of America
- Department of Biomedical Data Science, Stanford University, Stanford, California, United States of America
- Department of Statistics, Stanford University, Stanford, California, United States of America
| |
Collapse
|
4
|
Davis J, Fresard L, Knowles D, Pala M, Bustamante C, Battle A, Montgomery S. An Efficient Multiple-Testing Adjustment for eQTL Studies that Accounts for Linkage Disequilibrium between Variants. Am J Hum Genet 2016; 98:216-24. [PMID: 26749306 DOI: 10.1016/j.ajhg.2015.11.021] [Citation(s) in RCA: 57] [Impact Index Per Article: 7.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/30/2015] [Accepted: 11/18/2015] [Indexed: 10/22/2022] Open
Abstract
Methods for multiple-testing correction in local expression quantitative trait locus (cis-eQTL) studies are a trade-off between statistical power and computational efficiency. Bonferroni correction, though computationally trivial, is overly conservative and fails to account for linkage disequilibrium between variants. Permutation-based methods are more powerful, though computationally far more intensive. We present an alternative correction method called eigenMT, which runs over 500 times faster than permutations and has adjusted p values that closely approximate empirical ones. To achieve this speed while also maintaining the accuracy of permutation-based methods, we estimate the effective number of independent variants tested for association with a particular gene, termed Meff, by using the eigenvalue decomposition of the genotype correlation matrix. We employ a regularized estimator of the correlation matrix to ensure Meff is robust and yields adjusted p values that closely approximate p values from permutations. Finally, using a common genotype matrix, we show that eigenMT can be applied with even greater efficiency to studies across tissues or conditions. Our method provides a simpler, more efficient approach to multiple-testing correction than existing methods and fits within existing pipelines for eQTL discovery.
Collapse
|
5
|
Abstract
RNA sequencing (RNA-Seq) uses the capabilities of high-throughput sequencing methods to provide insight into the transcriptome of a cell. Compared to previous Sanger sequencing- and microarray-based methods, RNA-Seq provides far higher coverage and greater resolution of the dynamic nature of the transcriptome. Beyond quantifying gene expression, the data generated by RNA-Seq facilitate the discovery of novel transcripts, identification of alternatively spliced genes, and detection of allele-specific expression. Recent advances in the RNA-Seq workflow, from sample preparation to library construction to data analysis, have enabled researchers to further elucidate the functional complexity of the transcription. In addition to polyadenylated messenger RNA (mRNA) transcripts, RNA-Seq can be applied to investigate different populations of RNA, including total RNA, pre-mRNA, and noncoding RNA, such as microRNA and long ncRNA. This article provides an introduction to RNA-Seq methods, including applications, experimental design, and technical challenges.
Collapse
Affiliation(s)
- Kimberly R Kukurba
- Department of Pathology, Stanford University School of Medicine, Stanford, California 94305; Department of Genetics, Stanford University School of Medicine, Stanford, California 94305
| | - Stephen B Montgomery
- Department of Pathology, Stanford University School of Medicine, Stanford, California 94305; Department of Genetics, Stanford University School of Medicine, Stanford, California 94305; Department of Computer Science, Stanford University School of Medicine, Stanford, California 94305
| |
Collapse
|
6
|
Sahadevan S, Gunawan A, Tholen E, Große-Brinkhaus C, Tesfaye D, Schellander K, Hofmann-Apitius M, Cinar MU, Uddin MJ. Pathway based analysis of genes and interactions influencing porcine testis samples from boars with divergent androstenone content in back fat. PLoS One 2014; 9:e91077. [PMID: 24614349 PMCID: PMC3948775 DOI: 10.1371/journal.pone.0091077] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/10/2013] [Accepted: 02/07/2014] [Indexed: 12/21/2022] Open
Abstract
One of the primary factors contributing to boar taint is the level of androstenone in porcine adipose tissues. A majority of the studies performed to identify candidate biomarkers for the synthesis of androstenone in testis tissues follow a reductionist approach, identifying and studying the effect of biomarkers individually. Although these studies provide detailed information on individual biomarkers, a global picture of changes in metabolic pathways that lead to the difference in androstenone synthesis is still missing. The aim of this work was to identify major pathways and interactions influencing steroid hormone synthesis and androstenone biosynthesis using an integrative approach to provide a bird's eye view of the factors causing difference in steroidogenesis and androstenone biosynthesis. For this purpose, we followed an analysis procedure merging together gene expression data from boars with divergent levels of androstenone and pathway mapping and interaction network retrieved from KEGG database. The interaction networks were weighted with Pearson correlation coefficients calculated from gene expression data and significant interactions and enriched pathways were identified based on these networks. The results show that 1,023 interactions were significant for high and low androstenone animals and that a total of 92 pathways were enriched for significant interactions. Although published articles show that a number of these enriched pathways were activated as a result of downstream signaling of steroid hormones, we speculate that the significant interactions in pathways such as glutathione metabolism, sphingolipid metabolism, fatty acid metabolism and significant interactions in cAMP-PKA/PKC signaling might be the key factors determining the difference in steroidogenesis and androstenone biosynthesis between boars with divergent androstenone levels in our study. The results and assumptions presented in this study are from an in-silico analysis done at the gene expression level and further laboratory experiments at genomic, proteomic or metabolomic level are necessary to validate these findings.
Collapse
Affiliation(s)
- Sudeep Sahadevan
- Institute of Animal Science, University of Bonn, Bonn, Germany
- Fraunhofer Institute for Algorithms and Scientific Computing (SCAI), Schloss Birlinghoven, Sankt Augustin, Germany
| | - Asep Gunawan
- Institute of Animal Science, University of Bonn, Bonn, Germany
- Department of Animal Production and Technology, Faculty of Animal Science, Bogor Agricultural University, Bogor, Indonesia
| | - Ernst Tholen
- Institute of Animal Science, University of Bonn, Bonn, Germany
| | | | - Dawit Tesfaye
- Institute of Animal Science, University of Bonn, Bonn, Germany
| | | | - Martin Hofmann-Apitius
- Fraunhofer Institute for Algorithms and Scientific Computing (SCAI), Schloss Birlinghoven, Sankt Augustin, Germany
- Bonn-Aachen International Center for Information Technology (B-IT), Bonn, Germany
| | - Mehmet Ulas Cinar
- Department of Animal Science, Faculty of Agriculture, Erciyes University, Kayseri, Turkey
| | | |
Collapse
|
7
|
Hirsch HVB, Lnenicka G, Possidente D, Possidente B, Garfinkel MD, Wang L, Lu X, Ruden DM. Drosophila melanogaster as a model for lead neurotoxicology and toxicogenomics research. Front Genet 2012; 3:68. [PMID: 22586431 PMCID: PMC3343274 DOI: 10.3389/fgene.2012.00068] [Citation(s) in RCA: 17] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/29/2012] [Accepted: 04/09/2012] [Indexed: 01/01/2023] Open
Abstract
Drosophila melanogaster is an excellent model animal for studying the neurotoxicology of lead. It has been known since ancient Roman times that long-term exposure to low levels of lead results in behavioral abnormalities, such as what is now known as attention deficit hyperactivity disorder (ADHD). Because lead alters mechanisms that underlie developmental neuronal plasticity, chronic exposure of children, even at blood lead levels below the current CDC community action level (10 μg/dl), can result in reduced cognitive ability, increased likelihood of delinquency, behaviors associated with ADHD, changes in activity level, altered sensory function, delayed onset of sexual maturity in girls, and changes in immune function. In order to better understand how lead affects neuronal plasticity, we will describe recent findings from a Drosophila behavioral genetics laboratory, a Drosophila neurophysiology laboratory, and a Drosophila quantitative genetics laboratory who have joined forces to study the effects of lead on the Drosophila nervous system. Studying the effects of lead on Drosophila nervous system development will give us a better understanding of the mechanisms of Pb neurotoxicity in the developing human nervous system.
Collapse
Affiliation(s)
- Helmut V B Hirsch
- Department of Biological Sciences, University at Albany, State University of New York, Albany, NY, USA
| | | | | | | | | | | | | | | |
Collapse
|
8
|
Functional genomic architecture of predisposition to voluntary exercise in mice: expression QTL in the brain. Genetics 2012; 191:643-54. [PMID: 22466041 DOI: 10.1534/genetics.112.140509] [Citation(s) in RCA: 27] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/21/2022] Open
Abstract
The biological basis of voluntary exercise is complex and simultaneously controlled by peripheral (ability) and central (motivation) mechanisms. The accompanying natural reward, potential addiction, and the motivation associated with exercise are hypothesized to be regulated by multiple brain regions, neurotransmitters, peptides, and hormones. We generated a large (n = 815) advanced intercross line of mice (G(4)) derived from a line selectively bred for increased wheel running (high runner) and the C57BL/6J inbred strain. We previously mapped multiple quantitative trait loci (QTL) that contribute to the biological control of voluntary exercise levels, body weight, and composition, as well as changes in body weight and composition in response to short-term exercise. Currently, using a subset of the G(4) population (n = 244), we examined the transcriptional landscape relevant to neurobiological aspects of voluntary exercise by means of global mRNA expression profiles from brain tissue. We identified genome-wide expression quantitative trait loci (eQTL) regulating variation in mRNA abundance and determined the mode of gene action and the cis- and/or trans-acting nature of each eQTL. Subsets of cis-acting eQTL, colocalizing with QTL for exercise or body composition traits, were used to identify candidate genes based on both positional and functional evidence, which were further filtered by correlational and exclusion mapping analyses. Specifically, we discuss six plausible candidate genes (Insig2, Socs2, DBY, Arrdc4, Prcp, IL15) and their potential role in the regulation of voluntary activity, body composition, and their interactions. These results develop a potential initial model of the underlying functional genomic architecture of predisposition to voluntary exercise and its effects on body weight and composition within a neurophysiological framework.
Collapse
|