1
|
ZHANG GUANGHAO, BEESLEY LAURENJ, MUKHERJEE BHRAMAR, SHI XU. PATIENT RECRUITMENT USING ELECTRONIC HEALTH RECORDS UNDER SELECTION BIAS: A TWO-PHASE SAMPLING FRAMEWORK. Ann Appl Stat 2024; 18:1858-1878. [PMID: 39149424 PMCID: PMC11323140 DOI: 10.1214/23-aoas1860] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 08/17/2024]
Abstract
Electronic health records (EHRs) are increasingly recognized as a cost-effective resource for patient recruitment in clinical research. However, how to optimally select a cohort from millions of individuals to answer a scientific question of interest remains unclear. Consider a study to estimate the mean or mean difference of an expensive outcome. Inexpensive auxiliary covariates predictive of the outcome may often be available in patients' health records, presenting an opportunity to recruit patients selectively, which may improve efficiency in downstream analyses. In this paper we propose a two-phase sampling design that leverages available information on auxiliary covariates in EHR data. A key challenge in using EHR data for multiphase sampling is the potential selection bias, because EHR data are not necessarily representative of the target population. Extending existing literature on two-phase sampling design, we derive an optimal two-phase sampling method that improves efficiency over random sampling while accounting for the potential selection bias in EHR data. We demonstrate the efficiency gain from our sampling design via simulation studies and an application evaluating the prevalence of hypertension among U.S. adults leveraging data from the Michigan Genomics Initiative, a longitudinal biorepository in Michigan Medicine.
Collapse
Affiliation(s)
| | | | | | - XU SHI
- Department of Biostatistics, University of Michigan
| |
Collapse
|
2
|
Nam Y, Kim J, Jung SH, Woerner J, Suh EH, Lee DG, Shivakumar M, Lee ME, Kim D. Harnessing Artificial Intelligence in Multimodal Omics Data Integration: Paving the Path for the Next Frontier in Precision Medicine. Annu Rev Biomed Data Sci 2024; 7:225-250. [PMID: 38768397 DOI: 10.1146/annurev-biodatasci-102523-103801] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 05/22/2024]
Abstract
The integration of multiomics data with detailed phenotypic insights from electronic health records marks a paradigm shift in biomedical research, offering unparalleled holistic views into health and disease pathways. This review delineates the current landscape of multimodal omics data integration, emphasizing its transformative potential in generating a comprehensive understanding of complex biological systems. We explore robust methodologies for data integration, ranging from concatenation-based to transformation-based and network-based strategies, designed to harness the intricate nuances of diverse data types. Our discussion extends from incorporating large-scale population biobanks to dissecting high-dimensional omics layers at the single-cell level. The review underscores the emerging role of large language models in artificial intelligence, anticipating their influence as a near-future pivot in data integration approaches. Highlighting both achievements and hurdles, we advocate for a concerted effort toward sophisticated integration models, fortifying the foundation for groundbreaking discoveries in precision medicine.
Collapse
Affiliation(s)
- Yonghyun Nam
- Department of Biostatistics, Epidemiology and Informatics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, Pennsylvania, USA;
| | - Jaesik Kim
- Institute for Biomedical Informatics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, Pennsylvania, USA
- Department of Bioengineering, University of Pennsylvania, Philadelphia, Pennsylvania, USA
| | - Sang-Hyuk Jung
- Department of Biostatistics, Epidemiology and Informatics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, Pennsylvania, USA;
| | - Jakob Woerner
- Department of Biostatistics, Epidemiology and Informatics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, Pennsylvania, USA;
| | - Erica H Suh
- Department of Biostatistics, Epidemiology and Informatics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, Pennsylvania, USA;
| | - Dong-Gi Lee
- Department of Biostatistics, Epidemiology and Informatics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, Pennsylvania, USA;
| | - Manu Shivakumar
- Department of Biostatistics, Epidemiology and Informatics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, Pennsylvania, USA;
| | - Matthew E Lee
- Department of Biostatistics, Epidemiology and Informatics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, Pennsylvania, USA;
| | - Dokyoon Kim
- Institute for Biomedical Informatics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, Pennsylvania, USA
- Department of Biostatistics, Epidemiology and Informatics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, Pennsylvania, USA;
| |
Collapse
|
3
|
Goleva SB, Williams A, Schlueter DJ, Keaton JM, Tran TC, Waxse BJ, Ferrara TM, Cassini T, Mo H, Denny JC. Racial and Ethnic Disparities in Antihypertensive Medication Prescribing Patterns and Effectiveness. Clin Pharmacol Ther 2024. [PMID: 39051523 DOI: 10.1002/cpt.3360] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/27/2024] [Accepted: 06/08/2024] [Indexed: 07/27/2024]
Abstract
Variability in drug effectiveness and provider prescribing patterns have been reported in different racial and ethnic populations. We sought to evaluate antihypertensive drug effectiveness and prescribing patterns among self-identified Hispanic/Latino (Hispanic), Non-Hispanic Black (Black), and Non-Hispanic White (White) populations that enrolled in the NIH All of Us Research Program, a US longitudinal cohort. We employed a self-controlled case study method using electronic health record and survey data from 17,718 White, Hispanic, and Black participants who were diagnosed with essential hypertension and prescribed at least one of 19 commonly used antihypertensive medications. Effectiveness was determined by calculating the reduction in systolic blood pressure measurements after 28 or more days of drug exposure. Starting systolic blood pressure and effectiveness for each medication were compared for self-reported Black, Hispanic, and White participants using adjusted linear regressions. Black and Hispanic participants were started on antihypertensive medications at significantly higher SBP than White participants in 13 and 7 out of 19 medications, respectively. More Black participants were prescribed multiple antihypertensive medications (58.46%) than White (52.35%) or Hispanic (49.9%) participants. First-line HTN medications differed by race and ethnicity. Following the 2017 American College of Cardiology and the American Heart Association High Blood Pressure Guideline release, around 64% of Black participants were prescribed a recommended first-line antihypertensive drug compared with 76% of White and 82% of Hispanic participants. Effect sizes suggested that most antihypertensive drugs were less effective in Hispanic and Black, compared with White, participants, and statistical significance was reached in 6 out of 19 drugs. These results indicate that Black and Hispanic populations may benefit from earlier intervention and screening and highlight the potential benefits of personalizing first-line medications.
Collapse
Affiliation(s)
- Slavina B Goleva
- National Human Genome Research Institute, National Institutes of Health, Bethesda, Maryland, USA
| | - Ariel Williams
- National Human Genome Research Institute, National Institutes of Health, Bethesda, Maryland, USA
| | - David J Schlueter
- National Human Genome Research Institute, National Institutes of Health, Bethesda, Maryland, USA
- Department of Health and Society, University of Toronto Scarborough, Toronto, Ontario, Canada
| | - Jacob M Keaton
- National Human Genome Research Institute, National Institutes of Health, Bethesda, Maryland, USA
| | - Tam C Tran
- National Human Genome Research Institute, National Institutes of Health, Bethesda, Maryland, USA
| | - Bennett J Waxse
- National Institute of Allergy and Infectious Diseases, National Institutes of Health, Bethesda, Maryland, USA
| | - Tracey M Ferrara
- National Human Genome Research Institute, National Institutes of Health, Bethesda, Maryland, USA
| | - Thomas Cassini
- National Human Genome Research Institute, National Institutes of Health, Bethesda, Maryland, USA
- Division of Medical Genetics and Genomic Medicine, Department of Pediatrics, Vanderbilt University Medical Center, Nashville, Tennessee, USA
| | - Huan Mo
- National Human Genome Research Institute, National Institutes of Health, Bethesda, Maryland, USA
- Cohort Analytics Core, National Human Genome Research Institute, National Institutes of Health, Bethesda, Maryland, USA
| | - Joshua C Denny
- National Human Genome Research Institute, National Institutes of Health, Bethesda, Maryland, USA
- All of Us Research Program, National Institutes of Health, Bethesda, Maryland, USA
| |
Collapse
|
4
|
Venkatesh SS, Ganjgahi H, Palmer DS, Coley K, Linchangco GV, Hui Q, Wilson P, Ho YL, Cho K, Arumäe K, Wittemans LBL, Nellåker C, Vainik U, Sun YV, Holmes C, Lindgren CM, Nicholson G. Characterising the genetic architecture of changes in adiposity during adulthood using electronic health records. Nat Commun 2024; 15:5801. [PMID: 38987242 PMCID: PMC11237142 DOI: 10.1038/s41467-024-49998-0] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/25/2023] [Accepted: 06/25/2024] [Indexed: 07/12/2024] Open
Abstract
Obesity is a heritable disease, characterised by excess adiposity that is measured by body mass index (BMI). While over 1,000 genetic loci are associated with BMI, less is known about the genetic contribution to adiposity trajectories over adulthood. We derive adiposity-change phenotypes from 24.5 million primary-care health records in over 740,000 individuals in the UK Biobank, Million Veteran Program USA, and Estonian Biobank, to discover and validate the genetic architecture of adiposity trajectories. Using multiple BMI measurements over time increases power to identify genetic factors affecting baseline BMI by 14%. In the largest reported genome-wide study of adiposity-change in adulthood, we identify novel associations with BMI-change at six independent loci, including rs429358 (APOE missense variant). The SNP-based heritability of BMI-change (1.98%) is 9-fold lower than that of BMI. The modest genetic correlation between BMI-change and BMI (45.2%) indicates that genetic studies of longitudinal trajectories could uncover novel biology of quantitative traits in adulthood.
Collapse
Affiliation(s)
- Samvida S Venkatesh
- Wellcome Centre for Human Genetics, Nuffield Department of Medicine, University of Oxford, Oxford, UK.
- Big Data Institute, Li Ka Shing Centre for Health Information and Discovery, University of Oxford, Oxford, UK.
| | - Habib Ganjgahi
- Big Data Institute, Li Ka Shing Centre for Health Information and Discovery, University of Oxford, Oxford, UK
- Department of Statistics, University of Oxford, Oxford, UK
| | - Duncan S Palmer
- Big Data Institute, Li Ka Shing Centre for Health Information and Discovery, University of Oxford, Oxford, UK
- Nuffield Department of Population Health, Medical Sciences Division, University of Oxford, Oxford, UK
| | - Kayesha Coley
- Department of Population Health Sciences, University of Leicester, Leicester, UK
| | - Gregorio V Linchangco
- Department of Epidemiology, Emory University Rollins School of Public Health, Atlanta, GA, USA
- Atlanta VA Health Care System, Decatur, GA, USA
| | - Qin Hui
- Department of Epidemiology, Emory University Rollins School of Public Health, Atlanta, GA, USA
- Atlanta VA Health Care System, Decatur, GA, USA
| | - Peter Wilson
- Atlanta VA Health Care System, Decatur, GA, USA
- Department of Medicine, Emory University School of Medicine, Atlanta, GA, USA
| | - Yuk-Lam Ho
- Massachusetts Veterans Epidemiology Research and Information Center (MAVERIC), Veterans Affairs Boston Healthcare System, Boston, MA, USA
| | - Kelly Cho
- Massachusetts Veterans Epidemiology Research and Information Center (MAVERIC), Veterans Affairs Boston Healthcare System, Boston, MA, USA
- Division of Aging, Brigham and Women's Hospital, Harvard Medical School, Boston, MA, USA
| | - Kadri Arumäe
- Institute of Psychology, Faculty of Social Sciences, University of Tartu, Tartu, Estonia
| | - Laura B L Wittemans
- Novo Nordisk Research Centre Oxford, Oxford, UK
- Nuffield Department of Women's and Reproductive Health, Medical Sciences Division, University of Oxford, Oxford, UK
| | - Christoffer Nellåker
- Big Data Institute, Li Ka Shing Centre for Health Information and Discovery, University of Oxford, Oxford, UK
- Nuffield Department of Women's and Reproductive Health, Medical Sciences Division, University of Oxford, Oxford, UK
| | - Uku Vainik
- Institute of Psychology, Faculty of Social Sciences, University of Tartu, Tartu, Estonia
- Estonian Genome Centre, Institute of Genomics, Faculty of Science and Technology, University of Tartu, Tartu, Estonia
- Department of Neurology and Neurosurgery, Faculty of Medicine and Health Sciences, University of McGill, Montreal, Canada
| | - Yan V Sun
- Department of Epidemiology, Emory University Rollins School of Public Health, Atlanta, GA, USA
- Atlanta VA Health Care System, Decatur, GA, USA
| | - Chris Holmes
- Department of Statistics, University of Oxford, Oxford, UK
- Nuffield Department of Medicine, Medical Sciences Division, University of Oxford, Oxford, UK
- The Alan Turing Institute, London, UK
| | - Cecilia M Lindgren
- Wellcome Centre for Human Genetics, Nuffield Department of Medicine, University of Oxford, Oxford, UK.
- Big Data Institute, Li Ka Shing Centre for Health Information and Discovery, University of Oxford, Oxford, UK.
- Nuffield Department of Women's and Reproductive Health, Medical Sciences Division, University of Oxford, Oxford, UK.
- Broad Institute of Harvard and MIT, Cambridge, MA, USA.
| | | |
Collapse
|
5
|
McCaw ZR, Gao J, Lin X, Gronsbell J. Synthetic surrogates improve power for genome-wide association studies of partially missing phenotypes in population biobanks. Nat Genet 2024; 56:1527-1536. [PMID: 38872030 DOI: 10.1038/s41588-024-01793-9] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/14/2023] [Accepted: 05/08/2024] [Indexed: 06/15/2024]
Abstract
Within population biobanks, incomplete measurement of certain traits limits the power for genetic discovery. Machine learning is increasingly used to impute the missing values from the available data. However, performing genome-wide association studies (GWAS) on imputed traits can introduce spurious associations, identifying genetic variants that are not associated with the original trait. Here we introduce a new method, synthetic surrogate (SynSurr) analysis, which makes GWAS on imputed phenotypes robust to imputation errors. Rather than replacing missing values, SynSurr jointly analyzes the original and imputed traits. We show that SynSurr estimates the same genetic effect as standard GWAS and improves power in proportion to the quality of the imputations. SynSurr requires a commonly made missing-at-random assumption but relaxes the requirements of existing imputation methods by not requiring correct model specification. We present extensive simulations and ablation analyses to validate SynSurr and apply it to empower the GWAS of dual-energy X-ray absorptiometry traits within the UK Biobank.
Collapse
Affiliation(s)
- Zachary R McCaw
- Department of Biostatistics, Harvard T.H. Chan School of Public Health, Boston, MA, USA.
| | - Jianhui Gao
- Department of Statistical Sciences, University of Toronto, Toronto, Ontario, Canada
| | - Xihong Lin
- Department of Biostatistics, Harvard T.H. Chan School of Public Health, Boston, MA, USA
- Department of Statistics, Harvard University, Cambridge, MA, USA
| | - Jessica Gronsbell
- Department of Statistical Sciences, University of Toronto, Toronto, Ontario, Canada.
- Department of Computer Science, University of Toronto, Toronto, Ontario, Canada.
- Department of Family & Community Medicine, University of Toronto, Toronto, Ontario, Canada.
| |
Collapse
|
6
|
Martínez-Magaña JJ, Hurtado-Soriano J, Rivero-Segura NA, Montalvo-Ortiz JL, Garcia-delaTorre P, Becerril-Rojas K, Gomez-Verjan JC. Towards a Novel Frontier in the Use of Epigenetic Clocks in Epidemiology. Arch Med Res 2024; 55:103033. [PMID: 38955096 DOI: 10.1016/j.arcmed.2024.103033] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/10/2024] [Revised: 05/10/2024] [Accepted: 06/17/2024] [Indexed: 07/04/2024]
Abstract
Health problems associated with aging are a major public health concern for the future. Aging is a complex process with wide intervariability among individuals. Therefore, there is a need for innovative public health strategies that target factors associated with aging and the development of tools to assess the effectiveness of these strategies accurately. Novel approaches to measure biological age, such as epigenetic clocks, have become relevant. These clocks use non-sequential variable information from the genome and employ mathematical algorithms to estimate biological age based on DNA methylation levels. Therefore, in the present study, we comprehensively review the current status of the epigenetic clocks and their associations across the human phenome. We emphasize the potential utility of these tools in an epidemiological context, particularly in evaluating the impact of public health interventions focused on promoting healthy aging. Our review describes associations between epigenetic clocks and multiple traits across the life and health span. Additionally, we highlighted the evolution of studies beyond mere associations to establish causal mechanisms between epigenetic age and disease. We explored the application of epigenetic clocks to measure the efficacy of interventions focusing on rejuvenation.
Collapse
Affiliation(s)
- José Jaime Martínez-Magaña
- Department of Psychiatry, Yale University School of Medicine, New Haven, CT, USA; U.S. Department of Veterans Affairs National Center for Post-Traumatic Stress Disorder, Clinical Neuroscience Division, West Haven, CT, USA; VA Connecticut Healthcare System, West Haven, CT, USA
| | | | | | - Janitza L Montalvo-Ortiz
- Department of Psychiatry, Yale University School of Medicine, New Haven, CT, USA; U.S. Department of Veterans Affairs National Center for Post-Traumatic Stress Disorder, Clinical Neuroscience Division, West Haven, CT, USA; VA Connecticut Healthcare System, West Haven, CT, USA
| | - Paola Garcia-delaTorre
- Unidad de Investigación Epidemiológica y en Servicios de Salud, Área de Envejecimiento, Centro Médico Nacional, Siglo XXI, Instituto Mexicano del Seguro Social, Mexico City, Mexico
| | | | | |
Collapse
|
7
|
Salvatore M, Kundu R, Shi X, Friese CR, Lee S, Fritsche LG, Mondul AM, Hanauer D, Pearce CL, Mukherjee B. To weight or not to weight? The effect of selection bias in 3 large electronic health record-linked biobanks and recommendations for practice. J Am Med Inform Assoc 2024; 31:1479-1492. [PMID: 38742457 PMCID: PMC11187425 DOI: 10.1093/jamia/ocae098] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/14/2024] [Revised: 04/14/2024] [Accepted: 04/18/2024] [Indexed: 05/16/2024] Open
Abstract
OBJECTIVES To develop recommendations regarding the use of weights to reduce selection bias for commonly performed analyses using electronic health record (EHR)-linked biobank data. MATERIALS AND METHODS We mapped diagnosis (ICD code) data to standardized phecodes from 3 EHR-linked biobanks with varying recruitment strategies: All of Us (AOU; n = 244 071), Michigan Genomics Initiative (MGI; n = 81 243), and UK Biobank (UKB; n = 401 167). Using 2019 National Health Interview Survey data, we constructed selection weights for AOU and MGI to represent the US adult population more. We used weights previously developed for UKB to represent the UKB-eligible population. We conducted 4 common analyses comparing unweighted and weighted results. RESULTS For AOU and MGI, estimated phecode prevalences decreased after weighting (weighted-unweighted median phecode prevalence ratio [MPR]: 0.82 and 0.61), while UKB estimates increased (MPR: 1.06). Weighting minimally impacted latent phenome dimensionality estimation. Comparing weighted versus unweighted phenome-wide association study for colorectal cancer, the strongest associations remained unaltered, with considerable overlap in significant hits. Weighting affected the estimated log-odds ratio for sex and colorectal cancer to align more closely with national registry-based estimates. DISCUSSION Weighting had a limited impact on dimensionality estimation and large-scale hypothesis testing but impacted prevalence and association estimation. When interested in estimating effect size, specific signals from untargeted association analyses should be followed up by weighted analysis. CONCLUSION EHR-linked biobanks should report recruitment and selection mechanisms and provide selection weights with defined target populations. Researchers should consider their intended estimands, specify source and target populations, and weight EHR-linked biobank analyses accordingly.
Collapse
Affiliation(s)
- Maxwell Salvatore
- Department of Epidemiology, University of Michigan, Ann Arbor, MI 48109-2029, United States
- Center for Precision Health Data Science, Department of Biostatistics, University of Michigan, Ann Arbor, MI 48109-2029, United States
| | - Ritoban Kundu
- Center for Precision Health Data Science, Department of Biostatistics, University of Michigan, Ann Arbor, MI 48109-2029, United States
- Department of Biostatistics, University of Michigan, Ann Arbor, MI 48109-2029, United States
| | - Xu Shi
- Department of Biostatistics, University of Michigan, Ann Arbor, MI 48109-2029, United States
| | - Christopher R Friese
- Rogel Cancer Center, Michigan Medicine, University of Michigan, Ann Arbor, MI 48109-2029, United States
- Center for Improving Patient and Population Health, School of Nursing, University of Michigan, Ann Arbor, MI 48109-2029, United States
- Department of Health Management and Policy, University of Michigan, Ann Arbor, MI 48109-2029, United States
| | - Seunggeun Lee
- Department of Biostatistics, University of Michigan, Ann Arbor, MI 48109-2029, United States
- Graduate School of Data Science, Seoul National University, Gwanak-gu, Seoul, Republic of Korea
| | - Lars G Fritsche
- Center for Precision Health Data Science, Department of Biostatistics, University of Michigan, Ann Arbor, MI 48109-2029, United States
- Department of Biostatistics, University of Michigan, Ann Arbor, MI 48109-2029, United States
- Rogel Cancer Center, Michigan Medicine, University of Michigan, Ann Arbor, MI 48109-2029, United States
| | - Alison M Mondul
- Department of Epidemiology, University of Michigan, Ann Arbor, MI 48109-2029, United States
- Rogel Cancer Center, Michigan Medicine, University of Michigan, Ann Arbor, MI 48109-2029, United States
| | - David Hanauer
- Department of Learning Health Sciences, University of Michigan Medical School, Ann Arbor, MI 48109-2054, United States
| | - Celeste Leigh Pearce
- Department of Epidemiology, University of Michigan, Ann Arbor, MI 48109-2029, United States
- Rogel Cancer Center, Michigan Medicine, University of Michigan, Ann Arbor, MI 48109-2029, United States
| | - Bhramar Mukherjee
- Department of Epidemiology, University of Michigan, Ann Arbor, MI 48109-2029, United States
- Center for Precision Health Data Science, Department of Biostatistics, University of Michigan, Ann Arbor, MI 48109-2029, United States
- Department of Biostatistics, University of Michigan, Ann Arbor, MI 48109-2029, United States
| |
Collapse
|
8
|
van Alten S, Domingue BW, Faul J, Galama T, Marees AT. Reweighting UK Biobank corrects for pervasive selection bias due to volunteering. Int J Epidemiol 2024; 53:dyae054. [PMID: 38715336 PMCID: PMC11076923 DOI: 10.1093/ije/dyae054] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/11/2023] [Accepted: 04/10/2024] [Indexed: 05/12/2024] Open
Abstract
BACKGROUND Biobanks typically rely on volunteer-based sampling. This results in large samples (power) at the cost of representativeness (bias). The problem of volunteer bias is debated. Here, we (i) show that volunteering biases associations in UK Biobank (UKB) and (ii) estimate inverse probability (IP) weights that correct for volunteer bias in UKB. METHODS Drawing on UK Census data, we constructed a subsample representative of UKB's target population, which consists of all individuals invited to participate. Based on demographic variables shared between the UK Census and UKB, we estimated IP weights (IPWs) for each UKB participant. We compared 21 weighted and unweighted bivariate associations between these demographic variables to assess volunteer bias. RESULTS Volunteer bias in all associations, as naively estimated in UKB, was substantial-in some cases so severe that unweighted estimates had the opposite sign of the association in the target population. For example, older individuals in UKB reported being in better health, in contrast to evidence from the UK Census. Using IPWs in weighted regressions reduced 87% of volunteer bias on average. Volunteer-based sampling reduced the effective sample size of UKB substantially, to 32% of its original size. CONCLUSIONS Estimates from large-scale biobanks may be misleading due to volunteer bias. We recommend IP weighting to correct for such bias. To aid in the construction of the next generation of biobanks, we provide suggestions on how to best ensure representativeness in a volunteer-based design. For UKB, IPWs have been made available.
Collapse
Affiliation(s)
- Sjoerd van Alten
- School of Business and Economics, Vrije Universiteit Amsterdam, Amsterdam, Netherlands
- Tinbergen Institute, Amsterdam, Netherlands
| | | | - Jessica Faul
- Survey Research Center, Institute for Social Research, University of Michigan, Ann Arbor, MI, USA
| | - Titus Galama
- School of Business and Economics, Vrije Universiteit Amsterdam, Amsterdam, Netherlands
- Tinbergen Institute, Amsterdam, Netherlands
- Center for Economic and Social Research and Department of Economics, University of Southern California, Los Angeles, CA, USA
| | - Andries T Marees
- School of Business and Economics, Vrije Universiteit Amsterdam, Amsterdam, Netherlands
| |
Collapse
|
9
|
Salvatore M, Kundu R, Shi X, Friese CR, Lee S, Fritsche LG, Mondul AM, Hanauer D, Pearce CL, Mukherjee B. To weight or not to weight? Studying the effect of selection bias in three large EHR-linked biobanks. MEDRXIV : THE PREPRINT SERVER FOR HEALTH SCIENCES 2024:2024.02.12.24302710. [PMID: 38405832 PMCID: PMC10888982 DOI: 10.1101/2024.02.12.24302710] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 02/27/2024]
Abstract
Objective To explore the role of selection bias adjustment by weighting electronic health record (EHR)-linked biobank data for commonly performed analyses. Materials and methods We mapped diagnosis (ICD code) data to standardized phecodes from three EHR-linked biobanks with varying recruitment strategies: All of Us (AOU; n=244,071), Michigan Genomics Initiative (MGI; n=81,243), and UK Biobank (UKB; n=401,167). Using 2019 National Health Interview Survey data, we constructed selection weights for AOU and MGI to be more representative of the US adult population. We used weights previously developed for UKB to represent the UKB-eligible population. We conducted four common descriptive and analytic tasks comparing unweighted and weighted results. Results For AOU and MGI, estimated phecode prevalences decreased after weighting (weighted-unweighted median phecode prevalence ratio [MPR]: 0.82 and 0.61), while UKB's estimates increased (MPR: 1.06). Weighting minimally impacted latent phenome dimensionality estimation. Comparing weighted versus unweighted PheWAS for colorectal cancer, the strongest associations remained unaltered and there was large overlap in significant hits. Weighting affected the estimated log-odds ratio for sex and colorectal cancer to align more closely with national registry-based estimates. Discussion Weighting had limited impact on dimensionality estimation and large-scale hypothesis testing but impacted prevalence and association estimation more. Results from untargeted association analyses should be followed by weighted analysis when effect size estimation is of interest for specific signals. Conclusion EHR-linked biobanks should report recruitment and selection mechanisms and provide selection weights with defined target populations. Researchers should consider their intended estimands, specify source and target populations, and weight EHR-linked biobank analyses accordingly.
Collapse
Affiliation(s)
- Maxwell Salvatore
- Department of Epidemiology, University of Michigan, Ann Arbor, MI, USA
- Center for Precision Health Data Science, University of Michigan, Ann Arbor, MI, USA
| | - Ritoban Kundu
- Center for Precision Health Data Science, University of Michigan, Ann Arbor, MI, USA
- Department of Biostatistics, University of Michigan, Ann Arbor, MI, USA
| | - Xu Shi
- Department of Biostatistics, University of Michigan, Ann Arbor, MI, USA
| | - Christopher R Friese
- Rogel Cancer Center, University of Michigan, Ann Arbor, MI, USA
- Center for Improving Patient and Population Health, School of Nursing, University of Michigan, Ann Arbor, MI, USA
- Department of Health Management and Policy, University of Michigan, Ann Arbor, MI, USA
| | - Seunggeun Lee
- Department of Biostatistics, University of Michigan, Ann Arbor, MI, USA
- Graduate School of Data Science, Seoul National University, Seoul, Republic of Korea
| | - Lars G Fritsche
- Center for Precision Health Data Science, University of Michigan, Ann Arbor, MI, USA
- Department of Biostatistics, University of Michigan, Ann Arbor, MI, USA
- Rogel Cancer Center, University of Michigan, Ann Arbor, MI, USA
| | - Alison M Mondul
- Department of Epidemiology, University of Michigan, Ann Arbor, MI, USA
- Rogel Cancer Center, University of Michigan, Ann Arbor, MI, USA
| | - David Hanauer
- Department of Learning Health Sciences, University of Michigan Medical School, Ann Arbor, MI, USA
| | - Celeste Leigh Pearce
- Department of Epidemiology, University of Michigan, Ann Arbor, MI, USA
- Rogel Cancer Center, University of Michigan, Ann Arbor, MI, USA
| | - Bhramar Mukherjee
- Department of Epidemiology, University of Michigan, Ann Arbor, MI, USA
- Center for Precision Health Data Science, University of Michigan, Ann Arbor, MI, USA
- Department of Biostatistics, University of Michigan, Ann Arbor, MI, USA
| |
Collapse
|
10
|
Jordan DM, Vy HMT, Do R. A deep learning transformer model predicts high rates of undiagnosed rare disease in large electronic health systems. MEDRXIV : THE PREPRINT SERVER FOR HEALTH SCIENCES 2023:2023.12.21.23300393. [PMID: 38196638 PMCID: PMC10775679 DOI: 10.1101/2023.12.21.23300393] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 01/11/2024]
Abstract
It is estimated that as many as 1 in 16 people worldwide suffer from rare diseases. Rare disease patients face difficulty finding diagnosis and treatment for their conditions, including long diagnostic odysseys, multiple incorrect diagnoses, and unavailable or prohibitively expensive treatments. As a result, it is likely that large electronic health record (EHR) systems include high numbers of participants suffering from undiagnosed rare disease. While this has been shown in detail for specific diseases, these studies are expensive and time consuming and have only been feasible to perform for a handful of the thousands of known rare diseases. The bulk of these undiagnosed cases are effectively hidden, with no straightforward way to differentiate them from healthy controls. The ability to access them at scale would enormously expand our capacity to study and develop drugs for rare diseases, adding to tools aimed at increasing availability of study cohorts for rare disease. In this study, we train a deep learning transformer algorithm, RarePT (Rare-Phenotype Prediction Transformer), to impute undiagnosed rare disease from EHR diagnosis codes in 436,407 participants in the UK Biobank and validated on an independent cohort from 3,333,560 individuals from the Mount Sinai Health System. We applied our model to 155 rare diagnosis codes with fewer than 250 cases each in the UK Biobank and predicted participants with elevated risk for each diagnosis, with the number of participants predicted to be at risk ranging from 85 to 22,000 for different diagnoses. These risk predictions are significantly associated with increased mortality for 65% of diagnoses, with disease burden expressed as disability-adjusted life years (DALY) for 73% of diagnoses, and with 72% of available disease-specific diagnostic tests. They are also highly enriched for known rare diagnoses in patients not included in the training set, with an odds ratio (OR) of 48.0 in cross-validation cohorts of the UK Biobank and an OR of 30.6 in the independent Mount Sinai Health System cohort. Most importantly, RarePT successfully screens for undiagnosed patients in 32 rare diseases with available diagnostic tests in the UK Biobank. Using the trained model to estimate the prevalence of undiagnosed disease in the UK Biobank for these 32 rare phenotypes, we find that at least 50% of patients remain undiagnosed for 20 of 32 diseases. These estimates provide empirical evidence of a high prevalence of undiagnosed rare disease, as well as demonstrating the enormous potential benefit of using RarePT to screen for undiagnosed rare disease patients in large electronic health systems.
Collapse
Affiliation(s)
- Daniel M. Jordan
- Center for Genomic Data Analytics, Charles Bronfman Institute for Personalized Medicine, Department of Genetics and Genomic Sciences, Icahn School of Medicine at Mount Sinai, New York, NY, USA
| | - Ha My T. Vy
- Center for Genomic Data Analytics, Charles Bronfman Institute for Personalized Medicine, Department of Genetics and Genomic Sciences, Icahn School of Medicine at Mount Sinai, New York, NY, USA
| | - Ron Do
- Center for Genomic Data Analytics, Charles Bronfman Institute for Personalized Medicine, Department of Genetics and Genomic Sciences, Icahn School of Medicine at Mount Sinai, New York, NY, USA
| |
Collapse
|
11
|
Leviton A, Loddenkemper T. Design, implementation, and inferential issues associated with clinical trials that rely on data in electronic medical records: a narrative review. BMC Med Res Methodol 2023; 23:271. [PMID: 37974111 PMCID: PMC10652539 DOI: 10.1186/s12874-023-02102-4] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/17/2022] [Accepted: 11/08/2023] [Indexed: 11/19/2023] Open
Abstract
Real world evidence is now accepted by authorities charged with assessing the benefits and harms of new therapies. Clinical trials based on real world evidence are much less expensive than randomized clinical trials that do not rely on "real world evidence" such as contained in electronic health records (EHR). Consequently, we can expect an increase in the number of reports of these types of trials, which we identify here as 'EHR-sourced trials.' 'In this selected literature review, we discuss the various designs and the ethical issues they raise. EHR-sourced trials have the potential to improve/increase common data elements and other aspects of the EHR and related systems. Caution is advised, however, in drawing causal inferences about the relationships among EHR variables. Nevertheless, we anticipate that EHR-CTs will play a central role in answering research and regulatory questions.
Collapse
Affiliation(s)
- Alan Leviton
- Department of Neurology, Boston Children's Hospital, Harvard Medical School, Boston, MA, USA.
| | - Tobias Loddenkemper
- Department of Neurology, Boston Children's Hospital, Harvard Medical School, Boston, MA, USA
| |
Collapse
|
12
|
Sánchez-Valle J, Valencia A. Molecular bases of comorbidities: present and future perspectives. Trends Genet 2023; 39:773-786. [PMID: 37482451 DOI: 10.1016/j.tig.2023.06.003] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/13/2023] [Revised: 06/12/2023] [Accepted: 06/12/2023] [Indexed: 07/25/2023]
Abstract
Co-occurrence of diseases decreases patient quality of life, complicates treatment choices, and increases mortality. Analyses of electronic health records present a complex scenario of comorbidity relationships that vary by age, sex, and cohort under study. The study of similarities between diseases using 'omics data, such as genes altered in diseases, gene expression, proteome, and microbiome, are fundamental to uncovering the origin of, and potential treatment for, comorbidities. Recent studies have produced a first generation of genetic interpretations for as much as 46% of the comorbidities described in large cohorts. Integrating different sources of molecular information and using artificial intelligence (AI) methods are promising approaches for the study of comorbidities. They may help to improve the treatment of comorbidities, including the potential repositioning of drugs.
Collapse
Affiliation(s)
- Jon Sánchez-Valle
- Life Sciences Department, Barcelona Supercomputing Center, Barcelona, 08034, Spain.
| | - Alfonso Valencia
- Life Sciences Department, Barcelona Supercomputing Center, Barcelona, 08034, Spain; ICREA, Barcelona, 08010, Spain.
| |
Collapse
|
13
|
Liang J, Li Q, Fu Z, Liu X, Shen P, Sun Y, Zhang J, Lu P, Lin H, Tang X, Gao P. Validation and comparison of cardiovascular risk prediction equations in Chinese patients with Type 2 diabetes. Eur J Prev Cardiol 2023; 30:1293-1303. [PMID: 37315163 DOI: 10.1093/eurjpc/zwad198] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 02/23/2023] [Revised: 06/02/2023] [Accepted: 06/08/2023] [Indexed: 06/16/2023]
Abstract
AIMS For patients with diabetes, the European guidelines updated the cardiovascular disease (CVD) risk prediction recommendations using diabetes-specific models with age-specific cut-offs, whereas American guidelines still advise models derived from the general population. We aimed to compare the performance of four cardiovascular risk models in diabetes populations. METHODS AND RESULTS Patients with diabetes from the CHERRY study, an electronic health records-based cohort study in China, were identified. Five-year CVD risk was calculated using original and recalibrated diabetes-specific models [Action in Diabetes and Vascular disease: PreterAx and diamicroN-MR Controlled Evaluation (ADVANCE) and the Hong Kong cardiovascular risk model (HK)] and general population-based models [Pooled Cohort Equations (PCE) and Prediction for Atherosclerotic cardiovascular disease Risk in China (China-PAR)]. During a median 5.8-year follow-up, 46 558 patients had 2605 CVD events. C-statistics were 0.711 [95% confidence interval: 0.693-0.729] for ADVANCE and 0.701 (0.683-0.719) for HK in men, and 0.742 (0.725-0.759) and 0.732 (0.718-0.747) in women. C-statistics were worse in two general population-based models. Recalibrated ADVANCE underestimated risk by 1.2% and 16.8% in men and women, whereas PCE underestimated risk by 41.9% and 24.2% in men and women. With the age-specific cut-offs, the overlap of the high-risk patients selected by every model pair ranged from only 22.6% to 51.2%. When utilizing the fixed cut-off at 5%, the recalibrated ADVANCE selected similar high-risk patients in men (7400) as compared to the age-specific cut-offs (7102), whereas age-specific cut-offs exhibited a reduction in the selection of high-risk patients in women (2646 under age-specific cut-offs vs. 3647 under fixed cut-off). CONCLUSION Diabetes-specific CVD risk prediction models showed better discrimination for patients with diabetes. High-risk patients selected by different models varied significantly. Age-specific cut-offs selected fewer patients at high CVD risk especially in women.
Collapse
Affiliation(s)
- Jingyuan Liang
- Department of Epidemiology and Biostatistics, Peking University, 38 Xueyuan Road, Haidian District, Beijing 100191, China
| | - Qianqian Li
- Department of Epidemiology and Biostatistics, Peking University, 38 Xueyuan Road, Haidian District, Beijing 100191, China
| | - Zhangping Fu
- Department of Epidemiology and Biostatistics, Peking University, 38 Xueyuan Road, Haidian District, Beijing 100191, China
| | - Xiaofei Liu
- Department of Epidemiology and Biostatistics, Peking University, 38 Xueyuan Road, Haidian District, Beijing 100191, China
| | - Peng Shen
- Department of Chronic Diseases and Health Promotion, Yinzhou District Centre for Disease Control and Prevention, Ningbo, China
| | - Yexiang Sun
- Department of Chronic Diseases and Health Promotion, Yinzhou District Centre for Disease Control and Prevention, Ningbo, China
| | - Jingyi Zhang
- Department of Medical Big Data, Wonders Information Co. Ltd, Shanghai, China
| | - Ping Lu
- Department of Medical Big Data, Wonders Information Co. Ltd, Shanghai, China
| | - Hongbo Lin
- Department of Chronic Diseases and Health Promotion, Yinzhou District Centre for Disease Control and Prevention, Ningbo, China
| | - Xun Tang
- Department of Epidemiology and Biostatistics, Peking University, 38 Xueyuan Road, Haidian District, Beijing 100191, China
- Key Laboratory of Epidemiology of Major Diseases (Peking University), Ministry of Education, Beijing, China
| | - Pei Gao
- Department of Epidemiology and Biostatistics, Peking University, 38 Xueyuan Road, Haidian District, Beijing 100191, China
- Key Laboratory of Epidemiology of Major Diseases (Peking University), Ministry of Education, Beijing, China
- Peking University Clinical Research Institute, Peking University, Beijing, China
| |
Collapse
|
14
|
Stevelink R, Campbell C, Chen S, Abou-Khalil B, Adesoji OM, Afawi Z, Amadori E, Anderson A, Anderson J, Andrade DM, Annesi G, Auce P, Avbersek A, Bahlo M, Baker MD, Balagura G, Balestrini S, Barba C, Barboza K, Bartolomei F, Bast T, Baum L, Baumgartner T, Baykan B, Bebek N, Becker AJ, Becker F, Bennett CA, Berghuis B, Berkovic SF, Beydoun A, Bianchini C, Bisulli F, Blatt I, Bobbili DR, Borggraefe I, Bosselmann C, Braatz V, Bradfield JP, Brockmann K, Brody LC, Buono RJ, Busch RM, Caglayan H, Campbell E, Canafoglia L, Canavati C, Cascino GD, Castellotti B, Catarino CB, Cavalleri GL, Cerrato F, Chassoux F, Cherny SS, Cheung CL, Chinthapalli K, Chou IJ, Chung SK, Churchhouse C, Clark PO, Cole AJ, Compston A, Coppola A, Cosico M, Cossette P, Craig JJ, Cusick C, Daly MJ, Davis LK, de Haan GJ, Delanty N, Depondt C, Derambure P, Devinsky O, Di Vito L, Dlugos DJ, Doccini V, Doherty CP, El-Naggar H, Elger CE, Ellis CA, Eriksson JG, Faucon A, Feng YCA, Ferguson L, Ferraro TN, Ferri L, Feucht M, Fitzgerald M, Fonferko-Shadrach B, Fortunato F, Franceschetti S, Franke A, French JA, Freri E, Gagliardi M, Gambardella A, Geller EB, Giangregorio T, Gjerstad L, Glauser T, Goldberg E, Goldman A, Granata T, Greenberg DA, Guerrini R, Gupta N, Haas KF, Hakonarson H, Hallmann K, Hassanin E, Hegde M, Heinzen EL, Helbig I, Hengsbach C, Heyne HO, Hirose S, Hirsch E, Hjalgrim H, Howrigan DP, Hucks D, Hung PC, Iacomino M, Imbach LL, Inoue Y, Ishii A, Jamnadas-Khoda J, Jehi L, Johnson MR, Kälviäinen R, Kamatani Y, Kanaan M, Kanai M, Kantanen AM, Kara B, Kariuki SM, Kasperavičiūte D, Kasteleijn-Nolst Trenite D, Kato M, Kegele J, Kesim Y, Khoueiry-Zgheib N, King C, Kirsch HE, Klein KM, Kluger G, Knake S, Knowlton RC, Koeleman BPC, Korczyn AD, Koupparis A, Kousiappa I, Krause R, Krenn M, Krestel H, Krey I, Kunz WS, Kurki MI, Kurlemann G, Kuzniecky R, Kwan P, Labate A, Lacey A, Lal D, Landoulsi Z, Lau YL, Lauxmann S, Leech SL, Lehesjoki AE, Lemke JR, Lerche H, Lesca G, Leu C, Lewin N, Lewis-Smith D, Li GHY, Li QS, Licchetta L, Lin KL, Lindhout D, Linnankivi T, Lopes-Cendes I, Lowenstein DH, Lui CHT, Madia F, Magnusson S, Marson AG, May P, McGraw CM, Mei D, Mills JL, Minardi R, Mirza N, Møller RS, Molloy AM, Montomoli M, Mostacci B, Muccioli L, Muhle H, Müller-Schlüter K, Najm IM, Nasreddine W, Neale BM, Neubauer B, Newton CRJC, Nöthen MM, Nothnagel M, Nürnberg P, O’Brien TJ, Okada Y, Ólafsson E, Oliver KL, Özkara C, Palotie A, Pangilinan F, Papacostas SS, Parrini E, Pato CN, Pato MT, Pendziwiat M, Petrovski S, Pickrell WO, Pinsky R, Pippucci T, Poduri A, Pondrelli F, Powell RHW, Privitera M, Rademacher A, Radtke R, Ragona F, Rau S, Rees MI, Regan BM, Reif PS, Rhelms S, Riva A, Rosenow F, Ryvlin P, Saarela A, Sadleir LG, Sander JW, Sander T, Scala M, Scattergood T, Schachter SC, Schankin CJ, Scheffer IE, Schmitz B, Schoch S, Schubert-Bast S, Schulze-Bonhage A, Scudieri P, Sham P, Sheidley BR, Shih JJ, Sills GJ, Sisodiya SM, Smith MC, Smith PE, Sonsma ACM, Speed D, Sperling MR, Stefansson H, Stefansson K, Steinhoff BJ, Stephani U, Stewart WC, Stipa C, Striano P, Stroink H, Strzelczyk A, Surges R, Suzuki T, Tan KM, Taneja RS, Tanteles GA, Taubøll E, Thio LL, Thomas GN, Thomas RH, Timonen O, Tinuper P, Todaro M, Topaloğlu P, Tozzi R, Tsai MH, Tumiene B, Turkdogan D, Unnsteinsdóttir U, Utkus A, Vaidiswaran P, Valton L, van Baalen A, Vetro A, Vining EPG, Visscher F, von Brauchitsch S, von Wrede R, Wagner RG, Weber YG, Weckhuysen S, Weisenberg J, Weller M, Widdess-Walsh P, Wolff M, Wolking S, Wu D, Yamakawa K, Yang W, Yapıcı Z, Yücesan E, Zagaglia S, Zahnert F, Zara F, Zhou W, Zimprich F, Zsurka G, Zulfiqar Ali Q. GWAS meta-analysis of over 29,000 people with epilepsy identifies 26 risk loci and subtype-specific genetic architecture. Nat Genet 2023; 55:1471-1482. [PMID: 37653029 PMCID: PMC10484785 DOI: 10.1038/s41588-023-01485-w] [Citation(s) in RCA: 22] [Impact Index Per Article: 22.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/07/2022] [Accepted: 07/21/2023] [Indexed: 09/02/2023]
Abstract
Epilepsy is a highly heritable disorder affecting over 50 million people worldwide, of which about one-third are resistant to current treatments. Here we report a multi-ancestry genome-wide association study including 29,944 cases, stratified into three broad categories and seven subtypes of epilepsy, and 52,538 controls. We identify 26 genome-wide significant loci, 19 of which are specific to genetic generalized epilepsy (GGE). We implicate 29 likely causal genes underlying these 26 loci. SNP-based heritability analyses show that common variants explain between 39.6% and 90% of genetic risk for GGE and its subtypes. Subtype analysis revealed markedly different genetic architectures between focal and generalized epilepsies. Gene-set analyses of GGE signals implicate synaptic processes in both excitatory and inhibitory neurons in the brain. Prioritized candidate genes overlap with monogenic epilepsy genes and with targets of current antiseizure medications. Finally, we leverage our results to identify alternate drugs with predicted efficacy if repurposed for epilepsy treatment.
Collapse
|
15
|
Salvatore M, Clark-Boucher D, Fritsche LG, Ortlieb J, Houghtby J, Driscoll A, Caldwell-Larkins B, Smith JA, Brummett CM, Kheterpal S, Lisabeth L, Mukherjee B. Epidemiologic Questionnaire (EPI-Q) - a scalable, app-based health survey linked to electronic health record and genotype data. Epidemiol Health 2023; 45:e2023074. [PMID: 37591787 PMCID: PMC10867525 DOI: 10.4178/epih.e2023074] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/31/2023] [Accepted: 07/03/2023] [Indexed: 08/19/2023] Open
Abstract
The Epidemiologic Questionnaire (EPI-Q) was established to collect broad, uniform, self-reported health data to supplement electronic health record (EHR) and genotype information from participants in the University of Michigan (UM) Precision Health cohorts. Recruitment of EPI-Q participants, who were already enrolled in 1 of 3 ongoing UM Precision Health cohorts-the Michigan Genomics Initiative, Mental Health Biobank, and Metabolism, Endocrinology, and Diabetes cohorts-began in March 2020. Of 54,043 retrospective invitations, 5,577 individuals enrolled, representing a 10.3% response rate. Of these, 3,502 (63.7%) were female, and the average age was 56.1 years (standard deviation, 15.4). The baseline survey comprises 11 modules on topics including personal and family health history, lifestyle, and cancer screening and history. Additionally, 11 optional modules cover topics including financial toxicity, occupational exposure, and life meaning. The questions are based on standardized and validated instruments used in other cohorts, and we share resources to expedite development of similar surveys. Data are collected via the MyDataHelps platform, which enables current and future participants to share non-Michigan Medicine EHR data. Recruitment is ongoing. Cohort data are available to those with institutional review board approval; for details, contact the Data Office for Clinical and Translational Research (DataOffice@umich.edu).
Collapse
Affiliation(s)
- Maxwell Salvatore
- Department of Biostatistics, University of Michigan, Ann Arbor, MI, USA
- Center for Precision Health Data Science, University of Michigan, Ann Arbor, MI, USA
- Department of Epidemiology, University of Michigan, Ann Arbor, MI, USA
| | - Dylan Clark-Boucher
- Department of Biostatistics, University of Michigan, Ann Arbor, MI, USA
- Center for Precision Health Data Science, University of Michigan, Ann Arbor, MI, USA
| | - Lars G. Fritsche
- Department of Biostatistics, University of Michigan, Ann Arbor, MI, USA
- Center for Precision Health Data Science, University of Michigan, Ann Arbor, MI, USA
- Rogel Cancer Center, University of Michigan, Ann Arbor, MI, USA
| | - Jacob Ortlieb
- Precision Health, University of Michigan, Ann Arbor, MI, USA
| | - Janet Houghtby
- Precision Health, University of Michigan, Ann Arbor, MI, USA
| | - Anisa Driscoll
- Precision Health, University of Michigan, Ann Arbor, MI, USA
| | | | - Jennifer A. Smith
- Department of Epidemiology, University of Michigan, Ann Arbor, MI, USA
- Survey Research Center, Institute for Social Research, Ann Arbor, MI, USA
| | | | - Sachin Kheterpal
- Precision Health, University of Michigan, Ann Arbor, MI, USA
- Anesthesiology, Michigan Medicine, Ann Arbor, MI, USA
| | - Lynda Lisabeth
- Department of Epidemiology, University of Michigan, Ann Arbor, MI, USA
| | - Bhramar Mukherjee
- Department of Biostatistics, University of Michigan, Ann Arbor, MI, USA
- Center for Precision Health Data Science, University of Michigan, Ann Arbor, MI, USA
- Department of Epidemiology, University of Michigan, Ann Arbor, MI, USA
- Precision Health, University of Michigan, Ann Arbor, MI, USA
| |
Collapse
|
16
|
Mignogna G, Carey CE, Wedow R, Baya N, Cordioli M, Pirastu N, Bellocco R, Malerbi KF, Nivard MG, Neale BM, Walters RK, Ganna A. Patterns of item nonresponse behaviour to survey questionnaires are systematic and associated with genetic loci. Nat Hum Behav 2023; 7:1371-1387. [PMID: 37386106 PMCID: PMC10444625 DOI: 10.1038/s41562-023-01632-7] [Citation(s) in RCA: 2] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/05/2022] [Accepted: 05/17/2023] [Indexed: 07/01/2023]
Abstract
Response to survey questionnaires is vital for social and behavioural research, and most analyses assume full and accurate response by participants. However, nonresponse is common and impedes proper interpretation and generalizability of results. We examined item nonresponse behaviour across 109 questionnaire items in the UK Biobank (N = 360,628). Phenotypic factor scores for two participant-selected nonresponse answers, 'Prefer not to answer' (PNA) and 'I don't know' (IDK), each predicted participant nonresponse in follow-up surveys (incremental pseudo-R2 = 0.056), even when controlling for education and self-reported health (incremental pseudo-R2 = 0.046). After performing genome-wide association studies of our factors, PNA and IDK were highly genetically correlated with one another (rg = 0.73 (s.e. = 0.03)) and with education (rg,PNA = -0.51 (s.e. = 0.03); rg,IDK = -0.38 (s.e. = 0.02)), health (rg,PNA = 0.51 (s.e. = 0.03); rg,IDK = 0.49 (s.e. = 0.02)) and income (rg,PNA = -0.57 (s.e. = 0.04); rg,IDK = -0.46 (s.e. = 0.02)), with additional unique genetic associations observed for both PNA and IDK (P < 5 × 10-8). We discuss how these associations may bias studies of traits correlated with item nonresponse and demonstrate how this bias may substantially affect genome-wide association studies. While the UK Biobank data are deidentified, we further protected participant privacy by avoiding exploring non-response behaviour to single questions, assuring that no information can be used to associate results with any particular respondents.
Collapse
Affiliation(s)
- Gianmarco Mignogna
- Analytic and Translational Genetics Unit, Massachusetts General Hospital and Harvard Medical School, Boston, MA, USA
- Institute for Molecular Medicine Finland, University of Helsinki, Helsinki, Finland
- Department of Statistics and Quantitative Methods, University of Milano-Bicocca, Milan, Italy
- Stanley Center for Psychiatric Research, Broad Institute of MIT and Harvard, Cambridge, MA, USA
| | - Caitlin E Carey
- Analytic and Translational Genetics Unit, Massachusetts General Hospital and Harvard Medical School, Boston, MA, USA
- Stanley Center for Psychiatric Research, Broad Institute of MIT and Harvard, Cambridge, MA, USA
- Center for Genomic Medicine, Massachusetts General Hospital, Boston, MA, USA
| | - Robbee Wedow
- Stanley Center for Psychiatric Research, Broad Institute of MIT and Harvard, Cambridge, MA, USA.
- Department of Sociology, Purdue University, West Lafayette, IN, USA.
- Department of Medical and Molecular Genetics, Indiana University School of Medicine, Indianapolis, IN, USA.
- AnalytiXIN (Analytics Indiana), Indianapolis, IN, USA.
- Department of Statistics, Purdue University, West Lafayette, IN, USA.
| | - Nikolas Baya
- Analytic and Translational Genetics Unit, Massachusetts General Hospital and Harvard Medical School, Boston, MA, USA
- Stanley Center for Psychiatric Research, Broad Institute of MIT and Harvard, Cambridge, MA, USA
| | - Mattia Cordioli
- Institute for Molecular Medicine Finland, University of Helsinki, Helsinki, Finland
| | - Nicola Pirastu
- Centre for Global Health Research, Usher Institute, University of Edinburgh, Edinburgh, Scotland
- Fondazione Human Technopole, Viale Rita Levi-Montalcini, Milan, Italy
| | - Rino Bellocco
- Department of Statistics and Quantitative Methods, University of Milano-Bicocca, Milan, Italy
- Department of Medical Epidemiology and Biostatistics, Karolinska Institutet, Stockholm, Sweden
| | | | - Michel G Nivard
- Department of Biological Psychiatry, Faculty of Behavioural and Movement Sciences, Vrije Universiteit, Amsterdam, the Netherlands
- Methodology Program, Amsterdam Public Health, Amsterdam, the Netherlands
- Amsterdam Neuroscience - Mood, Anxiety, Psychosis, Stress and Sleep, Amsterdam, the Netherlands
| | - Benjamin M Neale
- Analytic and Translational Genetics Unit, Massachusetts General Hospital and Harvard Medical School, Boston, MA, USA
- Stanley Center for Psychiatric Research, Broad Institute of MIT and Harvard, Cambridge, MA, USA
- Center for Genomic Medicine, Massachusetts General Hospital, Boston, MA, USA
- Program in Medical and Population Genetics, Broad Institute of MIT and Harvard, Cambridge, MA, USA
- Novo Nordisk Foundation for Genomic Mechanisms of Disease, Broad Institute of MIT and Harvard, Cambridge, MA, USA
| | - Raymond K Walters
- Analytic and Translational Genetics Unit, Massachusetts General Hospital and Harvard Medical School, Boston, MA, USA
- Stanley Center for Psychiatric Research, Broad Institute of MIT and Harvard, Cambridge, MA, USA
| | - Andrea Ganna
- Analytic and Translational Genetics Unit, Massachusetts General Hospital and Harvard Medical School, Boston, MA, USA.
- Institute for Molecular Medicine Finland, University of Helsinki, Helsinki, Finland.
- Program in Medical and Population Genetics, Broad Institute of MIT and Harvard, Cambridge, MA, USA.
| |
Collapse
|
17
|
Wang S, Quan L, Ding M, Kang JH, Koenen KC, Kubzansky LD, Branch-Elliman W, Chavarro JE, Roberts AL. Depression, worry, and loneliness are associated with subsequent risk of hospitalization for COVID-19: a prospective study. Psychol Med 2023; 53:4022-4031. [PMID: 35586906 PMCID: PMC9924056 DOI: 10.1017/s0033291722000691] [Citation(s) in RCA: 6] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Indexed: 12/27/2022]
Abstract
BACKGROUND Pre-pandemic psychological distress is associated with increased susceptibility to severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) infection, but associations with the coronavirus disease 2019 (COVID-19) severity are not established. The authors examined the associations between distress prior to SARS-CoV-2 infection and subsequent risk of hospitalization. METHODS Between April 2020 (baseline) and April 2021, we followed 54 781 participants from three ongoing cohorts: Nurses' Health Study II (NHSII), Nurses' Health Study 3 (NHS3), and the Growing Up Today Study (GUTS) who reported no current or prior SARS-CoV-2 infection at baseline. Chronic depression was assessed during 2010-2019. Depression, anxiety, worry about COVID-19, perceived stress, and loneliness were measured at baseline. SARS-CoV-2 infection and hospitalization due to COVID-19 was self-reported. Relative risks (RRs) were calculated by Poisson regression. RESULTS 3663 participants reported a positive SARS-CoV-2 test (mean age = 55.0 years, standard deviation = 13.8) during follow-up. Among these participants, chronic depression prior to the pandemic [RR = 1.72; 95% confidence interval (CI) 1.20-2.46], and probable depression (RR = 1.81, 95% CI 1.08-3.03), being very worried about COVID-19 (RR = 1.79; 95% CI 1.12-2.86), and loneliness (RR = 1.81, 95% CI 1.02-3.20) reported at baseline were each associated with subsequent COVID-19 hospitalization, adjusting for demographic factors and healthcare worker status. Anxiety and perceived stress were not associated with hospitalization. Depression, worry about COVID-19, and loneliness were as strongly associated with hospitalization as were high cholesterol and hypertension, established risk factors for COVID-19 severity. CONCLUSIONS Psychological distress may be a risk factor for hospitalization in patients with SARS-CoV-2 infection. Assessment of psychological distress may identify patients at greater risk of hospitalization. Future work should examine whether addressing distress improves physical health outcomes.
Collapse
Affiliation(s)
- Siwen Wang
- Department of Nutrition, Harvard T.H. Chan School of Public Health, Boston, MA, USA
- Department of Environmental Health, Harvard T.H. Chan School of Public Health, Boston, MA, USA
- Zhejiang University School of Medicine, Hangzhou, Zhejiang, China
| | - Luwei Quan
- Department of Environmental Health, Harvard T.H. Chan School of Public Health, Boston, MA, USA
| | - Ming Ding
- Department of Nutrition, Harvard T.H. Chan School of Public Health, Boston, MA, USA
| | - Jae H Kang
- Channing Division of Network Medicine, Department of Medicine, Brigham and Women's Hospital, Harvard Medical School, Boston, MA, USA
| | - Karestan C Koenen
- Department of Epidemiology, Harvard T.H. Chan School of Public Health, Boston, MA, USA
- Department of Social and Behavioral Sciences, Harvard T.H. Chan School of Public Health, Boston, MA, USA
- Psychiatric Neurodevelopmental Genetics Unit, Department of Psychiatry, Massachusetts General Hospital, Boston, MA, USA
| | - Laura D Kubzansky
- Department of Social and Behavioral Sciences, Harvard T.H. Chan School of Public Health, Boston, MA, USA
| | - Westyn Branch-Elliman
- Department of Medicine, VA Boston Healthcare System, Boston, MA, USA
- Harvard Medical School, Boston, MA, USA
| | - Jorge E Chavarro
- Department of Nutrition, Harvard T.H. Chan School of Public Health, Boston, MA, USA
- Department of Epidemiology, Harvard T.H. Chan School of Public Health, Boston, MA, USA
- Channing Division of Network Medicine, Department of Medicine, Brigham and Women's Hospital, Harvard Medical School, Boston, MA, USA
| | - Andrea L Roberts
- Department of Environmental Health, Harvard T.H. Chan School of Public Health, Boston, MA, USA
| |
Collapse
|
18
|
Mezuk B, Kelly K, Bennion E, Concha JB. Leveraging a genetically-informative study design to explore depression as a risk factor for type 2 diabetes: Rationale and participant characteristics of the Mood and Immune Regulation in Twins Study. FRONTIERS IN CLINICAL DIABETES AND HEALTHCARE 2023; 4:1026402. [PMID: 37008275 PMCID: PMC10064086 DOI: 10.3389/fcdhc.2023.1026402] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 08/23/2022] [Accepted: 03/01/2023] [Indexed: 03/19/2023]
Abstract
BackgroundComorbidity between depression and type 2 diabetes is thought to arise from the joint effects of psychological, behavioral, and biological processes. Studies of monozygotic twins may provide a unique opportunity for clarifying how these processes inter-relate. This paper describes the rationale, characteristics, and initial findings of a longitudinal co-twin study aimed at examining the biopsychosocial mechanisms linking depression and risk of diabetes in mid-life.MethodsParticipants in the Mood and Immune Regulation in Twins (MIRT) Study were recruited from the Mid-Atlantic Twin Registry. MIRT consisted of 94 individuals who do not have diabetes at baseline, representing 43 twin pairs (41 monozygotic and 2 dizygotic), one set of monozygotic triplets, and 5 individuals whose co-twin did not participate. A broad set of variables were assessed including psychological factors (e.g., lifetime history major depression (MD)); social factors (e.g., stress perceptions and experiences); and biological factors, including indicators of metabolic risk (e.g., BMI, blood pressure (BP), HbA1c) and immune functioning (e.g., pro- and anti-inflammatory cytokines), as well as collection of RNA. Participants were re-assessed 6-month later. Intra-class correlation coefficients (ICC) and descriptive comparisons were used to explore variation in these psychological, social, and biological factors across time and within pairs.ResultsMean age was 53 years, 68% were female, and 77% identified as white. One-third had a history of MD, and 18 sibling sets were discordant for MD. MD was associated with higher systolic (139.1 vs 132.2 mmHg, p=0.05) and diastolic BP (87.2 vs. 80.8 mmHg, p=0.002) and IL-6 (1.47 vs. 0.93 pg/mL, p=0.001). MD was not associated with BMI, HbA1c, or other immune markers. While the biological characteristics of the co-twins were significantly correlated, all within-person ICCs were higher than the within-pair correlations (e.g., HbA1c within-person ICC=0.88 vs. within-pair ICC=0.49; IL-6 within-person ICC=0.64 vs. within-pair=0.54). Among the pairs discordant for MD, depression was not substantially associated with metabolic or immune markers, but was positively associated with stress.ConclusionsTwin studies have the potential to clarify the biopsychosocial processes linking depression and diabetes, and recently completed processing of RNA samples from MIRT permits future exploration of gene expression as a potential mechanism.
Collapse
Affiliation(s)
- Briana Mezuk
- Center for Social Epidemiology and Population Health, Department of Epidemiology, University of Michigan School of Public Health, Ann Arbor, MI, United States
- Research Center for Group Dynamics, Institute for Social Research, University of Michigan, Ann Arbor, MI, United States
- *Correspondence: Briana Mezuk,
| | - Kristen Kelly
- Institute for Behavioral Genetics, University of Colorado Boulder, Boulder, CO, United States
| | - Erica Bennion
- Office of Maternal and Child Health, Utah Department of Health and Human Services, Salt Lake, UT, United States
| | - Jeannie B. Concha
- College of Health Sciences, University of Texas at El Paso, El Paso, TX, United States
| |
Collapse
|
19
|
Getz K, Hubbard RA, Linn KA. Performance of Multiple Imputation Using Modern Machine Learning Methods in Electronic Health Records Data. Epidemiology 2023; 34:206-215. [PMID: 36722803 DOI: 10.1097/ede.0000000000001578] [Citation(s) in RCA: 9] [Impact Index Per Article: 9.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/02/2023]
Abstract
BACKGROUND Missing data are common in studies using electronic health records (EHRs)-derived data. Missingness in EHR data is related to healthcare utilization patterns, resulting in complex and potentially missing not at random missingness mechanisms. Prior research has suggested that machine learning-based multiple imputation methods may outperform traditional methods and may perform well even in settings of missing not at random missingness. METHODS We used plasmode simulations based on a nationwide EHR-derived de-identified database for patients with metastatic urothelial carcinoma to compare the performance of multiple imputation using chained equations, random forests, and denoising autoencoders in terms of bias and precision of hazard ratio estimates under varying proportions of observations with missing values and missingness mechanisms (missing completely at random, missing at random, and missing not at random). RESULTS Multiple imputation by chained equations and random forest methods had low bias and similar standard errors for parameter estimates under missingness completely at random. Under missingness at random, denoising autoencoders had higher bias than multiple imputation by chained equations and random forests. Contrary to results of prior studies of denoising autoencoders, all methods exhibited substantial bias under missingness not at random, with bias increasing in direct proportion to the amount of missing data. CONCLUSIONS We found no advantage of denoising autoencoders for multiple imputation in the setting of an epidemiologic study conducted using EHR data. Results suggested that denoising autoencoders may overfit the data leading to poor confounder control. Use of more flexible imputation approaches does not mitigate bias induced by missingness not at random and can produce estimates with spurious precision.
Collapse
Affiliation(s)
- Kylie Getz
- From the Department of Biostatistics and Epidemiology, School of Public Health, Rutgers University, Piscataway, NJ
| | - Rebecca A Hubbard
- Department of Biostatistics, Epidemiology & Informatics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA
- Abramson Cancer Center, University of Pennsylvania, Philadelphia, PA
| | - Kristin A Linn
- Department of Biostatistics, Epidemiology & Informatics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA
| |
Collapse
|
20
|
Bagheri M, Chung CP, Dickson AL, Van Driest SL, Borinstein SC, Mosley JD. White blood cell ranges and frequency of neutropenia by Duffy genotype status. Blood Adv 2023; 7:406-409. [PMID: 35895516 PMCID: PMC9979714 DOI: 10.1182/bloodadvances.2022007680] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/24/2022] [Revised: 07/07/2022] [Accepted: 07/08/2022] [Indexed: 02/02/2023] Open
Affiliation(s)
- Minoo Bagheri
- Department of Medicine, Vanderbilt University Medical Center, Nashville, TN
| | - Cecilia P. Chung
- Department of Medicine, Vanderbilt University Medical Center, Nashville, TN
| | - Alyson L. Dickson
- Department of Medicine, Vanderbilt University Medical Center, Nashville, TN
| | - Sara L. Van Driest
- Department of Medicine, Vanderbilt University Medical Center, Nashville, TN
- Department of Pediatrics, Vanderbilt University Medical Center, Nashville, TN
| | - Scott C. Borinstein
- Department of Pediatrics, Vanderbilt University Medical Center, Nashville, TN
| | - Jonathan D. Mosley
- Department of Medicine, Vanderbilt University Medical Center, Nashville, TN
- Department of Biomedical Informatics, Vanderbilt University Medical Center, Nashville, TN
| |
Collapse
|
21
|
Zawistowski M, Fritsche LG, Pandit A, Vanderwerff B, Patil S, Schmidt EM, VandeHaar P, Willer CJ, Brummett CM, Kheterpal S, Zhou X, Boehnke M, Abecasis GR, Zöllner S. The Michigan Genomics Initiative: A biobank linking genotypes and electronic clinical records in Michigan Medicine patients. CELL GENOMICS 2023; 3:100257. [PMID: 36819667 PMCID: PMC9932985 DOI: 10.1016/j.xgen.2023.100257] [Citation(s) in RCA: 25] [Impact Index Per Article: 25.0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 12/15/2021] [Revised: 06/07/2022] [Accepted: 01/05/2023] [Indexed: 02/04/2023]
Abstract
Biobanks of linked clinical patient histories and biological samples are an efficient strategy to generate large cohorts for modern genetics research. Biobank recruitment varies by factors such as geographic catchment and sampling strategy, which affect biobank demographics and research utility. Here, we describe the Michigan Genomics Initiative (MGI), a single-health-system biobank currently consisting of >91,000 participants recruited primarily during surgical encounters at Michigan Medicine. The surgical enrollment results in a biobank enriched for many diseases and ideally suited for a disease genetics cohort. Compared with the much larger population-based UK Biobank, MGI has higher prevalence for nearly all diagnosis-code-based phenotypes and larger absolute case counts for many phenotypes. Genome-wide association study (GWAS) results replicate known findings, thereby validating the genetic and clinical data. Our results illustrate that opportunistic biobank sampling within single health systems provides a unique and complementary resource for exploring the genetics of complex diseases.
Collapse
Affiliation(s)
- Matthew Zawistowski
- Department of Biostatistics and Center for Statistical Genetics, University of Michigan, Ann Arbor, MI 48103, USA
| | - Lars G. Fritsche
- Department of Biostatistics and Center for Statistical Genetics, University of Michigan, Ann Arbor, MI 48103, USA
| | - Anita Pandit
- Department of Biostatistics and Center for Statistical Genetics, University of Michigan, Ann Arbor, MI 48103, USA
| | - Brett Vanderwerff
- Department of Biostatistics and Center for Statistical Genetics, University of Michigan, Ann Arbor, MI 48103, USA
| | - Snehal Patil
- Department of Biostatistics and Center for Statistical Genetics, University of Michigan, Ann Arbor, MI 48103, USA
| | - Ellen M. Schmidt
- Department of Biostatistics and Center for Statistical Genetics, University of Michigan, Ann Arbor, MI 48103, USA
| | - Peter VandeHaar
- Department of Biostatistics and Center for Statistical Genetics, University of Michigan, Ann Arbor, MI 48103, USA
| | - Cristen J. Willer
- Department of Internal Medicine, Division of Cardiovascular Medicine, Department of Human Genetics, University of Michigan, Ann Arbor, MI 48103, USA
| | - Chad M. Brummett
- Department of Anesthesiology, University of Michigan, Ann Arbor, MI 48103, USA
| | - Sachin Kheterpal
- Department of Anesthesiology, University of Michigan, Ann Arbor, MI 48103, USA
| | - Xiang Zhou
- Department of Biostatistics and Center for Statistical Genetics, University of Michigan, Ann Arbor, MI 48103, USA
| | - Michael Boehnke
- Department of Biostatistics and Center for Statistical Genetics, University of Michigan, Ann Arbor, MI 48103, USA
| | - Gonçalo R. Abecasis
- Department of Biostatistics and Center for Statistical Genetics, University of Michigan, Ann Arbor, MI 48103, USA
- Regeneron Genetics Center, Tarrytown, NY 10591, USA
| | - Sebastian Zöllner
- Department of Biostatistics and Center for Statistical Genetics, University of Michigan, Ann Arbor, MI 48103, USA
- Department of Psychiatry, University of Michigan, Ann Arbor, MI 48103, USA
| |
Collapse
|
22
|
Ri K, Fukasawa T, Yoshida S, Takeuchi M, Kawakami K. Risk of parkinsonism and related movement disorders with gabapentinoids or tramadol: A case-crossover study. Pharmacotherapy 2023; 43:136-144. [PMID: 36633384 DOI: 10.1002/phar.2761] [Citation(s) in RCA: 3] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/27/2022] [Revised: 12/17/2022] [Accepted: 12/18/2022] [Indexed: 01/13/2023]
Abstract
INTRODUCTION A safety signal concerning parkinsonism and related movement disorders with gabapentinoids (gabapentin and pregabalin) or tramadol was detected by reviewing individual case reports and data mining in spontaneous report databases. Well-designed pharmacoepidemiological studies are needed to assess the signal. OBJECTIVE This study aimed to investigate the association of exposure to gabapentinoids or tramadol with the risk of parkinsonism and related movement disorders. METHODS We conducted a case-crossover study using a Japanese electronic medical records database. Patients with newly diagnosed parkinsonism or related movement disorders between January 1, 2007, and April 14, 2019, were identified. The diagnosis date of outcomes was defined as the index date. We assessed the exposure of each patient to gabapentinoids or tramadol during a 90-day hazard period ending 1 day before the index date and in three 90-day reference periods. Multivariable conditional logistic regression models were employed to estimate adjusted odds ratios (aORs) and 95% confidence intervals (CIs). To confirm the robustness of the primary findings, we also performed sensitivity analyses using a case-case-time-control design, a different time window for hazard and reference periods, a different definition of outcome, and different number of reference periods. RESULTS A total of 28,972 eligible cases were included in the primary analysis. Exposure to gabapentinoids (aOR, 2.12; 95% CI, 1.73-2.61) and tramadol (aOR, 2.04; 95% CI, 1.57-2.64) was associated with increased risk. Results were consistent across sensitivity analyses. CONCLUSION Our findings serve as a caution to physicians who prescribe gabapentinoids or tramadol in routine clinical practice.
Collapse
Affiliation(s)
- Kairi Ri
- Department of Pharmacoepidemiology, Graduate School of Medicine and Public Health, Kyoto University, Kyoto, Japan
| | - Toshiki Fukasawa
- Department of Pharmacoepidemiology, Graduate School of Medicine and Public Health, Kyoto University, Kyoto, Japan.,Department of Digital Health and Epidemiology, Graduate School of Medicine and Public Health, Kyoto University, Kyoto, Japan
| | - Satomi Yoshida
- Department of Pharmacoepidemiology, Graduate School of Medicine and Public Health, Kyoto University, Kyoto, Japan.,Department of Digital Health and Epidemiology, Graduate School of Medicine and Public Health, Kyoto University, Kyoto, Japan
| | - Masato Takeuchi
- Department of Pharmacoepidemiology, Graduate School of Medicine and Public Health, Kyoto University, Kyoto, Japan
| | - Koji Kawakami
- Department of Pharmacoepidemiology, Graduate School of Medicine and Public Health, Kyoto University, Kyoto, Japan
| |
Collapse
|
23
|
Yang S, Varghese P, Stephenson E, Tu K, Gronsbell J. Machine learning approaches for electronic health records phenotyping: a methodical review. J Am Med Inform Assoc 2023; 30:367-381. [PMID: 36413056 PMCID: PMC9846699 DOI: 10.1093/jamia/ocac216] [Citation(s) in RCA: 23] [Impact Index Per Article: 23.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/28/2022] [Revised: 09/27/2022] [Accepted: 10/27/2022] [Indexed: 11/23/2022] Open
Abstract
OBJECTIVE Accurate and rapid phenotyping is a prerequisite to leveraging electronic health records for biomedical research. While early phenotyping relied on rule-based algorithms curated by experts, machine learning (ML) approaches have emerged as an alternative to improve scalability across phenotypes and healthcare settings. This study evaluates ML-based phenotyping with respect to (1) the data sources used, (2) the phenotypes considered, (3) the methods applied, and (4) the reporting and evaluation methods used. MATERIALS AND METHODS We searched PubMed and Web of Science for articles published between 2018 and 2022. After screening 850 articles, we recorded 37 variables on 100 studies. RESULTS Most studies utilized data from a single institution and included information in clinical notes. Although chronic conditions were most commonly considered, ML also enabled the characterization of nuanced phenotypes such as social determinants of health. Supervised deep learning was the most popular ML paradigm, while semi-supervised and weakly supervised learning were applied to expedite algorithm development and unsupervised learning to facilitate phenotype discovery. ML approaches did not uniformly outperform rule-based algorithms, but deep learning offered a marginal improvement over traditional ML for many conditions. DISCUSSION Despite the progress in ML-based phenotyping, most articles focused on binary phenotypes and few articles evaluated external validity or used multi-institution data. Study settings were infrequently reported and analytic code was rarely released. CONCLUSION Continued research in ML-based phenotyping is warranted, with emphasis on characterizing nuanced phenotypes, establishing reporting and evaluation standards, and developing methods to accommodate misclassified phenotypes due to algorithm errors in downstream applications.
Collapse
Affiliation(s)
- Siyue Yang
- Department of Statistical Sciences, University of Toronto, Toronto, Ontario, Canada
| | | | - Ellen Stephenson
- Department of Family & Community Medicine, University of Toronto, Toronto, Ontario, Canada
| | - Karen Tu
- Department of Family & Community Medicine, University of Toronto, Toronto, Ontario, Canada
| | - Jessica Gronsbell
- Department of Statistical Sciences, University of Toronto, Toronto, Ontario, Canada
- Department of Family & Community Medicine, University of Toronto, Toronto, Ontario, Canada
- Department of Computer Science, University of Toronto, Toronto, Ontario, Canada
| |
Collapse
|
24
|
Venkatesh SS, Ganjgahi H, Palmer DS, Coley K, Wittemans LBL, Nellaker C, Holmes C, Lindgren CM, Nicholson G. The genetic architecture of changes in adiposity during adulthood. MEDRXIV : THE PREPRINT SERVER FOR HEALTH SCIENCES 2023:2023.01.09.23284364. [PMID: 36711652 PMCID: PMC9882550 DOI: 10.1101/2023.01.09.23284364] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Indexed: 01/13/2023]
Abstract
Obesity is a heritable disease, characterised by excess adiposity that is measured by body mass index (BMI). While over 1,000 genetic loci are associated with BMI, less is known about the genetic contribution to adiposity trajectories over adulthood. We derive adiposity-change phenotypes from 1.5 million primary-care health records in over 177,000 individuals in UK Biobank to study the genetic architecture of weight-change. Using multiple BMI measurements over time increases power to identify genetic factors affecting baseline BMI. In the largest reported genome-wide study of adiposity-change in adulthood, we identify novel associations with BMI-change at six independent loci, including rs429358 (a missense variant in APOE). The SNP-based heritability of BMI-change (1.98%) is 9-fold lower than that of BMI, and higher in women than in men. The modest genetic correlation between BMI-change and BMI (45.2%) indicates that genetic studies of longitudinal trajectories could uncover novel biology driving quantitative trait values in adulthood.
Collapse
Affiliation(s)
- Samvida S. Venkatesh
- Wellcome Centre for Human Genetics, Nuffield Department of Medicine, University of Oxford, UK
- Big Data Institute at the Li Ka Shing Centre for Health Information and Discovery, University of Oxford, UK
| | | | - Duncan S. Palmer
- Big Data Institute at the Li Ka Shing Centre for Health Information and Discovery, University of Oxford, UK
- Nuffield Department of Women’s and Reproductive Health, Medical Sciences Division, University of Oxford, UK
| | - Kayesha Coley
- Department of Population Health Sciences, University of Leicester, UK
| | - Laura B. L. Wittemans
- Big Data Institute at the Li Ka Shing Centre for Health Information and Discovery, University of Oxford, UK
- Nuffield Department of Women’s and Reproductive Health, Medical Sciences Division, University of Oxford, UK
| | - Christoffer Nellaker
- Big Data Institute at the Li Ka Shing Centre for Health Information and Discovery, University of Oxford, UK
- Nuffield Department of Women’s and Reproductive Health, Medical Sciences Division, University of Oxford, UK
| | - Chris Holmes
- Department of Statistics, University of Oxford, UK
- Nuffield Department of Medicine, Medical Sciences Division, University of Oxford, UK
- The Alan Turing Institute, London, UK
| | - Cecilia M. Lindgren
- Wellcome Centre for Human Genetics, Nuffield Department of Medicine, University of Oxford, UK
- Big Data Institute at the Li Ka Shing Centre for Health Information and Discovery, University of Oxford, UK
- Nuffield Department of Women’s and Reproductive Health, Medical Sciences Division, University of Oxford, UK
- Broad Institute of Harvard and MIT, Cambridge, Massachusetts, USA
| | | |
Collapse
|
25
|
Klinkhammer H, Staerk C, Maj C, Krawitz PM, Mayr A. A statistical boosting framework for polygenic risk scores based on large-scale genotype data. Front Genet 2023; 13:1076440. [PMID: 36704342 PMCID: PMC9871367 DOI: 10.3389/fgene.2022.1076440] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/21/2022] [Accepted: 12/20/2022] [Indexed: 01/12/2023] Open
Abstract
Polygenic risk scores (PRS) evaluate the individual genetic liability to a certain trait and are expected to play an increasingly important role in clinical risk stratification. Most often, PRS are estimated based on summary statistics of univariate effects derived from genome-wide association studies. To improve the predictive performance of PRS, it is desirable to fit multivariable models directly on the genetic data. Due to the large and high-dimensional data, a direct application of existing methods is often not feasible and new efficient algorithms are required to overcome the computational burden regarding efficiency and memory demands. We develop an adapted component-wise L 2-boosting algorithm to fit genotype data from large cohort studies to continuous outcomes using linear base-learners for the genetic variants. Similar to the snpnet approach implementing lasso regression, the proposed snpboost approach iteratively works on smaller batches of variants. By restricting the set of possible base-learners in each boosting step to variants most correlated with the residuals from previous iterations, the computational efficiency can be substantially increased without losing prediction accuracy. Furthermore, for large-scale data based on various traits from the UK Biobank we show that our method yields competitive prediction accuracy and computational efficiency compared to the snpnet approach and further commonly used methods. Due to the modular structure of boosting, our framework can be further extended to construct PRS for different outcome data and effect types-we illustrate this for the prediction of binary traits.
Collapse
Affiliation(s)
- Hannah Klinkhammer
- Institute for Medical Biometry, Informatics and Epidemiology, Medical Faculty, University of Bonn, Bonn, Germany
- Institute for Genomic Statistics and Bioinformatics, Medical Faculty, University of Bonn, Bonn, Germany
| | - Christian Staerk
- Institute for Medical Biometry, Informatics and Epidemiology, Medical Faculty, University of Bonn, Bonn, Germany
| | - Carlo Maj
- Institute for Genomic Statistics and Bioinformatics, Medical Faculty, University of Bonn, Bonn, Germany
- Center for Human Genetics, University of Marburg, Marburg, Germany
| | - Peter Michael Krawitz
- Institute for Genomic Statistics and Bioinformatics, Medical Faculty, University of Bonn, Bonn, Germany
| | - Andreas Mayr
- Institute for Medical Biometry, Informatics and Epidemiology, Medical Faculty, University of Bonn, Bonn, Germany
| |
Collapse
|
26
|
Gu T, Lee PH, Duan R. COMMUTE: Communication-efficient transfer learning for multi-site risk prediction. J Biomed Inform 2023; 137:104243. [PMID: 36403757 PMCID: PMC9868117 DOI: 10.1016/j.jbi.2022.104243] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/24/2022] [Revised: 09/20/2022] [Accepted: 11/06/2022] [Indexed: 11/19/2022]
Abstract
OBJECTIVES We propose a communication-efficient transfer learning approach (COMMUTE) that effectively incorporates multi-site healthcare data for training a risk prediction model in a target population of interest, accounting for challenges including population heterogeneity and data sharing constraints across sites. METHODS We first train population-specific source models locally within each site. Using data from a given target population, COMMUTE learns a calibration term for each source model, which adjusts for potential data heterogeneity through flexible distance-based regularizations. In a centralized setting where multi-site data can be directly pooled, all data are combined to train the target model after calibration. When individual-level data are not shareable in some sites, COMMUTE requests only the locally trained models from these sites, with which, COMMUTE generates heterogeneity-adjusted synthetic data for training the target model. We evaluate COMMUTE via extensive simulation studies and an application to multi-site data from the electronic Medical Records and Genomics (eMERGE) Network to predict extreme obesity. RESULTS Simulation studies show that COMMUTE outperforms methods without adjusting for population heterogeneity and methods trained in a single population over a broad spectrum of settings. Using eMERGE data, COMMUTE achieves an area under the receiver operating characteristic curve (AUC) around 0.80, which outperforms other benchmark methods with AUC ranging from 0.51 to 0.70. CONCLUSION COMMUTE improves the risk prediction in a target population with limited samples and safeguards against negative transfer when some source populations are highly different from the target. In a federated setting, it is highly communication efficient as it only requires each site to share model parameter estimates once, and no iterative communication or higher-order terms are needed.
Collapse
Affiliation(s)
- Tian Gu
- Department of Biostatistics, Harvard T.H. Chan School of Public Health, Boston, MA, United States
| | - Phil H Lee
- Department of Psychiatry, Harvard Medical School, Boston, MA, United States; Center for Genomic Medicine, Massachusetts General Hospital, Boston, MA, United States; Stanley Center for Psychiatric Research, Broad Institute of MIT and Harvard, Cambridge, MA, United States
| | - Rui Duan
- Department of Biostatistics, Harvard T.H. Chan School of Public Health, Boston, MA, United States.
| |
Collapse
|
27
|
Beesley LJ, Mukherjee B. Case studies in bias reduction and inference for electronic health record data with selection bias and phenotype misclassification. Stat Med 2022; 41:5501-5516. [PMID: 36131394 PMCID: PMC9826451 DOI: 10.1002/sim.9579] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/27/2021] [Revised: 08/12/2022] [Accepted: 08/13/2022] [Indexed: 01/11/2023]
Abstract
Electronic health records (EHR) are not designed for population-based research, but they provide easy and quick access to longitudinal health information for a large number of individuals. Many statistical methods have been proposed to account for selection bias, missing data, phenotyping errors, or other problems that arise in EHR data analysis. However, addressing multiple sources of bias simultaneously is challenging. We developed a methodological framework (R package, SAMBA) for jointly handling both selection bias and phenotype misclassification in the EHR setting that leverages external data sources. These methods assume factors related to selection and misclassification are fully observed, but these factors may be poorly understood and partially observed in practice. As a follow-up to the methodological work, we demonstrate how to apply these methods for two real-world case studies, and we evaluate their performance. In both examples, we use individual patient-level data collected through the University of Michigan Health System and various external population-based data sources. In case study (a), we explore the impact of these methods on estimated associations between gender and cancer diagnosis. In case study (b), we compare corrected associations between previously identified genetic loci and age-related macular degeneration with gold standard external summary estimates. These case studies illustrate how to utilize diverse auxiliary information to achieve less biased inference in EHR-based research.
Collapse
Affiliation(s)
- Lauren J. Beesley
- Department of BiostatisticsUniversity of MichiganMichiganUSA,Information Systems and ModelingLos Alamos National LaboratoryNew MexicoUSA
| | | |
Collapse
|
28
|
Khera AV, Wang M, Chaffin M, Emdin CA, Samani NJ, Schunkert H, Watkins H, McPherson R, Elosua R, Boerwinkle E, Ardissino D, Butterworth AS, Di Angelantonio E, Naheed A, Danesh J, Chowdhury R, Krumholz HM, Sheu WHH, Rich SS, Rotter JI, Chen YDI, Gabriel S, Lander ES, Saleheen D, Kathiresan S. Gene Sequencing Identifies Perturbation in Nitric Oxide Signaling as a Nonlipid Molecular Subtype of Coronary Artery Disease. CIRCULATION. GENOMIC AND PRECISION MEDICINE 2022; 15:e003598. [PMID: 36215124 PMCID: PMC9771961 DOI: 10.1161/circgen.121.003598] [Citation(s) in RCA: 4] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 09/19/2021] [Accepted: 06/24/2022] [Indexed: 12/24/2022]
Abstract
BACKGROUND A key goal of precision medicine is to disaggregate common, complex diseases into discrete molecular subtypes. Rare coding variants in the low-density lipoprotein receptor gene (LDLR) are identified in 1% to 2% of coronary artery disease (CAD) patients, defining a molecular subtype with risk driven by hypercholesterolemia. METHODS To search for additional subtypes, we compared the frequency of rare, predicted loss-of-function and damaging missense variants aggregated within a given gene in 41 081 CAD cases versus 217 115 controls. RESULTS Rare variants in LDLR were most strongly associated with CAD, present in 1% of cases and associated with 4.4-fold increased CAD risk. A second subtype was characterized by variants in endothelial nitric oxide synthase gene (NOS3), a key enzyme regulating vascular tone, endothelial function, and platelet aggregation. A rare predicted loss-of-function or damaging missense variants in NOS3 was present in 0.6% of cases and associated with 2.42-fold increased risk of CAD (95% CI, 1.80-3.26; P=5.50×10-9). These variants were associated with higher systolic blood pressure (+3.25 mm Hg; [95% CI, 1.86-4.65]; P=5.00×10-6) and increased risk of hypertension (adjusted odds ratio 1.31; [95% CI, 1.14-1.51]; P=2.00×10-4) but not circulating cholesterol concentrations, suggesting that, beyond lipid pathways, nitric oxide synthesis is a key nonlipid driver of CAD risk. CONCLUSIONS Beyond LDLR, we identified an additional nonlipid molecular subtype of CAD characterized by rare variants in the NOS3 gene.
Collapse
Affiliation(s)
- Amit V. Khera
- Program in Medical & Population Genetics, Broad Inst of MIT & Harvard, Cambridge, MA
- Ctr for Genomic Medicine, Massachusetts General Hospital, Boston, MA
- Dept of Medicine, Harvard Medical School, Boston, MA
- Cardiology Division, Dept of Medicine, Massachusetts General Hospital, Boston, MA
| | - Minxian Wang
- Ctr for Genomic Medicine, Massachusetts General Hospital, Boston, MA
- Program in Medical & Population Genetics, Broad Inst of MIT & Harvard, Cambridge, MA
- CAS Key Laboratory of Genome Sciences & Information, Beijing Inst of Genomics, Chinese Academy of Sciences & China National Ctr for Bioinformation, Beijing, China
| | - Mark Chaffin
- Program in Medical & Population Genetics, Broad Inst of MIT & Harvard, Cambridge, MA
| | - Connor A. Emdin
- Ctr for Genomic Medicine, Massachusetts General Hospital, Boston, MA
- Dept of Medicine, Harvard Medical School, Boston, MA
- Program in Medical & Population Genetics, Broad Inst of MIT & Harvard, Cambridge, MA
| | - Nilesh J. Samani
- Dept of Cardiovascular Sciences, Univ of Leicester, Leicester, UK
- NIHR Leicester Biomedical Research Ctr, Glenfield Hospital, Leicester, UK
| | - Heribert Schunkert
- Dept of Cardiology, German Heart Ctr Munich, Technical Univ of Munich, Munich, Germany
- DZHK (German Ctr for Cardiovascular Research), Partner site Munich, Munich Heart Alliance, Munich, Germany
| | - Hugh Watkins
- Division of Cardiovascular Medicine, Radcliffe Dept of Medicine, Univ of Oxford, Headington, UK
- Wellcome Trust Ctr for Human Genetics, Univ of Oxford, Oxford, UK
| | - Ruth McPherson
- Inst for Cardiogenetics, Univ of Lübeck, Lübeck, Schleswig-Holstein, Germany
- German Research Ctr for Cardiovascular Research, Partner Site Hamburg/Lübeck/Kiel & Univ Heart Center Lübeck (J.E.), Berlin, Brandenburg, Germany
- Depts of Medicine & Biochemistry, Univ of Ottawa Heart Inst, Ottawa, ON, Canada
| | - Roberto Elosua
- Cardiovascular Epidemiology & Genetics, Hospital del Mar Research Inst, Barcelona, Spain
- CIBER Enfermedades Cardiovasculares, Barcelona, Spain
- Facultat de Medicina, Universitat de Vic-Central de Cataluña, Barcelona, Spain
| | - Eric Boerwinkle
- Ctr for Human Genetics & Dept. of Epidemiology, Univ of Texas Health Science Ctr School of Public Health, Houston, TX
| | - Diego Ardissino
- Cardiology, Azienda Ospedaliero-Universitaria di Parma, Univ of Parma, Parma, Italy
- Associazione per lo Studio Della Trombosi in Cardiologia, Pavia, Italy
| | - Adam S. Butterworth
- British Heart Foundation Cardiovascular Epidemiology Unit, Dept of Public Health & Primary Care, Univ of Cambridge, Cambridge, UK
- National Inst for Health Research Blood & Transplant Research Unit in Donor Health & Genomics, Univ of Cambridge, Cambridge, UK
- Health Data Research UK Cambridge, Wellcome Genome Campus & Univ of Cambridge, Cambridge, UK
| | - Emanuele Di Angelantonio
- British Heart Foundation Cardiovascular Epidemiology Unit, Dept of Public Health & Primary Care, Univ of Cambridge, Cambridge, UK
- Health Data Research UK Cambridge, Wellcome Genome Campus & Univ of Cambridge, Cambridge, UK
- NIHR Blood & Transplant Research Unit in Donor Health & Genomics, Univ of Cambridge, Cambridge, UK
- BHF Ctr of Research Excellence, School of Clinical Medicine, Addenbrooke’s Hospital, Univ of Cambridge, Cambridge, UK
- Health Data Science Research Ctr, Human Technopole, Milan, Italy
| | - Aliya Naheed
- Initiative for Noncommunicable Bangladesh, Diseases, Health Systems & Population Studies Division, International Ctr for Diarrhoeal Disease Research, Dhaka, Bangladesh
| | - John Danesh
- British Heart Foundation Cardiovascular Epidemiology Unit, Dept of Public Health & Primary Care, Univ of Cambridge, Cambridge, UK
- National Inst for Health Research Blood & Transplant Research Unit in Donor Health & Genomics, Univ of Cambridge, Cambridge, UK
- British Heart Foundation Ctr of Research Excellence, Univ of Cambridge, Cambridge, UK
- Health Data Research UK Cambridge, Wellcome Genome Campus & Univ of Cambridge, Cambridge, UK
- Dept of Human Genetics, Wellcome Sanger Inst, Hinxton, UK
| | - Rajiv Chowdhury
- British Heart Foundation Cardiovascular Epidemiology Unit, Dept of Public Health & Primary Care, Univ of Cambridge, Cambridge, UK
- Centre for Non-Communicable Disease Research, Dhaka, Bangladesh
| | - Harlan M. Krumholz
- Section of Cardiovascular Medicine, Dept of Medicine, Yale Univ, New Haven, CT
- Ctr for Outcomes Research & Evaluation, Yale-New Haven Hospital, New Haven, CT
| | - Wayne H-H Sheu
- Cardiovascular Research Ctr, Dept of Medicine, National Yang Ming Univ School of Medicine, Taipei, Taiwan
| | - Stephen S. Rich
- Ctr for Public Health Genomics, Univ of Virginia, Charlottesville, VA
| | - Jerome I. Rotter
- The Inst for Translational Genomics & Population Sciences, Dept of Pediatrics, The Lundquist Inst for Biomedical Innovation at Harbor-UCLA Medical Ctr, Torrance, CA
| | - Yii-der Ida Chen
- The Inst for Translational Genomics & Population Sciences, Dept of Pediatrics, The Lundquist Inst for Biomedical Innovation at Harbor-UCLA Medical Ctr, Torrance, CA
| | - Stacey Gabriel
- Program in Medical & Population Genetics, Broad Inst of MIT & Harvard, Cambridge, MA
| | - Eric S. Lander
- Program in Medical & Population Genetics, Broad Inst of MIT & Harvard, Cambridge, MA
- Dept of Biology, MIT, Cambridge, MA
- Dept of Systems Biology, Harvard Medical School, Boston, MA
| | - Danish Saleheen
- Dept of Medicine, Columbia Univ, New York, NY
- Ctr for Non-Communicable Diseases, Karachi, Sindh, Pakistan
| | - Sekar Kathiresan
- Ctr for Genomic Medicine, Massachusetts General Hospital, Boston, MA
- Dept of Medicine, Harvard Medical School, Boston, MA
- Cardiology Division, Dept of Medicine, Massachusetts General Hospital, Boston, MA
- Verve Therapeutics, Cambridge, MA
| |
Collapse
|
29
|
Étiévant L, Viallon V. Causal inference under over-simplified longitudinal causal models. Int J Biostat 2022; 18:421-437. [PMID: 34727585 DOI: 10.1515/ijb-2020-0081] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/02/2020] [Accepted: 10/14/2021] [Indexed: 01/10/2023]
Abstract
Many causal models of interest in epidemiology involve longitudinal exposures, confounders and mediators. However, repeated measurements are not always available or used in practice, leading analysts to overlook the time-varying nature of exposures and work under over-simplified causal models. Our objective is to assess whether - and how - causal effects identified under such misspecified causal models relates to true causal effects of interest. We derive sufficient conditions ensuring that the quantities estimated in practice under over-simplified causal models can be expressed as weighted averages of longitudinal causal effects of interest. Unsurprisingly, these sufficient conditions are very restrictive, and our results state that the quantities estimated in practice should be interpreted with caution in general, as they usually do not relate to any longitudinal causal effect of interest. Our simulations further illustrate that the bias between the quantities estimated in practice and the weighted averages of longitudinal causal effects of interest can be substantial. Overall, our results confirm the need for repeated measurements to conduct proper analyses and/or the development of sensitivity analyses when they are not available.
Collapse
Affiliation(s)
| | - Vivian Viallon
- Nutritional Methodology and Biostatistics, International Agency for Research on Cancer, Lyon 69372, France
| |
Collapse
|
30
|
Ma Y, Patil S, Zhou X, Mukherjee B, Fritsche LG. ExPRSweb: An online repository with polygenic risk scores for common health-related exposures. Am J Hum Genet 2022; 109:1742-1760. [PMID: 36152628 PMCID: PMC9606385 DOI: 10.1016/j.ajhg.2022.09.001] [Citation(s) in RCA: 11] [Impact Index Per Article: 5.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/10/2022] [Accepted: 08/31/2022] [Indexed: 01/25/2023] Open
Abstract
Complex traits are influenced by genetic risk factors, lifestyle, and environmental variables, so-called exposures. Some exposures, e.g., smoking or lipid levels, have common genetic modifiers identified in genome-wide association studies. Because measurements are often unfeasible, exposure polygenic risk scores (ExPRSs) offer an alternative to study the influence of exposures on various phenotypes. Here, we collected publicly available summary statistics for 28 exposures and applied four common PRS methods to generate ExPRSs in two large biobanks: the Michigan Genomics Initiative and the UK Biobank. We established ExPRSs for 27 exposures and demonstrated their applicability in phenome-wide association studies and as predictors for common chronic conditions. Especially the addition of multiple ExPRSs showed, for several chronic conditions, an improvement compared to prediction models that only included traditional, disease-focused PRSs. To facilitate follow-up studies, we share all ExPRS constructs and generated results via an online repository called ExPRSweb.
Collapse
Affiliation(s)
- Ying Ma
- Department of Biostatistics, University of Michigan School of Public Health, Ann Arbor, MI 48109, USA; Center for Statistical Genetics, University of Michigan School of Public Health, Ann Arbor, MI 48109, USA
| | - Snehal Patil
- Department of Biostatistics, University of Michigan School of Public Health, Ann Arbor, MI 48109, USA; Center for Statistical Genetics, University of Michigan School of Public Health, Ann Arbor, MI 48109, USA
| | - Xiang Zhou
- Department of Biostatistics, University of Michigan School of Public Health, Ann Arbor, MI 48109, USA; Center for Statistical Genetics, University of Michigan School of Public Health, Ann Arbor, MI 48109, USA; Center for Precision Health Data Science, University of Michigan School of Public Health, Ann Arbor, MI 48109, USA
| | - Bhramar Mukherjee
- Department of Biostatistics, University of Michigan School of Public Health, Ann Arbor, MI 48109, USA; Center for Statistical Genetics, University of Michigan School of Public Health, Ann Arbor, MI 48109, USA; Center for Precision Health Data Science, University of Michigan School of Public Health, Ann Arbor, MI 48109, USA; Department of Epidemiology, University of Michigan School of Public Health, Ann Arbor, MI 48109, USA; University of Michigan Rogel Cancer Center, University of Michigan, Ann Arbor, MI 48109, USA; Michigan Institute for Data Science, University of Michigan, Ann Arbor, MI 48109, USA
| | - Lars G Fritsche
- Department of Biostatistics, University of Michigan School of Public Health, Ann Arbor, MI 48109, USA; Center for Statistical Genetics, University of Michigan School of Public Health, Ann Arbor, MI 48109, USA; Center for Precision Health Data Science, University of Michigan School of Public Health, Ann Arbor, MI 48109, USA; University of Michigan Rogel Cancer Center, University of Michigan, Ann Arbor, MI 48109, USA.
| |
Collapse
|
31
|
Clark-Boucher D, Boss J, Salvatore M, Smith JA, Fritsche LG, Mukherjee B. Assessing the added value of linking electronic health records to improve the prediction of self-reported COVID-19 testing and diagnosis. PLoS One 2022; 17:e0269017. [PMID: 35877617 PMCID: PMC9312965 DOI: 10.1371/journal.pone.0269017] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/08/2021] [Accepted: 05/12/2022] [Indexed: 11/19/2022] Open
Abstract
Since the beginning of the Coronavirus Disease 2019 (COVID-19) pandemic, a focus of research has been to identify risk factors associated with COVID-19-related outcomes, such as testing and diagnosis, and use them to build prediction models. Existing studies have used data from digital surveys or electronic health records (EHRs), but very few have linked the two sources to build joint predictive models. In this study, we used survey data on 7,054 patients from the Michigan Genomics Initiative biorepository to evaluate how well self-reported data could be integrated with electronic records for the purpose of modeling COVID-19-related outcomes. We observed that among survey respondents, self-reported COVID-19 diagnosis captured a larger number of cases than the corresponding EHRs, suggesting that self-reported outcomes may be better than EHRs for distinguishing COVID-19 cases from controls. In the modeling context, we compared the utility of survey- and EHR-derived predictor variables in models of survey-reported COVID-19 testing and diagnosis. We found that survey-derived predictors produced uniformly stronger models than EHR-derived predictors-likely due to their specificity, temporal proximity, and breadth-and that combining predictors from both sources offered no consistent improvement compared to using survey-based predictors alone. Our results suggest that, even though general EHRs are useful in predictive models of COVID-19 outcomes, they may not be essential in those models when rich survey data are already available. The two data sources together may offer better prediction for COVID severity, but we did not have enough severe cases in the survey respondents to assess that hypothesis in in our study.
Collapse
Affiliation(s)
- Dylan Clark-Boucher
- Department of Biostatistics, University of Michigan School of Public Health, Ann Arbor, Michigan, United States of America
| | - Jonathan Boss
- Department of Biostatistics, University of Michigan School of Public Health, Ann Arbor, Michigan, United States of America
| | - Maxwell Salvatore
- Department of Biostatistics, University of Michigan School of Public Health, Ann Arbor, Michigan, United States of America
- Department of Epidemiology, University of Michigan School of Public Health, Ann Arbor, Michigan, United States of America
| | - Jennifer A. Smith
- Department of Epidemiology, University of Michigan School of Public Health, Ann Arbor, Michigan, United States of America
- Survey Research Center, Institute for Social Research, University of Michigan, Ann Arbor, Michigan, United States of America
| | - Lars G. Fritsche
- Department of Biostatistics, University of Michigan School of Public Health, Ann Arbor, Michigan, United States of America
- Rogel Cancer Center, University of Michigan, Ann Arbor, Michigan, United States of America
- Center for Statistical Genetics, University of Michigan School of Public Health, Ann Arbor, Michigan, United States of America
| | - Bhramar Mukherjee
- Department of Biostatistics, University of Michigan School of Public Health, Ann Arbor, Michigan, United States of America
- Department of Epidemiology, University of Michigan School of Public Health, Ann Arbor, Michigan, United States of America
- Rogel Cancer Center, University of Michigan, Ann Arbor, Michigan, United States of America
- Center for Statistical Genetics, University of Michigan School of Public Health, Ann Arbor, Michigan, United States of America
| |
Collapse
|
32
|
He Y, Patel CJ. Shared exposure liability of type 2 diabetes and other chronic conditions in the UK Biobank. Acta Diabetol 2022; 59:851-860. [PMID: 35348899 PMCID: PMC9085680 DOI: 10.1007/s00592-022-01864-5] [Citation(s) in RCA: 5] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 11/15/2021] [Accepted: 01/31/2022] [Indexed: 11/09/2022]
Abstract
AIMS To investigate whether the cumulative exposure risks of incident T2D are shared with other common chronic diseases. RESEARCH DESIGN AND METHODS We first establish and report the cross-sectional prevalence, cross-sectional co-prevalence, and incidence of seven T2D-associated chronic diseases [hypertension, atrial fibrillation, coronary artery disease, obesity, chronic obstructive pulmonary disease (COPD), and chronic kidney and liver diseases] in the UK Biobank. We use published weights of genetic variants and exposure variables to derive the T2D polygenic (PGS) and polyexposure (PXS) risk scores and test their associations to incident diseases. RESULTS PXS was associated with higher levels of clinical risk factors including BMI, systolic blood pressure, blood glucose, triglycerides, and HbA1c in individuals without overt or diagnosed T2D. In addition to predicting incident T2D, PXS and PGS were significantly and positively associated with the incidence of all 7 other chronic diseases. There were 4% and 8% of individuals in the bottom deciles of PXS and PGS, respectively, who were prediabetic at baseline but had low risks of T2D and other chronic diseases. Compared to the remaining population, individuals in the top deciles of PGS and PXS had particularly high risks of developing chronic diseases. For instance, the hazard ratio of COPD and obesity for individuals in the top T2D PXS deciles was 2.82 (95% CI 2.39-3.35, P = 4.00 × 10-33) and 2.54 (95% CI 2.24-2.87, P = 9.86 × 10-50), respectively, compared to the remaining population. We also found that PXS and PGS were both significantly (P < 0.0001) and positively associated with the total number of incident diseases. CONCLUSIONS T2D shares polyexposure risks with other common chronic diseases. Individuals with an elevated genetic and non-genetic risk of T2D also have high risks of cardiovascular, liver, lung, and kidney diseases.
Collapse
Affiliation(s)
- Yixuan He
- Program in Bioinformatics and Integrative Genomics, Harvard Medical School, 10 Shattuck St, Boston, MA, 02215, USA
- Department of Biomedical Informatics, Harvard Medical School, 10 Shattuck St, Boston, MA, USA
| | - Chirag J Patel
- Department of Biomedical Informatics, Harvard Medical School, 10 Shattuck St, Boston, MA, USA.
| |
Collapse
|
33
|
Clinical artificial intelligence quality improvement: towards continual monitoring and updating of AI algorithms in healthcare. NPJ Digit Med 2022; 5:66. [PMID: 35641814 PMCID: PMC9156743 DOI: 10.1038/s41746-022-00611-y] [Citation(s) in RCA: 72] [Impact Index Per Article: 36.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/16/2021] [Accepted: 04/29/2022] [Indexed: 12/13/2022] Open
Abstract
Machine learning (ML) and artificial intelligence (AI) algorithms have the potential to derive insights from clinical data and improve patient outcomes. However, these highly complex systems are sensitive to changes in the environment and liable to performance decay. Even after their successful integration into clinical practice, ML/AI algorithms should be continuously monitored and updated to ensure their long-term safety and effectiveness. To bring AI into maturity in clinical care, we advocate for the creation of hospital units responsible for quality assurance and improvement of these algorithms, which we refer to as “AI-QI” units. We discuss how tools that have long been used in hospital quality assurance and quality improvement can be adapted to monitor static ML algorithms. On the other hand, procedures for continual model updating are still nascent. We highlight key considerations when choosing between existing methods and opportunities for methodological innovation.
Collapse
|
34
|
Yang S, Zhou X. PGS-server: accuracy, robustness and transferability of polygenic score methods for biobank scale studies. Brief Bioinform 2022; 23:6534383. [PMID: 35193147 DOI: 10.1093/bib/bbac039] [Citation(s) in RCA: 9] [Impact Index Per Article: 4.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/02/2021] [Revised: 12/29/2021] [Accepted: 01/26/2022] [Indexed: 01/02/2023] Open
Abstract
Polygenic scores (PGS) are important tools for carrying out genetic prediction of common diseases and disease related complex traits, facilitating the development of precision medicine. Unfortunately, despite the critical importance of PGS and the vast number of PGS methods recently developed, few comprehensive comparison studies have been performed to evaluate the effectiveness of PGS methods. To fill this critical knowledge gap, we performed a comprehensive comparison study on 12 different PGS methods through internal evaluations on 25 quantitative and 25 binary traits within the UK Biobank with sample sizes ranging from 147 408 to 336 573, and through external evaluations via 25 cross-study and 112 cross-ancestry analyses on summary statistics from multiple genome-wide association studies with sample sizes ranging from 1415 to 329 345. We evaluate the prediction accuracy, computational scalability, as well as robustness and transferability of different PGS methods across datasets and/or genetic ancestries, providing important guidelines for practitioners in choosing PGS methods. Besides method comparison, we present a simple aggregation strategy that combines multiple PGS from different methods to take advantage of their distinct benefits to achieve stable and superior prediction performance. To facilitate future applications of PGS, we also develop a PGS webserver (http://www.pgs-server.com/) that allows users to upload summary statistics and choose different PGS methods to fit the data directly. We hope that our results, method and webserver will facilitate the routine application of PGS across different research areas.
Collapse
Affiliation(s)
- Sheng Yang
- Department of Biostatistics, School of Public Health, Nanjing Medical University, Nanjing, Jiangsu 211166, China
| | - Xiang Zhou
- Department of Biostatistics, School of Public Health, University of Michigan, Ann Arbor, MI 48109, USA.,Center for Statistical Genetics, University of Michigan, Ann Arbor, MI 48109, USA
| |
Collapse
|
35
|
Kawaguchi ES, Li G, Lewinger JP, Gauderman WJ. Two-step hypothesis testing to detect gene-environment interactions in a genome-wide scan with a survival endpoint. Stat Med 2022; 41:1644-1657. [PMID: 35075649 PMCID: PMC9007892 DOI: 10.1002/sim.9319] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/13/2021] [Revised: 11/10/2021] [Accepted: 12/26/2021] [Indexed: 01/13/2023]
Abstract
Defined by their genetic profile, individuals may exhibit differential clinical outcomes due to an environmental exposure. Identifying subgroups based on specific exposure-modifying genes can lead to targeted interventions and focused studies. Genome-wide interaction scans (GWIS) can be performed to identify such genes, but these scans typically suffer from low power due to the large multiple testing burden. We provide a novel framework for powerful two-step hypothesis tests for GWIS with a time-to-event endpoint under the Cox proportional hazards model. In the Cox regression setting, we develop an approach that prioritizes genes for Step-2 G × E testing based on a carefully constructed Step-1 screening procedure. Simulation results demonstrate this two-step approach can lead to substantially higher power for identifying gene-environment ( G × E ) interactions compared to the standard GWIS while preserving the family wise error rate over a range of scenarios. In a taxane-anthracycline chemotherapy study for breast cancer patients, the two-step approach identifies several gene expression by treatment interactions that would not be detected using the standard GWIS.
Collapse
Affiliation(s)
- Eric S Kawaguchi
- Department of Population and Public Health Sciences, University of Southern California, Los Angeles, California, USA
| | - Gang Li
- Department of Biostatistics, University of California, Los Angeles, Los Angeles, California, USA.,Department of Computational Medicine, University of California, Los Angeles, Los Angeles, California, USA
| | - Juan Pablo Lewinger
- Department of Population and Public Health Sciences, University of Southern California, Los Angeles, California, USA
| | - W James Gauderman
- Department of Population and Public Health Sciences, University of Southern California, Los Angeles, California, USA
| |
Collapse
|
36
|
McGee G, Haneuse S, Coull BA, Weisskopf MG, Rotem RS. On the Nature of Informative Presence Bias in Analyses of Electronic Health Records. Epidemiology 2022; 33:105-113. [PMID: 34711733 PMCID: PMC8633193 DOI: 10.1097/ede.0000000000001432] [Citation(s) in RCA: 17] [Impact Index Per Article: 8.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/03/2023]
Abstract
Electronic health records (EHRs) offer unprecedented opportunities to answer epidemiologic questions. However, unlike in ordinary cohort studies or randomized trials, EHR data are collected somewhat idiosyncratically. In particular, patients who have more contact with the medical system have more opportunities to receive diagnoses, which are then recorded in their EHRs. The goal of this article is to shed light on the nature and scope of this phenomenon, known as informative presence, which can bias estimates of associations. We show how this can be characterized as an instance of misclassification bias. As a consequence, we show that informative presence bias can occur in a broader range of settings than previously thought, and that simple adjustment for the number of visits as a confounder may not fully correct for bias. Additionally, where previous work has considered only underdiagnosis, investigators are often concerned about overdiagnosis; we show how this changes the settings in which bias manifests. We report on a comprehensive series of simulations to shed light on when to expect informative presence bias, how it can be mitigated in some cases, and cases in which new methods need to be developed.
Collapse
Affiliation(s)
- Glen McGee
- Department of Statistics and Actuarial Science, University
of Waterloo, Waterloo, ON, Canada
| | - Sebastien Haneuse
- Department of Biostatistics, Harvard T.H. Chan School of
Public Health, Boston, MA
| | - Brent A. Coull
- Department of Biostatistics, Harvard T.H. Chan School of
Public Health, Boston, MA
| | - Marc G. Weisskopf
- Department of Environmental Health, Harvard T.H. Chan
School of Public Health, Boston, MA
| | - Ran S. Rotem
- Department of Environmental Health, Harvard T.H. Chan
School of Public Health, Boston, MA
- Kahn-Sagol-Maccabi Research and Innovation Institute,
Maccabi Healthcare Services, Tel Aviv, Israel
| |
Collapse
|
37
|
Spector-Bagdady K, Tang S, Jabbour S, Price WN, Bracic A, Creary MS, Kheterpal S, Brummett CM, Wiens J. Respecting Autonomy And Enabling Diversity: The Effect Of Eligibility And Enrollment On Research Data Demographics. Health Aff (Millwood) 2021; 40:1892-1899. [PMID: 34871076 DOI: 10.1377/hlthaff.2021.01197] [Citation(s) in RCA: 11] [Impact Index Per Article: 3.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/05/2022]
Abstract
Many promising advances in precision health and other Big Data research rely on large data sets to analyze correlations among genetic variants, behavior, environment, and outcomes to improve population health. But these data sets are generally populated with demographically homogeneous cohorts. We conducted a retrospective cohort study of patients at a major academic medical center during 2012-19 to explore how recruitment and enrollment approaches affected the demographic diversity of participants in its research biospecimen and data bank. We found that compared with the overall clinical population, patients who consented to enroll in the research data bank were significantly less diverse in terms of age, sex, race, ethnicity, and socioeconomic status. Compared with patients who were recruited for the data bank, patients who enrolled were younger and less likely to be Black or African American, Asian, or Hispanic. The overall demographic diversity of the data bank was affected as much (and in some cases more) by which patients were considered eligible for recruitment as by which patients consented to enroll. Our work underscores the need for systemic commitment to diversify data banks so that different communities can benefit from research.
Collapse
Affiliation(s)
- Kayte Spector-Bagdady
- Kayte Spector-Bagdady is an assistant professor of obstetrics and gynecology and an associate director of the Center for Bioethics and Social Sciences in Medicine at the University of Michigan Medical School, in Ann Arbor, Michigan. Spector-Bagdady, Shengpu Tang, and Sarah Jabbour are co-first authors
| | - Shengpu Tang
- Shengpu Tang is a PhD candidate in computer science and engineering at the University of Michigan, in Ann Arbor, Michigan
| | - Sarah Jabbour
- Sarah Jabbour is a PhD candidate in computer science and engineering at the University of Michigan
| | - W Nicholson Price
- W. Nicholson Price II is a professor of law at the University of Michigan Law School, in Ann Arbor, Michigan
| | - Ana Bracic
- Ana Bracic is an assistant professor of political science and a member of the Minority Politics Initiative at Michigan State University, in East Lansing, Michigan
| | - Melissa S Creary
- Melissa S. Creary is an assistant professor of health management and policy at the University of Michigan School of Public Health, in Ann Arbor, Michigan, and the senior director for the Office of Public Health Initiatives at the American Thrombosis and Hemostasis Network (ATHN), in Rochester, New York
| | - Sachin Kheterpal
- Sachin Kheterpal is a professor of anesthesiology and the associate dean for research information technology at the University of Michigan Medical School
| | - Chad M Brummett
- Chad M. Brummett is a professor of anesthesiology and senior associate chair for research at the University of Michigan Medical School
| | - Jenna Wiens
- Jenna Wiens is an associate professor of computer science and engineering, associate director of the Artificial Intelligence Lab, and codirector for Precision Health at the University of Michigan
| |
Collapse
|
38
|
Davitte JM, Stott-Miller M, Ehm MG, Cunnington MC, Reynolds RF. Integration of Real-World Data and Genetics to Support Target Identification and Validation. Clin Pharmacol Ther 2021; 111:63-76. [PMID: 34818443 DOI: 10.1002/cpt.2477] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/23/2021] [Revised: 10/06/2021] [Accepted: 10/27/2021] [Indexed: 01/01/2023]
Abstract
Even modest improvements in the probability of success of selecting drug targets which are ultimately approved can substantially reduce the costs of research and development. Drug targets with human genetic evidence of disease association are twice as likely to lead to approved drugs. A key enabler of identifying and validating these genetically validated targets is access to association results from genome-wide genotyping, whole-exome sequencing, and whole-genome sequencing studies with observable traits (often diseases) across large numbers of individuals. Today, linkage between genotype and real-world data (RWD) provides significant opportunities to not only increase the statistical power of genome-wide association studies by ascertaining additional cases for diseases of interest, but also to improve diversity and coverage of association studies across the disease phenome. As RWD-genetics linked resources continue to grow in diversity of participants, breadth of data captured, length of observation, and number of participants, there is a greater need to leverage the experience of RWD experts, clinicians, and highly experienced geneticists together to understand which lessons and frameworks from general research using RWD sources are relevant to improve genetics-driven drug discovery and development. This paper describes new challenges and opportunities for phenotypes enabled by diverse RWD sources, considerations in the use of RWD phenotypes for disease gene identification across the disease phenome, and challenges and opportunities in leveraging RWD phenotypes in target validation. The paper concludes with views on the future directions for phenotype development using RWD, and key questions requiring further research and development to advance this nascent field.
Collapse
Affiliation(s)
| | | | | | | | - Robert F Reynolds
- GlaxoSmithKline, New York, New York, USA.,Tulane School of Public Health and Tropical Medicine, New Orleans, Louisiana, USA
| |
Collapse
|
39
|
Willers C, Lynch T, Chand V, Islam M, Lassere M, March L. A Versatile, Secure, and Sustainable All-in-One Biobank-Registry Data Solution: The A3BC REDCap Model. Biopreserv Biobank 2021; 20:244-259. [PMID: 34807733 DOI: 10.1089/bio.2021.0098] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/20/2022] Open
Abstract
Introduction: A key element in the big data revolution is large-scale biobanking and the associated development of high-quality data collections and supporting informatics solutions. As such, in establishing the Australian Arthritis and Autoimmune Biobank Collaborative (A3BC), we sought to establish a low-cost, nation-scale data management system capable of managing a multisite biobank registry with complex longitudinal sample and data requirements. Materials and Methods: We assessed several international commercial and nonprofit software platforms using standardized system requirement criteria and follow-up interviews. Vendor compliance scoring was prioritized to meet our project-critical requirements. Consumer/end-user codesign was integral to refining our system requirements for optimized adoption. Customization of the selected software solution was performed to optimize field auto-population between participant timepoints and forms, using modules that are transferable and that do not impact core code. Institutional and independent testing was used to ensure data security. Results: We selected the widely used research web application Research Electronic Data Capture (REDCap), which is "free" (under nonprofit license agreement terms), highly configurable, and customizable to a variety of biobank and registry needs and can be developed/maintained by biobank users with modest IT skill, time, and cost. We created a secure, comprehensive participant-centric biobank-registry database that includes: (1) best practice data security measures (incl. multisite access login using institutional user credentials), (2) permission-to-contact and dynamic itemized electronic consent, (3) a complete chain of custody from consent to longitudinal biospecimen data collection to publication, (4) complex longitudinal patient-reported surveys, (5) integration of record-level extracted/linked participant data, (6) significant form auto-population for streamlined data capture, and (7) native dashboards for operational visualizations. Conclusion: We recommend the versatile, reusable, and sustainable informatics model we have developed in REDCap for prospective chronic disease biobanks or registry biobanks (of local to national complexity) supporting holistic research into disease prediction, precision medicine, and prevention strategies.
Collapse
Affiliation(s)
- Craig Willers
- Institute of Bone and Joint Research, The Australian Arthritis and Autoimmune Biobank Collaborative, Kolling Institute, University of Sydney, Sydney, Australia
| | - Tom Lynch
- Institute of Bone and Joint Research, The Australian Arthritis and Autoimmune Biobank Collaborative, Kolling Institute, University of Sydney, Sydney, Australia
| | - Vibhasha Chand
- Public Health and Preventive Medicine, Monash University, Clayton, Australia
| | - Mohammad Islam
- Information and Communications Technology, University of Sydney, Sydney, Australia
| | - Marissa Lassere
- School of Population Health, University of New South Wales, Sydney, Australia
| | - Lyn March
- Institute of Bone and Joint Research, The Australian Arthritis and Autoimmune Biobank Collaborative, Kolling Institute, University of Sydney, Sydney, Australia
- Department of Rheumatology, Royal North Shore Hospital, St Leonards, Australia
| |
Collapse
|
40
|
Rush A, Catchpoole DR, Reaiche-Miller G, Gilbert T, Ng W, Watson PH, Byrne JA. What Do Biomedical Researchers Want from Biobanks? Results of an Online Survey. Biopreserv Biobank 2021; 20:271-282. [PMID: 34756100 DOI: 10.1089/bio.2021.0084] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open
Abstract
Aims: The purpose of biobanking is to provide biospecimens and associated data to researchers, yet the perspectives of biobank research users have been under-investigated. This study aimed to ascertain biobank research users' needs and opinions about biobanking services. Methods: An online survey was developed, which requested information about researcher demographics, localities of biobanks accessed, methods of sourcing biospecimens, and opinions on topics including but not limited to, application processes, data availability, access fees, and return of research results. There were 27 multiple choice/check box questions, 4 questions with a 10-point Likert scale, and 8 questions with provision for further comment. A web link for the survey was distributed to researchers in late 2019/early 2020 in four Australian states: New South Wales, Victoria, Western Australia, and South Australia. Results: Respondents were generally satisfied with biobank application processes and the fit for purpose of received biospecimens/data. Nonetheless, most researchers (n = 61/99, 62%) reported creating their own collections owing to gaps in sample availability and a perceived increase in efficiency. Most accessed biobanks (n = 58/74, 78%) were in close proximity (local or intrastate) to the researcher. Most researchers had limited the scope of their research owing to difficulty of obtaining biospecimens (n = 55/86, 64%) and/or data (n = 52/85, 60%), with the top three responses for additional types of data required being "more long term follow up data," "more clinical data," and "more linked government data." The top influence to use a particular biobank was cost, and the most frequently suggested improvement was reduced direct "cost of obtaining biospecimens." Conclusion: Biobanks that do not meet the needs of their end-users are unlikely to be optimally utilized or sustainable. This survey provides valuable insights to guide biobanks and other stakeholders, such as developing marketing and client engagement plans to encourage local research users and discouraging the creation of unnecessary new collections.
Collapse
Affiliation(s)
- Amanda Rush
- New South Wales Health Statewide Biobank, New South Wales Health Pathology, Camperdown, Australia
| | - Daniel R Catchpoole
- Children's Cancer Research Unit, Kids Research, The Children's Hospital at Westmead, Westmead, Australia
| | - Georget Reaiche-Miller
- Division of Research and Innovation, The University of Adelaide Biobank, Adelaide, Australia
| | - Thomas Gilbert
- The University of Western Australia Medical School, University of Western Australia, Perth, Australia
| | - Wayne Ng
- Victorian Cancer Biobank, Melbourne, Australia
| | - Peter Hamilton Watson
- Biobanking and Biospecimen Research Services, British Columbia Cancer, Victoria, Canada
- Canadian Tissue Repository Network, British Columbia Cancer, Vancouver, Canada
| | - Jennifer A Byrne
- New South Wales Health Statewide Biobank, New South Wales Health Pathology, Camperdown, Australia
- School of Medical Sciences, Faculty of Medicine and Health, The University of Sydney, Sydney, Australia
| |
Collapse
|
41
|
Antoniades A, Papaioannou M, Malatras A, Papagregoriou G, Müller H, Holub P, Deltas C, Schizas CN. Integration of Biobanks in National eHealth Ecosystems Facilitating Long-Term Longitudinal Clinical-Omics Studies and Citizens' Engagement in Research Through eHealthBioR. Front Digit Health 2021; 3:628646. [PMID: 34713101 PMCID: PMC8521893 DOI: 10.3389/fdgth.2021.628646] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/12/2020] [Accepted: 05/11/2021] [Indexed: 11/13/2022] Open
Abstract
Biobanks have long existed to support research activities with BBMRI-ERIC formed as a European research infrastructure supporting the coordination for biobanking with 20 country members and one international organization. Although the benefits of biobanks to the research community are well-established, the direct benefit to citizens is limited to the generic benefit of promoting future research. Furthermore, the advent of General Data Protection Regulation (GDPR) legislation raised a series of challenges for scientific research especially related to biobanking associate activities and longitudinal research studies. Electronic health record (EHR) registries have long existed in healthcare providers. In some countries, even at the national level, these record the state of the health of citizens through time for the purposes of healthcare and data portability between different providers. The potential of EHRs in research is great and has been demonstrated in many projects that have transformed EHR data into retrospective medical history information on participating subjects directly from their physician's collected records; many key challenges, however, remain. In this paper, we present a citizen-centric framework called eHealthBioR, which would enable biobanks to link to EHR systems, thus enabling not just retrospective but also lifelong prospective longitudinal studies of participating citizens. It will also ensure strict adherence to legal and ethical requirements, enabling greater control that encourages participation. Citizens would benefit from the real and direct control of their data and samples, utilizing technology, to empower them to make informed decisions about providing consent and practicing their rights related to the use of their data, as well as by having access to knowledge and data generated from samples they provided to biobanks. This is expected to motivate patient engagement in future research and even leads to participatory design methodologies with citizen/patient-centric designed studies. The development of platforms based on the eHealthBioR framework would need to overcome significant challenges. However, it would shift the burden of addressing these to experts in the field while providing solutions enabling in the long term the lower monetary and time cost of longitudinal studies coupled with the option of lifelong monitoring through EHRs.
Collapse
Affiliation(s)
- Athos Antoniades
- eHealth Lab, Department of Computer Science, University of Cyprus, Nicosia, Cyprus
| | - Maria Papaioannou
- eHealth Lab, Department of Computer Science, University of Cyprus, Nicosia, Cyprus
| | - Apostolos Malatras
- biobank.cy Center of Excellence in Biobanking and Biomedical Research, University of Cyprus, Nicosia, Cyprus
| | - Gregory Papagregoriou
- biobank.cy Center of Excellence in Biobanking and Biomedical Research, University of Cyprus, Nicosia, Cyprus
| | - Heimo Müller
- Institute of Pathology, Medical University of Graz, Graz, Austria.,Biobanking and Biomolecular Resources Research Infrastructure - European Research Infrastructure Consortium, Biobanks and Biomolecular Resources Research Infrastructure Consortium, Graz, Austria
| | - Petr Holub
- Biobanking and Biomolecular Resources Research Infrastructure - European Research Infrastructure Consortium, Biobanks and Biomolecular Resources Research Infrastructure Consortium, Graz, Austria
| | - Constantinos Deltas
- biobank.cy Center of Excellence in Biobanking and Biomedical Research, University of Cyprus, Nicosia, Cyprus
| | - Christos N Schizas
- eHealth Lab, Department of Computer Science, University of Cyprus, Nicosia, Cyprus
| |
Collapse
|
42
|
Coleman JR. The Validity of Brief Phenotyping in Population Biobanks for Psychiatric Genome-Wide Association Studies on the Biobank Scale. Complex Psychiatry 2021; 7:11-15. [PMID: 34883499 PMCID: PMC8443942 DOI: 10.1159/000516837] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/11/2021] [Accepted: 04/14/2021] [Indexed: 11/19/2022] Open
Affiliation(s)
- Jonathan R.I. Coleman
- Social, Genetic and Developmental Psychiatry Centre, Institute of Psychiatry, Psychology and Neuroscience, King's College London, London, United Kingdom
| |
Collapse
|
43
|
Hubbard RA. Commentary on Professor Austin Bradford Hill's Alfred Watson Memorial Lecture. Stat Med 2021; 40:29-31. [PMID: 33368363 DOI: 10.1002/sim.8826] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/28/2020] [Accepted: 11/06/2020] [Indexed: 11/08/2022]
Abstract
As availability of health care data for research opens up new frontiers in medical statistics, keeping a focus on the science behind the data is more important than ever to promote sound research and protect the validity of research results. Though the electronic databases currently amassed for research far exceed in scale and scope the observational research Professor Hill likely conceived of, his guidance to statisticians to ground our work in the biological and medical processes behind the data remains salient across the decades.
Collapse
Affiliation(s)
- Rebecca A Hubbard
- Department of Biostatistics, Epidemiology & Informatics, University of Pennsylvania, Philadelphia, Pennsylvania, USA
| |
Collapse
|
44
|
Bi W, Lee S. Scalable and Robust Regression Methods for Phenome-Wide Association Analysis on Large-Scale Biobank Data. Front Genet 2021; 12:682638. [PMID: 34211504 PMCID: PMC8239389 DOI: 10.3389/fgene.2021.682638] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/18/2021] [Accepted: 05/17/2021] [Indexed: 02/05/2023] Open
Abstract
With the advances in genotyping technologies and electronic health records (EHRs), large biobanks have been great resources to identify novel genetic associations and gene-environment interactions on a genome-wide and even a phenome-wide scale. To date, several phenome-wide association studies (PheWAS) have been performed on biobank data, which provides comprehensive insights into many aspects of human genetics and biology. Although inspiring, PheWAS on large-scale biobank data encounters new challenges including computational burden, unbalanced phenotypic distribution, and genetic relationship. In this paper, we first discuss these new challenges and their potential impact on data analysis. Then, we summarize approaches that are scalable and robust in GWAS and PheWAS. This review can serve as a practical guide for geneticists, epidemiologists, and other medical researchers to identify genetic variations associated with health-related phenotypes in large-scale biobank data analysis. Meanwhile, it can also help statisticians to gain a comprehensive and up-to-date understanding of the current technical tool development.
Collapse
Affiliation(s)
- Wenjian Bi
- Department of Medical Genetics, School of Basic Medical Sciences, Peking University, Beijing, China
- Department of Biostatistics, University of Michigan, Ann Arbor, MI, United States
- Center for Statistical Genetics, University of Michigan, Ann Arbor, MI, United States
| | - Seunggeun Lee
- Graduate School of Data Science, Seoul National University, Seoul, South Korea
| |
Collapse
|
45
|
Bi W, Zhou W, Dey R, Mukherjee B, Sampson JN, Lee S. Efficient mixed model approach for large-scale genome-wide association studies of ordinal categorical phenotypes. Am J Hum Genet 2021; 108:825-839. [PMID: 33836139 DOI: 10.1016/j.ajhg.2021.03.019] [Citation(s) in RCA: 20] [Impact Index Per Article: 6.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/07/2021] [Accepted: 03/22/2021] [Indexed: 12/12/2022] Open
Abstract
In genome-wide association studies, ordinal categorical phenotypes are widely used to measure human behaviors, satisfaction, and preferences. However, because of the lack of analysis tools, methods designed for binary or quantitative traits are commonly used inappropriately to analyze categorical phenotypes. To accurately model the dependence of an ordinal categorical phenotype on covariates, we propose an efficient mixed model association test, proportional odds logistic mixed model (POLMM). POLMM is computationally efficient to analyze large datasets with hundreds of thousands of samples, can control type I error rates at a stringent significance level regardless of the phenotypic distribution, and is more powerful than alternative methods. In contrast, the standard linear mixed model approaches cannot control type I error rates for rare variants when the phenotypic distribution is unbalanced, although they performed well when testing common variants. We applied POLMM to 258 ordinal categorical phenotypes on array genotypes and imputed samples from 408,961 individuals in UK Biobank. In total, we identified 5,885 genome-wide significant variants, of which, 424 variants (7.2%) are rare variants with MAF < 0.01.
Collapse
|
46
|
Salvatore M, Beesley LJ, Fritsche LG, Hanauer D, Shi X, Mondul AM, Pearce CL, Mukherjee B. Phenotype risk scores (PheRS) for pancreatic cancer using time-stamped electronic health record data: Discovery and validation in two large biobanks. J Biomed Inform 2021; 113:103652. [PMID: 33279681 PMCID: PMC7855433 DOI: 10.1016/j.jbi.2020.103652] [Citation(s) in RCA: 13] [Impact Index Per Article: 4.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/21/2020] [Revised: 10/27/2020] [Accepted: 11/30/2020] [Indexed: 12/31/2022]
Abstract
BACKGROUND Traditional methods for disease risk prediction and assessment, such as diagnostic tests using serum, urine, blood, saliva or imaging biomarkers, have been important for identifying high-risk individuals for many diseases, leading to early detection and improved survival. For pancreatic cancer, traditional methods for screening have been largely unsuccessful in identifying high-risk individuals in advance of disease progression leading to high mortality and poor survival. Electronic health records (EHR) linked to genetic profiles provide an opportunity to integrate multiple sources of patient information for risk prediction and stratification. We leverage a constellation of temporally associated diagnoses available in the EHR to construct a summary risk score, called a phenotype risk score (PheRS), for identifying individuals at high-risk for having pancreatic cancer. The proposed PheRS approach incorporates the time with respect to disease onset into the prediction framework. We combine and contrast the PheRS with more well-known measures of inherited susceptibility, namely, the polygenic risk scores (PRS) for prediction of pancreatic cancer. METHODOLOGY We first calculated pairwise, unadjusted associations between pancreatic cancer diagnosis and all possible other diagnoses across the medical phenome. We call these pairwise associations co-occurrences. After accounting for cross-phenotype correlations, the multivariable association estimates from a subset of relatively independent diagnoses were used to create a weighted sum PheRS. We constructed time-restricted risk scores using data from 38,359 participants in the Michigan Genomics Initiative (MGI) based on the diagnoses contained in the EHR at 0, 1, 2, and 5 years prior to the target pancreatic cancer diagnosis. The PheRS was assessed for predictability in the UK Biobank (UKB). We tested the relative contribution of PheRS when added to a model containing a summary measure of inherited genetic susceptibility (PRS) plus other covariates like age, sex, smoking status, drinking status, and body mass index (BMI). RESULTS Our exploration of co-occurrence patterns identified expected associations while also revealing unexpected relationships that may warrant closer attention. Solely using the pancreatic cancer PheRS at 5 years before the target diagnoses yielded an AUC of 0.60 (95% CI = [0.58, 0.62]) in UKB. A larger predictive model including PheRS, PRS, and the covariates at the 5-year threshold achieved an AUC of 0.74 (95% CI = [0.72, 0.76]) in UKB. We note that PheRS does contribute independently in the joint model. Finally, scores at the top percentiles of the PheRS distribution demonstrated promise in terms of risk stratification. Scores in the top 2% were 10.20 (95% CI = [9.34, 12.99]) times more likely to identify cases than those in the bottom 98% in UKB at the 5-year threshold prior to pancreatic cancer diagnosis. CONCLUSIONS We developed a framework for creating a time-restricted PheRS from EHR data for pancreatic cancer using the rich information content of a medical phenome. In addition to identifying hypothesis-generating associations for future research, this PheRS demonstrates a potentially important contribution in identifying high-risk individuals, even after adjusting for PRS for pancreatic cancer and other traditional epidemiologic covariates. The methods are generalizable to other phenotypic traits.
Collapse
Affiliation(s)
- Maxwell Salvatore
- Department of Biostatistics, University of Michigan School of Public Health, Ann Arbor, MI 48109, United States
| | - Lauren J Beesley
- Department of Biostatistics, University of Michigan School of Public Health, Ann Arbor, MI 48109, United States
| | - Lars G Fritsche
- Department of Biostatistics, University of Michigan School of Public Health, Ann Arbor, MI 48109, United States; Rogel Cancer Center, University of Michigan Medicine, Ann Arbor, MI 48109, United States; Center for Statistical Genetics, University of Michigan School of Public Health, Ann Arbor, MI, United States
| | - David Hanauer
- Department of Pediatrics, University of Michigan Medical School, Ann Arbor, MI 48109, United States
| | - Xu Shi
- Department of Biostatistics, University of Michigan School of Public Health, Ann Arbor, MI 48109, United States
| | - Alison M Mondul
- Department of Epidemiology, University of Michigan School of Public Health, Ann Arbor, MI 48109, United States
| | - Celeste Leigh Pearce
- Department of Epidemiology, University of Michigan School of Public Health, Ann Arbor, MI 48109, United States
| | - Bhramar Mukherjee
- Department of Biostatistics, University of Michigan School of Public Health, Ann Arbor, MI 48109, United States.
| |
Collapse
|
47
|
Beesley LJ, Mukherjee B. Statistical inference for association studies using electronic health records: handling both selection bias and outcome misclassification. Biometrics 2020; 78:214-226. [PMID: 33179768 DOI: 10.1111/biom.13400] [Citation(s) in RCA: 24] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/18/2019] [Revised: 10/26/2020] [Accepted: 10/29/2020] [Indexed: 12/27/2022]
Abstract
Health research using electronic health records (EHR) has gained popularity, but misclassification of EHR-derived disease status and lack of representativeness of the study sample can result in substantial bias in effect estimates and can impact power and type I error. In this paper, we develop new strategies for handling disease status misclassification and selection bias in EHR-based association studies. We first focus on each type of bias separately. For misclassification, we propose three novel likelihood-based bias correction strategies. A distinguishing feature of the EHR setting is that misclassification may be related to patient-varying factors, and the proposed methods leverage data in the EHR to estimate misclassification rates without gold standard labels. For addressing selection bias, we describe how calibration and inverse probability weighting methods from the survey sampling literature can be extended and applied to the EHR setting. Addressing misclassification and selection biases simultaneously is a more challenging problem than dealing with each on its own, and we propose several new strategies. For all methods proposed, we derive valid standard error estimators and provide software for implementation. We provide a new suite of statistical estimation and inference strategies for addressing misclassification and selection bias simultaneously that is tailored to problems arising in EHR data analysis. We apply these methods to data from The Michigan Genomics Initiative, a longitudinal EHR-linked biorepository.
Collapse
Affiliation(s)
- Lauren J Beesley
- Department of Biostatistics, School of Public Health, University of Michigan, Ann Arbor, Michigan, USA
| | - Bhramar Mukherjee
- Department of Biostatistics, School of Public Health, University of Michigan, Ann Arbor, Michigan, USA
| |
Collapse
|
48
|
Fritsche LG, Patil S, Beesley LJ, VandeHaar P, Salvatore M, Ma Y, Peng RB, Taliun D, Zhou X, Mukherjee B. Cancer PRSweb: An Online Repository with Polygenic Risk Scores for Major Cancer Traits and Their Evaluation in Two Independent Biobanks. Am J Hum Genet 2020; 107:815-836. [PMID: 32991828 PMCID: PMC7675001 DOI: 10.1016/j.ajhg.2020.08.025] [Citation(s) in RCA: 55] [Impact Index Per Article: 13.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/23/2020] [Accepted: 08/28/2020] [Indexed: 02/06/2023] Open
Abstract
To facilitate scientific collaboration on polygenic risk scores (PRSs) research, we created an extensive PRS online repository for 35 common cancer traits integrating freely available genome-wide association studies (GWASs) summary statistics from three sources: published GWASs, the NHGRI-EBI GWAS Catalog, and UK Biobank-based GWASs. Our framework condenses these summary statistics into PRSs using various approaches such as linkage disequilibrium pruning/p value thresholding (fixed or data-adaptively optimized thresholds) and penalized, genome-wide effect size weighting. We evaluated the PRSs in two biobanks: the Michigan Genomics Initiative (MGI), a longitudinal biorepository effort at Michigan Medicine, and the population-based UK Biobank (UKB). For each PRS construct, we provide measures on predictive performance and discrimination. Besides PRS evaluation, the Cancer-PRSweb platform features construct downloads and phenome-wide PRS association study results (PRS-PheWAS) for predictive PRSs. We expect this integrated platform to accelerate PRS-related cancer research.
Collapse
Affiliation(s)
- Lars G Fritsche
- Department of Biostatistics, University of Michigan School of Public Health, Ann Arbor, MI 48109, USA; Center for Statistical Genetics, University of Michigan School of Public Health, Ann Arbor, MI 48109, USA; Center for Precision Health Data Science, University of Michigan School of Public Health, Ann Arbor, MI 48109, USA; University of Michigan Rogel Cancer Center, University of Michigan, Ann Arbor, MI 48109, USA.
| | - Snehal Patil
- Department of Biostatistics, University of Michigan School of Public Health, Ann Arbor, MI 48109, USA; Center for Statistical Genetics, University of Michigan School of Public Health, Ann Arbor, MI 48109, USA
| | - Lauren J Beesley
- Department of Biostatistics, University of Michigan School of Public Health, Ann Arbor, MI 48109, USA; Center for Precision Health Data Science, University of Michigan School of Public Health, Ann Arbor, MI 48109, USA
| | - Peter VandeHaar
- Department of Biostatistics, University of Michigan School of Public Health, Ann Arbor, MI 48109, USA; Center for Statistical Genetics, University of Michigan School of Public Health, Ann Arbor, MI 48109, USA
| | - Maxwell Salvatore
- Department of Biostatistics, University of Michigan School of Public Health, Ann Arbor, MI 48109, USA; Center for Precision Health Data Science, University of Michigan School of Public Health, Ann Arbor, MI 48109, USA
| | - Ying Ma
- Department of Biostatistics, University of Michigan School of Public Health, Ann Arbor, MI 48109, USA; Center for Statistical Genetics, University of Michigan School of Public Health, Ann Arbor, MI 48109, USA
| | - Robert B Peng
- Department of Biostatistics, University of Michigan School of Public Health, Ann Arbor, MI 48109, USA; Center for Precision Health Data Science, University of Michigan School of Public Health, Ann Arbor, MI 48109, USA; Department of Statistics, Northwestern University, Evanston, IL 60208, USA
| | - Daniel Taliun
- Department of Biostatistics, University of Michigan School of Public Health, Ann Arbor, MI 48109, USA; Center for Statistical Genetics, University of Michigan School of Public Health, Ann Arbor, MI 48109, USA
| | - Xiang Zhou
- Department of Biostatistics, University of Michigan School of Public Health, Ann Arbor, MI 48109, USA; Center for Statistical Genetics, University of Michigan School of Public Health, Ann Arbor, MI 48109, USA; Center for Precision Health Data Science, University of Michigan School of Public Health, Ann Arbor, MI 48109, USA
| | - Bhramar Mukherjee
- Department of Biostatistics, University of Michigan School of Public Health, Ann Arbor, MI 48109, USA; Center for Statistical Genetics, University of Michigan School of Public Health, Ann Arbor, MI 48109, USA; Center for Precision Health Data Science, University of Michigan School of Public Health, Ann Arbor, MI 48109, USA; Michigan Institute for Data Science, University of Michigan, Ann Arbor, MI 48109, USA; Department of Epidemiology, University of Michigan School of Public Health, Ann Arbor, MI 48109, USA; University of Michigan Rogel Cancer Center, University of Michigan, Ann Arbor, MI 48109, USA.
| |
Collapse
|
49
|
King C, Mulugeta A, Nabi F, Walton R, Zhou A, Hyppönen E. Mendelian randomization case-control PheWAS in UK Biobank shows evidence of causality for smoking intensity in 28 distinct clinical conditions. EClinicalMedicine 2020; 26:100488. [PMID: 33089118 PMCID: PMC7564324 DOI: 10.1016/j.eclinm.2020.100488] [Citation(s) in RCA: 13] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 04/30/2020] [Revised: 07/14/2020] [Accepted: 07/15/2020] [Indexed: 12/30/2022] Open
Abstract
BACKGROUND Smoking is one of the greatest threats to public health worldwide. We integrated phenome-wide association study (PheWAS) and Mendelian randomization (MR) approaches to explore causal effects of genetically predicted smoking intensity across the human disease spectrum. METHODS We conducted PheWAS case-control analyses in 152,483 ever smokers of White-British ancestry, aged 39-73 years. Disease diagnoses were based on hospital inpatient and mortality registrations. Smoking intensity was instrumented by four genetic variants, and disease risks estimated for one cigarette per day heavier intakes. Associations passing the FDR threshold (p<0•0025) were assessed for causality using several complementary MR approaches. FINDINGS Genetically instrumented smoking intensity was associated with 48 conditions, with MR supporting a possible causal effect for 28 distinct outcomes. Each cigarette smoked per day elevated the odds of respiratory diseases by 5% to 33% (nine distinct diseases, including pneumonia, emphysema, obstructive chronic bronchitis, pleurisy, pulmonary collapse, respiratory failure) and the odds of circulatory disease by 5% to 23% (seven diseases, including atherosclerosis, myocardial infarction, congestive heart failure, arterial embolisms). Further effects were seen for cancer within the respiratory system and other neoplasms, renal failure, septicaemia, and retinal disorders. No associations were observed in sensitivity analyses on 185,002 never smokers. INTERPRETATION These genetic data demonstrate the substantial adverse health impacts by smoking intensity and suggest notable increases in the risks of several diseases. Public health initiatives should highlight the damage caused by smoking intensity and the potential benefits of reducing or ideally quitting smoking.
Collapse
Affiliation(s)
- Catherine King
- Australian Centre for Precision Health, University of South Australia Cancer Research Institute, Adelaide, SA 5001, Australia
- South Australian Health and Medical Research Institute, Adelaide, Australia
- UniSA Clinical and Health Sciences, University of South Australia, Adelaide, SA, Australia
| | - Anwar Mulugeta
- Australian Centre for Precision Health, University of South Australia Cancer Research Institute, Adelaide, SA 5001, Australia
- South Australian Health and Medical Research Institute, Adelaide, Australia
- Department of Pharmacology and Clinical Sciences, College of Health Sciences, Addis Ababa University, Addis Ababa, Ethiopia
| | - Farhana Nabi
- Australian Centre for Precision Health, University of South Australia Cancer Research Institute, Adelaide, SA 5001, Australia
| | - Robert Walton
- Asthma UK Centre for Applied Research, Barts Institute of Population Health Sciences, Queen Mary University of London, London, United Kingdom
| | - Ang Zhou
- Australian Centre for Precision Health, University of South Australia Cancer Research Institute, Adelaide, SA 5001, Australia
- South Australian Health and Medical Research Institute, Adelaide, Australia
| | - Elina Hyppönen
- Australian Centre for Precision Health, University of South Australia Cancer Research Institute, Adelaide, SA 5001, Australia
- South Australian Health and Medical Research Institute, Adelaide, Australia
- UniSA Clinical and Health Sciences, University of South Australia, Adelaide, SA, Australia
- Corresponding author.
| |
Collapse
|
50
|
Bi W, Fritsche LG, Mukherjee B, Kim S, Lee S. A Fast and Accurate Method for Genome-Wide Time-to-Event Data Analysis and Its Application to UK Biobank. Am J Hum Genet 2020; 107:222-233. [PMID: 32589924 DOI: 10.1016/j.ajhg.2020.06.003] [Citation(s) in RCA: 41] [Impact Index Per Article: 10.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/07/2020] [Accepted: 06/03/2020] [Indexed: 12/09/2022] Open
Abstract
With increasing biobanking efforts connecting electronic health records and national registries to germline genetics, the time-to-event data analysis has attracted increasing attention in the genetics studies of human diseases. In time-to-event data analysis, the Cox proportional hazards (PH) regression model is one of the most used approaches. However, existing methods and tools are not scalable when analyzing a large biobank with hundreds of thousands of samples and endpoints, and they are not accurate when testing low-frequency and rare variants. Here, we propose a scalable and accurate method, SPACox (a saddlepoint approximation implementation based on the Cox PH regression model), that is applicable for genome-wide scale time-to-event data analysis. SPACox requires fitting a Cox PH regression model only once across the genome-wide analysis and then uses a saddlepoint approximation (SPA) to calibrate the test statistics. Simulation studies show that SPACox is 76-252 times faster than other existing alternatives, such as gwasurvivr, 185-511 times faster than the standard Wald test, and more than 6,000 times faster than the Firth correction and can control type I error rates at the genome-wide significance level regardless of minor allele frequencies. Through the analysis of UK Biobank inpatient data of 282,871 white British European ancestry samples, we show that SPACox can efficiently analyze large sample sizes and accurately control type I error rates. We identified 611 loci associated with time-to-event phenotypes of 12 common diseases, of which 38 loci would be missed within a logistic regression framework with a binary phenotype defined as event occurrence status during the follow-up period.
Collapse
|