1
|
Kontou PI, Bagos PG. The goldmine of GWAS summary statistics: a systematic review of methods and tools. BioData Min 2024; 17:31. [PMID: 39238044 PMCID: PMC11375927 DOI: 10.1186/s13040-024-00385-x] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/09/2024] [Accepted: 08/27/2024] [Indexed: 09/07/2024] Open
Abstract
Genome-wide association studies (GWAS) have revolutionized our understanding of the genetic architecture of complex traits and diseases. GWAS summary statistics have become essential tools for various genetic analyses, including meta-analysis, fine-mapping, and risk prediction. However, the increasing number of GWAS summary statistics and the diversity of software tools available for their analysis can make it challenging for researchers to select the most appropriate tools for their specific needs. This systematic review aims to provide a comprehensive overview of the currently available software tools and databases for GWAS summary statistics analysis. We conducted a comprehensive literature search to identify relevant software tools and databases. We categorized the tools and databases by their functionality, including data management, quality control, single-trait analysis, and multiple-trait analysis. We also compared the tools and databases based on their features, limitations, and user-friendliness. Our review identified a total of 305 functioning software tools and databases dedicated to GWAS summary statistics, each with unique strengths and limitations. We provide descriptions of the key features of each tool and database, including their input/output formats, data types, and computational requirements. We also discuss the overall usability and applicability of each tool for different research scenarios. This comprehensive review will serve as a valuable resource for researchers who are interested in using GWAS summary statistics to investigate the genetic basis of complex traits and diseases. By providing a detailed overview of the available tools and databases, we aim to facilitate informed tool selection and maximize the effectiveness of GWAS summary statistics analysis.
Collapse
Affiliation(s)
| | - Pantelis G Bagos
- Department of Computer Science and Biomedical Informatics, University of Thessaly, 35131, Lamia, Greece.
| |
Collapse
|
2
|
Wendelborn C, Anger M, Schickhardt C. Promoting Data Sharing: The Moral Obligations of Public Funding Agencies. SCIENCE AND ENGINEERING ETHICS 2024; 30:35. [PMID: 39105890 PMCID: PMC11303567 DOI: 10.1007/s11948-024-00491-3] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 10/21/2022] [Accepted: 06/08/2024] [Indexed: 08/07/2024]
Abstract
Sharing research data has great potential to benefit science and society. However, data sharing is still not common practice. Since public research funding agencies have a particular impact on research and researchers, the question arises: Are public funding agencies morally obligated to promote data sharing? We argue from a research ethics perspective that public funding agencies have several pro tanto obligations requiring them to promote data sharing. However, there are also pro tanto obligations that speak against promoting data sharing in general as well as with regard to particular instruments of such promotion. We examine and weigh these obligations and conclude that all things considered funders ought to promote the sharing of data. Even the instrument of mandatory data sharing policies can be justified under certain conditions.
Collapse
Affiliation(s)
- Christian Wendelborn
- Section for Translational Medical Ethics, German Cancer Research Center (DKFZ), National Center for Tumor Diseases (NCT) Heidelberg, Heidelberg, Germany.
- University of Konstanz, Konstanz, Germany.
| | - Michael Anger
- Section for Translational Medical Ethics, German Cancer Research Center (DKFZ), National Center for Tumor Diseases (NCT) Heidelberg, Heidelberg, Germany
| | - Christoph Schickhardt
- Section for Translational Medical Ethics, German Cancer Research Center (DKFZ), National Center for Tumor Diseases (NCT) Heidelberg, Heidelberg, Germany
| |
Collapse
|
3
|
Gadotti A, Rocher L, Houssiau F, Creţu AM, de Montjoye YA. Anonymization: The imperfect science of using data while preserving privacy. SCIENCE ADVANCES 2024; 10:eadn7053. [PMID: 39018389 PMCID: PMC466941 DOI: 10.1126/sciadv.adn7053] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 12/22/2023] [Accepted: 06/10/2024] [Indexed: 07/19/2024]
Abstract
Information about us, our actions, and our preferences is created at scale through surveys or scientific studies or as a result of our interaction with digital devices such as smartphones and fitness trackers. The ability to safely share and analyze such data is key for scientific and societal progress. Anonymization is considered by scientists and policy-makers as one of the main ways to share data while minimizing privacy risks. In this review, we offer a pragmatic perspective on the modern literature on privacy attacks and anonymization techniques. We discuss traditional de-identification techniques and their strong limitations in the age of big data. We then turn our attention to modern approaches to share anonymous aggregate data, such as data query systems, synthetic data, and differential privacy. We find that, although no perfect solution exists, applying modern techniques while auditing their guarantees against attacks is the best approach to safely use and share data today.
Collapse
Affiliation(s)
- Andrea Gadotti
- Imperial College London, Exhibition Road, London SW7 2AZ, UK
- University of Oxford, Wellington Square, Oxford OX1 2JD, UK
| | - Luc Rocher
- Imperial College London, Exhibition Road, London SW7 2AZ, UK
- University of Oxford, Wellington Square, Oxford OX1 2JD, UK
| | - Florimond Houssiau
- Imperial College London, Exhibition Road, London SW7 2AZ, UK
- Alan Turing Institute, 96 Euston Road, London NW1 2DB, UK
| | - Ana-Maria Creţu
- Imperial College London, Exhibition Road, London SW7 2AZ, UK
- EPFL, CH-1015 Lausanne, Switzerland
| | | |
Collapse
|
4
|
Creţu AM, Guépin F, de Montjoye YA. Correlation inference attacks against machine learning models. SCIENCE ADVANCES 2024; 10:eadj9260. [PMID: 38985874 DOI: 10.1126/sciadv.adj9260] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Received: 07/26/2023] [Accepted: 06/05/2024] [Indexed: 07/12/2024]
Abstract
Despite machine learning models being widely used today, the relationship between a model and its training dataset is not well understood. We explore correlation inference attacks, whether and when a model leaks information about the correlations between the input variables of its training dataset. We first propose a model-less attack, where an adversary exploits the spherical parameterization of correlation matrices alone to make an informed guess. Second, we propose a model-based attack, where an adversary exploits black-box model access to infer the correlations using minimal and realistic assumptions. Third, we evaluate our attacks against logistic regression and multilayer perceptron models on three tabular datasets and show the models to leak correlations. We lastly show how extracted correlations can be used as building blocks for attribute inference attacks and enable weaker adversaries. Our results raise fundamental questions on what a model does and should remember from its training set.
Collapse
|
5
|
Li W, Chen H, Jiang X, Harmanci A. FedGMMAT: Federated generalized linear mixed model association tests. PLoS Comput Biol 2024; 20:e1012142. [PMID: 39047024 PMCID: PMC11299833 DOI: 10.1371/journal.pcbi.1012142] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/10/2023] [Revised: 08/05/2024] [Accepted: 05/07/2024] [Indexed: 07/27/2024] Open
Abstract
Increasing genetic and phenotypic data size is critical for understanding the genetic determinants of diseases. Evidently, establishing practical means for collaboration and data sharing among institutions is a fundamental methodological barrier for performing high-powered studies. As the sample sizes become more heterogeneous, complex statistical approaches, such as generalized linear mixed effects models, must be used to correct for the confounders that may bias results. On another front, due to the privacy concerns around Protected Health Information (PHI), genetic information is restrictively protected by sharing according to regulations such as Health Insurance Portability and Accountability Act (HIPAA). This limits data sharing among institutions and hampers efforts around executing high-powered collaborative studies. Federated approaches are promising to alleviate the issues around privacy and performance, since sensitive data never leaves the local sites. Motivated by these, we developed FedGMMAT, a federated genetic association testing tool that utilizes a federated statistical testing approach for efficient association tests that can correct for confounding fixed and additive polygenic random effects among different collaborating sites. Genetic data is never shared among collaborating sites, and the intermediate statistics are protected by encryption. Using simulated and real datasets, we demonstrate FedGMMAT can achieve the virtually same results as pooled analysis under a privacy-preserving framework with practical resource requirements.
Collapse
Affiliation(s)
- Wentao Li
- McWilliams School of Biomedical Informatics, University of Texas Health Science Center at Houston, Houston, Texas, United States of America
| | - Han Chen
- McWilliams School of Biomedical Informatics, University of Texas Health Science Center at Houston, Houston, Texas, United States of America
- School of Public Health, University of Texas Health Science Center at Houston, Houston, Texas, United States of America
| | - Xiaoqian Jiang
- McWilliams School of Biomedical Informatics, University of Texas Health Science Center at Houston, Houston, Texas, United States of America
| | - Arif Harmanci
- McWilliams School of Biomedical Informatics, University of Texas Health Science Center at Houston, Houston, Texas, United States of America
| |
Collapse
|
6
|
Omidiran O, Patel A, Usman S, Mhatre I, Abdelhalim H, DeGroat W, Narayanan R, Singh K, Mendhe D, Ahmed Z. GWAS advancements to investigate disease associations and biological mechanisms. CLINICAL AND TRANSLATIONAL DISCOVERY 2024; 4:e296. [PMID: 38737752 PMCID: PMC11086745 DOI: 10.1002/ctd2.296] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 02/03/2024] [Accepted: 04/16/2024] [Indexed: 05/14/2024]
Abstract
Genome-wide association studies (GWAS) have been instrumental in elucidating the genetic architecture of various traits and diseases. Despite the success of GWAS, inherent limitations such as identifying rare and ultra-rare variants, the potential for spurious associations, and in pinpointing causative agents can undermine diagnostic capabilities. This review provides an overview of GWAS and highlights recent advances in genetics that employ a range of methodologies, including Whole Genome Sequencing (WGS), Mendelian Randomization (MR), the Pangenome's high-quality T2T-CHM13 panel, and the Human BioMolecular Atlas Program (HuBMAP), as potential enablers of current and future GWAS research. State of the literature demonstrate the capabilities of these techniques in enhancing the statistical power of GWAS. WGS, with its comprehensive approach, captures the entire genome, surpassing the capabilities of the traditional GWAS technique focused on predefined Single Nucleotide Polymorphism (SNP) sites. The Pangenome's T2T-CHM13 panel, with its holistic approach, aids in the analysis of regions with high sequence identity, such as segmental duplications (SDs). Mendelian Randomization has advanced causative inference, improving clinical diagnostics and facilitating definitive conclusions. Furthermore, spatial biology techniques like HuBMAP, enable 3D molecular mapping of tissues at single-cell resolution, offering insights into pathology of complex traits. This study aims to elucidate and advocate for the increased application of these technologies, highlighting their potential to shape the future of GWAS research.
Collapse
Affiliation(s)
- Oluwaferanmi Omidiran
- Rutgers Institute for Health, Health Care Policy and Aging Research, Rutgers University, 112 Paterson St, New Brunswick, NJ, USA
| | - Aashna Patel
- Rutgers Institute for Health, Health Care Policy and Aging Research, Rutgers University, 112 Paterson St, New Brunswick, NJ, USA
| | - Sarah Usman
- Rutgers Institute for Health, Health Care Policy and Aging Research, Rutgers University, 112 Paterson St, New Brunswick, NJ, USA
| | - Ishani Mhatre
- Rutgers Institute for Health, Health Care Policy and Aging Research, Rutgers University, 112 Paterson St, New Brunswick, NJ, USA
| | - Habiba Abdelhalim
- Rutgers Institute for Health, Health Care Policy and Aging Research, Rutgers University, 112 Paterson St, New Brunswick, NJ, USA
| | - William DeGroat
- Rutgers Institute for Health, Health Care Policy and Aging Research, Rutgers University, 112 Paterson St, New Brunswick, NJ, USA
| | - Rishabh Narayanan
- Rutgers Institute for Health, Health Care Policy and Aging Research, Rutgers University, 112 Paterson St, New Brunswick, NJ, USA
| | - Kritika Singh
- Rutgers Institute for Health, Health Care Policy and Aging Research, Rutgers University, 112 Paterson St, New Brunswick, NJ, USA
| | - Dinesh Mendhe
- Rutgers Institute for Health, Health Care Policy and Aging Research, Rutgers University, 112 Paterson St, New Brunswick, NJ, USA
| | - Zeeshan Ahmed
- Rutgers Institute for Health, Health Care Policy and Aging Research, Rutgers University, 112 Paterson St, New Brunswick, NJ, USA
- Department of Medicine, Robert Wood Johnson Medical School, Rutgers Biomedical and Health Sciences, 125 Paterson St, New Brunswick, NJ, USA
| |
Collapse
|
7
|
Brauneck A, Schmalhorst L, Weiss S, Baumbach L, Völker U, Ellinghaus D, Baumbach J, Buchholtz G. Legal aspects of privacy-enhancing technologies in genome-wide association studies and their impact on performance and feasibility. Genome Biol 2024; 25:154. [PMID: 38872191 PMCID: PMC11170858 DOI: 10.1186/s13059-024-03296-6] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/16/2023] [Accepted: 06/03/2024] [Indexed: 06/15/2024] Open
Abstract
Genomic data holds huge potential for medical progress but requires strict safety measures due to its sensitive nature to comply with data protection laws. This conflict is especially pronounced in genome-wide association studies (GWAS) which rely on vast amounts of genomic data to improve medical diagnoses. To ensure both their benefits and sufficient data security, we propose a federated approach in combination with privacy-enhancing technologies utilising the findings from a systematic review on federated learning and legal regulations in general and applying these to GWAS.
Collapse
Affiliation(s)
- Alissa Brauneck
- Hamburg University Faculty of Law, University of Hamburg, Hamburg, Germany.
| | - Louisa Schmalhorst
- Hamburg University Faculty of Law, University of Hamburg, Hamburg, Germany
| | - Stefan Weiss
- Interfaculty Institute of Genetics and Functional Genomics, Department of Functional Genomics, University Medicine Greifswald, Greifswald, Germany
| | - Linda Baumbach
- Department of Health Economics and Health Services Research, University Medical Center Hamburg-Eppendorf, Hamburg, Germany
| | - Uwe Völker
- Interfaculty Institute of Genetics and Functional Genomics, Department of Functional Genomics, University Medicine Greifswald, Greifswald, Germany
| | - David Ellinghaus
- Institute of Clinical Molecular Biology (IKMB), Kiel University and University Medical Center Schleswig-Holstein, Kiel, Germany
| | - Jan Baumbach
- Institute for Computational Systems Biology, University of Hamburg, Hamburg, Germany
| | - Gabriele Buchholtz
- Hamburg University Faculty of Law, University of Hamburg, Hamburg, Germany
| |
Collapse
|
8
|
Thomas M, Mackes N, Preuss-Dodhy A, Wieland T, Bundschus M. Assessing Privacy Vulnerabilities in Genetic Data Sets: Scoping Review. JMIR BIOINFORMATICS AND BIOTECHNOLOGY 2024; 5:e54332. [PMID: 38935957 PMCID: PMC11165293 DOI: 10.2196/54332] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 11/06/2023] [Revised: 03/26/2024] [Accepted: 03/29/2024] [Indexed: 06/29/2024]
Abstract
BACKGROUND Genetic data are widely considered inherently identifiable. However, genetic data sets come in many shapes and sizes, and the feasibility of privacy attacks depends on their specific content. Assessing the reidentification risk of genetic data is complex, yet there is a lack of guidelines or recommendations that support data processors in performing such an evaluation. OBJECTIVE This study aims to gain a comprehensive understanding of the privacy vulnerabilities of genetic data and create a summary that can guide data processors in assessing the privacy risk of genetic data sets. METHODS We conducted a 2-step search, in which we first identified 21 reviews published between 2017 and 2023 on the topic of genomic privacy and then analyzed all references cited in the reviews (n=1645) to identify 42 unique original research studies that demonstrate a privacy attack on genetic data. We then evaluated the type and components of genetic data exploited for these attacks as well as the effort and resources needed for their implementation and their probability of success. RESULTS From our literature review, we derived 9 nonmutually exclusive features of genetic data that are both inherent to any genetic data set and informative about privacy risk: biological modality, experimental assay, data format or level of processing, germline versus somatic variation content, content of single nucleotide polymorphisms, short tandem repeats, aggregated sample measures, structural variants, and rare single nucleotide variants. CONCLUSIONS On the basis of our literature review, the evaluation of these 9 features covers the great majority of privacy-critical aspects of genetic data and thus provides a foundation and guidance for assessing genetic data risk.
Collapse
|
9
|
Drechsler J, Pauly H. [Re-identification potential of structured health data]. Bundesgesundheitsblatt Gesundheitsforschung Gesundheitsschutz 2024; 67:164-170. [PMID: 38231225 PMCID: PMC10834562 DOI: 10.1007/s00103-023-03820-2] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/26/2023] [Accepted: 12/06/2023] [Indexed: 01/18/2024]
Abstract
Broad access to health data offers great potential for science and research. However, health data often contains sensitive information that must be protected in a special way. In this context, the article deals with the re-identification potential of health data. After defining the relevant terms, we discuss factors that influence the re-identification potential. We summarize international privacy standards for health data and highlight the importance of background knowledge. Given that the reidentification potential is often underestimated in practice, we present strategies for mitigation based on the Five Safes concept. We also discuss classical data protection strategies as well as methods for generating synthetic health data. The article concludes with a brief discussion and outlook on the planned Health Data Lab at the Federal Institute for Drugs and Medical Devices.
Collapse
Affiliation(s)
- Jörg Drechsler
- Institut für Arbeitsmarkt- und Berufsforschung (IAB), Regensburger Str. 104, 90478, Nürnberg, Deutschland.
- Universität Mannheim, Mannheim, Deutschland.
- Joint Program in Survey Methodology (JPSM), University of Maryland, College Park, MD, USA.
| | - Hannah Pauly
- Forschungsdatenzentrum Gesundheit, Bundesinstitut für Arzneimittel und Medizinprodukte (BfArM), Bonn, Deutschland
| |
Collapse
|
10
|
Greenfest‐Allen E, Valladares O, Kuksa PP, Gangadharan P, Lee W, Cifello J, Katanic Z, Kuzma AB, Wheeler N, Bush WS, Leung YY, Schellenberg G, Stoeckert CJ, Wang L. NIAGADS Alzheimer's GenomicsDB: A resource for exploring Alzheimer's disease genetic and genomic knowledge. Alzheimers Dement 2024; 20:1123-1136. [PMID: 37881831 PMCID: PMC10916966 DOI: 10.1002/alz.13509] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/02/2023] [Revised: 08/25/2023] [Accepted: 09/21/2023] [Indexed: 10/27/2023]
Abstract
INTRODUCTION The National Institute on Aging Genetics of Alzheimer's Disease Data Storage Site Alzheimer's Genomics Database (GenomicsDB) is a public knowledge base of Alzheimer's disease (AD) genetic datasets and genomic annotations. METHODS GenomicsDB uses a custom systems architecture to adopt and enforce rigorous standards that facilitate harmonization of AD-relevant genome-wide association study summary statistics datasets with functional annotations, including over 230 million annotated variants from the AD Sequencing Project. RESULTS GenomicsDB generates interactive reports compiled from the harmonized datasets and annotations. These reports contextualize AD-risk associations in a broader functional genomic setting and summarize them in the context of functionally annotated genes and variants. DISCUSSION Created to make AD-genetics knowledge more accessible to AD researchers, the GenomicsDB is designed to guide users unfamiliar with genetic data in not only exploring but also interpreting this ever-growing volume of data. Scalable and interoperable with other genomics resources using data technology standards, the GenomicsDB can serve as a central hub for research and data analysis on AD and related dementias. HIGHLIGHTS The National Institute on Aging Genetics of Alzheimer's Disease Data Storage Site (NIAGADS) offers to the public a unique, disease-centric collection of AD-relevant GWAS summary statistics datasets. Interpreting these data is challenging and requires significant bioinformatics expertise to standardize datasets and harmonize them with functional annotations on genome-wide scales. The NIAGADS Alzheimer's GenomicsDB helps overcome these challenges by providing a user-friendly public knowledge base for AD-relevant genetics that shares harmonized, annotated summary statistics datasets from the NIAGADS repository in an interpretable, easily searchable format.
Collapse
Affiliation(s)
- Emily Greenfest‐Allen
- Penn Neurodegeneration Genomics CenterPerelman School of MedicineUniversity of PennsylvaniaPhiladelphiaPennsylvaniaUSA
- Institute for Biomedical InformaticsPerelman School of MedicineUniversity of PennsylvaniaPhiladelphiaPennsylvaniaUSA
- Department of Pathology and Laboratory MedicinePerelman School of MedicineUniversity of PennsylvaniaPhiladelphiaPennsylvaniaUSA
| | - Otto Valladares
- Penn Neurodegeneration Genomics CenterPerelman School of MedicineUniversity of PennsylvaniaPhiladelphiaPennsylvaniaUSA
- Institute for Biomedical InformaticsPerelman School of MedicineUniversity of PennsylvaniaPhiladelphiaPennsylvaniaUSA
- Department of Pathology and Laboratory MedicinePerelman School of MedicineUniversity of PennsylvaniaPhiladelphiaPennsylvaniaUSA
| | - Pavel P. Kuksa
- Penn Neurodegeneration Genomics CenterPerelman School of MedicineUniversity of PennsylvaniaPhiladelphiaPennsylvaniaUSA
- Institute for Biomedical InformaticsPerelman School of MedicineUniversity of PennsylvaniaPhiladelphiaPennsylvaniaUSA
- Department of Pathology and Laboratory MedicinePerelman School of MedicineUniversity of PennsylvaniaPhiladelphiaPennsylvaniaUSA
| | - Prabhakaran Gangadharan
- Penn Neurodegeneration Genomics CenterPerelman School of MedicineUniversity of PennsylvaniaPhiladelphiaPennsylvaniaUSA
- Institute for Biomedical InformaticsPerelman School of MedicineUniversity of PennsylvaniaPhiladelphiaPennsylvaniaUSA
- Department of Pathology and Laboratory MedicinePerelman School of MedicineUniversity of PennsylvaniaPhiladelphiaPennsylvaniaUSA
| | - Wan‐Ping Lee
- Penn Neurodegeneration Genomics CenterPerelman School of MedicineUniversity of PennsylvaniaPhiladelphiaPennsylvaniaUSA
- Institute for Biomedical InformaticsPerelman School of MedicineUniversity of PennsylvaniaPhiladelphiaPennsylvaniaUSA
- Department of Pathology and Laboratory MedicinePerelman School of MedicineUniversity of PennsylvaniaPhiladelphiaPennsylvaniaUSA
| | - Jeffrey Cifello
- Penn Neurodegeneration Genomics CenterPerelman School of MedicineUniversity of PennsylvaniaPhiladelphiaPennsylvaniaUSA
- Department of Pathology and Laboratory MedicinePerelman School of MedicineUniversity of PennsylvaniaPhiladelphiaPennsylvaniaUSA
| | - Zivadin Katanic
- Penn Neurodegeneration Genomics CenterPerelman School of MedicineUniversity of PennsylvaniaPhiladelphiaPennsylvaniaUSA
- Institute for Biomedical InformaticsPerelman School of MedicineUniversity of PennsylvaniaPhiladelphiaPennsylvaniaUSA
- Department of Pathology and Laboratory MedicinePerelman School of MedicineUniversity of PennsylvaniaPhiladelphiaPennsylvaniaUSA
| | - Amanda B. Kuzma
- Penn Neurodegeneration Genomics CenterPerelman School of MedicineUniversity of PennsylvaniaPhiladelphiaPennsylvaniaUSA
- Institute for Biomedical InformaticsPerelman School of MedicineUniversity of PennsylvaniaPhiladelphiaPennsylvaniaUSA
- Department of Pathology and Laboratory MedicinePerelman School of MedicineUniversity of PennsylvaniaPhiladelphiaPennsylvaniaUSA
| | - Nicholas Wheeler
- Cleveland Institute for Computational BiologyDepartment of Population and Quantitative Health SciencesCase Western Reserve UniversityClevelandOhioUSA
| | - William S. Bush
- Cleveland Institute for Computational BiologyDepartment of Population and Quantitative Health SciencesCase Western Reserve UniversityClevelandOhioUSA
| | - Yuk Yee Leung
- Penn Neurodegeneration Genomics CenterPerelman School of MedicineUniversity of PennsylvaniaPhiladelphiaPennsylvaniaUSA
- Institute for Biomedical InformaticsPerelman School of MedicineUniversity of PennsylvaniaPhiladelphiaPennsylvaniaUSA
- Department of Pathology and Laboratory MedicinePerelman School of MedicineUniversity of PennsylvaniaPhiladelphiaPennsylvaniaUSA
| | - Gerard Schellenberg
- Penn Neurodegeneration Genomics CenterPerelman School of MedicineUniversity of PennsylvaniaPhiladelphiaPennsylvaniaUSA
- Institute for Biomedical InformaticsPerelman School of MedicineUniversity of PennsylvaniaPhiladelphiaPennsylvaniaUSA
- Department of Pathology and Laboratory MedicinePerelman School of MedicineUniversity of PennsylvaniaPhiladelphiaPennsylvaniaUSA
| | - Christian J. Stoeckert
- Institute for Biomedical InformaticsPerelman School of MedicineUniversity of PennsylvaniaPhiladelphiaPennsylvaniaUSA
- Department of GeneticsPerelman School of MedicineUniversity of PennsylvaniaPhiladelphiaPennsylvaniaUSA
| | - Li‐San Wang
- Penn Neurodegeneration Genomics CenterPerelman School of MedicineUniversity of PennsylvaniaPhiladelphiaPennsylvaniaUSA
- Institute for Biomedical InformaticsPerelman School of MedicineUniversity of PennsylvaniaPhiladelphiaPennsylvaniaUSA
- Department of Pathology and Laboratory MedicinePerelman School of MedicineUniversity of PennsylvaniaPhiladelphiaPennsylvaniaUSA
| |
Collapse
|
11
|
Spicker D, Moodie EE, Shortreed SM. Differentially Private Outcome-Weighted Learning for Optimal Dynamic Treatment Regime Estimation. Stat (Int Stat Inst) 2024; 13:e641. [PMID: 39070170 PMCID: PMC11281278 DOI: 10.1002/sta4.641] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/14/2023] [Accepted: 11/12/2023] [Indexed: 07/30/2024]
Abstract
Precision medicine is a framework for developing evidence-based medical recommendations that seeks to determine the optimal sequence of treatments tailored to all of the relevant patient-level characteristics which are observable. Because precision medicine relies on highly sensitive, patient-level data, ensuring the privacy of participants is of great importance. Dynamic treatment regimes (DTRs) provide one formalization of precision medicine in a longitudinal setting. Outcome-Weighted Learning (OWL) is a family of techniques for estimating optimal DTRs based on observational data. OWL techniques leverage support vector machine (SVM) classifiers in order to perform estimation. SVMs perform classification based on a set of influential points in the data known as support vectors. The classification rule produced by SVMs often requires direct access to the support vectors. Thus, releasing a treatment policy estimated with OWL requires the release of patient data for a subset of patients in the sample. As a result, the classification rules from SVMs constitute a severe privacy violation for those individuals whose data comprise the support vectors. This privacy violation is a major concern, particularly in light of the potentially highly sensitive medical data which are used in DTR estimation. Differential privacy has emerged as a mathematical framework for ensuring the privacy of individual-level data, with provable guarantees on the likelihood that individual characteristics can be determined by an adversary. We provide the first investigation of differential privacy in the context of DTRs and provide a differentially private OWL estimator, with theoretical results allowing us to quantify the cost of privacy in terms of the accuracy of the private estimators.
Collapse
Affiliation(s)
- Dylan Spicker
- Department of Mathematics and Statistics, University of New Brunswick (Saint John), NB, Canada
| | - Erica E.M. Moodie
- Department of Epidemiology, Biostatistics, and Occupational Health, McGill University, QC, Canada
| | - Susan M. Shortreed
- Kaiser Permanente Washington Health Research Institute, WA, USA
- Department of Biostatistics University of Washington, WA, USA
| |
Collapse
|
12
|
Emani PS, Geradi MN, Gürsoy G, Grasty MR, Miranker A, Gerstein MB. Assessing and mitigating privacy risks of sparse, noisy genotypes by local alignment to haplotype databases. Genome Res 2023; 33:2156-2173. [PMID: 38097386 PMCID: PMC10760520 DOI: 10.1101/gr.278322.123] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/24/2023] [Accepted: 11/18/2023] [Indexed: 01/04/2024]
Abstract
Single nucleotide polymorphisms (SNPs) from omics data create a reidentification risk for individuals and their relatives. Although the ability of thousands of SNPs (especially rare ones) to identify individuals has been repeatedly shown, the availability of small sets of noisy genotypes, from environmental DNA samples or functional genomics data, motivated us to quantify their informativeness. We present a computational tool suite, termed Privacy Leakage by Inference across Genotypic HMM Trajectories (PLIGHT), using population-genetics-based hidden Markov models (HMMs) of recombination and mutation to find piecewise alignment of small, noisy SNP sets to reference haplotype databases. We explore cases in which query individuals are either known to be in the database, or not, and consider several genotype queries, including those from environmental sample swabs from known individuals and from simulated "mosaics" (two-individual composites). Using PLIGHT on a database with ∼5000 haplotypes, we find for common, noise-free SNPs that only ten are sufficient to identify individuals, ∼20 can identify both components in two-individual mosaics, and 20-30 can identify first-order relatives. Using noisy environmental-sample-derived SNPs, PLIGHT identifies individuals in a database using ∼30 SNPs. Even when the individuals are not in the database, local genotype matches allow for some phenotypic information leakage based on coarse-grained SNP imputation. Finally, by quantifying privacy leakage from sparse SNP sets, PLIGHT helps determine the value of selectively sanitizing released SNPs without explicit assumptions about population membership or allele frequency. To make this practical, we provide a sanitization tool to remove the most identifying SNPs from genomic data.
Collapse
Affiliation(s)
- Prashant S Emani
- Program in Computational Biology and Bioinformatics, Yale University, New Haven, Connecticut 06520, USA
- Department of Molecular Biophysics and Biochemistry, Yale University, New Haven, Connecticut 06520, USA
| | - Maya N Geradi
- Program in Computational Biology and Bioinformatics, Yale University, New Haven, Connecticut 06520, USA
- Department of Molecular Biophysics and Biochemistry, Yale University, New Haven, Connecticut 06520, USA
| | - Gamze Gürsoy
- Program in Computational Biology and Bioinformatics, Yale University, New Haven, Connecticut 06520, USA
- Department of Molecular Biophysics and Biochemistry, Yale University, New Haven, Connecticut 06520, USA
| | - Monica R Grasty
- Department of Molecular Biophysics and Biochemistry, Yale University, New Haven, Connecticut 06520, USA
| | - Andrew Miranker
- Department of Molecular Biophysics and Biochemistry, Yale University, New Haven, Connecticut 06520, USA
| | - Mark B Gerstein
- Program in Computational Biology and Bioinformatics, Yale University, New Haven, Connecticut 06520, USA;
- Department of Molecular Biophysics and Biochemistry, Yale University, New Haven, Connecticut 06520, USA
- Department of Computer Science, Yale University, New Haven, Connecticut 06520, USA
- Department of Statistics and Data Science, Yale University, New Haven, Connecticut 06520, USA
| |
Collapse
|
13
|
Mosca MJ, Cho H. Reconstruction of private genomes through reference-based genotype imputation. Genome Biol 2023; 24:271. [PMID: 38053191 PMCID: PMC10698978 DOI: 10.1186/s13059-023-03105-6] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/12/2023] [Accepted: 11/06/2023] [Indexed: 12/07/2023] Open
Abstract
BACKGROUND Genotype imputation is an essential step in genetic studies to improve data quality and statistical power. Public imputation servers are widely used by researchers to impute their data using otherwise access-controlled reference panels of high-fidelity genomes held by these servers. RESULTS We report evidence against the prevailing assumption that providing access to panels only indirectly via imputation servers poses a negligible privacy risk to individuals in the panels. To this end, we present algorithmic strategies for adaptively constructing artificial input samples and interpreting their imputation results that lead to the accurate reconstruction of reference panel haplotypes. We illustrate this possibility on three reference panels of real genomes for a range of imputation tools and output settings. Moreover, we demonstrate that reconstructed haplotypes from the same individual could be linked via their genetic relatives using our Bayesian linking algorithm, which allows a substantial portion of the individual's diploid genome to be reassembled. We also provide population genetic estimates of the proportion of a panel that could be linked when an adversary holds a varying number of genomes from the same population. CONCLUSIONS Our results show that genomes in imputation server reference panels can be vulnerable to reconstruction, implying that additional safeguards may need to be considered. We suggest possible mitigation measures based on our findings. Our work illustrates the value of adversarial algorithms in uncovering new privacy risks to help inform the genomics community towards secure data sharing practices.
Collapse
Affiliation(s)
| | - Hyunghoon Cho
- Broad Institute of MIT and Harvard, Cambridge, MA, USA.
- Section of Biomedical Informatics and Data Science, Yale School of Medicine, New Haven, CT, USA.
| |
Collapse
|
14
|
Bonomi L, Lionts M, Fan L. Private Continuous Survival Analysis with Distributed Multi-Site Data. PROCEEDINGS : ... IEEE INTERNATIONAL CONFERENCE ON BIG DATA. IEEE INTERNATIONAL CONFERENCE ON BIG DATA 2023; 2023:5444-5453. [PMID: 38585488 PMCID: PMC10997374 DOI: 10.1109/bigdata59044.2023.10386571] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 04/09/2024]
Abstract
Effective disease surveillance systems require large-scale epidemiological data to improve health outcomes and quality of care for the general population. As data may be limited within a single site, multi-site data (e.g., from a number of local/regional health systems) need to be considered. Leveraging distributed data across multiple sites for epidemiological analysis poses significant challenges. Due to the sensitive nature of epidemiological data, it is imperative to design distributed solutions that provide strong privacy protections. Current privacy solutions often assume a central site, which is responsible for aggregating the distributed data and applying privacy protection before sharing the results (e.g., aggregation via secure primitives and differential privacy for sharing aggregate results). However, identifying such a central site may be difficult in practice and relying on a central site may introduce potential vulnerabilities (e.g., single point of failure). Furthermore, to support clinical interventions and inform policy decisions in a timely manner, epidemiological analysis need to reflect dynamic changes in the data. Yet, existing distributed privacy-protecting approaches were largely designed for static data (e.g., one-time data sharing) and cannot fulfill dynamic data requirements. In this work, we propose a privacy-protecting approach that supports the sharing of dynamic epidemiological analysis and provides strong privacy protection in a decentralized manner. We apply our solution in continuous survival analysis using the Kaplan-Meier estimation model while providing differential privacy protection. Our evaluations on a real dataset containing COVID-19 cases show that our method provides highly usable results.
Collapse
Affiliation(s)
- Luca Bonomi
- Dept. Biomedical Informatics, Vanderbilt University Medical Center, Nashville, TN
| | - Marilyn Lionts
- Dept. Computer Science, Vanderbilt University, Nashville, TN
| | - Liyue Fan
- College of Computing and Informatics, University of North Carolina, Charlotte, NC
| |
Collapse
|
15
|
Ayday E, Vaidya J, Jiang X, Telenti A. Ensuring Trust in Genomics Research. ... IEEE INTERNATIONAL CONFERENCE ON TRUST, PRIVACY AND SECURITY IN INTELLIGENT SYSTEMS AND APPLICATIONS : (TPS-ISA ...). IEEE INTERNATIONAL CONFERENCE ON TRUST, PRIVACY AND SECURITY IN INTELLIGENT SYSTEMS AND APPLICATIONS 2023; 2023:1-12. [PMID: 38562180 PMCID: PMC10981793 DOI: 10.1109/tps-isa58951.2023.00011] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 04/04/2024]
Abstract
Reproducibility, transparency, representation, and privacy underpin the trust on genomics research in general and genome-wide association studies (GWAS) in particular. Concerns about these issues can be mitigated by technologies that address privacy protection, quality control, and verifiability of GWAS. However, many of the existing technological solutions have been developed in isolation and may address one aspect of reproducibility, transparency, representation, and privacy of GWAS while unknowingly impacting other aspects. As a consequence, the current patchwork of technological tools only partially and in an overlapping manner address issues with GWAS, sometimes even creating more problems. This paper addresses the progress in a field that creates technological solutions that augment the acceptance and security of population genetic analyses. The text identifies areas that are falling behind in technical implementation or where there is insufficient research. We make the case that a full understanding of the different GWAS settings, technological tools and new research directions can holistically address the requirements for the acceptance of GWAS.
Collapse
Affiliation(s)
- Erman Ayday
- Department of Computer and Data Sciences Case Western Reserve University Cleveland, OH
| | - Jaideep Vaidya
- Management Science and Information Systems Department Rutgers University Newark, NJ
| | - Xiaoqian Jiang
- Department of Data Science and Artificial Intelligence University of Texas - Health Houston, TX
| | - Amalio Telenti
- Dept. of Integrative Structural and Computational Biology Scripps Institute La Jolla, CA
| |
Collapse
|
16
|
Liang X, Zhao J, Chen Y, Bandara E, Shetty S. Architectural Design of a Blockchain-Enabled, Federated Learning Platform for Algorithmic Fairness in Predictive Health Care: Design Science Study. J Med Internet Res 2023; 25:e46547. [PMID: 37902833 PMCID: PMC10644196 DOI: 10.2196/46547] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/15/2023] [Revised: 07/06/2023] [Accepted: 08/21/2023] [Indexed: 10/31/2023] Open
Abstract
BACKGROUND Developing effective and generalizable predictive models is critical for disease prediction and clinical decision-making, often requiring diverse samples to mitigate population bias and address algorithmic fairness. However, a major challenge is to retrieve learning models across multiple institutions without bringing in local biases and inequity, while preserving individual patients' privacy at each site. OBJECTIVE This study aims to understand the issues of bias and fairness in the machine learning process used in the predictive health care domain. We proposed a software architecture that integrates federated learning and blockchain to improve fairness, while maintaining acceptable prediction accuracy and minimizing overhead costs. METHODS We improved existing federated learning platforms by integrating blockchain through an iterative design approach. We used the design science research method, which involves 2 design cycles (federated learning for bias mitigation and decentralized architecture). The design involves a bias-mitigation process within the blockchain-empowered federated learning framework based on a novel architecture. Under this architecture, multiple medical institutions can jointly train predictive models using their privacy-protected data effectively and efficiently and ultimately achieve fairness in decision-making in the health care domain. RESULTS We designed and implemented our solution using the Aplos smart contract, microservices, Rahasak blockchain, and Apache Cassandra-based distributed storage. By conducting 20,000 local model training iterations and 1000 federated model training iterations across 5 simulated medical centers as peers in the Rahasak blockchain network, we demonstrated how our solution with an improved fairness mechanism can enhance the accuracy of predictive diagnosis. CONCLUSIONS Our study identified the technical challenges of prediction biases faced by existing predictive models in the health care domain. To overcome these challenges, we presented an innovative design solution using federated learning and blockchain, along with the adoption of a unique distributed architecture for a fairness-aware system. We have illustrated how this design can address privacy, security, prediction accuracy, and scalability challenges, ultimately improving fairness and equity in the predictive health care domain.
Collapse
Affiliation(s)
- Xueping Liang
- Department of Information Systems and Business Analytics, Florida International University, Miami, FL, United States
| | - Juan Zhao
- American Heart Association, Dallas, TX, United States
| | - Yan Chen
- Department of Information Systems and Business Analytics, Florida International University, Miami, FL, United States
| | - Eranga Bandara
- Virginia Modeling, Analysis and Simulation Center, Old Dominion University, Suffolk, VA, United States
| | - Sachin Shetty
- Virginia Modeling, Analysis and Simulation Center, Old Dominion University, Suffolk, VA, United States
| |
Collapse
|
17
|
Jarmin RS, Abowd JM, Ashmead R, Cumings-Menon R, Goldschlag N, Hawes MB, Keller SA, Kifer D, Leclerc P, Reiter JP, Rodríguez RA, Schmutte I, Velkoff VA, Zhuravlev P. An in-depth examination of requirements for disclosure risk assessment. Proc Natl Acad Sci U S A 2023; 120:e2220558120. [PMID: 37831744 PMCID: PMC10614951 DOI: 10.1073/pnas.2220558120] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/15/2023] Open
Abstract
The use of formal privacy to protect the confidentiality of responses in the 2020 Decennial Census of Population and Housing has triggered renewed interest and debate over how to measure the disclosure risks and societal benefits of the published data products. We argue that any proposal for quantifying disclosure risk should be based on prespecified, objective criteria. We illustrate this approach to evaluate the absolute disclosure risk framework, the counterfactual framework underlying differential privacy, and prior-to-posterior comparisons. We conclude that satisfying all the desiderata is impossible, but counterfactual comparisons satisfy the most while absolute disclosure risk satisfies the fewest. Furthermore, we explain that many of the criticisms levied against differential privacy would be levied against any technology that is not equivalent to direct, unrestricted access to confidential data. More research is needed, but in the near term, the counterfactual approach appears best-suited for privacy versus utility analysis.
Collapse
Affiliation(s)
- Ron S. Jarmin
- U.S. Census Bureau, Office of the Deputy Director, Washington, DC20233
| | - John M. Abowd
- Department of Economics, Cornell University, Ithaca, NY14853
| | - Robert Ashmead
- U.S. Census Bureau, Office of the Deputy Director, Washington, DC20233
| | | | - Nathan Goldschlag
- U.S. Census Bureau, Office of the Deputy Director, Washington, DC20233
| | - Michael B. Hawes
- U.S. Census Bureau, Office of the Deputy Director, Washington, DC20233
| | - Sallie Ann Keller
- U.S. Census Bureau, Office of the Deputy Director, Washington, DC20233
- Biocomplexity Institute, University of Virginia, Charlottesville, VA22904
| | - Daniel Kifer
- U.S. Census Bureau, Office of the Deputy Director, Washington, DC20233
- Department of Computer Science and Engineering, Penn State University, University Park, PA16802
| | - Philip Leclerc
- U.S. Census Bureau, Office of the Deputy Director, Washington, DC20233
| | - Jerome P. Reiter
- U.S. Census Bureau, Office of the Deputy Director, Washington, DC20233
- Department of Statistical Science, Duke University, Durham, NC27708
| | | | - Ian Schmutte
- Department of Economics, University of Georgia, Athens, GA30602
| | | | - Pavel Zhuravlev
- U.S. Census Bureau, Office of the Deputy Director, Washington, DC20233
| |
Collapse
|
18
|
Blatter TU, Witte H, Fasquelle-Lopez J, Theodoros Naka C, Raisaro JL, Leichtle AB. The BioRef Infrastructure, a Framework for Real-Time, Federated, Privacy-Preserving, and Personalized Reference Intervals: Design, Development, and Application. J Med Internet Res 2023; 25:e47254. [PMID: 37851984 PMCID: PMC10620636 DOI: 10.2196/47254] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/13/2023] [Revised: 07/13/2023] [Accepted: 07/14/2023] [Indexed: 10/20/2023] Open
Abstract
BACKGROUND Reference intervals (RIs) for patient test results are in standard use across many medical disciplines, allowing physicians to identify measurements indicating potentially pathological states with relative ease. The process of inferring cohort-specific RIs is, however, often ignored because of the high costs and cumbersome efforts associated with it. Sophisticated analysis tools are required to automatically infer relevant and locally specific RIs directly from routine laboratory data. These tools would effectively connect clinical laboratory databases to physicians and provide personalized target ranges for the respective cohort population. OBJECTIVE This study aims to describe the BioRef infrastructure, a multicentric governance and IT framework for the estimation and assessment of patient group-specific RIs from routine clinical laboratory data using an innovative decentralized data-sharing approach and a sophisticated, clinically oriented graphical user interface for data analysis. METHODS A common governance agreement and interoperability standards have been established, allowing the harmonization of multidimensional laboratory measurements from multiple clinical databases into a unified "big data" resource. International coding systems, such as the International Classification of Diseases, Tenth Revision (ICD-10); unique identifiers for medical devices from the Global Unique Device Identification Database; type identifiers from the Global Medical Device Nomenclature; and a universal transfer logic, such as the Resource Description Framework (RDF), are used to align the routine laboratory data of each data provider for use within the BioRef framework. With a decentralized data-sharing approach, the BioRef data can be evaluated by end users from each cohort site following a strict "no copy, no move" principle, that is, only data aggregates for the intercohort analysis of target ranges are exchanged. RESULTS The TI4Health distributed and secure analytics system was used to implement the proposed federated and privacy-preserving approach and comply with the limitations applied to sensitive patient data. Under the BioRef interoperability consensus, clinical partners enable the computation of RIs via the TI4Health graphical user interface for query without exposing the underlying raw data. The interface was developed for use by physicians and clinical laboratory specialists and allows intuitive and interactive data stratification by patient factors (age, sex, and personal medical history) as well as laboratory analysis determinants (device, analyzer, and test kit identifier). This consolidated effort enables the creation of extremely detailed and patient group-specific queries, allowing the generation of individualized, covariate-adjusted RIs on the fly. CONCLUSIONS With the BioRef-TI4Health infrastructure, a framework for clinical physicians and researchers to define precise RIs immediately in a convenient, privacy-preserving, and reproducible manner has been implemented, promoting a vital part of practicing precision medicine while streamlining compliance and avoiding transfers of raw patient data. This new approach can provide a crucial update on RIs and improve patient care for personalized medicine.
Collapse
Affiliation(s)
- Tobias Ueli Blatter
- University Institute of Clinical Chemistry, University Hospital Bern, Bern, Switzerland
- Graduate School for Health Sciences, University of Bern, Bern, Switzerland
| | - Harald Witte
- University Institute of Clinical Chemistry, University Hospital Bern, Bern, Switzerland
| | | | - Christos Theodoros Naka
- University Institute of Clinical Chemistry, University Hospital Bern, Bern, Switzerland
- Laboratory of Biometry, University of Thessaly, Volos, Greece
| | - Jean Louis Raisaro
- Biomedical Data Science Center, University Hospital Lausanne, Lausanne, Switzerland
| | - Alexander Benedikt Leichtle
- University Institute of Clinical Chemistry, University Hospital Bern, Bern, Switzerland
- Center for Artificial Intelligence in Medicine, University of Bern, Bern, Switzerland
| |
Collapse
|
19
|
Riaz S, Ali S, Wang G, Latif MA, Iqbal MZ. Membership inference attack on differentially private block coordinate descent. PeerJ Comput Sci 2023; 9:e1616. [PMID: 37869463 PMCID: PMC10588713 DOI: 10.7717/peerj-cs.1616] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/03/2023] [Accepted: 09/05/2023] [Indexed: 10/24/2023]
Abstract
The extraordinary success of deep learning is made possible due to the availability of crowd-sourced large-scale training datasets. Mostly, these datasets contain personal and confidential information, thus, have great potential of being misused, raising privacy concerns. Consequently, privacy-preserving deep learning has become a primary research interest nowadays. One of the prominent approaches adopted to prevent the leakage of sensitive information about the training data is by implementing differential privacy during training for their differentially private training, which aims to preserve the privacy of deep learning models. Though these models are claimed to be a safeguard against privacy attacks targeting sensitive information, however, least amount of work is found in the literature to practically evaluate their capability by performing a sophisticated attack model on them. Recently, DP-BCD is proposed as an alternative to state-of-the-art DP-SGD, to preserve the privacy of deep-learning models, having low privacy cost and fast convergence speed with highly accurate prediction results. To check its practical capability, in this article, we analytically evaluate the impact of a sophisticated privacy attack called the membership inference attack against it in both black box as well as white box settings. More precisely, we inspect how much information can be inferred from a differentially private deep model's training data. We evaluate our experiments on benchmark datasets using AUC, attacker advantage, precision, recall, and F1-score performance metrics. The experimental results exhibit that DP-BCD keeps its promise to preserve privacy against strong adversaries while providing acceptable model utility compared to state-of-the-art techniques.
Collapse
Affiliation(s)
- Shazia Riaz
- School of Computing, Macquarie University, Sydney, Australia
- Department of Computer Science, University of Agriculture, Faisalabad, Punjab, Pakistan
| | - Saqib Ali
- Department of Computer Science, University of Agriculture, Faisalabad, Punjab, Pakistan
- School of Computing, Guangzhou University, Guangzhou, China
| | - Guojun Wang
- School of Computing, Guangzhou University, Guangzhou, China
| | - Muhammad Ahsan Latif
- Department of Computer Science, University of Agriculture, Faisalabad, Punjab, Pakistan
| | - Muhammad Zafar Iqbal
- Department of Mathematics and Statistics, University of Agriculture Faisalabad, Faisalabad, Punjab, Pakistan
| |
Collapse
|
20
|
Wang X, Dervishi L, Li W, Ayday E, Jiang X, Vaidya J. Privacy-preserving federated genome-wide association studies via dynamic sampling. Bioinformatics 2023; 39:btad639. [PMID: 37856329 PMCID: PMC10612407 DOI: 10.1093/bioinformatics/btad639] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/09/2023] [Revised: 09/15/2023] [Accepted: 10/18/2023] [Indexed: 10/21/2023] Open
Abstract
MOTIVATION Genome-wide association studies (GWAS) benefit from the increasing availability of genomic data and cross-institution collaborations. However, sharing data across institutional boundaries jeopardizes medical data confidentiality and patient privacy. While modern cryptographic techniques provide formal secure guarantees, the substantial communication and computational overheads hinder the practical application of large-scale collaborative GWAS. RESULTS This work introduces an efficient framework for conducting collaborative GWAS on distributed datasets, maintaining data privacy without compromising the accuracy of the results. We propose a novel two-step strategy aimed at reducing communication and computational overheads, and we employ iterative and sampling techniques to ensure accurate results. We instantiate our approach using logistic regression, a commonly used statistical method for identifying associations between genetic markers and the phenotype of interest. We evaluate our proposed methods using two real genomic datasets and demonstrate their robustness in the presence of between-study heterogeneity and skewed phenotype distributions using a variety of experimental settings. The empirical results show the efficiency and applicability of the proposed method and the promise for its application for large-scale collaborative GWAS. AVAILABILITY AND IMPLEMENTATION The source code and data are available at https://github.com/amioamo/TDS.
Collapse
Affiliation(s)
- Xinyue Wang
- Management Science and Information Systems Department, Rutgers University, New Brunswick, NJ 07102, United States
| | - Leonard Dervishi
- Department of Computer and Data Sciences, Cleveland, OH 44106, United States
| | - Wentao Li
- Department of Health Data Science and Artificial Intelligence, Houston, TX 77030, United States
| | - Erman Ayday
- Department of Computer and Data Sciences, Cleveland, OH 44106, United States
| | - Xiaoqian Jiang
- Department of Health Data Science and Artificial Intelligence, Houston, TX 77030, United States
| | - Jaideep Vaidya
- Management Science and Information Systems Department, Rutgers University, New Brunswick, NJ 07102, United States
| |
Collapse
|
21
|
Zhang Y, Zhao L, Wang Q. MiDA: Membership inference attacks against domain adaptation. ISA TRANSACTIONS 2023; 141:103-112. [PMID: 36702690 DOI: 10.1016/j.isatra.2023.01.021] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 10/30/2022] [Revised: 01/03/2023] [Accepted: 01/14/2023] [Indexed: 06/18/2023]
Abstract
Domain adaption has become an effective solution to train neural networks with insufficient training data. In this paper, we investigate the vulnerability of domain adaption that potentially breaches sensitive information about the training dataset. We propose a new membership inference attack against domain adaption models, to infer the membership information of samples from the target domain. By leveraging the background knowledge about an additional source-domain in domain adaptation tasks, our attack can exploit the similar distributions between the target and source domain data to determine if a specific data sample belongs in the training set with high efficiency and accuracy. In particular, the proposed attack can be deployed in a practical scenario where the attacker cannot obtain any details of the model. We conduct extensive evaluations for object and digit recognition tasks. Experimental results show that our method can achieve the attack against domain adaptation models with a high success rate.
Collapse
Affiliation(s)
- Yuanjie Zhang
- Key Laboratory of Aerospace Information Security and Trusted Computing, Ministry of Education, School of Cyber Science and Engineering, Wuhan University, 430072 Wuhan, PR China.
| | - Lingchen Zhao
- Key Laboratory of Aerospace Information Security and Trusted Computing, Ministry of Education, School of Cyber Science and Engineering, Wuhan University, 430072 Wuhan, PR China.
| | - Qian Wang
- Key Laboratory of Aerospace Information Security and Trusted Computing, Ministry of Education, School of Cyber Science and Engineering, Wuhan University, 430072 Wuhan, PR China.
| |
Collapse
|
22
|
Klugman CM, Erwin CJ. Machines Like Me: 4 Corollaries for Responsible Use of AI in the Bioethics Classroom. THE AMERICAN JOURNAL OF BIOETHICS : AJOB 2023; 23:86-88. [PMID: 37812108 DOI: 10.1080/15265161.2023.2250317] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 10/10/2023]
|
23
|
Li W, Kim M, Zhang K, Chen H, Jiang X, Harmanci A. COLLAGENE enables privacy-aware federated and collaborative genomic data analysis. Genome Biol 2023; 24:204. [PMID: 37697426 PMCID: PMC10496350 DOI: 10.1186/s13059-023-03039-z] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/26/2022] [Accepted: 08/16/2023] [Indexed: 09/13/2023] Open
Abstract
Growing regulatory requirements set barriers around genetic data sharing and collaborations. Moreover, existing privacy-aware paradigms are challenging to deploy in collaborative settings. We present COLLAGENE, a tool base for building secure collaborative genomic data analysis methods. COLLAGENE protects data using shared-key homomorphic encryption and combines encryption with multiparty strategies for efficient privacy-aware collaborative method development. COLLAGENE provides ready-to-run tools for encryption/decryption, matrix processing, and network transfers, which can be immediately integrated into existing pipelines. We demonstrate the usage of COLLAGENE by building a practical federated GWAS protocol for binary phenotypes and a secure meta-analysis protocol. COLLAGENE is available at https://zenodo.org/record/8125935 .
Collapse
Affiliation(s)
- Wentao Li
- Center for Secure Artificial Intelligence For hEalthcare (SAFE), D. Bradley McWilliams School of Biomedical Informatics, University of Texas Health Science Center, Houston, TX, 77030, USA
| | - Miran Kim
- Department of Mathematics, Department of Computer Science, Hanyang University, Seoul, 04763, Republic of Korea
- Research Institute for Convergence of Basic Science, Hanyang University, Seoul, 04763, Republic of Korea
- Bio-BigData Center, Hanyang Institute of Bioscience and Biotechnology, Hanyang University, Seoul, 04763, Republic of Korea
| | - Kai Zhang
- Center for Secure Artificial Intelligence For hEalthcare (SAFE), D. Bradley McWilliams School of Biomedical Informatics, University of Texas Health Science Center, Houston, TX, 77030, USA
| | - Han Chen
- Human Genetics Center, Department of Epidemiology, Human Genetics and Environmental Sciences, School of Public Health, The University of Texas Health Science Center at Houston, Houston, TX, 77030, USA
- Center for Precision Health, D. Bradley McWilliams School of Biomedical Informatics, The University of Texas Health Science Center at Houston, Houston, TX, 77030, USA
| | - Xiaoqian Jiang
- Center for Secure Artificial Intelligence For hEalthcare (SAFE), D. Bradley McWilliams School of Biomedical Informatics, University of Texas Health Science Center, Houston, TX, 77030, USA
| | - Arif Harmanci
- Center for Secure Artificial Intelligence For hEalthcare (SAFE), D. Bradley McWilliams School of Biomedical Informatics, University of Texas Health Science Center, Houston, TX, 77030, USA.
- Center for Precision Health, D. Bradley McWilliams School of Biomedical Informatics, The University of Texas Health Science Center at Houston, Houston, TX, 77030, USA.
| |
Collapse
|
24
|
Budowle B, Arnette A, Sajantila A. A cost-benefit analysis for use of large SNP panels and high throughput typing for forensic investigative genetic genealogy. Int J Legal Med 2023; 137:1595-1614. [PMID: 37341834 PMCID: PMC10421786 DOI: 10.1007/s00414-023-03029-7] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/15/2023] [Accepted: 05/16/2023] [Indexed: 06/22/2023]
Abstract
Next-generation sequencing (NGS), also known as massively sequencing, enables large dense SNP panel analyses which generate the genetic component of forensic investigative genetic genealogy (FIGG). While the costs of implementing large SNP panel analyses into the laboratory system may seem high and daunting, the benefits of the technology may more than justify the investment. To determine if an infrastructural investment in public laboratories and using large SNP panel analyses would reap substantial benefits to society, a cost-benefit analysis (CBA) was performed. This CBA applied the logic that an increase of DNA profile uploads to a DNA database due to a sheer increase in number of markers and a greater sensitivity of detection afforded with NGS and a higher hit/association rate due to large SNP/kinship resolution and genealogy will increase investigative leads, will be more effective for identifying recidivists which in turn reduces future victims of crime, and will bring greater safety and security to communities. Analyses were performed for worst case/best case scenarios as well as by simulation sampling the range spaces with multiple input values simultaneously to generate best estimate summary statistics. This study shows that the benefits, both tangible and intangible, over the lifetime of an advanced database system would be huge and can be projected to be for less than $1 billion per year (over a 10-year period) investment can reap on average > $4.8 billion in tangible and intangible cost-benefits per year. More importantly, on average > 50,000 individuals need not become victims if FIGG were employed, assuming investigative associations generated were acted upon. The benefit to society is immense making the laboratory investment a nominal cost. The benefits likely are underestimated herein. There is latitude in the estimated costs, and even if they were doubled or tripled, there would still be substantial benefits gained with a FIGG-based approach. While the data used in this CBA are US centric (primarily because data were readily accessible), the model is generalizable and could be used by other jurisdictions to perform relevant and representative CBAs.
Collapse
Affiliation(s)
- Bruce Budowle
- Department of Forensic Medicine, University of Helsinki, Helsinki, Finland.
- Radford University Forensic Science Institute, Radford University, Radford, VA, USA.
| | - Andrew Arnette
- Department of Business Information Technology, Virginia Tech, Blacksburg, VA, USA
| | - Antti Sajantila
- Department of Forensic Medicine, University of Helsinki, Helsinki, Finland
- Forensic Medicine Unit, Finnish Institute for Health and Welfare, Helsinki, Finland
| |
Collapse
|
25
|
Casaletto J, Bernier A, McDougall R, Cline MS. Federated Analysis for Privacy-Preserving Data Sharing: A Technical and Legal Primer. Annu Rev Genomics Hum Genet 2023; 24:347-368. [PMID: 37253596 PMCID: PMC10846631 DOI: 10.1146/annurev-genom-110122-084756] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 06/01/2023]
Abstract
Continued advances in precision medicine rely on the widespread sharing of data that relate human genetic variation to disease. However, data sharing is severely limited by legal, regulatory, and ethical restrictions that safeguard patient privacy. Federated analysis addresses this problem by transferring the code to the data-providing the technical and legal capability to analyze the data within their secure home environment rather than transferring the data to another institution for analysis. This allows researchers to gain new insights from data that cannot be moved, while respecting patient privacy and the data stewards' legal obligations. Because federated analysis is a technical solution to the legal challenges inherent in data sharing, the technology and policy implications must be evaluated together. Here, we summarize the technical approaches to federated analysis and provide a legal analysis of their policy implications.
Collapse
Affiliation(s)
- James Casaletto
- Genomics Institute, University of California, Santa Cruz, California, USA; ,
| | - Alexander Bernier
- Centre of Genomics and Policy, Faculty of Medicine and Health Sciences, McGill University, Montreal, Quebec, Canada; ,
| | - Robyn McDougall
- Centre of Genomics and Policy, Faculty of Medicine and Health Sciences, McGill University, Montreal, Quebec, Canada; ,
| | - Melissa S Cline
- Genomics Institute, University of California, Santa Cruz, California, USA; ,
| |
Collapse
|
26
|
Knoppers BM, Bernier A, Bowers S, Kirby E. Open Data in the Era of the GDPR: Lessons from the Human Cell Atlas. Annu Rev Genomics Hum Genet 2023; 24:369-391. [PMID: 36791787 DOI: 10.1146/annurev-genom-101322-113255] [Citation(s) in RCA: 4] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/17/2023]
Abstract
The Human Cell Atlas (HCA) is striving to build an open community that is inclusive of all researchers adhering to its principles and as open as possible with respect to data access and use. However, open data sharing can pose certain challenges. For instance, being a global initiative, the HCA must contend with a patchwork of local and regional privacy rules. A notable example is the implementation of the European Union General Data Protection Regulation (GDPR), which caused some concern in the biomedical and genomic data-sharing community. We examine how the HCA's large, international group of researchers is investing tremendous efforts into ensuring appropriate sharing of data. We describe the HCA's objectives and governance, how it defines open data sharing, and ethico-legal challenges encountered early in its development; in particular, we describe the challenges prompted by the GDPR. Finally, we broaden the discussion to address tools and strategies that can be used to address ethical data governance.
Collapse
Affiliation(s)
- Bartha Maria Knoppers
- Centre of Genomics and Policy, School of Biomedical Sciences, Faculty of Medicine and Health Sciences, McGill University, Montreal, Quebec, Canada; , ,
| | - Alexander Bernier
- Centre of Genomics and Policy, School of Biomedical Sciences, Faculty of Medicine and Health Sciences, McGill University, Montreal, Quebec, Canada; , ,
| | | | - Emily Kirby
- Centre of Genomics and Policy, School of Biomedical Sciences, Faculty of Medicine and Health Sciences, McGill University, Montreal, Quebec, Canada; , ,
| |
Collapse
|
27
|
Li W, Chen H, Jiang X, Harmanci A. Federated generalized linear mixed models for collaborative genome-wide association studies. iScience 2023; 26:107227. [PMID: 37529100 PMCID: PMC10387571 DOI: 10.1016/j.isci.2023.107227] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/01/2022] [Revised: 01/28/2023] [Accepted: 06/23/2023] [Indexed: 08/03/2023] Open
Abstract
Federated association testing is a powerful approach to conduct large-scale association studies where sites share intermediate statistics through a central server. There are, however, several standing challenges. Confounding factors like population stratification should be carefully modeled across sites. In addition, it is crucial to consider disease etiology using flexible models to prevent biases. Privacy protections for participants pose another significant challenge. Here, we propose distributed Mixed Effects Genome-wide Association study (dMEGA), a method that enables federated generalized linear mixed model-based association testing across multiple sites without explicitly sharing genotype and phenotype data. dMEGA employs a reference projection to correct for population-stratification and utilizes efficient local-gradient updates among sites, incorporating both fixed and random effects. The accuracy and efficiency of dMEGA are demonstrated through simulated and real datasets. dMEGA is publicly available at https://github.com/Li-Wentao/dMEGA.
Collapse
Affiliation(s)
- Wentao Li
- School of Biomedical Informatics, University of Texas Health Science Center, Houston, TX 77030, USA
| | - Han Chen
- School of Biomedical Informatics, University of Texas Health Science Center, Houston, TX 77030, USA
- School of Public Health, University of Texas Health Science Center, Houston, TX 77030, USA
| | - Xiaoqian Jiang
- School of Biomedical Informatics, University of Texas Health Science Center, Houston, TX 77030, USA
| | - Arif Harmanci
- School of Biomedical Informatics, University of Texas Health Science Center, Houston, TX 77030, USA
| |
Collapse
|
28
|
Venkatesaramani R, Wan Z, Malin BA, Vorobeychik Y. Enabling tradeoffs in privacy and utility in genomic data Beacons and summary statistics. Genome Res 2023; 33:1113-1123. [PMID: 37217251 PMCID: PMC10538482 DOI: 10.1101/gr.277674.123] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/06/2023] [Accepted: 04/20/2023] [Indexed: 05/24/2023]
Abstract
The collection and sharing of genomic data are becoming increasingly commonplace in research, clinical, and direct-to-consumer settings. The computational protocols typically adopted to protect individual privacy include sharing summary statistics, such as allele frequencies, or limiting query responses to the presence/absence of alleles of interest using web services called Beacons. However, even such limited releases are susceptible to likelihood ratio-based membership-inference attacks. Several approaches have been proposed to preserve privacy, which either suppress a subset of genomic variants or modify query responses for specific variants (e.g., adding noise, as in differential privacy). However, many of these approaches result in a significant utility loss, either suppressing many variants or adding a substantial amount of noise. In this paper, we introduce optimization-based approaches to explicitly trade off the utility of summary data or Beacon responses and privacy with respect to membership-inference attacks based on likelihood ratios, combining variant suppression and modification. We consider two attack models. In the first, an attacker applies a likelihood ratio test to make membership-inference claims. In the second model, an attacker uses a threshold that accounts for the effect of the data release on the separation in scores between individuals in the data set and those who are not. We further introduce highly scalable approaches for approximately solving the privacy-utility tradeoff problem when information is in the form of either summary statistics or presence/absence queries. Finally, we show that the proposed approaches outperform the state of the art in both utility and privacy through an extensive evaluation with public data sets.
Collapse
Affiliation(s)
| | - Zhiyu Wan
- Vanderbilt University Medical Center, Nashville, Tennessee 37212, USA
| | - Bradley A Malin
- Vanderbilt University Medical Center, Nashville, Tennessee 37212, USA
| | | |
Collapse
|
29
|
Dervishi L, Li W, Halimi A, Jiang X, Vaidya J, Ayday E. Privacy preserving identification of population stratification for collaborative genomic research. Bioinformatics 2023; 39:i168-i176. [PMID: 37387172 PMCID: PMC10311306 DOI: 10.1093/bioinformatics/btad274] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 07/01/2023] Open
Abstract
The rapid improvements in genomic sequencing technology have led to the proliferation of locally collected genomic datasets. Given the sensitivity of genomic data, it is crucial to conduct collaborative studies while preserving the privacy of the individuals. However, before starting any collaborative research effort, the quality of the data needs to be assessed. One of the essential steps of the quality control process is population stratification: identifying the presence of genetic difference in individuals due to subpopulations. One of the common methods used to group genomes of individuals based on ancestry is principal component analysis (PCA). In this article, we propose a privacy-preserving framework which utilizes PCA to assign individuals to populations across multiple collaborators as part of the population stratification step. In our proposed client-server-based scheme, we initially let the server train a global PCA model on a publicly available genomic dataset which contains individuals from multiple populations. The global PCA model is later used to reduce the dimensionality of the local data by each collaborator (client). After adding noise to achieve local differential privacy (LDP), the collaborators send metadata (in the form of their local PCA outputs) about their research datasets to the server, which then aligns the local PCA results to identify the genetic differences among collaborators' datasets. Our results on real genomic data show that the proposed framework can perform population stratification analysis with high accuracy while preserving the privacy of the research participants.
Collapse
Affiliation(s)
- Leonard Dervishi
- Computer and Data Sciences, Case Western Reserve University, OH 44106, United States
| | - Wenbiao Li
- Computer and Data Sciences, Case Western Reserve University, OH 44106, United States
| | | | - Xiaoqian Jiang
- School of Biomedical Informatics, University of Texas Health Science Center at Houston, TX 77030, United States
| | - Jaideep Vaidya
- Management Science and Information Systems Department, Rutgers University, NJ 07102, USA
| | - Erman Ayday
- Computer and Data Sciences, Case Western Reserve University, OH 44106, United States
| |
Collapse
|
30
|
Martin D, Basodi S, Panta S, Rootes-Murdy K, Prae P, Sarwate AD, Kelly R, Romero J, Baker BT, Gazula H, Bockholt J, Turner JA, Esper NB, Franco AR, Plis S, Calhoun VD. Enhancing collaborative neuroimaging research: introducing COINSTAC Vaults for federated analysis and reproducibility. Front Neuroinform 2023; 17:1207721. [PMID: 37404336 PMCID: PMC10315678 DOI: 10.3389/fninf.2023.1207721] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/18/2023] [Accepted: 06/02/2023] [Indexed: 07/06/2023] Open
Abstract
Collaborative neuroimaging research is often hindered by technological, policy, administrative, and methodological barriers, despite the abundance of available data. COINSTAC (The Collaborative Informatics and Neuroimaging Suite Toolkit for Anonymous Computation) is a platform that successfully tackles these challenges through federated analysis, allowing researchers to analyze datasets without publicly sharing their data. This paper presents a significant enhancement to the COINSTAC platform: COINSTAC Vaults (CVs). CVs are designed to further reduce barriers by hosting standardized, persistent, and highly-available datasets, while seamlessly integrating with COINSTAC's federated analysis capabilities. CVs offer a user-friendly interface for self-service analysis, streamlining collaboration, and eliminating the need for manual coordination with data owners. Importantly, CVs can also be used in conjunction with open data as well, by simply creating a CV hosting the open data one would like to include in the analysis, thus filling an important gap in the data sharing ecosystem. We demonstrate the impact of CVs through several functional and structural neuroimaging studies utilizing federated analysis showcasing their potential to improve the reproducibility of research and increase sample sizes in neuroimaging studies.
Collapse
Affiliation(s)
- Dylan Martin
- Tri-institutional Center for Translational Research in Neuroimaging and Data Science, Georgia State, Georgia Tech, Emory, Atlanta, GA, United States
| | - Sunitha Basodi
- Tri-institutional Center for Translational Research in Neuroimaging and Data Science, Georgia State, Georgia Tech, Emory, Atlanta, GA, United States
| | - Sandeep Panta
- Tri-institutional Center for Translational Research in Neuroimaging and Data Science, Georgia State, Georgia Tech, Emory, Atlanta, GA, United States
| | - Kelly Rootes-Murdy
- Tri-institutional Center for Translational Research in Neuroimaging and Data Science, Georgia State, Georgia Tech, Emory, Atlanta, GA, United States
| | - Paul Prae
- Tri-institutional Center for Translational Research in Neuroimaging and Data Science, Georgia State, Georgia Tech, Emory, Atlanta, GA, United States
| | - Anand D. Sarwate
- Tri-institutional Center for Translational Research in Neuroimaging and Data Science, Georgia State, Georgia Tech, Emory, Atlanta, GA, United States
- Department of Electrical and Computer Engineering, Rutgers University–New Brunswick, Piscataway, NJ, United States
| | - Ross Kelly
- Tri-institutional Center for Translational Research in Neuroimaging and Data Science, Georgia State, Georgia Tech, Emory, Atlanta, GA, United States
| | - Javier Romero
- Tri-institutional Center for Translational Research in Neuroimaging and Data Science, Georgia State, Georgia Tech, Emory, Atlanta, GA, United States
| | - Bradley T. Baker
- Tri-institutional Center for Translational Research in Neuroimaging and Data Science, Georgia State, Georgia Tech, Emory, Atlanta, GA, United States
| | - Harshvardhan Gazula
- Athinoula A. Martinos Center for Biomedical Imaging, Massachusetts General Hospital and Harvard Medical School, Boston, MA, United States
| | - Jeremy Bockholt
- Tri-institutional Center for Translational Research in Neuroimaging and Data Science, Georgia State, Georgia Tech, Emory, Atlanta, GA, United States
| | - Jessica A. Turner
- Tri-institutional Center for Translational Research in Neuroimaging and Data Science, Georgia State, Georgia Tech, Emory, Atlanta, GA, United States
| | - Nathalia B. Esper
- Center for the Developing Brain, Child Mind Institute, New York, NY, United States
| | - Alexandre R. Franco
- Center for the Developing Brain, Child Mind Institute, New York, NY, United States
- Center for Brain Imaging and Neuromodulation, Nathan Kline Institute for Psychiatric Research, Orangeburg, NY, United States
- Department of Psychiatry, NYU Grossman School of Medicine, New York, NY, United States
| | - Sergey Plis
- Tri-institutional Center for Translational Research in Neuroimaging and Data Science, Georgia State, Georgia Tech, Emory, Atlanta, GA, United States
| | - Vince D. Calhoun
- Tri-institutional Center for Translational Research in Neuroimaging and Data Science, Georgia State, Georgia Tech, Emory, Atlanta, GA, United States
| |
Collapse
|
31
|
Wang X, Dervishi L, Li W, Jiang X, Ayday E, Vaidya J. Efficient Federated Kinship Relationship Identification. AMIA JOINT SUMMITS ON TRANSLATIONAL SCIENCE PROCEEDINGS. AMIA JOINT SUMMITS ON TRANSLATIONAL SCIENCE 2023; 2023:534-543. [PMID: 37351796 PMCID: PMC10283133] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Subscribe] [Scholar Register] [Indexed: 06/24/2023]
Abstract
Kinship relationship estimation plays a significant role in today's genome studies. Since genetic data are mostly stored and protected in different silos, retrieving the desirable kinship relationships across federated data warehouses is a non-trivial problem. The ability to identify and connect related individuals is important for both research and clinical applications. In this work, we propose a new privacy-preserving kinship relationship estimation framework: Incremental Update Kinship Identification (INK). The proposed framework includes three key components that allow us to control the balance between privacy and accuracy (of kinship estimation): an incremental process coupled with the use of auxiliary information and informative scores. Our empirical evaluation shows that INK can achieve higher kinship identification correctness while exposing fewer genetic markers.
Collapse
Affiliation(s)
| | | | | | | | - Erman Ayday
- Case Western Reserve University, Cleveland, OH
| | | |
Collapse
|
32
|
de Hemptinne MC, Posthuma D. Addressing the ethical and societal challenges posed by genome-wide association studies of behavioral and brain-related traits. Nat Neurosci 2023:10.1038/s41593-023-01333-4. [PMID: 37217727 DOI: 10.1038/s41593-023-01333-4] [Citation(s) in RCA: 2] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/29/2022] [Accepted: 04/14/2023] [Indexed: 05/24/2023]
Abstract
Genome-wide association studies have led to the identification of robust statistical associations of genetic variants with numerous brain-related traits, including neurological and psychiatric conditions, and psychological and behavioral measures. These results may provide insight into the biology underlying these traits and may facilitate clinically useful predictions. However, these results also carry the risk of harm, including possible negative effects of inaccurate predictions, violations of privacy, stigma and genomic discrimination, raising serious ethical and legal implications. Here, we discuss ethical concerns surrounding the results of genome-wide association studies for individuals, society and researchers. Given the success of genome-wide association studies and the increasing availability of nonclinical genomic prediction technologies, better laws and guidelines are urgently needed to regulate the storage, processing and responsible use of genetic data. Also, researchers should be aware of possible misuse of their results, and we provide guidance to help avoid such negative impacts on individuals and society.
Collapse
Affiliation(s)
- Matthieu C de Hemptinne
- Department of Complex Trait Genetics, Center for Neurogenomics and Cognitive Research, Amsterdam Neuroscience, Vrije Universiteit Amsterdam, Amsterdam, the Netherlands
| | - Danielle Posthuma
- Department of Complex Trait Genetics, Center for Neurogenomics and Cognitive Research, Amsterdam Neuroscience, Vrije Universiteit Amsterdam, Amsterdam, the Netherlands.
| |
Collapse
|
33
|
Martin D, Basodi S, Panta S, Rootes-Murdy K, Prae P, Sarwate AD, Kelly R, Romero J, Baker BT, Gazula H, Bockholt J, Turner J, Esper NB, Franco AR, Plis S, Calhoun VD. Enhancing Collaborative Neuroimaging Research: Introducing COINSTAC Vaults for Federated Analysis and Reproducibility. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2023:2023.05.08.539852. [PMID: 37214791 PMCID: PMC10197552 DOI: 10.1101/2023.05.08.539852] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/24/2023]
Abstract
Collaborative neuroimaging research is often hindered by technological, policy, administrative, and methodological barriers, despite the abundance of available data. COINSTAC is a platform that successfully tackles these challenges through federated analysis, allowing researchers to analyze datasets without publicly sharing their data. This paper presents a significant enhancement to the COINSTAC platform: COINSTAC Vaults (CVs). CVs are designed to further reduce barriers by hosting standardized, persistent, and highly-available datasets, while seamlessly integrating with COINSTAC's federated analysis capabilities. CVs offer a user-friendly interface for self-service analysis, streamlining collaboration and eliminating the need for manual coordination with data owners. Importantly, CVs can also be used in conjunction with open data as well, by simply creating a CV hosting the open data one would like to include in the analysis, thus filling an important gap in the data sharing ecosystem. We demonstrate the impact of CVs through several functional and structural neuroimaging studies utilizing federated analysis showcasing their potential to improve the reproducibility of research and increase sample sizes in neuroimaging studies.
Collapse
|
34
|
Dervishi L, Wang X, Li W, Halimi A, Vaidya J, Jiang X, Ayday E. Facilitating Federated Genomic Data Analysis by Identifying Record Correlations while Ensuring Privacy. AMIA ... ANNUAL SYMPOSIUM PROCEEDINGS. AMIA SYMPOSIUM 2023; 2022:395-404. [PMID: 37128365 PMCID: PMC10148342] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Subscribe] [Scholar Register] [Indexed: 05/03/2023]
Abstract
With the reduction of sequencing costs and the pervasiveness of computing devices, genomic data collection is continually growing. However, data collection is highly fragmented and the data is still siloed across different repositories. Analyzing all of this data would be transformative for genomics research. However, the data is sensitive, and therefore cannot be easily centralized. Furthermore, there may be correlations in the data, which if not detected, can impact the analysis. In this paper, we take the first step towards identifying correlated records across multiple data repositories in a privacy-preserving manner. The proposed framework, based on random shuffling, synthetic record generation, and local differential privacy, allows a trade-off of accuracy and computational efficiency. An extensive evaluation on real genomic data from the OpenSNP dataset shows that the proposed solution is efficient and effective.
Collapse
Affiliation(s)
| | | | | | | | | | | | - Erman Ayday
- Case Western Reserve University, Cleveland, OH
| |
Collapse
|
35
|
Zheng M, Zhang X, Ma X. Unsupervised Domain Adaptation with Differentially Private Gradient Projection. INT J INTELL SYST 2023. [DOI: 10.1155/2023/8426839] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 03/30/2023]
Abstract
Domain adaptation is a viable solution for deep learning with small data. However, domain adaptation models trained on data with sensitive information may be a violation of personal privacy. In this article, we proposed a solution for unsupervised domain adaptation, called DP-CUDA, which is based on differentially private gradient projection and contradistinguisher. Compared with the traditional domain adaptation process, DP-CUDA involves searching for domain-invariant features between the source domain and target domain first and then transferring knowledge. Specifically, the model is trained in the source domain by supervised learning from labeled data. During the training of the target model, feature learning is used to solve the classification task in an end-to-end manner using unlabeled data directly, and the differentially private noise is injected into the gradient. We conducted extensive experiments on a variety of benchmark datasets, including MNIST, USPS, SVHN, VisDA-2017, Office-31, and Amazon Review, to demonstrate our proposed method’s utility and privacy-preserving properties.
Collapse
|
36
|
Stasi A, Mir TUG, Pellegrino A, Wani AK, Shukla S. Forty years of research and development on forensic genetics: A bibliometric analysis. Forensic Sci Int Genet 2023; 63:102826. [PMID: 36640637 DOI: 10.1016/j.fsigen.2023.102826] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/27/2022] [Revised: 12/31/2022] [Accepted: 01/02/2023] [Indexed: 01/05/2023]
Abstract
The current study aims to investigate the research publication trends in the field of forensic genetics using Bibliometric analysis. An extensive search of the Scopus database was conducted to identify scholarly articles on forensic genetics published between 1977 and 2022, and a data set comprising 2945 articles was obtained. The analysis was carried out using VOSviewer, RStudio, MS Excel and MS Access to investigate the annual publication trend, most productive journals, organizations/authors/countries, authorship and citation patterns, most cited documents/articles and co-occurrence of keywords. The results revealed the first article in the field of forensic genetics was published in 1977. By the end of 1999, only 15 articles were published. Since then, there has been a considerable increase in the yearly number of publications and post-2006, there were more than 100 yearly published articles. USA, China, Spain, Germany and United Kingdom were found to be the most productive countries. Among various organizations, the Institute of Legal Medicine, Innsbruck Medical University, Austria was found to be the most productive organization. In terms of the number of publications and citations, Morling N. was found to be the most prolific author. The highest number of articles were published in Forensic Science International: Genetics, contributing about 34% of the total articles published in different sources/journals. The document with the highest number of citations was "HOMER N, 2008, PLOS GENET", with a total of 750 citations. The most frequent keywords were forensic genetics and forensic science, followed by STR, population genetics, DNA, mt-DNA and DNA-typing. The results also revealed that there had been collaborative research among countries, organizations and authors, which helps in the exchange of ideas across disciplines, developing new skills, getting access to financial resources and generating quality results.
Collapse
Affiliation(s)
- Alessandro Stasi
- Mahidol University International College, 999 Phutthamonthon Sai 4 Rd, Salaya, Phutthamonthon District, Nakhon Pathom 73170, Thailand.
| | - Tahir Ul Gani Mir
- Department of Forensic Science, School of Bioengineering and Biosciences, Lovely Professional University, Phagwara 144411, Punjab, India.
| | - Alfonso Pellegrino
- Sasin School of Management, Chulalongkorn University, Chula soi 12, Wang Mai, Pathum Wan, Bangkok 10330, Thailand.
| | - Atif Khurshid Wani
- Department of Biotechnology, School of Bioengineering and Biosciences, Lovely Professional University, Phagwara 144411, Punjab, India.
| | - Saurabh Shukla
- Department of Forensic Science, School of Bioengineering and Biosciences, Lovely Professional University, Phagwara 144411, Punjab, India.
| |
Collapse
|
37
|
Jiang Y, Shang T, Liu J. Secure Counting Query Protocol for Genomic Data. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2023; 20:1457-1468. [PMID: 35666798 DOI: 10.1109/tcbb.2022.3178446] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/04/2023]
Abstract
Statistical analysis on genomic data can explore the relationship between gene sequence and phenotype. Particularly, counting the genomic mutation samples and associating with related phenotypes for statistical analysis can annotate the variation sites and help to diagnose genovariation. Expansion of the size of variation sample data helps to increase the accuracy of statistical analysis. It is feasible to securely share data from genomic databases on cloud platforms. In this paper, we design a secure counting query protocol that can securely share genomic data on cloud platforms. Our protocol supports statistical analysis of the genomic data in VCF (Variant Call Format) files by counting query. There are three participants of data owner, cloud platform and query party. Firstly, the genomic data is preprocessed to reduce the data size. Secondly, Paillier homomorphic is used so that genomic data can be securely shared and calculated on cloud platform. Finally, the results which be decrypted is used to implement counting function of the protocol. Experimental results show that the protocol can implement the query counting function after homomorphic encryption. The query time is less than 1 s, which provide a feasible solution to share genomic data securely on cloud platform for statistical analysis.
Collapse
|
38
|
Neurogenomics in Africa: current state, challenges, opportunities, and recommendation. Ann Med Surg (Lond) 2023; 85:351-354. [PMID: 36845781 PMCID: PMC9949868 DOI: 10.1097/ms9.0000000000000158] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/09/2022] [Accepted: 12/25/2022] [Indexed: 02/28/2023] Open
Abstract
Neurological diseases are becoming more common in Africa. Current estimates indicate that Africa has a significant burden of neurological illnesses, though it is unclear what fraction of the burden may be linked to genetic transmission. In recent years, there has been a significant expansion in the knowledge of the genetic basis of neurological illnesses. This has been made possible mainly by the positional cloning research paradigm, which uses linkage studies to pinpoint specific genes on chromosomes and targeted screening of Mendelian neurological illnesses to identify the causative genes. However, there is currently very little and unequal geographic knowledge about neurogenetics in African people. The lack of collaboration between academics studying neurogenomics and bioinformatics contributes to the scarcity of large-scale neurogenomic investigations in Africa. The primary cause is a shortage of funding from the African government for clinical researchers; this has resulted in heterogeneity in research collaboration in the region as African researchers work more closely with researchers outside the region due to pulling factors of standardized laboratory resources and adequate funding. Therefore, adequate funding is required to elevate researchers' morale and give them the resources they need for their neurogenomic and bioinformatics studies. For Africa to fully benefit from this significant research area, substantial and sustainable financial investments in the training of scientists and clinicians will be required.
Collapse
|
39
|
Reales G, Wallace C. Sharing GWAS summary statistics results in more citations. Commun Biol 2023; 6:116. [PMID: 36709395 PMCID: PMC9884206 DOI: 10.1038/s42003-023-04497-8] [Citation(s) in RCA: 2] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/23/2022] [Accepted: 01/17/2023] [Indexed: 01/29/2023] Open
Abstract
A review of citation rates from genomic studies in the GWAS Catalog suggests that sharing summary statistics results, on average, in ~81.8% more citations, highlighting a benefit of publicly sharing GWAS summary statistics.
Collapse
Affiliation(s)
- Guillermo Reales
- Cambridge Institute of Therapeutic Immunology and Infectious Disease (CITIID), University of Cambridge, Cambridge, UK.
- Department of Medicine, University of Cambridge, Cambridge, UK.
| | - Chris Wallace
- Cambridge Institute of Therapeutic Immunology and Infectious Disease (CITIID), University of Cambridge, Cambridge, UK
- Department of Medicine, University of Cambridge, Cambridge, UK
- MRC Biostatistics Unit, University of Cambridge, Cambridge, UK
| |
Collapse
|
40
|
Sequre: a high-performance framework for secure multiparty computation enables biomedical data sharing. Genome Biol 2023; 24:5. [PMID: 36631897 PMCID: PMC9832703 DOI: 10.1186/s13059-022-02841-5] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/30/2022] [Accepted: 12/21/2022] [Indexed: 01/12/2023] Open
Abstract
Secure multiparty computation (MPC) is a cryptographic tool that allows computation on top of sensitive biomedical data without revealing private information to the involved entities. Here, we introduce Sequre, an easy-to-use, high-performance framework for developing performant MPC applications. Sequre offers a set of automatic compile-time optimizations that significantly improve the performance of MPC applications and incorporates the syntax of Python programming language to facilitate rapid application development. We demonstrate its usability and performance on various bioinformatics tasks showing up to 3-4 times increased speed over the existing pipelines with 7-fold reductions in codebase sizes.
Collapse
|
41
|
Beck T, Rowlands T, Shorter T, Brookes AJ. GWAS Central: an expanding resource for finding and visualising genotype and phenotype data from genome-wide association studies. Nucleic Acids Res 2023; 51:D986-D993. [PMID: 36350644 PMCID: PMC9825503 DOI: 10.1093/nar/gkac1017] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/15/2022] [Revised: 10/18/2022] [Accepted: 10/20/2022] [Indexed: 11/10/2022] Open
Abstract
The GWAS Central resource gathers and curates extensive summary-level genome-wide association study (GWAS) data and puts a range of user-friendly but powerful website tools for the comparison and visualisation of GWAS data at the fingertips of researchers. Through our continued efforts to harmonise and import data received from GWAS authors and consortia, and data sets actively collected from public sources, the database now contains over 72.5 million P-values for over 5000 studies testing over 7.4 million unique genetic markers investigating over 1700 unique phenotypes. Here, we describe an update to integrate this extensive data collection with mouse disease model data to support insights into the functional impact of human genetic variation. GWAS Central has expanded to include mouse gene-phenotype associations observed during mouse gene knockout screens. To allow similar cross-species phenotypes to be compared, terms from mammalian and human phenotype ontologies have been mapped. New interactive interfaces to find, correlate and view human and mouse genotype-phenotype associations are included in the website toolkit. Additionally, the integrated browser for interrogating multiple association data sets has been updated and a GA4GH Beacon API endpoint has been added for discovering variants tested in GWAS. The GWAS Central resource is accessible at https://www.gwascentral.org/.
Collapse
Affiliation(s)
- Tim Beck
- Department of Genetics and Genome Biology, University of Leicester, Leicester, LE1 7RH, UK
- Health Data Research UK (HDR UK), London, UK
| | - Thomas Rowlands
- Department of Genetics and Genome Biology, University of Leicester, Leicester, LE1 7RH, UK
| | - Tom Shorter
- Department of Genetics and Genome Biology, University of Leicester, Leicester, LE1 7RH, UK
| | - Anthony J Brookes
- Department of Genetics and Genome Biology, University of Leicester, Leicester, LE1 7RH, UK
- Health Data Research UK (HDR UK), London, UK
| |
Collapse
|
42
|
Washington PY, Puniwai N, Kamaka M, Gürsoy G, Tatonetti N, Brenner SE, Wall DP. Session Introduction: TOWARDS ETHICAL BIOMEDICAL INFORMATICS: LEARNING FROM OLELO NOEAU, HAWAIIAN PROVERBS. PACIFIC SYMPOSIUM ON BIOCOMPUTING. PACIFIC SYMPOSIUM ON BIOCOMPUTING 2023; 28:461-471. [PMID: 36541000 PMCID: PMC11095408] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Subscribe] [Scholar Register] [Indexed: 12/29/2022]
Abstract
Innovations in human-centered biomedical informatics are often developed with the eventual goal of real-world translation. While biomedical research questions are usually answered in terms of how a method performs in a particular context, we argue that it is equally important to consider and formally evaluate the ethical implications of informatics solutions. Several new research paradigms have arisen as a result of the consideration of ethical issues, including but not limited for privacy-preserving computation and fair machine learning. In the spirit of the Pacific Symposium on Biocomputing, we discuss broad and fundamental principles of ethical biomedical informatics in terms of Olelo Noeau, or Hawaiian proverbs and poetical sayings that capture Hawaiian values. While we emphasize issues related to privacy and fairness in particular, there are a multitude of facets to ethical biomedical informatics that can benefit from a critical analysis grounded in ethics.
Collapse
Affiliation(s)
- Peter Y Washington
- Department of Information & Computer Sciences, University of Hawaii at Manoa Honolulu, HI 96822, USA,
| | | | | | | | | | | | | |
Collapse
|
43
|
Samlali K, Thornbury M, Venter A. Community-led risk analysis of direct-to-consumer whole-genome sequencing. Biochem Cell Biol 2022; 100:499-509. [PMID: 35939839 DOI: 10.1139/bcb-2021-0506] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/14/2022] Open
Abstract
Direct-to-consumer (DTC) genetic testing is cheaper and more accessible than ever before; however, the intention to combine, reuse, and resell this genetic information as powerful data sets is generally hidden from the consumer. This financial gain is creating a competitive DTC market, reducing the price of whole-genome sequencing (WGS) to under 300 USD. Entering this transition from single-nucleotide polymorphism-based DTC testing to WGS DTC testing, individuals looking for access to their whole-genomic information face new privacy and security risks. Differences between WGS and other methods of consumer genetic tests are left unexplored by regulation, leading to the application of legal data anonymization methods on whole-genome data, and questionable consent methods. Large representative genomic data sets are important for research and improve the standard of medicine and personalized care. However, these data can also be used by market players, law enforcement, and governments for surveillance, population analyses, marketing purposes, and discrimination. Here, we present a summary of the state of WGS DTC genetic testing and its current regulation, through a community-based lens to expose dual-use risks in consumer-facing biotechnologies.
Collapse
Affiliation(s)
- Kenza Samlali
- BricoBio Community Biology Lab, Montréal, QC, Canada.,Centre for Applied Synthetic Biology, Concordia University, Montréal, QC, Canada.,Department of Electrical and Computer Engineering, Concordia University, Montréal, QC, Canada
| | - Mackenzie Thornbury
- BricoBio Community Biology Lab, Montréal, QC, Canada.,Centre for Applied Synthetic Biology, Concordia University, Montréal, QC, Canada.,Department of Biology, Concordia University, Montréal, QC, Canada
| | - Andrei Venter
- BricoBio Community Biology Lab, Montréal, QC, Canada
| |
Collapse
|
44
|
Liu T, Hu X, Xu H, Shu T, Nguyen DN. High-accuracy low-cost privacy-preserving federated learning in IoT systems via adaptive perturbation. JOURNAL OF INFORMATION SECURITY AND APPLICATIONS 2022. [DOI: 10.1016/j.jisa.2022.103309] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/24/2022]
|
45
|
Nelson SC, Gogarten SM, Fullerton SM, Isasi CR, Mitchell BD, North KE, Rich SS, Taylor MRG, Zöllner S, Sofer T. Social and scientific motivations to move beyond groups in allele frequencies: The TOPMed experience. Am J Hum Genet 2022; 109:1582-1590. [PMID: 36055210 PMCID: PMC9502047 DOI: 10.1016/j.ajhg.2022.07.008] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/05/2022] [Accepted: 07/05/2022] [Indexed: 11/29/2022] Open
Abstract
For the genomics community, allele frequencies within defined groups (or "strata") are useful across multiple research and clinical contexts. Benefits include allowing researchers to identify populations for replication or "look up" studies, enabling researchers to compare population-specific frequencies to validate findings, and facilitating assessment of variant pathogenicity in clinical contexts. However, there are potential concerns with stratified allele frequencies. These include potential re-identification (determining whether or not an individual participated in a given research study based on allele frequencies and individual-level genetic data), harm from associating stigmatizing variants with specific groups, potential reification of race as a biological rather than a socio-political category, and whether presenting stratified frequencies-and the downstream applications that this presentation enables-is consistent with participants' informed consents. The NHLBI Trans-Omics for Precision Medicine (TOPMed) program considered the scientific and social implications of different approaches for adding stratified frequencies to the TOPMed BRAVO (Browse All Variants Online) variant server. We recommend a novel approach of presenting ancestry-specific allele frequencies using a statistical method based upon local genetic ancestry inference. Notably, this approach does not require grouping individuals by either predominant global ancestry or race/ethnicity and, therefore, mitigates re-identification and other concerns as the mixture distribution of ancestral allele frequencies varies across the genome. Here we describe our considerations and approach, which can assist other genomics research programs facing similar issues of how to define and present stratified frequencies in publicly available variant databases.
Collapse
Affiliation(s)
- Sarah C Nelson
- Department of Biostatistics, University of Washington, Seattle, WA 98195, USA.
| | | | - Stephanie M Fullerton
- Department of Bioethics and Humanities, University of Washington, Seattle, WA 98195, USA
| | - Carmen R Isasi
- Department of Epidemiology and Population Health, Albert Einstein College of Medicine, Bronx, NY 10461, USA
| | - Braxton D Mitchell
- Department of Medicine, University of Maryland School of Medicine, Baltimore, MD 21201, USA
| | - Kari E North
- Department of Epidemiology, University of North Carolina, Chapel Hill, NC 27599, USA
| | - Stephen S Rich
- Center for Public Health Genomics, University of Virginia School of Medicine, Charlottesville, VA 22903, USA
| | - Matthew R G Taylor
- Department of Medicine, University of Colorado Anschutz Medical Campus, Aurora, CO 80045, USA
| | - Sebastian Zöllner
- Department of Biostatistics, University of Michigan, Ann Arbor, MI 48109, USA; Department of Psychiatry, University of Michigan, Ann Arbor, MI 48109, USA
| | - Tamar Sofer
- Department of Medicine, Harvard Medical School, Brigham and Women's Hospital, Boston, MA 02115, USA; Department of Biostatistics, Harvard T.H. Chan School of Public Health, Boston, MA 02115, USA.
| |
Collapse
|
46
|
Abstract
Genomics data are important for advancing biomedical research, improving clinical care, and informing other disciplines such as forensics and genealogy. However, privacy concerns arise when genomic data are shared. In particular, the identifying nature of genetic information, its direct relationship to health status, and the potential financial harm and stigmatization posed to individuals and their blood relatives call for a survey of the privacy issues related to sharing genetic and related data and potential solutions to overcome these issues. In this work, we provide an overview of the importance of genomic privacy, the information gleaned from genomics data, the sources of potential private information leakages in genomics, and ways to preserve privacy while utilizing the genetic information in research. We discuss the relationship between trust in the scientific community and protecting privacy, illuminating a future roadmap for data sharing and study participation.
Collapse
Affiliation(s)
- Gamze Gürsoy
- Department of Biomedical Informatics, Columbia University, New York, NY, USA; .,New York Genome Center, New York, NY, USA
| |
Collapse
|
47
|
The Protection of Data Sharing for Privacy in Financial Vision. APPLIED SCIENCES-BASEL 2022. [DOI: 10.3390/app12157408] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 02/05/2023]
Abstract
The primary motivation is to address difficulties in data interpretation or a reduction in model accuracy. Although differential privacy can provide data privacy guarantees, it also creates problems. Thus, we need to consider the noise setting for differential privacy is currently inconclusive. This paper’s main contribution is finding a balance between privacy and accuracy. The training data of deep learning models may contain private or sensitive corporate information. These may be dangerous to attacks, leading to privacy data leakage for data sharing. Many strategies are for privacy protection, and differential privacy is the most widely applied one. Google proposed a federated learning technology to solve the problem of data silos in 2016. The technology can share information without exchanging original data and has made significant progress in the medical field. However, there is still the risk of data leakage in federated learning; thus, many models are now used with differential privacy mechanisms to minimize the risk. The data in the financial field are similar to medical data, which contains a substantial amount of personal data. The leakage may cause uncontrollable consequences, making data exchange and sharing difficult. Let us suppose that differential privacy applies to the financial field. Financial institutions can provide customers with higher value and personalized services and automate credit scoring and risk management. Unfortunately, the economic area rarely applies differential privacy and attains no consensus on parameter settings. This study compares data security with non-private and differential privacy financial visual models. The paper finds a balance between privacy protection with model accuracy. The results show that when the privacy loss parameter ϵ is between 12.62 and 5.41, the privacy models can protect training data, and the accuracy does not decrease too much.
Collapse
|
48
|
Wan Z, Hazel JW, Clayton EW, Vorobeychik Y, Kantarcioglu M, Malin BA. Sociotechnical safeguards for genomic data privacy. Nat Rev Genet 2022; 23:429-445. [PMID: 35246669 PMCID: PMC8896074 DOI: 10.1038/s41576-022-00455-y] [Citation(s) in RCA: 28] [Impact Index Per Article: 14.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 01/24/2022] [Indexed: 12/21/2022]
Abstract
Recent developments in a variety of sectors, including health care, research and the direct-to-consumer industry, have led to a dramatic increase in the amount of genomic data that are collected, used and shared. This state of affairs raises new and challenging concerns for personal privacy, both legally and technically. This Review appraises existing and emerging threats to genomic data privacy and discusses how well current legal frameworks and technical safeguards mitigate these concerns. It concludes with a discussion of remaining and emerging challenges and illustrates possible solutions that can balance protecting privacy and realizing the benefits that result from the sharing of genetic information.
Collapse
Affiliation(s)
- Zhiyu Wan
- Center for Genetic Privacy and Identity in Community Settings, Vanderbilt University Medical Center, Nashville, TN, USA
- Department of Computer Science, Vanderbilt University, Nashville, TN, USA
- Department of Biomedical Informatics, Vanderbilt University Medical Center, Nashville, TN, USA
| | - James W Hazel
- Center for Genetic Privacy and Identity in Community Settings, Vanderbilt University Medical Center, Nashville, TN, USA
- Center for Biomedical Ethics and Society, Vanderbilt University, Nashville, TN, USA
| | - Ellen Wright Clayton
- Center for Genetic Privacy and Identity in Community Settings, Vanderbilt University Medical Center, Nashville, TN, USA
- Center for Biomedical Ethics and Society, Vanderbilt University, Nashville, TN, USA
- Vanderbilt University Law School, Nashville, TN, USA
| | - Yevgeniy Vorobeychik
- Department of Computer Science and Engineering, Washington University in St. Louis, St. Louis, MO, USA
| | - Murat Kantarcioglu
- Department of Computer Science, University of Texas at Dallas, Richardson, TX, USA
| | - Bradley A Malin
- Center for Genetic Privacy and Identity in Community Settings, Vanderbilt University Medical Center, Nashville, TN, USA.
- Department of Computer Science, Vanderbilt University, Nashville, TN, USA.
- Department of Biomedical Informatics, Vanderbilt University Medical Center, Nashville, TN, USA.
- Department of Biostatistics, Vanderbilt University Medical Center, Nashville, TN, USA.
| |
Collapse
|
49
|
Zhang C, Bonomi L. Mitigating Membership Inference in Deep Learning Applications with High Dimensional Genomic Data. IEEE INTERNATIONAL CONFERENCE ON HEALTHCARE INFORMATICS. IEEE INTERNATIONAL CONFERENCE ON HEALTHCARE INFORMATICS 2022; 2022:10.1109/ichi54592.2022.00101. [PMID: 36120416 PMCID: PMC9473339 DOI: 10.1109/ichi54592.2022.00101] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/09/2023]
Abstract
The use of deep learning techniques in medical applications holds great promises for advancing health care. However, there are growing privacy concerns regarding what information about individual data contributors (i.e., patients in the training set) these deep models may reveal when shared with external users. In this work, we first investigate the membership privacy risks in sharing deep learning models for cancer genomics tasks, and then study the applicability of privacy-protecting strategies for mitigating these privacy risks.
Collapse
Affiliation(s)
- Chonghao Zhang
- Dept. of Computer Science and Engineering, University of California, San Diego, La Jolla, CA
| | - Luca Bonomi
- Dept. of Biomedical Informatics, Vanderbilt University, Nashville, TN
| |
Collapse
|
50
|
Bonomi L, Fan L. Sharing Time-to-Event Data with Privacy Protection. PROCEEDINGS. IEEE INTERNATIONAL CONFERENCE ON HEALTHCARE INFORMATICS 2022; 2022:10.1109/ichi54592.2022.00014. [PMID: 36120417 PMCID: PMC9473343 DOI: 10.1109/ichi54592.2022.00014] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/15/2023]
Abstract
Sharing time-to-event data is beneficial for enabling collaborative research efforts (e.g., survival studies), facilitating the design of effective interventions, and advancing patient care (e.g., early diagnosis). Despite numerous privacy solutions for sharing time-to-event data, recent research studies have shown that external information may become available (e.g., self-disclosure of study participation on social media) to an adversary, posing new privacy concerns. In this work, we formulate a cohort inference attack for time-to-event data sharing, in which an informed adversary aims at inferring the membership of a target individual in a specific cohort. Our study investigates the privacy risks associated with time-to-event data and evaluates the empirical privacy protection offered by popular privacy-protecting solutions (e.g., binning, differential privacy). Furthermore, we propose a novel approach to privately release individual level time-to-event data with high utility, while providing indistinguishability guarantees for the input value. Our method TE-Sanitizer is shown to provide effective mitigation against the inference attacks and high usefulness in survival analysis. The results and discussion provide domain experts with insights on the privacy and the usefulness of the studied methods.
Collapse
Affiliation(s)
- Luca Bonomi
- Dept. of Biomedical Informatics, Vanderbilt University, Nashville, TN
| | - Liyue Fan
- Dept. of Computer Science, University of North Carolina at Charlotte, Charlotte, NC
| |
Collapse
|