1
|
Brauneck A, Schmalhorst L, Weiss S, Baumbach L, Völker U, Ellinghaus D, Baumbach J, Buchholtz G. Legal aspects of privacy-enhancing technologies in genome-wide association studies and their impact on performance and feasibility. Genome Biol 2024; 25:154. [PMID: 38872191 PMCID: PMC11170858 DOI: 10.1186/s13059-024-03296-6] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/16/2023] [Accepted: 06/03/2024] [Indexed: 06/15/2024] Open
Abstract
Genomic data holds huge potential for medical progress but requires strict safety measures due to its sensitive nature to comply with data protection laws. This conflict is especially pronounced in genome-wide association studies (GWAS) which rely on vast amounts of genomic data to improve medical diagnoses. To ensure both their benefits and sufficient data security, we propose a federated approach in combination with privacy-enhancing technologies utilising the findings from a systematic review on federated learning and legal regulations in general and applying these to GWAS.
Collapse
Affiliation(s)
- Alissa Brauneck
- Hamburg University Faculty of Law, University of Hamburg, Hamburg, Germany.
| | - Louisa Schmalhorst
- Hamburg University Faculty of Law, University of Hamburg, Hamburg, Germany
| | - Stefan Weiss
- Interfaculty Institute of Genetics and Functional Genomics, Department of Functional Genomics, University Medicine Greifswald, Greifswald, Germany
| | - Linda Baumbach
- Department of Health Economics and Health Services Research, University Medical Center Hamburg-Eppendorf, Hamburg, Germany
| | - Uwe Völker
- Interfaculty Institute of Genetics and Functional Genomics, Department of Functional Genomics, University Medicine Greifswald, Greifswald, Germany
| | - David Ellinghaus
- Institute of Clinical Molecular Biology (IKMB), Kiel University and University Medical Center Schleswig-Holstein, Kiel, Germany
| | - Jan Baumbach
- Institute for Computational Systems Biology, University of Hamburg, Hamburg, Germany
| | - Gabriele Buchholtz
- Hamburg University Faculty of Law, University of Hamburg, Hamburg, Germany
| |
Collapse
|
2
|
Thomas M, Mackes N, Preuss-Dodhy A, Wieland T, Bundschus M. Assessing Privacy Vulnerabilities in Genetic Data Sets: Scoping Review. JMIR BIOINFORMATICS AND BIOTECHNOLOGY 2024; 5:e54332. [PMID: 38935957 PMCID: PMC11165293 DOI: 10.2196/54332] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 11/06/2023] [Revised: 03/26/2024] [Accepted: 03/29/2024] [Indexed: 06/29/2024]
Abstract
BACKGROUND Genetic data are widely considered inherently identifiable. However, genetic data sets come in many shapes and sizes, and the feasibility of privacy attacks depends on their specific content. Assessing the reidentification risk of genetic data is complex, yet there is a lack of guidelines or recommendations that support data processors in performing such an evaluation. OBJECTIVE This study aims to gain a comprehensive understanding of the privacy vulnerabilities of genetic data and create a summary that can guide data processors in assessing the privacy risk of genetic data sets. METHODS We conducted a 2-step search, in which we first identified 21 reviews published between 2017 and 2023 on the topic of genomic privacy and then analyzed all references cited in the reviews (n=1645) to identify 42 unique original research studies that demonstrate a privacy attack on genetic data. We then evaluated the type and components of genetic data exploited for these attacks as well as the effort and resources needed for their implementation and their probability of success. RESULTS From our literature review, we derived 9 nonmutually exclusive features of genetic data that are both inherent to any genetic data set and informative about privacy risk: biological modality, experimental assay, data format or level of processing, germline versus somatic variation content, content of single nucleotide polymorphisms, short tandem repeats, aggregated sample measures, structural variants, and rare single nucleotide variants. CONCLUSIONS On the basis of our literature review, the evaluation of these 9 features covers the great majority of privacy-critical aspects of genetic data and thus provides a foundation and guidance for assessing genetic data risk.
Collapse
|
3
|
Cavinato T, Rubinacci S, Malaspinas AS, Delaneau O. A resampling-based approach to share reference panels. NATURE COMPUTATIONAL SCIENCE 2024; 4:360-366. [PMID: 38745108 PMCID: PMC11136649 DOI: 10.1038/s43588-024-00630-7] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 05/02/2023] [Accepted: 04/16/2024] [Indexed: 05/16/2024]
Abstract
For many genome-wide association studies, imputing genotypes from a haplotype reference panel is a necessary step. Over the past 15 years, reference panels have become larger and more diverse, leading to improvements in imputation accuracy. However, the latest generation of reference panels is subject to restrictions on data sharing due to concerns about privacy, limiting their usefulness for genotype imputation. In this context, here we propose RESHAPE, a method that employs a recombination Poisson process on a reference panel to simulate the genomes of hypothetical descendants after multiple generations. This data transformation helps to protect against re-identification threats and preserves data attributes, such as linkage disequilibrium patterns and, to some degree, identity-by-descent sharing, allowing for genotype imputation. Our experiments on gold-standard datasets show that simulated descendants up to eight generations can serve as reference panels without substantially reducing genotype imputation accuracy.
Collapse
Affiliation(s)
- Théo Cavinato
- Department of Computational Biology, University of Lausanne, Lausanne, Switzerland
- Swiss Institute of Bioinformatics, University of Lausanne, Lausanne, Switzerland
| | - Simone Rubinacci
- Division of Genetics, Department of Medicine, Brigham and Women's Hospital and Harvard Medical School, Boston, MA, USA
- Program in Medical and Population Genetics, Broad Institute of MIT and Harvard, Cambridge, MA, USA
| | - Anna-Sapfo Malaspinas
- Department of Computational Biology, University of Lausanne, Lausanne, Switzerland
- Swiss Institute of Bioinformatics, University of Lausanne, Lausanne, Switzerland
| | | |
Collapse
|
4
|
Malakar Y, Lacey J, Twine NA, McCrea R, Bauer DC. Balancing the safeguarding of privacy and data sharing: perceptions of genomic professionals on patient genomic data ownership in Australia. Eur J Hum Genet 2024; 32:506-512. [PMID: 36631540 PMCID: PMC11061115 DOI: 10.1038/s41431-022-01273-w] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/29/2022] [Revised: 11/09/2022] [Accepted: 12/15/2022] [Indexed: 01/13/2023] Open
Abstract
There are inherent complexities and tensions in achieving a responsible balance between safeguarding patients' privacy and sharing genomic data for advancing health and medical science. A growing body of literature suggests establishing patient genomic data ownership, enabled by blockchain technology, as one approach for managing these priorities. We conducted an online survey, applying a mixed methods approach to collect quantitative (using scale questions) and qualitative data (using open-ended questions). We explored the views of 117 genomic professionals (clinical geneticists, genetic counsellors, bioinformaticians, and researchers) towards patient data ownership in Australia. Data analysis revealed most professionals agreed that patients have rights to data ownership. However, there is a need for a clearer understanding of the nature and implications of data ownership in this context as genomic data often is subject to collective ownership (e.g., with family members and laboratories). This research finds that while the majority of genomic professionals acknowledge the desire for patient data ownership, bioinformaticians and researchers expressed more favourable views than clinical geneticists and genetic counsellors, suggesting that their views on this issue may be shaped by how closely they interact with patients as part of their professional duties. This research also confirms that stronger health system infrastructure is a prerequisite for enabling patient data ownership, which needs to be underpinned by appropriate digital infrastructure (e.g., central vs. decentralised data storage), patient identity ownership (e.g., limited vs. self-sovereign identity), and policy at both federal and state levels.
Collapse
Affiliation(s)
- Yuwan Malakar
- Responsible Innovation Future Science Platform, Commonwealth Scientific and Industrial Research Organisation (CSIRO), Brisbane, Queensland, Australia.
| | - Justine Lacey
- Responsible Innovation Future Science Platform, Commonwealth Scientific and Industrial Research Organisation (CSIRO), Brisbane, Queensland, Australia
| | - Natalie A Twine
- Transformational Bioinformatics, Commonwealth Scientific and Industrial Research Organisation (CSIRO), Sydney, Australia
- Applied BioSciences, Faculty of Science and Engineering, Macquarie University, Macquarie Park, Australia
| | - Rod McCrea
- Responsible Innovation Future Science Platform, Commonwealth Scientific and Industrial Research Organisation (CSIRO), Brisbane, Queensland, Australia
| | - Denis C Bauer
- Transformational Bioinformatics, Commonwealth Scientific and Industrial Research Organisation (CSIRO), Sydney, Australia
- Applied BioSciences, Faculty of Science and Engineering, Macquarie University, Macquarie Park, Australia
- Department of Biomedical Sciences, Faculty of Medicine and Health Science, Macquarie University, Macquarie Park, Australia
| |
Collapse
|
5
|
Shuffling haplotypes to share reference panels for imputation. NATURE COMPUTATIONAL SCIENCE 2024; 4:320-321. [PMID: 38778210 DOI: 10.1038/s43588-024-00640-5] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/25/2024]
|
6
|
Bartels K, Afonso S, Brown L, Carriles C, Kim R, Lazier J, Mercimek-Andrews S, Nelson TN, Stedman I, Thain E, Vanneste R, Chad L. Next generation of free? Points to consider when navigating sponsored genetic testing. J Med Genet 2024; 61:299-304. [PMID: 37932018 DOI: 10.1136/jmg-2023-109571] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/10/2023] [Accepted: 09/28/2023] [Indexed: 11/08/2023]
Abstract
Genetics has been integrated into patient care across many subspecialties. However, genetic and genomic testing (GT) remain expensive with disparities in access both within Canada and internationally. It is, therefore, not surprising that sponsored GT has emerged as one alternative. Sponsored GT, for the purpose of this document, refers to clinical-grade GT partially or fully subsidised by industry. In return, industry sponsors-usually pharmaceutical or biotechnology companies-may have access to patients' genetic data, practitioner information, DNA and/or other information. The availability of sponsored GT options in the Canadian healthcare landscape has appeared to simplify patient and practitioner access to GT, but the potential ethical and legal considerations, as well as the nuances of a publicly funded healthcare system, must also be considered. This document offers preliminary guidance for Canadian healthcare practitioners encountering sponsored GT in practice. Further research and dialogue is urgently needed to explore this issue to provide fulsome considerations that one must be aware of when availing such options.
Collapse
Affiliation(s)
- Kirsten Bartels
- Department of Medicine, Providence Health Care Heart Centre, St. Paul's Hospital, Vancouver, British Columbia, Canada
| | - Samantha Afonso
- Heart, Lung and Vascular Program, St. Michael's Hospital, Unity Health Toronto, Toronto, Ontario, Canada
| | - Lindsay Brown
- Pathology & Laboratory Medicine, BC Children's Hospital, Vancouver, British Columbia, Canada
| | - Claudia Carriles
- Genomics Laboratory, Shared Health Manitoba, Winnipeg, Manitoba, Canada
| | - Raymond Kim
- Department of Medicine, Division of Medical Oncology and Hematology, Princess Margaret Cancer Centre, University Health Network, Toronto, Ontario, Canada
| | - Joanna Lazier
- Medical Genetics, Children's Hospital of Eastern Ontario, Ottawa, Ontario, Canada
| | | | - Tanya N Nelson
- Pathology & Laboratory Medicine, BC Children's Hospital, Vancouver, British Columbia, Canada
| | - Ian Stedman
- School of Public Policy and Administration, York University, Toronto, Ontario, Canada
| | - Emily Thain
- Familial Cancer Clinic, University Health Network, Toronto, Ontario, Canada
| | - Rachel Vanneste
- Division of Medical Genetics, Department of Pediatrics, University of Saskatchewan, Saskatoon, Saskatchewan, Canada
| | - Lauren Chad
- Department of Pediatrics, The Hospital for Sick Children, Toronto, Ontario, Canada
- Department of Bioethics, The Hospital for Sick Children, Toronto, Ontario, Canada
| |
Collapse
|
7
|
Zhou J, Chen S, Wu Y, Li H, Zhang B, Zhou L, Hu Y, Xiang Z, Li Z, Chen N, Han W, Xu C, Wang D, Gao X. PPML-Omics: A privacy-preserving federated machine learning method protects patients' privacy in omic data. SCIENCE ADVANCES 2024; 10:eadh8601. [PMID: 38295178 PMCID: PMC10830108 DOI: 10.1126/sciadv.adh8601] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 03/18/2023] [Accepted: 12/29/2023] [Indexed: 02/02/2024]
Abstract
Modern machine learning models toward various tasks with omic data analysis give rise to threats of privacy leakage of patients involved in those datasets. Here, we proposed a secure and privacy-preserving machine learning method (PPML-Omics) by designing a decentralized differential private federated learning algorithm. We applied PPML-Omics to analyze data from three sequencing technologies and addressed the privacy concern in three major tasks of omic data under three representative deep learning models. We examined privacy breaches in depth through privacy attack experiments and demonstrated that PPML-Omics could protect patients' privacy. In each of these applications, PPML-Omics was able to outperform methods of comparison under the same level of privacy guarantee, demonstrating the versatility of the method in simultaneously balancing the privacy-preserving capability and utility in omic data analysis. Furthermore, we gave the theoretical proof of the privacy-preserving capability of PPML-Omics, suggesting the first mathematically guaranteed method with robust and generalizable empirical performance in protecting patients' privacy in omic data.
Collapse
Affiliation(s)
- Juexiao Zhou
- Computer Science Program, Computer, Electrical and Mathematical Sciences and Engineering Division, King Abdullah University of Science and Technology (KAUST), Thuwal 23955-6900, Kingdom of Saudi Arabia
- Computational Bioscience Research Center, Computer, Electrical and Mathematical Sciences and Engineering Division, King Abdullah University of Science and Technology (KAUST), Thuwal 23955-6900, Kingdom of Saudi Arabia
| | - Siyuan Chen
- Computer Science Program, Computer, Electrical and Mathematical Sciences and Engineering Division, King Abdullah University of Science and Technology (KAUST), Thuwal 23955-6900, Kingdom of Saudi Arabia
- Computational Bioscience Research Center, Computer, Electrical and Mathematical Sciences and Engineering Division, King Abdullah University of Science and Technology (KAUST), Thuwal 23955-6900, Kingdom of Saudi Arabia
| | - Yulian Wu
- Computer Science Program, Computer, Electrical and Mathematical Sciences and Engineering Division, King Abdullah University of Science and Technology (KAUST), Thuwal 23955-6900, Kingdom of Saudi Arabia
- Computational Bioscience Research Center, Computer, Electrical and Mathematical Sciences and Engineering Division, King Abdullah University of Science and Technology (KAUST), Thuwal 23955-6900, Kingdom of Saudi Arabia
| | - Haoyang Li
- Computer Science Program, Computer, Electrical and Mathematical Sciences and Engineering Division, King Abdullah University of Science and Technology (KAUST), Thuwal 23955-6900, Kingdom of Saudi Arabia
- Computational Bioscience Research Center, Computer, Electrical and Mathematical Sciences and Engineering Division, King Abdullah University of Science and Technology (KAUST), Thuwal 23955-6900, Kingdom of Saudi Arabia
| | - Bin Zhang
- Computer Science Program, Computer, Electrical and Mathematical Sciences and Engineering Division, King Abdullah University of Science and Technology (KAUST), Thuwal 23955-6900, Kingdom of Saudi Arabia
- Computational Bioscience Research Center, Computer, Electrical and Mathematical Sciences and Engineering Division, King Abdullah University of Science and Technology (KAUST), Thuwal 23955-6900, Kingdom of Saudi Arabia
| | - Longxi Zhou
- Computer Science Program, Computer, Electrical and Mathematical Sciences and Engineering Division, King Abdullah University of Science and Technology (KAUST), Thuwal 23955-6900, Kingdom of Saudi Arabia
- Computational Bioscience Research Center, Computer, Electrical and Mathematical Sciences and Engineering Division, King Abdullah University of Science and Technology (KAUST), Thuwal 23955-6900, Kingdom of Saudi Arabia
| | - Yan Hu
- Computer Science Program, Computer, Electrical and Mathematical Sciences and Engineering Division, King Abdullah University of Science and Technology (KAUST), Thuwal 23955-6900, Kingdom of Saudi Arabia
| | - Zihang Xiang
- Computer Science Program, Computer, Electrical and Mathematical Sciences and Engineering Division, King Abdullah University of Science and Technology (KAUST), Thuwal 23955-6900, Kingdom of Saudi Arabia
| | - Zhongxiao Li
- Computer Science Program, Computer, Electrical and Mathematical Sciences and Engineering Division, King Abdullah University of Science and Technology (KAUST), Thuwal 23955-6900, Kingdom of Saudi Arabia
- Computational Bioscience Research Center, Computer, Electrical and Mathematical Sciences and Engineering Division, King Abdullah University of Science and Technology (KAUST), Thuwal 23955-6900, Kingdom of Saudi Arabia
| | - Ningning Chen
- Computer Science Program, Computer, Electrical and Mathematical Sciences and Engineering Division, King Abdullah University of Science and Technology (KAUST), Thuwal 23955-6900, Kingdom of Saudi Arabia
- Computational Bioscience Research Center, Computer, Electrical and Mathematical Sciences and Engineering Division, King Abdullah University of Science and Technology (KAUST), Thuwal 23955-6900, Kingdom of Saudi Arabia
| | - Wenkai Han
- Computer Science Program, Computer, Electrical and Mathematical Sciences and Engineering Division, King Abdullah University of Science and Technology (KAUST), Thuwal 23955-6900, Kingdom of Saudi Arabia
- Computational Bioscience Research Center, Computer, Electrical and Mathematical Sciences and Engineering Division, King Abdullah University of Science and Technology (KAUST), Thuwal 23955-6900, Kingdom of Saudi Arabia
| | - Chencheng Xu
- Computer Science Program, Computer, Electrical and Mathematical Sciences and Engineering Division, King Abdullah University of Science and Technology (KAUST), Thuwal 23955-6900, Kingdom of Saudi Arabia
- Computational Bioscience Research Center, Computer, Electrical and Mathematical Sciences and Engineering Division, King Abdullah University of Science and Technology (KAUST), Thuwal 23955-6900, Kingdom of Saudi Arabia
| | - Di Wang
- Computer Science Program, Computer, Electrical and Mathematical Sciences and Engineering Division, King Abdullah University of Science and Technology (KAUST), Thuwal 23955-6900, Kingdom of Saudi Arabia
- Computational Bioscience Research Center, Computer, Electrical and Mathematical Sciences and Engineering Division, King Abdullah University of Science and Technology (KAUST), Thuwal 23955-6900, Kingdom of Saudi Arabia
| | - Xin Gao
- Computer Science Program, Computer, Electrical and Mathematical Sciences and Engineering Division, King Abdullah University of Science and Technology (KAUST), Thuwal 23955-6900, Kingdom of Saudi Arabia
- Computational Bioscience Research Center, Computer, Electrical and Mathematical Sciences and Engineering Division, King Abdullah University of Science and Technology (KAUST), Thuwal 23955-6900, Kingdom of Saudi Arabia
| |
Collapse
|
8
|
Schubach M, Maass T, Nazaretyan L, Röner S, Kircher M. CADD v1.7: using protein language models, regulatory CNNs and other nucleotide-level scores to improve genome-wide variant predictions. Nucleic Acids Res 2024; 52:D1143-D1154. [PMID: 38183205 PMCID: PMC10767851 DOI: 10.1093/nar/gkad989] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/14/2023] [Revised: 10/14/2023] [Accepted: 10/17/2023] [Indexed: 01/07/2024] Open
Abstract
Machine Learning-based scoring and classification of genetic variants aids the assessment of clinical findings and is employed to prioritize variants in diverse genetic studies and analyses. Combined Annotation-Dependent Depletion (CADD) is one of the first methods for the genome-wide prioritization of variants across different molecular functions and has been continuously developed and improved since its original publication. Here, we present our most recent release, CADD v1.7. We explored and integrated new annotation features, among them state-of-the-art protein language model scores (Meta ESM-1v), regulatory variant effect predictions (from sequence-based convolutional neural networks) and sequence conservation scores (Zoonomia). We evaluated the new version on data sets derived from ClinVar, ExAC/gnomAD and 1000 Genomes variants. For coding effects, we tested CADD on 31 Deep Mutational Scanning (DMS) data sets from ProteinGym and, for regulatory effect prediction, we used saturation mutagenesis reporter assay data of promoter and enhancer sequences. The inclusion of new features further improved the overall performance of CADD. As with previous releases, all data sets, genome-wide CADD v1.7 scores, scripts for on-site scoring and an easy-to-use webserver are readily provided via https://cadd.bihealth.org/ or https://cadd.gs.washington.edu/ to the community.
Collapse
Affiliation(s)
- Max Schubach
- Exploratory Diagnostic Sciences, Berlin Institute of Health at Charité – Universitätsmedizin Berlin, Berlin, Germany
| | - Thorben Maass
- Institute of Human Genetics, University Hospital Schleswig-Holstein, University of Lübeck, Lübeck, Germany
| | - Lusiné Nazaretyan
- Exploratory Diagnostic Sciences, Berlin Institute of Health at Charité – Universitätsmedizin Berlin, Berlin, Germany
| | - Sebastian Röner
- Exploratory Diagnostic Sciences, Berlin Institute of Health at Charité – Universitätsmedizin Berlin, Berlin, Germany
| | - Martin Kircher
- Exploratory Diagnostic Sciences, Berlin Institute of Health at Charité – Universitätsmedizin Berlin, Berlin, Germany
- Institute of Human Genetics, University Hospital Schleswig-Holstein, University of Lübeck, Lübeck, Germany
| |
Collapse
|
9
|
Emani PS, Geradi MN, Gürsoy G, Grasty MR, Miranker A, Gerstein MB. Assessing and mitigating privacy risks of sparse, noisy genotypes by local alignment to haplotype databases. Genome Res 2023; 33:gr.278322.123. [PMID: 38097386 PMCID: PMC10760520 DOI: 10.1101/gr.278322.123] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/24/2023] [Accepted: 11/18/2023] [Indexed: 01/04/2024]
Abstract
Single nucleotide polymorphisms (SNPs) from omics data create a reidentification risk for individuals and their relatives. Although the ability of thousands of SNPs (especially rare ones) to identify individuals has been repeatedly shown, the availability of small sets of noisy genotypes, from environmental DNA samples or functional genomics data, motivated us to quantify their informativeness. We present a computational tool suite, termed Privacy Leakage by Inference across Genotypic HMM Trajectories (PLIGHT), using population-genetics-based hidden Markov models (HMMs) of recombination and mutation to find piecewise alignment of small, noisy SNP sets to reference haplotype databases. We explore cases in which query individuals are either known to be in the database, or not, and consider several genotype queries, including those from environmental sample swabs from known individuals and from simulated "mosaics" (two-individual composites). Using PLIGHT on a database with ∼5000 haplotypes, we find for common, noise-free SNPs that only ten are sufficient to identify individuals, ∼20 can identify both components in two-individual mosaics, and 20-30 can identify first-order relatives. Using noisy environmental-sample-derived SNPs, PLIGHT identifies individuals in a database using ∼30 SNPs. Even when the individuals are not in the database, local genotype matches allow for some phenotypic information leakage based on coarse-grained SNP imputation. Finally, by quantifying privacy leakage from sparse SNP sets, PLIGHT helps determine the value of selectively sanitizing released SNPs without explicit assumptions about population membership or allele frequency. To make this practical, we provide a sanitization tool to remove the most identifying SNPs from genomic data.
Collapse
Affiliation(s)
- Prashant S Emani
- Program in Computational Biology and Bioinformatics, Yale University, New Haven, Connecticut 06520, USA
- Department of Molecular Biophysics and Biochemistry, Yale University, New Haven, Connecticut 06520, USA
| | - Maya N Geradi
- Program in Computational Biology and Bioinformatics, Yale University, New Haven, Connecticut 06520, USA
- Department of Molecular Biophysics and Biochemistry, Yale University, New Haven, Connecticut 06520, USA
| | - Gamze Gürsoy
- Program in Computational Biology and Bioinformatics, Yale University, New Haven, Connecticut 06520, USA
- Department of Molecular Biophysics and Biochemistry, Yale University, New Haven, Connecticut 06520, USA
| | - Monica R Grasty
- Department of Molecular Biophysics and Biochemistry, Yale University, New Haven, Connecticut 06520, USA
| | - Andrew Miranker
- Department of Molecular Biophysics and Biochemistry, Yale University, New Haven, Connecticut 06520, USA
| | - Mark B Gerstein
- Program in Computational Biology and Bioinformatics, Yale University, New Haven, Connecticut 06520, USA;
- Department of Molecular Biophysics and Biochemistry, Yale University, New Haven, Connecticut 06520, USA
- Department of Computer Science, Yale University, New Haven, Connecticut 06520, USA
- Department of Statistics and Data Science, Yale University, New Haven, Connecticut 06520, USA
| |
Collapse
|
10
|
Ayday E, Vaidya J, Jiang X, Telenti A. Ensuring Trust in Genomics Research. ... IEEE INTERNATIONAL CONFERENCE ON TRUST, PRIVACY AND SECURITY IN INTELLIGENT SYSTEMS AND APPLICATIONS : (TPS-ISA ...). IEEE INTERNATIONAL CONFERENCE ON TRUST, PRIVACY AND SECURITY IN INTELLIGENT SYSTEMS AND APPLICATIONS 2023; 2023:1-12. [PMID: 38562180 PMCID: PMC10981793 DOI: 10.1109/tps-isa58951.2023.00011] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 04/04/2024]
Abstract
Reproducibility, transparency, representation, and privacy underpin the trust on genomics research in general and genome-wide association studies (GWAS) in particular. Concerns about these issues can be mitigated by technologies that address privacy protection, quality control, and verifiability of GWAS. However, many of the existing technological solutions have been developed in isolation and may address one aspect of reproducibility, transparency, representation, and privacy of GWAS while unknowingly impacting other aspects. As a consequence, the current patchwork of technological tools only partially and in an overlapping manner address issues with GWAS, sometimes even creating more problems. This paper addresses the progress in a field that creates technological solutions that augment the acceptance and security of population genetic analyses. The text identifies areas that are falling behind in technical implementation or where there is insufficient research. We make the case that a full understanding of the different GWAS settings, technological tools and new research directions can holistically address the requirements for the acceptance of GWAS.
Collapse
Affiliation(s)
- Erman Ayday
- Department of Computer and Data Sciences Case Western Reserve University Cleveland, OH
| | - Jaideep Vaidya
- Management Science and Information Systems Department Rutgers University Newark, NJ
| | - Xiaoqian Jiang
- Department of Data Science and Artificial Intelligence University of Texas - Health Houston, TX
| | - Amalio Telenti
- Dept. of Integrative Structural and Computational Biology Scripps Institute La Jolla, CA
| |
Collapse
|
11
|
Tamuhla T, Lulamba ET, Mutemaringa T, Tiffin N. Multiple modes of data sharing can facilitate secondary use of sensitive health data for research. BMJ Glob Health 2023; 8:e013092. [PMID: 37802544 PMCID: PMC10565310 DOI: 10.1136/bmjgh-2023-013092] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/12/2023] [Accepted: 09/12/2023] [Indexed: 10/10/2023] Open
Abstract
Evidence-based healthcare relies on health data from diverse sources to inform decision-making across different domains, including disease prevention, aetiology, diagnostics, therapeutics and prognosis. Increasing volumes of highly granular data provide opportunities to leverage the evidence base, with growing recognition that health data are highly sensitive and onward research use may create privacy issues for individuals providing data. Concerns are heightened for data without explicit informed consent for secondary research use. Additionally, researchers-especially from under-resourced environments and the global South-may wish to participate in onward analysis of resources they collected or retain oversight of onward use to ensure ethical constraints are respected. Different data-sharing approaches may be adopted according to data sensitivity and secondary use restrictions, moving beyond the traditional Open Access model of unidirectional data transfer from generator to secondary user. We describe collaborative data sharing, facilitating research by combining datasets and undertaking meta-analysis involving collaborating partners; federated data analysis, where partners undertake synchronous, harmonised analyses on their independent datasets and then combine their results in a coauthored report, and trusted research environments where data are analysed in a controlled environment and only aggregate results are exported. We review how deidentification and anonymisation methods, including data perturbation, can reduce risks specifically associated with health data secondary use. In addition, we present an innovative modularised approach for building data sharing agreements incorporating a more nuanced approach to data sharing to protect privacy, and provide a framework for building the agreements for each of these data-sharing scenarios.
Collapse
Affiliation(s)
- Tsaone Tamuhla
- South African National Bioinformatics Institute, University of the Western Cape, Bellville, South Africa
| | - Eddie T Lulamba
- South African National Bioinformatics Institute, University of the Western Cape, Bellville, South Africa
| | - Themba Mutemaringa
- Provincial Health Data Centre, Health Intelligence Directorate, Western Cape Department of Health and Wellness, Cape Town, Western Cape, South Africa
- Computational Biology Division, Department of Integrative Biomedical Sciences, University of Cape Town, Rondebosch, Western Cape, South Africa
| | - Nicki Tiffin
- South African National Bioinformatics Institute, University of the Western Cape, Bellville, South Africa
| |
Collapse
|
12
|
Sadhuka S, Fridman D, Berger B, Cho H. Assessing transcriptomic reidentification risks using discriminative sequence models. Genome Res 2023; 33:1101-1112. [PMID: 37541758 PMCID: PMC10538488 DOI: 10.1101/gr.277699.123] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/14/2023] [Accepted: 04/19/2023] [Indexed: 08/06/2023]
Abstract
Gene expression data provide molecular insights into the functional impact of genetic variation, for example, through expression quantitative trait loci (eQTLs). With an improving understanding of the association between genotypes and gene expression comes a greater concern that gene expression profiles could be matched to genotype profiles of the same individuals in another data set, known as a linking attack. Prior works show such a risk could analyze only a fraction of eQTLs that is independent owing to restrictive model assumptions, leaving the full extent of this risk incompletely understood. To address this challenge, we introduce the discriminative sequence model (DSM), a novel probabilistic framework for predicting a sequence of genotypes based on gene expression data. By modeling the joint distribution over all known eQTLs in a genomic region, DSM improves the power of linking attacks with necessary calibration for linkage disequilibrium and redundant predictive signals. We show greater linking accuracy of DSM compared with existing approaches across a range of attack scenarios and data sets including up to 22,288 individuals, suggesting that DSM helps uncover a substantial additional risk overlooked by previous studies. Our work provides a unified framework for assessing the privacy risks of sharing diverse omics data sets beyond transcriptomics.
Collapse
Affiliation(s)
- Shuvom Sadhuka
- Computer Science and AI Lab, Massachusetts Institute of Technology, Cambridge, Massachusetts 02139, USA
- Broad Institute of MIT and Harvard, Cambridge, Massachusetts 02142, USA
| | - Daniel Fridman
- Broad Institute of MIT and Harvard, Cambridge, Massachusetts 02142, USA
- Department of Biomedical Informatics, Harvard Medical School, Boston, Massachusetts 02115, USA
| | - Bonnie Berger
- Computer Science and AI Lab, Massachusetts Institute of Technology, Cambridge, Massachusetts 02139, USA
- Broad Institute of MIT and Harvard, Cambridge, Massachusetts 02142, USA
| | - Hyunghoon Cho
- Broad Institute of MIT and Harvard, Cambridge, Massachusetts 02142, USA;
| |
Collapse
|
13
|
Sweeney SM, Hamadeh HK, Abrams N, Adam SJ, Brenner S, Connors DE, Davis GJ, Fiore L, Gawel SH, Grossman RL, Hanlon SE, Hsu K, Kelloff GJ, Kirsch IR, Louv B, McGraw D, Meng F, Milgram D, Miller RS, Morgan E, Mukundan L, O'Brien T, Robbins P, Rubin EH, Rubinstein WS, Salmi L, Schaller T, Shi G, Sigman CC, Srivastava S. Challenges to Using Big Data in Cancer. Cancer Res 2023; 83:1175-1182. [PMID: 36625843 PMCID: PMC10102837 DOI: 10.1158/0008-5472.can-22-1274] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/15/2022] [Revised: 07/29/2022] [Accepted: 12/05/2022] [Indexed: 01/11/2023]
Abstract
Big data in healthcare can enable unprecedented understanding of diseases and their treatment, particularly in oncology. These data may include electronic health records, medical imaging, genomic sequencing, payor records, and data from pharmaceutical research, wearables, and medical devices. The ability to combine datasets and use data across many analyses is critical to the successful use of big data and is a concern for those who generate and use the data. Interoperability and data quality continue to be major challenges when working with different healthcare datasets. Mapping terminology across datasets, missing and incorrect data, and varying data structures make combining data an onerous and largely manual undertaking. Data privacy is another concern addressed by the Health Insurance Portability and Accountability Act, the Common Rule, and the General Data Protection Regulation. The use of big data is now included in the planning and activities of the FDA and the European Medicines Agency. The willingness of organizations to share data in a precompetitive fashion, agreements on data quality standards, and institution of universal and practical tenets on data privacy will be crucial to fully realizing the potential for big data in medicine.
Collapse
Affiliation(s)
- Shawn M. Sweeney
- American Association for Cancer Research, Philadelphia, Pennsylvania
| | | | - Natalie Abrams
- Division of Cancer Prevention, Early Detection Research Network, National Cancer Institute, Rockville, Maryland
| | - Stacey J. Adam
- Foundation for the National Institutes of Health, Bethesda, Maryland
| | - Sara Brenner
- Office of In Vitro Diagnostics, Center for Devices and Radiological Health, U.S. Food and Drug Administration, Silver Spring, Maryland
| | - Dana E. Connors
- Foundation for the National Institutes of Health, Bethesda, Maryland
| | - Gerard J. Davis
- Abbott Diagnostics Division, Abbott Laboratories, Lake Forest, Illinois
| | - Louis Fiore
- Boston University School of Medicine, Boston and New England Department of Veterans Affairs, Bedford, Massachusetts
| | - Susan H. Gawel
- Abbott Diagnostics Division, Abbott Laboratories, Lake Forest, Illinois
| | - Robert L. Grossman
- Center for Translational Data Science, The University of Chicago, Chicago, Illinois
| | - Sean E. Hanlon
- Center for Strategic Scientific Initiatives, National Cancer Institute, Bethesda, Maryland
| | | | - Gary J. Kelloff
- Division of Cancer Treatment and Diagnosis, National Cancer Institute, Bethesda, Maryland
| | | | - Bill Louv
- Project Data Sphere, Morrisville, North Carolina
| | - Deven McGraw
- Ciitizen Platform at Invitae, San Francisco, California
| | - Frank Meng
- Boston University and Veterans Administration Boston Healthcare System, Boston, Massachusetts
| | | | - Robert S. Miller
- CancerLinQ, American Society of Clinical Oncology, Alexandria, Virginia
| | - Emily Morgan
- Foundation for the National Institutes of Health, Bethesda, Maryland
| | | | | | | | | | - Wendy S. Rubinstein
- Office of In Vitro Diagnostics, Center for Devices and Radiological Health, U.S. Food and Drug Administration, Silver Spring, Maryland
| | - Liz Salmi
- Beth Israel Deaconess Medical Center, Harvard Medical School, Boston, Massachusetts
| | | | - George Shi
- Abbott Diagnostics Division, Abbott Laboratories, Lake Forest, Illinois
| | - Caroline C. Sigman
- Boston University and Veterans Administration Boston Healthcare System, Boston, Massachusetts
| | - Sudhir Srivastava
- Cancer Biomarkers Research Group, Division of Cancer Prevention, National Cancer Institute, Rockville, Maryland
| |
Collapse
|
14
|
Akyüz K, Goisauf M, Chassang G, Kozera Ł, Mežinska S, Tzortzatou-Nanopoulou O, Mayrhofer MT. Post-identifiability in changing sociotechnological genomic data environments. BIOSOCIETIES 2023:1-28. [PMID: 37359141 PMCID: PMC10042674 DOI: 10.1057/s41292-023-00299-7] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 01/13/2023] [Indexed: 03/30/2023]
Abstract
Data practices in biomedical research often rely on standards that build on normative assumptions regarding privacy and involve 'ethics work.' In an increasingly datafied research environment, identifiability gains a new temporal and spatial dimension, especially in regard to genomic data. In this paper, we analyze how genomic identifiability is considered as a specific data issue in a recent controversial case: publication of the genome sequence of the HeLa cell line. Considering developments in the sociotechnological and data environment, such as big data, biomedical, recreational, and research uses of genomics, our analysis highlights what it means to be (re-)identifiable in the postgenomic era. By showing how the risk of genomic identifiability is not a specificity of the HeLa controversy, but rather a systematic data issue, we argue that a new conceptualization is needed. With the notion of post-identifiability as a sociotechnological situation, we show how past assumptions and ideas about future possibilities come together in the case of genomic identifiability. We conclude by discussing how kinship, temporality, and openness are subject to renewed negotiations along with the changing understandings and expectations of identifiability and status of genomic data.
Collapse
Affiliation(s)
- Kaya Akyüz
- Department of Science and Technology Studies, University of Vienna, Universitätsstraße 7/Stiege II/6, Stock (NIG), 1010 Vienna, Austria
- BBMRI-ERIC, Graz, Austria
| | - Melanie Goisauf
- Department of Science and Technology Studies, University of Vienna, Universitätsstraße 7/Stiege II/6, Stock (NIG), 1010 Vienna, Austria
- BBMRI-ERIC, Graz, Austria
| | - Gauthier Chassang
- CERPOP, Université de Toulouse, Inserm, Université Paul Sabatier, Toulouse, France
- Plateforme GenoToul Societal “Ethique et Biosciences”, Toulouse, France
| | | | - Signe Mežinska
- Institute of Clinical and Preventive Medicine, University of Latvia, Riga, Latvia
- BBMRI.LV, Riga, Latvia
| | | | | |
Collapse
|
15
|
Guo Y, Liu F, Zhou T, Cai Z, Xiao N. Seeing is believing: Towards interactive visual exploration of data privacy in federated learning. Inf Process Manag 2023. [DOI: 10.1016/j.ipm.2022.103162] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/26/2022]
|
16
|
TogoVar: A comprehensive Japanese genetic variation database. Hum Genome Var 2022; 9:44. [PMID: 36509753 PMCID: PMC9744889 DOI: 10.1038/s41439-022-00222-9] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/01/2022] [Revised: 11/03/2022] [Accepted: 11/07/2022] [Indexed: 12/14/2022] Open
Abstract
TogoVar ( https://togovar.org ) is a database that integrates allele frequencies derived from Japanese populations and provides annotations for variant interpretation. First, a scheme to reanalyze individual-level genome sequence data deposited in the Japanese Genotype-phenotype Archive (JGA), a controlled-access database, was established to make allele frequencies publicly available. As more Japanese individual-level genome sequence data are deposited in JGA, the sample size employed in TogoVar is expected to increase, contributing to genetic study as reference data for Japanese populations. Second, public datasets of Japanese and non-Japanese populations were integrated into TogoVar to easily compare allele frequencies in Japanese and other populations. Each variant detected in Japanese populations was assigned a TogoVar ID as a permanent identifier. Third, these variants were annotated with molecular consequence, pathogenicity, and literature information for interpreting and prioritizing variants. Here, we introduce the newly developed TogoVar database that compares allele frequencies among Japanese and non-Japanese populations and describes the integrated annotations.
Collapse
|
17
|
Fierro-Monti I, Wright JC, Choudhary JS, Vizcaíno JA. Identifying individuals using proteomics: are we there yet? Front Mol Biosci 2022; 9:1062031. [PMID: 36523653 PMCID: PMC9744771 DOI: 10.3389/fmolb.2022.1062031] [Citation(s) in RCA: 4] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/05/2022] [Accepted: 11/16/2022] [Indexed: 08/31/2023] Open
Abstract
Multi-omics approaches including proteomics analyses are becoming an integral component of precision medicine. As clinical proteomics studies gain momentum and their sensitivity increases, research on identifying individuals based on their proteomics data is here examined for risks and ethics-related issues. A great deal of work has already been done on this topic for DNA/RNA sequencing data, but it has yet to be widely studied in other omics fields. The current state-of-the-art for the identification of individuals based solely on proteomics data is explained. Protein sequence variation analysis approaches are covered in more detail, including the available analysis workflows and their limitations. We also outline some previous forensic and omics proteomics studies that are relevant for the identification of individuals. Following that, we discuss the risks of patient reidentification using other proteomics data types such as protein expression abundance and post-translational modification (PTM) profiles. In light of the potential identification of individuals through proteomics data, possible legal and ethical implications are becoming increasingly important in the field.
Collapse
Affiliation(s)
- Ivo Fierro-Monti
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Trust Genome Campus, Hinxton, United Kingdom
| | | | | | - Juan Antonio Vizcaíno
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Trust Genome Campus, Hinxton, United Kingdom
| |
Collapse
|
18
|
Woerner AE, Mandape S, Kapema KB, Duque TM, Smuts A, King JL, Crysup B, Wang X, Huang M, Ge J, Budowle B. Optimized variant calling for estimating kinship. Forensic Sci Int Genet 2022; 61:102785. [DOI: 10.1016/j.fsigen.2022.102785] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/14/2022] [Revised: 08/07/2022] [Accepted: 09/29/2022] [Indexed: 11/16/2022]
|
19
|
Kim M, Wang S, Jiang X, Harmanci A. SVAT: Secure outsourcing of variant annotation and genotype aggregation. BMC Bioinformatics 2022; 23:409. [PMID: 36182914 PMCID: PMC9526274 DOI: 10.1186/s12859-022-04959-6] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/22/2021] [Accepted: 09/20/2022] [Indexed: 11/10/2022] Open
Abstract
Background Sequencing of thousands of samples provides genetic variants with allele frequencies spanning a very large spectrum and gives invaluable insight into genetic determinants of diseases. Protecting the genetic privacy of participants is challenging as only a few rare variants can easily re-identify an individual among millions. In certain cases, there are policy barriers against sharing genetic data from indigenous populations and stigmatizing conditions. Results We present SVAT, a method for secure outsourcing of variant annotation and aggregation, which are two basic steps in variant interpretation and detection of causal variants. SVAT uses homomorphic encryption to encrypt the data at the client-side. The data always stays encrypted while it is stored, in-transit, and most importantly while it is analyzed. SVAT makes use of a vectorized data representation to convert annotation and aggregation into efficient vectorized operations in a single framework. Also, SVAT utilizes a secure re-encryption approach so that multiple disparate genotype datasets can be combined for federated aggregation and secure computation of allele frequencies on the aggregated dataset. Conclusions Overall, SVAT provides a secure, flexible, and practical framework for privacy-aware outsourcing of annotation, filtering, and aggregation of genetic variants. SVAT is publicly available for download from https://github.com/harmancilab/SVAT. Supplementary Information The online version contains supplementary material available at 10.1186/s12859-022-04959-6.
Collapse
Affiliation(s)
- Miran Kim
- Department of Mathematics, Hanyang University, Seoul, 04763, Republic of Korea
| | - Su Wang
- Center for Precision Health, School of Biomedical Informatics, University of Texas Health Science Center, Houston, TX, 77030, USA
| | - Xiaoqian Jiang
- Center for Secure Artificial Intelligence For hEalthcare (SAFE), School of Biomedical Informatics, University of Texas Health Science Center, Houston, TX, 77030, USA
| | - Arif Harmanci
- Center for Precision Health, School of Biomedical Informatics, University of Texas Health Science Center, Houston, TX, 77030, USA.
| |
Collapse
|
20
|
TrustGWAS: A full-process workflow for encrypted GWAS using multi-key homomorphic encryption and pseudorandom number perturbation. Cell Syst 2022; 13:752-767.e6. [PMID: 36041458 DOI: 10.1016/j.cels.2022.08.001] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/19/2021] [Revised: 04/21/2022] [Accepted: 08/04/2022] [Indexed: 01/26/2023]
Abstract
The statistical power of genome-wide association studies (GWASs) is affected by the effective sample size. However, the privacy and security concerns associated with individual-level genotype data pose great challenges for cross-institutional cooperation. The full-process cryptographic solutions are in demand but have not been covered, especially the essential principal-component analysis (PCA). Here, we present TrustGWAS, a complete solution for secure, large-scale GWAS, recapitulating gold standard results against PLINK without compromising privacy and supporting basic PLINK steps including quality control, linkage disequilibrium pruning, PCA, chi-square test, Cochran-Armitage trend test, covariate-supported logistic regression and linear regression, and their sequential combinations. TrustGWAS leverages pseudorandom number perturbations for PCA and multiparty scheme of multi-key homomorphic encryption for all other modules. TrustGWAS can evaluate 100,000 individuals with 1 million variants and complete QC-LD-PCA-regression workflow within 50 h. We further successfully discover gene loci associated with fasting blood glucose, consistent with the findings of the ChinaMAP project.
Collapse
|
21
|
Abstract
Genomics data are important for advancing biomedical research, improving clinical care, and informing other disciplines such as forensics and genealogy. However, privacy concerns arise when genomic data are shared. In particular, the identifying nature of genetic information, its direct relationship to health status, and the potential financial harm and stigmatization posed to individuals and their blood relatives call for a survey of the privacy issues related to sharing genetic and related data and potential solutions to overcome these issues. In this work, we provide an overview of the importance of genomic privacy, the information gleaned from genomics data, the sources of potential private information leakages in genomics, and ways to preserve privacy while utilizing the genetic information in research. We discuss the relationship between trust in the scientific community and protecting privacy, illuminating a future roadmap for data sharing and study participation.
Collapse
Affiliation(s)
- Gamze Gürsoy
- Department of Biomedical Informatics, Columbia University, New York, NY, USA; .,New York Genome Center, New York, NY, USA
| |
Collapse
|
22
|
Chen Z, Qian Y, Wang Y, Fang Y. Deep Convolutional Generative Adversarial Network-Based EMG Data Enhancement for Hand Motion Classification. Front Bioeng Biotechnol 2022; 10:909653. [PMID: 36061423 PMCID: PMC9431769 DOI: 10.3389/fbioe.2022.909653] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/31/2022] [Accepted: 06/22/2022] [Indexed: 11/13/2022] Open
Abstract
The acquisition of bio-signal from the human body requires a strict experimental setup and ethical approvements, which leads to limited data for the training of classifiers in the era of big data. It will change the situation if synthetic data can be generated based on real data. This article proposes such a kind of multiple channel electromyography (EMG) data enhancement method using a deep convolutional generative adversarial network (DCGAN). The generation procedure is as follows: First, the multiple channels of EMG signals within sliding windows are converted to grayscale images through matrix transformation, normalization, and histogram equalization. Second, the grayscale images of each class are used to train DCGAN so that synthetic grayscale images of each class can be generated with the input of random noises. To evaluate whether the synthetic data own the similarity and diversity with the real data, the classification accuracy index is adopted in this article. A public EMG dataset (that is, ISR Myo-I) for hand motion recognition is used to prove the usability of the proposed method. The experimental results show that adding synthetic data to the training data has little effect on the classification performance, indicating the similarity between real data and synthetic data. Moreover, it is also noted that the average accuracy (five classes) is slightly increased by 1%–2% for support vector machine (SVM) and random forest (RF), respectively, with additional synthetic data for training. Although the improvement is not statistically significant, it implies that the generated data by DCGAN own its new characteristics, and it is possible to enrich the diversity of the training dataset. In addition, cross-validation analysis shows that the synthetic samples have large inter-class distance, reflected by higher cross-validation accuracy of pure synthetic sample classification. Furthermore, this article also demonstrates that histogram equalization can significantly improve the performance of EMG-based hand motion recognition.
Collapse
|
23
|
Jiang Y, Mosquera L, Jiang B, Kong L, El Emam K. Measuring re-identification risk using a synthetic estimator to enable data sharing. PLoS One 2022; 17:e0269097. [PMID: 35714132 PMCID: PMC9205507 DOI: 10.1371/journal.pone.0269097] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/28/2021] [Accepted: 05/13/2022] [Indexed: 11/18/2022] Open
Abstract
Background One common way to share health data for secondary analysis while meeting increasingly strict privacy regulations is to de-identify it. To demonstrate that the risk of re-identification is acceptably low, re-identification risk metrics are used. There is a dearth of good risk estimators modeling the attack scenario where an adversary selects a record from the microdata sample and attempts to match it with individuals in the population. Objectives Develop an accurate risk estimator for the sample-to-population attack. Methods A type of estimator based on creating a synthetic variant of a population dataset was developed to estimate the re-identification risk for an adversary performing a sample-to-population attack. The accuracy of the estimator was evaluated through a simulation on four different datasets in terms of estimation error. Two estimators were considered, a Gaussian copula and a d-vine copula. They were compared against three other estimators proposed in the literature. Results Taking the average of the two copula estimates consistently had a median error below 0.05 across all sampling fractions and true risk values. This was significantly more accurate than existing methods. A sensitivity analysis of the estimator accuracy based on variation in input parameter accuracy provides further application guidance. The estimator was then used to assess re-identification risk and de-identify a large Ontario COVID-19 behavioral survey dataset. Conclusions The average of two copula estimators consistently provides the most accurate re-identification risk estimate and can serve as a good basis for managing privacy risks when data are de-identified and shared.
Collapse
Affiliation(s)
- Yangdi Jiang
- Department of Mathematical and Statistical Sciences, University of Alberta, Edmonton, Canada
- Replica Analytics Ltd., Ottawa, Ontario, Canada
| | | | - Bei Jiang
- Department of Mathematical and Statistical Sciences, University of Alberta, Edmonton, Canada
| | - Linglong Kong
- Department of Mathematical and Statistical Sciences, University of Alberta, Edmonton, Canada
| | - Khaled El Emam
- Replica Analytics Ltd., Ottawa, Ontario, Canada
- School of Epidemiology and Public Health, University of Ottawa, Ottawa, Ontario, Canada
- Childrens Hospital of Eastern Ontario Research Institute, Ottawa, Ontario, Canada
- * E-mail:
| |
Collapse
|
24
|
Zhang C, Bonomi L. Mitigating Membership Inference in Deep Learning Applications with High Dimensional Genomic Data. IEEE INTERNATIONAL CONFERENCE ON HEALTHCARE INFORMATICS. IEEE INTERNATIONAL CONFERENCE ON HEALTHCARE INFORMATICS 2022; 2022:10.1109/ichi54592.2022.00101. [PMID: 36120416 PMCID: PMC9473339 DOI: 10.1109/ichi54592.2022.00101] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/09/2023]
Abstract
The use of deep learning techniques in medical applications holds great promises for advancing health care. However, there are growing privacy concerns regarding what information about individual data contributors (i.e., patients in the training set) these deep models may reveal when shared with external users. In this work, we first investigate the membership privacy risks in sharing deep learning models for cancer genomics tasks, and then study the applicability of privacy-protecting strategies for mitigating these privacy risks.
Collapse
Affiliation(s)
- Chonghao Zhang
- Dept. of Computer Science and Engineering, University of California, San Diego, La Jolla, CA
| | - Luca Bonomi
- Dept. of Biomedical Informatics, Vanderbilt University, Nashville, TN
| |
Collapse
|
25
|
Nakagawa Y, Ohata S, Shimizu K. Efficient privacy-preserving variable-length substring match for genome sequence. Algorithms Mol Biol 2022; 17:9. [PMID: 35473587 PMCID: PMC9040336 DOI: 10.1186/s13015-022-00211-1] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/19/2021] [Accepted: 03/01/2022] [Indexed: 11/28/2022] Open
Abstract
The development of a privacy-preserving technology is important for accelerating genome data sharing. This study proposes an algorithm that securely searches a variable-length substring match between a query and a database sequence. Our concept hinges on a technique that efficiently applies FM-index for a secret-sharing scheme. More precisely, we developed an algorithm that can achieve a secure table lookup in such a way that \documentclass[12pt]{minimal}
\usepackage{amsmath}
\usepackage{wasysym}
\usepackage{amsfonts}
\usepackage{amssymb}
\usepackage{amsbsy}
\usepackage{mathrsfs}
\usepackage{upgreek}
\setlength{\oddsidemargin}{-69pt}
\begin{document}$$V[V[\ldots V[p_0] \ldots ]]$$\end{document}V[V[…V[p0]…]] is computed for a given depth of recursion where \documentclass[12pt]{minimal}
\usepackage{amsmath}
\usepackage{wasysym}
\usepackage{amsfonts}
\usepackage{amssymb}
\usepackage{amsbsy}
\usepackage{mathrsfs}
\usepackage{upgreek}
\setlength{\oddsidemargin}{-69pt}
\begin{document}$$p_0$$\end{document}p0 is an initial position, and V is a vector. We used the secure table lookup for vectors created based on FM-index. The notable feature of the secure table lookup is that time, communication, and round complexities are not dependent on the table length N, after the query input. Therefore, a substring match by reference to the FM-index-based table can also be conducted independently against the database length, and the entire search time is dramatically improved compared to previous approaches. We conducted an experiment using a human genome sequence with the length of 10 million as the database and a query with the length of 100 and found that the query response time of our protocol was at least three orders of magnitude faster than a non-indexed database search protocol under the realistic computation/network environment.
Collapse
|
26
|
Bonomi L, Wu Z, Fan L. Sharing personal ECG time-series data privately. J Am Med Inform Assoc 2022; 29:1152-1160. [PMID: 35380666 PMCID: PMC9196703 DOI: 10.1093/jamia/ocac047] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/09/2021] [Revised: 03/16/2022] [Accepted: 03/31/2022] [Indexed: 11/13/2022] Open
Abstract
Abstract
Objective
Emerging technologies (eg, wearable devices) have made it possible to collect data directly from individuals (eg, time-series), providing new insights on the health and well-being of individual patients. Broadening the access to these data would facilitate the integration with existing data sources (eg, clinical and genomic data) and advance medical research. Compared to traditional health data, these data are collected directly from individuals, are highly unique and provide fine-grained information, posing new privacy challenges. In this work, we study the applicability of a novel privacy model to enable individual-level time-series data sharing while maintaining the usability for data analytics.
Methods and materials
We propose a privacy-protecting method for sharing individual-level electrocardiography (ECG) time-series data, which leverages dimensional reduction technique and random sampling to achieve provable privacy protection. We show that our solution provides strong privacy protection against an informed adversarial model while enabling useful aggregate-level analysis.
Results
We conduct our evaluations on 2 real-world ECG datasets. Our empirical results show that the privacy risk is significantly reduced after sanitization while the data usability is retained for a variety of clinical tasks (eg, predictive modeling and clustering).
Discussion
Our study investigates the privacy risk in sharing individual-level ECG time-series data. We demonstrate that individual-level data can be highly unique, requiring new privacy solutions to protect data contributors.
Conclusion
The results suggest our proposed privacy-protection method provides strong privacy protections while preserving the usefulness of the data.
Collapse
Affiliation(s)
- Luca Bonomi
- Department of Biomedical Informatics, Vanderbilt University Medical Center, Nashville, Tennessee, USA
| | - Zeyun Wu
- Department of Electrical and Computer Engineering, University of California San Diego, La Jolla, California, USA
| | - Liyue Fan
- Department of Computer Science, University of North Carolina at Charlotte, Charlotte, North Carolina, USA
| |
Collapse
|
27
|
Functional genomics data: privacy risk assessment and technological mitigation. Nat Rev Genet 2022; 23:245-258. [PMID: 34759381 DOI: 10.1038/s41576-021-00428-7] [Citation(s) in RCA: 6] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 10/18/2021] [Indexed: 12/15/2022]
Abstract
The generation of functional genomics data by next-generation sequencing has increased greatly in the past decade. Broad sharing of these data is essential for research advancement but poses notable privacy challenges, some of which are analogous to those that occur when sharing genetic variant data. However, there are also unique privacy challenges that arise from cryptic information leakage during the processing and summarization of functional genomics data from raw reads to derived quantities, such as gene expression values. Here, we review these challenges and present potential solutions for mitigating privacy risks while allowing broad data dissemination and analysis.
Collapse
|
28
|
Santaló J, Berdasco M. Ethical implications of epigenetics in the era of personalized medicine. Clin Epigenetics 2022; 14:44. [PMID: 35337378 PMCID: PMC8953972 DOI: 10.1186/s13148-022-01263-1] [Citation(s) in RCA: 9] [Impact Index Per Article: 4.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/27/2022] [Accepted: 03/17/2022] [Indexed: 11/10/2022] Open
Abstract
Given the increasing research activity on epigenetics to monitor human diseases and its connection with lifestyle and environmental expositions, the field of epigenetics has attracted a great deal of interest also at the ethical and societal level. In this review, we will identify and discuss current ethical, legal and social issues of epigenetics research in the context of personalized medicine. The review covers ethical aspects such as how epigenetic information should impact patient autonomy and the ability to generate an intentional and voluntary decision, the measures of data protection related to privacy and confidentiality derived from epigenome studies (e.g., risk of discrimination, patient re-identification and unexpected findings) or the debate in the distribution of responsibilities for health (i.e., personal versus public responsibilities). We pay special attention to the risk of social discrimination and stigmatization as a consequence of inferring information related to lifestyle and environmental exposures potentially contained in epigenetic data. Furthermore, as exposures to the environment and individual habits do not affect all populations equally, the violation of the principle of distributive justice in the access to the benefits of clinical epigenetics is discussed. In this regard, epigenetics represents a great opportunity for the integration of public policy measures aimed to create healthier living environments. Whether these public policies will coexist or, in contrast, compete with strategies reinforcing the personalized medicine interventions needs to be considered. The review ends with a reflection on the main challenges in epigenetic research, some of them in a technical dimension (e.g., assessing causality or establishing reference epigenomes) but also in the ethical and social sphere (e.g., risk to add an epigenetic determinism on top of the current genetic one). In sum, integration into life science investigation of social experiences such as exposure to risk, nutritional habits, prejudice and stigma, is imperative to understand epigenetic variation in disease. This pragmatic approach is required to locate clinical epigenetics out of the experimental laboratories and facilitate its implementation into society.
Collapse
Affiliation(s)
- Josep Santaló
- Facultat de Biociències, Unitat de Biologia Cel·lular, Universitat Autònoma de Barcelona, Barcelona, Spain
| | - María Berdasco
- Cancer Epigenetics and Biology Program (PEBC), Bellvitge Biomedical Research Institute (IDIBELL), Barcelona, Catalonia, Spain. .,Epigenetic Therapies Group, Experimental and Clinical Hematology Program (PHEC), Josep Carreras Leukaemia Research Institute, Badalona, Barcelona, Catalonia, Spain.
| |
Collapse
|
29
|
Wan Z, Hazel JW, Clayton EW, Vorobeychik Y, Kantarcioglu M, Malin BA. Sociotechnical safeguards for genomic data privacy. Nat Rev Genet 2022; 23:429-445. [PMID: 35246669 PMCID: PMC8896074 DOI: 10.1038/s41576-022-00455-y] [Citation(s) in RCA: 28] [Impact Index Per Article: 14.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 01/24/2022] [Indexed: 12/21/2022]
Abstract
Recent developments in a variety of sectors, including health care, research and the direct-to-consumer industry, have led to a dramatic increase in the amount of genomic data that are collected, used and shared. This state of affairs raises new and challenging concerns for personal privacy, both legally and technically. This Review appraises existing and emerging threats to genomic data privacy and discusses how well current legal frameworks and technical safeguards mitigate these concerns. It concludes with a discussion of remaining and emerging challenges and illustrates possible solutions that can balance protecting privacy and realizing the benefits that result from the sharing of genetic information. In this Review, the authors describe technical and legal protection mechanisms for mitigating vulnerabilities in genomic data privacy. They also discuss how these protections are dependent on the context of data use such as in research, health care, direct-to-consumer testing or forensic investigations.
Collapse
Affiliation(s)
- Zhiyu Wan
- Center for Genetic Privacy and Identity in Community Settings, Vanderbilt University Medical Center, Nashville, TN, USA.,Department of Computer Science, Vanderbilt University, Nashville, TN, USA.,Department of Biomedical Informatics, Vanderbilt University Medical Center, Nashville, TN, USA
| | - James W Hazel
- Center for Genetic Privacy and Identity in Community Settings, Vanderbilt University Medical Center, Nashville, TN, USA.,Center for Biomedical Ethics and Society, Vanderbilt University, Nashville, TN, USA
| | - Ellen Wright Clayton
- Center for Genetic Privacy and Identity in Community Settings, Vanderbilt University Medical Center, Nashville, TN, USA.,Center for Biomedical Ethics and Society, Vanderbilt University, Nashville, TN, USA.,Vanderbilt University Law School, Nashville, TN, USA
| | - Yevgeniy Vorobeychik
- Department of Computer Science and Engineering, Washington University in St. Louis, St. Louis, MO, USA
| | - Murat Kantarcioglu
- Department of Computer Science, University of Texas at Dallas, Richardson, TX, USA
| | - Bradley A Malin
- Center for Genetic Privacy and Identity in Community Settings, Vanderbilt University Medical Center, Nashville, TN, USA. .,Department of Computer Science, Vanderbilt University, Nashville, TN, USA. .,Department of Biomedical Informatics, Vanderbilt University Medical Center, Nashville, TN, USA. .,Department of Biostatistics, Vanderbilt University Medical Center, Nashville, TN, USA.
| |
Collapse
|
30
|
Akgün M, Pfeifer N, Kohlbacher O. Efficient privacy-preserving whole-genome variant queries. Bioinformatics 2022; 38:2202-2210. [PMID: 35150254 PMCID: PMC9004657 DOI: 10.1093/bioinformatics/btac070] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/06/2021] [Revised: 01/13/2022] [Accepted: 02/03/2022] [Indexed: 02/03/2023] Open
Abstract
MOTIVATION Diagnosis and treatment decisions on genomic data have become widespread as the cost of genome sequencing decreases gradually. In this context, disease-gene association studies are of great importance. However, genomic data are very sensitive when compared to other data types and contains information about individuals and their relatives. Many studies have shown that this information can be obtained from the query-response pairs on genomic databases. In this work, we propose a method that uses secure multi-party computation to query genomic databases in a privacy-protected manner. The proposed solution privately outsources genomic data from arbitrarily many sources to the two non-colluding proxies and allows genomic databases to be safely stored in semi-honest cloud environments. It provides data privacy, query privacy and output privacy by using XOR-based sharing and unlike previous solutions, it allows queries to run efficiently on hundreds of thousands of genomic data. RESULTS We measure the performance of our solution with parameters similar to real-world applications. It is possible to query a genomic database with 3 000 000 variants with five genomic query predicates under 400 ms. Querying 1 048 576 genomes, each containing 1 000 000 variants, for the presence of five different query variants can be achieved approximately in 6 min with a small amount of dedicated hardware and connectivity. These execution times are in the right range to enable real-world applications in medical research and healthcare. Unlike previous studies, it is possible to query multiple databases with response times fast enough for practical application. To the best of our knowledge, this is the first solution that provides this performance for querying large-scale genomic data. AVAILABILITY AND IMPLEMENTATION https://gitlab.com/DIFUTURE/privacy-preserving-variant-queries. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Mete Akgün
- To whom correspondence should be addressed.
| | - Nico Pfeifer
- Institute for Bioinformatics and Medical Informatics, University of Tübingen, Tübingen, Germany,Methods in Medical Informatics, Department of Computer Science, University of Tübingen, Tübingen, Germany,Statistical Learning in Computational Biology, Max Planck Institute for Informatics, Saarbrücken, Germany
| | - Oliver Kohlbacher
- Institute for Bioinformatics and Medical Informatics, University of Tübingen, Tübingen, Germany,Translational Bioinformatics, University Hospital Tübingen, Tübingen, Germany,Applied Bioinformatics, Department of Computer Science, University of Tübingen, Tübingen, Germany
| |
Collapse
|
31
|
Torkzadehmahani R, Nasirigerdeh R, Blumenthal DB, Kacprowski T, List M, Matschinske J, Spaeth J, Wenke NK, Baumbach J. Privacy-Preserving Artificial Intelligence Techniques in Biomedicine. Methods Inf Med 2022; 61:e12-e27. [PMID: 35062032 PMCID: PMC9246509 DOI: 10.1055/s-0041-1740630] [Citation(s) in RCA: 5] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/15/2022]
Abstract
Background
Artificial intelligence (AI) has been successfully applied in numerous scientific domains. In biomedicine, AI has already shown tremendous potential, e.g., in the interpretation of next-generation sequencing data and in the design of clinical decision support systems.
Objectives
However, training an AI model on sensitive data raises concerns about the privacy of individual participants. For example, summary statistics of a genome-wide association study can be used to determine the presence or absence of an individual in a given dataset. This considerable privacy risk has led to restrictions in accessing genomic and other biomedical data, which is detrimental for collaborative research and impedes scientific progress. Hence, there has been a substantial effort to develop AI methods that can learn from sensitive data while protecting individuals' privacy.
Method
This paper provides a structured overview of recent advances in privacy-preserving AI techniques in biomedicine. It places the most important state-of-the-art approaches within a unified taxonomy and discusses their strengths, limitations, and open problems.
Conclusion
As the most promising direction, we suggest combining federated machine learning as a more scalable approach with other additional privacy-preserving techniques. This would allow to merge the advantages to provide privacy guarantees in a distributed way for biomedical applications. Nonetheless, more research is necessary as hybrid approaches pose new challenges such as additional network or computation overhead.
Collapse
Affiliation(s)
- Reihaneh Torkzadehmahani
- Institute for Artificial Intelligence in Medicine and Healthcare, Technical University of Munich, Munich, Germany
| | - Reza Nasirigerdeh
- Institute for Artificial Intelligence in Medicine and Healthcare, Technical University of Munich, Munich, Germany.,Klinikum Rechts der Isar, Technical University of Munich, Munich, Germany
| | - David B Blumenthal
- Department of Artificial Intelligence in Biomedical Engineering (AIBE), Friedrich-Alexander University Erlangen-Nürnberg (FAU), Erlangen, Germany
| | - Tim Kacprowski
- Division of Data Science in Biomedicine, Peter L. Reichertz Institute for Medical Informatics of TU Braunschweig and Medical School Hannover, Braunschweig, Germany.,Braunschweig Integrated Centre of Systems Biology (BRICS), TU Braunschweig, Braunschweig, Germany
| | - Markus List
- Chair of Experimental Bioinformatics, Technical University of Munich, Munich, Germany
| | - Julian Matschinske
- E.U. Horizon2020 FeatureCloud Project Consortium.,Chair of Computational Systems Biology, University of Hamburg, Hamburg, Germany
| | - Julian Spaeth
- E.U. Horizon2020 FeatureCloud Project Consortium.,Chair of Computational Systems Biology, University of Hamburg, Hamburg, Germany
| | - Nina Kerstin Wenke
- E.U. Horizon2020 FeatureCloud Project Consortium.,Chair of Computational Systems Biology, University of Hamburg, Hamburg, Germany
| | - Jan Baumbach
- E.U. Horizon2020 FeatureCloud Project Consortium.,Chair of Computational Systems Biology, University of Hamburg, Hamburg, Germany.,Institute of Mathematics and Computer Science, University of Southern Denmark, Odense, Denmark
| |
Collapse
|
32
|
Jafarbeiki S, Sakzad A, Kasra Kermanshahi S, Gaire R, Steinfeld R, Lai S, Abraham G, Thapa C. PrivGenDB: Efficient and privacy-preserving query executions over encrypted SNP-Phenotype database. INFORMATICS IN MEDICINE UNLOCKED 2022. [DOI: 10.1016/j.imu.2022.100988] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/25/2022] Open
|
33
|
Akyüz K, Chassang G, Goisauf M, Kozera Ł, Mezinska S, Tzortzatou O, Mayrhofer MT. Biobanking and risk assessment: a comprehensive typology of risks for an adaptive risk governance. LIFE SCIENCES, SOCIETY AND POLICY 2021; 17:10. [PMID: 34903285 PMCID: PMC8666836 DOI: 10.1186/s40504-021-00117-7] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 04/30/2021] [Accepted: 12/01/2021] [Indexed: 05/03/2023]
Abstract
Biobanks act as the custodians for the access to and responsible use of human biological samples and related data that have been generously donated by individuals to serve the public interest and scientific advances in the health research realm. Risk assessment has become a daily practice for biobanks and has been discussed from different perspectives. This paper aims to provide a literature review on risk assessment in order to put together a comprehensive typology of diverse risks biobanks could potentially face. Methodologically set as a typology, the conceptual approach used in this paper is based on the interdisciplinary analysis of scientific literature, the relevant ethical and legal instruments and practices in biobanking to identify how risks are assessed, considered and mitigated. Through an interdisciplinary mapping exercise, we have produced a typology of potential risks in biobanking, taking into consideration the perspectives of different stakeholders, such as institutional actors and publics, including participants and representative organizations. With this approach, we have identified the following risk types: economic, infrastructural, institutional, research community risks and participant's risks. The paper concludes by highlighting the necessity of an adaptive risk governance as an integral part of good governance in biobanking. In this regard, it contributes to sustainability in biobanking by assisting in the design of relevant risk management practices, where they are not already in place or require an update. The typology is intended to be useful from the early stages of establishing such a complex and multileveled biomedical infrastructure as well as to provide a catalogue of risks for improving the risk management practices already in place.
Collapse
Affiliation(s)
- Kaya Akyüz
- BBMRI-ERIC, Graz, Austria.
- Department of Science and Technology Studies, University of Vienna, Vienna, Austria.
| | - Gauthier Chassang
- BBMRI-ERIC, Graz, Austria
- CERPOP, Université de Toulouse, Inserm, Université Paul Sabatier, Toulouse, France
| | - Melanie Goisauf
- BBMRI-ERIC, Graz, Austria
- Department of Science and Technology Studies, University of Vienna, Vienna, Austria
| | | | - Signe Mezinska
- BBMRI-ERIC, Graz, Austria
- Institute of Clinical and Preventive Medicine, University of Latvia, Riga, Latvia
| | - Olga Tzortzatou
- BBMRI-ERIC, Graz, Austria
- Biomedical Research Foundation of the Academy of Athens, Athens, Greece
| | | |
Collapse
|
34
|
Wan Z, Vorobeychik Y, Xia W, Liu Y, Wooders M, Guo J, Yin Z, Clayton EW, Kantarcioglu M, Malin BA. Using game theory to thwart multistage privacy intrusions when sharing data. SCIENCE ADVANCES 2021; 7:eabe9986. [PMID: 34890225 PMCID: PMC8664254 DOI: 10.1126/sciadv.abe9986] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 12/14/2020] [Accepted: 10/25/2021] [Indexed: 06/13/2023]
Abstract
Person-specific biomedical data are now widely collected, but its sharing raises privacy concerns, specifically about the re-identification of seemingly anonymous records. Formal re-identification risk assessment frameworks can inform decisions about whether and how to share data; current techniques, however, focus on scenarios where the data recipients use only one resource for re-identification purposes. This is a concern because recent attacks show that adversaries can access multiple resources, combining them in a stage-wise manner, to enhance the chance of an attack’s success. In this work, we represent a re-identification game using a two-player Stackelberg game of perfect information, which can be applied to assess risk, and suggest an optimal data sharing strategy based on a privacy-utility tradeoff. We report on experiments with large-scale genomic datasets to show that, using game theoretic models accounting for adversarial capabilities to launch multistage attacks, most data can be effectively shared with low re-identification risk.
Collapse
Affiliation(s)
- Zhiyu Wan
- Department of Electrical Engineering and Computer Science, Vanderbilt University, Nashville, TN 37212, USA
- Department of Biomedical Informatics, Vanderbilt University Medical Center, Nashville, TN 37203, USA
| | - Yevgeniy Vorobeychik
- Department of Computer Science and Engineering, Washington University in St. Louis, St. Louis, MO 63130, USA
| | - Weiyi Xia
- Department of Biomedical Informatics, Vanderbilt University Medical Center, Nashville, TN 37203, USA
| | - Yongtai Liu
- Department of Electrical Engineering and Computer Science, Vanderbilt University, Nashville, TN 37212, USA
| | - Myrna Wooders
- Department of Economics, Vanderbilt University, Nashville, TN 37235, USA
| | - Jia Guo
- Department of Electrical Engineering and Computer Science, Vanderbilt University, Nashville, TN 37212, USA
| | - Zhijun Yin
- Department of Electrical Engineering and Computer Science, Vanderbilt University, Nashville, TN 37212, USA
- Department of Biomedical Informatics, Vanderbilt University Medical Center, Nashville, TN 37203, USA
| | - Ellen Wright Clayton
- Center for Biomedical Ethics and Society, Vanderbilt University Medical Center, Nashville, TN 37203, USA
- School of Law, Vanderbilt University, Nashville, TN 37203, USA
- Department of Pediatrics, Vanderbilt University Medical Center, Nashville, TN 37232, USA
| | - Murat Kantarcioglu
- Department of Computer Science, University of Texas at Dallas, Richardson, TX 75080, USA
- Institute for Quantitative Social Science, Harvard University, Cambridge, MA 02138, USA
- Department of Electrical Engineering and Computer Sciences, University of California, Berkeley, Berkeley, CA 94720, USA
| | - Bradley A. Malin
- Department of Electrical Engineering and Computer Science, Vanderbilt University, Nashville, TN 37212, USA
- Department of Biomedical Informatics, Vanderbilt University Medical Center, Nashville, TN 37203, USA
- Department of Biostatistics, Vanderbilt University Medical Center, Nashville, TN 37203, USA
| |
Collapse
|
35
|
Dupras C, Bunnik EM. Toward a Framework for Assessing Privacy Risks in Multi-Omic Research and Databases. THE AMERICAN JOURNAL OF BIOETHICS : AJOB 2021; 21:46-64. [PMID: 33433298 DOI: 10.1080/15265161.2020.1863516] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/12/2023]
Abstract
While the accumulation and increased circulation of genomic data have captured much attention over the past decade, privacy risks raised by the diversification and integration of omics have been largely overlooked. In this paper, we propose the outline of a framework for assessing privacy risks in multi-omic research and databases. Following a comparison of privacy risks associated with genomic and epigenomic data, we dissect ten privacy risk-impacting omic data properties that affect either the risk of re-identification of research participants, or the sensitivity of the information potentially conveyed by biological data. We then propose a three-step approach for the assessment of privacy risks in the multi-omic era. Thus, we lay grounds for a data property-based, 'pan-omic' approach that moves away from genetic exceptionalism. We conclude by inviting our peers to refine these theoretical foundations, put them to the test in their respective fields, and translate our approach into practical guidance.
Collapse
|
36
|
Ziegenhain C, Sandberg R. BAMboozle removes genetic variation from human sequence data for open data sharing. Nat Commun 2021; 12:6216. [PMID: 34711808 PMCID: PMC8553849 DOI: 10.1038/s41467-021-26152-8] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/20/2021] [Accepted: 09/20/2021] [Indexed: 11/18/2022] Open
Abstract
The risks associated with re-identification of human genetic data are severely limiting open data sharing in life sciences, even in studies where donor-related genetic variant information is not of primary interest. Here, we developed BAMboozle, a versatile tool to eliminate critical types of sensitive genetic information in human sequence data by reverting aligned reads to the genome reference sequence. Applying BAMboozle to functional genomics data, such as single-cell RNA-seq (scRNA-seq) and scATAC-seq datasets, confirmed the removal of donor-related single nucleotide polymorphisms (SNPs) and indels in a manner that did not disclose the altered positions. Importantly, BAMboozle only removes the genetic sequence variants of the sample (i.e., donor) while preserving other important aspects of the raw sequence data. For example, BAMboozled scRNA-seq data contained accurate cell-type associated gene expression signatures, splice kinetic information, and can be used for methods benchmarking. Altogether, BAMboozle efficiently removes genetic variation in aligned sequence data, which represents a step forward towards open data sharing in many areas of genomics where the genetic variant information is not of primary interest.
Collapse
Affiliation(s)
- Christoph Ziegenhain
- Department of Cell and Molecular Biology, Karolinska Institute, Stockholm, Sweden
| | - Rickard Sandberg
- Department of Cell and Molecular Biology, Karolinska Institute, Stockholm, Sweden.
| |
Collapse
|
37
|
Hekel R, Budis J, Kucharik M, Radvanszky J, Pös Z, Szemes T. Privacy-preserving storage of sequenced genomic data. BMC Genomics 2021; 22:712. [PMID: 34600465 PMCID: PMC8487550 DOI: 10.1186/s12864-021-07996-2] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/22/2021] [Accepted: 09/10/2021] [Indexed: 11/23/2022] Open
Abstract
Background The current and future applications of genomic data may raise ethical and privacy concerns. Processing and storing of this data introduce a risk of abuse by potential offenders since the human genome contains sensitive personal information. For this reason, we have developed a privacy-preserving method, named Varlock providing secure storage of sequenced genomic data. We used a public set of population allele frequencies to mask the personal alleles detected in genomic reads. Each personal allele described by the public set is masked by a randomly selected population allele with respect to its frequency. Masked alleles are preserved in an encrypted confidential file that can be shared in whole or in part using public-key cryptography. Results Our method masked the personal variants and introduced new variants detected in a personal masked genome. Alternative alleles with lower population frequency were masked and introduced more often. We performed a joint PCA analysis of personal and masked VCFs, showing that the VCFs between the two groups cannot be trivially mapped. Moreover, the method is reversible and personal alleles in specific genomic regions can be unmasked on demand. Conclusion Our method masks personal alleles within genomic reads while preserving valuable non-sensitive properties of sequenced DNA fragments for further research. Personal alleles in the desired genomic regions may be restored and shared with patients, clinics, and researchers. We suggest that the method can provide an additional security layer for storing and sharing of the raw aligned reads. Supplementary Information The online version contains supplementary material available at 10.1186/s12864-021-07996-2.
Collapse
Affiliation(s)
- Rastislav Hekel
- Geneton s.r.o, Bratislava, Slovakia. .,Faculty of Natural Sciences, Comenius University, Bratislava, Slovakia. .,Slovak Centre of Scientific and Technical Information, Bratislava, Slovakia. .,Comenius University Science Park, Bratislava, Slovakia.
| | - Jaroslav Budis
- Geneton s.r.o, Bratislava, Slovakia.,Slovak Centre of Scientific and Technical Information, Bratislava, Slovakia.,Comenius University Science Park, Bratislava, Slovakia
| | - Marcel Kucharik
- Geneton s.r.o, Bratislava, Slovakia.,Comenius University Science Park, Bratislava, Slovakia
| | - Jan Radvanszky
- Geneton s.r.o, Bratislava, Slovakia.,Faculty of Natural Sciences, Comenius University, Bratislava, Slovakia.,Comenius University Science Park, Bratislava, Slovakia.,Biomedical Research Centre, Institute of Clinical and Translational Research, Slovak Academy of Sciences, Bratislava, Slovakia
| | - Zuzana Pös
- Geneton s.r.o, Bratislava, Slovakia.,Faculty of Natural Sciences, Comenius University, Bratislava, Slovakia.,Comenius University Science Park, Bratislava, Slovakia.,Biomedical Research Centre, Institute of Clinical and Translational Research, Slovak Academy of Sciences, Bratislava, Slovakia
| | - Tomas Szemes
- Geneton s.r.o, Bratislava, Slovakia.,Faculty of Natural Sciences, Comenius University, Bratislava, Slovakia.,Comenius University Science Park, Bratislava, Slovakia
| |
Collapse
|
38
|
Keane TM, O'Donovan C, Vizcaíno JA. The growing need for controlled data access models in clinical proteomics and metabolomics. Nat Commun 2021; 12:5787. [PMID: 34599180 PMCID: PMC8486822 DOI: 10.1038/s41467-021-26110-4] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/30/2021] [Accepted: 09/17/2021] [Indexed: 01/25/2023] Open
Abstract
More and more clinical studies include potentially sensitive human proteomics or metabolomics datasets, but bioinformatics resources for managing the access to these data are not yet available. This commentary discusses current best practices and future perspectives for the responsible handling of clinical proteomics and metabolomics data.
Collapse
Affiliation(s)
- Thomas M Keane
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SD, UK
| | - Claire O'Donovan
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SD, UK
| | - Juan Antonio Vizcaíno
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SD, UK.
| |
Collapse
|
39
|
Daniels H, Jones KH, Heys S, Ford DV. Exploring the Use of Genomic and Routinely Collected Data: Narrative Literature Review and Interview Study. J Med Internet Res 2021; 23:e15739. [PMID: 34559060 PMCID: PMC8501405 DOI: 10.2196/15739] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/02/2019] [Revised: 10/01/2020] [Accepted: 07/15/2021] [Indexed: 11/13/2022] Open
Abstract
Background Advancing the use of genomic data with routinely collected health data holds great promise for health care and research. Increasing the use of these data is a high priority to understand and address the causes of disease. Objective This study aims to provide an outline of the use of genomic data alongside routinely collected data in health research to date. As this field prepares to move forward, it is important to take stock of the current state of play in order to highlight new avenues for development, identify challenges, and ensure that adequate data governance models are in place for safe and socially acceptable progress. Methods We conducted a literature review to draw information from past studies that have used genomic and routinely collected data and conducted interviews with individuals who use these data for health research. We collected data on the following: the rationale of using genomic data in conjunction with routinely collected data, types of genomic and routinely collected data used, data sources, project approvals, governance and access models, and challenges encountered. Results The main purpose of using genomic and routinely collected data was to conduct genome-wide and phenome-wide association studies. Routine data sources included electronic health records, disease and death registries, health insurance systems, and deprivation indices. The types of genomic data included polygenic risk scores, single nucleotide polymorphisms, and measures of genetic activity, and biobanks generally provided these data. Although the literature search showed that biobanks released data to researchers, the case studies revealed a growing tendency for use within a data safe haven. Challenges of working with these data revolved around data collection, data storage, technical, and data privacy issues. Conclusions Using genomic and routinely collected data holds great promise for progressing health research. Several challenges are involved, particularly in terms of privacy. Overcoming these barriers will ensure that the use of these data to progress health research can be exploited to its full potential.
Collapse
Affiliation(s)
- Helen Daniels
- Population Data Science, Swansea University, Swansea, United Kingdom
| | | | - Sharon Heys
- Population Data Science, Swansea University, Swansea, United Kingdom
| | | |
Collapse
|
40
|
Liu YL, Stadler ZK. The Future of Parallel Tumor and Germline Genetic Testing: Is There a Role for All Patients With Cancer? J Natl Compr Canc Netw 2021; 19:871-878. [PMID: 34340209 PMCID: PMC11123333 DOI: 10.6004/jnccn.2021.7044] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/13/2020] [Accepted: 04/09/2021] [Indexed: 11/17/2022]
Abstract
Under the traditional paradigm of genetic testing in cancer, the role of germline testing was to assess for the inherited risk of cancer, whereas the role of tumor testing was to determine therapeutic selection. Parallel tumor-normal genetic testing uses simultaneous genetic testing of the tumor and normal tissue to identify mutations and allows their classification as either germline or somatic. The increasing adoption of parallel testing has revealed a greater number of germline findings in patients who otherwise would not have met clinical criteria for testing. This result has widespread implications for the screening and further testing of at-risk relatives and for gene discovery. It has also revealed the importance of germline testing in therapeutic actionability. Herein, we describe the pros and cons of tumor-only versus parallel tumor-normal testing and summarize the data on the prevalence of incidental actionable germline findings. Because germline testing in patients with cancer continues to expand, it is imperative that systems be in place for the proper interpretation, dissemination, and counseling for patients and at-risk relatives. We also review new therapeutic approvals with germline indications and highlight the increasing importance of germline testing in selecting therapies. Because recommendations for universal genetic testing are increasing in multiple cancer types and the number of approved therapies with germline indications is also increasing, a gradual transition toward parallel tumor-normal genetic testing in all patients with cancer is foreseeable.
Collapse
Affiliation(s)
- Ying L. Liu
- Clinical Genetics Service, Department of Medicine, Memorial Sloan Kettering Cancer Center, New York, New York
| | - Zsofia K. Stadler
- Clinical Genetics Service, Department of Medicine, Memorial Sloan Kettering Cancer Center, New York, New York
| |
Collapse
|
41
|
Oestreich M, Chen D, Schultze JL, Fritz M, Becker M. Privacy considerations for sharing genomics data. EXCLI JOURNAL 2021; 20:1243-1260. [PMID: 34345236 PMCID: PMC8326502 DOI: 10.17179/excli2021-4002] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 06/19/2021] [Accepted: 07/07/2021] [Indexed: 01/23/2023]
Abstract
An increasing amount of attention has been geared towards understanding the privacy risks that arise from sharing genomic data of human origin. Most of these efforts have focused on issues in the context of genomic sequence data, but the popularity of techniques for collecting other types of genome-related data has prompted researchers to investigate privacy concerns in a broader genomic context. In this review, we give an overview of different types of genome-associated data, their individual ways of revealing sensitive information, the motivation to share them as well as established and upcoming methods to minimize information leakage. We further discuss the concise threats that are being posed, who is at risk, and how the risk level compares to potential benefits, all while addressing the topic in the context of modern technology, methodology, and information sharing culture. Additionally, we will discuss the current legal situation regarding the sharing of genomic data in a selection of countries, evaluating the scope of their applicability as well as their limitations. We will finalize this review by evaluating the development that is required in the scientific field in the near future in order to improve and develop privacy-preserving data sharing techniques for the genomic context.
Collapse
Affiliation(s)
- Marie Oestreich
- Systems Medicine, Deutsches Zentrum für Neurodegenerative Erkrankungen (DZNE), Venusberg-Campus 1/99, 53127 Bonn, Germany
| | - Dingfan Chen
- CISPA Helmholtz Center for Information Security, Saarbrücken, Germany, Stuhlsatzenhaus 5, 66123 Saarbrücken, Germany
| | - Joachim L. Schultze
- Systems Medicine, Deutsches Zentrum für Neurodegenerative Erkrankungen (DZNE), Venusberg-Campus 1/99, 53127 Bonn, Germany
- Genomics and Immunoregulation, Life & Medical Sciences (LIMES) Institute, University of Bonn, Bonn, Germany, Carl-Troll-Straße 31, 53115 Bonn, Germany
- PRECISE Platform for Single Cell Genomics and Epigenomics at Deutsches Zentrum für Neurodegenerative Erkrankungen (DZNE) and the University of Bonn, Germany, Venusberg-Campus 1/99, 53127 Bonn, Germany
| | - Mario Fritz
- CISPA Helmholtz Center for Information Security, Saarbrücken, Germany, Stuhlsatzenhaus 5, 66123 Saarbrücken, Germany
| | - Matthias Becker
- Systems Medicine, Deutsches Zentrum für Neurodegenerative Erkrankungen (DZNE), Venusberg-Campus 1/99, 53127 Bonn, Germany
| |
Collapse
|
42
|
Bu D, Wang X, Tang H. Haplotype-based membership inference from summary genomic data. Bioinformatics 2021; 37:i161-i168. [PMID: 34252973 PMCID: PMC8275351 DOI: 10.1093/bioinformatics/btab305] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/22/2022] Open
Abstract
Motivation The availability of human genomic data, together with the enhanced capacity to process them, is leading to transformative technological advances in biomedical science and engineering. However, the public dissemination of such data has been difficult due to privacy concerns. Specifically, it has been shown that the presence of a human subject in a case group can be inferred from the shared summary statistics of the group, e.g. the allele frequencies, or even the presence/absence of genetic variants (e.g. shared by the Beacon project) in the group. These methods rely on the availability of the target’s genome, i.e. the DNA profile of a target human subject, and thus are often referred to as the membership inference method. Results In this article, we demonstrate the haplotypes, i.e. the sequence of single nucleotide variations (SNVs) showing strong genetic linkages in human genome databases, may be inferred from the summary of genomic data without using a target’s genome. Furthermore, novel haplotypes that did not appear in the database may be reconstructed solely from the allele frequencies from genomic datasets. These reconstructed haplotypes can be used for a haplotype-based membership inference algorithm to identify target subjects in a case group with greater power than existing methods based on SNVs. Availability and implementation The implementation of the membership inference algorithms is available at https://github.com/diybu/Haplotype-based-membership-inferences.
Collapse
Affiliation(s)
- Diyue Bu
- Department of Informatics, Luddy School of Informatics, Computing, and Engineering, Indiana University, Bloomington, IN 47408, USA
| | - Xiaofeng Wang
- Department of Informatics, Luddy School of Informatics, Computing, and Engineering, Indiana University, Bloomington, IN 47408, USA
| | - Haixu Tang
- Department of Informatics, Luddy School of Informatics, Computing, and Engineering, Indiana University, Bloomington, IN 47408, USA
| |
Collapse
|
43
|
Lu D, Zhang Y, Zhang L, Wang H, Weng W, Li L, Cai H. Methods of privacy-preserving genomic sequencing data alignments. Brief Bioinform 2021; 22:6279828. [PMID: 34021302 DOI: 10.1093/bib/bbab151] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/05/2020] [Revised: 03/10/2021] [Accepted: 03/30/2021] [Indexed: 11/14/2022] Open
Abstract
Genomic data alignment, a fundamental operation in sequencing, can be utilized to map reads into a reference sequence, query on a genomic database and perform genetic tests. However, with the reduction of sequencing cost and the accumulation of genome data, privacy-preserving genomic sequencing data alignment is becoming unprecedentedly important. In this paper, we present a comprehensive review of secure genomic data comparison schemes. We discuss the privacy threats, including adversaries and privacy attacks. The attacks can be categorized into inference, membership, identity tracing and completion attacks and have been applied to obtaining the genomic privacy information. We classify the state-of-the-art genomic privacy-preserving alignment methods into three different scenarios: large-scale reads mapping, encrypted genomic datasets querying and genetic testing to ease privacy threats. A comprehensive analysis of these approaches has been carried out to evaluate the computation and communication complexity as well as the privacy requirements. The survey provides the researchers with the current trends and the insights on the significance and challenges of privacy issues in genomic data alignment.
Collapse
Affiliation(s)
- Dandan Lu
- School of Computer Science and Engineering, South China University of Technology, Guangzhou, 510006, China
| | - Yue Zhang
- School of Computer Science, Guangdong Polytechnic Normal University, Guangzhou, 510006, China
| | - Ling Zhang
- Department of Radiology, Sun Yat-sen University Cancer Center; State Key Laboratory of Oncology in South China; Collaborative Innovation Center for Cancer Medicine, 651 Dongfeng East Road, Guangzhou, P. R. China,510060
| | - Haiyan Wang
- School of Computer Science and Engineering, South China University of Technology, Guangzhou, 510006, China
| | - Wanlin Weng
- School of Computer Science and Engineering, South China University of Technology, Guangzhou, 510006, China
| | - Li Li
- School of Computer Science and Engineering, South China University of Technology, Guangzhou, 510006, China
| | - Hongmin Cai
- School of Computer Science and Engineering, South China University of Technology, Guangzhou, 510006, China
| |
Collapse
|
44
|
Arshad S, Arshad J, Khan MM, Parkinson S. Analysis of security and privacy challenges for DNA-genomics applications and databases. J Biomed Inform 2021; 119:103815. [PMID: 34022422 DOI: 10.1016/j.jbi.2021.103815] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/28/2021] [Revised: 05/07/2021] [Accepted: 05/08/2021] [Indexed: 02/06/2023]
Abstract
DNA technology is rapidly moving towards digitization. Scientists use software tools and applications for sequencing, synthesizing, analyzing and sharing of DNA and genomic data, operate lab equipment and store genetic information in shared datastores. Using cutting-edge computing methods and techniques, researchers have decoded human genome, created organisms with new capabilities, automated drug development and transformed food safety. Such software applications are typically developed to progress scientific understanding and as such cyber security is never a concern for these applications. However, with the increasing commercialisation of DNA technologies, coupled with the sensitivity of DNA data, there is a need to adopt a security-by-design approach. In this paper we investigate bio-cyber security threats to genomic-DNA data and software applications making use of such data to advance scientific research. Specifically, we adopt an empirical approach to analyse and identify vulnerabilities within genomic-DNA databases and bioinformatics software applications that can lead to cyber-attacks affecting the confidentiality, integrity and availability of such sensitive data. We present a detailed analysis of these threats and highlight potential protection mechanisms to help researchers pursue these research directions.
Collapse
Affiliation(s)
- Saadia Arshad
- Department of Computer Science & IT, NED University of Engineering and Technology, Karachi, Pakistan
| | - Junaid Arshad
- School of Computing and Digital Technology, Birmingham City University, Birmingham, UK.
| | - Muhammad Mubashir Khan
- Department of Computer Science & IT, NED University of Engineering and Technology, Karachi, Pakistan
| | - Simon Parkinson
- Department of Computer Science, University of Huddersfield, Huddersfield, UK
| |
Collapse
|
45
|
Gürsoy G, Emani P, Brannon CM, Jolanki OA, Harmanci A, Strattan JS, Cherry JM, Miranker AD, Gerstein M. Data Sanitization to Reduce Private Information Leakage from Functional Genomics. Cell 2021; 183:905-917.e16. [PMID: 33186529 DOI: 10.1016/j.cell.2020.09.036] [Citation(s) in RCA: 17] [Impact Index Per Article: 5.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/26/2019] [Revised: 07/23/2020] [Accepted: 09/11/2020] [Indexed: 12/30/2022]
Abstract
The generation of functional genomics datasets is surging, because they provide insight into gene regulation and organismal phenotypes (e.g., genes upregulated in cancer). The intent behind functional genomics experiments is not necessarily to study genetic variants, yet they pose privacy concerns due to their use of next-generation sequencing. Moreover, there is a great incentive to broadly share raw reads for better statistical power and general research reproducibility. Thus, we need new modes of sharing beyond traditional controlled-access models. Here, we develop a data-sanitization procedure allowing raw functional genomics reads to be shared while minimizing privacy leakage, enabling principled privacy-utility trade-offs. Our protocol works with traditional Illumina-based assays and newer technologies such as 10x single-cell RNA sequencing. It involves quantifying the privacy leakage in reads by statistically linking study participants to known individuals. We carried out these linkages using data from highly accurate reference genomes and more realistic environmental samples.
Collapse
Affiliation(s)
- Gamze Gürsoy
- Program in Computational Biology and Bioinformatics, Yale University, New Haven, CT 06520, USA; Department of Molecular Biophysics and Biochemistry, Yale University, New Haven, CT 06520, USA
| | - Prashant Emani
- Program in Computational Biology and Bioinformatics, Yale University, New Haven, CT 06520, USA; Department of Molecular Biophysics and Biochemistry, Yale University, New Haven, CT 06520, USA
| | - Charlotte M Brannon
- Program in Computational Biology and Bioinformatics, Yale University, New Haven, CT 06520, USA; Department of Molecular Biophysics and Biochemistry, Yale University, New Haven, CT 06520, USA
| | - Otto A Jolanki
- Stanford University School of Medicine, Department of Genetics, Stanford, CA 94305, USA
| | - Arif Harmanci
- School of Biomedical Informatics, Center for Precision Health, University of Texas Health Sciences Center, Houston, TX 77030, USA
| | - J Seth Strattan
- Stanford University School of Medicine, Department of Genetics, Stanford, CA 94305, USA
| | - J Michael Cherry
- Stanford University School of Medicine, Department of Genetics, Stanford, CA 94305, USA
| | - Andrew D Miranker
- Department of Molecular Biophysics and Biochemistry, Yale University, New Haven, CT 06520, USA; Department of Chemical and Environmental Engineering, Yale University, New Haven, CT 06520, USA
| | - Mark Gerstein
- Program in Computational Biology and Bioinformatics, Yale University, New Haven, CT 06520, USA; Department of Molecular Biophysics and Biochemistry, Yale University, New Haven, CT 06520, USA; Department of Computer Science, Yale University, New Haven, CT 06520, USA; Department of Statistics and Data Science, Yale University, New Haven, CT 06520, USA.
| |
Collapse
|
46
|
Ayoz K, Ayday E, Cicek AE. Genome Reconstruction Attacks Against Genomic Data-Sharing Beacons. PROCEEDINGS ON PRIVACY ENHANCING TECHNOLOGIES. PRIVACY ENHANCING TECHNOLOGIES SYMPOSIUM 2021; 2021:28-48. [PMID: 34746296 PMCID: PMC8570374 DOI: 10.2478/popets-2021-0036] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/21/2022]
Abstract
Sharing genome data in a privacy-preserving way stands as a major bottleneck in front of the scientific progress promised by the big data era in genomics. A community-driven protocol named genomic data-sharing beacon protocol has been widely adopted for sharing genomic data. The system aims to provide a secure, easy to implement, and standardized interface for data sharing by only allowing yes/no queries on the presence of specific alleles in the dataset. However, beacon protocol was recently shown to be vulnerable against membership inference attacks. In this paper, we show that privacy threats against genomic data sharing beacons are not limited to membership inference. We identify and analyze a novel vulnerability of genomic data-sharing beacons: genome reconstruction. We show that it is possible to successfully reconstruct a substantial part of the genome of a victim when the attacker knows the victim has been added to the beacon in a recent update. In particular, we show how an attacker can use the inherent correlations in the genome and clustering techniques to run such an attack in an efficient and accurate way. We also show that even if multiple individuals are added to the beacon during the same update, it is possible to identify the victim's genome with high confidence using traits that are easily accessible by the attacker (e.g., eye color or hair type). Moreover, we show how a reconstructed genome using a beacon that is not associated with a sensitive phenotype can be used for membership inference attacks to beacons with sensitive phenotypes (e.g., HIV+). The outcome of this work will guide beacon operators on when and how to update the content of the beacon and help them (along with the beacon participants) make informed decisions.
Collapse
|
47
|
Cho JC. Human microbiome privacy risks associated with summary statistics. PLoS One 2021; 16:e0249528. [PMID: 33798253 PMCID: PMC8018636 DOI: 10.1371/journal.pone.0249528] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/10/2020] [Accepted: 03/21/2021] [Indexed: 11/25/2022] Open
Abstract
Recognizing that microbial community composition within the human microbiome is associated with the physiological state of the host has sparked a large number of human microbiome association studies (HMAS). With the increasing size of publicly available HMAS data, the privacy risk is also increasing because HMAS metadata could contain sensitive private information. I demonstrate that a simple test statistic based on the taxonomic profiles of an individual's microbiome along with summary statistics of HMAS data can reveal the membership of the individual's microbiome in an HMAS sample. In particular, species-level taxonomic data obtained from small-scale HMAS can be highly vulnerable to privacy risk. Minimal guidelines for HMAS data privacy are suggested, and an assessment of HMAS privacy risk using the simulation method proposed is recommended at the time of study design.
Collapse
Affiliation(s)
- Jae-Chang Cho
- Institute of Environmental Science and Department of Environmental Science, Hankuk University of Foreign Studies, Yong-In, Korea
| |
Collapse
|
48
|
Yang H, Chen L, Cheng Z, Yang M, Wang J, Lin C, Wang Y, Huang L, Chen Y, Peng S, Ke Z, Li W. Deep learning-based six-type classifier for lung cancer and mimics from histopathological whole slide images: a retrospective study. BMC Med 2021; 19:80. [PMID: 33775248 PMCID: PMC8006383 DOI: 10.1186/s12916-021-01953-2] [Citation(s) in RCA: 37] [Impact Index Per Article: 12.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 12/26/2020] [Accepted: 02/26/2021] [Indexed: 12/19/2022] Open
Abstract
BACKGROUND Targeted therapy and immunotherapy put forward higher demands for accurate lung cancer classification, as well as benign versus malignant disease discrimination. Digital whole slide images (WSIs) witnessed the transition from traditional histopathology to computational approaches, arousing a hype of deep learning methods for histopathological analysis. We aimed at exploring the potential of deep learning models in the identification of lung cancer subtypes and cancer mimics from WSIs. METHODS We initially obtained 741 WSIs from the First Affiliated Hospital of Sun Yat-sen University (SYSUFH) for the deep learning model development, optimization, and verification. Additional 318 WSIs from SYSUFH, 212 from Shenzhen People's Hospital, and 422 from The Cancer Genome Atlas were further collected for multi-centre verification. EfficientNet-B5- and ResNet-50-based deep learning methods were developed and compared using the metrics of recall, precision, F1-score, and areas under the curve (AUCs). A threshold-based tumour-first aggregation approach was proposed and implemented for the label inferencing of WSIs with complex tissue components. Four pathologists of different levels from SYSUFH reviewed all the testing slides blindly, and the diagnosing results were used for quantitative comparisons with the best performing deep learning model. RESULTS We developed the first deep learning-based six-type classifier for histopathological WSI classification of lung adenocarcinoma, lung squamous cell carcinoma, small cell lung carcinoma, pulmonary tuberculosis, organizing pneumonia, and normal lung. The EfficientNet-B5-based model outperformed ResNet-50 and was selected as the backbone in the classifier. Tested on 1067 slides from four cohorts of different medical centres, AUCs of 0.970, 0.918, 0.963, and 0.978 were achieved, respectively. The classifier achieved high consistence to the ground truth and attending pathologists with high intraclass correlation coefficients over 0.873. CONCLUSIONS Multi-cohort testing demonstrated our six-type classifier achieved consistent and comparable performance to experienced pathologists and gained advantages over other existing computational methods. The visualization of prediction heatmap improved the model interpretability intuitively. The classifier with the threshold-based tumour-first label inferencing method exhibited excellent accuracy and feasibility in classifying lung cancers and confused nonneoplastic tissues, indicating that deep learning can resolve complex multi-class tissue classification that conforms to real-world histopathological scenarios.
Collapse
Affiliation(s)
- Huan Yang
- Zhongshan School of Medicine, Sun Yat-sen University, Guangzhou, 510080, China
| | - Lili Chen
- Department of Pathology, The First Affiliated Hospital, Sun Yat-sen University, Guangzhou, 510080, China
| | - Zhiqiang Cheng
- Department of Pathology, Shenzhen People's Hospital, Shenzhen, 518020, China
| | - Minglei Yang
- Zhongshan School of Medicine, Sun Yat-sen University, Guangzhou, 510080, China
| | - Jianbo Wang
- Zhongshan School of Medicine, Sun Yat-sen University, Guangzhou, 510080, China
| | - Chenghao Lin
- Zhongshan School of Medicine, Sun Yat-sen University, Guangzhou, 510080, China
| | - Yuefeng Wang
- Department of Pathology, The First Affiliated Hospital, Sun Yat-sen University, Guangzhou, 510080, China
| | - Leilei Huang
- Department of Pathology, The First Affiliated Hospital, Sun Yat-sen University, Guangzhou, 510080, China
| | - Yangshan Chen
- Department of Pathology, The First Affiliated Hospital, Sun Yat-sen University, Guangzhou, 510080, China
| | - Sui Peng
- Center for Precision Medicine, Sun Yat-sen University, Guangzhou, 510080, China.,Molecular Diagnosis Center or Institute of Precision Medicine, The First Affiliated Hospital, Sun Yat-sen University, Guangzhou, 510080, China
| | - Zunfu Ke
- Department of Pathology, The First Affiliated Hospital, Sun Yat-sen University, Guangzhou, 510080, China. .,Center for Precision Medicine, Sun Yat-sen University, Guangzhou, 510080, China. .,Molecular Diagnosis Center or Institute of Precision Medicine, The First Affiliated Hospital, Sun Yat-sen University, Guangzhou, 510080, China.
| | - Weizhong Li
- Zhongshan School of Medicine, Sun Yat-sen University, Guangzhou, 510080, China. .,Center for Precision Medicine, Sun Yat-sen University, Guangzhou, 510080, China. .,Key Laboratory of Tropical Disease Control (Ministry of Education), Sun Yat-sen University, Guangzhou, 510080, China.
| |
Collapse
|
49
|
Lemieux VL, Hofman D, Hamouda H, Batista D, Kaur R, Pan W, Costanzo I, Regier D, Pollard S, Weymann D, Fraser R. Having Our “Omic” Cake and Eating It Too?: Evaluating User Response to Using Blockchain Technology for Private and Secure Health Data Management and Sharing. FRONTIERS IN BLOCKCHAIN 2021. [DOI: 10.3389/fbloc.2020.558705] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/13/2022]
Abstract
This paper reports on end users' perspectives on the use of a blockchain solution for private and secure individual “omics” health data management and sharing. This solution is one output of a multidisciplinary project investigating the social, data, and technical issues surrounding application of blockchain technology in the context of personalized healthcare research. The project studies potential ethical, legal, social, and cognitive constraints of self-sovereign healthcare data management and sharing, and whether such constraints can be addressed through careful design of a blockchain solution.
Collapse
|
50
|
PLEIO: a method to map and interpret pleiotropic loci with GWAS summary statistics. Am J Hum Genet 2021; 108:36-48. [PMID: 33352115 DOI: 10.1016/j.ajhg.2020.11.017] [Citation(s) in RCA: 16] [Impact Index Per Article: 5.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/22/2020] [Accepted: 11/23/2020] [Indexed: 12/31/2022] Open
Abstract
Identifying and interpreting pleiotropic loci is essential to understanding the shared etiology among diseases and complex traits. A common approach to mapping pleiotropic loci is to meta-analyze GWAS summary statistics across multiple traits. However, this strategy does not account for the complex genetic architectures of traits, such as genetic correlations and heritabilities. Furthermore, the interpretation is challenging because phenotypes often have different characteristics and units. We propose PLEIO (Pleiotropic Locus Exploration and Interpretation using Optimal test), a summary-statistic-based framework to map and interpret pleiotropic loci in a joint analysis of multiple diseases and complex traits. Our method maximizes power by systematically accounting for genetic correlations and heritabilities of the traits in the association test. Any set of related phenotypes, binary or quantitative traits with different units, can be combined seamlessly. In addition, our framework offers interpretation and visualization tools to help downstream analyses. Using our method, we combined 18 traits related to cardiovascular disease and identified 13 pleiotropic loci, which showed four different patterns of associations.
Collapse
|