1
|
Brauneck A, Schmalhorst L, Weiss S, Baumbach L, Völker U, Ellinghaus D, Baumbach J, Buchholtz G. Legal aspects of privacy-enhancing technologies in genome-wide association studies and their impact on performance and feasibility. Genome Biol 2024; 25:154. [PMID: 38872191 PMCID: PMC11170858 DOI: 10.1186/s13059-024-03296-6] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/16/2023] [Accepted: 06/03/2024] [Indexed: 06/15/2024] Open
Abstract
Genomic data holds huge potential for medical progress but requires strict safety measures due to its sensitive nature to comply with data protection laws. This conflict is especially pronounced in genome-wide association studies (GWAS) which rely on vast amounts of genomic data to improve medical diagnoses. To ensure both their benefits and sufficient data security, we propose a federated approach in combination with privacy-enhancing technologies utilising the findings from a systematic review on federated learning and legal regulations in general and applying these to GWAS.
Collapse
Affiliation(s)
- Alissa Brauneck
- Hamburg University Faculty of Law, University of Hamburg, Hamburg, Germany.
| | - Louisa Schmalhorst
- Hamburg University Faculty of Law, University of Hamburg, Hamburg, Germany
| | - Stefan Weiss
- Interfaculty Institute of Genetics and Functional Genomics, Department of Functional Genomics, University Medicine Greifswald, Greifswald, Germany
| | - Linda Baumbach
- Department of Health Economics and Health Services Research, University Medical Center Hamburg-Eppendorf, Hamburg, Germany
| | - Uwe Völker
- Interfaculty Institute of Genetics and Functional Genomics, Department of Functional Genomics, University Medicine Greifswald, Greifswald, Germany
| | - David Ellinghaus
- Institute of Clinical Molecular Biology (IKMB), Kiel University and University Medical Center Schleswig-Holstein, Kiel, Germany
| | - Jan Baumbach
- Institute for Computational Systems Biology, University of Hamburg, Hamburg, Germany
| | - Gabriele Buchholtz
- Hamburg University Faculty of Law, University of Hamburg, Hamburg, Germany
| |
Collapse
|
2
|
Cavinato T, Rubinacci S, Malaspinas AS, Delaneau O. A resampling-based approach to share reference panels. NATURE COMPUTATIONAL SCIENCE 2024; 4:360-366. [PMID: 38745108 PMCID: PMC11136649 DOI: 10.1038/s43588-024-00630-7] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 05/02/2023] [Accepted: 04/16/2024] [Indexed: 05/16/2024]
Abstract
For many genome-wide association studies, imputing genotypes from a haplotype reference panel is a necessary step. Over the past 15 years, reference panels have become larger and more diverse, leading to improvements in imputation accuracy. However, the latest generation of reference panels is subject to restrictions on data sharing due to concerns about privacy, limiting their usefulness for genotype imputation. In this context, here we propose RESHAPE, a method that employs a recombination Poisson process on a reference panel to simulate the genomes of hypothetical descendants after multiple generations. This data transformation helps to protect against re-identification threats and preserves data attributes, such as linkage disequilibrium patterns and, to some degree, identity-by-descent sharing, allowing for genotype imputation. Our experiments on gold-standard datasets show that simulated descendants up to eight generations can serve as reference panels without substantially reducing genotype imputation accuracy.
Collapse
Affiliation(s)
- Théo Cavinato
- Department of Computational Biology, University of Lausanne, Lausanne, Switzerland
- Swiss Institute of Bioinformatics, University of Lausanne, Lausanne, Switzerland
| | - Simone Rubinacci
- Division of Genetics, Department of Medicine, Brigham and Women's Hospital and Harvard Medical School, Boston, MA, USA
- Program in Medical and Population Genetics, Broad Institute of MIT and Harvard, Cambridge, MA, USA
| | - Anna-Sapfo Malaspinas
- Department of Computational Biology, University of Lausanne, Lausanne, Switzerland
- Swiss Institute of Bioinformatics, University of Lausanne, Lausanne, Switzerland
| | | |
Collapse
|
3
|
Koh AS, Bos HMW, Rothblum ED, Carone N, Gartrell NK. Donor sibling relations among adult offspring conceived via insemination by lesbian parents. Hum Reprod 2023; 38:2166-2174. [PMID: 37697711 DOI: 10.1093/humrep/dead175] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/17/2023] [Revised: 08/13/2023] [Indexed: 09/13/2023] Open
Abstract
STUDY QUESTION How do adult offspring in planned lesbian-parent families feel about and relate to their donor (half) sibling(s) (DS)? SUMMARY ANSWER A majority of offspring had found DS and maintained good ongoing relationships, and all offspring (regardless of whether a DS had been identified) were satisfied with their knowledge of and contact level with the DS. WHAT IS KNOWN ALREADY The first generation of donor insemination offspring of intended lesbian-parent families is now in their 30s. Coincident with this is an increased use of DNA testing and genetic ancestry websites, facilitating the discovery of donor siblings from a common sperm donor. Few studies of offspring and their DS include sexual minority parent (SMP) families, and only sparse data separately analyze the offspring of SMP families or extend the analyses to established adult offspring. STUDY DESIGN, SIZE, DURATION This cohort study included 75 adult offspring, longitudinally followed since conception in lesbian-parent families. Quantitative analyses were performed from online surveys of the offspring in the seventh wave of the 36-year study, with a 90% family retention rate. The data were collected from March 2021 to November 2022. PARTICIPANTS/MATERIALS, SETTING, METHODS Participants were 30- to 33-year-old donor insemination offspring whose lesbian parents enrolled in a US prospective longitudinal study when these offspring were conceived. Offspring who knew of a DS were asked about their numbers found, characteristics or motivations for meeting, DS terminology, relationship quality and maintenance, and impact of the DS contact on others. All offspring (with or without known DS) were asked about the importance of knowing if they have DS and their terminology, satisfaction with information about DS, and feelings about future contact. MAIN RESULTS AND THE ROLE OF CHANCE Of offspring, 53% (n = 40) had found DS in modest numbers, via a DS or sperm bank registry in 45% of cases, and most of these offspring had made contact. The offspring had their meeting motivations fulfilled, viewed the DS as acquaintances more often than siblings or friends, and maintained good relationships via meetings, social media, and cell phone communication. They disclosed their DS meetings to most relatives with neutral impact. The offspring, whether with known or unknown DS, felt neutral about the importance of knowing if they had DS, were satisfied with what they knew (or did not know) of the DS, and were satisfied with their current level of DS contact. This study is the largest, longest-running longitudinal study of intended lesbian-parent families and their offspring, and due to its prospective nature, is not biased by over-sampling offspring who were already satisfied with their DS. LIMITATIONS, REASONS FOR CAUTION The sample was from the USA, and mostly White, highly educated individuals, not representative of the diversity of donor insemination offspring of lesbian-parent families. WIDER IMPLICATIONS OF THE FINDINGS While about half of the offspring found out about DS, the other half did not. Regardless of knowing of a DS, these adult offspring of lesbian parents were satisfied with their level of DS contact. Early disclosure and identity formation about being donor-conceived in a lesbian-parent family may distinguish these study participants from donor insemination offspring and adoptees in the general population, who may be more compelled to seek genetic relatives. The study participants who sought DS mostly found a modest number of them, in contrast to reports in studies that have found large numbers of DS. This may be because one-third of study offspring had donors known to the families since conception, who may have been less likely to participate in commercial sperm banking or internet donation sites, where quotas are difficult to enforce or nonexistent. The study results have implications for anyone considering gamete donation, gamete donors, donor-conceived offspring, and/or gamete banks, as well as the medical and public policy professionals who advise them. STUDY FUNDING/COMPETING INTEREST(S) No funding was provided for this project. The authors have no competing interests. TRIAL REGISTRATION NUMBER N/A.
Collapse
Affiliation(s)
- Audrey S Koh
- Department of Obstetrics, Gynecology and Reproductive Sciences, School of Medicine, University of California, San Francisco, San Francisco, CA, USA
| | - Henny M W Bos
- Research Institute of Child Development and Education, University of Amsterdam, Amsterdam, The Netherlands
| | - Esther D Rothblum
- Department of Women's Studies, San Diego State University, San Diego, CA, USA
- Williams Institute, UCLA School of Law, Los Angeles, CA, USA
| | - Nicola Carone
- Department of Brain and Behavioral Sciences, University of Pavia, Pavia, Italy
| | - Nanette K Gartrell
- Research Institute of Child Development and Education, University of Amsterdam, Amsterdam, The Netherlands
- Williams Institute, UCLA School of Law, Los Angeles, CA, USA
| |
Collapse
|
4
|
Ayday E, Vaidya J, Jiang X, Telenti A. Ensuring Trust in Genomics Research. ... IEEE INTERNATIONAL CONFERENCE ON TRUST, PRIVACY AND SECURITY IN INTELLIGENT SYSTEMS AND APPLICATIONS : (TPS-ISA ...). IEEE INTERNATIONAL CONFERENCE ON TRUST, PRIVACY AND SECURITY IN INTELLIGENT SYSTEMS AND APPLICATIONS 2023; 2023:1-12. [PMID: 38562180 PMCID: PMC10981793 DOI: 10.1109/tps-isa58951.2023.00011] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 04/04/2024]
Abstract
Reproducibility, transparency, representation, and privacy underpin the trust on genomics research in general and genome-wide association studies (GWAS) in particular. Concerns about these issues can be mitigated by technologies that address privacy protection, quality control, and verifiability of GWAS. However, many of the existing technological solutions have been developed in isolation and may address one aspect of reproducibility, transparency, representation, and privacy of GWAS while unknowingly impacting other aspects. As a consequence, the current patchwork of technological tools only partially and in an overlapping manner address issues with GWAS, sometimes even creating more problems. This paper addresses the progress in a field that creates technological solutions that augment the acceptance and security of population genetic analyses. The text identifies areas that are falling behind in technical implementation or where there is insufficient research. We make the case that a full understanding of the different GWAS settings, technological tools and new research directions can holistically address the requirements for the acceptance of GWAS.
Collapse
Affiliation(s)
- Erman Ayday
- Department of Computer and Data Sciences Case Western Reserve University Cleveland, OH
| | - Jaideep Vaidya
- Management Science and Information Systems Department Rutgers University Newark, NJ
| | - Xiaoqian Jiang
- Department of Data Science and Artificial Intelligence University of Texas - Health Houston, TX
| | - Amalio Telenti
- Dept. of Integrative Structural and Computational Biology Scripps Institute La Jolla, CA
| |
Collapse
|
5
|
Sadhuka S, Fridman D, Berger B, Cho H. Assessing transcriptomic reidentification risks using discriminative sequence models. Genome Res 2023; 33:1101-1112. [PMID: 37541758 PMCID: PMC10538488 DOI: 10.1101/gr.277699.123] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/14/2023] [Accepted: 04/19/2023] [Indexed: 08/06/2023]
Abstract
Gene expression data provide molecular insights into the functional impact of genetic variation, for example, through expression quantitative trait loci (eQTLs). With an improving understanding of the association between genotypes and gene expression comes a greater concern that gene expression profiles could be matched to genotype profiles of the same individuals in another data set, known as a linking attack. Prior works show such a risk could analyze only a fraction of eQTLs that is independent owing to restrictive model assumptions, leaving the full extent of this risk incompletely understood. To address this challenge, we introduce the discriminative sequence model (DSM), a novel probabilistic framework for predicting a sequence of genotypes based on gene expression data. By modeling the joint distribution over all known eQTLs in a genomic region, DSM improves the power of linking attacks with necessary calibration for linkage disequilibrium and redundant predictive signals. We show greater linking accuracy of DSM compared with existing approaches across a range of attack scenarios and data sets including up to 22,288 individuals, suggesting that DSM helps uncover a substantial additional risk overlooked by previous studies. Our work provides a unified framework for assessing the privacy risks of sharing diverse omics data sets beyond transcriptomics.
Collapse
Affiliation(s)
- Shuvom Sadhuka
- Computer Science and AI Lab, Massachusetts Institute of Technology, Cambridge, Massachusetts 02139, USA
- Broad Institute of MIT and Harvard, Cambridge, Massachusetts 02142, USA
| | - Daniel Fridman
- Broad Institute of MIT and Harvard, Cambridge, Massachusetts 02142, USA
- Department of Biomedical Informatics, Harvard Medical School, Boston, Massachusetts 02115, USA
| | - Bonnie Berger
- Computer Science and AI Lab, Massachusetts Institute of Technology, Cambridge, Massachusetts 02139, USA
- Broad Institute of MIT and Harvard, Cambridge, Massachusetts 02142, USA
| | - Hyunghoon Cho
- Broad Institute of MIT and Harvard, Cambridge, Massachusetts 02142, USA;
| |
Collapse
|
6
|
Liu W, Zhang Y, Yang H, Meng Q. A Survey on Differential Privacy for Medical Data Analysis. ANNALS OF DATA SCIENCE 2023; 11:1-15. [PMID: 38625247 PMCID: PMC10257172 DOI: 10.1007/s40745-023-00475-3] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 02/28/2023] [Revised: 05/16/2023] [Accepted: 05/22/2023] [Indexed: 12/01/2023]
Abstract
Machine learning methods promote the sustainable development of wise information technology of medicine (WITMED), and a variety of medical data brings high value and convenience to medical analysis. However, the applications of medical data have also been confronted with the risk of privacy leakage that is hard to avoid, especially when conducting correlation analysis or data sharing among multiple institutions. Data security and privacy preservation have recently played an essential role in the field of secure and private medical data analysis, where many differential privacy strategies are applied to medical data publishing and mining. In this paper, we survey research work on the applications of differential privacy for medical data analysis, discussing the necessity of medical privacy-preserving, the advantages of differential privacy, and their applications to typical medical data, such as genomic data and wearable device data. Furthermore, we discuss the challenges and potential future research directions for differential privacy in medical applications.
Collapse
Affiliation(s)
- WeiKang Liu
- Cyberspace Institute of Advanced Technology, Guangzhou University, Guangzhou, China
| | - Yanchun Zhang
- Cyberspace Institute of Advanced Technology, Guangzhou University, Guangzhou, China
- Institute of Sustainable Industries and Liveable Cities, Victoria University, Melbourne, Australia
| | - Hong Yang
- Cyberspace Institute of Advanced Technology, Guangzhou University, Guangzhou, China
| | - Qinxue Meng
- College of Information Engineering, Suzhou University, Suzhou, China
| |
Collapse
|
7
|
Gyngell C, Lynch F, Vears D, Bowman-Smart H, Savulescu J, Christodoulou J. Storing paediatric genomic data for sequential interrogation across the lifespan. JOURNAL OF MEDICAL ETHICS 2023:jme-2022-108471. [PMID: 37263770 DOI: 10.1136/jme-2022-108471] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 06/09/2022] [Accepted: 03/02/2023] [Indexed: 06/03/2023]
Abstract
Genomic sequencing (GS) is increasingly used in paediatric medicine to aid in screening, research and treatment. Some health systems are trialling GS as a first-line test in newborn screening programmes. Questions about what to do with genomic data after it has been generated are becoming more pertinent. While other research has outlined the ethical reasons for storing deidentified genomic data to be used in research, the ethical case for storing data for future clinical use has not been explicated. In this paper, we examine the ethical case for storing genomic data with the intention of using it as a lifetime health resource. In this model, genomic data would be stored with the intention of reanalysis at certain points through one's life. We argue this could benefit individuals and create an important public resource. However, several ethical challenges must first be met to achieve these benefits. We explore issues related to privacy, consent, justice and equality. We conclude by arguing that health systems should be moving towards futures that allow for the sequential interrogation of genomic data throughout the lifespan.
Collapse
Affiliation(s)
- Christopher Gyngell
- Biomedical Ethics Research Group, Murdoch Children's Research Institute, Parkville, Victoria, Australia
- Department of Paediatrics, The University of Melbourne, Melbourne, Victoria, Australia
| | - Fiona Lynch
- Biomedical Ethics Research Group, Murdoch Children's Research Institute, Parkville, Victoria, Australia
- Melbourne Law School, The University of Melbourne, Parkville, VIC, Australia
| | - Danya Vears
- Biomedical Ethics Research Group, Murdoch Children's Research Institute, Parkville, Victoria, Australia
- Department of Paediatrics, The University of Melbourne, Melbourne, Victoria, Australia
| | - Hilary Bowman-Smart
- Biomedical Ethics Research Group, Murdoch Children's Research Institute, Parkville, Victoria, Australia
- University of South Australia, Adeliade, South Australia, Australia
| | - Julian Savulescu
- Biomedical Ethics Research Group, Murdoch Children's Research Institute, Parkville, Victoria, Australia
- Faculty of Philosophy, University of Oxford, Oxford, UK
- Centre for Biomedical Ethics - Yong Loo Lin School of Medicine, National University of Singapore, Singapore
| | - John Christodoulou
- Department of Paediatrics, The University of Melbourne, Melbourne, Victoria, Australia
- Brain and Mitochondrial Research Group, Murdoch Children's Research Institute, Parkville, VIC, Australia
| |
Collapse
|
8
|
Jiang Y, Shang T, Liu J. Secure Counting Query Protocol for Genomic Data. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2023; 20:1457-1468. [PMID: 35666798 DOI: 10.1109/tcbb.2022.3178446] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/04/2023]
Abstract
Statistical analysis on genomic data can explore the relationship between gene sequence and phenotype. Particularly, counting the genomic mutation samples and associating with related phenotypes for statistical analysis can annotate the variation sites and help to diagnose genovariation. Expansion of the size of variation sample data helps to increase the accuracy of statistical analysis. It is feasible to securely share data from genomic databases on cloud platforms. In this paper, we design a secure counting query protocol that can securely share genomic data on cloud platforms. Our protocol supports statistical analysis of the genomic data in VCF (Variant Call Format) files by counting query. There are three participants of data owner, cloud platform and query party. Firstly, the genomic data is preprocessed to reduce the data size. Secondly, Paillier homomorphic is used so that genomic data can be securely shared and calculated on cloud platform. Finally, the results which be decrypted is used to implement counting function of the protocol. Experimental results show that the protocol can implement the query counting function after homomorphic encryption. The query time is less than 1 s, which provide a feasible solution to share genomic data securely on cloud platform for statistical analysis.
Collapse
|
9
|
Zhou J, Lei B, Lang H, Panaousis E, Liang K, Xiang J. Secure genotype imputation using homomorphic encryption. JOURNAL OF INFORMATION SECURITY AND APPLICATIONS 2023. [DOI: 10.1016/j.jisa.2022.103386] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/09/2022]
|
10
|
Hwang S, Ozturk E, Tsudik G. Balancing Security and Privacy in Genomic Range Queries*. ACM TRANSACTIONS ON PRIVACY AND SECURITY 2022. [DOI: 10.1145/3575796] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/13/2022]
Abstract
Exciting recent advances in genome sequencing, coupled with greatly reduced storage and computation costs, make genomic testing increasingly accessible to individuals. Already today, one’s digitized DNA can be easily obtained from a sequencing lab and later used to conduct numerous tests by engaging with a testing facility. Due to the inherent sensitivity of genetic material and the often-proprietary nature of genomic tests, privacy is a natural and crucial issue. While genomic privacy received a great deal of attention within and outside the research community, genomic security has not been sufficiently studied. This is surprising since the usage of fake or altered genomes can have grave consequences, such as erroneous drug prescriptions and genetic test outcomes.
Unfortunately, in the genomic domain, privacy and security (as often happens) are at odds with each other. In this paper, we attempt to reconcile security with privacy in genomic testing by designing a novel technique for a secure and private genomic range query protocol between a genomic testing facility and an individual user. The proposed technique ensures
authenticity
and
completeness
of user-supplied genomic material while maintaining its
privacy
by releasing only the minimum thereof. To confirm its broad usability, we show how to apply the proposed technique to a previously proposed genomic private substring matching protocol. Experiments show that the proposed technique offers good performance and is quite practical. Furthermore, we generalize the genomic range query problem to sparse integer sets and discuss potential use cases.
Collapse
|
11
|
Al Aziz MM, Thulasiraman P, Mohammed N. Parallel and private generalized suffix tree construction and query on genomic data. BMC Genom Data 2022; 23:45. [PMID: 35715724 PMCID: PMC9206251 DOI: 10.1186/s12863-022-01053-x] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/13/2021] [Accepted: 04/25/2022] [Indexed: 11/10/2022] Open
Abstract
Background Several technological advancements and digitization of healthcare data have provided the scientific community with a large quantity of genomic data. Such datasets facilitated a deeper understanding of several diseases and our health in general. Strikingly, these genome datasets require a large storage volume and present technical challenges in retrieving meaningful information. Furthermore, the privacy aspects of genomic data limit access and often hinder timely scientific discovery. Methods In this paper, we utilize the Generalized Suffix Tree (GST); their construction and applications have been fairly studied in related areas. The main contribution of this article is the proposal of a privacy-preserving string query execution framework using GSTs and an additional tree-based hashing mechanism. Initially, we start by introducing an efficient GST construction in parallel that is scalable for a large genomic dataset. The secure indexing scheme allows the genomic data in a GST to be outsourced to an untrusted cloud server under encryption. Additionally, the proposed methods can perform several string search operations (i.e., exact, set-maximal matches) securely and efficiently using the outlined framework. Results The experimental results on different datasets and parameters in a real cloud environment exhibit the scalability of these methods as they also outperform the state-of-the-art method based on Burrows-Wheeler Transformation (BWT). The proposed method only takes around 36.7s to execute a set-maximal match whereas the BWT-based method takes around 160.85s, providing a 4× speedup. Supplementary Information The online version contains supplementary material available at (10.1186/s12863-022-01053-x).
Collapse
|
12
|
Fierro-Monti I, Wright JC, Choudhary JS, Vizcaíno JA. Identifying individuals using proteomics: are we there yet? Front Mol Biosci 2022; 9:1062031. [PMID: 36523653 PMCID: PMC9744771 DOI: 10.3389/fmolb.2022.1062031] [Citation(s) in RCA: 4] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/05/2022] [Accepted: 11/16/2022] [Indexed: 08/31/2023] Open
Abstract
Multi-omics approaches including proteomics analyses are becoming an integral component of precision medicine. As clinical proteomics studies gain momentum and their sensitivity increases, research on identifying individuals based on their proteomics data is here examined for risks and ethics-related issues. A great deal of work has already been done on this topic for DNA/RNA sequencing data, but it has yet to be widely studied in other omics fields. The current state-of-the-art for the identification of individuals based solely on proteomics data is explained. Protein sequence variation analysis approaches are covered in more detail, including the available analysis workflows and their limitations. We also outline some previous forensic and omics proteomics studies that are relevant for the identification of individuals. Following that, we discuss the risks of patient reidentification using other proteomics data types such as protein expression abundance and post-translational modification (PTM) profiles. In light of the potential identification of individuals through proteomics data, possible legal and ethical implications are becoming increasingly important in the field.
Collapse
Affiliation(s)
- Ivo Fierro-Monti
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Trust Genome Campus, Hinxton, United Kingdom
| | | | | | - Juan Antonio Vizcaíno
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Trust Genome Campus, Hinxton, United Kingdom
| |
Collapse
|
13
|
Thaldar DW, Townsend BA, Donnelly DL, Botes M, Gooden A, van Harmelen J, Shozi B. The multidimensional legal nature of personal genomic sequence data: A South African perspective. Front Genet 2022; 13:997595. [PMID: 36437942 PMCID: PMC9681828 DOI: 10.3389/fgene.2022.997595] [Citation(s) in RCA: 6] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/19/2022] [Accepted: 09/28/2022] [Indexed: 10/19/2023] Open
Abstract
This article provides a comprehensive analysis of the various dimensions in South African law applicable to personal genomic sequence data. This analysis includes property rights, personality rights, and intellectual property rights. Importantly, the under-investigated question of whether personal genomic sequence data are capable of being owned is investigated and answered affirmatively. In addition to being susceptible of ownership, personal genomic sequence data are also the object of data subjects' personality rights, and can also be the object of intellectual property rights: whether on their own qua trade secret or as part of a patented invention or copyrighted dataset. It is shown that personality rights constrain ownership rights, while the exploitation of intellectual property rights is constrained by both personality rights and ownership rights. All of these rights applicable to personal genomic sequence data should be acknowledged and harmonized for such data to be used effectively.
Collapse
Affiliation(s)
| | - Beverley A. Townsend
- School of Law, University of KwaZulu-Natal, Durban, South Africa
- York Law School, University of York, York, United Kingdom
| | | | - Marietjie Botes
- School of Law, University of KwaZulu-Natal, Durban, South Africa
- SnT Interdisciplinary Centre for Security, Reliability Security and Trust, University of Luxembourg, Luxembourg, Luxembourg
| | - Amy Gooden
- School of Law, University of KwaZulu-Natal, Durban, South Africa
| | | | - Bonginkosi Shozi
- School of Law, University of KwaZulu-Natal, Durban, South Africa
- School of Law, Institute for Practical Ethics, University of California, San Diego, San Diego, CA, United States
| |
Collapse
|
14
|
Al Aziz MM, Anjum MM, Mohammed N, Jiang X. Generalized Genomic Data Sharing for Differentially Private Federated Learning. J Biomed Inform 2022; 132:104113. [PMID: 35690350 DOI: 10.1016/j.jbi.2022.104113] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/13/2021] [Revised: 03/28/2022] [Accepted: 06/05/2022] [Indexed: 10/18/2022]
Abstract
The success behind Machine Learning (ML) methods has largely been attributed to the quality and quantity of the available data which can spread across multiple owners. A Federated Learning (FL) from distributed datasets often provides a reliable solution that provides valuable insight. For a genomic dataset, such data have also proven to be sensitive which requires additional safety mechanisms before any sharing or ML operations. We propose a generalized gene expression data sharing method using a differentially private mechanism. Due to the large number of genes available, the data dimension is also reduced to accommodate smaller privacy budgets as we utilize an exponential mechanism to create a private histogram from numeric expression data. The output histogram can be used in any federated machine learning setting having multiple data owners. The proposed solution was submitted to genomic data security and privacy competition, iDash 2020 where it ranked third among 55 teams. We extend the proposed solution and experimented with two different machine learning algorithms and different settings. The experimental results show that it takes around 8 seconds to train a model while achieving 0.89 AUC with only a privacy budget of 5. The paper outlined a method to share gene expression data for Federated Learning using a privacy-preserving mechanism. Different experimental settings and recent competition results show the efficacy of the method which can be further extended to other genomic datasets and machine learning algorithms.
Collapse
Affiliation(s)
- Md Momin Al Aziz
- Computer Science, University of Manitoba, 66 Chancellors Circle, Winnipeg, R3T 2N2, Manitoba, Canada
| | - Md Monowar Anjum
- Computer Science, University of Manitoba, 66 Chancellors Circle, Winnipeg, R3T 2N2, Manitoba, Canada
| | - Noman Mohammed
- Computer Science, University of Manitoba, 66 Chancellors Circle, Winnipeg, R3T 2N2, Manitoba, Canada
| | - Xiaoqian Jiang
- School of Biomedical Informatics, University of Texas Health Science Center at Houston, 7000 Fannin Street, Houston, 77030, Texas, USA
| |
Collapse
|
15
|
Hartung M, Anastasi E, Mamdouh ZM, Nogales C, Schmidt HHHW, Baumbach J, Zolotareva O, List M. Cancer driver drug interaction explorer. Nucleic Acids Res 2022; 50:W138-W144. [PMID: 35580047 PMCID: PMC9252786 DOI: 10.1093/nar/gkac384] [Citation(s) in RCA: 6] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/02/2022] [Revised: 04/06/2022] [Accepted: 04/29/2022] [Indexed: 12/16/2022] Open
Abstract
Cancer is a heterogeneous disease characterized by unregulated cell growth and promoted by mutations in cancer driver genes some of which encode suitable drug targets. Since the distinct set of cancer driver genes can vary between and within cancer types, evidence-based selection of drugs is crucial for targeted therapy following the precision medicine paradigm. However, many putative cancer driver genes can not be targeted directly, suggesting an indirect approach that considers alternative functionally related targets in the gene interaction network. Once potential drug targets have been identified, it is essential to consider all available drugs. Since tools that offer support for systematic discovery of drug repurposing candidates in oncology are lacking, we developed CADDIE, a web application integrating six human gene-gene and four drug-gene interaction databases, information regarding cancer driver genes, cancer-type specific mutation frequencies, gene expression information, genetically related diseases, and anticancer drugs. CADDIE offers access to various network algorithms for identifying drug targets and drug repurposing candidates. It guides users from the selection of seed genes to the identification of therapeutic targets or drug candidates, making network medicine algorithms accessible for clinical research. CADDIE is available at https://exbio.wzw.tum.de/caddie/ and programmatically via a python package at https://pypi.org/project/caddiepy/.
Collapse
Affiliation(s)
- Michael Hartung
- Institute for Computational Systems Biology, University of Hamburg, 22607 Hamburg, Germany
| | - Elisa Anastasi
- School of Computing, Newcastle University, 2308 Newcastle upon Tyne, UK
| | - Zeinab M Mamdouh
- Department of Pharmacology and Personalised Medicine, Maastricht University, 6229 Maastricht, Netherlands.,Department of Pharmacology and Toxicology, Faculty of Pharmacy, Zagazig University, 44519 Zagazig, Egypt
| | - Cristian Nogales
- Department of Pharmacology and Personalised Medicine, Maastricht University, 6229 Maastricht, Netherlands
| | - Harald H H W Schmidt
- Department of Pharmacology and Personalised Medicine, Maastricht University, 6229 Maastricht, Netherlands
| | - Jan Baumbach
- Institute for Computational Systems Biology, University of Hamburg, 22607 Hamburg, Germany.,Computational Biomedicine Lab, Department of Mathematics and Computer Science, University of Southern Denmark, 5230 Odense, Denmark
| | - Olga Zolotareva
- Institute for Computational Systems Biology, University of Hamburg, 22607 Hamburg, Germany.,Chair of Experimental Bioinformatics, TUM School of Life Sciences, Technical University of Munich, 85354 Freising, Germany
| | - Markus List
- Chair of Experimental Bioinformatics, TUM School of Life Sciences, Technical University of Munich, 85354 Freising, Germany
| |
Collapse
|
16
|
Nakagawa Y, Ohata S, Shimizu K. Efficient privacy-preserving variable-length substring match for genome sequence. Algorithms Mol Biol 2022; 17:9. [PMID: 35473587 PMCID: PMC9040336 DOI: 10.1186/s13015-022-00211-1] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/19/2021] [Accepted: 03/01/2022] [Indexed: 11/28/2022] Open
Abstract
The development of a privacy-preserving technology is important for accelerating genome data sharing. This study proposes an algorithm that securely searches a variable-length substring match between a query and a database sequence. Our concept hinges on a technique that efficiently applies FM-index for a secret-sharing scheme. More precisely, we developed an algorithm that can achieve a secure table lookup in such a way that \documentclass[12pt]{minimal}
\usepackage{amsmath}
\usepackage{wasysym}
\usepackage{amsfonts}
\usepackage{amssymb}
\usepackage{amsbsy}
\usepackage{mathrsfs}
\usepackage{upgreek}
\setlength{\oddsidemargin}{-69pt}
\begin{document}$$V[V[\ldots V[p_0] \ldots ]]$$\end{document}V[V[…V[p0]…]] is computed for a given depth of recursion where \documentclass[12pt]{minimal}
\usepackage{amsmath}
\usepackage{wasysym}
\usepackage{amsfonts}
\usepackage{amssymb}
\usepackage{amsbsy}
\usepackage{mathrsfs}
\usepackage{upgreek}
\setlength{\oddsidemargin}{-69pt}
\begin{document}$$p_0$$\end{document}p0 is an initial position, and V is a vector. We used the secure table lookup for vectors created based on FM-index. The notable feature of the secure table lookup is that time, communication, and round complexities are not dependent on the table length N, after the query input. Therefore, a substring match by reference to the FM-index-based table can also be conducted independently against the database length, and the entire search time is dramatically improved compared to previous approaches. We conducted an experiment using a human genome sequence with the length of 10 million as the database and a query with the length of 100 and found that the query response time of our protocol was at least three orders of magnitude faster than a non-indexed database search protocol under the realistic computation/network environment.
Collapse
|
17
|
Yilmaz E, Ji T, Ayday E, Li P. Genomic Data Sharing under Dependent Local Differential Privacy. CODASPY : PROCEEDINGS OF THE ... ACM CONFERENCE ON DATA AND APPLICATION SECURITY AND PRIVACY. ACM CONFERENCE ON DATA AND APPLICATION SECURITY & PRIVACY 2022; 2022:77-88. [PMID: 35531063 PMCID: PMC9073402 DOI: 10.1145/3508398.3511519] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/14/2023]
Abstract
Privacy-preserving genomic data sharing is prominent to increase the pace of genomic research, and hence to pave the way towards personalized genomic medicine. In this paper, we introduce (ϵ, T)-dependent local differential privacy (LDP) for privacy-preserving sharing of correlated data and propose a genomic data sharing mechanism under this privacy definition. We first show that the original definition of LDP is not suitable for genomic data sharing, and then we propose a new mechanism to share genomic data. The proposed mechanism considers the correlations in data during data sharing, eliminates statistically unlikely data values beforehand, and adjusts the probability distributions for each shared data point accordingly. By doing so, we show that we can avoid an attacker from inferring the correct values of the shared data points by utilizing the correlations in the data. By adjusting the probability distributions of the shared states of each data point, we also improve the utility of shared data for the data collector. Furthermore, we develop a greedy algorithm that strategically identifies the processing order of the shared data points with the aim of maximizing the utility of the shared data. Our evaluation results on a real-life genomic dataset show the superiority of the proposed mechanism compared to the randomized response mechanism (a widely used technique to achieve LDP).
Collapse
Affiliation(s)
- Emre Yilmaz
- University of Houston-Downtown, Houston, Texas
| | - Tianxi Ji
- Case Western Reserve University, Cleveland, Ohio
| | - Erman Ayday
- Case Western Reserve University, Cleveland, Ohio
| | - Pan Li
- Case Western Reserve University, Cleveland, Ohio
| |
Collapse
|
18
|
Personalized workflows in reconstructive dentistry-current possibilities and future opportunities. Clin Oral Investig 2022; 26:4283-4290. [PMID: 35352184 PMCID: PMC9203374 DOI: 10.1007/s00784-022-04475-0] [Citation(s) in RCA: 13] [Impact Index Per Article: 6.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/01/2022] [Accepted: 03/22/2022] [Indexed: 01/20/2023]
Abstract
Objectives The increasing collection of health data coupled with continuous IT advances have enabled precision medicine with personalized workflows. Traditionally, dentistry has lagged behind general medicine in the integration of new technologies: So what is the status quo of precision dentistry? The primary focus of this review is to provide a current overview of personalized workflows in the discipline of reconstructive dentistry (prosthodontics) and to highlight the disruptive potential of novel technologies for dentistry; the possible impact on society is also critically discussed. Material and methods Narrative literature review. Results Narrative literature review. Conclusions In the near future, artificial intelligence (AI) will increase diagnostic accuracy, simplify treatment planning, and thus contribute to the development of personalized reconstructive workflows by analyzing e-health data to promote decision-making on an individual patient basis. Dental education will also benefit from AI systems for personalized curricula considering the individual students’ skills. Augmented reality (AR) will facilitate communication with patients and improve clinical workflows through the use of visually guided protocols. Tele-dentistry will enable opportunities for remote contact among dental professionals and facilitate remote patient consultations and post-treatment follow-up using digital devices. Finally, a personalized digital dental passport encoded using blockchain technology could enable prosthetic rehabilitation using 3D-printed dental biomaterials. Clinical significance Overall, AI can be seen as the door-opener and driving force for the evolution from evidence-based prosthodontics to personalized reconstructive dentistry encompassing a synoptic approach with prosthetic and implant workflows. Nevertheless, ethical concerns need to be solved and international guidelines for data management and computing power must be established prior to a widespread routine implementation.
Collapse
|
19
|
Kim YG, Kang G. Secure Collaborative Platform for Healthcare Research in an Open Environment: A Perspective on Accountability in Access Control (Preprint). J Med Internet Res 2022; 24:e37978. [DOI: 10.2196/37978] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/14/2022] [Revised: 08/02/2022] [Accepted: 08/30/2022] [Indexed: 11/13/2022] Open
|
20
|
Wan Z, Hazel JW, Clayton EW, Vorobeychik Y, Kantarcioglu M, Malin BA. Sociotechnical safeguards for genomic data privacy. Nat Rev Genet 2022; 23:429-445. [PMID: 35246669 PMCID: PMC8896074 DOI: 10.1038/s41576-022-00455-y] [Citation(s) in RCA: 28] [Impact Index Per Article: 14.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 01/24/2022] [Indexed: 12/21/2022]
Abstract
Recent developments in a variety of sectors, including health care, research and the direct-to-consumer industry, have led to a dramatic increase in the amount of genomic data that are collected, used and shared. This state of affairs raises new and challenging concerns for personal privacy, both legally and technically. This Review appraises existing and emerging threats to genomic data privacy and discusses how well current legal frameworks and technical safeguards mitigate these concerns. It concludes with a discussion of remaining and emerging challenges and illustrates possible solutions that can balance protecting privacy and realizing the benefits that result from the sharing of genetic information. In this Review, the authors describe technical and legal protection mechanisms for mitigating vulnerabilities in genomic data privacy. They also discuss how these protections are dependent on the context of data use such as in research, health care, direct-to-consumer testing or forensic investigations.
Collapse
Affiliation(s)
- Zhiyu Wan
- Center for Genetic Privacy and Identity in Community Settings, Vanderbilt University Medical Center, Nashville, TN, USA.,Department of Computer Science, Vanderbilt University, Nashville, TN, USA.,Department of Biomedical Informatics, Vanderbilt University Medical Center, Nashville, TN, USA
| | - James W Hazel
- Center for Genetic Privacy and Identity in Community Settings, Vanderbilt University Medical Center, Nashville, TN, USA.,Center for Biomedical Ethics and Society, Vanderbilt University, Nashville, TN, USA
| | - Ellen Wright Clayton
- Center for Genetic Privacy and Identity in Community Settings, Vanderbilt University Medical Center, Nashville, TN, USA.,Center for Biomedical Ethics and Society, Vanderbilt University, Nashville, TN, USA.,Vanderbilt University Law School, Nashville, TN, USA
| | - Yevgeniy Vorobeychik
- Department of Computer Science and Engineering, Washington University in St. Louis, St. Louis, MO, USA
| | - Murat Kantarcioglu
- Department of Computer Science, University of Texas at Dallas, Richardson, TX, USA
| | - Bradley A Malin
- Center for Genetic Privacy and Identity in Community Settings, Vanderbilt University Medical Center, Nashville, TN, USA. .,Department of Computer Science, Vanderbilt University, Nashville, TN, USA. .,Department of Biomedical Informatics, Vanderbilt University Medical Center, Nashville, TN, USA. .,Department of Biostatistics, Vanderbilt University Medical Center, Nashville, TN, USA.
| |
Collapse
|
21
|
Akgün M, Pfeifer N, Kohlbacher O. Efficient privacy-preserving whole-genome variant queries. Bioinformatics 2022; 38:2202-2210. [PMID: 35150254 PMCID: PMC9004657 DOI: 10.1093/bioinformatics/btac070] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/06/2021] [Revised: 01/13/2022] [Accepted: 02/03/2022] [Indexed: 02/03/2023] Open
Abstract
MOTIVATION Diagnosis and treatment decisions on genomic data have become widespread as the cost of genome sequencing decreases gradually. In this context, disease-gene association studies are of great importance. However, genomic data are very sensitive when compared to other data types and contains information about individuals and their relatives. Many studies have shown that this information can be obtained from the query-response pairs on genomic databases. In this work, we propose a method that uses secure multi-party computation to query genomic databases in a privacy-protected manner. The proposed solution privately outsources genomic data from arbitrarily many sources to the two non-colluding proxies and allows genomic databases to be safely stored in semi-honest cloud environments. It provides data privacy, query privacy and output privacy by using XOR-based sharing and unlike previous solutions, it allows queries to run efficiently on hundreds of thousands of genomic data. RESULTS We measure the performance of our solution with parameters similar to real-world applications. It is possible to query a genomic database with 3 000 000 variants with five genomic query predicates under 400 ms. Querying 1 048 576 genomes, each containing 1 000 000 variants, for the presence of five different query variants can be achieved approximately in 6 min with a small amount of dedicated hardware and connectivity. These execution times are in the right range to enable real-world applications in medical research and healthcare. Unlike previous studies, it is possible to query multiple databases with response times fast enough for practical application. To the best of our knowledge, this is the first solution that provides this performance for querying large-scale genomic data. AVAILABILITY AND IMPLEMENTATION https://gitlab.com/DIFUTURE/privacy-preserving-variant-queries. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Mete Akgün
- To whom correspondence should be addressed.
| | - Nico Pfeifer
- Institute for Bioinformatics and Medical Informatics, University of Tübingen, Tübingen, Germany,Methods in Medical Informatics, Department of Computer Science, University of Tübingen, Tübingen, Germany,Statistical Learning in Computational Biology, Max Planck Institute for Informatics, Saarbrücken, Germany
| | - Oliver Kohlbacher
- Institute for Bioinformatics and Medical Informatics, University of Tübingen, Tübingen, Germany,Translational Bioinformatics, University Hospital Tübingen, Tübingen, Germany,Applied Bioinformatics, Department of Computer Science, University of Tübingen, Tübingen, Germany
| |
Collapse
|
22
|
Alsaffar MM, Hasan M, McStay GP, Sedky M. Digital DNA lifecycle security and privacy: an overview. Brief Bioinform 2022; 23:6518049. [PMID: 35106557 DOI: 10.1093/bib/bbab607] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/13/2021] [Revised: 12/29/2021] [Accepted: 12/30/2021] [Indexed: 11/14/2022] Open
Abstract
DNA sequencing technologies have advanced significantly in the last few years leading to advancements in biomedical research which has improved personalised medicine and the discovery of new treatments for diseases. Sequencing technology advancement has also reduced the cost of DNA sequencing, which has led to the rise of direct-to-consumer (DTC) sequencing, e.g. 23andme.com, ancestry.co.uk, etc. In the meantime, concerns have emerged over privacy and security in collecting, handling, analysing and sharing DNA and genomic data. DNA data are unique and can be used to identify individuals. Moreover, those data provide information on people's current disease status and disposition, e.g. mental health or susceptibility for developing cancer. DNA privacy violation does not only affect the owner but also affects their close consanguinity due to its hereditary nature. This article introduces and defines the term 'digital DNA life cycle' and presents an overview of privacy and security threats and their mitigation techniques for predigital DNA and throughout the digital DNA life cycle. It covers DNA sequencing hardware, software and DNA sequence pipeline in addition to common privacy attacks and their countermeasures when DNA digital data are stored, queried or shared. Likewise, the article examines DTC genomic sequencing privacy and security.
Collapse
Affiliation(s)
- Muhalb M Alsaffar
- Department of Computing, AI and Robotics, School of Digital, Technologies and Arts, Staffordshire University, College Road, ST4 2DE, Staffordshire, United Kingdom
| | | | - Gavin P McStay
- Department of Biological Sciences, School of Health, Science and Wellbeing, Staffordshire University, College Road, Stoke-on-Trent, Staffordshire, ST4 2DE, United Kingdom
| | - Mohamed Sedky
- Department of Computing, AI and Robotics, School of Digital, Technologies and Arts, Staffordshire University, College Road, ST4 2DE, Staffordshire, United Kingdom
| |
Collapse
|
23
|
Torkzadehmahani R, Nasirigerdeh R, Blumenthal DB, Kacprowski T, List M, Matschinske J, Spaeth J, Wenke NK, Baumbach J. Privacy-Preserving Artificial Intelligence Techniques in Biomedicine. Methods Inf Med 2022; 61:e12-e27. [PMID: 35062032 PMCID: PMC9246509 DOI: 10.1055/s-0041-1740630] [Citation(s) in RCA: 5] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/15/2022]
Abstract
Background
Artificial intelligence (AI) has been successfully applied in numerous scientific domains. In biomedicine, AI has already shown tremendous potential, e.g., in the interpretation of next-generation sequencing data and in the design of clinical decision support systems.
Objectives
However, training an AI model on sensitive data raises concerns about the privacy of individual participants. For example, summary statistics of a genome-wide association study can be used to determine the presence or absence of an individual in a given dataset. This considerable privacy risk has led to restrictions in accessing genomic and other biomedical data, which is detrimental for collaborative research and impedes scientific progress. Hence, there has been a substantial effort to develop AI methods that can learn from sensitive data while protecting individuals' privacy.
Method
This paper provides a structured overview of recent advances in privacy-preserving AI techniques in biomedicine. It places the most important state-of-the-art approaches within a unified taxonomy and discusses their strengths, limitations, and open problems.
Conclusion
As the most promising direction, we suggest combining federated machine learning as a more scalable approach with other additional privacy-preserving techniques. This would allow to merge the advantages to provide privacy guarantees in a distributed way for biomedical applications. Nonetheless, more research is necessary as hybrid approaches pose new challenges such as additional network or computation overhead.
Collapse
Affiliation(s)
- Reihaneh Torkzadehmahani
- Institute for Artificial Intelligence in Medicine and Healthcare, Technical University of Munich, Munich, Germany
| | - Reza Nasirigerdeh
- Institute for Artificial Intelligence in Medicine and Healthcare, Technical University of Munich, Munich, Germany.,Klinikum Rechts der Isar, Technical University of Munich, Munich, Germany
| | - David B Blumenthal
- Department of Artificial Intelligence in Biomedical Engineering (AIBE), Friedrich-Alexander University Erlangen-Nürnberg (FAU), Erlangen, Germany
| | - Tim Kacprowski
- Division of Data Science in Biomedicine, Peter L. Reichertz Institute for Medical Informatics of TU Braunschweig and Medical School Hannover, Braunschweig, Germany.,Braunschweig Integrated Centre of Systems Biology (BRICS), TU Braunschweig, Braunschweig, Germany
| | - Markus List
- Chair of Experimental Bioinformatics, Technical University of Munich, Munich, Germany
| | - Julian Matschinske
- E.U. Horizon2020 FeatureCloud Project Consortium.,Chair of Computational Systems Biology, University of Hamburg, Hamburg, Germany
| | - Julian Spaeth
- E.U. Horizon2020 FeatureCloud Project Consortium.,Chair of Computational Systems Biology, University of Hamburg, Hamburg, Germany
| | - Nina Kerstin Wenke
- E.U. Horizon2020 FeatureCloud Project Consortium.,Chair of Computational Systems Biology, University of Hamburg, Hamburg, Germany
| | - Jan Baumbach
- E.U. Horizon2020 FeatureCloud Project Consortium.,Chair of Computational Systems Biology, University of Hamburg, Hamburg, Germany.,Institute of Mathematics and Computer Science, University of Southern Denmark, Odense, Denmark
| |
Collapse
|
24
|
Jafarbeiki S, Sakzad A, Kasra Kermanshahi S, Gaire R, Steinfeld R, Lai S, Abraham G, Thapa C. PrivGenDB: Efficient and privacy-preserving query executions over encrypted SNP-Phenotype database. INFORMATICS IN MEDICINE UNLOCKED 2022. [DOI: 10.1016/j.imu.2022.100988] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/25/2022] Open
|
25
|
Ji T, Ayday E, Yilmaz E, Li P. OUP accepted manuscript. Bioinformatics 2022; 38:i143-i152. [PMID: 35758787 PMCID: PMC9236581 DOI: 10.1093/bioinformatics/btac243] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/25/2022] Open
Abstract
Motivation Database fingerprinting has been widely used to discourage unauthorized redistribution of data by providing means to identify the source of data leakages. However, there is no fingerprinting scheme aiming at achieving liability guarantees when sharing genomic databases. Thus, we are motivated to fill in this gap by devising a vanilla fingerprinting scheme specifically for genomic databases. Moreover, since malicious genomic database recipients may compromise the embedded fingerprint (distort the steganographic marks, i.e. the embedded fingerprint bit-string) by launching effective correlation attacks, which leverage the intrinsic correlations among genomic data (e.g. Mendel’s law and linkage disequilibrium), we also augment the vanilla scheme by developing mitigation techniques to achieve robust fingerprinting of genomic databases against correlation attacks. Results Via experiments using a real-world genomic database, we first show that correlation attacks against fingerprinting schemes for genomic databases are very powerful. In particular, the correlation attacks can distort more than half of the fingerprint bits by causing a small utility loss (e.g. database accuracy and consistency of SNP–phenotype associations measured via P-values). Next, we experimentally show that the correlation attacks can be effectively mitigated by our proposed mitigation techniques. We validate that the attacker can hardly compromise a large portion of the fingerprint bits even if it pays a higher cost in terms of degradation of the database utility. For example, with around 24% loss in accuracy and 20% loss in the consistency of SNP–phenotype associations, the attacker can only distort about 30% fingerprint bits, which is insufficient for it to avoid being accused. We also show that the proposed mitigation techniques also preserve the utility of the shared genomic databases, e.g. the mitigation techniques only lead to around 3% loss in accuracy. Availability and implementation https://github.com/xiutianxi/robust-genomic-fp-github.
Collapse
Affiliation(s)
- Tianxi Ji
- Department of Electrical, Computer, and System Engineering, Case Western Reserve University, Cleveland, OH 44106, USA
| | - Erman Ayday
- To whom correspondence should be addressed. E-mail:
| | | | - Pan Li
- Department of Electrical, Computer, and System Engineering, Case Western Reserve University, Cleveland, OH 44106, USA
| |
Collapse
|
26
|
Wan Z, Vorobeychik Y, Xia W, Liu Y, Wooders M, Guo J, Yin Z, Clayton EW, Kantarcioglu M, Malin BA. Using game theory to thwart multistage privacy intrusions when sharing data. SCIENCE ADVANCES 2021; 7:eabe9986. [PMID: 34890225 PMCID: PMC8664254 DOI: 10.1126/sciadv.abe9986] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 12/14/2020] [Accepted: 10/25/2021] [Indexed: 06/13/2023]
Abstract
Person-specific biomedical data are now widely collected, but its sharing raises privacy concerns, specifically about the re-identification of seemingly anonymous records. Formal re-identification risk assessment frameworks can inform decisions about whether and how to share data; current techniques, however, focus on scenarios where the data recipients use only one resource for re-identification purposes. This is a concern because recent attacks show that adversaries can access multiple resources, combining them in a stage-wise manner, to enhance the chance of an attack’s success. In this work, we represent a re-identification game using a two-player Stackelberg game of perfect information, which can be applied to assess risk, and suggest an optimal data sharing strategy based on a privacy-utility tradeoff. We report on experiments with large-scale genomic datasets to show that, using game theoretic models accounting for adversarial capabilities to launch multistage attacks, most data can be effectively shared with low re-identification risk.
Collapse
Affiliation(s)
- Zhiyu Wan
- Department of Electrical Engineering and Computer Science, Vanderbilt University, Nashville, TN 37212, USA
- Department of Biomedical Informatics, Vanderbilt University Medical Center, Nashville, TN 37203, USA
| | - Yevgeniy Vorobeychik
- Department of Computer Science and Engineering, Washington University in St. Louis, St. Louis, MO 63130, USA
| | - Weiyi Xia
- Department of Biomedical Informatics, Vanderbilt University Medical Center, Nashville, TN 37203, USA
| | - Yongtai Liu
- Department of Electrical Engineering and Computer Science, Vanderbilt University, Nashville, TN 37212, USA
| | - Myrna Wooders
- Department of Economics, Vanderbilt University, Nashville, TN 37235, USA
| | - Jia Guo
- Department of Electrical Engineering and Computer Science, Vanderbilt University, Nashville, TN 37212, USA
| | - Zhijun Yin
- Department of Electrical Engineering and Computer Science, Vanderbilt University, Nashville, TN 37212, USA
- Department of Biomedical Informatics, Vanderbilt University Medical Center, Nashville, TN 37203, USA
| | - Ellen Wright Clayton
- Center for Biomedical Ethics and Society, Vanderbilt University Medical Center, Nashville, TN 37203, USA
- School of Law, Vanderbilt University, Nashville, TN 37203, USA
- Department of Pediatrics, Vanderbilt University Medical Center, Nashville, TN 37232, USA
| | - Murat Kantarcioglu
- Department of Computer Science, University of Texas at Dallas, Richardson, TX 75080, USA
- Institute for Quantitative Social Science, Harvard University, Cambridge, MA 02138, USA
- Department of Electrical Engineering and Computer Sciences, University of California, Berkeley, Berkeley, CA 94720, USA
| | - Bradley A. Malin
- Department of Electrical Engineering and Computer Science, Vanderbilt University, Nashville, TN 37212, USA
- Department of Biomedical Informatics, Vanderbilt University Medical Center, Nashville, TN 37203, USA
- Department of Biostatistics, Vanderbilt University Medical Center, Nashville, TN 37203, USA
| |
Collapse
|
27
|
Blockchain-Based Privacy-Preserving System for Genomic Data Management Using Local Differential Privacy. ELECTRONICS 2021. [DOI: 10.3390/electronics10233019] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/16/2022]
Abstract
The advances made in genome technology have resulted in significant amounts of genomic data being generated at an increasing speed. As genomic data contain various privacy-sensitive information, security schemes that protect confidentiality and control access are essential. Many security techniques have been proposed to safeguard healthcare data. However, these techniques are inadequate for genomic data management because of their large size. Additionally, privacy problems due to the sharing of gene data are yet to be addressed. In this study, we propose a secure genomic data management system using blockchain and local differential privacy (LDP). The proposed system employs two types of storage: private storage for internal staff and semi-private storage for external users. In private storage, because encrypted gene data are stored, only internal employees can access the data. Meanwhile, in semi-private storage, gene data are irreversibly modified by LDP. Through LDP, different noises are added to each section of the genomic data. Therefore, even though the third party uses or exposes the shared data, the owner’s privacy is guaranteed. Furthermore, the access control for each storage is ensured by the blockchain, and the gene owner can trace the usage and sharing status using a decentralized application in a mobile device.
Collapse
|
28
|
Dupras C, Bunnik EM. Toward a Framework for Assessing Privacy Risks in Multi-Omic Research and Databases. THE AMERICAN JOURNAL OF BIOETHICS : AJOB 2021; 21:46-64. [PMID: 33433298 DOI: 10.1080/15265161.2020.1863516] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/12/2023]
Abstract
While the accumulation and increased circulation of genomic data have captured much attention over the past decade, privacy risks raised by the diversification and integration of omics have been largely overlooked. In this paper, we propose the outline of a framework for assessing privacy risks in multi-omic research and databases. Following a comparison of privacy risks associated with genomic and epigenomic data, we dissect ten privacy risk-impacting omic data properties that affect either the risk of re-identification of research participants, or the sensitivity of the information potentially conveyed by biological data. We then propose a three-step approach for the assessment of privacy risks in the multi-omic era. Thus, we lay grounds for a data property-based, 'pan-omic' approach that moves away from genetic exceptionalism. We conclude by inviting our peers to refine these theoretical foundations, put them to the test in their respective fields, and translate our approach into practical guidance.
Collapse
|
29
|
Kim M, Harmanci AO, Bossuat JP, Carpov S, Cheon JH, Chillotti I, Cho W, Froelicher D, Gama N, Georgieva M, Hong S, Hubaux JP, Kim D, Lauter K, Ma Y, Ohno-Machado L, Sofia H, Son Y, Song Y, Troncoso-Pastoriza J, Jiang X. Ultrafast homomorphic encryption models enable secure outsourcing of genotype imputation. Cell Syst 2021; 12:1108-1120.e4. [PMID: 34464590 PMCID: PMC9898842 DOI: 10.1016/j.cels.2021.07.010] [Citation(s) in RCA: 16] [Impact Index Per Article: 5.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/25/2020] [Revised: 04/21/2021] [Accepted: 07/29/2021] [Indexed: 02/06/2023]
Abstract
Genotype imputation is a fundamental step in genomic data analysis, where missing variant genotypes are predicted using the existing genotypes of nearby "tag" variants. Although researchers can outsource genotype imputation, privacy concerns may prohibit genetic data sharing with an untrusted imputation service. Here, we developed secure genotype imputation using efficient homomorphic encryption (HE) techniques. In HE-based methods, the genotype data are secure while it is in transit, at rest, and in analysis. It can only be decrypted by the owner. We compared secure imputation with three state-of-the-art non-secure methods and found that HE-based methods provide genetic data security with comparable accuracy for common variants. HE-based methods have time and memory requirements that are comparable or lower than those for the non-secure methods. Our results provide evidence that HE-based methods can practically perform resource-intensive computations for high-throughput genetic data analysis. The source code is freely available for download at https://github.com/K-miran/secure-imputation.
Collapse
Affiliation(s)
- Miran Kim
- Department of Computer Science and Engineering and Graduate School of Artificial Intelligence, Ulsan National Institute of Science and Technology, Ulsan, 44919, Republic of Korea
| | - Arif Ozgun Harmanci
- Center for Precision Health, School of Biomedical Informatics, University of Texas Health Science Center, Houston, TX, 77030, USA.,Corresponding authors: ,
| | | | - Sergiu Carpov
- Inpher, EPFL Innovation Park Bàtiment A, 3rd Fl, 1015 Lausanne, Switzerland.,CEA, LIST, 91191 Gif-sur-Yvette Cedex, France
| | - Jung Hee Cheon
- Department of Mathematical Sciences, Seoul National University, Seoul, 08826, Republic of Korea.,Crypto Lab Inc., Seoul, 08826, Republic of Korea
| | | | - Wonhee Cho
- Department of Mathematical Sciences, Seoul National University, Seoul, 08826, Republic of Korea
| | | | - Nicolas Gama
- Inpher, EPFL Innovation Park Bàtiment A, 3rd Fl, 1015 Lausanne, Switzerland
| | - Mariya Georgieva
- Inpher, EPFL Innovation Park Bàtiment A, 3rd Fl, 1015 Lausanne, Switzerland
| | - Seungwan Hong
- Department of Mathematical Sciences, Seoul National University, Seoul, 08826, Republic of Korea
| | | | - Duhyeong Kim
- Department of Mathematical Sciences, Seoul National University, Seoul, 08826, Republic of Korea
| | | | - Yiping Ma
- University of Pennsylvania, Philadelphia, PA, 19104, USA
| | - Lucila Ohno-Machado
- UCSD Health Department of Biomedical Informatics, University of California, San Diego, CA, 92093, USA
| | - Heidi Sofia
- National Institutes of Health (NIH) - National Human Genome Research Institute, Bethesda, MD, 20892, USA
| | | | - Yongsoo Song
- Department of Computer Science and Engineering, Seoul National University, Seoul, 08826, Republic of Korea
| | | | - Xiaoqian Jiang
- Center for Secure Artificial intelligence For hEalthcare (SAFE), School of Biomedical Informatics, University of Texas Health Science Center, Houston, TX, 77030, USA.,Corresponding authors: ,
| |
Collapse
|
30
|
Abstract
Ensuring the privacy of participants in genomic studies is a critical responsibility of the biomedical community. Accurate and efficient implementations of secure genotype imputation highlight practical approaches to safeguard sensitive genomic data that can be adapted for numerous bioinformatics applications.
Collapse
Affiliation(s)
- Maxwell A Sherman
- Computer Science and Artificial Intelligence Laboratory, Massachusetts Institute of Technology, Cambridge, MA, USA; Harvard-MIT Health Sciences and Technology Program, Cambridge, MA, USA; Division of Genetics, Department of Medicine, Brigham and Women's Hospital and Harvard Medical School, Boston, MA, USA; Broad Institute of MIT and Harvard, Cambridge, MA, USA.
| |
Collapse
|
31
|
Hekel R, Budis J, Kucharik M, Radvanszky J, Pös Z, Szemes T. Privacy-preserving storage of sequenced genomic data. BMC Genomics 2021; 22:712. [PMID: 34600465 PMCID: PMC8487550 DOI: 10.1186/s12864-021-07996-2] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/22/2021] [Accepted: 09/10/2021] [Indexed: 11/23/2022] Open
Abstract
Background The current and future applications of genomic data may raise ethical and privacy concerns. Processing and storing of this data introduce a risk of abuse by potential offenders since the human genome contains sensitive personal information. For this reason, we have developed a privacy-preserving method, named Varlock providing secure storage of sequenced genomic data. We used a public set of population allele frequencies to mask the personal alleles detected in genomic reads. Each personal allele described by the public set is masked by a randomly selected population allele with respect to its frequency. Masked alleles are preserved in an encrypted confidential file that can be shared in whole or in part using public-key cryptography. Results Our method masked the personal variants and introduced new variants detected in a personal masked genome. Alternative alleles with lower population frequency were masked and introduced more often. We performed a joint PCA analysis of personal and masked VCFs, showing that the VCFs between the two groups cannot be trivially mapped. Moreover, the method is reversible and personal alleles in specific genomic regions can be unmasked on demand. Conclusion Our method masks personal alleles within genomic reads while preserving valuable non-sensitive properties of sequenced DNA fragments for further research. Personal alleles in the desired genomic regions may be restored and shared with patients, clinics, and researchers. We suggest that the method can provide an additional security layer for storing and sharing of the raw aligned reads. Supplementary Information The online version contains supplementary material available at 10.1186/s12864-021-07996-2.
Collapse
Affiliation(s)
- Rastislav Hekel
- Geneton s.r.o, Bratislava, Slovakia. .,Faculty of Natural Sciences, Comenius University, Bratislava, Slovakia. .,Slovak Centre of Scientific and Technical Information, Bratislava, Slovakia. .,Comenius University Science Park, Bratislava, Slovakia.
| | - Jaroslav Budis
- Geneton s.r.o, Bratislava, Slovakia.,Slovak Centre of Scientific and Technical Information, Bratislava, Slovakia.,Comenius University Science Park, Bratislava, Slovakia
| | - Marcel Kucharik
- Geneton s.r.o, Bratislava, Slovakia.,Comenius University Science Park, Bratislava, Slovakia
| | - Jan Radvanszky
- Geneton s.r.o, Bratislava, Slovakia.,Faculty of Natural Sciences, Comenius University, Bratislava, Slovakia.,Comenius University Science Park, Bratislava, Slovakia.,Biomedical Research Centre, Institute of Clinical and Translational Research, Slovak Academy of Sciences, Bratislava, Slovakia
| | - Zuzana Pös
- Geneton s.r.o, Bratislava, Slovakia.,Faculty of Natural Sciences, Comenius University, Bratislava, Slovakia.,Comenius University Science Park, Bratislava, Slovakia.,Biomedical Research Centre, Institute of Clinical and Translational Research, Slovak Academy of Sciences, Bratislava, Slovakia
| | - Tomas Szemes
- Geneton s.r.o, Bratislava, Slovakia.,Faculty of Natural Sciences, Comenius University, Bratislava, Slovakia.,Comenius University Science Park, Bratislava, Slovakia
| |
Collapse
|
32
|
Wirth FN, Meurers T, Johns M, Prasser F. Privacy-preserving data sharing infrastructures for medical research: systematization and comparison. BMC Med Inform Decis Mak 2021; 21:242. [PMID: 34384406 PMCID: PMC8359765 DOI: 10.1186/s12911-021-01602-x] [Citation(s) in RCA: 13] [Impact Index Per Article: 4.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/07/2021] [Accepted: 07/31/2021] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Data sharing is considered a crucial part of modern medical research. Unfortunately, despite its advantages, it often faces obstacles, especially data privacy challenges. As a result, various approaches and infrastructures have been developed that aim to ensure that patients and research participants remain anonymous when data is shared. However, privacy protection typically comes at a cost, e.g. restrictions regarding the types of analyses that can be performed on shared data. What is lacking is a systematization making the trade-offs taken by different approaches transparent. The aim of the work described in this paper was to develop a systematization for the degree of privacy protection provided and the trade-offs taken by different data sharing methods. Based on this contribution, we categorized popular data sharing approaches and identified research gaps by analyzing combinations of promising properties and features that are not yet supported by existing approaches. METHODS The systematization consists of different axes. Three axes relate to privacy protection aspects and were adopted from the popular Five Safes Framework: (1) safe data, addressing privacy at the input level, (2) safe settings, addressing privacy during shared processing, and (3) safe outputs, addressing privacy protection of analysis results. Three additional axes address the usefulness of approaches: (4) support for de-duplication, to enable the reconciliation of data belonging to the same individuals, (5) flexibility, to be able to adapt to different data analysis requirements, and (6) scalability, to maintain performance with increasing complexity of shared data or common analysis processes. RESULTS Using the systematization, we identified three different categories of approaches: distributed data analyses, which exchange anonymous aggregated data, secure multi-party computation protocols, which exchange encrypted data, and data enclaves, which store pooled individual-level data in secure environments for access for analysis purposes. We identified important research gaps, including a lack of approaches enabling the de-duplication of horizontally distributed data or providing a high degree of flexibility. CONCLUSIONS There are fundamental differences between different data sharing approaches and several gaps in their functionality that may be interesting to investigate in future work. Our systematization can make the properties of privacy-preserving data sharing infrastructures more transparent and support decision makers and regulatory authorities with a better understanding of the trade-offs taken.
Collapse
Affiliation(s)
- Felix Nikolaus Wirth
- Berlin Institute of Health at Charité - Universitätsmedizin Berlin, Charitéplatz 1, 10117, Berlin, Germany.
| | - Thierry Meurers
- Berlin Institute of Health at Charité - Universitätsmedizin Berlin, Charitéplatz 1, 10117, Berlin, Germany
| | - Marco Johns
- Berlin Institute of Health at Charité - Universitätsmedizin Berlin, Charitéplatz 1, 10117, Berlin, Germany
| | - Fabian Prasser
- Berlin Institute of Health at Charité - Universitätsmedizin Berlin, Charitéplatz 1, 10117, Berlin, Germany
| |
Collapse
|
33
|
Sarkar E, Chielle E, Gürsoy G, Mazonka O, Gerstein M, Maniatakos M. Fast and Scalable Private Genotype Imputation Using Machine Learning and Partially Homomorphic Encryption. IEEE ACCESS : PRACTICAL INNOVATIONS, OPEN SOLUTIONS 2021; 9:93097-93110. [PMID: 34476144 PMCID: PMC8409799 DOI: 10.1109/access.2021.3093005] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Indexed: 06/13/2023]
Abstract
The recent advances in genome sequencing technologies provide unprecedented opportunities to understand the relationship between human genetic variation and diseases. However, genotyping whole genomes from a large cohort of individuals is still cost prohibitive. Imputation methods to predict genotypes of missing genetic variants are widely used, especially for genome-wide association studies. Accurate genotype imputation requires complex statistical methods. Due to the data and computing-intensive nature of the problem, imputation is increasingly outsourced, raising serious privacy concerns. In this work, we investigate solutions for fast, scalable, and accurate privacy-preserving genotype imputation using Machine Learning (ML) and a standardized homomorphic encryption scheme, Paillier cryptosystem. ML-based privacy-preserving inference has been largely optimized for computation-heavy non-linear functions in a single-output multi-class classification setting. However, having a large number of multi-class outputs per genome per individual calls for further optimizations and/or approximations specific to this application. Here we explore the effectiveness of linear models for genotype imputation to convert them to privacy-preserving equivalents using standardized homomorphic encryption schemes. Our results show that performance of our privacy-preserving genotype imputation method is equivalent to the state-of-the-art plaintext solutions, achieving up to 99% micro area under curve score, even on real-world large-scale datasets up to 80,000 targets.
Collapse
Affiliation(s)
- Esha Sarkar
- Tandon School of Engineering, New York University, New York, NY 11201, USA
| | - Eduardo Chielle
- New York University Abu Dhabi, Abu Dhabi, United Arab Emirates
| | - Gamze Gürsoy
- Program in Computational Biology and Bioinformatics, Yale University, New Haven, CT 06520, USA
| | - Oleg Mazonka
- New York University Abu Dhabi, Abu Dhabi, United Arab Emirates
| | - Mark Gerstein
- Program in Computational Biology and Bioinformatics, Yale University, New Haven, CT 06520, USA
| | - Michail Maniatakos
- Tandon School of Engineering, New York University, New York, NY 11201, USA
| |
Collapse
|
34
|
Jungkunz M, Köngeter A, Mehlis K, Winkler EC, Schickhardt C. Secondary Use of Clinical Data in Data-Gathering, Non-Interventional Research or Learning Activities: Definition, Types, and a Framework for Risk Assessment. J Med Internet Res 2021; 23:e26631. [PMID: 34100760 PMCID: PMC8241435 DOI: 10.2196/26631] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/21/2020] [Revised: 03/10/2021] [Accepted: 05/06/2021] [Indexed: 12/16/2022] Open
Abstract
Background The secondary use of clinical data in data-gathering, non-interventional research or learning activities (SeConts) has great potential for scientific progress and health care improvement. At the same time, it poses relevant risks for the privacy and informational self-determination of patients whose data are used. Objective Since the current literature lacks a tailored framework for risk assessment in SeConts as well as a clarification of the concept and practical scope of SeConts, we aim to fill this gap. Methods In this study, we analyze each element of the concept of SeConts to provide a synthetic definition, investigate the practical relevance and scope of SeConts through a literature review, and operationalize the widespread definition of risk (as a harmful event of a certain magnitude that occurs with a certain probability) to conduct a tailored analysis of privacy risk factors typically implied in SeConts. Results We offer a conceptual clarification and definition of SeConts and provide a list of types of research and learning activities that can be subsumed under the definition of SeConts. We also offer a proposal for the classification of SeConts types into the categories non-interventional (observational) clinical research, quality control and improvement, or public health research. In addition, we provide a list of risk factors that determine the probability or magnitude of harm implied in SeConts. The risk factors provide a framework for assessing the privacy-related risks for patients implied in SeConts. We illustrate the use of risk assessment by applying it to a concrete example. Conclusions In the future, research ethics committees and data use and access committees will be able to rely on and apply the framework offered here when reviewing projects of secondary use of clinical data for learning and research purposes.
Collapse
Affiliation(s)
- Martin Jungkunz
- Section for Translational Medical Ethics, Department of Medical Oncology, National Center for Tumor Diseases, Heidelberg University Hospital, Heidelberg, Germany
| | - Anja Köngeter
- Section for Translational Medical Ethics, Department of Medical Oncology, National Center for Tumor Diseases, Heidelberg University Hospital, Heidelberg, Germany
| | - Katja Mehlis
- Section for Translational Medical Ethics, Department of Medical Oncology, National Center for Tumor Diseases, Heidelberg University Hospital, Heidelberg, Germany
| | - Eva C Winkler
- Section for Translational Medical Ethics, Department of Medical Oncology, National Center for Tumor Diseases, Heidelberg University Hospital, Heidelberg, Germany
| | - Christoph Schickhardt
- Section for Translational Medical Ethics, National Center for Tumor Diseases, German Cancer Research Center (DKFZ), Heidelberg, Germany
| |
Collapse
|
35
|
Eisenhauer ER, Tait AR, Low LK, Arslanian-Engoren CM. Women's Choices Regarding Use of Their Newborns' Residual Dried Blood Samples in Research. J Obstet Gynecol Neonatal Nurs 2021; 50:424-438. [PMID: 34033759 DOI: 10.1016/j.jogn.2021.04.003] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 04/01/2021] [Indexed: 09/30/2022] Open
Abstract
OBJECTIVE To determine the proportion of informed choices women made about donating their newborns' blood samples for research. DESIGN A quantitative analysis of informed choice using data on women's knowledge and attitudes from a descriptive, cross-sectional survey. SETTING The state of Michigan. PARTICIPANTS Women (N = 69, ≥18 years old) who had (a) newborns 0 to 3 months of age, (b) yes or no decisions regarding use of the blood sample for research on file, (c) no evidence of an infant death in the state database, (d) completed the knowledge scale, (e) completed the attitude scale, and (f) recalled the decision (i.e., yes or no) about donating blood samples. METHODS We used the multidimensional measure of informed choice to calculate the proportion of informed choices in data on women's knowledge, attitudes, and decisions about biospecimen research. RESULTS Fifty-five percent (38/69) of participants made informed choices about donating newborn blood samples for research, and 45% made uninformed choices (31/69). Inadequate knowledge about biospecimen research contributed to 87% of uniformed choices (27/31). Participants who declined to donate their newborns' blood samples struggled with making decisions consistent with their values. CONCLUSION Nearly half of the participants made uninformed choices about donating the blood samples of their newborns for research. Women need more information about genetics and the storage and research use of newborns' blood samples to make informed choices. Nurses need to be made aware of the ethical, legal, and social implications of such research because they are primary sources of advocacy, information, and support for childbearing women and may be charged with overseeing or obtaining informed consent. Additional research with larger, more diverse samples is needed.
Collapse
|
36
|
Lu D, Zhang Y, Zhang L, Wang H, Weng W, Li L, Cai H. Methods of privacy-preserving genomic sequencing data alignments. Brief Bioinform 2021; 22:6279828. [PMID: 34021302 DOI: 10.1093/bib/bbab151] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/05/2020] [Revised: 03/10/2021] [Accepted: 03/30/2021] [Indexed: 11/14/2022] Open
Abstract
Genomic data alignment, a fundamental operation in sequencing, can be utilized to map reads into a reference sequence, query on a genomic database and perform genetic tests. However, with the reduction of sequencing cost and the accumulation of genome data, privacy-preserving genomic sequencing data alignment is becoming unprecedentedly important. In this paper, we present a comprehensive review of secure genomic data comparison schemes. We discuss the privacy threats, including adversaries and privacy attacks. The attacks can be categorized into inference, membership, identity tracing and completion attacks and have been applied to obtaining the genomic privacy information. We classify the state-of-the-art genomic privacy-preserving alignment methods into three different scenarios: large-scale reads mapping, encrypted genomic datasets querying and genetic testing to ease privacy threats. A comprehensive analysis of these approaches has been carried out to evaluate the computation and communication complexity as well as the privacy requirements. The survey provides the researchers with the current trends and the insights on the significance and challenges of privacy issues in genomic data alignment.
Collapse
Affiliation(s)
- Dandan Lu
- School of Computer Science and Engineering, South China University of Technology, Guangzhou, 510006, China
| | - Yue Zhang
- School of Computer Science, Guangdong Polytechnic Normal University, Guangzhou, 510006, China
| | - Ling Zhang
- Department of Radiology, Sun Yat-sen University Cancer Center; State Key Laboratory of Oncology in South China; Collaborative Innovation Center for Cancer Medicine, 651 Dongfeng East Road, Guangzhou, P. R. China,510060
| | - Haiyan Wang
- School of Computer Science and Engineering, South China University of Technology, Guangzhou, 510006, China
| | - Wanlin Weng
- School of Computer Science and Engineering, South China University of Technology, Guangzhou, 510006, China
| | - Li Li
- School of Computer Science and Engineering, South China University of Technology, Guangzhou, 510006, China
| | - Hongmin Cai
- School of Computer Science and Engineering, South China University of Technology, Guangzhou, 510006, China
| |
Collapse
|
37
|
Ayoz K, Ayday E, Cicek AE. Genome Reconstruction Attacks Against Genomic Data-Sharing Beacons. PROCEEDINGS ON PRIVACY ENHANCING TECHNOLOGIES. PRIVACY ENHANCING TECHNOLOGIES SYMPOSIUM 2021; 2021:28-48. [PMID: 34746296 PMCID: PMC8570374 DOI: 10.2478/popets-2021-0036] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/21/2022]
Abstract
Sharing genome data in a privacy-preserving way stands as a major bottleneck in front of the scientific progress promised by the big data era in genomics. A community-driven protocol named genomic data-sharing beacon protocol has been widely adopted for sharing genomic data. The system aims to provide a secure, easy to implement, and standardized interface for data sharing by only allowing yes/no queries on the presence of specific alleles in the dataset. However, beacon protocol was recently shown to be vulnerable against membership inference attacks. In this paper, we show that privacy threats against genomic data sharing beacons are not limited to membership inference. We identify and analyze a novel vulnerability of genomic data-sharing beacons: genome reconstruction. We show that it is possible to successfully reconstruct a substantial part of the genome of a victim when the attacker knows the victim has been added to the beacon in a recent update. In particular, we show how an attacker can use the inherent correlations in the genome and clustering techniques to run such an attack in an efficient and accurate way. We also show that even if multiple individuals are added to the beacon during the same update, it is possible to identify the victim's genome with high confidence using traits that are easily accessible by the attacker (e.g., eye color or hair type). Moreover, we show how a reconstructed genome using a beacon that is not associated with a sensitive phenotype can be used for membership inference attacks to beacons with sensitive phenotypes (e.g., HIV+). The outcome of this work will guide beacon operators on when and how to update the content of the beacon and help them (along with the beacon participants) make informed decisions.
Collapse
|
38
|
Fake It Till You Make It: Guidelines for Effective Synthetic Data Generation. APPLIED SCIENCES-BASEL 2021. [DOI: 10.3390/app11052158] [Citation(s) in RCA: 11] [Impact Index Per Article: 3.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/18/2022]
Abstract
Synthetic data provides a privacy protecting mechanism for the broad usage and sharing of healthcare data for secondary purposes. It is considered a safe approach for the sharing of sensitive data as it generates an artificial dataset that contains no identifiable information. Synthetic data is increasing in popularity with multiple synthetic data generators developed in the past decade, yet its utility is still a subject of research. This paper is concerned with evaluating the effect of various synthetic data generation and usage settings on the utility of the generated synthetic data and its derived models. Specifically, we investigate (i) the effect of data pre-processing on the utility of the synthetic data generated, (ii) whether tuning should be applied to the synthetic datasets when generating supervised machine learning models, and (iii) whether sharing preliminary machine learning results can improve the synthetic data models. Lastly, (iv) we investigate whether one utility measure (Propensity score) can predict the accuracy of the machine learning models generated from the synthetic data when employed in real life. We use two popular measures of synthetic data utility, propensity score and classification accuracy, to compare the different settings. We adopt a recent mechanism for the calculation of propensity, which looks carefully into the choice of model for the propensity score calculation. Accordingly, this paper takes a new direction with investigating the effect of various data generation and usage settings on the quality of the generated data and its ensuing models. The goal is to inform on the best strategies to follow when generating and using synthetic data.
Collapse
|
39
|
Dupic T, Bensouda Koraichi M, Minervina AA, Pogorelyy MV, Mora T, Walczak AM. Immune fingerprinting through repertoire similarity. PLoS Genet 2021; 17:e1009301. [PMID: 33395405 PMCID: PMC7808657 DOI: 10.1371/journal.pgen.1009301] [Citation(s) in RCA: 10] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/13/2020] [Revised: 01/14/2021] [Accepted: 12/07/2020] [Indexed: 11/18/2022] Open
Abstract
Immune repertoires provide a unique fingerprint reflecting the immune history of individuals, with potential applications in precision medicine. However, the question of how personal that information is and how it can be used to identify individuals has not been explored. Here, we show that individuals can be uniquely identified from repertoires of just a few thousands lymphocytes. We present "Immprint," a classifier using an information-theoretic measure of repertoire similarity to distinguish pairs of repertoire samples coming from the same versus different individuals. Using published T-cell receptor repertoires and statistical modeling, we tested its ability to identify individuals with great accuracy, including identical twins, by computing false positive and false negative rates < 10-6 from samples composed of 10,000 T-cells. We verified through longitudinal datasets that the method is robust to acute infections and that the immune fingerprint is stable for at least three years. These results emphasize the private and personal nature of repertoire data.
Collapse
Affiliation(s)
- Thomas Dupic
- Department of Organismic and Evolutionary Biology, Harvard University, Cambridge, Massachusetts, USA
- Laboratoire de physique de l’École Normale Supérieure, CNRS, Sorbonne Université, Université de Paris, and École normale supérieure (PSL), Paris, France
| | - Meriem Bensouda Koraichi
- Laboratoire de physique de l’École Normale Supérieure, CNRS, Sorbonne Université, Université de Paris, and École normale supérieure (PSL), Paris, France
| | | | - Mikhail V. Pogorelyy
- Shemyakin-Ovchinnikov Institute of Bioorganic Chemistry, Moscow, Russia
- Pirogov Russian National Research Medical University, Moscow, Russia
| | - Thierry Mora
- Laboratoire de physique de l’École Normale Supérieure, CNRS, Sorbonne Université, Université de Paris, and École normale supérieure (PSL), Paris, France
- * E-mail: (TM); (AMW)
| | - Aleksandra M. Walczak
- Laboratoire de physique de l’École Normale Supérieure, CNRS, Sorbonne Université, Université de Paris, and École normale supérieure (PSL), Paris, France
- * E-mail: (TM); (AMW)
| |
Collapse
|
40
|
Oliver KH, Higgs S, Clayton J. The End of Genetic Privacy in the Blade Runner Canon. JOURNAL OF LITERATURE AND SCIENCE 2021; 14:108-124. [PMID: 36506249 PMCID: PMC9731365] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Grants] [Download PDF] [Subscribe] [Scholar Register] [Indexed: 06/17/2023]
Affiliation(s)
- Kendra H. Oliver
- Departments of Pharmacology and Communication of Science and Technology, Vanderbilt University
| | | | - Jay Clayton
- Departments of English, Cinema and Media Arts, and Communication of Science and Technology, Vanderbilt University
| |
Collapse
|
41
|
Rahman Mahdi MS, Al Aziz MM, Mohammed N, Jiang X. Privacy-preserving string search on encrypted genomic data using a generalized suffix tree. INFORMATICS IN MEDICINE UNLOCKED 2021. [DOI: 10.1016/j.imu.2021.100525] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/22/2022] Open
|
42
|
Schumacher GJ, Sawaya S, Nelson D, Hansen AJ. Genetic Information Insecurity as State of the Art. Front Bioeng Biotechnol 2020; 8:591980. [PMID: 33381496 PMCID: PMC7768984 DOI: 10.3389/fbioe.2020.591980] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/05/2020] [Accepted: 11/16/2020] [Indexed: 11/16/2022] Open
Abstract
Genetic information is being generated at an increasingly rapid pace, offering advances in science and medicine that are paralleled only by the threats and risk present within the responsible systems. Human genetic information is identifiable and contains sensitive information, but genetic information security is only recently gaining attention. Genetic data is generated in an evolving and distributed cyber-physical system, with multiple subsystems that handle information and multiple partners that rely and influence the whole ecosystem. This paper characterizes a general genetic information system from the point of biological material collection through long-term data sharing, storage and application in the security context. While all biotechnology stakeholders and ecosystems are valuable assets to the bioeconomy, genetic information systems are particularly vulnerable with great potential for harm and misuse. The security of post-analysis phases of data dissemination and storage have been focused on by others, but the security of wet and dry laboratories is also challenging due to distributed devices and systems that are not designed nor implemented with security in mind. Consequently, industry standards and best operational practices threaten the security of genetic information systems. Extensive development of laboratory security will be required to realize the potential of this emerging field while protecting the bioeconomy and all of its stakeholders.
Collapse
Affiliation(s)
- Garrett J. Schumacher
- GeneInfoSec Inc., Boulder, CO, United States
- Technology, Cybersecurity and Policy Program, College of Engineering and Applied Science, University of Colorado Boulder, Boulder, CO, United States
- Department of Computer Science, College of Engineering and Applied Science, University of Colorado Boulder, Boulder, CO, United States
| | | | | | - Aaron J. Hansen
- Technology, Cybersecurity and Policy Program, College of Engineering and Applied Science, University of Colorado Boulder, Boulder, CO, United States
- Department of Computer Science, College of Engineering and Applied Science, University of Colorado Boulder, Boulder, CO, United States
| |
Collapse
|
43
|
Karimi S, Jiang X, Dolin RH, Kim M, Boxwala A. A secure system for genomics clinical decision support. J Biomed Inform 2020; 112:103602. [PMID: 33080397 PMCID: PMC8577277 DOI: 10.1016/j.jbi.2020.103602] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/16/2020] [Revised: 09/07/2020] [Accepted: 10/12/2020] [Indexed: 11/26/2022]
Abstract
We developed a prototype genomic archiving and communications system to securely store genome data and provide clinical decision support (CDS). This system operates on a client-server model. The client encrypts the data, and the server stores data and performs the computations necessary for CDS. Computations are directly performed on encrypted data, and the client decrypts results. The server cannot decrypt inputs or outputs, which provides strong guarantees of security. We have validated our system with three genomics-based CDS applications. The results demonstrate that it is possible to resolve a long-standing dilemma in genomic data privacy and accessibility, by using a principled cryptographical framework and a mathematical representation of genome data and CDS questions.
Collapse
Affiliation(s)
| | - Xiaoqian Jiang
- UT Health School of Biomedical Informatics, Houston, TX, United States
| | | | - Miran Kim
- UT Health School of Biomedical Informatics, Houston, TX, United States
| | - Aziz Boxwala
- Elimu Informatics Inc., Richmond, CA, United States
| |
Collapse
|
44
|
Yilmaz E, Ji T, Ayday E, Li P. Preserving Genomic Privacy via Selective Sharing. PROCEEDINGS OF THE ACM WORKSHOP ON PRIVACY IN THE ELECTRONIC SOCIETY. ACM WORKSHOP ON PRIVACY IN THE ELECTRONIC SOCIETY 2020; 2020:163-179. [PMID: 34485998 PMCID: PMC8411901 DOI: 10.1145/3411497.3420214] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 10/23/2022]
Abstract
Although genomic data has significant impact and widespread usage in medical research, it puts individuals' privacy in danger, even if they anonymously or partially share their genomic data. To address this problem, we present a framework that is inspired from differential privacy for sharing individuals' genomic data while preserving their privacy. We assume an individual with some sensitive portion on her genome (e.g., mutations or single nucleotide polymorphisms - SNPs that reveal sensitive information about the individual) that she does not want to share. The goals of the individual are to (i) preserve the privacy of her sensitive data (considering the correlations between the sensitive and non-sensitive part), (ii) preserve the privacy of interdependent data (data that belongs to other individuals that is correlated with her data), and (iii) share as much non-sensitive data as possible to maximize utility of data sharing. As opposed to traditional differential privacy-based data sharing schemes, the proposed scheme does not intentionally add noise to data; it is based on selective sharing of data points. We observe that traditional differential privacy concept does not capture sharing data in such a setting, and hence we first introduce a privacy notation, ϵ-indirect privacy, that addresses data sharing in such settings. We show that the proposed framework does not provide sensitive information to the attacker while it provides a high data sharing utility. We also compare the proposed technique with the previous ones and show our advantage both in terms of privacy and data sharing utility.
Collapse
Affiliation(s)
| | | | | | - Pan Li
- Case Western Reserve University
| |
Collapse
|
45
|
Chang C, Deng Y, Jiang X, Long Q. Multiple imputation for analysis of incomplete data in distributed health data networks. Nat Commun 2020; 11:5467. [PMID: 33122624 PMCID: PMC7596726 DOI: 10.1038/s41467-020-19270-2] [Citation(s) in RCA: 14] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/15/2020] [Accepted: 10/02/2020] [Indexed: 11/25/2022] Open
Abstract
Distributed health data networks (DHDNs) leverage data from multiple sources or sites such as electronic health records (EHRs) from multiple healthcare systems and have drawn increasing interests in recent years, as they do not require sharing of subject-level data and hence lower the hurdles for collaboration between institutions considerably. However, DHDNs face a number of challenges in data analysis, particularly in the presence of missing data. The current state-of-the-art methods for handling incomplete data require pooling data into a central repository before analysis, which is not feasible in DHDNs. In this paper, we address the missing data problem in distributed environments such as DHDNs that has not been investigated previously. We develop communication-efficient distributed multiple imputation methods for incomplete data that are horizontally partitioned. Since subject-level data are not shared or transferred outside of each site in the proposed methods, they enhance protection of patient privacy and have the potential to strengthen public trust in analysis of sensitive health data. We investigate, through extensive simulation studies, the performance of these methods. Our methods are applied to the analysis of an acute stroke dataset collected from multiple hospitals, mimicking a DHDN where health data are horizontally partitioned across hospitals and subject-level data cannot be shared or sent to a central data repository.
Collapse
Affiliation(s)
| | - Yi Deng
- Emory University, Atlanta, GA, USA
| | - Xiaoqian Jiang
- University of Texas Health Science Center at Houston, Houston, TX, USA
| | - Qi Long
- University of Pennsylvania, Philadelphia, PA, USA.
| |
Collapse
|
46
|
Kweon S, Lee JH, Lee Y, Park YR. Personal Health Information Inference Using Machine Learning on RNA Expression Data from Patients With Cancer: Algorithm Validation Study. J Med Internet Res 2020; 22:e18387. [PMID: 32773372 PMCID: PMC7445622 DOI: 10.2196/18387] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/28/2020] [Revised: 03/25/2020] [Accepted: 07/06/2020] [Indexed: 12/21/2022] Open
Abstract
BACKGROUND As the need for sharing genomic data grows, privacy issues and concerns, such as the ethics surrounding data sharing and disclosure of personal information, are raised. OBJECTIVE The main purpose of this study was to verify whether genomic data is sufficient to predict a patient's personal information. METHODS RNA expression data and matched patient personal information were collected from 9538 patients in The Cancer Genome Atlas program. Five personal information variables (age, gender, race, cancer type, and cancer stage) were recorded for each patient. Four different machine learning algorithms (support vector machine, decision tree, random forest, and artificial neural network) were used to determine whether a patient's personal information could be accurately predicted from RNA expression data. Performance measurement of the prediction models was based on the accuracy and area under the receiver operating characteristic curve. We selected five cancer types (breast carcinoma, kidney renal clear cell carcinoma, head and neck squamous cell carcinoma, low-grade glioma, and lung adenocarcinoma) with large samples sizes to verify whether predictive accuracy would differ between them. We also validated the efficacy of our four machine learning models in analyzing normal samples from 593 cancer patients. RESULTS In most samples, personal information with high genetic relevance, such as gender and cancer type, could be predicted from RNA expression data alone. The prediction accuracies for gender and cancer type, which were the best models, were 0.93-0.99 and 0.78-0.94, respectively. Other aspects of personal information, such as age, race, and cancer stage, were difficult to predict from RNA expression data, with accuracies ranging from 0.0026-0.29, 0.76-0.96, and 0.45-0.79, respectively. Among the tested machine learning methods, the highest predictive accuracy was obtained using the support vector machine algorithm (mean accuracy 0.77), while the lowest accuracy was obtained using the random forest method (mean accuracy 0.65). Gender and race were predicted more accurately than other variables in the samples. On average, the accuracy of cancer stage prediction ranged between 0.71-0.67, while the age prediction accuracy ranged between 0.18-0.23 for the five cancer types. CONCLUSIONS We attempted to predict patient information using RNA expression data. We found that some identifiers could be predicted, but most others could not. This study showed that personal information available from RNA expression data is limited and this information cannot be used to identify specific patients.
Collapse
Affiliation(s)
- Solbi Kweon
- Department of Biomedical System Informatics, Yonsei University College of Medicine, Seoul, Republic of Korea.,Department of Medical Engineering, Asan Medical Institute of Convergence Science and Technology, Asan Medical Center, University of Ulsan College of Medicine, Seoul, Republic of Korea
| | - Jeong Hoon Lee
- Department of Biomedical System Informatics, Yonsei University College of Medicine, Seoul, Republic of Korea
| | - Younghee Lee
- Department of Biomedical Informatics, University of Utah School of Medicine, Salt Lake City, UT, United States
| | - Yu Rang Park
- Department of Biomedical System Informatics, Yonsei University College of Medicine, Seoul, Republic of Korea
| |
Collapse
|
47
|
Abstract
BACKGROUND Genomic data have been collected by different institutions and companies and need to be shared for broader use. In a cross-site genomic data sharing system, a secure and transparent access control audit module plays an essential role in ensuring the accountability. A centralized access log audit system is vulnerable to the single point of attack and also lack transparency since the log could be tampered by a malicious system administrator or internal adversaries. Several studies have proposed blockchain-based access audit to solve this problem but without considering the efficiency of the audit queries. The 2018 iDASH competition first track provides us with an opportunity to design efficient logging and querying system for cross-site genomic dataset access audit. We designed a blockchain-based log system which can provide a light-weight and widely compatible module for existing blockchain platforms. The submitted solution won the third place of the competition. In this paper, we report the technical details in our system. METHODS We present two methods: baseline method and enhanced method. We started with the baseline method and then adjusted our implementation based on the competition evaluation criteria and characteristics of the log system. To overcome obstacles of indexing on the immutable Blockchain system, we designed a hierarchical timestamp structure which supports efficient range queries on the timestamp field. RESULTS We implemented our methods in Python3, tested the scalability, and compared the performance using the test data supplied by competition organizer. We successfully boosted the log retrieval speed for complex AND queries that contain multiple predicates. For the range query, we boosted the speed for at least one order of magnitude. The storage usage is reduced by 25%. CONCLUSION We demonstrate that Blockchain can be used to build a time and space efficient log and query genomic dataset audit trail. Therefore, it provides a promising solution for sharing genomic data with accountability requirement across multiple sites.
Collapse
Affiliation(s)
- Shuaicheng Ma
- Department of Computer Science, Emory University, 400 Dowman Dr, Atlanta, GA USA
| | - Yang Cao
- Department of Social Informatics, Kyoto University, Kyoto, Japan
| | - Li Xiong
- Department of Computer Science, Emory University, 400 Dowman Dr, Atlanta, GA USA
| |
Collapse
|
48
|
Bonomi L, Huang Y, Ohno-Machado L. Privacy challenges and research opportunities for genomic data sharing. Nat Genet 2020; 52:646-654. [PMID: 32601475 PMCID: PMC7761157 DOI: 10.1038/s41588-020-0651-0] [Citation(s) in RCA: 70] [Impact Index Per Article: 17.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/11/2019] [Accepted: 05/22/2020] [Indexed: 12/17/2022]
Abstract
The sharing of genomic data holds great promise in advancing precision medicine and providing personalized treatments and other types of interventions. However, these opportunities come with privacy concerns, and data misuse could potentially lead to privacy infringement for individuals and their blood relatives. With the rapid growth and increased availability of genomic datasets, understanding the current genome privacy landscape and identifying the challenges in developing effective privacy-protecting solutions are imperative. In this work, we provide an overview of major privacy threats identified by the research community and examine the privacy challenges in the context of emerging direct-to-consumer genetic-testing applications. We additionally present general privacy-protection techniques for genomic data sharing and their potential applications in direct-to-consumer genomic testing and forensic analyses. Finally, we discuss limitations in current privacy-protection methods, highlight possible mitigation strategies and suggest future research opportunities for advancing genomic data sharing.
Collapse
Affiliation(s)
- Luca Bonomi
- UCSD Health Department of Biomedical Informatics, University of California San Diego, La Jolla, CA, USA.
| | - Yingxiang Huang
- UCSD Health Department of Biomedical Informatics, University of California San Diego, La Jolla, CA, USA
| | - Lucila Ohno-Machado
- UCSD Health Department of Biomedical Informatics, University of California San Diego, La Jolla, CA, USA
- Division of Health Services Research & Development, VA San Diego Healthcare System, San Diego, La Jolla, CA, USA
| |
Collapse
|
49
|
Almadhoun N, Ayday E, Ulusoy Ö. Inference attacks against differentially private query results from genomic datasets including dependent tuples. Bioinformatics 2020; 36:i136-i145. [PMID: 32657411 PMCID: PMC7355303 DOI: 10.1093/bioinformatics/btaa475] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/21/2022] Open
Abstract
MOTIVATION The rapid decrease in the sequencing technology costs leads to a revolution in medical research and clinical care. Today, researchers have access to large genomic datasets to study associations between variants and complex traits. However, availability of such genomic datasets also results in new privacy concerns about personal information of the participants in genomic studies. Differential privacy (DP) is one of the rigorous privacy concepts, which received widespread interest for sharing summary statistics from genomic datasets while protecting the privacy of participants against inference attacks. However, DP has a known drawback as it does not consider the correlation between dataset tuples. Therefore, privacy guarantees of DP-based mechanisms may degrade if the dataset includes dependent tuples, which is a common situation for genomic datasets due to the inherent correlations between genomes of family members. RESULTS In this article, using two real-life genomic datasets, we show that exploiting the correlation between the dataset participants results in significant information leak from differentially private results of complex queries. We formulate this as an attribute inference attack and show the privacy loss in minor allele frequency (MAF) and chi-square queries. Our results show that using the results of differentially private MAF queries and utilizing the dependency between tuples, an adversary can reveal up to 50% more sensitive information about the genome of a target (compared to original privacy guarantees of standard DP-based mechanisms), while differentially privacy chi-square queries can reveal up to 40% more sensitive information. Furthermore, we show that the adversary can use the inferred genomic data obtained from the attribute inference attack to infer the membership of a target in another genomic dataset (e.g. associated with a sensitive trait). Using a log-likelihood-ratio test, our results also show that the inference power of the adversary can be significantly high in such an attack even using inferred (and hence partially incorrect) genomes. AVAILABILITY AND IMPLEMENTATION https://github.com/nourmadhoun/Inference-Attacks-Differential-Privacy.
Collapse
Affiliation(s)
- Nour Almadhoun
- Computer Engineering Department, Bilkent University, Bilkent, Ankara 06800, Turkey
| | - Erman Ayday
- Computer Engineering Department, Bilkent University, Bilkent, Ankara 06800, Turkey
- Department of Computer and Data Sciences, Case Western Reserve University, Cleveland, OH 44106, USA
| | - Özgür Ulusoy
- Computer Engineering Department, Bilkent University, Bilkent, Ankara 06800, Turkey
| |
Collapse
|
50
|
Jones K, Daniels H, Heys S, Lacey A, Ford DV. Toward a Risk-Utility Data Governance Framework for Research Using Genomic and Phenotypic Data in Safe Havens: Multifaceted Review. J Med Internet Res 2020; 22:e16346. [PMID: 32412420 PMCID: PMC7260661 DOI: 10.2196/16346] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/20/2019] [Revised: 01/13/2020] [Accepted: 01/30/2020] [Indexed: 02/06/2023] Open
Abstract
BACKGROUND Research using genomic data opens up new insights into health and disease. Being able to use the data in association with health and administrative record data held in safe havens can multiply the benefits. However, there is much discussion about the use of genomic data with perceptions of particular challenges in doing so safely and effectively. OBJECTIVE This study aimed to work toward a risk-utility data governance framework for research using genomic and phenotypic data in an anonymized form for research in safe havens. METHODS We carried out a multifaceted review drawing upon data governance arrangements in published research, case studies of organizations working with genomic and phenotypic data, public views and expectations, and example studies using genomic and phenotypic data in combination. The findings were contextualized against a backdrop of legislative and regulatory requirements and used to create recommendations. RESULTS We proposed recommendations toward a risk-utility model with a flexible suite of controls to safeguard privacy and retain data utility for research. These were presented as overarching principles aligned to the core elements in the data sharing framework produced by the Global Alliance for Genomics and Health and as practical control measures distilled from published literature and case studies of operational safe havens to be applied as required at a project-specific level. CONCLUSIONS The recommendations presented can be used to contribute toward a proportionate data governance framework to promote the safe, socially acceptable use of genomic and phenotypic data in safe havens. They do not purport to eradicate risk but propose case-by-case assessment with transparency and accountability. If the risks are adequately understood and mitigated, there should be no reason that linked genomic and phenotypic data should not be used in an anonymized form for research in safe havens.
Collapse
Affiliation(s)
- Kerina Jones
- Population Data Science, Swansea University Medical School, Swansea University, Swansea, United Kingdom
| | - Helen Daniels
- Population Data Science, Swansea University Medical School, Swansea University, Swansea, United Kingdom
| | - Sharon Heys
- Population Data Science, Swansea University Medical School, Swansea University, Swansea, United Kingdom
| | - Arron Lacey
- Population Data Science, Swansea University Medical School, Swansea University, Swansea, United Kingdom
| | - David V Ford
- Population Data Science, Swansea University Medical School, Swansea University, Swansea, United Kingdom
| |
Collapse
|