1
|
Jiang Y, Shang T, Liu J. Secure Counting Query Protocol for Genomic Data. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2023; 20:1457-1468. [PMID: 35666798 DOI: 10.1109/tcbb.2022.3178446] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/04/2023]
Abstract
Statistical analysis on genomic data can explore the relationship between gene sequence and phenotype. Particularly, counting the genomic mutation samples and associating with related phenotypes for statistical analysis can annotate the variation sites and help to diagnose genovariation. Expansion of the size of variation sample data helps to increase the accuracy of statistical analysis. It is feasible to securely share data from genomic databases on cloud platforms. In this paper, we design a secure counting query protocol that can securely share genomic data on cloud platforms. Our protocol supports statistical analysis of the genomic data in VCF (Variant Call Format) files by counting query. There are three participants of data owner, cloud platform and query party. Firstly, the genomic data is preprocessed to reduce the data size. Secondly, Paillier homomorphic is used so that genomic data can be securely shared and calculated on cloud platform. Finally, the results which be decrypted is used to implement counting function of the protocol. Experimental results show that the protocol can implement the query counting function after homomorphic encryption. The query time is less than 1 s, which provide a feasible solution to share genomic data securely on cloud platform for statistical analysis.
Collapse
|
2
|
van Egmond MB, Spini G, van der Galien O, IJpma A, Veugen T, Kraaij W, Sangers A, Rooijakkers T, Langenkamp P, Kamphorst B, van de L'Isle N, Kooij-Janic M. Privacy-preserving dataset combination and Lasso regression for healthcare predictions. BMC Med Inform Decis Mak 2021; 21:266. [PMID: 34530824 PMCID: PMC8445286 DOI: 10.1186/s12911-021-01582-y] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/05/2021] [Accepted: 06/29/2021] [Indexed: 11/12/2022] Open
Abstract
Background Recent developments in machine learning have shown its potential impact for clinical use such as risk prediction, prognosis, and treatment selection. However, relevant data are often scattered across different stakeholders and their use is regulated, e.g. by GDPR or HIPAA. As a concrete use-case, hospital Erasmus MC and health insurance company Achmea have data on individuals in the city of Rotterdam, which would in theory enable them to train a regression model in order to identify high-impact lifestyle factors for heart failure. However, privacy and confidentiality concerns make it unfeasible to exchange these data. Methods This article describes a solution where vertically-partitioned synthetic data of Achmea and of Erasmus MC are combined using Secure Multi-Party Computation. First, a secure inner join protocol takes place to securely determine the identifiers of the patients that are represented in both datasets. Then, a secure Lasso Regression model is trained on the securely combined data. The involved parties thus obtain the prediction model but no further information on the input data of the other parties. Results We implement our secure solution and describe its performance and scalability: we can train a prediction model on two datasets with 5000 records each and a total of 30 features in less than one hour, with a minimal difference from the results of standard (non-secure) methods. Conclusions This article shows that it is possible to combine datasets and train a Lasso regression model on this combination in a secure way. Such a solution thus further expands the potential of privacy-preserving data analysis in the medical domain.
Collapse
Affiliation(s)
- Marie Beth van Egmond
- Unit ICT, TNO (Dutch Organization for Applied Scientific Research), The Hague, The Netherlands.
| | - Gabriele Spini
- Unit ICT, TNO (Dutch Organization for Applied Scientific Research), The Hague, The Netherlands
| | | | | | - Thijs Veugen
- Unit ICT, TNO (Dutch Organization for Applied Scientific Research), The Hague, The Netherlands.,Cryptology Research Group, Centrum Wiskunde and Informatica (CWI), Amsterdam, The Netherlands
| | - Wessel Kraaij
- Unit ICT, TNO (Dutch Organization for Applied Scientific Research), The Hague, The Netherlands.,Leiden Institute of Advanced Computer Science, Leiden University, Leiden, The Netherlands
| | - Alex Sangers
- Unit ICT, TNO (Dutch Organization for Applied Scientific Research), The Hague, The Netherlands
| | - Thomas Rooijakkers
- Unit ICT, TNO (Dutch Organization for Applied Scientific Research), The Hague, The Netherlands
| | - Peter Langenkamp
- Unit ICT, TNO (Dutch Organization for Applied Scientific Research), The Hague, The Netherlands
| | - Bart Kamphorst
- Unit ICT, TNO (Dutch Organization for Applied Scientific Research), The Hague, The Netherlands
| | | | - Milena Kooij-Janic
- Unit ICT, TNO (Dutch Organization for Applied Scientific Research), The Hague, The Netherlands
| |
Collapse
|
3
|
Implementing Privacy-Preserving Genotype Analysis with Consideration for Population Stratification. CRYPTOGRAPHY 2021. [DOI: 10.3390/cryptography5030021] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/16/2022]
Abstract
In bioinformatics, genome-wide association studies (GWAS) are used to detect associations between single-nucleotide polymorphisms (SNPs) and phenotypic traits such as diseases. Significant differences in SNP counts between case and control groups can signal association between variants and phenotypic traits. Most traits are affected by multiple genetic locations. To detect these subtle associations, bioinformaticians need access to more heterogeneous data. Regulatory restrictions in cross-border health data exchange have created a surge in research on privacy-preserving solutions, including secure computing techniques. However, in studies of such scale, one must account for population stratification, as under- and over-representation of sub-populations can lead to spurious associations. We improve on the state of the art of privacy-preserving GWAS methods by showing how to adapt principal component analysis (PCA) with stratification control (EIGENSTRAT), FastPCA, EMMAX and the genomic control algorithm for secure computing. We implement these methods using secure computing techniques—secure multi-party computation (MPC) and trusted execution environments (TEE). Our algorithms are the most complex ones at this scale implemented with MPC. We present performance benchmarks and a security and feasibility trade-off discussion for both techniques.
Collapse
|
4
|
B A, S S. A survey on genomic data by privacy-preserving techniques perspective. Comput Biol Chem 2021; 93:107538. [PMID: 34246892 DOI: 10.1016/j.compbiolchem.2021.107538] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/23/2021] [Revised: 06/15/2021] [Accepted: 06/26/2021] [Indexed: 11/27/2022]
Abstract
Nowadays, the purpose of human genomics is widely emerging in health-related problems and also to achieve time and cost-efficient healthcare. Due to advancement in genomics and its research, development in privacy concerns is needed regarding querying, accessing and, storage and computation of the genomic data. While the genomic data is widely accessible, the privacy issues may emerge due to the untrusted third party (adversaries/researchers), they may reveal the information or strategy plans regarding the genome data of an individual when it is requested for research purposes. To mitigate this problem many privacy-preserving techniques are used along with cryptographic methods are briefly discussed. Furthermore, efficiency and accuracy in a secure and private genomic data computation are needed to be researched in future.
Collapse
Affiliation(s)
- Abinaya B
- Kalaignarkarunanidhi Institute of Technology, Coimbatore, India.
| | - Santhi S
- Kalaignarkarunanidhi Institute of Technology, Coimbatore, India.
| |
Collapse
|
5
|
Wang X, Jiang X, Vaidya J. Efficient verification for outsourced genome-wide association studies. J Biomed Inform 2021; 117:103714. [PMID: 33711538 DOI: 10.1016/j.jbi.2021.103714] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/18/2019] [Revised: 02/09/2021] [Accepted: 02/10/2021] [Indexed: 11/17/2022]
Abstract
With cloud computing is being widely adopted in conducting genome-wide association studies (GWAS), how to verify the integrity of outsourced GWAS computation remains to be accomplished. Here, we propose two novel algorithms to generate synthetic SNPs that are indistinguishable from real SNPs. The first method creates synthetic SNPs based on the phenotype vector, while the second approach creates synthetic SNPs based on real SNPs that are most similar to the phenotype vector. The time complexity of the first approach and the second approach is Om and Omlogn2, respectively, where m is the number of subjects while n is the number of SNPs. Furthermore, through a game theoretic analysis, we demonstrate that it is possible to incentivize honest behavior by the server by coupling appropriate payoffs with randomized verification. We conduct extensive experiments of our proposed methods, and the results show that beyond a formal adversarial model, when only a few synthetic SNPs are generated and mixed into the real data they cannot be distinguished from the real SNPs even by a variety of predictive machine learning models. We demonstrate that the proposed approach can ensure that logistic regression for GWAS can be outsourced in an efficient and trustworthy way.
Collapse
Affiliation(s)
| | - Xiaoqian Jiang
- University of Texas Health Science Center at Houston, TX, USA
| | | |
Collapse
|
6
|
Abstract
Abstract
Genome-Wide Association Studies (GWAS) identify the genomic variations that are statistically associated with a particular phenotype (e.g., a disease). The confidence in GWAS results increases with the number of genomes analyzed, which encourages federated computations where biocenters would periodically share the genomes they have sequenced. However, for economical and legal reasons, this collaboration will only happen if biocenters cannot learn each others’ data. In addition, GWAS releases should not jeopardize the privacy of the individuals whose genomes are used. We introduce DyPS, a novel framework to conduct dynamic privacy-preserving federated GWAS. DyPS leverages a Trusted Execution Environment to secure dynamic GWAS computations. Moreover, DyPS uses a scaling mechanism to speed up the releases of GWAS results according to the evolving number of genomes used in the study, even if individuals retract their participation consent. Lastly, DyPS also tolerates up to all-but-one colluding biocenters without privacy leaks. We implemented and extensively evaluated DyPS through several scenarios involving more than 6 million simulated genomes and up to 35,000 real genomes. Our evaluation shows that DyPS updates test statistics with a reasonable additional request processing delay (11% longer) compared to an approach that would update them with minimal delay but would lead to 8% of the genomes not being protected. In addition, DyPS can result in the same amount of aggregate statistics as a static release (i.e., at the end of the study), but can produce up to 2.6 times more statistics information during earlier dynamic releases. Besides, we show that DyPS can support a larger number of genomes and SNP positions without any significant performance penalty.
Collapse
|
7
|
Karimi S, Jiang X, Dolin RH, Kim M, Boxwala A. A secure system for genomics clinical decision support. J Biomed Inform 2020; 112:103602. [PMID: 33080397 PMCID: PMC8577277 DOI: 10.1016/j.jbi.2020.103602] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/16/2020] [Revised: 09/07/2020] [Accepted: 10/12/2020] [Indexed: 11/26/2022]
Abstract
We developed a prototype genomic archiving and communications system to securely store genome data and provide clinical decision support (CDS). This system operates on a client-server model. The client encrypts the data, and the server stores data and performs the computations necessary for CDS. Computations are directly performed on encrypted data, and the client decrypts results. The server cannot decrypt inputs or outputs, which provides strong guarantees of security. We have validated our system with three genomics-based CDS applications. The results demonstrate that it is possible to resolve a long-standing dilemma in genomic data privacy and accessibility, by using a principled cryptographical framework and a mathematical representation of genome data and CDS questions.
Collapse
Affiliation(s)
| | - Xiaoqian Jiang
- UT Health School of Biomedical Informatics, Houston, TX, United States
| | | | - Miran Kim
- UT Health School of Biomedical Informatics, Houston, TX, United States
| | - Aziz Boxwala
- Elimu Informatics Inc., Richmond, CA, United States
| |
Collapse
|
8
|
Bonomi L, Huang Y, Ohno-Machado L. Privacy challenges and research opportunities for genomic data sharing. Nat Genet 2020; 52:646-654. [PMID: 32601475 PMCID: PMC7761157 DOI: 10.1038/s41588-020-0651-0] [Citation(s) in RCA: 70] [Impact Index Per Article: 17.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/11/2019] [Accepted: 05/22/2020] [Indexed: 12/17/2022]
Abstract
The sharing of genomic data holds great promise in advancing precision medicine and providing personalized treatments and other types of interventions. However, these opportunities come with privacy concerns, and data misuse could potentially lead to privacy infringement for individuals and their blood relatives. With the rapid growth and increased availability of genomic datasets, understanding the current genome privacy landscape and identifying the challenges in developing effective privacy-protecting solutions are imperative. In this work, we provide an overview of major privacy threats identified by the research community and examine the privacy challenges in the context of emerging direct-to-consumer genetic-testing applications. We additionally present general privacy-protection techniques for genomic data sharing and their potential applications in direct-to-consumer genomic testing and forensic analyses. Finally, we discuss limitations in current privacy-protection methods, highlight possible mitigation strategies and suggest future research opportunities for advancing genomic data sharing.
Collapse
Affiliation(s)
- Luca Bonomi
- UCSD Health Department of Biomedical Informatics, University of California San Diego, La Jolla, CA, USA.
| | - Yingxiang Huang
- UCSD Health Department of Biomedical Informatics, University of California San Diego, La Jolla, CA, USA
| | - Lucila Ohno-Machado
- UCSD Health Department of Biomedical Informatics, University of California San Diego, La Jolla, CA, USA
- Division of Health Services Research & Development, VA San Diego Healthcare System, San Diego, La Jolla, CA, USA
| |
Collapse
|
9
|
Wang S, Bonomi L, Dai W, Chen F, Cheung C, Bloss CS, Cheng S, Jiang X. Big Data Privacy in Biomedical Research. IEEE TRANSACTIONS ON BIG DATA 2020; 6:296-308. [PMID: 32478127 PMCID: PMC7258042 DOI: 10.1109/tbdata.2016.2608848] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/11/2023]
Abstract
Biomedical research often involves studying patient data that contain personal information. Inappropriate use of these data might lead to leakage of sensitive information, which can put patient privacy at risk. The problem of preserving patient privacy has received increasing attentions in the era of big data. Many privacy methods have been developed to protect against various attack models. This paper reviews relevant topics in the context of biomedical research. We discuss privacy preserving technologies related to (1) record linkage, (2) synthetic data generation, and (3) genomic data privacy. We also discuss the ethical implications of big data privacy in biomedicine and present challenges in future research directions for improving data privacy in biomedical research.
Collapse
Affiliation(s)
- Shuang Wang
- Department of Biomedical Informatics, University of California San Diego, La Jolla, CA 92093
| | - Luca Bonomi
- Department of Biomedical Informatics, University of California San Diego, La Jolla, CA 92093
| | - Wenrui Dai
- Department of Biomedical Informatics, University of California San Diego, La Jolla, CA 92093
| | - Feng Chen
- Department of Biomedical Informatics, University of California San Diego, La Jolla, CA 92093
| | - Cynthia Cheung
- Department of Psychiatry, University of California San Diego, La Jolla, CA, 92093
| | - Cinnamon S Bloss
- Department of Psychiatry, University of California San Diego, La Jolla, CA, 92093
| | - Samuel Cheng
- School of Electrical and Computer Engineering, University of Oklahoma, Tulsa, OK, 74135
| | - Xiaoqian Jiang
- Department of Biomedical Informatics, University of California San Diego, La Jolla, CA 92093
| |
Collapse
|
10
|
Aziz MMA, Sadat MN, Alhadidi D, Wang S, Jiang X, Brown CL, Mohammed N. Privacy-preserving techniques of genomic data-a survey. Brief Bioinform 2019; 20:887-895. [PMID: 29121240 PMCID: PMC6585383 DOI: 10.1093/bib/bbx139] [Citation(s) in RCA: 23] [Impact Index Per Article: 4.6] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/23/2017] [Revised: 09/30/2017] [Indexed: 01/10/2023] Open
Abstract
Genomic data hold salient information about the characteristics of a living organism. Throughout the past decade, pinnacle developments have given us more accurate and inexpensive methods to retrieve genome sequences of humans. However, with the advancement of genomic research, there is a growing privacy concern regarding the collection, storage and analysis of such sensitive human data. Recent results show that given some background information, it is possible for an adversary to reidentify an individual from a specific genomic data set. This can reveal the current association or future susceptibility of some diseases for that individual (and sometimes the kinship between individuals) resulting in a privacy violation. Regardless of these risks, our genomic data hold much importance in analyzing the well-being of us and the future generation. Thus, in this article, we discuss the different privacy and security-related problems revolving around human genomic data. In addition, we will explore some of the cardinal cryptographic concepts, which can bring efficacy in secure and private genomic data computation. This article will relate the gaps between these two research areas-Cryptography and Genomics.
Collapse
Affiliation(s)
- Md Momin Al Aziz
- Department of Computer Science at the University of Manitoba, Winnipeg, Canada
| | - Md Nazmus Sadat
- Department of Computer Science at the University of Manitoba, Winnipeg, Canada
| | - Dima Alhadidi
- Faculty of Computer Science at the University of New Brunswick, Frederiction, Canada
| | - Shuang Wang
- Department of Biomedical Informatics at the University of California in San Diego, La Jolla, CA, USA
| | - Xiaoqian Jiang
- Department of Biomedical Informatics at the University of California in San Diego, La Jolla, CA, USA
| | - Cheryl L Brown
- Department of Political Science and Public Administration at the University of North Carolina at Charlotte, NC, USA
| | - Noman Mohammed
- Department of Computer Science at the University of Manitoba, Winnipeg, Canada
| |
Collapse
|
11
|
Park S, Kim M, Seo S, Hong S, Han K, Lee K, Cheon JH, Kim S. A secure SNP panel scheme using homomorphically encrypted K-mers without SNP calling on the user side. BMC Genomics 2019; 20:188. [PMID: 30967116 PMCID: PMC6456943 DOI: 10.1186/s12864-019-5473-z] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/30/2023] Open
Abstract
BACKGROUND Single Nucleotide Polymorphism (SNP) in the genome has become crucial information for clinical use. For example, the targeted cancer therapy is primarily based on the information which clinically important SNPs are detectable from the tumor. Many hospitals have developed their own panels that include clinically important SNPs. The genome information exchange between the patient and the hospital has become more popular. However, the genome sequence information is innate and irreversible and thus its leakage has serious consequences. Therefore, protecting one's genome information is critical. On the other side, hospitals may need to protect their own panels. There is no known secure SNP panel scheme to protect both. RESULTS In this paper, we propose a secure SNP panel scheme using homomorphically encrypted K-mers without requiring SNP calling on the user side and without revealing the panel information to the user. Use of the powerful homomorphic encryption technique is desirable, but there is no known algorithm to efficiently align two homomorphically encrypted sequences. Thus, we designed and implemented a novel secure SNP panel scheme utilizing the computationally feasible equality test on two homomorphically encrypted K-mers. To make the scheme work correctly, in addition to SNPs in the panel, sequence variations at the population level should be addressed. We designed a concept of Point Deviation Tolerance (PDT) level to address the false positives and false negatives. Using the TCGA BRCA dataset, we demonstrated that our scheme works at the level of over a hundred thousand somatic mutations. In addition, we provide a computational guideline for the panel design, including the size of K-mer and the number of SNPs. CONCLUSIONS The proposed method is the first of its kind to protect both the user's sequence and the hospital's panel information using the powerful homomorphic encryption scheme. We demonstrated that the scheme works with a simulated dataset and the TCGA BRCA dataset. In this study, we have shown only the feasibility of the proposed scheme and much more efforts should be done to make the scheme usable for clinical use.
Collapse
Affiliation(s)
- Sungjoon Park
- 0000 0004 0470 5905grid.31501.36Department of Computer Science and Engineering, Seoul National University, Seoul, Republic of Korea
| | - Minsu Kim
- 0000 0004 0470 5905grid.31501.36Interdisciplinary Program in Bioinformatics, Seoul National University, Seoul, Republic of Korea
| | | | - Seungwan Hong
- 0000 0004 0470 5905grid.31501.36Department of Mathematical Sciences, Seoul National University, Seoul, Republic of Korea
| | - Kyoohyung Han
- 0000 0004 0470 5905grid.31501.36Department of Mathematical Sciences, Seoul National University, Seoul, Republic of Korea
| | - Keewoo Lee
- 0000 0004 0470 5905grid.31501.36Department of Mathematical Sciences, Seoul National University, Seoul, Republic of Korea
| | - Jung Hee Cheon
- 0000 0004 0470 5905grid.31501.36Department of Mathematical Sciences, Seoul National University, Seoul, Republic of Korea
| | - Sun Kim
- 0000 0004 0470 5905grid.31501.36Department of Computer Science and Engineering, Seoul National University, Seoul, Republic of Korea ,0000 0004 0470 5905grid.31501.36Interdisciplinary Program in Bioinformatics, Seoul National University, Seoul, Republic of Korea ,0000 0004 0470 5905grid.31501.36Bioinformatics Institute, Seoul National University, Seoul, Republic of Korea
| |
Collapse
|
12
|
Sadat MN, Aziz MMA, Mohammed N, Chen F, Jiang X, Wang S. SAFETY: Secure gwAs in Federated Environment through a hYbrid Solution. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2019; 16:93-102. [PMID: 29993695 PMCID: PMC6411680 DOI: 10.1109/tcbb.2018.2829760] [Citation(s) in RCA: 12] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/04/2023]
Abstract
Recent studies demonstrate that effective healthcare can benefit from using the human genomic information. Consequently, many institutions are using statistical analysis of genomic data, which are mostly based on genome-wide association studies (GWAS). GWAS analyze genome sequence variations in order to identify genetic risk factors for diseases. These studies often require pooling data from different sources together in order to unravel statistical patterns, and relationships between genetic variants and diseases. Here, the primary challenge is to fulfill one major objective: accessing multiple genomic data repositories for collaborative research in a privacy-preserving manner. Due to the privacy concerns regarding the genomic data, multi-jurisdictional laws and policies of cross-border genomic data sharing are enforced among different countries. In this article, we present SAFETY, a hybrid framework, which can securely perform GWAS on federated genomic datasets using homomorphic encryption and recently introduced secure hardware component of Intel Software Guard Extensions to ensure high efficiency and privacy at the same time. Different experimental settings show the efficacy and applicability of such hybrid framework in secure conduction of GWAS. To the best of our knowledge, this hybrid use of homomorphic encryption along with Intel SGX is not proposed to this date. SAFETY is up to 4.82 times faster than the best existing secure computation technique.
Collapse
Affiliation(s)
- Md Nazmus Sadat
- Department of Computer Science, University of Manitoba, Winnipeg, MB, R3T 2N2, Canada
| | - Md Momin Al Aziz
- Department of Computer Science, University of Manitoba, Winnipeg, MB, R3T 2N2, Canada
| | - Noman Mohammed
- Department of Computer Science, University of Manitoba, Winnipeg, MB, R3T 2N2, Canada
| | - Feng Chen
- Department of Biomedical Informatics, University of California San Diego, La Jolla, CA, 92093, USA
| | - Xiaoqian Jiang
- Department of Biomedical Informatics, University of California San Diego, La Jolla, CA, 92093, USA
| | - Shuang Wang
- Department of Biomedical Informatics, University of California San Diego, La Jolla, CA, 92093, USA
| |
Collapse
|
13
|
Systematizing Genome Privacy Research: A Privacy-Enhancing Technologies Perspective. PROCEEDINGS ON PRIVACY ENHANCING TECHNOLOGIES 2018. [DOI: 10.2478/popets-2019-0006] [Citation(s) in RCA: 21] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/20/2022]
Abstract
Abstract
Rapid advances in human genomics are enabling researchers to gain a better understanding of the role of the genome in our health and well-being, stimulating hope for more effective and cost efficient healthcare. However, this also prompts a number of security and privacy concerns stemming from the distinctive characteristics of genomic data. To address them, a new research community has emerged and produced a large number of publications and initiatives. In this paper, we rely on a structured methodology to contextualize and provide a critical analysis of the current knowledge on privacy-enhancing technologies used for testing, storing, and sharing genomic data, using a representative sample of the work published in the past decade. We identify and discuss limitations, technical challenges, and issues faced by the community, focusing in particular on those that are inherently tied to the nature of the problem and are harder for the community alone to address. Finally, we report on the importance and difficulty of the identified challenges based on an online survey of genome data privacy experts.
Collapse
|
14
|
Bonte C, Makri E, Ardeshirdavani A, Simm J, Moreau Y, Vercauteren F. Towards practical privacy-preserving genome-wide association study. BMC Bioinformatics 2018; 19:537. [PMID: 30572817 PMCID: PMC6302495 DOI: 10.1186/s12859-018-2541-3] [Citation(s) in RCA: 22] [Impact Index Per Article: 3.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/04/2018] [Accepted: 11/22/2018] [Indexed: 12/25/2022] Open
Abstract
BACKGROUND The deployment of Genome-wide association studies (GWASs) requires genomic information of a large population to produce reliable results. This raises significant privacy concerns, making people hesitate to contribute their genetic information to such studies. RESULTS We propose two provably secure solutions to address this challenge: (1) a somewhat homomorphic encryption (HE) approach, and (2) a secure multiparty computation (MPC) approach. Unlike previous work, our approach does not rely on adding noise to the input data, nor does it reveal any information about the patients. Our protocols aim to prevent data breaches by calculating the χ2 statistic in a privacy-preserving manner, without revealing any information other than whether the statistic is significant or not. Specifically, our protocols compute the χ2 statistic, but only return a yes/no answer, indicating significance. By not revealing the statistic value itself but only the significance, our approach thwarts attacks exploiting statistic values. We significantly increased the efficiency of our HE protocols by introducing a new masking technique to perform the secure comparison that is necessary for determining significance. CONCLUSIONS We show that full-scale privacy-preserving GWAS is practical, as long as the statistics can be computed by low degree polynomials. Our implementations demonstrated that both approaches are efficient. The secure multiparty computation technique completes its execution in approximately 2 ms for data contributed by one million subjects.
Collapse
Affiliation(s)
- Charlotte Bonte
- imec-COSIC, Department of Electrical Engineering, KU Leuven, Leuven, Belgium
| | - Eleftheria Makri
- imec-COSIC, Department of Electrical Engineering, KU Leuven, Leuven, Belgium
- ABRR, Saxion University of Applied Sciences, Enschede, The Netherlands
| | | | | | | | | |
Collapse
|
15
|
Jiang Y, Wang C, Wu Z, Du X, Wang S. Privacy-preserving biomedical data dissemination via a hybrid approach. AMIA ... ANNUAL SYMPOSIUM PROCEEDINGS. AMIA SYMPOSIUM 2018; 2018:1176-1185. [PMID: 30815160 PMCID: PMC6371369] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Subscribe] [Scholar Register] [Indexed: 06/09/2023]
Abstract
Sharing medical data can benefit many aspects of biomedical research studies. However, medical data usually contains sensitive patient information, which cannot be shared directly. Summary statistics, like histogram, are widely used in medical research which serves as a sanitized synopsis of the raw health dataset such as Electrical Health Records (EHR). Such synopsized representation is then be used to support advanced operations over health dataset such as counting queries and learning based tasks. While privacy becomes an increasingly important issue for generating and publishing health data based histograms. Previous solutions show promise on securely generating histogram via differential privacy, however such methods only consider a centralized solution and the accuracy is still a limitation for real world applications. In this paper, we propose a novel hybrid solution to combine two rigorous theoretical models (homomorphic encryption and differential privacy) for securely generating synthetic V-optimal histograms over distributed datasets. Our results demonstrated accuracy improvement over previous study over real medical datasets.
Collapse
Affiliation(s)
- Yichen Jiang
- Dept. of Biomedical Informatics, University of California San Diego, La Jolla, CA, USA
| | - Chenghong Wang
- Dept. of Biomedical Informatics, University of California San Diego, La Jolla, CA, USA
| | - Zhixuan Wu
- Dept. of Biomedical Informatics, University of California San Diego, La Jolla, CA, USA
- Dept. of Computer Science, Syracuse University, Syracuse, NY, USA Both authors contributed equally to this paper
| | - Xin Du
- Dept. of Biomedical Informatics, University of California San Diego, La Jolla, CA, USA
- Dept. of Computer Science, Syracuse University, Syracuse, NY, USA Both authors contributed equally to this paper
| | - Shuang Wang
- Dept. of Biomedical Informatics, University of California San Diego, La Jolla, CA, USA
| |
Collapse
|
16
|
Azencott CA. Machine learning and genomics: precision medicine versus patient privacy. PHILOSOPHICAL TRANSACTIONS. SERIES A, MATHEMATICAL, PHYSICAL, AND ENGINEERING SCIENCES 2018; 376:rsta.2017.0350. [PMID: 30082298 DOI: 10.1098/rsta.2017.0350] [Citation(s) in RCA: 22] [Impact Index Per Article: 3.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Accepted: 06/07/2018] [Indexed: 06/08/2023]
Abstract
Machine learning can have a major societal impact in computational biology applications. In particular, it plays a central role in the development of precision medicine, whereby treatment is tailored to the clinical or genetic features of the patient. However, these advances require collecting and sharing among researchers large amounts of genomic data, which generates much concern about privacy. Researchers, study participants and governing bodies should be aware of the ways in which the privacy of participants might be compromised, as well as of the large body of research on technical solutions to these issues. We review how breaches in patient privacy can occur, present recent developments in computational data protection and discuss how they can be combined with legal and ethical perspectives to provide secure frameworks for genomic data sharing.This article is part of a discussion meeting issue 'The growing ubiquity of algorithms in society: implications, impacts and innovations'.
Collapse
Affiliation(s)
- C-A Azencott
- MINES ParisTech, PSL Research University, CBIO-Centre for Computational Biology, 75006 Paris, France
- Institut Curie, PSL Research University, 75005 Paris, France
- INSERM, U900, 75005 Paris, France
| |
Collapse
|
17
|
Wang M, Ji Z, Wang S, Kim J, Yang H, Jiang X, Ohno-Machado L. Mechanisms to protect the privacy of families when using the transmission disequilibrium test in genome-wide association studies. Bioinformatics 2017; 33:3716-3725. [PMID: 29036461 PMCID: PMC5860319 DOI: 10.1093/bioinformatics/btx470] [Citation(s) in RCA: 14] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/26/2016] [Revised: 05/29/2017] [Accepted: 07/20/2017] [Indexed: 11/12/2022] Open
Abstract
MOTIVATION Inappropriate disclosure of human genomes may put the privacy of study subjects and of their family members at risk. Existing privacy-preserving mechanisms for Genome-Wide Association Studies (GWAS) mainly focus on protecting individual information in case-control studies. Protecting privacy in family-based studies is more difficult. The transmission disequilibrium test (TDT) is a powerful family-based association test employed in many rare disease studies. It gathers information about families (most frequently involving parents, affected children and their siblings). It is important to develop privacy-preserving approaches to disclose TDT statistics with a guarantee that the risk of family 're-identification' stays below a pre-specified risk threshold. 'Re-identification' in this context means that an attacker can infer that the presence of a family in a study. METHODS In the context of protecting family-level privacy, we developed and evaluated a suite of differentially private (DP) mechanisms for TDT. They include Laplace mechanisms based on the TDT test statistic, P-values, projected P-values and exponential mechanisms based on the TDT test statistic and the shortest Hamming distance (SHD) score. RESULTS Using simulation studies with a small cohort and a large one, we showed that that the exponential mechanism based on the SHD score preserves the highest utility and privacy among all proposed DP methods. We provide a guideline on applying our DP TDT in a real dataset in analyzing Kawasaki disease with 187 families and 906 SNPs. There are some limitations, including: (1) the performance of our implementation is slow for real-time results generation and (2) handling missing data is still challenging. AVAILABILITY AND IMPLEMENTATION The software dpTDT is available in https://github.com/mwgrassgreen/dpTDT. CONTACT mengw1@stanford.edu. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Meng Wang
- Department of Genetics, Stanford University, Stanford, CA, USA
| | - Zhanglong Ji
- Department of Biomedical Informatics, UC San Diego, La Jolla, CA, USA
| | - Shuang Wang
- Department of Biomedical Informatics, UC San Diego, La Jolla, CA, USA
| | - Jihoon Kim
- Department of Biomedical Informatics, UC San Diego, La Jolla, CA, USA
| | - Hai Yang
- Department of Biomedical Informatics, UC San Diego, La Jolla, CA, USA
| | - Xiaoqian Jiang
- Department of Biomedical Informatics, UC San Diego, La Jolla, CA, USA
| | | |
Collapse
|
18
|
Wang S, Jiang X, Tang H, Wang X, Bu D, Carey K, Dyke SO, Fox D, Jiang C, Lauter K, Malin B, Sofia H, Telenti A, Wang L, Wang W, Ohno-Machado L. A community effort to protect genomic data sharing, collaboration and outsourcing. NPJ Genom Med 2017; 2:33. [PMID: 29263842 PMCID: PMC5677972 DOI: 10.1038/s41525-017-0036-1] [Citation(s) in RCA: 27] [Impact Index Per Article: 3.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/30/2017] [Revised: 07/10/2017] [Accepted: 10/10/2017] [Indexed: 12/13/2022] Open
Abstract
The human genome can reveal sensitive information and is potentially re-identifiable, which raises privacy and security concerns about sharing such data on wide scales. In 2016, we organized the third Critical Assessment of Data Privacy and Protection competition as a community effort to bring together biomedical informaticists, computer privacy and security researchers, and scholars in ethical, legal, and social implications (ELSI) to assess the latest advances on privacy-preserving techniques for protecting human genomic data. Teams were asked to develop novel protection methods for emerging genome privacy challenges in three scenarios: Track (1) data sharing through the Beacon service of the Global Alliance for Genomics and Health. Track (2) collaborative discovery of similar genomes between two institutions; and Track (3) data outsourcing to public cloud services. The latter two tracks represent continuing themes from our 2015 competition, while the former was new and a response to a recently established vulnerability. The winning strategy for Track 1 mitigated the privacy risk by hiding approximately 11% of the variation in the database while permitting around 160,000 queries, a significant improvement over the baseline. The winning strategies in Tracks 2 and 3 showed significant progress over the previous competition by achieving multiple orders of magnitude performance improvement in terms of computational runtime and memory requirements. The outcomes suggest that applying highly optimized privacy-preserving and secure computation techniques to safeguard genomic data sharing and analysis is useful. However, the results also indicate that further efforts are needed to refine these techniques into practical solutions.
Collapse
Affiliation(s)
- Shuang Wang
- UCSD Health Department of Biomedical Informatics, University of California San Diego, La Jolla, CA 92093 USA
| | - Xiaoqian Jiang
- UCSD Health Department of Biomedical Informatics, University of California San Diego, La Jolla, CA 92093 USA
| | - Haixu Tang
- Computer Science and Informatics, Indiana University, Bloomington, IN 47408 USA
| | - Xiaofeng Wang
- Computer Science and Informatics, Indiana University, Bloomington, IN 47408 USA
| | - Diyue Bu
- Computer Science and Informatics, Indiana University, Bloomington, IN 47408 USA
| | - Knox Carey
- GeneCloud, Intertrust, CA, Sunnyvale, CA 94085 USA
| | - Stephanie Om Dyke
- Centre of Genomics and Policy, Department of Human Genetics, McGill University, Montreal, QC H3A 0G4 Canada
| | - Dov Fox
- School of Law, University of San Diego, San Diego, CA 92110 USA
| | - Chao Jiang
- UCSD Health Department of Biomedical Informatics, University of California San Diego, La Jolla, CA 92093 USA
| | - Kristin Lauter
- Cryptography Group, Microsoft Research, San Diego, CA 92122 USA
| | - Bradley Malin
- Department of Biomedical Informatics, School of Medicine, Vanderbilt University, Nashville, TN 37203 USA
| | - Heidi Sofia
- National Human Genome Research Institute, Rockville, MD 20894 USA
| | | | - Lei Wang
- Computer Science and Informatics, Indiana University, Bloomington, IN 47408 USA
| | - Wenhao Wang
- Computer Science and Informatics, Indiana University, Bloomington, IN 47408 USA
| | - Lucila Ohno-Machado
- UCSD Health Department of Biomedical Informatics, University of California San Diego, La Jolla, CA 92093 USA
| |
Collapse
|
19
|
Ziegeldorf JH, Pennekamp J, Hellmanns D, Schwinger F, Kunze I, Henze M, Hiller J, Matzutt R, Wehrle K. BLOOM: BLoom filter based oblivious outsourced matchings. BMC Med Genomics 2017; 10:44. [PMID: 28786361 PMCID: PMC5547447 DOI: 10.1186/s12920-017-0277-y] [Citation(s) in RCA: 17] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/06/2023] Open
Abstract
BACKGROUND Whole genome sequencing has become fast, accurate, and cheap, paving the way towards the large-scale collection and processing of human genome data. Unfortunately, this dawning genome era does not only promise tremendous advances in biomedical research but also causes unprecedented privacy risks for the many. Handling storage and processing of large genome datasets through cloud services greatly aggravates these concerns. Current research efforts thus investigate the use of strong cryptographic methods and protocols to implement privacy-preserving genomic computations. METHODS We propose FHE-BLOOM and PHE-BLOOM, two efficient approaches for genetic disease testing using homomorphically encrypted Bloom filters. Both approaches allow the data owner to securely outsource storage and computation to an untrusted cloud. FHE-BLOOM is fully secure in the semi-honest model while PHE-BLOOM slightly relaxes security guarantees in a trade-off for highly improved performance. RESULTS We implement and evaluate both approaches on a large dataset of up to 50 patient genomes each with up to 1000000 variations (single nucleotide polymorphisms). For both implementations, overheads scale linearly in the number of patients and variations, while PHE-BLOOM is faster by at least three orders of magnitude. For example, testing disease susceptibility of 50 patients with 100000 variations requires only a total of 308.31 s (σ=8.73 s) with our first approach and a mere 0.07 s (σ=0.00 s) with the second. We additionally discuss security guarantees of both approaches and their limitations as well as possible extensions towards more complex query types, e.g., fuzzy or range queries. CONCLUSIONS Both approaches handle practical problem sizes efficiently and are easily parallelized to scale with the elastic resources available in the cloud. The fully homomorphic scheme, FHE-BLOOM, realizes a comprehensive outsourcing to the cloud, while the partially homomorphic scheme, PHE-BLOOM, trades a slight relaxation of security guarantees against performance improvements by at least three orders of magnitude.
Collapse
Affiliation(s)
- Jan Henrik Ziegeldorf
- Communication and Distributed Systems (COMSYS), RWTH Aachen University, Ahornstrasse 55, Aachen, 52074 Germany
| | - Jan Pennekamp
- Communication and Distributed Systems (COMSYS), RWTH Aachen University, Ahornstrasse 55, Aachen, 52074 Germany
| | - David Hellmanns
- Communication and Distributed Systems (COMSYS), RWTH Aachen University, Ahornstrasse 55, Aachen, 52074 Germany
| | - Felix Schwinger
- Communication and Distributed Systems (COMSYS), RWTH Aachen University, Ahornstrasse 55, Aachen, 52074 Germany
| | - Ike Kunze
- Communication and Distributed Systems (COMSYS), RWTH Aachen University, Ahornstrasse 55, Aachen, 52074 Germany
| | - Martin Henze
- Communication and Distributed Systems (COMSYS), RWTH Aachen University, Ahornstrasse 55, Aachen, 52074 Germany
| | - Jens Hiller
- Communication and Distributed Systems (COMSYS), RWTH Aachen University, Ahornstrasse 55, Aachen, 52074 Germany
| | - Roman Matzutt
- Communication and Distributed Systems (COMSYS), RWTH Aachen University, Ahornstrasse 55, Aachen, 52074 Germany
| | - Klaus Wehrle
- Communication and Distributed Systems (COMSYS), RWTH Aachen University, Ahornstrasse 55, Aachen, 52074 Germany
| |
Collapse
|
20
|
Chen F, Wang C, Dai W, Jiang X, Mohammed N, Al Aziz MM, Sadat MN, Sahinalp C, Lauter K, Wang S. PRESAGE: PRivacy-preserving gEnetic testing via SoftwAre Guard Extension. BMC Med Genomics 2017; 10:48. [PMID: 28786365 PMCID: PMC5547453 DOI: 10.1186/s12920-017-0281-2] [Citation(s) in RCA: 24] [Impact Index Per Article: 3.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/04/2022] Open
Abstract
Background Advances in DNA sequencing technologies have prompted a wide range of genomic applications to improve healthcare and facilitate biomedical research. However, privacy and security concerns have emerged as a challenge for utilizing cloud computing to handle sensitive genomic data. Methods We present one of the first implementations of Software Guard Extension (SGX) based securely outsourced genetic testing framework, which leverages multiple cryptographic protocols and minimal perfect hash scheme to enable efficient and secure data storage and computation outsourcing. Results We compared the performance of the proposed PRESAGE framework with the state-of-the-art homomorphic encryption scheme, as well as the plaintext implementation. The experimental results demonstrated significant performance over the homomorphic encryption methods and a small computational overhead in comparison to plaintext implementation. Conclusions The proposed PRESAGE provides an alternative solution for secure and efficient genomic data outsourcing in an untrusted cloud by using a hybrid framework that combines secure hardware and multiple crypto protocols.
Collapse
Affiliation(s)
- Feng Chen
- Department of Biomedical Informatics, University of California San Diego, La Jolla, 92093, CA, USA.
| | - Chenghong Wang
- Department of Computer Science, Syracuse University, Syracuse, 13244, NY, USA
| | - Wenrui Dai
- Department of Biomedical Informatics, University of California San Diego, La Jolla, 92093, CA, USA
| | - Xiaoqian Jiang
- Department of Biomedical Informatics, University of California San Diego, La Jolla, 92093, CA, USA
| | - Noman Mohammed
- Department of Computer Science, University of Manitoba, Winnipeg, R3T 2N2, MB, Canada
| | - Md Momin Al Aziz
- Department of Computer Science, University of Manitoba, Winnipeg, R3T 2N2, MB, Canada
| | - Md Nazmus Sadat
- Department of Computer Science, University of Manitoba, Winnipeg, R3T 2N2, MB, Canada
| | - Cenk Sahinalp
- Department of Computer Science and Informatics, Indiana University, Bloomington, 47408, IN, USA
| | - Kristin Lauter
- Cryptography Group, Microsoft Research, San Diego,, 92122, CA, USA
| | - Shuang Wang
- Department of Biomedical Informatics, University of California San Diego, La Jolla, 92093, CA, USA
| |
Collapse
|
21
|
Takai-Igarashi T, Kinoshita K, Nagasaki M, Ogishima S, Nakamura N, Nagase S, Nagaie S, Saito T, Nagami F, Minegishi N, Suzuki Y, Suzuki K, Hashizume H, Kuriyama S, Hozawa A, Yaegashi N, Kure S, Tamiya G, Kawaguchi Y, Tanaka H, Yamamoto M. Security controls in an integrated Biobank to protect privacy in data sharing: rationale and study design. BMC Med Inform Decis Mak 2017; 17:100. [PMID: 28683736 PMCID: PMC5501115 DOI: 10.1186/s12911-017-0494-5] [Citation(s) in RCA: 24] [Impact Index Per Article: 3.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/07/2016] [Accepted: 06/27/2017] [Indexed: 01/08/2023] Open
Abstract
Background With the goal of realizing genome-based personalized healthcare, we have developed a biobank that integrates personal health, genome, and omics data along with biospecimens donated by volunteers of 150,000. Such a large-scale of data integration involves obvious risks of privacy violation. The research use of personal genome and health information is a topic of global discussion with regard to the protection of privacy while promoting scientific advancement. The present paper reports on our plans, current attempts, and accomplishments in addressing security problems involved in data sharing to ensure donor privacy while promoting scientific advancement. Methods Biospecimens and data have been collected in prospective cohort studies with the comprehensive agreement. The sample size of 150,000 participants was required for multiple researches including genome-wide screening of gene by environment interactions, haplotype phasing, and parametric linkage analysis. Results We established the TohokuMedicalMegabank (TMM) data sharing policy: a privacy protection rule that requires physical, personnel, and technological safeguards against privacy violation regarding the use and sharing of data. The proposed policy refers to that of NCBI and that of the Sanger Institute. The proposed policy classifies shared data according to the strength of re-identification risks. Local committees organized by TMM evaluate re-identification risk and assign a security category to a dataset. Every dataset is stored in an assigned segment of a supercomputer in accordance with its security category. A security manager should be designated to handle all security problems at individual data use locations. The proposed policy requires closed networks and IP-VPN remote connections. Conclusion The mission of the biobank is to distribute biological resources most productively. This mission motivated us to collect biospecimens and health data and simultaneously analyze genome/omics data in-house. The biobank also has the mission of improving the quality and quantity of the contents of the biobank. This motivated us to request users to share the results of their research as feedback to the biobank. The TMM data sharing policy has tackled every security problem originating with the missions. We believe our current implementation to be the best way to protect privacy in data sharing. Electronic supplementary material The online version of this article (doi:10.1186/s12911-017-0494-5) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Takako Takai-Igarashi
- Tohoku Medical Megabank Organization, Tohoku University, 2-1 Seiryo-machi, Aoba-ku, Sendai, Japan.
| | - Kengo Kinoshita
- Tohoku Medical Megabank Organization, Tohoku University, 2-1 Seiryo-machi, Aoba-ku, Sendai, Japan.,Graduate School of Information Sciences, Tohoku University, Sendai, Japan
| | - Masao Nagasaki
- Tohoku Medical Megabank Organization, Tohoku University, 2-1 Seiryo-machi, Aoba-ku, Sendai, Japan
| | - Soichi Ogishima
- Tohoku Medical Megabank Organization, Tohoku University, 2-1 Seiryo-machi, Aoba-ku, Sendai, Japan
| | - Naoki Nakamura
- Tohoku Medical Megabank Organization, Tohoku University, 2-1 Seiryo-machi, Aoba-ku, Sendai, Japan.,Graduate School of Medicine, Tohoku University, Sendai, Japan.,Tohoku University Hospital, Tohoku University, Sendai, Japan
| | - Sachiko Nagase
- Tohoku Medical Megabank Organization, Tohoku University, 2-1 Seiryo-machi, Aoba-ku, Sendai, Japan.,Graduate School of Medicine, Tohoku University, Sendai, Japan
| | - Satoshi Nagaie
- Tohoku Medical Megabank Organization, Tohoku University, 2-1 Seiryo-machi, Aoba-ku, Sendai, Japan
| | - Tomo Saito
- Tohoku Medical Megabank Organization, Tohoku University, 2-1 Seiryo-machi, Aoba-ku, Sendai, Japan
| | - Fuji Nagami
- Tohoku Medical Megabank Organization, Tohoku University, 2-1 Seiryo-machi, Aoba-ku, Sendai, Japan.,Graduate School of Medicine, Tohoku University, Sendai, Japan.,Tohoku University Hospital, Tohoku University, Sendai, Japan
| | - Naoko Minegishi
- Tohoku Medical Megabank Organization, Tohoku University, 2-1 Seiryo-machi, Aoba-ku, Sendai, Japan.,Graduate School of Medicine, Tohoku University, Sendai, Japan
| | - Yoichi Suzuki
- Tohoku Medical Megabank Organization, Tohoku University, 2-1 Seiryo-machi, Aoba-ku, Sendai, Japan.,Graduate School of Medicine, Tohoku University, Sendai, Japan
| | - Kichiya Suzuki
- Tohoku Medical Megabank Organization, Tohoku University, 2-1 Seiryo-machi, Aoba-ku, Sendai, Japan.,Graduate School of Medicine, Tohoku University, Sendai, Japan.,Tohoku University Hospital, Tohoku University, Sendai, Japan
| | - Hiroaki Hashizume
- Tohoku Medical Megabank Organization, Tohoku University, 2-1 Seiryo-machi, Aoba-ku, Sendai, Japan
| | - Shinichi Kuriyama
- Tohoku Medical Megabank Organization, Tohoku University, 2-1 Seiryo-machi, Aoba-ku, Sendai, Japan.,Graduate School of Medicine, Tohoku University, Sendai, Japan.,International Research Institute of Disaster Science, Tohoku University, Sendai, Japan
| | - Atsushi Hozawa
- Tohoku Medical Megabank Organization, Tohoku University, 2-1 Seiryo-machi, Aoba-ku, Sendai, Japan.,Graduate School of Medicine, Tohoku University, Sendai, Japan
| | - Nobuo Yaegashi
- Tohoku Medical Megabank Organization, Tohoku University, 2-1 Seiryo-machi, Aoba-ku, Sendai, Japan.,Graduate School of Medicine, Tohoku University, Sendai, Japan.,Tohoku University Hospital, Tohoku University, Sendai, Japan
| | - Shigeo Kure
- Tohoku Medical Megabank Organization, Tohoku University, 2-1 Seiryo-machi, Aoba-ku, Sendai, Japan.,Graduate School of Medicine, Tohoku University, Sendai, Japan.,Tohoku University Hospital, Tohoku University, Sendai, Japan
| | - Gen Tamiya
- Tohoku Medical Megabank Organization, Tohoku University, 2-1 Seiryo-machi, Aoba-ku, Sendai, Japan
| | - Yoshio Kawaguchi
- Tohoku Medical Megabank Organization, Tohoku University, 2-1 Seiryo-machi, Aoba-ku, Sendai, Japan
| | - Hiroshi Tanaka
- Tohoku Medical Megabank Organization, Tohoku University, 2-1 Seiryo-machi, Aoba-ku, Sendai, Japan
| | - Masayuki Yamamoto
- Tohoku Medical Megabank Organization, Tohoku University, 2-1 Seiryo-machi, Aoba-ku, Sendai, Japan. .,Graduate School of Medicine, Tohoku University, Sendai, Japan.
| |
Collapse
|
22
|
Humbert M, Ayday E, Hubaux JP, Telenti A. Quantifying Interdependent Risks in Genomic Privacy. ACM TRANSACTIONS ON PRIVACY AND SECURITY 2017. [DOI: 10.1145/3035538] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 10/20/2022]
Abstract
The rapid progress in human-genome sequencing is leading to a high availability of genomic data. These data is notoriously very sensitive and stable in time, and highly correlated among relatives. In this article, we study the implications of these familial correlations on kin genomic privacy. We formalize the problem and detail efficient reconstruction attacks based on graphical models and belief propagation. With our approach, an attacker can infer the genomes of the relatives of an individual whose genome or phenotype are observed by notably relying on Mendel’s Laws, statistical relationships between the genomic variants, and between the genome and the phenotype. We evaluate the effect of these dependencies on privacy with respect to the amount of observed variants and the relatives sharing them. We also study how the algorithmic performance evolves when we take these various relationships into account. Furthermore, to quantify the level of genomic privacy as a result of the proposed inference attack, we discuss possible definitions of
genomic privacy
metrics, and compare their values and evolution. Genomic data reveals Mendelian disorders and the likelihood of developing severe diseases, such as Alzheimer’s. We also introduce the quantification of
health privacy
, specifically, the measure of how well the predisposition to a disease is concealed from an attacker. We evaluate our approach on actual genomic data from a pedigree and show the threat extent by combining data gathered from a genome-sharing website as well as an online social network.
Collapse
|
23
|
Wang S, Jiang X, Singh S, Marmor R, Bonomi L, Fox D, Dow M, Ohno-Machado L. Genome privacy: challenges, technical approaches to mitigate risk, and ethical considerations in the United States. Ann N Y Acad Sci 2017; 1387:73-83. [PMID: 27681358 PMCID: PMC5266631 DOI: 10.1111/nyas.13259] [Citation(s) in RCA: 34] [Impact Index Per Article: 4.9] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/01/2016] [Revised: 08/18/2016] [Accepted: 08/22/2016] [Indexed: 12/28/2022]
Abstract
Accessing and integrating human genomic data with phenotypes are important for biomedical research. Making genomic data accessible for research purposes, however, must be handled carefully to avoid leakage of sensitive individual information to unauthorized parties and improper use of data. In this article, we focus on data sharing within the scope of data accessibility for research. Current common practices to gain biomedical data access are strictly rule based, without a clear and quantitative measurement of the risk of privacy breaches. In addition, several types of studies require privacy-preserving linkage of genotype and phenotype information across different locations (e.g., genotypes stored in a sequencing facility and phenotypes stored in an electronic health record) to accelerate discoveries. The computer science community has developed a spectrum of techniques for data privacy and confidentiality protection, many of which have yet to be tested on real-world problems. In this article, we discuss clinical, technical, and ethical aspects of genome data privacy and confidentiality in the United States, as well as potential solutions for privacy-preserving genotype-phenotype linkage in biomedical research.
Collapse
Affiliation(s)
- Shuang Wang
- Department of Biomedical Informatics, University of California San Diego, La Jolla, California
| | - Xiaoqian Jiang
- Department of Biomedical Informatics, University of California San Diego, La Jolla, California
| | - Siddharth Singh
- Department of Biomedical Informatics, University of California San Diego, La Jolla, California
| | - Rebecca Marmor
- Department of Biomedical Informatics, University of California San Diego, La Jolla, California
| | - Luca Bonomi
- Department of Biomedical Informatics, University of California San Diego, La Jolla, California
| | - Dov Fox
- School of Law, University of San Diego, San Diego, California
| | - Michelle Dow
- Department of Biomedical Informatics, University of California San Diego, La Jolla, California
| | - Lucila Ohno-Machado
- Department of Biomedical Informatics, University of California San Diego, La Jolla, California
| |
Collapse
|
24
|
Tang H, Jiang X, Wang X, Wang S, Sofia H, Fox D, Lauter K, Malin B, Telenti A, Xiong L, Ohno-Machado L. Protecting genomic data analytics in the cloud: state of the art and opportunities. BMC Med Genomics 2016; 9:63. [PMID: 27733153 PMCID: PMC5062944 DOI: 10.1186/s12920-016-0224-3] [Citation(s) in RCA: 36] [Impact Index Per Article: 4.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/04/2016] [Accepted: 09/28/2016] [Indexed: 11/17/2022] Open
Abstract
The outsourcing of genomic data into public cloud computing settings raises concerns over privacy and security. Significant advancements in secure computation methods have emerged over the past several years, but such techniques need to be rigorously evaluated for their ability to support the analysis of human genomic data in an efficient and cost-effective manner. With respect to public cloud environments, there are concerns about the inadvertent exposure of human genomic data to unauthorized users. In analyses involving multiple institutions, there is additional concern about data being used beyond agreed research scope and being prcoessed in untrused computational environments, which may not satisfy institutional policies. To systematically investigate these issues, the NIH-funded National Center for Biomedical Computing iDASH (integrating Data for Analysis, 'anonymization' and SHaring) hosted the second Critical Assessment of Data Privacy and Protection competition to assess the capacity of cryptographic technologies for protecting computation over human genomes in the cloud and promoting cross-institutional collaboration. Data scientists were challenged to design and engineer practical algorithms for secure outsourcing of genome computation tasks in working software, whereby analyses are performed only on encrypted data. They were also challenged to develop approaches to enable secure collaboration on data from genomic studies generated by multiple organizations (e.g., medical centers) to jointly compute aggregate statistics without sharing individual-level records. The results of the competition indicated that secure computation techniques can enable comparative analysis of human genomes, but greater efficiency (in terms of compute time and memory utilization) are needed before they are sufficiently practical for real world environments.
Collapse
Affiliation(s)
- Haixu Tang
- School of Informatics and Computing, Indiana University, Bloomington, IN, USA.
| | - Xiaoqian Jiang
- Department of Biomedical Informatics, University of California San Diego, La Jolla, CA, USA
| | - Xiaofeng Wang
- School of Informatics and Computing, Indiana University, Bloomington, IN, USA
| | - Shuang Wang
- Department of Biomedical Informatics, University of California San Diego, La Jolla, CA, USA
| | - Heidi Sofia
- National Human Genome Research Institute, Rockville, MD, USA
| | - Dov Fox
- School of Law, University of San Diego, San Diego, CA, USA
| | | | - Bradley Malin
- Department of Biomedical Informatics, School of Medicine, Vanderbilt University, Nashville, TN, USA
| | | | - Li Xiong
- Department of Mathematics and Computer Science, Emory University, Atlanta, GA, USA
| | - Lucila Ohno-Machado
- Department of Biomedical Informatics, University of California San Diego, La Jolla, CA, USA
| |
Collapse
|
25
|
Shi H, Jiang C, Dai W, Jiang X, Tang Y, Ohno-Machado L, Wang S. Secure Multi-pArty Computation Grid LOgistic REgression (SMAC-GLORE). BMC Med Inform Decis Mak 2016; 16 Suppl 3:89. [PMID: 27454168 PMCID: PMC4959358 DOI: 10.1186/s12911-016-0316-1] [Citation(s) in RCA: 31] [Impact Index Per Article: 3.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022] Open
Abstract
Background In biomedical research, data sharing and information exchange are very important for improving quality of care, accelerating discovery, and promoting the meaningful secondary use of clinical data. A big concern in biomedical data sharing is the protection of patient privacy because inappropriate information leakage can put patient privacy at risk. Methods In this study, we deployed a grid logistic regression framework based on Secure Multi-party Computation (SMAC-GLORE). Unlike our previous work in GLORE, SMAC-GLORE protects not only patient-level data, but also all the intermediary information exchanged during the model-learning phase. Results The experimental results demonstrate the feasibility of secure distributed logistic regression across multiple institutions without sharing patient-level data. Conclusions In this study, we developed a circuit-based SMAC-GLORE framework. The proposed framework provides a practical solution for secure distributed logistic regression model learning.
Collapse
Affiliation(s)
- Haoyi Shi
- Department of Biomedical Informatics, University of California, San Diego, CA, 92093, USA.,Department of Electrical Engineering and Computer Science, Syracuse University, Syracuse, NY, 13210, USA
| | - Chao Jiang
- Department of Biomedical Informatics, University of California, San Diego, CA, 92093, USA.,School of Electrical and Computer Engineering, University of Oklahoma, Tulsa, OK, 74135, USA
| | - Wenrui Dai
- Department of Biomedical Informatics, University of California, San Diego, CA, 92093, USA
| | - Xiaoqian Jiang
- Department of Biomedical Informatics, University of California, San Diego, CA, 92093, USA
| | - Yuzhe Tang
- Department of Electrical Engineering and Computer Science, Syracuse University, Syracuse, NY, 13210, USA
| | - Lucila Ohno-Machado
- Department of Biomedical Informatics, University of California, San Diego, CA, 92093, USA
| | - Shuang Wang
- Department of Biomedical Informatics, University of California, San Diego, CA, 92093, USA.
| |
Collapse
|
26
|
Constable SD, Tang Y, Wang S, Jiang X, Chapin S. Privacy-preserving GWAS analysis on federated genomic datasets. BMC Med Inform Decis Mak 2015; 15 Suppl 5:S2. [PMID: 26733045 PMCID: PMC4699163 DOI: 10.1186/1472-6947-15-s5-s2] [Citation(s) in RCA: 35] [Impact Index Per Article: 3.9] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/17/2023] Open
Abstract
BACKGROUND The biomedical community benefits from the increasing availability of genomic data to support meaningful scientific research, e.g., Genome-Wide Association Studies (GWAS). However, high quality GWAS usually requires a large amount of samples, which can grow beyond the capability of a single institution. Federated genomic data analysis holds the promise of enabling cross-institution collaboration for effective GWAS, but it raises concerns about patient privacy and medical information confidentiality (as data are being exchanged across institutional boundaries), which becomes an inhibiting factor for the practical use. METHODS We present a privacy-preserving GWAS framework on federated genomic datasets. Our method is to layer the GWAS computations on top of secure multi-party computation (MPC) systems. This approach allows two parties in a distributed system to mutually perform secure GWAS computations, but without exposing their private data outside. RESULTS We demonstrate our technique by implementing a framework for minor allele frequency counting and χ2 statistics calculation, one of typical computations used in GWAS. For efficient prototyping, we use a state-of-the-art MPC framework, i.e., Portable Circuit Format (PCF) 1. Our experimental results show promise in realizing both efficient and secure cross-institution GWAS computations.
Collapse
Affiliation(s)
- Scott D Constable
- Department of EECS, Syracuse University, South Crouse Avenue, 13244 Syracuse, NY USA
| | - Yuzhe Tang
- Department of EECS, Syracuse University, South Crouse Avenue, 13244 Syracuse, NY USA
| | - Shuang Wang
- Department of Biomedical Informatics, University of California, San Diego, 9500 Gilman Drive, MC 0728, 92093 La Jolla, CA USA
| | - Xiaoqian Jiang
- Department of Biomedical Informatics, University of California, San Diego, 9500 Gilman Drive, MC 0728, 92093 La Jolla, CA USA
| | - Steve Chapin
- Department of EECS, Syracuse University, South Crouse Avenue, 13244 Syracuse, NY USA
| |
Collapse
|