1
|
Wan Z, Hazel JW, Clayton EW, Vorobeychik Y, Kantarcioglu M, Malin BA. Sociotechnical safeguards for genomic data privacy. Nat Rev Genet 2022; 23:429-445. [PMID: 35246669 PMCID: PMC8896074 DOI: 10.1038/s41576-022-00455-y] [Citation(s) in RCA: 28] [Impact Index Per Article: 14.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 01/24/2022] [Indexed: 12/21/2022]
Abstract
Recent developments in a variety of sectors, including health care, research and the direct-to-consumer industry, have led to a dramatic increase in the amount of genomic data that are collected, used and shared. This state of affairs raises new and challenging concerns for personal privacy, both legally and technically. This Review appraises existing and emerging threats to genomic data privacy and discusses how well current legal frameworks and technical safeguards mitigate these concerns. It concludes with a discussion of remaining and emerging challenges and illustrates possible solutions that can balance protecting privacy and realizing the benefits that result from the sharing of genetic information. In this Review, the authors describe technical and legal protection mechanisms for mitigating vulnerabilities in genomic data privacy. They also discuss how these protections are dependent on the context of data use such as in research, health care, direct-to-consumer testing or forensic investigations.
Collapse
Affiliation(s)
- Zhiyu Wan
- Center for Genetic Privacy and Identity in Community Settings, Vanderbilt University Medical Center, Nashville, TN, USA.,Department of Computer Science, Vanderbilt University, Nashville, TN, USA.,Department of Biomedical Informatics, Vanderbilt University Medical Center, Nashville, TN, USA
| | - James W Hazel
- Center for Genetic Privacy and Identity in Community Settings, Vanderbilt University Medical Center, Nashville, TN, USA.,Center for Biomedical Ethics and Society, Vanderbilt University, Nashville, TN, USA
| | - Ellen Wright Clayton
- Center for Genetic Privacy and Identity in Community Settings, Vanderbilt University Medical Center, Nashville, TN, USA.,Center for Biomedical Ethics and Society, Vanderbilt University, Nashville, TN, USA.,Vanderbilt University Law School, Nashville, TN, USA
| | - Yevgeniy Vorobeychik
- Department of Computer Science and Engineering, Washington University in St. Louis, St. Louis, MO, USA
| | - Murat Kantarcioglu
- Department of Computer Science, University of Texas at Dallas, Richardson, TX, USA
| | - Bradley A Malin
- Center for Genetic Privacy and Identity in Community Settings, Vanderbilt University Medical Center, Nashville, TN, USA. .,Department of Computer Science, Vanderbilt University, Nashville, TN, USA. .,Department of Biomedical Informatics, Vanderbilt University Medical Center, Nashville, TN, USA. .,Department of Biostatistics, Vanderbilt University Medical Center, Nashville, TN, USA.
| |
Collapse
|
2
|
Jafarbeiki S, Sakzad A, Kasra Kermanshahi S, Gaire R, Steinfeld R, Lai S, Abraham G, Thapa C. PrivGenDB: Efficient and privacy-preserving query executions over encrypted SNP-Phenotype database. INFORMATICS IN MEDICINE UNLOCKED 2022. [DOI: 10.1016/j.imu.2022.100988] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/25/2022] Open
|
3
|
Hekel R, Budis J, Kucharik M, Radvanszky J, Pös Z, Szemes T. Privacy-preserving storage of sequenced genomic data. BMC Genomics 2021; 22:712. [PMID: 34600465 PMCID: PMC8487550 DOI: 10.1186/s12864-021-07996-2] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/22/2021] [Accepted: 09/10/2021] [Indexed: 11/23/2022] Open
Abstract
Background The current and future applications of genomic data may raise ethical and privacy concerns. Processing and storing of this data introduce a risk of abuse by potential offenders since the human genome contains sensitive personal information. For this reason, we have developed a privacy-preserving method, named Varlock providing secure storage of sequenced genomic data. We used a public set of population allele frequencies to mask the personal alleles detected in genomic reads. Each personal allele described by the public set is masked by a randomly selected population allele with respect to its frequency. Masked alleles are preserved in an encrypted confidential file that can be shared in whole or in part using public-key cryptography. Results Our method masked the personal variants and introduced new variants detected in a personal masked genome. Alternative alleles with lower population frequency were masked and introduced more often. We performed a joint PCA analysis of personal and masked VCFs, showing that the VCFs between the two groups cannot be trivially mapped. Moreover, the method is reversible and personal alleles in specific genomic regions can be unmasked on demand. Conclusion Our method masks personal alleles within genomic reads while preserving valuable non-sensitive properties of sequenced DNA fragments for further research. Personal alleles in the desired genomic regions may be restored and shared with patients, clinics, and researchers. We suggest that the method can provide an additional security layer for storing and sharing of the raw aligned reads. Supplementary Information The online version contains supplementary material available at 10.1186/s12864-021-07996-2.
Collapse
Affiliation(s)
- Rastislav Hekel
- Geneton s.r.o, Bratislava, Slovakia. .,Faculty of Natural Sciences, Comenius University, Bratislava, Slovakia. .,Slovak Centre of Scientific and Technical Information, Bratislava, Slovakia. .,Comenius University Science Park, Bratislava, Slovakia.
| | - Jaroslav Budis
- Geneton s.r.o, Bratislava, Slovakia.,Slovak Centre of Scientific and Technical Information, Bratislava, Slovakia.,Comenius University Science Park, Bratislava, Slovakia
| | - Marcel Kucharik
- Geneton s.r.o, Bratislava, Slovakia.,Comenius University Science Park, Bratislava, Slovakia
| | - Jan Radvanszky
- Geneton s.r.o, Bratislava, Slovakia.,Faculty of Natural Sciences, Comenius University, Bratislava, Slovakia.,Comenius University Science Park, Bratislava, Slovakia.,Biomedical Research Centre, Institute of Clinical and Translational Research, Slovak Academy of Sciences, Bratislava, Slovakia
| | - Zuzana Pös
- Geneton s.r.o, Bratislava, Slovakia.,Faculty of Natural Sciences, Comenius University, Bratislava, Slovakia.,Comenius University Science Park, Bratislava, Slovakia.,Biomedical Research Centre, Institute of Clinical and Translational Research, Slovak Academy of Sciences, Bratislava, Slovakia
| | - Tomas Szemes
- Geneton s.r.o, Bratislava, Slovakia.,Faculty of Natural Sciences, Comenius University, Bratislava, Slovakia.,Comenius University Science Park, Bratislava, Slovakia
| |
Collapse
|
4
|
Bonomi L, Huang Y, Ohno-Machado L. Privacy challenges and research opportunities for genomic data sharing. Nat Genet 2020; 52:646-654. [PMID: 32601475 PMCID: PMC7761157 DOI: 10.1038/s41588-020-0651-0] [Citation(s) in RCA: 70] [Impact Index Per Article: 17.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/11/2019] [Accepted: 05/22/2020] [Indexed: 12/17/2022]
Abstract
The sharing of genomic data holds great promise in advancing precision medicine and providing personalized treatments and other types of interventions. However, these opportunities come with privacy concerns, and data misuse could potentially lead to privacy infringement for individuals and their blood relatives. With the rapid growth and increased availability of genomic datasets, understanding the current genome privacy landscape and identifying the challenges in developing effective privacy-protecting solutions are imperative. In this work, we provide an overview of major privacy threats identified by the research community and examine the privacy challenges in the context of emerging direct-to-consumer genetic-testing applications. We additionally present general privacy-protection techniques for genomic data sharing and their potential applications in direct-to-consumer genomic testing and forensic analyses. Finally, we discuss limitations in current privacy-protection methods, highlight possible mitigation strategies and suggest future research opportunities for advancing genomic data sharing.
Collapse
Affiliation(s)
- Luca Bonomi
- UCSD Health Department of Biomedical Informatics, University of California San Diego, La Jolla, CA, USA.
| | - Yingxiang Huang
- UCSD Health Department of Biomedical Informatics, University of California San Diego, La Jolla, CA, USA
| | - Lucila Ohno-Machado
- UCSD Health Department of Biomedical Informatics, University of California San Diego, La Jolla, CA, USA
- Division of Health Services Research & Development, VA San Diego Healthcare System, San Diego, La Jolla, CA, USA
| |
Collapse
|
5
|
Kockan C, Zhu K, Dokmai N, Karpov N, Kulekci MO, Woodruff DP, Sahinalp SC. Sketching algorithms for genomic data analysis and querying in a secure enclave. Nat Methods 2020; 17:295-301. [PMID: 32132732 DOI: 10.1038/s41592-020-0761-8] [Citation(s) in RCA: 20] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/17/2019] [Accepted: 01/22/2020] [Indexed: 11/09/2022]
Abstract
Genome-wide association studies (GWAS), especially on rare diseases, may necessitate exchange of sensitive genomic data between multiple institutions. Since genomic data sharing is often infeasible due to privacy concerns, cryptographic methods, such as secure multiparty computation (SMC) protocols, have been developed with the aim of offering privacy-preserving collaborative GWAS. Unfortunately, the computational overhead of these methods remain prohibitive for human-genome-scale data. Here we introduce SkSES (https://github.com/ndokmai/sgx-genome-variants-search), a hardware-software hybrid approach for privacy-preserving collaborative GWAS, which improves the running time of the most advanced cryptographic protocols by two orders of magnitude. The SkSES approach is based on trusted execution environments (TEEs) offered by current-generation microprocessors-in particular, Intel's SGX. To overcome the severe memory limitation of the TEEs, SkSES employs novel 'sketching' algorithms that maintain essential statistical information on genomic variants in input VCF files. By additionally incorporating efficient data compression and population stratification reduction methods, SkSES identifies the top k genomic variants in a cohort quickly, accurately and in a privacy-preserving manner.
Collapse
Affiliation(s)
- Can Kockan
- Department of Computer Science, Indiana University, Bloomington, IN, USA.,Cancer Data Science Laboratory, National Cancer Institute, National Institutes of Health, Bethesda, MD, USA
| | - Kaiyuan Zhu
- Department of Computer Science, Indiana University, Bloomington, IN, USA.,Cancer Data Science Laboratory, National Cancer Institute, National Institutes of Health, Bethesda, MD, USA
| | - Natnatee Dokmai
- Department of Computer Science, Indiana University, Bloomington, IN, USA
| | - Nikolai Karpov
- Department of Computer Science, Indiana University, Bloomington, IN, USA
| | - M Oguzhan Kulekci
- Informatics Institute, Istanbul Technical University, Istanbul, Turkey
| | - David P Woodruff
- Department of Computer Science, Carnegie Mellon University, Pittsburgh, PA, USA
| | - S Cenk Sahinalp
- Cancer Data Science Laboratory, National Cancer Institute, National Institutes of Health, Bethesda, MD, USA.
| |
Collapse
|
6
|
Setup-Free Secure Search on Encrypted Data: Faster and Post-Processing Free. PROCEEDINGS ON PRIVACY ENHANCING TECHNOLOGIES 2019. [DOI: 10.2478/popets-2019-0038] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/21/2022]
Abstract
Abstract
We present a novel secure search protocol on data and queries encrypted with Fully Homomorphic Encryption (FHE). Our protocol enables organizations (client) to (1) securely upload an unsorted data array x = (x[1], . . . , x[n]) to an untrusted honest-but-curious sever, where data may be uploaded over time and from multiple data-sources; and (2) securely issue repeated search queries q for retrieving the first element (i*, x[i*]) satisfying an agreed matching criterion i* = min { i ∈ [n] | IsMatch(x[i], q) = 1 }, as well as fetching the next matching elements with further interaction. For security, the client encrypts the data and queries with FHE prior to uploading, and the server processes the ciphertexts to produce the result ciphertext for the client to decrypt. Our secure search protocol improves over the prior state-of-the-art for secure search on FHE encrypted data (Akavia, Feldman, Shaul (AFS), CCS’2018) in achieving:
– Post-processing free protocol where the server produces a ciphertext for the correct search outcome with overwhelming success probability. This is in contrast to returning a list of candidates for the client to postprocess, or suffering from a noticeable error probability, in AFS. Our post-processing freeness enables the server to use secure search as a sub-component in a larger computation without interaction with the client.
– Faster protocol: (a) Client time and communication bandwidth are improved by a log2
n/ log log n factor. (b) Server evaluates a polynomial of degree linear in log n (compare to cubic in AFS), and overall number of multiplications improved by up to log n factor. (c) Employing only GF(2) computations (compare to GF(p) for p ≫ in AFS) to gain both further speedup and compatibility to all current FHE candidates.
– Order of magnitude speedup exhibited by extensive benchmarks we executed on identical hardware for implementations of ours versus AFS’s protocols. Additionally, like other FHE based solutions, our solution is setup-free: to outsource elements from the client to the server, no additional actions are performed on x except for encrypting it element by element (each element bit by bit) and uploading the resulted ciphertexts to the server.
Collapse
|
7
|
Systematizing Genome Privacy Research: A Privacy-Enhancing Technologies Perspective. PROCEEDINGS ON PRIVACY ENHANCING TECHNOLOGIES 2018. [DOI: 10.2478/popets-2019-0006] [Citation(s) in RCA: 21] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/20/2022]
Abstract
Abstract
Rapid advances in human genomics are enabling researchers to gain a better understanding of the role of the genome in our health and well-being, stimulating hope for more effective and cost efficient healthcare. However, this also prompts a number of security and privacy concerns stemming from the distinctive characteristics of genomic data. To address them, a new research community has emerged and produced a large number of publications and initiatives. In this paper, we rely on a structured methodology to contextualize and provide a critical analysis of the current knowledge on privacy-enhancing technologies used for testing, storing, and sharing genomic data, using a representative sample of the work published in the past decade. We identify and discuss limitations, technical challenges, and issues faced by the community, focusing in particular on those that are inherently tied to the nature of the problem and are harder for the community alone to address. Finally, we report on the importance and difficulty of the identified challenges based on an online survey of genome data privacy experts.
Collapse
|
8
|
Azencott CA. Machine learning and genomics: precision medicine versus patient privacy. PHILOSOPHICAL TRANSACTIONS. SERIES A, MATHEMATICAL, PHYSICAL, AND ENGINEERING SCIENCES 2018; 376:rsta.2017.0350. [PMID: 30082298 DOI: 10.1098/rsta.2017.0350] [Citation(s) in RCA: 22] [Impact Index Per Article: 3.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Accepted: 06/07/2018] [Indexed: 06/08/2023]
Abstract
Machine learning can have a major societal impact in computational biology applications. In particular, it plays a central role in the development of precision medicine, whereby treatment is tailored to the clinical or genetic features of the patient. However, these advances require collecting and sharing among researchers large amounts of genomic data, which generates much concern about privacy. Researchers, study participants and governing bodies should be aware of the ways in which the privacy of participants might be compromised, as well as of the large body of research on technical solutions to these issues. We review how breaches in patient privacy can occur, present recent developments in computational data protection and discuss how they can be combined with legal and ethical perspectives to provide secure frameworks for genomic data sharing.This article is part of a discussion meeting issue 'The growing ubiquity of algorithms in society: implications, impacts and innovations'.
Collapse
Affiliation(s)
- C-A Azencott
- MINES ParisTech, PSL Research University, CBIO-Centre for Computational Biology, 75006 Paris, France
- Institut Curie, PSL Research University, 75005 Paris, France
- INSERM, U900, 75005 Paris, France
| |
Collapse
|
9
|
Sousa JS, Lefebvre C, Huang Z, Raisaro JL, Aguilar-Melchor C, Killijian MO, Hubaux JP. Efficient and secure outsourcing of genomic data storage. BMC Med Genomics 2017; 10:46. [PMID: 28786363 PMCID: PMC5547444 DOI: 10.1186/s12920-017-0275-0] [Citation(s) in RCA: 18] [Impact Index Per Article: 2.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/22/2022] Open
Abstract
Background Cloud computing is becoming the preferred solution for efficiently dealing with the increasing amount of genomic data. Yet, outsourcing storage and processing sensitive information, such as genomic data, comes with important concerns related to privacy and security. This calls for new sophisticated techniques that ensure data protection from untrusted cloud providers and that still enable researchers to obtain useful information. Methods We present a novel privacy-preserving algorithm for fully outsourcing the storage of large genomic data files to a public cloud and enabling researchers to efficiently search for variants of interest. In order to protect data and query confidentiality from possible leakage, our solution exploits optimal encoding for genomic variants and combines it with homomorphic encryption and private information retrieval. Our proposed algorithm is implemented in C++ and was evaluated on real data as part of the 2016 iDash Genome Privacy-Protection Challenge. Results Results show that our solution outperforms the state-of-the-art solutions and enables researchers to search over millions of encrypted variants in a few seconds. Conclusions As opposed to prior beliefs that sophisticated privacy-enhancing technologies (PETs) are unpractical for real operational settings, our solution demonstrates that, in the case of genomic data, PETs are very efficient enablers.
Collapse
Affiliation(s)
- João Sá Sousa
- Laboratory for Communications and Applications - LCA 1, École Polytechnique Fédérale de Lausanne, Route Cantonale, Lausanne, 1015, Switzerland
| | - Cédric Lefebvre
- Laboratory for Analysis and Architecture of Systems - LAAS-CNRS, Université Toulouse, 7 Avenue du Colonel Roche, Toulouse, 31400, France
| | - Zhicong Huang
- Laboratory for Communications and Applications - LCA 1, École Polytechnique Fédérale de Lausanne, Route Cantonale, Lausanne, 1015, Switzerland
| | - Jean Louis Raisaro
- Laboratory for Communications and Applications - LCA 1, École Polytechnique Fédérale de Lausanne, Route Cantonale, Lausanne, 1015, Switzerland
| | - Carlos Aguilar-Melchor
- Toulouse Institute of Computer Science Research - IRIT, Université Toulouse, 118 Route de Narbonne, Toulouse, F-31062, France
| | - Marc-Olivier Killijian
- Laboratory for Analysis and Architecture of Systems - LAAS-CNRS, Université Toulouse, 7 Avenue du Colonel Roche, Toulouse, 31400, France
| | - Jean-Pierre Hubaux
- Laboratory for Communications and Applications - LCA 1, École Polytechnique Fédérale de Lausanne, Route Cantonale, Lausanne, 1015, Switzerland
| |
Collapse
|
10
|
Abstract
Background As genome sequencing technology develops rapidly, there has lately been an increasing need to keep genomic data secure even when stored in the cloud and still used for research. We are interested in designing a protocol for the secure outsourcing matching problem on encrypted data. Method We propose an efficient method to securely search a matching position with the query data and extract some information at the position. After decryption, only a small amount of comparisons with the query information should be performed in plaintext state. We apply this method to find a set of biomarkers in encrypted genomes. The important feature of our method is to encode a genomic database as a single element of polynomial ring. Result Since our method requires a single homomorphic multiplication of hybrid scheme for query computation, it has the advantage over the previous methods in parameter size, computation complexity, and communication cost. In particular, the extraction procedure not only prevents leakage of database information that has not been queried by user but also reduces the communication cost by half. We evaluate the performance of our method and verify that the computation on large-scale personal data can be securely and practically outsourced to a cloud environment during data analysis. It takes about 3.9 s to search-and-extract the reference and alternate sequences at the queried position in a database of size 4M. Conclusion Our solution for finding a set of biomarkers in DNA sequences shows the progress of cryptographic techniques in terms of their capability can support real-world genome data analysis in a cloud environment.
Collapse
Affiliation(s)
- Miran Kim
- Division of Biomedical Informatics, University of California- San Diego, San Diego, CA, 92093, USA.
| | - Yongsoo Song
- Department of Mathematical Sciences, Seoul National University, GwanAkRo 1, Seoul, 08826, Republic of Korea
| | - Jung Hee Cheon
- Department of Mathematical Sciences, Seoul National University, GwanAkRo 1, Seoul, 08826, Republic of Korea
| |
Collapse
|
11
|
Wagner J, Paulson JN, Wang X, Bhattacharjee B, Corrada Bravo H. Privacy-preserving microbiome analysis using secure computation. Bioinformatics 2016; 32:1873-9. [PMID: 26873931 PMCID: PMC4908319 DOI: 10.1093/bioinformatics/btw073] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/23/2015] [Revised: 12/04/2015] [Accepted: 01/31/2016] [Indexed: 01/10/2023] Open
Abstract
MOTIVATION Developing targeted therapeutics and identifying biomarkers relies on large amounts of research participant data. Beyond human DNA, scientists now investigate the DNA of micro-organisms inhabiting the human body. Recent work shows that an individual's collection of microbial DNA consistently identifies that person and could be used to link a real-world identity to a sensitive attribute in a research dataset. Unfortunately, the current suite of DNA-specific privacy-preserving analysis tools does not meet the requirements for microbiome sequencing studies. RESULTS To address privacy concerns around microbiome sequencing, we implement metagenomic analyses using secure computation. Our implementation allows comparative analysis over combined data without revealing the feature counts for any individual sample. We focus on three analyses and perform an evaluation on datasets currently used by the microbiome research community. We use our implementation to simulate sharing data between four policy-domains. Additionally, we describe an application of our implementation for patients to combine data that allows drug developers to query against and compensate patients for the analysis. AVAILABILITY AND IMPLEMENTATION The software is freely available for download at: http://cbcb.umd.edu/∼hcorrada/projects/secureseq.html SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online. CONTACT hcorrada@umiacs.umd.edu.
Collapse
Affiliation(s)
- Justin Wagner
- Center for Bioinformatics and Computational Biologyand
| | | | - Xiao Wang
- Maryland Cybersecurity Center, Department of Computer Science, University of Maryland, College Park, MD USA
| | - Bobby Bhattacharjee
- Maryland Cybersecurity Center, Department of Computer Science, University of Maryland, College Park, MD USA
| | | |
Collapse
|