1
|
Blindenbach J, Kang J, Hong S, Karam C, Lehner T, Gürsoy G. Ultra-secure storage and analysis of genetic data for the advancement of precision medicine. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2024.04.16.589793. [PMID: 38695012 PMCID: PMC11061874 DOI: 10.1101/2024.04.16.589793] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 05/04/2024]
Abstract
Cloud computing provides the opportunity to store the ever-growing genotype-phenotype data sets needed to achieve the full potential of precision medicine. However, due to the sensitive nature of this data and the patchwork of data privacy laws across states and countries, additional security protections are proving necessary to ensure data privacy and security. Here we present SQUiD, a secure queryable database for storing and analyzing genotype-phenotype data. With SQUiD, genotype-phenotype data can be stored in a low-security, low-cost public cloud in the encrypted form, which researchers can securely query without the public cloud ever being able to decrypt the data. We demonstrate the usability of SQUiD by replicating various commonly used calculations such as polygenic risk scores, cohort creation for GWAS, MAF filtering, and patient similarity analysis both on synthetic and UK Biobank data. Our work represents a new and scalable platform enabling the realization of precision medicine without security and privacy concerns.
Collapse
Affiliation(s)
- Jacob Blindenbach
- Department of Computer Science, Columbia University
- Department of Biomedical Informatics, Columbia University
- New York Genome Center
- These authors contributed equally
| | - Jiayi Kang
- COSIC, KU Leuven
- These authors contributed equally
| | - Seungwan Hong
- Department of Biomedical Informatics, Columbia University
- New York Genome Center
- These authors contributed equally
| | | | | | - Gamze Gürsoy
- Department of Computer Science, Columbia University
- Department of Biomedical Informatics, Columbia University
- New York Genome Center
| |
Collapse
|
2
|
Bataa M, Song S, Park K, Kim M, Cheon JH, Kim S. Finding Highly Similar Regions of Genomic Sequences Through Homomorphic Encryption. J Comput Biol 2024; 31:197-212. [PMID: 38531050 DOI: 10.1089/cmb.2023.0050] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 03/28/2024] Open
Abstract
Finding highly similar regions of genomic sequences is a basic computation of genomic analysis. Genomic analyses on a large amount of data are efficiently processed in cloud environments, but outsourcing them to a cloud raises concerns over the privacy and security issues. Homomorphic encryption (HE) is a powerful cryptographic primitive that preserves privacy of genomic data in various analyses processed in an untrusted cloud environment. We introduce an efficient algorithm for finding highly similar regions of two homomorphically encrypted sequences, and describe how to implement it using the bit-wise and word-wise HE schemes. In the experiment, our algorithm outperforms an existing algorithm by up to two orders of magnitude in terms of elapsed time. Overall, it finds highly similar regions of the sequences in real data sets in a feasible time.
Collapse
Affiliation(s)
- Magsarjav Bataa
- Department of Computer Science and Engineering, Seoul National University, Seoul, South Korea
- Department of Information and Computer Sciences, National University of Mongolia, Ulaanbaatar, Mongolia
| | - Siwoo Song
- Department of Computer Science and Engineering, Seoul National University, Seoul, South Korea
| | - Kunsoo Park
- Department of Computer Science and Engineering, Seoul National University, Seoul, South Korea
| | - Miran Kim
- Department of Mathematics, Hanyang University, Seoul, South Korea
| | - Jung Hee Cheon
- Department of Mathematical Sciences, Seoul National University, Seoul, South Korea
| | - Sun Kim
- Department of Computer Science and Engineering, Seoul National University, Seoul, South Korea
| |
Collapse
|
3
|
Woods A, Kramer ST, Xu D, Jiang W. Secure Comparisons of Single Nucleotide Polymorphisms Using Secure Multiparty Computation: Method Development. JMIR BIOINFORMATICS AND BIOTECHNOLOGY 2023; 4:e44700. [PMID: 38935952 PMCID: PMC11135223 DOI: 10.2196/44700] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 11/29/2022] [Revised: 05/21/2023] [Accepted: 06/09/2023] [Indexed: 06/29/2024]
Abstract
BACKGROUND While genomic variations can provide valuable information for health care and ancestry, the privacy of individual genomic data must be protected. Thus, a secure environment is desirable for a human DNA database such that the total data are queryable but not directly accessible to involved parties (eg, data hosts and hospitals) and that the query results are learned only by the user or authorized party. OBJECTIVE In this study, we provide efficient and secure computations on panels of single nucleotide polymorphisms (SNPs) from genomic sequences as computed under the following set operations: union, intersection, set difference, and symmetric difference. METHODS Using these operations, we can compute similarity metrics, such as the Jaccard similarity, which could allow querying a DNA database to find the same person and genetic relatives securely. We analyzed various security paradigms and show metrics for the protocols under several security assumptions, such as semihonest, malicious with honest majority, and malicious with a malicious majority. RESULTS We show that our methods can be used practically on realistically sized data. Specifically, we can compute the Jaccard similarity of two genomes when considering sets of SNPs, each with 400,000 SNPs, in 2.16 seconds with the assumption of a malicious adversary in an honest majority and 0.36 seconds under a semihonest model. CONCLUSIONS Our methods may help adopt trusted environments for hosting individual genomic data with end-to-end data security.
Collapse
Affiliation(s)
- Andrew Woods
- Department of Electrical Engineering and Computer Science, University of Missouri, Columbia, MO, United States
- Christopher S. Bond Life Sciences Center, University of Missouri, Columbia, MO, United States
| | - Skyler T Kramer
- Christopher S. Bond Life Sciences Center, University of Missouri, Columbia, MO, United States
- Institute for Data Science and Informatics, University of Missouri, Columbia, MO, United States
| | - Dong Xu
- Department of Electrical Engineering and Computer Science, University of Missouri, Columbia, MO, United States
- Christopher S. Bond Life Sciences Center, University of Missouri, Columbia, MO, United States
- Institute for Data Science and Informatics, University of Missouri, Columbia, MO, United States
| | - Wei Jiang
- Department of Electrical Engineering and Computer Science, University of Missouri, Columbia, MO, United States
| |
Collapse
|
4
|
Kuo TT, Jiang X, Tang H, Wang X, Harmanci A, Kim M, Post K, Bu D, Bath T, Kim J, Liu W, Chen H, Ohno-Machado L. The evolving privacy and security concerns for genomic data analysis and sharing as observed from the iDASH competition. J Am Med Inform Assoc 2022; 29:2182-2190. [PMID: 36164820 PMCID: PMC9667175 DOI: 10.1093/jamia/ocac165] [Citation(s) in RCA: 4] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/18/2022] [Revised: 08/25/2022] [Accepted: 09/13/2022] [Indexed: 01/11/2023] Open
Abstract
Concerns regarding inappropriate leakage of sensitive personal information as well as unauthorized data use are increasing with the growth of genomic data repositories. Therefore, privacy and security of genomic data have become increasingly important and need to be studied. With many proposed protection techniques, their applicability in support of biomedical research should be well understood. For this purpose, we have organized a community effort in the past 8 years through the integrating data for analysis, anonymization and sharing consortium to address this practical challenge. In this article, we summarize our experience from these competitions, report lessons learned from the events in 2020/2021 as examples, and discuss potential future research directions in this emerging field.
Collapse
Affiliation(s)
- Tsung-Ting Kuo
- Corresponding Author: Tsung-Ting Kuo, PhD, UCSD Health Department of Biomedical Informatics, University of California San Diego, La Jolla, CA 92093, USA;
| | | | | | | | - Arif Harmanci
- School of Biomedical Informatics, The University of Texas Health Science Center at Houston, Houston, Texas, USA
| | - Miran Kim
- Department of Mathematics, Hanyang University, Seoul, Republic of Korea,Department of Computer Science, Hanyang University, Seoul, Republic of Korea
| | - Kai Post
- UCSD Health Department of Biomedical Informatics, University of California San Diego, La Jolla, California, USA
| | - Diyue Bu
- Luddy School of Informatics, Computing, and Engineering, Indiana University Bloomington, Bloomington, Indiana, USA
| | - Tyler Bath
- UCSD Health Department of Biomedical Informatics, University of California San Diego, La Jolla, California, USA
| | - Jihoon Kim
- UCSD Health Department of Biomedical Informatics, University of California San Diego, La Jolla, California, USA
| | - Weijie Liu
- Luddy School of Informatics, Computing, and Engineering, Indiana University Bloomington, Bloomington, Indiana, USA
| | - Hongbo Chen
- Luddy School of Informatics, Computing, and Engineering, Indiana University Bloomington, Bloomington, Indiana, USA
| | - Lucila Ohno-Machado
- UCSD Health Department of Biomedical Informatics, University of California San Diego, La Jolla, California, USA,Division of Health Services Research & Development, Veteran Affairs San Diego Healthcare System, San Diego, California, USA
| |
Collapse
|
5
|
Carter AB. Considerations for Genomic Data Privacy and Security when Working in the Cloud. J Mol Diagn 2019; 21:542-552. [DOI: 10.1016/j.jmoldx.2018.07.009] [Citation(s) in RCA: 30] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/05/2017] [Revised: 05/16/2018] [Accepted: 07/02/2018] [Indexed: 01/21/2023] Open
|
6
|
Aziz MMA, Sadat MN, Alhadidi D, Wang S, Jiang X, Brown CL, Mohammed N. Privacy-preserving techniques of genomic data-a survey. Brief Bioinform 2019; 20:887-895. [PMID: 29121240 PMCID: PMC6585383 DOI: 10.1093/bib/bbx139] [Citation(s) in RCA: 23] [Impact Index Per Article: 4.6] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/23/2017] [Revised: 09/30/2017] [Indexed: 01/10/2023] Open
Abstract
Genomic data hold salient information about the characteristics of a living organism. Throughout the past decade, pinnacle developments have given us more accurate and inexpensive methods to retrieve genome sequences of humans. However, with the advancement of genomic research, there is a growing privacy concern regarding the collection, storage and analysis of such sensitive human data. Recent results show that given some background information, it is possible for an adversary to reidentify an individual from a specific genomic data set. This can reveal the current association or future susceptibility of some diseases for that individual (and sometimes the kinship between individuals) resulting in a privacy violation. Regardless of these risks, our genomic data hold much importance in analyzing the well-being of us and the future generation. Thus, in this article, we discuss the different privacy and security-related problems revolving around human genomic data. In addition, we will explore some of the cardinal cryptographic concepts, which can bring efficacy in secure and private genomic data computation. This article will relate the gaps between these two research areas-Cryptography and Genomics.
Collapse
Affiliation(s)
- Md Momin Al Aziz
- Department of Computer Science at the University of Manitoba, Winnipeg, Canada
| | - Md Nazmus Sadat
- Department of Computer Science at the University of Manitoba, Winnipeg, Canada
| | - Dima Alhadidi
- Faculty of Computer Science at the University of New Brunswick, Frederiction, Canada
| | - Shuang Wang
- Department of Biomedical Informatics at the University of California in San Diego, La Jolla, CA, USA
| | - Xiaoqian Jiang
- Department of Biomedical Informatics at the University of California in San Diego, La Jolla, CA, USA
| | - Cheryl L Brown
- Department of Political Science and Public Administration at the University of North Carolina at Charlotte, NC, USA
| | - Noman Mohammed
- Department of Computer Science at the University of Manitoba, Winnipeg, Canada
| |
Collapse
|
7
|
Abstract
OBJECTIVE To summarize notable research contributions published in 2017 on data sharing and privacy issues in medical informatics. METHODS An extensive search of PubMed/Medline, Web of Science, ACM Digital Library, IEEE Xplore, and AAAI Digital Library was conducted to uncover the scientific contributions published in 2017 that addressed issues of biomedical data sharing, with a focus on data access and privacy. The selection process was based on three steps: (i) a selection of candidate best papers, (ii) the review of the candidate best papers by a team of international experts with respect to six predefined criteria, and (iii) the selection of the best papers by the editorial board of the Yearbook Results: Five best papers were selected. They cover the lifecycle of biomedical data collection, use, and sharing. The papers introduce 1) consenting strategies for emerging environments, 2) software for searching and retrieving datasets in organizationally distributed environments, 3) approaches to measure the privacy risks of sharing new data increasingly utilized in research and the clinical setting (e.g., genomic), 4) new cryptographic techniques for querying clinical data for cohort discovery, and 5) novel game theoretic strategies for publishing summary information about genome-phenome studies that balance the utility of the data with potential privacy risks to the participants of such studies. CONCLUSION The papers illustrated that there is no one-size-fitsall solution to privacy while working with biomedical data. At the same time, the papers show that there are opportunities for leveraging newly emerging technologies to enable data use while minimizing privacy risks.
Collapse
Affiliation(s)
- Bradley Malin
- Department of Biomedical Informatics, Vanderbilt University, Nashville, Tennessee, USA.,Department of Electrical Engineering and Computer Science, Vanderbilt University, Nashville, Tennessee, USA
| | - Kenneth Goodman
- Institute for Bioethics and Health Policy, University of Miami, Miami, Florida, USA
| | | |
Collapse
|
8
|
Wang S, Jiang X, Tang H, Wang X, Bu D, Carey K, Dyke SO, Fox D, Jiang C, Lauter K, Malin B, Sofia H, Telenti A, Wang L, Wang W, Ohno-Machado L. A community effort to protect genomic data sharing, collaboration and outsourcing. NPJ Genom Med 2017; 2:33. [PMID: 29263842 PMCID: PMC5677972 DOI: 10.1038/s41525-017-0036-1] [Citation(s) in RCA: 27] [Impact Index Per Article: 3.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/30/2017] [Revised: 07/10/2017] [Accepted: 10/10/2017] [Indexed: 12/13/2022] Open
Abstract
The human genome can reveal sensitive information and is potentially re-identifiable, which raises privacy and security concerns about sharing such data on wide scales. In 2016, we organized the third Critical Assessment of Data Privacy and Protection competition as a community effort to bring together biomedical informaticists, computer privacy and security researchers, and scholars in ethical, legal, and social implications (ELSI) to assess the latest advances on privacy-preserving techniques for protecting human genomic data. Teams were asked to develop novel protection methods for emerging genome privacy challenges in three scenarios: Track (1) data sharing through the Beacon service of the Global Alliance for Genomics and Health. Track (2) collaborative discovery of similar genomes between two institutions; and Track (3) data outsourcing to public cloud services. The latter two tracks represent continuing themes from our 2015 competition, while the former was new and a response to a recently established vulnerability. The winning strategy for Track 1 mitigated the privacy risk by hiding approximately 11% of the variation in the database while permitting around 160,000 queries, a significant improvement over the baseline. The winning strategies in Tracks 2 and 3 showed significant progress over the previous competition by achieving multiple orders of magnitude performance improvement in terms of computational runtime and memory requirements. The outcomes suggest that applying highly optimized privacy-preserving and secure computation techniques to safeguard genomic data sharing and analysis is useful. However, the results also indicate that further efforts are needed to refine these techniques into practical solutions.
Collapse
Affiliation(s)
- Shuang Wang
- UCSD Health Department of Biomedical Informatics, University of California San Diego, La Jolla, CA 92093 USA
| | - Xiaoqian Jiang
- UCSD Health Department of Biomedical Informatics, University of California San Diego, La Jolla, CA 92093 USA
| | - Haixu Tang
- Computer Science and Informatics, Indiana University, Bloomington, IN 47408 USA
| | - Xiaofeng Wang
- Computer Science and Informatics, Indiana University, Bloomington, IN 47408 USA
| | - Diyue Bu
- Computer Science and Informatics, Indiana University, Bloomington, IN 47408 USA
| | - Knox Carey
- GeneCloud, Intertrust, CA, Sunnyvale, CA 94085 USA
| | - Stephanie Om Dyke
- Centre of Genomics and Policy, Department of Human Genetics, McGill University, Montreal, QC H3A 0G4 Canada
| | - Dov Fox
- School of Law, University of San Diego, San Diego, CA 92110 USA
| | - Chao Jiang
- UCSD Health Department of Biomedical Informatics, University of California San Diego, La Jolla, CA 92093 USA
| | - Kristin Lauter
- Cryptography Group, Microsoft Research, San Diego, CA 92122 USA
| | - Bradley Malin
- Department of Biomedical Informatics, School of Medicine, Vanderbilt University, Nashville, TN 37203 USA
| | - Heidi Sofia
- National Human Genome Research Institute, Rockville, MD 20894 USA
| | | | - Lei Wang
- Computer Science and Informatics, Indiana University, Bloomington, IN 47408 USA
| | - Wenhao Wang
- Computer Science and Informatics, Indiana University, Bloomington, IN 47408 USA
| | - Lucila Ohno-Machado
- UCSD Health Department of Biomedical Informatics, University of California San Diego, La Jolla, CA 92093 USA
| |
Collapse
|