1
|
Mosca MJ, Cho H. Reconstruction of private genomes through reference-based genotype imputation. Genome Biol 2023; 24:271. [PMID: 38053191 PMCID: PMC10698978 DOI: 10.1186/s13059-023-03105-6] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/12/2023] [Accepted: 11/06/2023] [Indexed: 12/07/2023] Open
Abstract
BACKGROUND Genotype imputation is an essential step in genetic studies to improve data quality and statistical power. Public imputation servers are widely used by researchers to impute their data using otherwise access-controlled reference panels of high-fidelity genomes held by these servers. RESULTS We report evidence against the prevailing assumption that providing access to panels only indirectly via imputation servers poses a negligible privacy risk to individuals in the panels. To this end, we present algorithmic strategies for adaptively constructing artificial input samples and interpreting their imputation results that lead to the accurate reconstruction of reference panel haplotypes. We illustrate this possibility on three reference panels of real genomes for a range of imputation tools and output settings. Moreover, we demonstrate that reconstructed haplotypes from the same individual could be linked via their genetic relatives using our Bayesian linking algorithm, which allows a substantial portion of the individual's diploid genome to be reassembled. We also provide population genetic estimates of the proportion of a panel that could be linked when an adversary holds a varying number of genomes from the same population. CONCLUSIONS Our results show that genomes in imputation server reference panels can be vulnerable to reconstruction, implying that additional safeguards may need to be considered. We suggest possible mitigation measures based on our findings. Our work illustrates the value of adversarial algorithms in uncovering new privacy risks to help inform the genomics community towards secure data sharing practices.
Collapse
Affiliation(s)
| | - Hyunghoon Cho
- Broad Institute of MIT and Harvard, Cambridge, MA, USA.
- Section of Biomedical Informatics and Data Science, Yale School of Medicine, New Haven, CT, USA.
| |
Collapse
|
2
|
Smajlović H, Shajii A, Berger B, Cho H, Numanagić I. Sequre: a high-performance framework for secure multiparty computation enables biomedical data sharing. Genome Biol 2023; 24:5. [PMID: 36631897 DOI: 10.1186/s13059-022-02841-5] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/30/2022] [Accepted: 12/21/2022] [Indexed: 01/12/2023] Open
Abstract
Secure multiparty computation (MPC) is a cryptographic tool that allows computation on top of sensitive biomedical data without revealing private information to the involved entities. Here, we introduce Sequre, an easy-to-use, high-performance framework for developing performant MPC applications. Sequre offers a set of automatic compile-time optimizations that significantly improve the performance of MPC applications and incorporates the syntax of Python programming language to facilitate rapid application development. We demonstrate its usability and performance on various bioinformatics tasks showing up to 3-4 times increased speed over the existing pipelines with 7-fold reductions in codebase sizes.
Collapse
|
3
|
Kim M, Wang S, Jiang X, Harmanci A. SVAT: Secure outsourcing of variant annotation and genotype aggregation. BMC Bioinformatics 2022; 23:409. [PMID: 36182914 PMCID: PMC9526274 DOI: 10.1186/s12859-022-04959-6] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/22/2021] [Accepted: 09/20/2022] [Indexed: 11/10/2022] Open
Abstract
Background Sequencing of thousands of samples provides genetic variants with allele frequencies spanning a very large spectrum and gives invaluable insight into genetic determinants of diseases. Protecting the genetic privacy of participants is challenging as only a few rare variants can easily re-identify an individual among millions. In certain cases, there are policy barriers against sharing genetic data from indigenous populations and stigmatizing conditions. Results We present SVAT, a method for secure outsourcing of variant annotation and aggregation, which are two basic steps in variant interpretation and detection of causal variants. SVAT uses homomorphic encryption to encrypt the data at the client-side. The data always stays encrypted while it is stored, in-transit, and most importantly while it is analyzed. SVAT makes use of a vectorized data representation to convert annotation and aggregation into efficient vectorized operations in a single framework. Also, SVAT utilizes a secure re-encryption approach so that multiple disparate genotype datasets can be combined for federated aggregation and secure computation of allele frequencies on the aggregated dataset. Conclusions Overall, SVAT provides a secure, flexible, and practical framework for privacy-aware outsourcing of annotation, filtering, and aggregation of genetic variants. SVAT is publicly available for download from https://github.com/harmancilab/SVAT. Supplementary Information The online version contains supplementary material available at 10.1186/s12859-022-04959-6.
Collapse
Affiliation(s)
- Miran Kim
- Department of Mathematics, Hanyang University, Seoul, 04763, Republic of Korea
| | - Su Wang
- Center for Precision Health, School of Biomedical Informatics, University of Texas Health Science Center, Houston, TX, 77030, USA
| | - Xiaoqian Jiang
- Center for Secure Artificial Intelligence For hEalthcare (SAFE), School of Biomedical Informatics, University of Texas Health Science Center, Houston, TX, 77030, USA
| | - Arif Harmanci
- Center for Precision Health, School of Biomedical Informatics, University of Texas Health Science Center, Houston, TX, 77030, USA.
| |
Collapse
|
4
|
Hekel R, Budis J, Kucharik M, Radvanszky J, Pös Z, Szemes T. Privacy-preserving storage of sequenced genomic data. BMC Genomics 2021; 22:712. [PMID: 34600465 PMCID: PMC8487550 DOI: 10.1186/s12864-021-07996-2] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/22/2021] [Accepted: 09/10/2021] [Indexed: 11/23/2022] Open
Abstract
Background The current and future applications of genomic data may raise ethical and privacy concerns. Processing and storing of this data introduce a risk of abuse by potential offenders since the human genome contains sensitive personal information. For this reason, we have developed a privacy-preserving method, named Varlock providing secure storage of sequenced genomic data. We used a public set of population allele frequencies to mask the personal alleles detected in genomic reads. Each personal allele described by the public set is masked by a randomly selected population allele with respect to its frequency. Masked alleles are preserved in an encrypted confidential file that can be shared in whole or in part using public-key cryptography. Results Our method masked the personal variants and introduced new variants detected in a personal masked genome. Alternative alleles with lower population frequency were masked and introduced more often. We performed a joint PCA analysis of personal and masked VCFs, showing that the VCFs between the two groups cannot be trivially mapped. Moreover, the method is reversible and personal alleles in specific genomic regions can be unmasked on demand. Conclusion Our method masks personal alleles within genomic reads while preserving valuable non-sensitive properties of sequenced DNA fragments for further research. Personal alleles in the desired genomic regions may be restored and shared with patients, clinics, and researchers. We suggest that the method can provide an additional security layer for storing and sharing of the raw aligned reads. Supplementary Information The online version contains supplementary material available at 10.1186/s12864-021-07996-2.
Collapse
Affiliation(s)
- Rastislav Hekel
- Geneton s.r.o, Bratislava, Slovakia. .,Faculty of Natural Sciences, Comenius University, Bratislava, Slovakia. .,Slovak Centre of Scientific and Technical Information, Bratislava, Slovakia. .,Comenius University Science Park, Bratislava, Slovakia.
| | - Jaroslav Budis
- Geneton s.r.o, Bratislava, Slovakia.,Slovak Centre of Scientific and Technical Information, Bratislava, Slovakia.,Comenius University Science Park, Bratislava, Slovakia
| | - Marcel Kucharik
- Geneton s.r.o, Bratislava, Slovakia.,Comenius University Science Park, Bratislava, Slovakia
| | - Jan Radvanszky
- Geneton s.r.o, Bratislava, Slovakia.,Faculty of Natural Sciences, Comenius University, Bratislava, Slovakia.,Comenius University Science Park, Bratislava, Slovakia.,Biomedical Research Centre, Institute of Clinical and Translational Research, Slovak Academy of Sciences, Bratislava, Slovakia
| | - Zuzana Pös
- Geneton s.r.o, Bratislava, Slovakia.,Faculty of Natural Sciences, Comenius University, Bratislava, Slovakia.,Comenius University Science Park, Bratislava, Slovakia.,Biomedical Research Centre, Institute of Clinical and Translational Research, Slovak Academy of Sciences, Bratislava, Slovakia
| | - Tomas Szemes
- Geneton s.r.o, Bratislava, Slovakia.,Faculty of Natural Sciences, Comenius University, Bratislava, Slovakia.,Comenius University Science Park, Bratislava, Slovakia
| |
Collapse
|
5
|
Meyerson W, Leisman J, Navarro FCP, Gerstein M. Origins and characterization of variants shared between databases of somatic and germline human mutations. BMC Bioinformatics 2020; 21:227. [PMID: 32498674 PMCID: PMC7273669 DOI: 10.1186/s12859-020-3508-8] [Citation(s) in RCA: 9] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/24/2020] [Accepted: 04/20/2020] [Indexed: 01/26/2023] Open
Abstract
Background Mutations arise in the human genome in two major settings: the germline and the soma. These settings involve different inheritance patterns, time scales, chromatin structures, and environmental exposures, all of which impact the resulting distribution of substitutions. Nonetheless, many of the same single nucleotide variants (SNVs) are shared between germline and somatic mutation databases, such as between the gnomAD database of 120,000 germline exomes and the TCGA database of 10,000 somatic exomes. Here, we sought to explain this overlap. Results After strict filtering to exclude common germline polymorphisms and sites with poor coverage or mappability, we found 336,987 variants shared between the somatic and germline databases. A uniform statistical model explains 34% of these shared variants; a model that incorporates the varying mutation rates of the basic mutation types explains another 50% of shared variants; and a model that includes extended nucleotide contexts (e.g. surrounding 3 bases on either side) explains an additional 4% of shared variants. Analysis of read depth finds mixed evidence that up to 4% of the shared variants may represent germline variants leaked into somatic call sets. 9% of the shared variants are not explained by any model. Sequencing errors and convergent evolution did not account for these. We surveyed other factors as well: Cancers driven by endogenous mutational processes share a greater fraction of variants with the germline, and recently derived germline variants were more likely to be somatically shared than were ancient germline ones. Conclusions Overall, we find that shared variants largely represent bona fide biological occurrences of the same variant in the germline and somatic setting and arise primarily because DNA has some of the same basic chemical vulnerabilities in either setting. Moreover, we find mixed evidence that somatic call-sets leak appreciable numbers of germline variants, which is relevant to genomic privacy regulations. In future studies, the similar chemical vulnerability of DNA between the somatic and germline settings might be used to help identify disease-related genes by guiding the development of background-mutation models that are informed by both somatic and germline patterns of variation.
Collapse
Affiliation(s)
- William Meyerson
- Computational Biology & Bioinformatics, Yale University, New Haven, CT, 06511, USA. .,Yale School of Medicine, Yale University, New Haven, CT, 06510, USA.
| | - John Leisman
- Molecular, Cellular and Developmental Biology, Yale University, New Haven, CT, 06510, USA
| | - Fabio C P Navarro
- Computational Biology & Bioinformatics, Yale University, New Haven, CT, 06511, USA.,Molecular Biophysics & Biochemistry, Yale University, New Haven, CT, 06511, USA
| | - Mark Gerstein
- Computational Biology & Bioinformatics, Yale University, New Haven, CT, 06511, USA. .,Yale School of Medicine, Yale University, New Haven, CT, 06510, USA. .,Molecular Biophysics & Biochemistry, Yale University, New Haven, CT, 06511, USA. .,Department of Computer Science, Yale University, New Haven, CT, 06511, USA.
| |
Collapse
|
6
|
Park S, Kim M, Seo S, Hong S, Han K, Lee K, Cheon JH, Kim S. A secure SNP panel scheme using homomorphically encrypted K-mers without SNP calling on the user side. BMC Genomics 2019; 20:188. [PMID: 30967116 PMCID: PMC6456943 DOI: 10.1186/s12864-019-5473-z] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/30/2023] Open
Abstract
BACKGROUND Single Nucleotide Polymorphism (SNP) in the genome has become crucial information for clinical use. For example, the targeted cancer therapy is primarily based on the information which clinically important SNPs are detectable from the tumor. Many hospitals have developed their own panels that include clinically important SNPs. The genome information exchange between the patient and the hospital has become more popular. However, the genome sequence information is innate and irreversible and thus its leakage has serious consequences. Therefore, protecting one's genome information is critical. On the other side, hospitals may need to protect their own panels. There is no known secure SNP panel scheme to protect both. RESULTS In this paper, we propose a secure SNP panel scheme using homomorphically encrypted K-mers without requiring SNP calling on the user side and without revealing the panel information to the user. Use of the powerful homomorphic encryption technique is desirable, but there is no known algorithm to efficiently align two homomorphically encrypted sequences. Thus, we designed and implemented a novel secure SNP panel scheme utilizing the computationally feasible equality test on two homomorphically encrypted K-mers. To make the scheme work correctly, in addition to SNPs in the panel, sequence variations at the population level should be addressed. We designed a concept of Point Deviation Tolerance (PDT) level to address the false positives and false negatives. Using the TCGA BRCA dataset, we demonstrated that our scheme works at the level of over a hundred thousand somatic mutations. In addition, we provide a computational guideline for the panel design, including the size of K-mer and the number of SNPs. CONCLUSIONS The proposed method is the first of its kind to protect both the user's sequence and the hospital's panel information using the powerful homomorphic encryption scheme. We demonstrated that the scheme works with a simulated dataset and the TCGA BRCA dataset. In this study, we have shown only the feasibility of the proposed scheme and much more efforts should be done to make the scheme usable for clinical use.
Collapse
Affiliation(s)
- Sungjoon Park
- 0000 0004 0470 5905grid.31501.36Department of Computer Science and Engineering, Seoul National University, Seoul, Republic of Korea
| | - Minsu Kim
- 0000 0004 0470 5905grid.31501.36Interdisciplinary Program in Bioinformatics, Seoul National University, Seoul, Republic of Korea
| | | | - Seungwan Hong
- 0000 0004 0470 5905grid.31501.36Department of Mathematical Sciences, Seoul National University, Seoul, Republic of Korea
| | - Kyoohyung Han
- 0000 0004 0470 5905grid.31501.36Department of Mathematical Sciences, Seoul National University, Seoul, Republic of Korea
| | - Keewoo Lee
- 0000 0004 0470 5905grid.31501.36Department of Mathematical Sciences, Seoul National University, Seoul, Republic of Korea
| | - Jung Hee Cheon
- 0000 0004 0470 5905grid.31501.36Department of Mathematical Sciences, Seoul National University, Seoul, Republic of Korea
| | - Sun Kim
- 0000 0004 0470 5905grid.31501.36Department of Computer Science and Engineering, Seoul National University, Seoul, Republic of Korea ,0000 0004 0470 5905grid.31501.36Interdisciplinary Program in Bioinformatics, Seoul National University, Seoul, Republic of Korea ,0000 0004 0470 5905grid.31501.36Bioinformatics Institute, Seoul National University, Seoul, Republic of Korea
| |
Collapse
|
7
|
Abstract
With the rapid advancement of high-throughput DNA sequencing technologies, genomics has become a big data discipline where large-scale genetic information of human individuals can be obtained efficiently with low cost. However, such massive amount of personal genomic data creates tremendous challenge for privacy, especially given the emergence of direct-to-consumer (DTC) industry that provides genetic testing services. Here we review the recent development in genomic big data and its implications on privacy. We also discuss the current dilemmas and future challenges of genomic privacy.
Collapse
|
8
|
Raisaro JL, McLaren PJ, Fellay J, Cavassini M, Klersy C, Hubaux JP. Are privacy-enhancing technologies for genomic data ready for the clinic? A survey of medical experts of the Swiss HIV Cohort Study. J Biomed Inform 2018; 79:1-6. [PMID: 29331453 DOI: 10.1016/j.jbi.2017.12.013] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/12/2017] [Revised: 12/21/2017] [Accepted: 12/23/2017] [Indexed: 12/23/2022]
Abstract
PURPOSE Protecting patient privacy is a major obstacle for the implementation of genomic-based medicine. Emerging privacy-enhancing technologies can become key enablers for managing sensitive genetic data. We studied physicians' attitude toward this kind of technology in order to derive insights that might foster their future adoption for clinical care. METHODS We conducted a questionnaire-based survey among 55 physicians of the Swiss HIV Cohort Study who tested the first implementation of a privacy-preserving model for delivering genomic test results. We evaluated their feedback on three different aspects of our model: clinical utility, ability to address privacy concerns and system usability. RESULTS 38/55 (69%) physicians participated in the study. Two thirds of them acknowledged genetic privacy as a key aspect that needs to be protected to help building patient trust and deploy new-generation medical information systems. All of them successfully used the tool for evaluating their patients' pharmacogenomics risk and 90% were happy with the user experience and the efficiency of the tool. Only 8% of physicians were unsatisfied with the level of information and wanted to have access to the patient's actual DNA sequence. CONCLUSION This survey, although limited in size, represents the first evaluation of privacy-preserving models for genomic-based medicine. It has allowed us to derive unique insights that will improve the design of these new systems in the future. In particular, we have observed that a clinical information system that uses homomorphic encryption to provide clinicians with risk information based on sensitive genetic test results can offer information that clinicians feel sufficient for their needs and appropriately respectful of patients' privacy. The ability of this kind of systems to ensure strong security and privacy guarantees and to provide some analytics on encrypted data has been assessed as a key enabler for the management of sensitive medical information in the near future. Providing clinically relevant information to physicians while protecting patients' privacy in order to comply with regulations is crucial for the widespread use of these new technologies.
Collapse
Affiliation(s)
- Jean-Louis Raisaro
- School of Computer Communications Sciences, École Polytechnique Fédérale de Lausanne, Switzerland
| | - Paul J McLaren
- J.C. Wilt Infectious Diseases Research Centre, National Microbiology Laboratories, Public Health Agency of Canada, Winnipeg, Canada; Department of Medical Microbiology and Infectious Diseases, University of Manitoba, Winnipeg, Canada
| | - Jacques Fellay
- School of Life Sciences, École Polytechnique Fédérale de Lausanne, Switzerland; Swiss Institute of Bioinformatics, Lausanne, Switzerland
| | - Matthias Cavassini
- Division of Infectious Diseases, Lausanne University Hospital, Switzerland
| | - Catherine Klersy
- Service of Biometry and Clinical Epidemiology, Fondazione IRCCS Policlinico San Matteo, Pavia, Italy
| | - Jean-Pierre Hubaux
- School of Computer Communications Sciences, École Polytechnique Fédérale de Lausanne, Switzerland.
| | | |
Collapse
|