1
|
Al Badawi A, Faizal Bin Yusof M. Private pathological assessment via machine learning and homomorphic encryption. BioData Min 2024; 17:33. [PMID: 39252108 PMCID: PMC11385496 DOI: 10.1186/s13040-024-00379-9] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/23/2024] [Accepted: 08/08/2024] [Indexed: 09/11/2024] Open
Abstract
PURPOSE The objective of this research is to explore the applicability of machine learning and fully homomorphic encryption (FHE) in the private pathological assessment, with a focus on the inference phase of support vector machines (SVM) for the classification of confidential medical data. METHODS A framework is introduced that utilizes the Cheon-Kim-Kim-Song (CKKS) FHE scheme, facilitating the execution of SVM inference on encrypted datasets. This framework ensures the privacy of patient data and negates the necessity of decryption during the analytical process. Additionally, an efficient feature extraction technique is presented for the transformation of medical imagery into vectorial representations. RESULTS The system's evaluation across various datasets substantiates its practicality and efficacy. The proposed method delivers classification accuracy and performance on par with traditional, non-encrypted SVM inference, while upholding a 128-bit security level against established cryptographic attacks targeting the CKKS scheme. The secure inference process is executed within a temporal span of mere seconds. CONCLUSION The findings of this study underscore the viability of FHE in enhancing the security and efficiency of bioinformatics analyses, potentially benefiting fields such as cardiology, oncology, and medical imagery. The implications of this research are significant for the future of privacy-preserving machine learning, promoting progress in diagnostic procedures, tailored medical treatments, and clinical investigations.
Collapse
Affiliation(s)
- Ahmad Al Badawi
- Department of Homeland Security, Rabdan Academy, Dhafeer St, Al Sa'adah, 22401, Abu Dhabi, United Arab Emirates.
| | - Mohd Faizal Bin Yusof
- Department of Homeland Security, Rabdan Academy, Dhafeer St, Al Sa'adah, 22401, Abu Dhabi, United Arab Emirates
| |
Collapse
|
2
|
Cho H, Froelicher D, Dokmai N, Nandi A, Sadhuka S, Hong MM, Berger B. Privacy-Enhancing Technologies in Biomedical Data Science. Annu Rev Biomed Data Sci 2024; 7:317-343. [PMID: 39178425 PMCID: PMC11346580 DOI: 10.1146/annurev-biodatasci-120423-120107] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 08/25/2024]
Abstract
The rapidly growing scale and variety of biomedical data repositories raise important privacy concerns. Conventional frameworks for collecting and sharing human subject data offer limited privacy protection, often necessitating the creation of data silos. Privacy-enhancing technologies (PETs) promise to safeguard these data and broaden their usage by providing means to share and analyze sensitive data while protecting privacy. Here, we review prominent PETs and illustrate their role in advancing biomedicine. We describe key use cases of PETs and their latest technical advances and highlight recent applications of PETs in a range of biomedical domains. We conclude by discussing outstanding challenges and social considerations that need to be addressed to facilitate a broader adoption of PETs in biomedical data science.
Collapse
Affiliation(s)
- Hyunghoon Cho
- Department of Biomedical Informatics and Data Science, Yale School of Medicine, New Haven, Connecticut, USA;
| | - David Froelicher
- Computer Science and Artificial Intelligence Laboratory, Massachusetts Institute of Technology, Cambridge, Massachusetts, USA;
| | - Natnatee Dokmai
- Department of Biomedical Informatics and Data Science, Yale School of Medicine, New Haven, Connecticut, USA;
| | - Anupama Nandi
- Department of Biomedical Informatics and Data Science, Yale School of Medicine, New Haven, Connecticut, USA;
| | - Shuvom Sadhuka
- Computer Science and Artificial Intelligence Laboratory, Massachusetts Institute of Technology, Cambridge, Massachusetts, USA;
| | - Matthew M Hong
- Computer Science and Artificial Intelligence Laboratory, Massachusetts Institute of Technology, Cambridge, Massachusetts, USA;
| | - Bonnie Berger
- Computer Science and Artificial Intelligence Laboratory, Massachusetts Institute of Technology, Cambridge, Massachusetts, USA;
- Department of Mathematics, Massachusetts Institute of Technology, Cambridge, Massachusetts, USA
| |
Collapse
|
3
|
Hong S, Choi YA, Joo DS, Gürsoy G. Privacy-preserving model evaluation for logistic and linear regression using homomorphically encrypted genotype data. J Biomed Inform 2024; 156:104678. [PMID: 38936565 PMCID: PMC11272436 DOI: 10.1016/j.jbi.2024.104678] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/16/2024] [Revised: 05/29/2024] [Accepted: 06/19/2024] [Indexed: 06/29/2024]
Abstract
OBJECTIVE Linear and logistic regression are widely used statistical techniques in population genetics for analyzing genetic data and uncovering patterns and associations in large genetic datasets, such as identifying genetic variations linked to specific diseases or traits. However, obtaining statistically significant results from these studies requires large amounts of sensitive genotype and phenotype information from thousands of patients, which raises privacy concerns. Although cryptographic techniques such as homomorphic encryption offers a potential solution to the privacy concerns as it allows computations on encrypted data, previous methods leveraging homomorphic encryption have not addressed the confidentiality of shared models, which can leak information about the training data. METHODS In this work, we present a secure model evaluation method for linear and logistic regression using homomorphic encryption for six prediction tasks, where input genotypes, output phenotypes, and model parameters are all encrypted. RESULTS Our method ensures no private information leakage during inference and achieves high accuracy (≥93% for all outcomes) with each inference taking less than ten seconds for ∼200 genomes. CONCLUSION Our study demonstrates that it is possible to perform linear and logistic regression model evaluation while protecting patient confidentiality with theoretical security guarantees. Our implementation and test data are available at https://github.com/G2Lab/privateML/.
Collapse
Affiliation(s)
- Seungwan Hong
- Department of Biomedical Informatics, Columbia University, New York, NY 10032, USA; New York Genome Center, New York, NY 10013, USA
| | - Yoolim A Choi
- Department of Biomedical Informatics, Columbia University, New York, NY 10032, USA; New York Genome Center, New York, NY 10013, USA
| | - Daniel S Joo
- New York Genome Center, New York, NY 10013, USA; Department of Computer Science, Columbia University, New York, NY 10032, USA
| | - Gamze Gürsoy
- Department of Biomedical Informatics, Columbia University, New York, NY 10032, USA; New York Genome Center, New York, NY 10013, USA; Department of Computer Science, Columbia University, New York, NY 10032, USA.
| |
Collapse
|
4
|
Aherrahrou N, Tairi H, Aherrahrou Z. Genomic privacy preservation in genome-wide association studies: taxonomy, limitations, challenges, and vision. Brief Bioinform 2024; 25:bbae356. [PMID: 39073827 DOI: 10.1093/bib/bbae356] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/24/2024] [Revised: 06/19/2024] [Accepted: 07/12/2024] [Indexed: 07/30/2024] Open
Abstract
Genome-wide association studies (GWAS) serve as a crucial tool for identifying genetic factors associated with specific traits. However, ethical constraints prevent the direct exchange of genetic information, prompting the need for privacy preservation solutions. To address these issues, earlier works are based on cryptographic mechanisms such as homomorphic encryption, secure multi-party computing, and differential privacy. Very recently, federated learning has emerged as a promising solution for enabling secure and collaborative GWAS computations. This work provides an extensive overview of existing methods for GWAS privacy preserving, with the main focus on collaborative and distributed approaches. This survey provides a comprehensive analysis of the challenges faced by existing methods, their limitations, and insights into designing efficient solutions.
Collapse
Affiliation(s)
- Noura Aherrahrou
- LISAC, Department of Computer Science, Faculty of Sciences Dhar El Mahraz, University Sidi Mohamed Ben Abdellah, B.P. 1796 - Atlas, 30003, Fez, Morocco
| | - Hamid Tairi
- LISAC, Department of Computer Science, Faculty of Sciences Dhar El Mahraz, University Sidi Mohamed Ben Abdellah, B.P. 1796 - Atlas, 30003, Fez, Morocco
| | - Zouhair Aherrahrou
- Institute for Cardiogenetics, Universität zu Lübeck, D-23562 Lübeck, Germany
- DZHK (German Centre for Cardiovascular Research), Partner Site Hamburg/Kiel/Lübeck, Germany
- University Heart Centre Lübeck, D-23562 Lübeck, Germany
| |
Collapse
|
5
|
Brauneck A, Schmalhorst L, Weiss S, Baumbach L, Völker U, Ellinghaus D, Baumbach J, Buchholtz G. Legal aspects of privacy-enhancing technologies in genome-wide association studies and their impact on performance and feasibility. Genome Biol 2024; 25:154. [PMID: 38872191 PMCID: PMC11170858 DOI: 10.1186/s13059-024-03296-6] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/16/2023] [Accepted: 06/03/2024] [Indexed: 06/15/2024] Open
Abstract
Genomic data holds huge potential for medical progress but requires strict safety measures due to its sensitive nature to comply with data protection laws. This conflict is especially pronounced in genome-wide association studies (GWAS) which rely on vast amounts of genomic data to improve medical diagnoses. To ensure both their benefits and sufficient data security, we propose a federated approach in combination with privacy-enhancing technologies utilising the findings from a systematic review on federated learning and legal regulations in general and applying these to GWAS.
Collapse
Affiliation(s)
- Alissa Brauneck
- Hamburg University Faculty of Law, University of Hamburg, Hamburg, Germany.
| | - Louisa Schmalhorst
- Hamburg University Faculty of Law, University of Hamburg, Hamburg, Germany
| | - Stefan Weiss
- Interfaculty Institute of Genetics and Functional Genomics, Department of Functional Genomics, University Medicine Greifswald, Greifswald, Germany
| | - Linda Baumbach
- Department of Health Economics and Health Services Research, University Medical Center Hamburg-Eppendorf, Hamburg, Germany
| | - Uwe Völker
- Interfaculty Institute of Genetics and Functional Genomics, Department of Functional Genomics, University Medicine Greifswald, Greifswald, Germany
| | - David Ellinghaus
- Institute of Clinical Molecular Biology (IKMB), Kiel University and University Medical Center Schleswig-Holstein, Kiel, Germany
| | - Jan Baumbach
- Institute for Computational Systems Biology, University of Hamburg, Hamburg, Germany
| | - Gabriele Buchholtz
- Hamburg University Faculty of Law, University of Hamburg, Hamburg, Germany
| |
Collapse
|
6
|
Rujano MA, Boiten JW, Ohmann C, Canham S, Contrino S, David R, Ewbank J, Filippone C, Connellan C, Custers I, van Nuland R, Mayrhofer MT, Holub P, Álvarez EG, Bacry E, Hughes N, Freeberg MA, Schaffhauser B, Wagener H, Sánchez-Pla A, Bertolini G, Panagiotopoulou M. Sharing sensitive data in life sciences: an overview of centralized and federated approaches. Brief Bioinform 2024; 25:bbae262. [PMID: 38836701 DOI: 10.1093/bib/bbae262] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/06/2024] [Revised: 04/19/2024] [Indexed: 06/06/2024] Open
Abstract
Biomedical data are generated and collected from various sources, including medical imaging, laboratory tests and genome sequencing. Sharing these data for research can help address unmet health needs, contribute to scientific breakthroughs, accelerate the development of more effective treatments and inform public health policy. Due to the potential sensitivity of such data, however, privacy concerns have led to policies that restrict data sharing. In addition, sharing sensitive data requires a secure and robust infrastructure with appropriate storage solutions. Here, we examine and compare the centralized and federated data sharing models through the prism of five large-scale and real-world use cases of strategic significance within the European data sharing landscape: the French Health Data Hub, the BBMRI-ERIC Colorectal Cancer Cohort, the federated European Genome-phenome Archive, the Observational Medical Outcomes Partnership/OHDSI network and the EBRAINS Medical Informatics Platform. Our analysis indicates that centralized models facilitate data linkage, harmonization and interoperability, while federated models facilitate scaling up and legal compliance, as the data typically reside on the data generator's premises, allowing for better control of how data are shared. This comparative study thus offers guidance on the selection of the most appropriate sharing strategy for sensitive datasets and provides key insights for informed decision-making in data sharing efforts.
Collapse
Affiliation(s)
- Maria A Rujano
- European Clinical Research Infrastructure Network (ECRIN), Boulevard Saint Jacques 30, 75014, Paris, France
| | - Jan-Willem Boiten
- Foundation Lygature, Jaarbeursplein 6, 3521 AL, Utrecht, The Netherlands
| | - Christian Ohmann
- European Clinical Research Infrastructure Network (ECRIN), Boulevard Saint Jacques 30, 75014, Paris, France
| | - Steve Canham
- European Clinical Research Infrastructure Network (ECRIN), Boulevard Saint Jacques 30, 75014, Paris, France
| | - Sergio Contrino
- European Clinical Research Infrastructure Network (ECRIN), Boulevard Saint Jacques 30, 75014, Paris, France
| | - Romain David
- European Research Infrastructure on Highly Pathogenic Agents (ERINHA AISBL), rue du Trône 98/Boîte 4B, 1050, Brussels, Belgium
| | - Jonathan Ewbank
- European Research Infrastructure on Highly Pathogenic Agents (ERINHA AISBL), rue du Trône 98/Boîte 4B, 1050, Brussels, Belgium
| | - Claudia Filippone
- European Research Infrastructure on Highly Pathogenic Agents (ERINHA AISBL), rue du Trône 98/Boîte 4B, 1050, Brussels, Belgium
| | - Claire Connellan
- European Research Infrastructure on Highly Pathogenic Agents (ERINHA AISBL), rue du Trône 98/Boîte 4B, 1050, Brussels, Belgium
| | - Ilse Custers
- Foundation Lygature, Jaarbeursplein 6, 3521 AL, Utrecht, The Netherlands
| | - Rick van Nuland
- Foundation Lygature, Jaarbeursplein 6, 3521 AL, Utrecht, The Netherlands
| | - Michaela Th Mayrhofer
- Biobanking and Biomolecular Resources Research Infrastructure (BBMRI-ERIC), Neue Stiftingtalstrasse 2/B/6, 8010, Graz, Austria
| | - Petr Holub
- Biobanking and Biomolecular Resources Research Infrastructure (BBMRI-ERIC), Neue Stiftingtalstrasse 2/B/6, 8010, Graz, Austria
| | - Eva García Álvarez
- Biobanking and Biomolecular Resources Research Infrastructure (BBMRI-ERIC), Neue Stiftingtalstrasse 2/B/6, 8010, Graz, Austria
| | - Emmanuel Bacry
- Health Data Hub (HDH), rue Georges Pitard 9, 75015, Paris, France
| | - Nigel Hughes
- Janssen Research and Development, Antwerpseweg 15, 2340, Beerse, Belgium
| | - Mallory A Freeberg
- European Molecular Biology Laboratory (EMBL), European Bioinformatics Institute (EBI), Wellcome Genome Campus, CB10 1SD, Hinxton, Cambridgeshire, United Kingdom
| | - Birgit Schaffhauser
- Department of Clinical Neurosciences, Centre Hospitalier Universitaire Vaudois (CHUV), Rue du Bugnon 21, 1011, Lausanne, Switzerland
| | - Harald Wagener
- Center for Digital Health, BIH@Charité University Medicine, Anna-Louisa-Karsch-Straße 2, 10178, Berlin, Germany
| | - Alex Sánchez-Pla
- Department of Genetics, Microbiology and Statistics, Universitat de Barcelona, Diagonal 643, 08028, Barcelona, Spain
| | - Guido Bertolini
- Laboratory of Clinical Epidemiology, Istituto di Ricerche Farmacologiche Mario Negri IRCCS, Via GB Camozzi 3, 24020, Ranica (Bergamo), Italy
| | - Maria Panagiotopoulou
- European Clinical Research Infrastructure Network (ECRIN), Boulevard Saint Jacques 30, 75014, Paris, France
| |
Collapse
|
7
|
Blindenbach J, Kang J, Hong S, Karam C, Lehner T, Gürsoy G. Ultra-secure storage and analysis of genetic data for the advancement of precision medicine. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2024.04.16.589793. [PMID: 38695012 PMCID: PMC11061874 DOI: 10.1101/2024.04.16.589793] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 05/04/2024]
Abstract
Cloud computing provides the opportunity to store the ever-growing genotype-phenotype data sets needed to achieve the full potential of precision medicine. However, due to the sensitive nature of this data and the patchwork of data privacy laws across states and countries, additional security protections are proving necessary to ensure data privacy and security. Here we present SQUiD, a secure queryable database for storing and analyzing genotype-phenotype data. With SQUiD, genotype-phenotype data can be stored in a low-security, low-cost public cloud in the encrypted form, which researchers can securely query without the public cloud ever being able to decrypt the data. We demonstrate the usability of SQUiD by replicating various commonly used calculations such as polygenic risk scores, cohort creation for GWAS, MAF filtering, and patient similarity analysis both on synthetic and UK Biobank data. Our work represents a new and scalable platform enabling the realization of precision medicine without security and privacy concerns.
Collapse
Affiliation(s)
- Jacob Blindenbach
- Department of Computer Science, Columbia University
- Department of Biomedical Informatics, Columbia University
- New York Genome Center
- These authors contributed equally
| | - Jiayi Kang
- COSIC, KU Leuven
- These authors contributed equally
| | - Seungwan Hong
- Department of Biomedical Informatics, Columbia University
- New York Genome Center
- These authors contributed equally
| | | | | | - Gamze Gürsoy
- Department of Computer Science, Columbia University
- Department of Biomedical Informatics, Columbia University
- New York Genome Center
| |
Collapse
|
8
|
Zhong Z, Li G, Xu Z, Zeng H, Teng J, Feng X, Diao S, Gao Y, Li J, Zhang Z. Evaluating three strategies of genome-wide association analysis for integrating data from multiple populations. Anim Genet 2024; 55:265-276. [PMID: 38185881 DOI: 10.1111/age.13394] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/28/2023] [Revised: 11/24/2023] [Accepted: 12/21/2023] [Indexed: 01/09/2024]
Abstract
In livestock, genome-wide association studies (GWAS) are usually conducted in a single population (single-GWAS) with limited sample size and detection power. To enhance the detection power of GWAS, meta-analysis of GWAS (meta-GWAS) and mega-analysis of GWAS (mega-GWAS) have been proposed to integrate data from multiple populations at the level of summary statistics or individual data, respectively. However, there is a lack of comparison for these different strategies, which makes it difficult to guide the best practice of GWAS integrating data from multiple study populations. To maximize the comparison of different association analysis strategies across multiple populations, we conducted single-GWAS, meta-GWAS, and mega-GWAS for the backfat thickness of 100 kg (BFT_100) and days to 100 kg (DAYS_100) within each of the three commercial pig breeds (Duroc, Yorkshire, and Landrace). Based on controlling the genome inflation factor to one, we calculated corrected p-values (pC ). In Yorkshire, with the largest sample size, mega-GWAS, meta-GWAS and single-GWAS detected 149, 38 and 20 significant SNPs (pC < 1E-5) associated with BFT_100, as well as 26, four, and one QTL, respectively. Among them, pC of SNPs from mega-GWAS was the lowest, followed by meta-GWAS and single-GWAS. The correlation of pC among the three GWAS strategies ranged from 0.60 to 0.75 and the correlation of SNP effect values between meta-GWAS and mega-GWAS was 0.74, all showing good agreement. Collectively, even though there are differences in the integration of individual data or summary statistics, integrating data from multiple populations is an effective means of genetic argument for complex traits, especially mega-GWAS versus single-GWAS.
Collapse
Affiliation(s)
- Zhanming Zhong
- National Engineering Research Center for Breeding Swine Industry, Guangdong Provincial Key Lab of Agro-Animal Genomics and Molecular Breeding, College of Animal Science, South China Agricultural University, Guangzhou, China
| | - Guangzhen Li
- National Engineering Research Center for Breeding Swine Industry, Guangdong Provincial Key Lab of Agro-Animal Genomics and Molecular Breeding, College of Animal Science, South China Agricultural University, Guangzhou, China
| | - Zhiting Xu
- National Engineering Research Center for Breeding Swine Industry, Guangdong Provincial Key Lab of Agro-Animal Genomics and Molecular Breeding, College of Animal Science, South China Agricultural University, Guangzhou, China
| | - Haonan Zeng
- National Engineering Research Center for Breeding Swine Industry, Guangdong Provincial Key Lab of Agro-Animal Genomics and Molecular Breeding, College of Animal Science, South China Agricultural University, Guangzhou, China
| | - Jinyan Teng
- National Engineering Research Center for Breeding Swine Industry, Guangdong Provincial Key Lab of Agro-Animal Genomics and Molecular Breeding, College of Animal Science, South China Agricultural University, Guangzhou, China
| | - Xueyan Feng
- National Engineering Research Center for Breeding Swine Industry, Guangdong Provincial Key Lab of Agro-Animal Genomics and Molecular Breeding, College of Animal Science, South China Agricultural University, Guangzhou, China
| | - Shuqi Diao
- National Engineering Research Center for Breeding Swine Industry, Guangdong Provincial Key Lab of Agro-Animal Genomics and Molecular Breeding, College of Animal Science, South China Agricultural University, Guangzhou, China
| | - Yahui Gao
- National Engineering Research Center for Breeding Swine Industry, Guangdong Provincial Key Lab of Agro-Animal Genomics and Molecular Breeding, College of Animal Science, South China Agricultural University, Guangzhou, China
| | - Jiaqi Li
- National Engineering Research Center for Breeding Swine Industry, Guangdong Provincial Key Lab of Agro-Animal Genomics and Molecular Breeding, College of Animal Science, South China Agricultural University, Guangzhou, China
| | - Zhe Zhang
- National Engineering Research Center for Breeding Swine Industry, Guangdong Provincial Key Lab of Agro-Animal Genomics and Molecular Breeding, College of Animal Science, South China Agricultural University, Guangzhou, China
| |
Collapse
|
9
|
Zhao T, Wang F, Mott R, Dekkers J, Cheng H. Using encrypted genotypes and phenotypes for collaborative genomic analyses to maintain data confidentiality. Genetics 2024; 226:iyad210. [PMID: 38085098 PMCID: PMC11090459 DOI: 10.1093/genetics/iyad210] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/09/2023] [Accepted: 11/13/2023] [Indexed: 03/08/2024] Open
Abstract
To adhere to and capitalize on the benefits of the FAIR (findable, accessible, interoperable, and reusable) principles in agricultural genome-to-phenome studies, it is crucial to address privacy and intellectual property issues that prevent sharing and reuse of data in research and industry. Direct sharing of genotype and phenotype data is often prohibited due to intellectual property and privacy concerns. Thus, there is a pressing need for encryption methods that obscure confidential aspects of the data, without affecting the outcomes of certain statistical analyses. A homomorphic encryption method for genotypes and phenotypes (HEGP) has been proposed for single-marker regression in genome-wide association studies (GWAS) using linear mixed models with Gaussian errors. This methodology permits frequentist likelihood-based parameter estimation and inference. In this paper, we extend HEGP to broader applications in genome-to-phenome analyses. We show that HEGP is suited to commonly used linear mixed models for genetic analyses of quantitative traits including genomic best linear unbiased prediction (GBLUP) and ridge-regression best linear unbiased prediction (RR-BLUP), as well as Bayesian variable selection methods (e.g. those in Bayesian Alphabet), for genetic parameter estimation, genomic prediction, and GWAS. By advancing the capabilities of HEGP, we offer researchers and industry professionals a secure and efficient approach for collaborative genomic analyses while preserving data confidentiality.
Collapse
Affiliation(s)
- Tianjing Zhao
- Department of Animal Science, University of California, Davis, CA 95616, USA
- Department of Animal Science, University of Nebraska-Lincoln, Lincoln, NE 68583, USA
| | - Fangyi Wang
- Department of Plant Sciences, University of California, Davis, CA 95616, USA
| | - Richard Mott
- Genetics Institute, University College London, London, WC1E 6BT, UK
| | - Jack Dekkers
- Department of Animal Science, Iowa State University, Ames, IA 50011, USA
| | - Hao Cheng
- Department of Animal Science, University of California, Davis, CA 95616, USA
| |
Collapse
|
10
|
Bataa M, Song S, Park K, Kim M, Cheon JH, Kim S. Finding Highly Similar Regions of Genomic Sequences Through Homomorphic Encryption. J Comput Biol 2024; 31:197-212. [PMID: 38531050 DOI: 10.1089/cmb.2023.0050] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 03/28/2024] Open
Abstract
Finding highly similar regions of genomic sequences is a basic computation of genomic analysis. Genomic analyses on a large amount of data are efficiently processed in cloud environments, but outsourcing them to a cloud raises concerns over the privacy and security issues. Homomorphic encryption (HE) is a powerful cryptographic primitive that preserves privacy of genomic data in various analyses processed in an untrusted cloud environment. We introduce an efficient algorithm for finding highly similar regions of two homomorphically encrypted sequences, and describe how to implement it using the bit-wise and word-wise HE schemes. In the experiment, our algorithm outperforms an existing algorithm by up to two orders of magnitude in terms of elapsed time. Overall, it finds highly similar regions of the sequences in real data sets in a feasible time.
Collapse
Affiliation(s)
- Magsarjav Bataa
- Department of Computer Science and Engineering, Seoul National University, Seoul, South Korea
- Department of Information and Computer Sciences, National University of Mongolia, Ulaanbaatar, Mongolia
| | - Siwoo Song
- Department of Computer Science and Engineering, Seoul National University, Seoul, South Korea
| | - Kunsoo Park
- Department of Computer Science and Engineering, Seoul National University, Seoul, South Korea
| | - Miran Kim
- Department of Mathematics, Hanyang University, Seoul, South Korea
| | - Jung Hee Cheon
- Department of Mathematical Sciences, Seoul National University, Seoul, South Korea
| | - Sun Kim
- Department of Computer Science and Engineering, Seoul National University, Seoul, South Korea
| |
Collapse
|
11
|
Dong X, Lu Y, Guo L, Li C, Ni Q, Wu B, Wang H, Yang L, Wu S, Sun Q, Zheng H, Zhou W, Wang S. PICOTEES: a privacy-preserving online service of phenotype exploration for genetic-diagnostic variants from Chinese children cohorts. J Genet Genomics 2024; 51:243-251. [PMID: 37714454 DOI: 10.1016/j.jgg.2023.09.003] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/06/2023] [Revised: 08/31/2023] [Accepted: 09/03/2023] [Indexed: 09/17/2023]
Abstract
The growth in biomedical data resources has raised potential privacy concerns and risks of genetic information leakage. For instance, exome sequencing aids clinical decisions by comparing data through web services, but it requires significant trust between users and providers. To alleviate privacy concerns, the most commonly used strategy is to anonymize sensitive data. Unfortunately, studies have shown that anonymization is insufficient to protect against reidentification attacks. Recently, privacy-preserving technologies have been applied to preserve application utility while protecting the privacy of biomedical data. We present the PICOTEES framework, a privacy-preserving online service of phenotype exploration for genetic-diagnostic variants (https://birthdefectlab.cn:3000/). PICOTEES enables privacy-preserving queries of the phenotype spectrum for a single variant by utilizing trusted execution environment technology, which can protect the privacy of the user's query information, backend models, and data, as well as the final results. We demonstrate the utility and performance of PICOTEES by exploring a bioinformatics dataset. The dataset is from a cohort containing 20,909 genetic testing patients with 3,152,508 variants from the Children's Hospital of Fudan University in China, dominated by the Chinese Han population (>99.9%). Our query results yield a large number of unreported diagnostic variants and previously reported pathogenicity.
Collapse
Affiliation(s)
- Xinran Dong
- Center for Molecular Medicine, Children's Hospital of Fudan University, Shanghai 201102, China; Key Laboratory of Birth Defects, Children's Hospital of Fudan University, Shanghai 201102, China
| | - Yulan Lu
- Center for Molecular Medicine, Children's Hospital of Fudan University, Shanghai 201102, China; Key Laboratory of Birth Defects, Children's Hospital of Fudan University, Shanghai 201102, China
| | - Lanting Guo
- Department of Bioinformatics, Hangzhou Nuowei Information Technology Co., Ltd, Hangzhou, Zhejiang 310000, China
| | - Chuan Li
- Center for Molecular Medicine, Children's Hospital of Fudan University, Shanghai 201102, China
| | - Qi Ni
- Center for Molecular Medicine, Children's Hospital of Fudan University, Shanghai 201102, China; Key Laboratory of Birth Defects, Children's Hospital of Fudan University, Shanghai 201102, China
| | - Bingbing Wu
- Center for Molecular Medicine, Children's Hospital of Fudan University, Shanghai 201102, China; Key Laboratory of Birth Defects, Children's Hospital of Fudan University, Shanghai 201102, China
| | - Huijun Wang
- Center for Molecular Medicine, Children's Hospital of Fudan University, Shanghai 201102, China; Key Laboratory of Birth Defects, Children's Hospital of Fudan University, Shanghai 201102, China
| | - Lin Yang
- Center for Molecular Medicine, Children's Hospital of Fudan University, Shanghai 201102, China; Key Laboratory of Birth Defects, Children's Hospital of Fudan University, Shanghai 201102, China
| | - Songyang Wu
- The Third Research Institute of the Ministry of Public Security, Shanghai 200031, China
| | - Qi Sun
- Department of Bioinformatics, Hangzhou Nuowei Information Technology Co., Ltd, Hangzhou, Zhejiang 310000, China
| | - Hao Zheng
- Department of Bioinformatics, Hangzhou Nuowei Information Technology Co., Ltd, Hangzhou, Zhejiang 310000, China
| | - Wenhao Zhou
- Center for Molecular Medicine, Children's Hospital of Fudan University, Shanghai 201102, China; Xiamen Campus of Children's Hospital of Fudan University, Xiamen, Fujian 361006, China.
| | - Shuang Wang
- Department of Bioinformatics, Hangzhou Nuowei Information Technology Co., Ltd, Hangzhou, Zhejiang 310000, China; Institutes for Systems Genetics, West China Hospital, Chengdu, Sichuan 610041, China; Shanghai Putuo People's Hospital, Tongji University, Shanghai 200060, China.
| |
Collapse
|
12
|
Oliva A, Kaphle A, Reguant R, Sng LMF, Twine NA, Malakar Y, Wickramarachchi A, Keller M, Ranbaduge T, Chan EKF, Breen J, Buckberry S, Guennewig B, Haas M, Brown A, Cowley MJ, Thorne N, Jain Y, Bauer DC. Future-proofing genomic data and consent management: a comprehensive review of technology innovations. Gigascience 2024; 13:giae021. [PMID: 38837943 PMCID: PMC11152178 DOI: 10.1093/gigascience/giae021] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/14/2023] [Revised: 01/15/2024] [Accepted: 04/09/2024] [Indexed: 06/07/2024] Open
Abstract
Genomic information is increasingly used to inform medical treatments and manage future disease risks. However, any personal and societal gains must be carefully balanced against the risk to individuals contributing their genomic data. Expanding our understanding of actionable genomic insights requires researchers to access large global datasets to capture the complexity of genomic contribution to diseases. Similarly, clinicians need efficient access to a patient's genome as well as population-representative historical records for evidence-based decisions. Both researchers and clinicians hence rely on participants to consent to the use of their genomic data, which in turn requires trust in the professional and ethical handling of this information. Here, we review existing and emerging solutions for secure and effective genomic information management, including storage, encryption, consent, and authorization that are needed to build participant trust. We discuss recent innovations in cloud computing, quantum-computing-proof encryption, and self-sovereign identity. These innovations can augment key developments from within the genomics community, notably GA4GH Passports and the Crypt4GH file container standard. We also explore how decentralized storage as well as the digital consenting process can offer culturally acceptable processes to encourage data contributions from ethnic minorities. We conclude that the individual and their right for self-determination needs to be put at the center of any genomics framework, because only on an individual level can the received benefits be accurately balanced against the risk of exposing private information.
Collapse
Affiliation(s)
- Adrien Oliva
- Australian e-Health Research Centre, Commonwealth Scientific and Industrial Research Organisation, Level 3/160 Hawkesbury Rd, Westmead NSW 2145, Australia
| | - Anubhav Kaphle
- Australian e-Health Research Centre, Commonwealth Scientific and Industrial Research Organisation, Level 3/160 Hawkesbury Rd, Westmead NSW 2145, Australia
| | - Roc Reguant
- Australian e-Health Research Centre, Commonwealth Scientific and Industrial Research Organisation, Level 3/160 Hawkesbury Rd, Westmead NSW 2145, Australia
| | - Letitia M F Sng
- Australian e-Health Research Centre, Commonwealth Scientific and Industrial Research Organisation, Level 3/160 Hawkesbury Rd, Westmead NSW 2145, Australia
| | - Natalie A Twine
- Australian e-Health Research Centre, Commonwealth Scientific and Industrial Research Organisation, Level 3/160 Hawkesbury Rd, Westmead NSW 2145, Australia
| | - Yuwan Malakar
- Responsible Innovation Future Science Platform, Commonwealth Scientific and Industrial Research Organisation, Brisbane, 41 Boggo Rd, Dutton Park QLD 4102, Australia
| | - Anuradha Wickramarachchi
- Australian e-Health Research Centre, Commonwealth Scientific and Industrial Research Organisation, Level 3/160 Hawkesbury Rd, Westmead NSW 2145, Australia
| | - Marcel Keller
- Data61, Commonwealth Scientific and Industrial Research Organisation, Level 5/13 Garden St, Eveleigh NSW 2015, Australia
| | - Thilina Ranbaduge
- Data61, Commonwealth Scientific and Industrial Research Organisation, Building 101, Clunies Ross St, Black Mountain, Canberra, ACT 2601, Australia
| | - Eva K F Chan
- NSW Health Pathology, Sydney, 1 Reserve Road, St Leonards NSW 2065, Australia
| | - James Breen
- Telethon Kids Institute, Perth, WA 6009, Australia
- National Centre for Indigenous Genomics, The John Curtin School of Medical Research, Australian National University, Canberra, ACT 2601, Australia
| | - Sam Buckberry
- Telethon Kids Institute, Perth, WA 6009, Australia
- National Centre for Indigenous Genomics, The John Curtin School of Medical Research, Australian National University, Canberra, ACT 2601, Australia
| | - Boris Guennewig
- Sydney Medical School, Brain and Mind Centre, The University of Sydney, Sydney, 94 Mallett St, Camperdown NSW 2050, Australia
| | - Matilda Haas
- Australian Genomics, Parkville, VIC 3052, Australia
- Murdoch Children’s Research Institute, Parkville, Victoria 3052, Australia
| | - Alex Brown
- Telethon Kids Institute, Perth, WA 6009, Australia
- National Centre for Indigenous Genomics, The John Curtin School of Medical Research, Australian National University, Canberra, ACT 2601, Australia
| | - Mark J Cowley
- Children’s Cancer Institute, Lowy Cancer Research Centre, Level 4, Lowy Cancer Research Centre Corner Botany & High Streets UNSW Kensington Campus UNSW Sydney, Kensington NSW 2052, Australia
- School of Clinical Medicine, UNSW Medicine & Health, Wallace Wurth Building (C27), Cnr High St & Botany St, UNSW Sydney, Kensington NSW 2052, Australia
| | - Natalie Thorne
- University of Melbourne, Melbourne, Parkville VIC 3052, Australia
- Melbourne Genomics Health Alliance, Melbourne 1G, Walter and Eliza Hall Institute/1G Royal Parade, Parkville VIC 3052, Australia
- Walter and Eliza Hall Institute, Melbourne, 1G, Walter and Eliza Hall Institute/1G Royal Parade, Parkville VIC 3052, Australia
| | - Yatish Jain
- Australian e-Health Research Centre, Commonwealth Scientific and Industrial Research Organisation, Level 3/160 Hawkesbury Rd, Westmead NSW 2145, Australia
- Applied BioSciences, Faculty of Science and Engineering, Macquarie University, Applied BioSciences 205B Culloden Rd Macquarie University, NSW 2109, Australia
| | - Denis C Bauer
- Applied BioSciences, Faculty of Science and Engineering, Macquarie University, Applied BioSciences 205B Culloden Rd Macquarie University, NSW 2109, Australia
- Department of Biomedical Sciences, MQ Health General Practice - Macquarie University, Suite 305, Level 3/2 Technology Pl, Macquarie Park NSW 2109, Australia
- Australian e-Health Research Centre, Commonwealth Scientific and Industrial Research Organisation, Gate 13, Kintore Avenue University of Adelaide, Adelaide SA 5000, Australia
| |
Collapse
|
13
|
Zhang QX, Liu T, Guo X, Zhen J, Yang MY, Khederzadeh S, Zhou F, Han X, Zheng Q, Jia P, Ding X, He M, Zou X, Liao JK, Zhang H, He J, Zhu X, Lu D, Chen H, Zeng C, Liu F, Zheng HF, Liu S, Xu HM, Chen GB. Searching across-cohort relatives in 54,092 GWAS samples via encrypted genotype regression. PLoS Genet 2024; 20:e1011037. [PMID: 38206971 PMCID: PMC10783776 DOI: 10.1371/journal.pgen.1011037] [Citation(s) in RCA: 2] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/08/2023] [Accepted: 12/13/2023] [Indexed: 01/13/2024] Open
Abstract
Explicitly sharing individual level data in genomics studies has many merits comparing to sharing summary statistics, including more strict QCs, common statistical analyses, relative identification and improved statistical power in GWAS, but it is hampered by privacy or ethical constraints. In this study, we developed encG-reg, a regression approach that can detect relatives of various degrees based on encrypted genomic data, which is immune of ethical constraints. The encryption properties of encG-reg are based on the random matrix theory by masking the original genotypic matrix without sacrificing precision of individual-level genotype data. We established a connection between the dimension of a random matrix, which masked genotype matrices, and the required precision of a study for encrypted genotype data. encG-reg has false positive and false negative rates equivalent to sharing original individual level data, and is computationally efficient when searching relatives. We split the UK Biobank into their respective centers, and then encrypted the genotype data. We observed that the relatives estimated using encG-reg was equivalently accurate with the estimation by KING, which is a widely used software but requires original genotype data. In a more complex application, we launched a finely devised multi-center collaboration across 5 research institutes in China, covering 9 cohorts of 54,092 GWAS samples. encG-reg again identified true relatives existing across the cohorts with even different ethnic backgrounds and genotypic qualities. Our study clearly demonstrates that encrypted genomic data can be used for data sharing without loss of information or data sharing barrier.
Collapse
Affiliation(s)
- Qi-Xin Zhang
- Institute of Bioinformatics, Zhejiang University, Hangzhou, Zhejiang, China
- Center for Reproductive Medicine, Department of Genetic and Genomic Medicine, and Clinical Research Institute, Zhejiang Provincial People’s Hospital, People’s Hospital of Hangzhou Medical College, Hangzhou, Zhejiang, China
| | - Tianzi Liu
- CAS Key Laboratory of Computational Biology, Shanghai Institute of Nutrition and Health, Chinese Academy of Sciences, Shanghai, China
- CAS Key Laboratory of Genomic and Precision Medicine, Beijing Institute of Genomics, Chinese Academy of Sciences, Beijing, China
| | - Xinxin Guo
- School of Public Health (Shenzhen), Sun Yat-sen University, Shenzhen, Guangdong, China
| | - Jianxin Zhen
- Central Laboratory, Shenzhen Baoan Women’s and Children’s Hospital, Shenzhen, Guangdong, China
| | - Meng-yuan Yang
- Diseases & Population (DaP) Geninfo Lab, School of Life Sciences, Westlake University, Hangzhou, Zhejiang, China
| | - Saber Khederzadeh
- Diseases & Population (DaP) Geninfo Lab, School of Life Sciences, Westlake University, Hangzhou, Zhejiang, China
| | - Fang Zhou
- State Key Laboratory of Genetic Engineering, School of Life Sciences, Fudan University, Shanghai, China
| | - Xiaotong Han
- State Key Laboratory of Ophthalmology, Zhongshan Ophthalmic Center, Sun Yat-sen University, Guangdong Provincial Key Laboratory of Ophthalmology and Visual Science, Guangdong Provincial Clinical Research Center for Ocular Diseases, Guangzhou, Guangdong, China
| | - Qiwen Zheng
- CAS Key Laboratory of Genomic and Precision Medicine, Beijing Institute of Genomics, Chinese Academy of Sciences, Beijing, China
| | - Peilin Jia
- CAS Key Laboratory of Genomic and Precision Medicine, Beijing Institute of Genomics, Chinese Academy of Sciences, Beijing, China
| | - Xiaohu Ding
- State Key Laboratory of Ophthalmology, Zhongshan Ophthalmic Center, Sun Yat-sen University, Guangdong Provincial Key Laboratory of Ophthalmology and Visual Science, Guangdong Provincial Clinical Research Center for Ocular Diseases, Guangzhou, Guangdong, China
| | - Mingguang He
- State Key Laboratory of Ophthalmology, Zhongshan Ophthalmic Center, Sun Yat-sen University, Guangdong Provincial Key Laboratory of Ophthalmology and Visual Science, Guangdong Provincial Clinical Research Center for Ocular Diseases, Guangzhou, Guangdong, China
- Centre for Eye Research Australia, Royal Victorian Eye and Ear Hospital, Melbourne, Victoria, Australia
- Ophthalmology, Department of Surgery, University of Melbourne, Melbourne, Victoria, Australia
| | - Xin Zou
- State Key Laboratory of CAD & GC, Zhejiang University, Hangzhou, Zhejiang, China
| | - Jia-Kai Liao
- School of Mathematics and Statistics and Research Institute of Mathematical Sciences (RIMS), Jiangsu Provincial Key Laboratory of Educational Big Data Science and Engineering, Jiangsu Normal University, Xuzhou, Jiangsu, China
- Ningbo Institute of Life and Health Industry, University of Chinese Academy of Sciences, Ningbo, Zhejiang, China
| | - Hongxin Zhang
- State Key Laboratory of CAD & GC, Zhejiang University, Hangzhou, Zhejiang, China
| | - Ji He
- Department of Neurology, Peking University Third Hospital, Beijing, China
| | - Xiaofeng Zhu
- Department of Population and Quantitative Health Sciences, Case Western Reserve University, Cleveland, Ohio, United States of America
| | - Daru Lu
- State Key Laboratory of Genetic Engineering and MOE Engineering Research Center of Gene Technology, School of Life Sciences and Zhongshan Hospital, Fudan University, Shanghai, China
- NHC Key Laboratory of Birth Defects and Reproductive Health, Chongqing Population and Family Planning Science and Technology Research Institute, Chongqing, China
| | - Hongyan Chen
- State Key Laboratory of Genetic Engineering, School of Life Sciences, Fudan University, Shanghai, China
| | - Changqing Zeng
- CAS Key Laboratory of Genomic and Precision Medicine, Beijing Institute of Genomics, Chinese Academy of Sciences, Beijing, China
- Henan Academy of Sciences, Zhengzhou, Henan, China
| | - Fan Liu
- CAS Key Laboratory of Genomic and Precision Medicine, Beijing Institute of Genomics, Chinese Academy of Sciences, Beijing, China
- Department of Forensic Sciences, College of Criminal Justice, Naif Arab University of Security Sciences, Riyadh, Kingdom of Saudi Arabia
| | - Hou-Feng Zheng
- Diseases & Population (DaP) Geninfo Lab, School of Life Sciences, Westlake University, Hangzhou, Zhejiang, China
| | - Siyang Liu
- School of Public Health (Shenzhen), Sun Yat-sen University, Shenzhen, Guangdong, China
| | - Hai-Ming Xu
- Institute of Bioinformatics, Zhejiang University, Hangzhou, Zhejiang, China
| | - Guo-Bo Chen
- Center for Reproductive Medicine, Department of Genetic and Genomic Medicine, and Clinical Research Institute, Zhejiang Provincial People’s Hospital, People’s Hospital of Hangzhou Medical College, Hangzhou, Zhejiang, China
- Key Laboratory of Endocrine Gland Diseases of Zhejiang Province, Hangzhou, Zhejiang, China
| |
Collapse
|
14
|
Wang X, Dervishi L, Li W, Ayday E, Jiang X, Vaidya J. Privacy-preserving federated genome-wide association studies via dynamic sampling. Bioinformatics 2023; 39:btad639. [PMID: 37856329 PMCID: PMC10612407 DOI: 10.1093/bioinformatics/btad639] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/09/2023] [Revised: 09/15/2023] [Accepted: 10/18/2023] [Indexed: 10/21/2023] Open
Abstract
MOTIVATION Genome-wide association studies (GWAS) benefit from the increasing availability of genomic data and cross-institution collaborations. However, sharing data across institutional boundaries jeopardizes medical data confidentiality and patient privacy. While modern cryptographic techniques provide formal secure guarantees, the substantial communication and computational overheads hinder the practical application of large-scale collaborative GWAS. RESULTS This work introduces an efficient framework for conducting collaborative GWAS on distributed datasets, maintaining data privacy without compromising the accuracy of the results. We propose a novel two-step strategy aimed at reducing communication and computational overheads, and we employ iterative and sampling techniques to ensure accurate results. We instantiate our approach using logistic regression, a commonly used statistical method for identifying associations between genetic markers and the phenotype of interest. We evaluate our proposed methods using two real genomic datasets and demonstrate their robustness in the presence of between-study heterogeneity and skewed phenotype distributions using a variety of experimental settings. The empirical results show the efficiency and applicability of the proposed method and the promise for its application for large-scale collaborative GWAS. AVAILABILITY AND IMPLEMENTATION The source code and data are available at https://github.com/amioamo/TDS.
Collapse
Affiliation(s)
- Xinyue Wang
- Management Science and Information Systems Department, Rutgers University, New Brunswick, NJ 07102, United States
| | - Leonard Dervishi
- Department of Computer and Data Sciences, Cleveland, OH 44106, United States
| | - Wentao Li
- Department of Health Data Science and Artificial Intelligence, Houston, TX 77030, United States
| | - Erman Ayday
- Department of Computer and Data Sciences, Cleveland, OH 44106, United States
| | - Xiaoqian Jiang
- Department of Health Data Science and Artificial Intelligence, Houston, TX 77030, United States
| | - Jaideep Vaidya
- Management Science and Information Systems Department, Rutgers University, New Brunswick, NJ 07102, United States
| |
Collapse
|
15
|
Li W, Kim M, Zhang K, Chen H, Jiang X, Harmanci A. COLLAGENE enables privacy-aware federated and collaborative genomic data analysis. Genome Biol 2023; 24:204. [PMID: 37697426 PMCID: PMC10496350 DOI: 10.1186/s13059-023-03039-z] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/26/2022] [Accepted: 08/16/2023] [Indexed: 09/13/2023] Open
Abstract
Growing regulatory requirements set barriers around genetic data sharing and collaborations. Moreover, existing privacy-aware paradigms are challenging to deploy in collaborative settings. We present COLLAGENE, a tool base for building secure collaborative genomic data analysis methods. COLLAGENE protects data using shared-key homomorphic encryption and combines encryption with multiparty strategies for efficient privacy-aware collaborative method development. COLLAGENE provides ready-to-run tools for encryption/decryption, matrix processing, and network transfers, which can be immediately integrated into existing pipelines. We demonstrate the usage of COLLAGENE by building a practical federated GWAS protocol for binary phenotypes and a secure meta-analysis protocol. COLLAGENE is available at https://zenodo.org/record/8125935 .
Collapse
Affiliation(s)
- Wentao Li
- Center for Secure Artificial Intelligence For hEalthcare (SAFE), D. Bradley McWilliams School of Biomedical Informatics, University of Texas Health Science Center, Houston, TX, 77030, USA
| | - Miran Kim
- Department of Mathematics, Department of Computer Science, Hanyang University, Seoul, 04763, Republic of Korea
- Research Institute for Convergence of Basic Science, Hanyang University, Seoul, 04763, Republic of Korea
- Bio-BigData Center, Hanyang Institute of Bioscience and Biotechnology, Hanyang University, Seoul, 04763, Republic of Korea
| | - Kai Zhang
- Center for Secure Artificial Intelligence For hEalthcare (SAFE), D. Bradley McWilliams School of Biomedical Informatics, University of Texas Health Science Center, Houston, TX, 77030, USA
| | - Han Chen
- Human Genetics Center, Department of Epidemiology, Human Genetics and Environmental Sciences, School of Public Health, The University of Texas Health Science Center at Houston, Houston, TX, 77030, USA
- Center for Precision Health, D. Bradley McWilliams School of Biomedical Informatics, The University of Texas Health Science Center at Houston, Houston, TX, 77030, USA
| | - Xiaoqian Jiang
- Center for Secure Artificial Intelligence For hEalthcare (SAFE), D. Bradley McWilliams School of Biomedical Informatics, University of Texas Health Science Center, Houston, TX, 77030, USA
| | - Arif Harmanci
- Center for Secure Artificial Intelligence For hEalthcare (SAFE), D. Bradley McWilliams School of Biomedical Informatics, University of Texas Health Science Center, Houston, TX, 77030, USA.
- Center for Precision Health, D. Bradley McWilliams School of Biomedical Informatics, The University of Texas Health Science Center at Houston, Houston, TX, 77030, USA.
| |
Collapse
|
16
|
Casaletto J, Bernier A, McDougall R, Cline MS. Federated Analysis for Privacy-Preserving Data Sharing: A Technical and Legal Primer. Annu Rev Genomics Hum Genet 2023; 24:347-368. [PMID: 37253596 PMCID: PMC10846631 DOI: 10.1146/annurev-genom-110122-084756] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 06/01/2023]
Abstract
Continued advances in precision medicine rely on the widespread sharing of data that relate human genetic variation to disease. However, data sharing is severely limited by legal, regulatory, and ethical restrictions that safeguard patient privacy. Federated analysis addresses this problem by transferring the code to the data-providing the technical and legal capability to analyze the data within their secure home environment rather than transferring the data to another institution for analysis. This allows researchers to gain new insights from data that cannot be moved, while respecting patient privacy and the data stewards' legal obligations. Because federated analysis is a technical solution to the legal challenges inherent in data sharing, the technology and policy implications must be evaluated together. Here, we summarize the technical approaches to federated analysis and provide a legal analysis of their policy implications.
Collapse
Affiliation(s)
- James Casaletto
- Genomics Institute, University of California, Santa Cruz, California, USA; ,
| | - Alexander Bernier
- Centre of Genomics and Policy, Faculty of Medicine and Health Sciences, McGill University, Montreal, Quebec, Canada; ,
| | - Robyn McDougall
- Centre of Genomics and Policy, Faculty of Medicine and Health Sciences, McGill University, Montreal, Quebec, Canada; ,
| | - Melissa S Cline
- Genomics Institute, University of California, Santa Cruz, California, USA; ,
| |
Collapse
|
17
|
Li W, Chen H, Jiang X, Harmanci A. Federated generalized linear mixed models for collaborative genome-wide association studies. iScience 2023; 26:107227. [PMID: 37529100 PMCID: PMC10387571 DOI: 10.1016/j.isci.2023.107227] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/01/2022] [Revised: 01/28/2023] [Accepted: 06/23/2023] [Indexed: 08/03/2023] Open
Abstract
Federated association testing is a powerful approach to conduct large-scale association studies where sites share intermediate statistics through a central server. There are, however, several standing challenges. Confounding factors like population stratification should be carefully modeled across sites. In addition, it is crucial to consider disease etiology using flexible models to prevent biases. Privacy protections for participants pose another significant challenge. Here, we propose distributed Mixed Effects Genome-wide Association study (dMEGA), a method that enables federated generalized linear mixed model-based association testing across multiple sites without explicitly sharing genotype and phenotype data. dMEGA employs a reference projection to correct for population-stratification and utilizes efficient local-gradient updates among sites, incorporating both fixed and random effects. The accuracy and efficiency of dMEGA are demonstrated through simulated and real datasets. dMEGA is publicly available at https://github.com/Li-Wentao/dMEGA.
Collapse
Affiliation(s)
- Wentao Li
- School of Biomedical Informatics, University of Texas Health Science Center, Houston, TX 77030, USA
| | - Han Chen
- School of Biomedical Informatics, University of Texas Health Science Center, Houston, TX 77030, USA
- School of Public Health, University of Texas Health Science Center, Houston, TX 77030, USA
| | - Xiaoqian Jiang
- School of Biomedical Informatics, University of Texas Health Science Center, Houston, TX 77030, USA
| | - Arif Harmanci
- School of Biomedical Informatics, University of Texas Health Science Center, Houston, TX 77030, USA
| |
Collapse
|
18
|
Geva R, Gusev A, Polyakov Y, Liram L, Rosolio O, Alexandru A, Genise N, Blatt M, Duchin Z, Waissengrin B, Mirelman D, Bukstein F, Blumenthal DT, Wolf I, Pelles-Avraham S, Schaffer T, Lavi LA, Micciancio D, Vaikuntanathan V, Badawi AA, Goldwasser S. Collaborative privacy-preserving analysis of oncological data using multiparty homomorphic encryption. Proc Natl Acad Sci U S A 2023; 120:e2304415120. [PMID: 37549296 PMCID: PMC10437415 DOI: 10.1073/pnas.2304415120] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/17/2023] [Accepted: 06/09/2023] [Indexed: 08/09/2023] Open
Abstract
Real-world healthcare data sharing is instrumental in constructing broader-based and larger clinical datasets that may improve clinical decision-making research and outcomes. Stakeholders are frequently reluctant to share their data without guaranteed patient privacy, proper protection of their datasets, and control over the usage of their data. Fully homomorphic encryption (FHE) is a cryptographic capability that can address these issues by enabling computation on encrypted data without intermediate decryptions, so the analytics results are obtained without revealing the raw data. This work presents a toolset for collaborative privacy-preserving analysis of oncological data using multiparty FHE. Our toolset supports survival analysis, logistic regression training, and several common descriptive statistics. We demonstrate using oncological datasets that the toolset achieves high accuracy and practical performance, which scales well to larger datasets. As part of this work, we propose a cryptographic protocol for interactive bootstrapping in multiparty FHE, which is of independent interest. The toolset we develop is general-purpose and can be applied to other collaborative medical and healthcare application domains.
Collapse
Affiliation(s)
- Ravit Geva
- Tel Aviv Sorasky Medical Center, Tel Aviv64239, Israel
| | - Alexander Gusev
- Dana-Farber Cancer Institute, Harvard Medical School, Boston, MA02215
| | | | - Lior Liram
- Duality Technologies, Inc., Hoboken, NJ07103
| | | | | | | | | | | | | | - Dan Mirelman
- Tel Aviv Sorasky Medical Center, Tel Aviv64239, Israel
| | | | | | - Ido Wolf
- Tel Aviv Sorasky Medical Center, Tel Aviv64239, Israel
| | | | - Tali Schaffer
- Tel Aviv Sorasky Medical Center, Tel Aviv64239, Israel
| | - Lee A. Lavi
- Tel Aviv Sorasky Medical Center, Tel Aviv64239, Israel
| | - Daniele Micciancio
- Duality Technologies, Inc., Hoboken, NJ07103
- University of California, San Diego, CA92093
| | - Vinod Vaikuntanathan
- Duality Technologies, Inc., Hoboken, NJ07103
- Massachusetts Institute of Technology, Cambridge, MA02139
| | | | - Shafi Goldwasser
- Duality Technologies, Inc., Hoboken, NJ07103
- Simons Institute for the Theory of Computing, University of California, Berkeley, CA94720
| |
Collapse
|
19
|
Aneja S, Avesta A, Xu H, Machado LO. Clinical Informatics Approaches to Facilitate Cancer Data Sharing. Yearb Med Inform 2023; 32:104-110. [PMID: 37414028 PMCID: PMC10751108 DOI: 10.1055/s-0043-1768721] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 07/08/2023] Open
Abstract
OBJECTIVES Despite growing enthusiasm surrounding the utility of clinical informatics to improve cancer outcomes, data availability remains a persistent bottleneck to progress. Difficulty combining data with protected health information often limits our ability to aggregate larger more representative datasets for analysis. With the rise of machine learning techniques that require increasing amounts of clinical data, these barriers have magnified. Here, we review recent efforts within clinical informatics to address issues related to safely sharing cancer data. METHODS We carried out a narrative review of clinical informatics studies related to sharing protected health data within cancer studies published from 2018-2022, with a focus on domains such as decentralized analytics, homomorphic encryption, and common data models. RESULTS Clinical informatics studies that investigated cancer data sharing were identified. A particular focus of the search yielded studies on decentralized analytics, homomorphic encryption, and common data models. Decentralized analytics has been prototyped across genomic, imaging, and clinical data with the most advances in diagnostic image analysis. Homomorphic encryption was most often employed on genomic data and less on imaging and clinical data. Common data models primarily involve clinical data from the electronic health record. Although all methods have robust research, there are limited studies showing wide scale implementation. CONCLUSIONS Decentralized analytics, homomorphic encryption, and common data models represent promising solutions to improve cancer data sharing. Promising results thus far have been limited to smaller settings. Future studies should be focused on evaluating the scalability and efficacy of these methods across clinical settings of varying resources and expertise.
Collapse
Affiliation(s)
- Sanjay Aneja
- Department of Therapeutic Radiology, Yale School of Medicine, New Haven, CT, USA
- Center for Outcomes Research and Evaluation at Yale, New Haven, CT, USA
- Department of Bioinformatics and Data Science, Yale School of Medicine, New Haven, CT, USA
| | - Arman Avesta
- Department of Therapeutic Radiology, Yale School of Medicine, New Haven, CT, USA
- Center for Outcomes Research and Evaluation at Yale, New Haven, CT, USA
| | - Hua Xu
- Department of Bioinformatics and Data Science, Yale School of Medicine, New Haven, CT, USA
| | - Lucila Ohno Machado
- Department of Bioinformatics and Data Science, Yale School of Medicine, New Haven, CT, USA
| |
Collapse
|
20
|
Mendelsohn S, Froelicher D, Loginov D, Bernick D, Berger B, Cho H. sfkit: a web-based toolkit for secure and federated genomic analysis. Nucleic Acids Res 2023; 51:W535-W541. [PMID: 37246709 PMCID: PMC10320181 DOI: 10.1093/nar/gkad464] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/27/2023] [Revised: 05/03/2023] [Accepted: 05/14/2023] [Indexed: 05/30/2023] Open
Abstract
Advances in genomics are increasingly depending upon the ability to analyze large and diverse genomic data collections, which are often difficult to amass due to privacy concerns. Recent works have shown that it is possible to jointly analyze datasets held by multiple parties, while provably preserving the privacy of each party's dataset using cryptographic techniques. However, these tools have been challenging to use in practice due to the complexities of the required setup and coordination among the parties. We present sfkit, a secure and federated toolkit for collaborative genomic studies, to allow groups of collaborators to easily perform joint analyses of their datasets without compromising privacy. sfkit consists of a web server and a command-line interface, which together support a range of use cases including both auto-configured and user-supplied computational environments. sfkit provides collaborative workflows for the essential tasks of genome-wide association study (GWAS) and principal component analysis (PCA). We envision sfkit becoming a one-stop server for secure collaborative tools for a broad range of genomic analyses. sfkit is open-source and available at: https://sfkit.org.
Collapse
Affiliation(s)
| | - David Froelicher
- Broad Institute of MIT and Harvard, Cambridge, MA, USA
- Computer Science and AI Laboratory, MIT, Cambridge, MA, USA
| | - Denis Loginov
- Broad Institute of MIT and Harvard, Cambridge, MA, USA
| | - David Bernick
- Broad Institute of MIT and Harvard, Cambridge, MA, USA
| | - Bonnie Berger
- Broad Institute of MIT and Harvard, Cambridge, MA, USA
- Computer Science and AI Laboratory, MIT, Cambridge, MA, USA
- Department of Mathematics, MIT, Cambridge, MA, USA
| | - Hyunghoon Cho
- Broad Institute of MIT and Harvard, Cambridge, MA, USA
| |
Collapse
|
21
|
Dervishi L, Wang X, Li W, Halimi A, Vaidya J, Jiang X, Ayday E. Facilitating Federated Genomic Data Analysis by Identifying Record Correlations while Ensuring Privacy. AMIA ... ANNUAL SYMPOSIUM PROCEEDINGS. AMIA SYMPOSIUM 2023; 2022:395-404. [PMID: 37128365 PMCID: PMC10148342] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Subscribe] [Scholar Register] [Indexed: 05/03/2023]
Abstract
With the reduction of sequencing costs and the pervasiveness of computing devices, genomic data collection is continually growing. However, data collection is highly fragmented and the data is still siloed across different repositories. Analyzing all of this data would be transformative for genomics research. However, the data is sensitive, and therefore cannot be easily centralized. Furthermore, there may be correlations in the data, which if not detected, can impact the analysis. In this paper, we take the first step towards identifying correlated records across multiple data repositories in a privacy-preserving manner. The proposed framework, based on random shuffling, synthetic record generation, and local differential privacy, allows a trade-off of accuracy and computational efficiency. An extensive evaluation on real genomic data from the OpenSNP dataset shows that the proposed solution is efficient and effective.
Collapse
Affiliation(s)
| | | | | | | | | | | | - Erman Ayday
- Case Western Reserve University, Cleveland, OH
| |
Collapse
|
22
|
Yamamoto A, Shibuya T. Privacy-Preserving Statistical Analysis of Genomic Data Using Compressive Mechanism with Haar Wavelet Transform. J Comput Biol 2023; 30:176-188. [PMID: 36374238 DOI: 10.1089/cmb.2022.0246] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/16/2022] Open
Abstract
To promote the use of personal genome information in medicine, it is important to analyze the relationship between diseases and the human genomes. Therefore, statistical analysis using genomic data is often conducted, but there is a privacy concern with respect to releasing the statistics as they are. Existing methods to address this problem using the concept of differential privacy cannot provide accurate outputs under strong privacy guarantees, making them less practical. In this study, for the first time, we investigate the application of a compressive mechanism to genomic statistical data and propose two approaches. The first is to apply the normal compressive mechanism to the statistics vector along with an algorithm to determine the number of nonzero entries in a sparse representation. The second is to alter the mechanism based on the data, aiming to release significant single nucleotide polymorphisms with a high probability. In this algorithm, we apply the compressive mechanism with the input as a sparse vector for significant data and the Laplace mechanism for nonsignificant data. By using the Haar wavelet transform for the compressive mechanism, we can determine the number of nonzero elements and the amount of noise. In addition, we give theoretical guarantees that our proposed methods achieve ϵ-differential privacy. We evaluated our methods in terms of accuracy and rank error compared with the Laplace and exponential mechanisms. The results show that our second method in particular can guarantee high privacy assurance as well as utility.
Collapse
Affiliation(s)
- Akito Yamamoto
- Human Genome Center, The Institute of Medical Science, The University of Tokyo, Tokyo, Japan
| | - Tetsuo Shibuya
- Human Genome Center, The Institute of Medical Science, The University of Tokyo, Tokyo, Japan
| |
Collapse
|
23
|
Wang Y, Namba S, Lopera E, Kerminen S, Tsuo K, Läll K, Kanai M, Zhou W, Wu KH, Favé MJ, Bhatta L, Awadalla P, Brumpton B, Deelen P, Hveem K, Lo Faro V, Mägi R, Murakami Y, Sanna S, Smoller JW, Uzunovic J, Wolford BN, Willer C, Gamazon ER, Cox NJ, Surakka I, Okada Y, Martin AR, Hirbo J. Global Biobank analyses provide lessons for developing polygenic risk scores across diverse cohorts. CELL GENOMICS 2023; 3:100241. [PMID: 36777179 PMCID: PMC9903818 DOI: 10.1016/j.xgen.2022.100241] [Citation(s) in RCA: 28] [Impact Index Per Article: 28.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 12/01/2021] [Revised: 08/28/2022] [Accepted: 12/03/2022] [Indexed: 01/06/2023]
Abstract
Polygenic risk scores (PRSs) have been widely explored in precision medicine. However, few studies have thoroughly investigated their best practices in global populations across different diseases. We here utilized data from Global Biobank Meta-analysis Initiative (GBMI) to explore methodological considerations and PRS performance in 9 different biobanks for 14 disease endpoints. Specifically, we constructed PRSs using pruning and thresholding (P + T) and PRS-continuous shrinkage (CS). For both methods, using a European-based linkage disequilibrium (LD) reference panel resulted in comparable or higher prediction accuracy compared with several other non-European-based panels. PRS-CS overall outperformed the classic P + T method, especially for endpoints with higher SNP-based heritability. Notably, prediction accuracy is heterogeneous across endpoints, biobanks, and ancestries, especially for asthma, which has known variation in disease prevalence across populations. Overall, we provide lessons for PRS construction, evaluation, and interpretation using GBMI resources and highlight the importance of best practices for PRS in the biobank-scale genomics era.
Collapse
Affiliation(s)
- Ying Wang
- Analytic and Translational Genetics Unit, Massachusetts General Hospital, Boston, MA 02114, USA
- Stanley Center for Psychiatric Research and Program in Medical and Population Genetics, Broad Institute of MIT and Harvard, Cambridge, MA 02142, USA
| | - Shinichi Namba
- Department of Statistical Genetics, Osaka University Graduate School of Medicine, Suita 565-0871, Japan
| | - Esteban Lopera
- Department of Genetics, UMCG, University of Groningen, Groningen, the Netherlands
| | - Sini Kerminen
- Institute for Molecular Medicine Finland, FIMM, HiLIFE, University of Helsinki, Helsinki, Finland
| | - Kristin Tsuo
- Analytic and Translational Genetics Unit, Massachusetts General Hospital, Boston, MA 02114, USA
- Stanley Center for Psychiatric Research and Program in Medical and Population Genetics, Broad Institute of MIT and Harvard, Cambridge, MA 02142, USA
| | - Kristi Läll
- Estonian Genome Centre, Institute of Genomics, University of Tartu, Tartu, Estonia
| | - Masahiro Kanai
- Analytic and Translational Genetics Unit, Massachusetts General Hospital, Boston, MA 02114, USA
- Stanley Center for Psychiatric Research and Program in Medical and Population Genetics, Broad Institute of MIT and Harvard, Cambridge, MA 02142, USA
- Department of Statistical Genetics, Osaka University Graduate School of Medicine, Suita 565-0871, Japan
- Department of Biomedical Informatics, Harvard Medical School, Boston, MA, USA
| | - Wei Zhou
- Analytic and Translational Genetics Unit, Massachusetts General Hospital, Boston, MA 02114, USA
- Stanley Center for Psychiatric Research and Program in Medical and Population Genetics, Broad Institute of MIT and Harvard, Cambridge, MA 02142, USA
| | - Kuan-Han Wu
- Department of Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, MI 48103, USA
| | | | - Laxmi Bhatta
- K.G. Jebsen Center for Genetic Epidemiology, Department of Public Health and Nursing, NTNU, Norwegian University of Science and Technology, 7030 Trondheim, Norway
| | - Philip Awadalla
- Ontario Institute for Cancer Research, Toronto, ON, Canada
- Department of Molecular Genetics, University of Toronto, Toronto, ON, Canada
| | - Ben Brumpton
- K.G. Jebsen Center for Genetic Epidemiology, Department of Public Health and Nursing, NTNU, Norwegian University of Science and Technology, 7030 Trondheim, Norway
- HUNT Research Centre, Department of Public Health and Nursing, NTNU, Norwegian University of Science and Technology, 7600 Levanger, Norway
- Clinic of Medicine, St. Olav’s Hospital, Trondheim University Hospital, 7030 Trondheim, Norway
| | - Patrick Deelen
- Department of Genetics, UMCG, University of Groningen, Groningen, the Netherlands
- Oncode Institute, Utrecht, the Netherlands
| | - Kristian Hveem
- K.G. Jebsen Center for Genetic Epidemiology, Department of Public Health and Nursing, NTNU, Norwegian University of Science and Technology, 7030 Trondheim, Norway
- HUNT Research Centre, Department of Public Health and Nursing, NTNU, Norwegian University of Science and Technology, 7600 Levanger, Norway
| | - Valeria Lo Faro
- Department of Ophthalmology, University Medical Center Groningen, University of Groningen, Groningen, the Netherlands
- Department of Clinical Genetics, Amsterdam University Medical Center (AMC), Amsterdam, the Netherlands
- Department of Immunology, Genetics and Pathology, Science for Life Laboratory, Uppsala University, Uppsala, Sweden
| | - Reedik Mägi
- Estonian Genome Centre, Institute of Genomics, University of Tartu, Tartu, Estonia
| | - Yoshinori Murakami
- Division of Molecular Pathology, Institute of Medical Science, the University of Tokyo, Tokyo, Japan
| | - Serena Sanna
- Department of Genetics, UMCG, University of Groningen, Groningen, the Netherlands
- Institute for Genetics and Biomedical Research (IRGB), National Research Council (CNR), 09100 Cagliari, Italy
| | - Jordan W. Smoller
- Psychiatric and Neurodevelopmental Genetics Unit, Center for Genomic Medicine, Massachusetts General Hospital, Boston, MA 02114, USA
| | | | - Brooke N. Wolford
- Department of Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, MI 48103, USA
- K.G. Jebsen Center for Genetic Epidemiology, Department of Public Health and Nursing, NTNU, Norwegian University of Science and Technology, 7030 Trondheim, Norway
| | - Cristen Willer
- K.G. Jebsen Center for Genetic Epidemiology, Department of Public Health and Nursing, NTNU, Norwegian University of Science and Technology, 7030 Trondheim, Norway
- Department of Internal Medicine, University of Michigan, Ann Arbor, MI 48109, USA
- Department of Biostatistics and Center for Statistical Genetics, and Department of Human Genetics, University of Michigan, Ann Arbor, MI 48109, USA
| | - Eric R. Gamazon
- Department of Medicine, Division of Genetic Medicine, Vanderbilt University School of Medicine, Nashville, TN, USA
- MRC Epidemiology Unit, University of Cambridge, Cambridge, UK
- Vanderbilt Genetics Institute, Vanderbilt University Medical Center, Nashville, TN, USA
| | - Nancy J. Cox
- Department of Medicine, Division of Genetic Medicine, Vanderbilt University School of Medicine, Nashville, TN, USA
- Vanderbilt Genetics Institute, Vanderbilt University Medical Center, Nashville, TN, USA
| | - Ida Surakka
- Department of Internal Medicine, University of Michigan, Ann Arbor, MI 48109, USA
| | - Yukinori Okada
- Department of Statistical Genetics, Osaka University Graduate School of Medicine, Suita 565-0871, Japan
- Laboratory for Systems Genetics, RIKEN Center for Integrative Medical Sciences, Yokohama, Japan
- Laboratory of Statistical Immunology, Immunology Frontier Research Center (WPI-IFReC) and Center for Infectious Disease Education and Research (CiDER), Osaka University, Suita 565-0871, Japan
- Department of Genome Informatics, Graduate School of Medicine, the University of Tokyo, Tokyo 113-0033, Japan
| | - Alicia R. Martin
- Analytic and Translational Genetics Unit, Massachusetts General Hospital, Boston, MA 02114, USA
- Stanley Center for Psychiatric Research and Program in Medical and Population Genetics, Broad Institute of MIT and Harvard, Cambridge, MA 02142, USA
| | - Jibril Hirbo
- Department of Medicine, Division of Genetic Medicine, Vanderbilt University School of Medicine, Nashville, TN, USA
- Vanderbilt Genetics Institute, Vanderbilt University Medical Center, Nashville, TN, USA
| |
Collapse
|
24
|
Sequre: a high-performance framework for secure multiparty computation enables biomedical data sharing. Genome Biol 2023; 24:5. [PMID: 36631897 PMCID: PMC9832703 DOI: 10.1186/s13059-022-02841-5] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/30/2022] [Accepted: 12/21/2022] [Indexed: 01/12/2023] Open
Abstract
Secure multiparty computation (MPC) is a cryptographic tool that allows computation on top of sensitive biomedical data without revealing private information to the involved entities. Here, we introduce Sequre, an easy-to-use, high-performance framework for developing performant MPC applications. Sequre offers a set of automatic compile-time optimizations that significantly improve the performance of MPC applications and incorporates the syntax of Python programming language to facilitate rapid application development. We demonstrate its usability and performance on various bioinformatics tasks showing up to 3-4 times increased speed over the existing pipelines with 7-fold reductions in codebase sizes.
Collapse
|
25
|
Fujiwara M, Hashimoto H, Doi K, Kujiraoka M, Tanizawa Y, Ishida Y, Sasaki M, Nagasaki M. Secure secondary utilization system of genomic data using quantum secure cloud. Sci Rep 2022; 12:18530. [PMID: 36323706 PMCID: PMC9630297 DOI: 10.1038/s41598-022-22804-x] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/05/2022] [Accepted: 10/19/2022] [Indexed: 12/05/2022] Open
Abstract
Secure storage and secondary use of individual human genome data is increasingly important for genome research and personalized medicine. Currently, it is necessary to store the whole genome sequencing information (FASTQ data), which enables detections of de novo mutations and structural variations in the analysis of hereditary diseases and cancer. Furthermore, bioinformatics tools to analyze FASTQ data are frequently updated to improve the precision and recall of detected variants. However, existing secure secondary use of data, such as multi-party computation or homomorphic encryption, can handle only a limited algorithms and usually requires huge computational resources. Here, we developed a high-performance one-stop system for large-scale genome data analysis with secure secondary use of the data by the data owner and multiple users with different levels of data access control. Our quantum secure cloud system is a distributed secure genomic data analysis system (DSGD) with a "trusted server" built on a quantum secure cloud, the information-theoretically secure Tokyo QKD Network. The trusted server will be capable of deploying and running a variety of sequencing analysis hardware, such as GPUs and FPGAs, as well as CPU-based software. We demonstrated that DSGD achieved comparable throughput with and without encryption on the trusted server Therefore, our system is ready to be installed at research institutes and hospitals that make diagnoses based on whole genome sequencing on a daily basis.
Collapse
Affiliation(s)
- Mikio Fujiwara
- grid.28312.3a0000 0001 0590 0962National Institute of Information and Communications Technology (NICT), 4-2-1 Nukui-Kita, Koganei, Tokyo 184-8795 Japan
| | - Hiroki Hashimoto
- grid.258799.80000 0004 0372 2033Human Biosciences Unit for the Top Global Course Center for the Promotion of Interdisciplinary Education and Research, Center for Genomic Medicine, Graduate School of Medicine, Kyoto University, Kyoto, 606-8507 Japan
| | - Kazuaki Doi
- grid.410825.a0000 0004 1770 8232Corporate Research and Development Center, Toshiba Corporation, 1, Komukai Toshiba-Cho, Saiwai-Ku, Kawasaki-Shi, 212-8582 Japan
| | - Mamiko Kujiraoka
- grid.410825.a0000 0004 1770 8232Corporate Research and Development Center, Toshiba Corporation, 1, Komukai Toshiba-Cho, Saiwai-Ku, Kawasaki-Shi, 212-8582 Japan
| | - Yoshimichi Tanizawa
- grid.410825.a0000 0004 1770 8232Corporate Research and Development Center, Toshiba Corporation, 1, Komukai Toshiba-Cho, Saiwai-Ku, Kawasaki-Shi, 212-8582 Japan
| | - Yusuke Ishida
- ZenmuTech, Inc., THE HUB Ginza, OCT 804, 8-17-5 Ginza Chuo-Ku, Tokyo, 104-0061 Japan
| | - Masahide Sasaki
- grid.28312.3a0000 0001 0590 0962National Institute of Information and Communications Technology (NICT), 4-2-1 Nukui-Kita, Koganei, Tokyo 184-8795 Japan
| | - Masao Nagasaki
- grid.258799.80000 0004 0372 2033Human Biosciences Unit for the Top Global Course Center for the Promotion of Interdisciplinary Education and Research, Center for Genomic Medicine, Graduate School of Medicine, Kyoto University, Kyoto, 606-8507 Japan
| |
Collapse
|
26
|
TrustGWAS: A full-process workflow for encrypted GWAS using multi-key homomorphic encryption and pseudorandom number perturbation. Cell Syst 2022; 13:752-767.e6. [PMID: 36041458 DOI: 10.1016/j.cels.2022.08.001] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/19/2021] [Revised: 04/21/2022] [Accepted: 08/04/2022] [Indexed: 01/26/2023]
Abstract
The statistical power of genome-wide association studies (GWASs) is affected by the effective sample size. However, the privacy and security concerns associated with individual-level genotype data pose great challenges for cross-institutional cooperation. The full-process cryptographic solutions are in demand but have not been covered, especially the essential principal-component analysis (PCA). Here, we present TrustGWAS, a complete solution for secure, large-scale GWAS, recapitulating gold standard results against PLINK without compromising privacy and supporting basic PLINK steps including quality control, linkage disequilibrium pruning, PCA, chi-square test, Cochran-Armitage trend test, covariate-supported logistic regression and linear regression, and their sequential combinations. TrustGWAS leverages pseudorandom number perturbations for PCA and multiparty scheme of multi-key homomorphic encryption for all other modules. TrustGWAS can evaluate 100,000 individuals with 1 million variants and complete QC-LD-PCA-regression workflow within 50 h. We further successfully discover gene loci associated with fasting blood glucose, consistent with the findings of the ChinaMAP project.
Collapse
|
27
|
Artificial Intelligence in Medicine and Privacy Preservation. Artif Intell Med 2022. [DOI: 10.1007/978-3-030-64573-1_261] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/19/2022]
|
28
|
Zolotareva O, Nasirigerdeh R, Matschinske J, Torkzadehmahani R, Bakhtiari M, Frisch T, Späth J, Blumenthal DB, Abbasinejad A, Tieri P, Kaissis G, Rückert D, Wenke NK, List M, Baumbach J. Flimma: a federated and privacy-aware tool for differential gene expression analysis. Genome Biol 2021; 22:338. [PMID: 34906207 PMCID: PMC8670124 DOI: 10.1186/s13059-021-02553-2] [Citation(s) in RCA: 7] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/23/2020] [Accepted: 11/22/2021] [Indexed: 12/13/2022] Open
Abstract
Aggregating transcriptomics data across hospitals can increase sensitivity and robustness of differential expression analyses, yielding deeper clinical insights. As data exchange is often restricted by privacy legislation, meta-analyses are frequently employed to pool local results. However, the accuracy might drop if class labels are inhomogeneously distributed among cohorts. Flimma ( https://exbio.wzw.tum.de/flimma/ ) addresses this issue by implementing the state-of-the-art workflow limma voom in a federated manner, i.e., patient data never leaves its source site. Flimma results are identical to those generated by limma voom on aggregated datasets even in imbalanced scenarios where meta-analysis approaches fail.
Collapse
Affiliation(s)
- Olga Zolotareva
- Chair of Experimental Bioinformatics, TUM School of Life Sciences, Technical University of Munich, Freising, Germany. .,Institute for Computational Systems Biology, University of Hamburg, Hamburg, Germany.
| | - Reza Nasirigerdeh
- AI in Medicine and Healthcare, Technical University of Munich, Munich, Germany.,Klinikum rechts der Isar, Technical University of Munich, Munich, Germany
| | - Julian Matschinske
- Institute for Computational Systems Biology, University of Hamburg, Hamburg, Germany
| | | | - Mohammad Bakhtiari
- Institute for Computational Systems Biology, University of Hamburg, Hamburg, Germany
| | - Tobias Frisch
- Department of Mathematics and Computer Science, University of Southern Denmark, Odense, Denmark
| | - Julian Späth
- Institute for Computational Systems Biology, University of Hamburg, Hamburg, Germany
| | - David B Blumenthal
- Department Artificial Intelligence in Biomedical Engineering, Friedrich-Alexander University Erlangen-Nürnberg, Erlangen, Germany
| | - Amir Abbasinejad
- Chair of Experimental Bioinformatics, TUM School of Life Sciences, Technical University of Munich, Freising, Germany.,Sapienza University of Rome, Rome, Italy
| | - Paolo Tieri
- CNR National Research Council, IAC Institute for Applied Computing, Rome, Italy.,Sapienza University of Rome, Rome, Italy
| | - Georgios Kaissis
- AI in Medicine and Healthcare, Technical University of Munich, Munich, Germany.,Klinikum rechts der Isar, Technical University of Munich, Munich, Germany.,Biomedical Image Analysis Group, Imperial College London, London, UK.,OpenMined, Oxford, UK
| | - Daniel Rückert
- AI in Medicine and Healthcare, Technical University of Munich, Munich, Germany.,Klinikum rechts der Isar, Technical University of Munich, Munich, Germany.,Biomedical Image Analysis Group, Imperial College London, London, UK
| | - Nina K Wenke
- Institute for Computational Systems Biology, University of Hamburg, Hamburg, Germany
| | - Markus List
- Chair of Experimental Bioinformatics, TUM School of Life Sciences, Technical University of Munich, Freising, Germany
| | - Jan Baumbach
- Institute for Computational Systems Biology, University of Hamburg, Hamburg, Germany.,Department of Mathematics and Computer Science, University of Southern Denmark, Odense, Denmark
| |
Collapse
|
29
|
Truly privacy-preserving federated analytics for precision medicine with multiparty homomorphic encryption. Nat Commun 2021; 12:5910. [PMID: 34635645 PMCID: PMC8505638 DOI: 10.1038/s41467-021-25972-y] [Citation(s) in RCA: 48] [Impact Index Per Article: 16.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/12/2021] [Accepted: 09/01/2021] [Indexed: 01/10/2023] Open
Abstract
Using real-world evidence in biomedical research, an indispensable complement to clinical trials, requires access to large quantities of patient data that are typically held separately by multiple healthcare institutions. We propose FAMHE, a novel federated analytics system that, based on multiparty homomorphic encryption (MHE), enables privacy-preserving analyses of distributed datasets by yielding highly accurate results without revealing any intermediate data. We demonstrate the applicability of FAMHE to essential biomedical analysis tasks, including Kaplan-Meier survival analysis in oncology and genome-wide association studies in medical genetics. Using our system, we accurately and efficiently reproduce two published centralized studies in a federated setting, enabling biomedical insights that are not possible from individual institutions alone. Our work represents a necessary key step towards overcoming the privacy hurdle in enabling multi-centric scientific collaborations. Existing approaches to sharing of distributed medical data either provide only limited protection of patients’ privacy or sacrifice the accuracy of results. Here, the authors propose a federated analytics system, based on multiparty homomorphic encryption (MHE), to overcome these issues.
Collapse
|
30
|
Oestreich M, Chen D, Schultze JL, Fritz M, Becker M. Privacy considerations for sharing genomics data. EXCLI JOURNAL 2021; 20:1243-1260. [PMID: 34345236 PMCID: PMC8326502 DOI: 10.17179/excli2021-4002] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 06/19/2021] [Accepted: 07/07/2021] [Indexed: 01/23/2023]
Abstract
An increasing amount of attention has been geared towards understanding the privacy risks that arise from sharing genomic data of human origin. Most of these efforts have focused on issues in the context of genomic sequence data, but the popularity of techniques for collecting other types of genome-related data has prompted researchers to investigate privacy concerns in a broader genomic context. In this review, we give an overview of different types of genome-associated data, their individual ways of revealing sensitive information, the motivation to share them as well as established and upcoming methods to minimize information leakage. We further discuss the concise threats that are being posed, who is at risk, and how the risk level compares to potential benefits, all while addressing the topic in the context of modern technology, methodology, and information sharing culture. Additionally, we will discuss the current legal situation regarding the sharing of genomic data in a selection of countries, evaluating the scope of their applicability as well as their limitations. We will finalize this review by evaluating the development that is required in the scientific field in the near future in order to improve and develop privacy-preserving data sharing techniques for the genomic context.
Collapse
Affiliation(s)
- Marie Oestreich
- Systems Medicine, Deutsches Zentrum für Neurodegenerative Erkrankungen (DZNE), Venusberg-Campus 1/99, 53127 Bonn, Germany
| | - Dingfan Chen
- CISPA Helmholtz Center for Information Security, Saarbrücken, Germany, Stuhlsatzenhaus 5, 66123 Saarbrücken, Germany
| | - Joachim L. Schultze
- Systems Medicine, Deutsches Zentrum für Neurodegenerative Erkrankungen (DZNE), Venusberg-Campus 1/99, 53127 Bonn, Germany
- Genomics and Immunoregulation, Life & Medical Sciences (LIMES) Institute, University of Bonn, Bonn, Germany, Carl-Troll-Straße 31, 53115 Bonn, Germany
- PRECISE Platform for Single Cell Genomics and Epigenomics at Deutsches Zentrum für Neurodegenerative Erkrankungen (DZNE) and the University of Bonn, Germany, Venusberg-Campus 1/99, 53127 Bonn, Germany
| | - Mario Fritz
- CISPA Helmholtz Center for Information Security, Saarbrücken, Germany, Stuhlsatzenhaus 5, 66123 Saarbrücken, Germany
| | - Matthias Becker
- Systems Medicine, Deutsches Zentrum für Neurodegenerative Erkrankungen (DZNE), Venusberg-Campus 1/99, 53127 Bonn, Germany
| |
Collapse
|
31
|
B A, S S. A survey on genomic data by privacy-preserving techniques perspective. Comput Biol Chem 2021; 93:107538. [PMID: 34246892 DOI: 10.1016/j.compbiolchem.2021.107538] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/23/2021] [Revised: 06/15/2021] [Accepted: 06/26/2021] [Indexed: 11/27/2022]
Abstract
Nowadays, the purpose of human genomics is widely emerging in health-related problems and also to achieve time and cost-efficient healthcare. Due to advancement in genomics and its research, development in privacy concerns is needed regarding querying, accessing and, storage and computation of the genomic data. While the genomic data is widely accessible, the privacy issues may emerge due to the untrusted third party (adversaries/researchers), they may reveal the information or strategy plans regarding the genome data of an individual when it is requested for research purposes. To mitigate this problem many privacy-preserving techniques are used along with cryptographic methods are briefly discussed. Furthermore, efficiency and accuracy in a secure and private genomic data computation are needed to be researched in future.
Collapse
Affiliation(s)
- Abinaya B
- Kalaignarkarunanidhi Institute of Technology, Coimbatore, India.
| | - Santhi S
- Kalaignarkarunanidhi Institute of Technology, Coimbatore, India.
| |
Collapse
|
32
|
Abstract
Abstract
Genome-Wide Association Studies (GWAS) identify the genomic variations that are statistically associated with a particular phenotype (e.g., a disease). The confidence in GWAS results increases with the number of genomes analyzed, which encourages federated computations where biocenters would periodically share the genomes they have sequenced. However, for economical and legal reasons, this collaboration will only happen if biocenters cannot learn each others’ data. In addition, GWAS releases should not jeopardize the privacy of the individuals whose genomes are used. We introduce DyPS, a novel framework to conduct dynamic privacy-preserving federated GWAS. DyPS leverages a Trusted Execution Environment to secure dynamic GWAS computations. Moreover, DyPS uses a scaling mechanism to speed up the releases of GWAS results according to the evolving number of genomes used in the study, even if individuals retract their participation consent. Lastly, DyPS also tolerates up to all-but-one colluding biocenters without privacy leaks. We implemented and extensively evaluated DyPS through several scenarios involving more than 6 million simulated genomes and up to 35,000 real genomes. Our evaluation shows that DyPS updates test statistics with a reasonable additional request processing delay (11% longer) compared to an approach that would update them with minimal delay but would lead to 8% of the genomes not being protected. In addition, DyPS can result in the same amount of aggregate statistics as a static release (i.e., at the end of the study), but can produce up to 2.6 times more statistics information during earlier dynamic releases. Besides, we show that DyPS can support a larger number of genomes and SNP positions without any significant performance penalty.
Collapse
|
33
|
Artificial Intelligence in Medicine and Privacy Preservation. Artif Intell Med 2021. [DOI: 10.1007/978-3-030-58080-3_261-1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/20/2022]
|
34
|
Blatt M, Gusev A, Polyakov Y, Goldwasser S. Secure large-scale genome-wide association studies using homomorphic encryption. Proc Natl Acad Sci U S A 2020; 117:11608-11613. [PMID: 32398369 PMCID: PMC7261120 DOI: 10.1073/pnas.1918257117] [Citation(s) in RCA: 24] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/12/2023] Open
Abstract
Genome-wide association studies (GWASs) seek to identify genetic variants associated with a trait, and have been a powerful approach for understanding complex diseases. A critical challenge for GWASs has been the dependence on individual-level data that typically have strict privacy requirements, creating an urgent need for methods that preserve the individual-level privacy of participants. Here, we present a privacy-preserving framework based on several advances in homomorphic encryption and demonstrate that it can perform an accurate GWAS analysis for a real dataset of more than 25,000 individuals, keeping all individual data encrypted and requiring no user interactions. Our extrapolations show that it can evaluate GWASs of 100,000 individuals and 500,000 single-nucleotide polymorphisms (SNPs) in 5.6 h on a single server node (or in 11 min on 31 server nodes running in parallel). Our performance results are more than one order of magnitude faster than prior state-of-the-art results using secure multiparty computation, which requires continuous user interactions, with the accuracy of both solutions being similar. Our homomorphic encryption advances can also be applied to other domains where large-scale statistical analyses over encrypted data are needed.
Collapse
Affiliation(s)
| | - Alexander Gusev
- Duality Technologies, Inc., Newark, NJ 07103
- Dana-Farber Cancer Institute, Harvard Medical School, Boston, MA 02215
| | | | - Shafi Goldwasser
- Duality Technologies, Inc., Newark, NJ 07103;
- Simons Institute for the Theory of Computing, University of California, Berkeley, CA 94720
| |
Collapse
|