1
|
Aherrahrou N, Tairi H, Aherrahrou Z. Genomic privacy preservation in genome-wide association studies: taxonomy, limitations, challenges, and vision. Brief Bioinform 2024; 25:bbae356. [PMID: 39073827 DOI: 10.1093/bib/bbae356] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/24/2024] [Revised: 06/19/2024] [Accepted: 07/12/2024] [Indexed: 07/30/2024] Open
Abstract
Genome-wide association studies (GWAS) serve as a crucial tool for identifying genetic factors associated with specific traits. However, ethical constraints prevent the direct exchange of genetic information, prompting the need for privacy preservation solutions. To address these issues, earlier works are based on cryptographic mechanisms such as homomorphic encryption, secure multi-party computing, and differential privacy. Very recently, federated learning has emerged as a promising solution for enabling secure and collaborative GWAS computations. This work provides an extensive overview of existing methods for GWAS privacy preserving, with the main focus on collaborative and distributed approaches. This survey provides a comprehensive analysis of the challenges faced by existing methods, their limitations, and insights into designing efficient solutions.
Collapse
Affiliation(s)
- Noura Aherrahrou
- LISAC, Department of Computer Science, Faculty of Sciences Dhar El Mahraz, University Sidi Mohamed Ben Abdellah, B.P. 1796 - Atlas, 30003, Fez, Morocco
| | - Hamid Tairi
- LISAC, Department of Computer Science, Faculty of Sciences Dhar El Mahraz, University Sidi Mohamed Ben Abdellah, B.P. 1796 - Atlas, 30003, Fez, Morocco
| | - Zouhair Aherrahrou
- Institute for Cardiogenetics, Universität zu Lübeck, D-23562 Lübeck, Germany
- DZHK (German Centre for Cardiovascular Research), Partner Site Hamburg/Kiel/Lübeck, Germany
- University Heart Centre Lübeck, D-23562 Lübeck, Germany
| |
Collapse
|
2
|
Dong X, Lu Y, Guo L, Li C, Ni Q, Wu B, Wang H, Yang L, Wu S, Sun Q, Zheng H, Zhou W, Wang S. PICOTEES: a privacy-preserving online service of phenotype exploration for genetic-diagnostic variants from Chinese children cohorts. J Genet Genomics 2024; 51:243-251. [PMID: 37714454 DOI: 10.1016/j.jgg.2023.09.003] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/06/2023] [Revised: 08/31/2023] [Accepted: 09/03/2023] [Indexed: 09/17/2023]
Abstract
The growth in biomedical data resources has raised potential privacy concerns and risks of genetic information leakage. For instance, exome sequencing aids clinical decisions by comparing data through web services, but it requires significant trust between users and providers. To alleviate privacy concerns, the most commonly used strategy is to anonymize sensitive data. Unfortunately, studies have shown that anonymization is insufficient to protect against reidentification attacks. Recently, privacy-preserving technologies have been applied to preserve application utility while protecting the privacy of biomedical data. We present the PICOTEES framework, a privacy-preserving online service of phenotype exploration for genetic-diagnostic variants (https://birthdefectlab.cn:3000/). PICOTEES enables privacy-preserving queries of the phenotype spectrum for a single variant by utilizing trusted execution environment technology, which can protect the privacy of the user's query information, backend models, and data, as well as the final results. We demonstrate the utility and performance of PICOTEES by exploring a bioinformatics dataset. The dataset is from a cohort containing 20,909 genetic testing patients with 3,152,508 variants from the Children's Hospital of Fudan University in China, dominated by the Chinese Han population (>99.9%). Our query results yield a large number of unreported diagnostic variants and previously reported pathogenicity.
Collapse
Affiliation(s)
- Xinran Dong
- Center for Molecular Medicine, Children's Hospital of Fudan University, Shanghai 201102, China; Key Laboratory of Birth Defects, Children's Hospital of Fudan University, Shanghai 201102, China
| | - Yulan Lu
- Center for Molecular Medicine, Children's Hospital of Fudan University, Shanghai 201102, China; Key Laboratory of Birth Defects, Children's Hospital of Fudan University, Shanghai 201102, China
| | - Lanting Guo
- Department of Bioinformatics, Hangzhou Nuowei Information Technology Co., Ltd, Hangzhou, Zhejiang 310000, China
| | - Chuan Li
- Center for Molecular Medicine, Children's Hospital of Fudan University, Shanghai 201102, China
| | - Qi Ni
- Center for Molecular Medicine, Children's Hospital of Fudan University, Shanghai 201102, China; Key Laboratory of Birth Defects, Children's Hospital of Fudan University, Shanghai 201102, China
| | - Bingbing Wu
- Center for Molecular Medicine, Children's Hospital of Fudan University, Shanghai 201102, China; Key Laboratory of Birth Defects, Children's Hospital of Fudan University, Shanghai 201102, China
| | - Huijun Wang
- Center for Molecular Medicine, Children's Hospital of Fudan University, Shanghai 201102, China; Key Laboratory of Birth Defects, Children's Hospital of Fudan University, Shanghai 201102, China
| | - Lin Yang
- Center for Molecular Medicine, Children's Hospital of Fudan University, Shanghai 201102, China; Key Laboratory of Birth Defects, Children's Hospital of Fudan University, Shanghai 201102, China
| | - Songyang Wu
- The Third Research Institute of the Ministry of Public Security, Shanghai 200031, China
| | - Qi Sun
- Department of Bioinformatics, Hangzhou Nuowei Information Technology Co., Ltd, Hangzhou, Zhejiang 310000, China
| | - Hao Zheng
- Department of Bioinformatics, Hangzhou Nuowei Information Technology Co., Ltd, Hangzhou, Zhejiang 310000, China
| | - Wenhao Zhou
- Center for Molecular Medicine, Children's Hospital of Fudan University, Shanghai 201102, China; Xiamen Campus of Children's Hospital of Fudan University, Xiamen, Fujian 361006, China.
| | - Shuang Wang
- Department of Bioinformatics, Hangzhou Nuowei Information Technology Co., Ltd, Hangzhou, Zhejiang 310000, China; Institutes for Systems Genetics, West China Hospital, Chengdu, Sichuan 610041, China; Shanghai Putuo People's Hospital, Tongji University, Shanghai 200060, China.
| |
Collapse
|
3
|
Wang X, Dervishi L, Li W, Ayday E, Jiang X, Vaidya J. Privacy-preserving federated genome-wide association studies via dynamic sampling. Bioinformatics 2023; 39:btad639. [PMID: 37856329 PMCID: PMC10612407 DOI: 10.1093/bioinformatics/btad639] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/09/2023] [Revised: 09/15/2023] [Accepted: 10/18/2023] [Indexed: 10/21/2023] Open
Abstract
MOTIVATION Genome-wide association studies (GWAS) benefit from the increasing availability of genomic data and cross-institution collaborations. However, sharing data across institutional boundaries jeopardizes medical data confidentiality and patient privacy. While modern cryptographic techniques provide formal secure guarantees, the substantial communication and computational overheads hinder the practical application of large-scale collaborative GWAS. RESULTS This work introduces an efficient framework for conducting collaborative GWAS on distributed datasets, maintaining data privacy without compromising the accuracy of the results. We propose a novel two-step strategy aimed at reducing communication and computational overheads, and we employ iterative and sampling techniques to ensure accurate results. We instantiate our approach using logistic regression, a commonly used statistical method for identifying associations between genetic markers and the phenotype of interest. We evaluate our proposed methods using two real genomic datasets and demonstrate their robustness in the presence of between-study heterogeneity and skewed phenotype distributions using a variety of experimental settings. The empirical results show the efficiency and applicability of the proposed method and the promise for its application for large-scale collaborative GWAS. AVAILABILITY AND IMPLEMENTATION The source code and data are available at https://github.com/amioamo/TDS.
Collapse
Affiliation(s)
- Xinyue Wang
- Management Science and Information Systems Department, Rutgers University, New Brunswick, NJ 07102, United States
| | - Leonard Dervishi
- Department of Computer and Data Sciences, Cleveland, OH 44106, United States
| | - Wentao Li
- Department of Health Data Science and Artificial Intelligence, Houston, TX 77030, United States
| | - Erman Ayday
- Department of Computer and Data Sciences, Cleveland, OH 44106, United States
| | - Xiaoqian Jiang
- Department of Health Data Science and Artificial Intelligence, Houston, TX 77030, United States
| | - Jaideep Vaidya
- Management Science and Information Systems Department, Rutgers University, New Brunswick, NJ 07102, United States
| |
Collapse
|
4
|
Bon JJ, Bretherton A, Buchhorn K, Cramb S, Drovandi C, Hassan C, Jenner AL, Mayfield HJ, McGree JM, Mengersen K, Price A, Salomone R, Santos-Fernandez E, Vercelloni J, Wang X. Being Bayesian in the 2020s: opportunities and challenges in the practice of modern applied Bayesian statistics. PHILOSOPHICAL TRANSACTIONS. SERIES A, MATHEMATICAL, PHYSICAL, AND ENGINEERING SCIENCES 2023; 381:20220156. [PMID: 36970822 PMCID: PMC10041356 DOI: 10.1098/rsta.2022.0156] [Citation(s) in RCA: 2] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 08/22/2022] [Accepted: 01/06/2023] [Indexed: 06/18/2023]
Abstract
Building on a strong foundation of philosophy, theory, methods and computation over the past three decades, Bayesian approaches are now an integral part of the toolkit for most statisticians and data scientists. Whether they are dedicated Bayesians or opportunistic users, applied professionals can now reap many of the benefits afforded by the Bayesian paradigm. In this paper, we touch on six modern opportunities and challenges in applied Bayesian statistics: intelligent data collection, new data sources, federated analysis, inference for implicit models, model transfer and purposeful software products. This article is part of the theme issue 'Bayesian inference: challenges, perspectives, and prospects'.
Collapse
Affiliation(s)
- Joshua J. Bon
- Centre for Data Science, Queensland University of Technology, Brisbane, Queensland, Australia
- School of Mathematical Sciences, Queensland University of Technology, Brisbane, Queensland, Australia
| | - Adam Bretherton
- Centre for Data Science, Queensland University of Technology, Brisbane, Queensland, Australia
- School of Mathematical Sciences, Queensland University of Technology, Brisbane, Queensland, Australia
| | - Katie Buchhorn
- Centre for Data Science, Queensland University of Technology, Brisbane, Queensland, Australia
- School of Mathematical Sciences, Queensland University of Technology, Brisbane, Queensland, Australia
| | - Susanna Cramb
- Centre for Data Science, Queensland University of Technology, Brisbane, Queensland, Australia
- School of Public Health and Social Work, Queensland University of Technology, Brisbane, Queensland, Australia
| | - Christopher Drovandi
- Centre for Data Science, Queensland University of Technology, Brisbane, Queensland, Australia
- School of Mathematical Sciences, Queensland University of Technology, Brisbane, Queensland, Australia
| | - Conor Hassan
- Centre for Data Science, Queensland University of Technology, Brisbane, Queensland, Australia
- School of Mathematical Sciences, Queensland University of Technology, Brisbane, Queensland, Australia
| | - Adrianne L. Jenner
- Centre for Data Science, Queensland University of Technology, Brisbane, Queensland, Australia
- School of Mathematical Sciences, Queensland University of Technology, Brisbane, Queensland, Australia
| | - Helen J. Mayfield
- Centre for Data Science, Queensland University of Technology, Brisbane, Queensland, Australia
- School of Public Health, The University of Queensland, Saint Lucia, Queensland, Australia
| | - James M. McGree
- Centre for Data Science, Queensland University of Technology, Brisbane, Queensland, Australia
- School of Mathematical Sciences, Queensland University of Technology, Brisbane, Queensland, Australia
| | - Kerrie Mengersen
- Centre for Data Science, Queensland University of Technology, Brisbane, Queensland, Australia
- School of Mathematical Sciences, Queensland University of Technology, Brisbane, Queensland, Australia
| | - Aiden Price
- Centre for Data Science, Queensland University of Technology, Brisbane, Queensland, Australia
- School of Mathematical Sciences, Queensland University of Technology, Brisbane, Queensland, Australia
| | - Robert Salomone
- Centre for Data Science, Queensland University of Technology, Brisbane, Queensland, Australia
- School of Computer Science, Queensland University of Technology, Brisbane, Queensland, Australia
| | - Edgar Santos-Fernandez
- Centre for Data Science, Queensland University of Technology, Brisbane, Queensland, Australia
- School of Mathematical Sciences, Queensland University of Technology, Brisbane, Queensland, Australia
| | - Julie Vercelloni
- Centre for Data Science, Queensland University of Technology, Brisbane, Queensland, Australia
- School of Mathematical Sciences, Queensland University of Technology, Brisbane, Queensland, Australia
| | - Xiaoyu Wang
- Centre for Data Science, Queensland University of Technology, Brisbane, Queensland, Australia
- School of Mathematical Sciences, Queensland University of Technology, Brisbane, Queensland, Australia
| |
Collapse
|
5
|
Wirth FN, Kussel T, Müller A, Hamacher K, Prasser F. EasySMPC: a simple but powerful no-code tool for practical secure multiparty computation. BMC Bioinformatics 2022; 23:531. [PMID: 36494612 PMCID: PMC9733077 DOI: 10.1186/s12859-022-05044-8] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/31/2022] [Accepted: 11/08/2022] [Indexed: 12/13/2022] Open
Abstract
BACKGROUND Modern biomedical research is data-driven and relies heavily on the re-use and sharing of data. Biomedical data, however, is subject to strict data protection requirements. Due to the complexity of the data required and the scale of data use, obtaining informed consent is often infeasible. Other methods, such as anonymization or federation, in turn have their own limitations. Secure multi-party computation (SMPC) is a cryptographic technology for distributed calculations, which brings formally provable security and privacy guarantees and can be used to implement a wide-range of analytical approaches. As a relatively new technology, SMPC is still rarely used in real-world biomedical data sharing activities due to several barriers, including its technical complexity and lack of usability. RESULTS To overcome these barriers, we have developed the tool EasySMPC, which is implemented in Java as a cross-platform, stand-alone desktop application provided as open-source software. The tool makes use of the SMPC method Arithmetic Secret Sharing, which allows to securely sum up pre-defined sets of variables among different parties in two rounds of communication (input sharing and output reconstruction) and integrates this method into a graphical user interface. No additional software services need to be set up or configured, as EasySMPC uses the most widespread digital communication channel available: e-mails. No cryptographic keys need to be exchanged between the parties and e-mails are exchanged automatically by the software. To demonstrate the practicability of our solution, we evaluated its performance in a wide range of data sharing scenarios. The results of our evaluation show that our approach is scalable (summing up 10,000 variables between 20 parties takes less than 300 s) and that the number of participants is the essential factor. CONCLUSIONS We have developed an easy-to-use "no-code solution" for performing secure joint calculations on biomedical data using SMPC protocols, which is suitable for use by scientists without IT expertise and which has no special infrastructure requirements. We believe that innovative approaches to data sharing with SMPC are needed to foster the translation of complex protocols into practice.
Collapse
Affiliation(s)
- Felix Nikolaus Wirth
- grid.484013.a0000 0004 6879 971XBerlin Institute of Health at Charité – Universitätsmedizin Berlin, Medical Informatics Group, Charitéplatz 1, 10117 Berlin, Germany
| | - Tobias Kussel
- grid.6546.10000 0001 0940 1669Computational Biology and Simulation, TU Darmstadt, Darmstadt, Germany
| | - Armin Müller
- grid.484013.a0000 0004 6879 971XBerlin Institute of Health at Charité – Universitätsmedizin Berlin, Medical Informatics Group, Charitéplatz 1, 10117 Berlin, Germany
| | - Kay Hamacher
- grid.6546.10000 0001 0940 1669Computational Biology and Simulation, TU Darmstadt, Darmstadt, Germany
| | - Fabian Prasser
- grid.484013.a0000 0004 6879 971XBerlin Institute of Health at Charité – Universitätsmedizin Berlin, Medical Informatics Group, Charitéplatz 1, 10117 Berlin, Germany
| |
Collapse
|
6
|
Torkzadehmahani R, Nasirigerdeh R, Blumenthal DB, Kacprowski T, List M, Matschinske J, Spaeth J, Wenke NK, Baumbach J. Privacy-Preserving Artificial Intelligence Techniques in Biomedicine. Methods Inf Med 2022; 61:e12-e27. [PMID: 35062032 PMCID: PMC9246509 DOI: 10.1055/s-0041-1740630] [Citation(s) in RCA: 5] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/22/2021] [Accepted: 09/18/2021] [Indexed: 12/15/2022]
Abstract
BACKGROUND Artificial intelligence (AI) has been successfully applied in numerous scientific domains. In biomedicine, AI has already shown tremendous potential, e.g., in the interpretation of next-generation sequencing data and in the design of clinical decision support systems. OBJECTIVES However, training an AI model on sensitive data raises concerns about the privacy of individual participants. For example, summary statistics of a genome-wide association study can be used to determine the presence or absence of an individual in a given dataset. This considerable privacy risk has led to restrictions in accessing genomic and other biomedical data, which is detrimental for collaborative research and impedes scientific progress. Hence, there has been a substantial effort to develop AI methods that can learn from sensitive data while protecting individuals' privacy. METHOD This paper provides a structured overview of recent advances in privacy-preserving AI techniques in biomedicine. It places the most important state-of-the-art approaches within a unified taxonomy and discusses their strengths, limitations, and open problems. CONCLUSION As the most promising direction, we suggest combining federated machine learning as a more scalable approach with other additional privacy-preserving techniques. This would allow to merge the advantages to provide privacy guarantees in a distributed way for biomedical applications. Nonetheless, more research is necessary as hybrid approaches pose new challenges such as additional network or computation overhead.
Collapse
Affiliation(s)
- Reihaneh Torkzadehmahani
- Institute for Artificial Intelligence in Medicine and Healthcare, Technical University of Munich, Munich, Germany
| | - Reza Nasirigerdeh
- Institute for Artificial Intelligence in Medicine and Healthcare, Technical University of Munich, Munich, Germany
- Klinikum Rechts der Isar, Technical University of Munich, Munich, Germany
| | - David B. Blumenthal
- Department of Artificial Intelligence in Biomedical Engineering (AIBE), Friedrich-Alexander University Erlangen-Nürnberg (FAU), Erlangen, Germany
| | - Tim Kacprowski
- Division of Data Science in Biomedicine, Peter L. Reichertz Institute for Medical Informatics of TU Braunschweig and Medical School Hannover, Braunschweig, Germany
- Braunschweig Integrated Centre of Systems Biology (BRICS), TU Braunschweig, Braunschweig, Germany
| | - Markus List
- Chair of Experimental Bioinformatics, Technical University of Munich, Munich, Germany
| | - Julian Matschinske
- E.U. Horizon2020 FeatureCloud Project Consortium
- Chair of Computational Systems Biology, University of Hamburg, Hamburg, Germany
| | - Julian Spaeth
- E.U. Horizon2020 FeatureCloud Project Consortium
- Chair of Computational Systems Biology, University of Hamburg, Hamburg, Germany
| | - Nina Kerstin Wenke
- E.U. Horizon2020 FeatureCloud Project Consortium
- Chair of Computational Systems Biology, University of Hamburg, Hamburg, Germany
| | - Jan Baumbach
- E.U. Horizon2020 FeatureCloud Project Consortium
- Chair of Computational Systems Biology, University of Hamburg, Hamburg, Germany
- Institute of Mathematics and Computer Science, University of Southern Denmark, Odense, Denmark
| |
Collapse
|
7
|
Ghavamipour AR, Turkmen F, Jiang X. Privacy-preserving logistic regression with secret sharing. BMC Med Inform Decis Mak 2022; 22:89. [PMID: 35366870 PMCID: PMC8977014 DOI: 10.1186/s12911-022-01811-y] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/26/2021] [Accepted: 02/22/2022] [Indexed: 11/10/2022] Open
Abstract
Abstract
Background
Logistic regression (LR) is a widely used classification method for modeling binary outcomes in many medical data classification tasks. Researchers that collect and combine datasets from various data custodians and jurisdictions can greatly benefit from the increased statistical power to support their analysis goals. However, combining data from different sources creates serious privacy concerns that need to be addressed.
Methods
In this paper, we propose two privacy-preserving protocols for performing logistic regression with the Newton–Raphson method in the estimation of parameters. Our proposals are based on secure Multi-Party Computation (MPC) and tailored to the honest majority and dishonest majority security settings.
Results
The proposed protocols are evaluated against both synthetic and real-world datasets in terms of efficiency and accuracy, and a comparison is made with the ordinary logistic regression. The experimental results demonstrate that the proposed protocols are highly efficient and accurate.
Conclusions
Our work introduces two iterative algorithms to enable the distributed training of a logistic regression model in a privacy-preserving manner. The implementation results show that our algorithms can handle large datasets from multiple sources.
Collapse
|
8
|
Kamphorst B, Rooijakkers T, Veugen T, Cellamare M, Knoors D. Accurate training of the Cox proportional hazards model on vertically-partitioned data while preserving privacy. BMC Med Inform Decis Mak 2022; 22:49. [PMID: 35209883 PMCID: PMC8867891 DOI: 10.1186/s12911-022-01771-3] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/08/2021] [Accepted: 01/20/2022] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Analysing distributed medical data is challenging because of data sensitivity and various regulations to access and combine data. Some privacy-preserving methods are known for analyzing horizontally-partitioned data, where different organisations have similar data on disjoint sets of people. Technically more challenging is the case of vertically-partitioned data, dealing with data on overlapping sets of people. We use an emerging technology based on cryptographic techniques called secure multi-party computation (MPC), and apply it to perform privacy-preserving survival analysis on vertically-distributed data by means of the Cox proportional hazards (CPH) model. Both MPC and CPH are explained. METHODS We use a Newton-Raphson solver to securely train the CPH model with MPC, jointly with all data holders, without revealing any sensitive data. In order to securely compute the log-partial likelihood in each iteration, we run into several technical challenges to preserve the efficiency and security of our solution. To tackle these technical challenges, we generalize a cryptographic protocol for securely computing the inverse of the Hessian matrix and develop a new method for securely computing exponentiations. A theoretical complexity estimate is given to get insight into the computational and communication effort that is needed. RESULTS Our secure solution is implemented in a setting with three different machines, each presenting a different data holder, which can communicate through the internet. The MPyC platform is used for implementing this privacy-preserving solution to obtain the CPH model. We test the accuracy and computation time of our methods on three standard benchmark survival datasets. We identify future work to make our solution more efficient. CONCLUSIONS Our secure solution is comparable with the standard, non-secure solver in terms of accuracy and convergence speed. The computation time is considerably larger, although the theoretical complexity is still cubic in the number of covariates and quadratic in the number of subjects. We conclude that this is a promising way of performing parametric survival analysis on vertically-distributed medical data, while realising high level of security and privacy.
Collapse
Affiliation(s)
- Bart Kamphorst
- Cyber Security and Robustness, Netherlands Organisation for Applied Scientific Research, The Hague, The Netherlands
| | - Thomas Rooijakkers
- Cyber Security and Robustness, Netherlands Organisation for Applied Scientific Research, The Hague, The Netherlands
| | - Thijs Veugen
- Cyber Security and Robustness, Netherlands Organisation for Applied Scientific Research, The Hague, The Netherlands
- Cryptology, Centrum Wiskunde and Informatica, Amsterdam, The Netherlands
| | - Matteo Cellamare
- Research and Development, Netherlands Comprehensive Cancer Organisation, Eindhoven, The Netherlands
| | - Daan Knoors
- Research and Development, Netherlands Comprehensive Cancer Organisation, Eindhoven, The Netherlands
| |
Collapse
|
9
|
Nasirigerdeh R, Torkzadehmahani R, Matschinske J, Frisch T, List M, Späth J, Weiss S, Völker U, Pitkänen E, Heider D, Wenke NK, Kaissis G, Rueckert D, Kacprowski T, Baumbach J. sPLINK: a hybrid federated tool as a robust alternative to meta-analysis in genome-wide association studies. Genome Biol 2022; 23:32. [PMID: 35073941 PMCID: PMC8785575 DOI: 10.1186/s13059-021-02562-1] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/25/2020] [Accepted: 12/02/2021] [Indexed: 11/10/2022] Open
Abstract
Meta-analysis has been established as an effective approach to combining summary statistics of several genome-wide association studies (GWAS). However, the accuracy of meta-analysis can be attenuated in the presence of cross-study heterogeneity. We present sPLINK, a hybrid federated and user-friendly tool, which performs privacy-aware GWAS on distributed datasets while preserving the accuracy of the results. sPLINK is robust against heterogeneous distributions of data across cohorts while meta-analysis considerably loses accuracy in such scenarios. sPLINK achieves practical runtime and acceptable network usage for chi-square and linear/logistic regression tests. sPLINK is available at https://exbio.wzw.tum.de/splink .
Collapse
Affiliation(s)
- Reza Nasirigerdeh
- AI in Medicine and Healthcare, Technical University of Munich, Munich, Germany.
- Klinikum rechts der Isar, Munich, Germany.
| | | | - Julian Matschinske
- Chair of Computational Systems Biology, University of Hamburg, Hamburg, Germany
| | - Tobias Frisch
- Department of Mathematics and Computer Science, University of Southern Denmark, Odense, Denmark
| | - Markus List
- Chair of Experimental Bioinformatics, TUM School of Life Sciences, Technical University of Munich, Munich, Germany
| | - Julian Späth
- Chair of Computational Systems Biology, University of Hamburg, Hamburg, Germany
| | - Stefan Weiss
- Department of Functional Genomics, University Medicine Greifswald, Greifswald, Germany
| | - Uwe Völker
- Department of Functional Genomics, University Medicine Greifswald, Greifswald, Germany
| | - Esa Pitkänen
- Institute for Molecular Medicine Finland (FIMM), Helsinki Institute of Life Science (HiLIFE), University of Helsinki, Helsinki, Finland
- Applied Tumor Genomics Research Program, Research Programs Unit, Faculty of Medicine, University of Helsinki, Helsinki, Finland
| | - Dominik Heider
- Department of Mathematics and Computer Science, University of Marburg, Marburg, Germany
| | - Nina Kerstin Wenke
- Chair of Computational Systems Biology, University of Hamburg, Hamburg, Germany
| | - Georgios Kaissis
- AI in Medicine and Healthcare, Technical University of Munich, Munich, Germany
- Klinikum rechts der Isar, Munich, Germany
- Biomedical Image Analysis Group, Imperial College London, London, UK
- OpenMined, Oxford, UK
| | - Daniel Rueckert
- AI in Medicine and Healthcare, Technical University of Munich, Munich, Germany
- Klinikum rechts der Isar, Munich, Germany
- Biomedical Image Analysis Group, Imperial College London, London, UK
| | - Tim Kacprowski
- Division Data Science in Biomedicine, Peter L. Reichertz Institute for Medical Informatics of TU Braunschweig and Hannover Medical School, Brunswick, Germany
- Braunschweig Integrated Centre of Systems Biology (BRICS), Brunswick, Germany
| | - Jan Baumbach
- Chair of Computational Systems Biology, University of Hamburg, Hamburg, Germany
- Department of Mathematics and Computer Science, University of Southern Denmark, Odense, Denmark
| |
Collapse
|
10
|
Spini G, Mancini E, Attema T, Abspoel M, de Gier J, Fehr S, Veugen T, van Heesch M, Worm D, De Luca A, Cramer R, Sloot PM. New Approach to Privacy-Preserving Clinical Decision Support Systems for HIV Treatment. J Med Syst 2022; 46:84. [PMID: 36261621 PMCID: PMC9581834 DOI: 10.1007/s10916-022-01851-x] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/19/2020] [Revised: 08/09/2022] [Accepted: 08/16/2022] [Indexed: 01/04/2023]
Abstract
BACKGROUND HIV treatment prescription is a complex process. Clinical decision support systems (CDSS) are a category of health information technologies that can assist clinicians to choose optimal treatments based on clinical trials and expert knowledge. The usability of some CDSSs for HIV treatment would be significantly improved by using the knowledge obtained by treating other patients. This knowledge, however, is mainly contained in patient records, whose usage is restricted due to privacy and confidentiality constraints. METHODS A treatment effectiveness measure, containing valuable information for HIV treatment prescription, was defined and a method to extract this measure from patient records was developed. This method uses an advanced cryptographic technology, known as secure Multiparty Computation (henceforth referred to as MPC), to preserve the privacy of the patient records and the confidentiality of the clinicians' decisions. FINDINGS Our solution enables to compute an effectiveness measure of an HIV treatment, the average time-to-treatment-failure, while preserving privacy. Experimental results show that our solution, although at proof-of-concept stage, has good efficiency and provides a result to a query within 24 min for a dataset of realistic size. INTERPRETATION This paper presents a novel and efficient approach HIV clinical decision support systems, that harnesses the potential and insights acquired from treatment data, while preserving the privacy of patient records and the confidentiality of clinician decisions.
Collapse
Affiliation(s)
- Gabriele Spini
- Applied Cryptography and Quantum Algorithms, TNO, 96800, 2509 JE Postbus, The Hague, The Netherlands
| | - Emiliano Mancini
- Institute for Advanced Study, University of Amsterdam, Oude Turfmarkt 147, 1012 GC Amsterdam, The Netherlands ,Department of Global Health, Amsterdam UMC, Location AMC, 1105 AZ Amsterdam, The Netherlands ,Data Science Institute, Hasselt University, Diepenbeek, Belgium
| | - Thomas Attema
- Applied Cryptography and Quantum Algorithms, TNO, 96800, 2509 JE Postbus, The Hague, The Netherlands ,Cryptology Group, CWI, P.O. Box 94079, 1090 GB Amsterdam, The Netherlands ,Mathematical Institute, Leiden University, P.O. Box 9512, 2300 RA Leiden, The Netherlands
| | - Mark Abspoel
- Cryptology Group, CWI, P.O. Box 94079, 1090 GB Amsterdam, The Netherlands ,Philips Research, High Tech Campus 34, 5656 AE Eindhoven, The Netherlands
| | - Jan de Gier
- Applied Cryptography and Quantum Algorithms, TNO, 96800, 2509 JE Postbus, The Hague, The Netherlands
| | - Serge Fehr
- Cryptology Group, CWI, P.O. Box 94079, 1090 GB Amsterdam, The Netherlands ,Mathematical Institute, Leiden University, P.O. Box 9512, 2300 RA Leiden, The Netherlands
| | - Thijs Veugen
- Applied Cryptography and Quantum Algorithms, TNO, 96800, 2509 JE Postbus, The Hague, The Netherlands ,Cryptology Group, CWI, P.O. Box 94079, 1090 GB Amsterdam, The Netherlands
| | - Maran van Heesch
- Applied Cryptography and Quantum Algorithms, TNO, 96800, 2509 JE Postbus, The Hague, The Netherlands
| | - Daniël Worm
- Applied Cryptography and Quantum Algorithms, TNO, 96800, 2509 JE Postbus, The Hague, The Netherlands
| | - Andrea De Luca
- Department of Medical Biotechnologies, University of Siena and Siena University Hospital, Viale Mario Bracci 16, 53100 Siena, Italy
| | - Ronald Cramer
- Cryptology Group, CWI, P.O. Box 94079, 1090 GB Amsterdam, The Netherlands ,Mathematical Institute, Leiden University, P.O. Box 9512, 2300 RA Leiden, The Netherlands
| | - Peter M.A. Sloot
- Institute for Advanced Study, University of Amsterdam, Oude Turfmarkt 147, 1012 GC Amsterdam, The Netherlands ,Complexity Institute, Nanyang Technological University, Academic Building North, Level 1 Section B Unit No. 7 (ABN-01B-07), 61 Nanyang Drive, 637335 Singapore, Singapore ,Advanced Computing, ITMO University, Lomonosova street 9, 191002 Saint Petersburg, Russia
| |
Collapse
|
11
|
van Egmond MB, Spini G, van der Galien O, IJpma A, Veugen T, Kraaij W, Sangers A, Rooijakkers T, Langenkamp P, Kamphorst B, van de L'Isle N, Kooij-Janic M. Privacy-preserving dataset combination and Lasso regression for healthcare predictions. BMC Med Inform Decis Mak 2021; 21:266. [PMID: 34530824 PMCID: PMC8445286 DOI: 10.1186/s12911-021-01582-y] [Citation(s) in RCA: 9] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/05/2021] [Accepted: 06/29/2021] [Indexed: 11/12/2022] Open
Abstract
Background Recent developments in machine learning have shown its potential impact for clinical use such as risk prediction, prognosis, and treatment selection. However, relevant data are often scattered across different stakeholders and their use is regulated, e.g. by GDPR or HIPAA. As a concrete use-case, hospital Erasmus MC and health insurance company Achmea have data on individuals in the city of Rotterdam, which would in theory enable them to train a regression model in order to identify high-impact lifestyle factors for heart failure. However, privacy and confidentiality concerns make it unfeasible to exchange these data. Methods This article describes a solution where vertically-partitioned synthetic data of Achmea and of Erasmus MC are combined using Secure Multi-Party Computation. First, a secure inner join protocol takes place to securely determine the identifiers of the patients that are represented in both datasets. Then, a secure Lasso Regression model is trained on the securely combined data. The involved parties thus obtain the prediction model but no further information on the input data of the other parties. Results We implement our secure solution and describe its performance and scalability: we can train a prediction model on two datasets with 5000 records each and a total of 30 features in less than one hour, with a minimal difference from the results of standard (non-secure) methods. Conclusions This article shows that it is possible to combine datasets and train a Lasso regression model on this combination in a secure way. Such a solution thus further expands the potential of privacy-preserving data analysis in the medical domain.
Collapse
Affiliation(s)
- Marie Beth van Egmond
- Unit ICT, TNO (Dutch Organization for Applied Scientific Research), The Hague, The Netherlands.
| | - Gabriele Spini
- Unit ICT, TNO (Dutch Organization for Applied Scientific Research), The Hague, The Netherlands
| | | | | | - Thijs Veugen
- Unit ICT, TNO (Dutch Organization for Applied Scientific Research), The Hague, The Netherlands.,Cryptology Research Group, Centrum Wiskunde and Informatica (CWI), Amsterdam, The Netherlands
| | - Wessel Kraaij
- Unit ICT, TNO (Dutch Organization for Applied Scientific Research), The Hague, The Netherlands.,Leiden Institute of Advanced Computer Science, Leiden University, Leiden, The Netherlands
| | - Alex Sangers
- Unit ICT, TNO (Dutch Organization for Applied Scientific Research), The Hague, The Netherlands
| | - Thomas Rooijakkers
- Unit ICT, TNO (Dutch Organization for Applied Scientific Research), The Hague, The Netherlands
| | - Peter Langenkamp
- Unit ICT, TNO (Dutch Organization for Applied Scientific Research), The Hague, The Netherlands
| | - Bart Kamphorst
- Unit ICT, TNO (Dutch Organization for Applied Scientific Research), The Hague, The Netherlands
| | | | - Milena Kooij-Janic
- Unit ICT, TNO (Dutch Organization for Applied Scientific Research), The Hague, The Netherlands
| |
Collapse
|
12
|
Multi-Party Privacy-Preserving Logistic Regression with Poor Quality Data Filtering for IoT Contributors. ELECTRONICS 2021. [DOI: 10.3390/electronics10172049] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/17/2022]
Abstract
Nowadays, the internet of things (IoT) is used to generate data in several application domains. A logistic regression, which is a standard machine learning algorithm with a wide application range, is built on such data. Nevertheless, building a powerful and effective logistic regression model requires large amounts of data. Thus, collaboration between multiple IoT participants has often been the go-to approach. However, privacy concerns and poor data quality are two challenges that threaten the success of such a setting. Several studies have proposed different methods to address the privacy concern but to the best of our knowledge, little attention has been paid towards addressing the poor data quality problems in the multi-party logistic regression model. Thus, in this study, we propose a multi-party privacy-preserving logistic regression framework with poor quality data filtering for IoT data contributors to address both problems. Specifically, we propose a new metric gradient similarity in a distributed setting that we employ to filter out parameters from data contributors with poor quality data. To solve the privacy challenge, we employ homomorphic encryption. Theoretical analysis and experimental evaluations using real-world datasets demonstrate that our proposed framework is privacy-preserving and robust against poor quality data.
Collapse
|
13
|
Wirth FN, Meurers T, Johns M, Prasser F. Privacy-preserving data sharing infrastructures for medical research: systematization and comparison. BMC Med Inform Decis Mak 2021; 21:242. [PMID: 34384406 PMCID: PMC8359765 DOI: 10.1186/s12911-021-01602-x] [Citation(s) in RCA: 13] [Impact Index Per Article: 4.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/07/2021] [Accepted: 07/31/2021] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Data sharing is considered a crucial part of modern medical research. Unfortunately, despite its advantages, it often faces obstacles, especially data privacy challenges. As a result, various approaches and infrastructures have been developed that aim to ensure that patients and research participants remain anonymous when data is shared. However, privacy protection typically comes at a cost, e.g. restrictions regarding the types of analyses that can be performed on shared data. What is lacking is a systematization making the trade-offs taken by different approaches transparent. The aim of the work described in this paper was to develop a systematization for the degree of privacy protection provided and the trade-offs taken by different data sharing methods. Based on this contribution, we categorized popular data sharing approaches and identified research gaps by analyzing combinations of promising properties and features that are not yet supported by existing approaches. METHODS The systematization consists of different axes. Three axes relate to privacy protection aspects and were adopted from the popular Five Safes Framework: (1) safe data, addressing privacy at the input level, (2) safe settings, addressing privacy during shared processing, and (3) safe outputs, addressing privacy protection of analysis results. Three additional axes address the usefulness of approaches: (4) support for de-duplication, to enable the reconciliation of data belonging to the same individuals, (5) flexibility, to be able to adapt to different data analysis requirements, and (6) scalability, to maintain performance with increasing complexity of shared data or common analysis processes. RESULTS Using the systematization, we identified three different categories of approaches: distributed data analyses, which exchange anonymous aggregated data, secure multi-party computation protocols, which exchange encrypted data, and data enclaves, which store pooled individual-level data in secure environments for access for analysis purposes. We identified important research gaps, including a lack of approaches enabling the de-duplication of horizontally distributed data or providing a high degree of flexibility. CONCLUSIONS There are fundamental differences between different data sharing approaches and several gaps in their functionality that may be interesting to investigate in future work. Our systematization can make the properties of privacy-preserving data sharing infrastructures more transparent and support decision makers and regulatory authorities with a better understanding of the trade-offs taken.
Collapse
Affiliation(s)
- Felix Nikolaus Wirth
- Berlin Institute of Health at Charité - Universitätsmedizin Berlin, Charitéplatz 1, 10117, Berlin, Germany.
| | - Thierry Meurers
- Berlin Institute of Health at Charité - Universitätsmedizin Berlin, Charitéplatz 1, 10117, Berlin, Germany
| | - Marco Johns
- Berlin Institute of Health at Charité - Universitätsmedizin Berlin, Charitéplatz 1, 10117, Berlin, Germany
| | - Fabian Prasser
- Berlin Institute of Health at Charité - Universitätsmedizin Berlin, Charitéplatz 1, 10117, Berlin, Germany
| |
Collapse
|
14
|
Dong X, Randolph DA, Weng C, Kho AN, Rogers JM, Wang X. Developing High Performance Secure Multi-Party Computation Protocols in Healthcare: A Case Study of Patient Risk Stratification. AMIA JOINT SUMMITS ON TRANSLATIONAL SCIENCE PROCEEDINGS. AMIA JOINT SUMMITS ON TRANSLATIONAL SCIENCE 2021; 2021:200-209. [PMID: 34457134 PMCID: PMC8378657] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Subscribe] [Scholar Register] [Indexed: 06/13/2023]
Abstract
We demonstrate that secure multi-party computation (MPC) using garbled circuits is viable technology for solving clinical use cases that require cross-institution data exchange and collaboration. We describe two MPC protocols, based on Yao's garbled circuits and tested using large and realistically synthesized datasets. Linking records using private set intersection (PSI), we compute two metrics often used in patient risk stratification: high utilizer identification (PSI-HU) and comorbidity index calculation (PSI-CI). Cuckoo hashing enables our protocols to achieve extremely fast run times, with answers to clinically meaningful questions produced in minutes instead of hours. Also, our protocols are provably secure against any computationally bounded adversary in a semi-honest setting, the de-facto mode for cross-institution data analytics. Finally, these protocols eliminate the need for an implicitly trusted third-party "honest broker" to mediate the information linkage and exchange.
Collapse
Affiliation(s)
- Xiao Dong
- Center for Clinical and Translational Science, University of Illinois College of Medicine, Chicago, Illinois, USA
| | - David A Randolph
- Center for Clinical and Translational Science, University of Illinois College of Medicine, Chicago, Illinois, USA
| | - Chenkai Weng
- Department of Computer Science, Northwestern University, Evanston, Illinois, USA
| | - Abel N Kho
- Feinberg School of Medicine, Northwestern University, Chicago, Illinois, USA
| | - Jennie M Rogers
- Department of Computer Science, Northwestern University, Evanston, Illinois, USA
| | - Xiao Wang
- Department of Computer Science, Northwestern University, Evanston, Illinois, USA
| |
Collapse
|
15
|
Park JA, Sung MD, Kim HH, Park YR. Weight-Based Framework for Predictive Modeling of Multiple Databases With Noniterative Communication Without Data Sharing: Privacy-Protecting Analytic Method for Multi-Institutional Studies. JMIR Med Inform 2021; 9:e21043. [PMID: 33818396 PMCID: PMC8056295 DOI: 10.2196/21043] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/07/2020] [Revised: 11/16/2020] [Accepted: 03/03/2021] [Indexed: 01/22/2023] Open
Abstract
Background Securing the representativeness of study populations is crucial in biomedical research to ensure high generalizability. In this regard, using multi-institutional data have advantages in medicine. However, combining data physically is difficult as the confidential nature of biomedical data causes privacy issues. Therefore, a methodological approach is necessary when using multi-institution medical data for research to develop a model without sharing data between institutions. Objective This study aims to develop a weight-based integrated predictive model of multi-institutional data, which does not require iterative communication between institutions, to improve average predictive performance by increasing the generalizability of the model under privacy-preserving conditions without sharing patient-level data. Methods The weight-based integrated model generates a weight for each institutional model and builds an integrated model for multi-institutional data based on these weights. We performed 3 simulations to show the weight characteristics and to determine the number of repetitions of the weight required to obtain stable values. We also conducted an experiment using real multi-institutional data to verify the developed weight-based integrated model. We selected 10 hospitals (2845 intensive care unit [ICU] stays in total) from the electronic intensive care unit Collaborative Research Database to predict ICU mortality with 11 features. To evaluate the validity of our model, compared with a centralized model, which was developed by combining all the data of 10 hospitals, we used proportional overlap (ie, 0.5 or less indicates a significant difference at a level of .05; and 2 indicates 2 CIs overlapping completely). Standard and firth logistic regression models were applied for the 2 simulations and the experiment. Results The results of these simulations indicate that the weight of each institution is determined by 2 factors (ie, the data size of each institution and how well each institutional model fits into the overall institutional data) and that repeatedly generating 200 weights is necessary per institution. In the experiment, the estimated area under the receiver operating characteristic curve (AUC) and 95% CIs were 81.36% (79.37%-83.36%) and 81.95% (80.03%-83.87%) in the centralized model and weight-based integrated model, respectively. The proportional overlap of the CIs for AUC in both the weight-based integrated model and the centralized model was approximately 1.70, and that of overlap of the 11 estimated odds ratios was over 1, except for 1 case. Conclusions In the experiment where real multi-institutional data were used, our model showed similar results to the centralized model without iterative communication between institutions. In addition, our weight-based integrated model provided a weighted average model by integrating 10 models overfitted or underfitted, compared with the centralized model. The proposed weight-based integrated model is expected to provide an efficient distributed research approach as it increases the generalizability of the model and does not require iterative communication.
Collapse
Affiliation(s)
- Ji Ae Park
- Department of Biomedical System Informatics, Yonsei University College of Medicine, Seoul, Republic of Korea
| | - Min Dong Sung
- Department of Biomedical System Informatics, Yonsei University College of Medicine, Seoul, Republic of Korea
| | - Ho Heon Kim
- Department of Biomedical System Informatics, Yonsei University College of Medicine, Seoul, Republic of Korea
| | - Yu Rang Park
- Department of Biomedical System Informatics, Yonsei University College of Medicine, Seoul, Republic of Korea
| |
Collapse
|
16
|
Kuo TT, Gabriel RA, Cidambi KR, Ohno-Machado L. EXpectation Propagation LOgistic REgRession on permissioned blockCHAIN (ExplorerChain): decentralized online healthcare/genomics predictive model learning. J Am Med Inform Assoc 2021; 27:747-756. [PMID: 32364235 PMCID: PMC7309256 DOI: 10.1093/jamia/ocaa023] [Citation(s) in RCA: 24] [Impact Index Per Article: 8.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/23/2019] [Revised: 02/11/2020] [Accepted: 02/24/2020] [Indexed: 11/19/2022] Open
Abstract
Objective Predicting patient outcomes using healthcare/genomics data is an increasingly popular/important area. However, some diseases are rare and require data from multiple institutions to construct generalizable models. To address institutional data protection policies, many distributed methods keep the data locally but rely on a central server for coordination, which introduces risks such as a single point of failure. We focus on providing an alternative based on a decentralized approach. We introduce the idea using blockchain technology for this purpose, with a brief description of its own potential advantages/disadvantages. Materials and Methods We explain how our proposed EXpectation Propagation LOgistic REgRession on Permissioned blockCHAIN (ExplorerChain) can achieve the same results when compared to a distributed model that uses a central server on 3 healthcare/genomic datasets, and what trade-offs need to be considered when using centralized/decentralized methods. We explain how the use of blockchain technology can help decrease some of the problems encountered in decentralized methods. Results We showed that the discrimination power of ExplorerChain can be statistically similar to its counterpart central server-based algorithm. While ExplorerChain inherited some benefits of blockchain, it had a small increased running time. Discussion ExplorerChain has the same prerequisites as a distributed model with a centralized server for coordination. In a manner similar to secure multi-party computation strategies, it assumes that participating institutions are honest, but “curious.” Conclusion When evaluated on relatively small datasets, results suggest that ExplorerChain, which combines artificial intelligence and blockchain technologies, performs as well as a central server-based method, and may avoid some risks at the cost of efficiency.
Collapse
Affiliation(s)
- Tsung-Ting Kuo
- UCSD Health Department of Biomedical Informatics, University of California San Diego, La Jolla, California, USA
| | - Rodney A Gabriel
- UCSD Health Department of Biomedical Informatics, University of California San Diego, La Jolla, California, USA.,Department of Anesthesiology, University of California San Diego, San Diego, California, USA
| | - Krishna R Cidambi
- Department of Orthopaedic Surgery, University of California at San Diego, San Diego, California, USA
| | - Lucila Ohno-Machado
- UCSD Health Department of Biomedical Informatics, University of California San Diego, La Jolla, California, USA.,Division of Health Services Research & Development, VA San Diego Healthcare System, San Diego, California, USA
| |
Collapse
|
17
|
Kuo TT. The anatomy of a distributed predictive modeling framework: online learning, blockchain network, and consensus algorithm. JAMIA Open 2020; 3:201-208. [PMID: 32734160 PMCID: PMC7382618 DOI: 10.1093/jamiaopen/ooaa017] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/30/2020] [Revised: 04/21/2020] [Accepted: 04/29/2020] [Indexed: 11/23/2022] Open
Abstract
Objective Cross-institutional distributed healthcare/genomic predictive modeling is an emerging technology that fulfills both the need of building a more generalizable model and of protecting patient data by only exchanging the models but not the patient data. In this article, the implementation details are presented for one specific blockchain-based approach, ExplorerChain, from a software development perspective. The healthcare/genomic use cases of myocardial infarction, cancer biomarker, and length of hospitalization after surgery are also described. Materials and Methods ExplorerChain’s 3 main technical components, including online machine learning, metadata of transaction, and the Proof-of-Information-Timed (PoINT) algorithm, are introduced in this study. Specifically, the 3 algorithms (ie, core, new network, and new site/data) are described in detail. Results ExplorerChain was implemented and the design details of it were illustrated, especially the development configurations in a practical setting. Also, the system architecture and programming languages are introduced. The code was also released in an open source repository available at https://github.com/tsungtingkuo/explorerchain. Discussion The designing considerations of semi-trust assumption, data format normalization, and non-determinism was discussed. The limitations of the implementation include fixed-number participating sites, limited join-or-leave capability during initialization, advanced privacy technology yet to be included, and further investigation in ethical, legal, and social implications. Conclusion This study can serve as a reference for the researchers who would like to implement and even deploy blockchain technology. Furthermore, the off-the-shelf software can also serve as a cornerstone to accelerate the development and investigation of future healthcare/genomic blockchain studies.
Collapse
Affiliation(s)
- Tsung-Ting Kuo
- UCSD Health Department of Biomedical Informatics, University of California San Diego, La Jolla, California, USA
| |
Collapse
|
18
|
Wu X, Zheng H, Dou Z, Chen F, Deng J, Chen X, Xu S, Gao G, Li M, Wang Z, Xiao Y, Xie K, Wang S, Xu H. A novel privacy-preserving federated genome-wide association study framework and its application in identifying potential risk variants in ankylosing spondylitis. Brief Bioinform 2020; 22:5860679. [PMID: 32591779 DOI: 10.1093/bib/bbaa090] [Citation(s) in RCA: 9] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/26/2020] [Revised: 04/05/2020] [Accepted: 04/24/2020] [Indexed: 11/13/2022] Open
Abstract
Genome-wide association studies (GWAS) have been widely used for identifying potential risk variants in various diseases. A statistically meaningful GWAS typically requires a large sample size to detect disease-associated single nucleotide polymorphisms (SNPs). However, a single institution usually only possesses a limited number of samples. Therefore, cross-institutional partnerships are required to increase sample size and statistical power. However, cross-institutional partnerships offer significant challenges, a major one being data privacy. For example, the privacy awareness of people, the impact of data privacy leakages and the privacy-related risks are becoming increasingly important, while there is no de-identification standard available to safeguard genomic data sharing. In this paper, we introduce a novel privacy-preserving federated GWAS framework (iPRIVATES). Equipped with privacy-preserving federated analysis, iPRIVATES enables multiple institutions to jointly perform GWAS analysis without leaking patient-level genotyping data. Only aggregated local statistics are exchanged within the study network. In addition, we evaluate the performance of iPRIVATES through both simulated data and a real-world application for identifying potential risk variants in ankylosing spondylitis (AS). The experimental results showed that the strongest signal of AS-associated SNPs reside mostly around the human leukocyte antigen (HLA) regions. The proposed iPRIVATES framework achieved equivalent results as traditional centralized implementation, demonstrating its great potential in driving collaborative genomic research for different diseases while preserving data privacy.
Collapse
Affiliation(s)
- Xin Wu
- Department of Rheumatology and Immunology, Shanghai Changzheng Hospital, Second Military Medical University, Shanghai, China
| | | | - Zuochao Dou
- Department of Bioinformatics, Hangzhou Nuowei Information Technology Co. Ltd, Hangzhou, China
| | - Feng Chen
- Department of Bioinformatics, Hangzhou Nuowei Information Technology Co., Ltd, Hangzhou, China
| | - Jieren Deng
- Department of Bioinformatics, Hangzhou Nuowei Information Technology Co., Ltd, Hangzhou, China
| | - Xiang Chen
- Department of Bioinformatics, Hangzhou Nuowei Information Technology Co., Ltd, Hangzhou, China
| | | | | | | | - Zhen Wang
- Department of Rheumatology and Immunology, Shanghai Changzheng Hospital, Second Military Medical University, China
| | - Yuhui Xiao
- Department of Bioinformatics, Hangzhou Nuowei Information Technology, Hangzhou, China
| | - Kang Xie
- Key Lab of Information Network Security of the Ministry of Public Security
| | - Shuang Wang
- Hangzhou Nuowei Information Technology Co., Ltd, Hangzhou, China
| | - Huji Xu
- Department of Rheumatology and Immunology, Shanghai Changzheng Hospital
| |
Collapse
|
19
|
Scott ER, Wallsten RL. A Look to the Future. Pharmacogenomics 2019. [DOI: 10.1016/b978-0-12-812626-4.00010-3] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/28/2022] Open
|
20
|
Jiang Y, Hamer J, Wang C, Jiang X, Kim M, Song Y, Xia Y, Mohammed N, Sadat MN, Wang S. SecureLR: Secure Logistic Regression Model via a Hybrid Cryptographic Protocol. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2019; 16:113-123. [PMID: 29994005 DOI: 10.1109/tcbb.2018.2833463] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/08/2023]
Abstract
Machine learning applications are intensively utilized in various science fields, and increasingly the biomedical and healthcare sector. Applying predictive modeling to biomedical data introduces privacy and security concerns requiring additional protection to prevent accidental disclosure or leakage of sensitive patient information. Significant advancements in secure computing methods have emerged in recent years, however, many of which require substantial computational and/or communication overheads, which might hinder their adoption in biomedical applications. In this work, we propose SecureLR, a novel framework allowing researchers to leverage both the computational and storage capacity of Public Cloud Servers to conduct learning and predictions on biomedical data without compromising data security or efficiency. Our model builds upon homomorphic encryption methodologies with hardware-based security reinforcement through Software Guard Extensions (SGX), and our implementation demonstrates a practical hybrid cryptographic solution to address important concerns in conducting machine learning with public clouds.
Collapse
|
21
|
Abstract
Background One of the 3 tracks of iDASH Privacy & Security Workshop 2017 competition was to execute a whole genome variants search on private genomic data. Particularly, the search application was to find the top most significant SNPs (Single-Nucleotide Polymorphisms) in a database of genome records labeled with control or case. In this paper we discuss the solution submitted by our team to this competition. Methods Privacy and confidentiality of genome data had to be ensured using Intel SGX enclaves. The typical use-case of this application is the multi-party computation (each party possessing one or several genome records) of the SNPs which statistically differentiate control and case genome datasets. Results Our solution consists of two applications: (i) compress and encrypt genome files and (ii) perform genome processing (top most important SNPs search). We have opted for a horizontal treatment of genome records and heavily used parallel processing. Rust programming language was employed to develop both applications. Conclusions Execution performance of the processing applications scales well and very good performance metrics are obtained. Contest organizers selected it as the best submission amongst other received competition entries and our team was awarded the first prize on this track.
Collapse
|
22
|
Gibson JE, Ander EL, Cave M, Bath-Hextall F, Musah A, Leonardi-Bee J. Linkage of national soil quality measurements to primary care medical records in England and Wales: a new resource for investigating environmental impacts on human health. Popul Health Metr 2018; 16:12. [PMID: 30012161 PMCID: PMC6048879 DOI: 10.1186/s12963-018-0168-2] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/21/2016] [Accepted: 06/19/2018] [Indexed: 12/02/2022] Open
Abstract
Background Long-term, low-level exposure to toxic elements in soil may be harmful to human health but large longitudinal cohort studies with sufficient follow-up time to study these effects are cost-prohibitive and impractical. Linkage of routinely collected medical outcome data to systematic surveys of soil quality may offer a viable alternative. Methods We used the Geochemical Baseline Survey of the Environment (G-BASE), a systematic X-ray fluorescence survey of soil inorganic chemistry throughout England and Wales to obtain estimates of the concentrations of 15 elements in the soil contained within each English and Welsh postcode area. We linked these data to the residential postcodes of individuals enrolled in The Health Improvement Network (THIN), a large database of UK primary care medical records, to provide estimates of exposure. Observed exposure levels among the THIN population were compared with expectations based on UK population estimates to assess representativeness. Results Three hundred seventy-seven of three hundred ninety-five English and Welsh THIN practices agreed to participate in the linkage, providing complete residential soil metal estimates for 6,243,363 individuals (92% of all current and former patients) with a mean period of prospective computerised medical data collection (follow-up) of 6.75 years. Overall agreement between the THIN population and expectations was excellent; however, the number of participating practices in the Yorkshire & Humber strategic health authority was low, leading to restricted ranges of measurements for some elements relative to the known variations in geochemical concentrations in this area. Conclusions The linked database provides unprecedented population size and statistical power to study the effects of elements in soil on human health. With appropriate adjustment, results should be generalizable to and representative of the wider English and Welsh population. Electronic supplementary material The online version of this article (10.1186/s12963-018-0168-2) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Jack E Gibson
- Division of Epidemiology & Public Health, School of Medicine, University of Nottingham, Clinical Sciences Building Phase II, City Hospital, Hucknall Road, Nottingham, NG5 1PB, UK.
| | - E Louise Ander
- Centre for Environmental Geochemistry, British Geological Survey, Nicker Hill, Keyworth, Nottingham, NG12 5GG, UK
| | - Mark Cave
- Environmental Geochemistry Baselines Group, British Geological Survey, Nicker Hill, Keyworth, Nottingham, NG12 5GG, UK
| | - Fiona Bath-Hextall
- Centre for Evidence Based Health Care, School of Health Sciences, University of Nottingham, Queen's Medical Centre, Nottingham, NG7 2HA, UK
| | - Anwar Musah
- Division of Epidemiology & Public Health, School of Medicine, University of Nottingham, Clinical Sciences Building Phase II, City Hospital, Hucknall Road, Nottingham, NG5 1PB, UK
| | - Jo Leonardi-Bee
- Division of Epidemiology & Public Health, School of Medicine, University of Nottingham, Clinical Sciences Building Phase II, City Hospital, Hucknall Road, Nottingham, NG5 1PB, UK
| |
Collapse
|
23
|
Chenghong W, Jiang Y, Mohammed N, Chen F, Jiang X, Al Aziz MM, Sadat MN, Wang S. SCOTCH: Secure Counting Of encrypTed genomiC data using a Hybrid approach. AMIA ... ANNUAL SYMPOSIUM PROCEEDINGS. AMIA SYMPOSIUM 2018; 2017:1744-1753. [PMID: 29854245 PMCID: PMC5977689] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Subscribe] [Scholar Register] [Indexed: 06/08/2023]
Abstract
As genomic data are usually at large scale and highly sensitive, it is essential to enable both efficient and secure analysis, by which the data owner can securely delegate both computation and storage on untrusted public cloud. Counting query of genotypes is a basic function for many downstream applications in biomedical research (e.g., computing allele frequency, calculating chi-squared statistics, etc.). Previous solutions show promise on secure counting of outsourced data but the efficiency is still a big limitation for real world applications. In this paper, we propose a novel hybrid solution to combine a rigorous theoretical model (homomorphic encryption) and the latest hardware-based infrastructure (i.e., Software Guard Extensions) to speed up the computation while preserving the privacy of both data owners and data users. Our results demonstrated efficiency by using the real data from the personal genome project.
Collapse
Affiliation(s)
- Wang Chenghong
- Dept. of Biomedical Informatics, University of California San Diego, La Jolla, CA, USA
- Dept. of Computer Science, Syracuse University, Syracuse, NY, USA
| | - Yichen Jiang
- Dept. of Biomedical Informatics, University of California San Diego, La Jolla, CA, USA
- Dept. of Computer Science, Syracuse University, Syracuse, NY, USA
| | - Noman Mohammed
- Dept. of Computer Science, University of Manitoba, Winnipeg, MB R3T 2N2, Canada
| | - Feng Chen
- Dept. of Biomedical Informatics, University of California San Diego, La Jolla, CA, USA
| | - Xiaoqian Jiang
- Dept. of Biomedical Informatics, University of California San Diego, La Jolla, CA, USA
| | - Md Momin Al Aziz
- Dept. of Computer Science, University of Manitoba, Winnipeg, MB R3T 2N2, Canada
| | - Md Nazmus Sadat
- Dept. of Computer Science, University of Manitoba, Winnipeg, MB R3T 2N2, Canada
| | - Shuang Wang
- Dept. of Biomedical Informatics, University of California San Diego, La Jolla, CA, USA
| |
Collapse
|
24
|
Lee J, Sun J, Wang F, Wang S, Jun CH, Jiang X. Privacy-Preserving Patient Similarity Learning in a Federated Environment: Development and Analysis. JMIR Med Inform 2018; 6:e20. [PMID: 29653917 PMCID: PMC5924379 DOI: 10.2196/medinform.7744] [Citation(s) in RCA: 55] [Impact Index Per Article: 9.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/29/2017] [Revised: 09/12/2017] [Accepted: 01/06/2018] [Indexed: 12/14/2022] Open
Abstract
Background There is an urgent need for the development of global analytic frameworks that can perform analyses in a privacy-preserving federated environment across multiple institutions without privacy leakage. A few studies on the topic of federated medical analysis have been conducted recently with the focus on several algorithms. However, none of them have solved similar patient matching, which is useful for applications such as cohort construction for cross-institution observational studies, disease surveillance, and clinical trials recruitment. Objective The aim of this study was to present a privacy-preserving platform in a federated setting for patient similarity learning across institutions. Without sharing patient-level information, our model can find similar patients from one hospital to another. Methods We proposed a federated patient hashing framework and developed a novel algorithm to learn context-specific hash codes to represent patients across institutions. The similarities between patients can be efficiently computed using the resulting hash codes of corresponding patients. To avoid security attack from reverse engineering on the model, we applied homomorphic encryption to patient similarity search in a federated setting. Results We used sequential medical events extracted from the Multiparameter Intelligent Monitoring in Intensive Care-III database to evaluate the proposed algorithm in predicting the incidence of five diseases independently. Our algorithm achieved averaged area under the curves of 0.9154 and 0.8012 with balanced and imbalanced data, respectively, in κ-nearest neighbor with κ=3. We also confirmed privacy preservation in similarity search by using homomorphic encryption. Conclusions The proposed algorithm can help search similar patients across institutions effectively to support federated data analysis in a privacy-preserving manner.
Collapse
Affiliation(s)
- Junghye Lee
- School of Management Engineering, Ulsan National Institute of Science and Technology, Ulsan, Republic Of Korea.,Department of Biomedical Informatics, University of California San Diego, San Diego, CA, United States.,Department of Industrial and Management Engineering, Pohang University of Science and Technology, Pohang, Republic Of Korea
| | - Jimeng Sun
- College of Computing, Georgia Institute of Technology, Atlanta, GA, United States
| | - Fei Wang
- Division of Health Informatics, Department of Healthcare Policy and Research, Weill Cornell Medical College, Cornell University, New York City, NY, United States
| | - Shuang Wang
- Department of Biomedical Informatics, University of California San Diego, San Diego, CA, United States
| | - Chi-Hyuck Jun
- Department of Industrial and Management Engineering, Pohang University of Science and Technology, Pohang, Republic Of Korea
| | - Xiaoqian Jiang
- Department of Biomedical Informatics, University of California San Diego, San Diego, CA, United States
| |
Collapse
|
25
|
Sadat MN, Jiang X, Aziz MMA, Wang S, Mohammed N. Secure and Efficient Regression Analysis Using a Hybrid Cryptographic Framework: Development and Evaluation. JMIR Med Inform 2018; 6:e14. [PMID: 29506966 PMCID: PMC5859787 DOI: 10.2196/medinform.8286] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/23/2017] [Revised: 10/25/2017] [Accepted: 01/03/2018] [Indexed: 11/25/2022] Open
Abstract
Background Machine learning is an effective data-driven tool that is being widely used to extract valuable patterns and insights from data. Specifically, predictive machine learning models are very important in health care for clinical data analysis. The machine learning algorithms that generate predictive models often require pooling data from different sources to discover statistical patterns or correlations among different attributes of the input data. The primary challenge is to fulfill one major objective: preserving the privacy of individuals while discovering knowledge from data. Objective Our objective was to develop a hybrid cryptographic framework for performing regression analysis over distributed data in a secure and efficient way. Methods Existing secure computation schemes are not suitable for processing the large-scale data that are used in cutting-edge machine learning applications. We designed, developed, and evaluated a hybrid cryptographic framework, which can securely perform regression analysis, a fundamental machine learning algorithm using somewhat homomorphic encryption and a newly introduced secure hardware component of Intel Software Guard Extensions (Intel SGX) to ensure both privacy and efficiency at the same time. Results Experimental results demonstrate that our proposed method provides a better trade-off in terms of security and efficiency than solely secure hardware-based methods. Besides, there is no approximation error. Computed model parameters are exactly similar to plaintext results. Conclusions To the best of our knowledge, this kind of secure computation model using a hybrid cryptographic framework, which leverages both somewhat homomorphic encryption and Intel SGX, is not proposed or evaluated to this date. Our proposed framework ensures data security and computational efficiency at the same time.
Collapse
Affiliation(s)
- Md Nazmus Sadat
- Department of Computer Science, University of Manitoba, Winnipeg, MB, Canada
| | - Xiaoqian Jiang
- Department of Biomedical Informatics, University of California San Diego, La Jolla, CA, United States
| | - Md Momin Al Aziz
- Department of Computer Science, University of Manitoba, Winnipeg, MB, Canada
| | - Shuang Wang
- Department of Biomedical Informatics, University of California San Diego, La Jolla, CA, United States
| | - Noman Mohammed
- Department of Computer Science, University of Manitoba, Winnipeg, MB, Canada
| |
Collapse
|
26
|
Chen F, Wang C, Dai W, Jiang X, Mohammed N, Al Aziz MM, Sadat MN, Sahinalp C, Lauter K, Wang S. PRESAGE: PRivacy-preserving gEnetic testing via SoftwAre Guard Extension. BMC Med Genomics 2017; 10:48. [PMID: 28786365 PMCID: PMC5547453 DOI: 10.1186/s12920-017-0281-2] [Citation(s) in RCA: 24] [Impact Index Per Article: 3.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/04/2022] Open
Abstract
Background Advances in DNA sequencing technologies have prompted a wide range of genomic applications to improve healthcare and facilitate biomedical research. However, privacy and security concerns have emerged as a challenge for utilizing cloud computing to handle sensitive genomic data. Methods We present one of the first implementations of Software Guard Extension (SGX) based securely outsourced genetic testing framework, which leverages multiple cryptographic protocols and minimal perfect hash scheme to enable efficient and secure data storage and computation outsourcing. Results We compared the performance of the proposed PRESAGE framework with the state-of-the-art homomorphic encryption scheme, as well as the plaintext implementation. The experimental results demonstrated significant performance over the homomorphic encryption methods and a small computational overhead in comparison to plaintext implementation. Conclusions The proposed PRESAGE provides an alternative solution for secure and efficient genomic data outsourcing in an untrusted cloud by using a hybrid framework that combines secure hardware and multiple crypto protocols.
Collapse
Affiliation(s)
- Feng Chen
- Department of Biomedical Informatics, University of California San Diego, La Jolla, 92093, CA, USA.
| | - Chenghong Wang
- Department of Computer Science, Syracuse University, Syracuse, 13244, NY, USA
| | - Wenrui Dai
- Department of Biomedical Informatics, University of California San Diego, La Jolla, 92093, CA, USA
| | - Xiaoqian Jiang
- Department of Biomedical Informatics, University of California San Diego, La Jolla, 92093, CA, USA
| | - Noman Mohammed
- Department of Computer Science, University of Manitoba, Winnipeg, R3T 2N2, MB, Canada
| | - Md Momin Al Aziz
- Department of Computer Science, University of Manitoba, Winnipeg, R3T 2N2, MB, Canada
| | - Md Nazmus Sadat
- Department of Computer Science, University of Manitoba, Winnipeg, R3T 2N2, MB, Canada
| | - Cenk Sahinalp
- Department of Computer Science and Informatics, Indiana University, Bloomington, 47408, IN, USA
| | - Kristin Lauter
- Cryptography Group, Microsoft Research, San Diego,, 92122, CA, USA
| | - Shuang Wang
- Department of Biomedical Informatics, University of California San Diego, La Jolla, 92093, CA, USA
| |
Collapse
|
27
|
|
28
|
Wang S, Jiang X, Singh S, Marmor R, Bonomi L, Fox D, Dow M, Ohno-Machado L. Genome privacy: challenges, technical approaches to mitigate risk, and ethical considerations in the United States. Ann N Y Acad Sci 2017; 1387:73-83. [PMID: 27681358 PMCID: PMC5266631 DOI: 10.1111/nyas.13259] [Citation(s) in RCA: 34] [Impact Index Per Article: 4.9] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/01/2016] [Revised: 08/18/2016] [Accepted: 08/22/2016] [Indexed: 12/28/2022]
Abstract
Accessing and integrating human genomic data with phenotypes are important for biomedical research. Making genomic data accessible for research purposes, however, must be handled carefully to avoid leakage of sensitive individual information to unauthorized parties and improper use of data. In this article, we focus on data sharing within the scope of data accessibility for research. Current common practices to gain biomedical data access are strictly rule based, without a clear and quantitative measurement of the risk of privacy breaches. In addition, several types of studies require privacy-preserving linkage of genotype and phenotype information across different locations (e.g., genotypes stored in a sequencing facility and phenotypes stored in an electronic health record) to accelerate discoveries. The computer science community has developed a spectrum of techniques for data privacy and confidentiality protection, many of which have yet to be tested on real-world problems. In this article, we discuss clinical, technical, and ethical aspects of genome data privacy and confidentiality in the United States, as well as potential solutions for privacy-preserving genotype-phenotype linkage in biomedical research.
Collapse
Affiliation(s)
- Shuang Wang
- Department of Biomedical Informatics, University of California San Diego, La Jolla, California
| | - Xiaoqian Jiang
- Department of Biomedical Informatics, University of California San Diego, La Jolla, California
| | - Siddharth Singh
- Department of Biomedical Informatics, University of California San Diego, La Jolla, California
| | - Rebecca Marmor
- Department of Biomedical Informatics, University of California San Diego, La Jolla, California
| | - Luca Bonomi
- Department of Biomedical Informatics, University of California San Diego, La Jolla, California
| | - Dov Fox
- School of Law, University of San Diego, San Diego, California
| | - Michelle Dow
- Department of Biomedical Informatics, University of California San Diego, La Jolla, California
| | - Lucila Ohno-Machado
- Department of Biomedical Informatics, University of California San Diego, La Jolla, California
| |
Collapse
|
29
|
Constable SD, Tang Y, Wang S, Jiang X, Chapin S. Privacy-preserving GWAS analysis on federated genomic datasets. BMC Med Inform Decis Mak 2015; 15 Suppl 5:S2. [PMID: 26733045 PMCID: PMC4699163 DOI: 10.1186/1472-6947-15-s5-s2] [Citation(s) in RCA: 35] [Impact Index Per Article: 3.9] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/17/2023] Open
Abstract
BACKGROUND The biomedical community benefits from the increasing availability of genomic data to support meaningful scientific research, e.g., Genome-Wide Association Studies (GWAS). However, high quality GWAS usually requires a large amount of samples, which can grow beyond the capability of a single institution. Federated genomic data analysis holds the promise of enabling cross-institution collaboration for effective GWAS, but it raises concerns about patient privacy and medical information confidentiality (as data are being exchanged across institutional boundaries), which becomes an inhibiting factor for the practical use. METHODS We present a privacy-preserving GWAS framework on federated genomic datasets. Our method is to layer the GWAS computations on top of secure multi-party computation (MPC) systems. This approach allows two parties in a distributed system to mutually perform secure GWAS computations, but without exposing their private data outside. RESULTS We demonstrate our technique by implementing a framework for minor allele frequency counting and χ2 statistics calculation, one of typical computations used in GWAS. For efficient prototyping, we use a state-of-the-art MPC framework, i.e., Portable Circuit Format (PCF) 1. Our experimental results show promise in realizing both efficient and secure cross-institution GWAS computations.
Collapse
Affiliation(s)
- Scott D Constable
- Department of EECS, Syracuse University, South Crouse Avenue, 13244 Syracuse, NY USA
| | - Yuzhe Tang
- Department of EECS, Syracuse University, South Crouse Avenue, 13244 Syracuse, NY USA
| | - Shuang Wang
- Department of Biomedical Informatics, University of California, San Diego, 9500 Gilman Drive, MC 0728, 92093 La Jolla, CA USA
| | - Xiaoqian Jiang
- Department of Biomedical Informatics, University of California, San Diego, 9500 Gilman Drive, MC 0728, 92093 La Jolla, CA USA
| | - Steve Chapin
- Department of EECS, Syracuse University, South Crouse Avenue, 13244 Syracuse, NY USA
| |
Collapse
|