1
|
Torkzadehmahani R, Nasirigerdeh R, Blumenthal DB, Kacprowski T, List M, Matschinske J, Spaeth J, Wenke NK, Baumbach J. Privacy-Preserving Artificial Intelligence Techniques in Biomedicine. Methods Inf Med 2022; 61:e12-e27. [PMID: 35062032 PMCID: PMC9246509 DOI: 10.1055/s-0041-1740630] [Citation(s) in RCA: 5] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/22/2021] [Accepted: 09/18/2021] [Indexed: 12/15/2022]
Abstract
BACKGROUND Artificial intelligence (AI) has been successfully applied in numerous scientific domains. In biomedicine, AI has already shown tremendous potential, e.g., in the interpretation of next-generation sequencing data and in the design of clinical decision support systems. OBJECTIVES However, training an AI model on sensitive data raises concerns about the privacy of individual participants. For example, summary statistics of a genome-wide association study can be used to determine the presence or absence of an individual in a given dataset. This considerable privacy risk has led to restrictions in accessing genomic and other biomedical data, which is detrimental for collaborative research and impedes scientific progress. Hence, there has been a substantial effort to develop AI methods that can learn from sensitive data while protecting individuals' privacy. METHOD This paper provides a structured overview of recent advances in privacy-preserving AI techniques in biomedicine. It places the most important state-of-the-art approaches within a unified taxonomy and discusses their strengths, limitations, and open problems. CONCLUSION As the most promising direction, we suggest combining federated machine learning as a more scalable approach with other additional privacy-preserving techniques. This would allow to merge the advantages to provide privacy guarantees in a distributed way for biomedical applications. Nonetheless, more research is necessary as hybrid approaches pose new challenges such as additional network or computation overhead.
Collapse
Affiliation(s)
- Reihaneh Torkzadehmahani
- Institute for Artificial Intelligence in Medicine and Healthcare, Technical University of Munich, Munich, Germany
| | - Reza Nasirigerdeh
- Institute for Artificial Intelligence in Medicine and Healthcare, Technical University of Munich, Munich, Germany
- Klinikum Rechts der Isar, Technical University of Munich, Munich, Germany
| | - David B. Blumenthal
- Department of Artificial Intelligence in Biomedical Engineering (AIBE), Friedrich-Alexander University Erlangen-Nürnberg (FAU), Erlangen, Germany
| | - Tim Kacprowski
- Division of Data Science in Biomedicine, Peter L. Reichertz Institute for Medical Informatics of TU Braunschweig and Medical School Hannover, Braunschweig, Germany
- Braunschweig Integrated Centre of Systems Biology (BRICS), TU Braunschweig, Braunschweig, Germany
| | - Markus List
- Chair of Experimental Bioinformatics, Technical University of Munich, Munich, Germany
| | - Julian Matschinske
- E.U. Horizon2020 FeatureCloud Project Consortium
- Chair of Computational Systems Biology, University of Hamburg, Hamburg, Germany
| | - Julian Spaeth
- E.U. Horizon2020 FeatureCloud Project Consortium
- Chair of Computational Systems Biology, University of Hamburg, Hamburg, Germany
| | - Nina Kerstin Wenke
- E.U. Horizon2020 FeatureCloud Project Consortium
- Chair of Computational Systems Biology, University of Hamburg, Hamburg, Germany
| | - Jan Baumbach
- E.U. Horizon2020 FeatureCloud Project Consortium
- Chair of Computational Systems Biology, University of Hamburg, Hamburg, Germany
- Institute of Mathematics and Computer Science, University of Southern Denmark, Odense, Denmark
| |
Collapse
|
2
|
Su J, Cao Y, Chen Y, Liu Y, Song J. Privacy protection of medical data in social network. BMC Med Inform Decis Mak 2021; 21:286. [PMID: 34663276 PMCID: PMC8524799 DOI: 10.1186/s12911-021-01645-0] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/12/2021] [Accepted: 09/14/2021] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Protection of privacy data published in the health care field is an important research field. The Health Insurance Portability and Accountability Act (HIPAA) in the USA is the current legislation for privacy protection. However, the Institute of Medicine Committee on Health Research and the Privacy of Health Information recently concluded that HIPAA cannot adequately safeguard the privacy, while at the same time researchers cannot use the medical data for effective researches. Therefore, more effective privacy protection methods are urgently needed to ensure the security of released medical data. METHODS Privacy protection methods based on clustering are the methods and algorithms to ensure that the published data remains useful and protected. In this paper, we first analyzed the importance of the key attributes of medical data in the social network. According to the attribute function and the main objective of privacy protection, the attribute information was divided into three categories. We then proposed an algorithm based on greedy clustering to group the data points according to the attributes and the connective information of the nodes in the published social network. Finally, we analyzed the loss of information during the procedure of clustering, and evaluated the proposed approach with respect to classification accuracy and information loss rates on a medical dataset. RESULTS The associated social network of a medical dataset was analyzed for privacy preservation. We evaluated the values of generalization loss and structure loss for different values of k and a, i.e. [Formula: see text] = {3, 6, 9, 12, 15, 18, 21, 24, 27, 30}, a = {0, 0.2, 0.4, 0.6, 0.8, 1}. The experimental results in our proposed approach showed that the generalization loss approached optimal when a = 1 and k = 21, and structure loss approached optimal when a = 0.4 and k = 3. CONCLUSION We showed the importance of the attributes and the structure of the released health data in privacy preservation. Our method achieved better results of privacy preservation in social network by optimizing generalization loss and structure loss. The proposed method to evaluate loss obtained a balance between the data availability and the risk of privacy leakage.
Collapse
Affiliation(s)
- Jie Su
- School of Information Science and Engineering, University of Jinan, Jinan, 250022, China.
- Shandong Provincial Key Laboratory of Network Based Intelligent Computing, University of Jinan, Jinan, 250022, China.
| | - Yi Cao
- School of Information Science and Engineering, University of Jinan, Jinan, 250022, China
- Shandong Provincial Key Laboratory of Network Based Intelligent Computing, University of Jinan, Jinan, 250022, China
| | - Yuehui Chen
- School of Information Science and Engineering, University of Jinan, Jinan, 250022, China
- Shandong Provincial Key Laboratory of Network Based Intelligent Computing, University of Jinan, Jinan, 250022, China
| | - Yahui Liu
- School of Information Management, Beijing Information Science & Technology University, Beijing, China
| | - Jinming Song
- Department of Hematopathology and Lab Medicines, H. Lee Moffitt Cancer Center and Research Institute, Tampa, FL, 33612, USA
| |
Collapse
|
3
|
Amiri-Zarandi M, Dara RA, Fraser E. A survey of machine learning-based solutions to protect privacy in the Internet of Things. Comput Secur 2020. [DOI: 10.1016/j.cose.2020.101921] [Citation(s) in RCA: 11] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/24/2022]
|
4
|
Kuo TT. The anatomy of a distributed predictive modeling framework: online learning, blockchain network, and consensus algorithm. JAMIA Open 2020; 3:201-208. [PMID: 32734160 PMCID: PMC7382618 DOI: 10.1093/jamiaopen/ooaa017] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/30/2020] [Revised: 04/21/2020] [Accepted: 04/29/2020] [Indexed: 11/23/2022] Open
Abstract
Objective Cross-institutional distributed healthcare/genomic predictive modeling is an emerging technology that fulfills both the need of building a more generalizable model and of protecting patient data by only exchanging the models but not the patient data. In this article, the implementation details are presented for one specific blockchain-based approach, ExplorerChain, from a software development perspective. The healthcare/genomic use cases of myocardial infarction, cancer biomarker, and length of hospitalization after surgery are also described. Materials and Methods ExplorerChain’s 3 main technical components, including online machine learning, metadata of transaction, and the Proof-of-Information-Timed (PoINT) algorithm, are introduced in this study. Specifically, the 3 algorithms (ie, core, new network, and new site/data) are described in detail. Results ExplorerChain was implemented and the design details of it were illustrated, especially the development configurations in a practical setting. Also, the system architecture and programming languages are introduced. The code was also released in an open source repository available at https://github.com/tsungtingkuo/explorerchain. Discussion The designing considerations of semi-trust assumption, data format normalization, and non-determinism was discussed. The limitations of the implementation include fixed-number participating sites, limited join-or-leave capability during initialization, advanced privacy technology yet to be included, and further investigation in ethical, legal, and social implications. Conclusion This study can serve as a reference for the researchers who would like to implement and even deploy blockchain technology. Furthermore, the off-the-shelf software can also serve as a cornerstone to accelerate the development and investigation of future healthcare/genomic blockchain studies.
Collapse
Affiliation(s)
- Tsung-Ting Kuo
- UCSD Health Department of Biomedical Informatics, University of California San Diego, La Jolla, California, USA
| |
Collapse
|
5
|
Bonomi L, Jiang X, Ohno-Machado L. Protecting patient privacy in survival analyses. J Am Med Inform Assoc 2020; 27:366-375. [PMID: 31750926 PMCID: PMC7025359 DOI: 10.1093/jamia/ocz195] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/15/2019] [Revised: 09/09/2019] [Accepted: 10/18/2019] [Indexed: 11/13/2022] Open
Abstract
OBJECTIVE Survival analysis is the cornerstone of many healthcare applications in which the "survival" probability (eg, time free from a certain disease, time to death) of a group of patients is computed to guide clinical decisions. It is widely used in biomedical research and healthcare applications. However, frequent sharing of exact survival curves may reveal information about the individual patients, as an adversary may infer the presence of a person of interest as a participant of a study or of a particular group. Therefore, it is imperative to develop methods to protect patient privacy in survival analysis. MATERIALS AND METHODS We develop a framework based on the formal model of differential privacy, which provides provable privacy protection against a knowledgeable adversary. We show the performance of privacy-protecting solutions for the widely used Kaplan-Meier nonparametric survival model. RESULTS We empirically evaluated the usefulness of our privacy-protecting framework and the reduced privacy risk for a popular epidemiology dataset and a synthetic dataset. Results show that our methods significantly reduce the privacy risk when compared with their nonprivate counterparts, while retaining the utility of the survival curves. DISCUSSION The proposed framework demonstrates the feasibility of conducting privacy-protecting survival analyses. We discuss future research directions to further enhance the usefulness of our proposed solutions in biomedical research applications. CONCLUSION The results suggest that our proposed privacy-protection methods provide strong privacy protections while preserving the usefulness of survival analyses.
Collapse
Affiliation(s)
- Luca Bonomi
- Department of Biomedical Informatics, UC San Diego Health, University of California, San Diego, La Jolla, California, USA
| | - Xiaoqian Jiang
- School of Biomedical Informatics, University of Texas Health Science Center at Houston, Houston, Texas, USA
| | - Lucila Ohno-Machado
- Department of Biomedical Informatics, UC San Diego Health, University of California, San Diego, La Jolla, California, USA
- Division of Health Services Research and Development, VA San Diego Healthcare System, La Jolla, California, USA
| |
Collapse
|
6
|
Zerka F, Barakat S, Walsh S, Bogowicz M, Leijenaar RTH, Jochems A, Miraglio B, Townend D, Lambin P. Systematic Review of Privacy-Preserving Distributed Machine Learning From Federated Databases in Health Care. JCO Clin Cancer Inform 2020; 4:184-200. [PMID: 32134684 PMCID: PMC7113079 DOI: 10.1200/cci.19.00047] [Citation(s) in RCA: 44] [Impact Index Per Article: 11.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 01/16/2020] [Indexed: 02/06/2023] Open
Abstract
Big data for health care is one of the potential solutions to deal with the numerous challenges of health care, such as rising cost, aging population, precision medicine, universal health coverage, and the increase of noncommunicable diseases. However, data centralization for big data raises privacy and regulatory concerns.Covered topics include (1) an introduction to privacy of patient data and distributed learning as a potential solution to preserving these data, a description of the legal context for patient data research, and a definition of machine/deep learning concepts; (2) a presentation of the adopted review protocol; (3) a presentation of the search results; and (4) a discussion of the findings, limitations of the review, and future perspectives.Distributed learning from federated databases makes data centralization unnecessary. Distributed algorithms iteratively analyze separate databases, essentially sharing research questions and answers between databases instead of sharing the data. In other words, one can learn from separate and isolated datasets without patient data ever leaving the individual clinical institutes.Distributed learning promises great potential to facilitate big data for medical application, in particular for international consortiums. Our purpose is to review the major implementations of distributed learning in health care.
Collapse
Affiliation(s)
- Fadila Zerka
- The D-Lab, Department of Precision Medicine, GROW School for Oncology and Developmental Biology, Maastricht University Medical Centre, Maastricht, The Netherlands
- Oncoradiomics, Liège, Belgium
| | - Samir Barakat
- The D-Lab, Department of Precision Medicine, GROW School for Oncology and Developmental Biology, Maastricht University Medical Centre, Maastricht, The Netherlands
- Oncoradiomics, Liège, Belgium
| | - Sean Walsh
- The D-Lab, Department of Precision Medicine, GROW School for Oncology and Developmental Biology, Maastricht University Medical Centre, Maastricht, The Netherlands
- Oncoradiomics, Liège, Belgium
| | - Marta Bogowicz
- The D-Lab, Department of Precision Medicine, GROW School for Oncology and Developmental Biology, Maastricht University Medical Centre, Maastricht, The Netherlands
- Department of Radiation Oncology, University Hospital Zurich and University of Zurich, Zurich, Switzerland
| | - Ralph T. H. Leijenaar
- The D-Lab, Department of Precision Medicine, GROW School for Oncology and Developmental Biology, Maastricht University Medical Centre, Maastricht, The Netherlands
- Oncoradiomics, Liège, Belgium
| | - Arthur Jochems
- The D-Lab, Department of Precision Medicine, GROW School for Oncology and Developmental Biology, Maastricht University Medical Centre, Maastricht, The Netherlands
| | | | - David Townend
- Department of Health, Ethics, and Society, CAPHRI (Care and Public Health Research Institute), Maastricht University, Maastricht, The Netherlands
| | - Philippe Lambin
- The D-Lab, Department of Precision Medicine, GROW School for Oncology and Developmental Biology, Maastricht University Medical Centre, Maastricht, The Netherlands
| |
Collapse
|
7
|
A multicenter random forest model for effective prognosis prediction in collaborative clinical research network. Artif Intell Med 2020; 103:101814. [PMID: 32143809 DOI: 10.1016/j.artmed.2020.101814] [Citation(s) in RCA: 20] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/16/2019] [Revised: 02/04/2020] [Accepted: 02/04/2020] [Indexed: 12/17/2022]
Abstract
BACKGROUND The accuracy of a prognostic prediction model has become an essential aspect of the quality and reliability of the health-related decisions made by clinicians in modern medicine. Unfortunately, individual institutions often lack sufficient samples, which might not provide sufficient statistical power for models. One mitigation is to expand data collection from a single institution to multiple centers to collectively increase the sample size. However, sharing sensitive biomedical data for research involves complicated issues. Machine learning models such as random forests (RF), though they are commonly used and achieve good performances for prognostic prediction, usually suffer worse performance under multicenter privacy-preserving data mining scenarios compared to a centrally trained version. METHODS AND MATERIALS In this study, a multicenter random forest prognosis prediction model is proposed that enables federated clinical data mining from horizontally partitioned datasets. By using a novel data enhancement approach based on a differentially private generative adversarial network customized to clinical prognosis data, the proposed model is able to provide a multicenter RF model with performances on par with-or even better than-centrally trained RF but without the need to aggregate the raw data. Moreover, our model also incorporates an importance ranking step designed for feature selection without sharing patient-level information. RESULT The proposed model was evaluated on colorectal cancer datasets from the US and China. Two groups of datasets with different levels of heterogeneity within the collaborative research network were selected. First, we compare the performance of the distributed random forest model under different privacy parameters with different percentages of enhancement datasets and validate the effectiveness and plausibility of our approach. Then, we compare the discrimination and calibration ability of the proposed multicenter random forest with a centrally trained random forest model and other tree-based classifiers as well as some commonly used machine learning methods. The results show that the proposed model can provide better prediction performance in terms of discrimination and calibration ability than the centrally trained RF model or the other candidate models while following the privacy-preserving rules in both groups. Additionally, good discrimination and calibration ability are shown on the simplified model based on the feature importance ranking in the proposed approach. CONCLUSION The proposed random forest model exhibits ideal prediction capability using multicenter clinical data and overcomes the performance limitation arising from privacy guarantees. It can also provide feature importance ranking across institutions without pooling the data at a central site. This study offers a practical solution for building a prognosis prediction model in the collaborative clinical research network and solves practical issues in real-world applications of medical artificial intelligence.
Collapse
|
8
|
Liu Y, Huang J, Urbanowicz RJ, Chen K, Manduchi E, Greene CS, Moore JH, Scheet P, Chen Y. Embracing study heterogeneity for finding genetic interactions in large-scale research consortia. Genet Epidemiol 2019; 44:52-66. [PMID: 31583758 DOI: 10.1002/gepi.22262] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/13/2018] [Revised: 08/02/2019] [Accepted: 08/09/2019] [Indexed: 11/12/2022]
Abstract
Genetic interactions have been recognized as a potentially important contributor to the heritability of complex diseases. Nevertheless, due to small effect sizes and stringent multiple-testing correction, identifying genetic interactions in complex diseases is particularly challenging. To address the above challenges, many genomic research initiatives collaborate to form large-scale consortia and develop open access to enable sharing of genome-wide association study (GWAS) data. Despite the perceived benefits of data sharing from large consortia, a number of practical issues have arisen, such as privacy concerns on individual genomic information and heterogeneous data sources from distributed GWAS databases. In the context of large consortia, we demonstrate that the heterogeneously appearing marginal effects over distributed GWAS databases can offer new insights into genetic interactions for which conventional methods have had limited success. In this paper, we develop a novel two-stage testing procedure, named phylogenY-based effect-size tests for interactions using first 2 moments (YETI2), to detect genetic interactions through both pooled marginal effects, in terms of averaging site-specific marginal effects, and heterogeneity in marginal effects across sites, using a meta-analytic framework. YETI2 can not only be applied to large consortia without shared personal information but also can be used to leverage underlying heterogeneity in marginal effects to prioritize potential genetic interactions. We investigate the performance of YETI2 through simulation studies and apply YETI2 to bladder cancer data from dbGaP.
Collapse
Affiliation(s)
- Yulun Liu
- Department of Population and Data Sciences, The University of Texas Southwestern Medical Center, Dallas, Texas
| | - Jing Huang
- Department of Biostatistics, Epidemiology, and Informatics, University of Pennsylvania, Philadelphia, Pennsylvania
| | - Ryan J Urbanowicz
- Institute for Biomedical Informatics, University of Pennsylvania, Philadelphia, Pennsylvania
| | - Kun Chen
- Department of Statistics, University of Connecticut, Storrs, Connecticut
| | - Elisabetta Manduchi
- Department of Biostatistics, Epidemiology, and Informatics, University of Pennsylvania, Philadelphia, Pennsylvania.,Institute for Biomedical Informatics, University of Pennsylvania, Philadelphia, Pennsylvania
| | - Casey S Greene
- Department of Pharmacology, University of Pennsylvania, Philadelphia, Pennsylvania
| | - Jason H Moore
- Department of Biostatistics, Epidemiology, and Informatics, University of Pennsylvania, Philadelphia, Pennsylvania.,Institute for Biomedical Informatics, University of Pennsylvania, Philadelphia, Pennsylvania
| | - Paul Scheet
- Department of Epidemiology, The University of Texas MD Anderson Cancer Center, Houston, Texas
| | - Yong Chen
- Department of Biostatistics, Epidemiology, and Informatics, University of Pennsylvania, Philadelphia, Pennsylvania.,Institute for Biomedical Informatics, University of Pennsylvania, Philadelphia, Pennsylvania
| |
Collapse
|
9
|
Arellano AM, Dai W, Wang S, Jiang X, Ohno-Machado L. Privacy Policy and Technology in Biomedical Data Science. Annu Rev Biomed Data Sci 2018; 1:115-129. [PMID: 31058261 PMCID: PMC6497413 DOI: 10.1146/annurev-biodatasci-080917-013416] [Citation(s) in RCA: 19] [Impact Index Per Article: 3.2] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/04/2023]
Abstract
Privacyis an important consideration when sharing clinical data, which often contain sensitive information. Adequate protection to safeguard patient privacy and to increase public trust in biomedical research is paramount. This review covers topics in policy and technology in the context of clinical data sharing. We review policy articles related to (a) the Common Rule, HIPAA privacy and security rules, and governance; (b) patients' viewpoints and consent practices; and (c) research ethics. We identify key features of the revised Common Rule and the most notable changes since its previous version. We address data governance for research in addition to the increasing emphasis on ethical and social implications. Research ethics topics include data sharing best practices, use of data from populations of low socioeconomic status (SES), recent updates to institutional review board (IRB) processes to protect human subjects' data, and important concerns about the limitations of current policies to address data deidentification. In terms of technology, we focus on articles that have applicability in real world health care applications: deidentification methods that comply with HIPAA, data anonymization approaches to satisfy well-acknowledged issues in deidentified data, encryption methods to safeguard data analyses, and privacy-preserving predictive modeling. The first two technology topics are mostly relevant to methodologies that attempt to sanitize structured or unstructured data. The third topic includes analysis on encrypted data. The last topic includes various mechanisms to build statistical models without sharing raw data.
Collapse
Affiliation(s)
- April Moreno Arellano
- Department of Biomedical Informatics, School of Medicine, University of California, San Diego, La Jolla, California 92093, USA;
| | - Wenrui Dai
- Department of Biomedical Informatics, School of Medicine, University of California, San Diego, La Jolla, California 92093, USA;
| | - Shuang Wang
- Department of Biomedical Informatics, School of Medicine, University of California, San Diego, La Jolla, California 92093, USA;
| | - Xiaoqian Jiang
- Department of Biomedical Informatics, School of Medicine, University of California, San Diego, La Jolla, California 92093, USA;
| | - Lucila Ohno-Machado
- Department of Biomedical Informatics, School of Medicine, University of California, San Diego, La Jolla, California 92093, USA;
| |
Collapse
|
10
|
Wang M, Ji Z, Kim HE, Wang S, Xiong L, Jiang X. Selecting Optimal Subset to release under Differentially Private M-estimators from Hybrid Datasets. IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING 2018; 30:573-584. [PMID: 30034201 PMCID: PMC6051552 DOI: 10.1109/tkde.2017.2773545] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/08/2023]
Abstract
Privacy concern in data sharing especially for health data gains particularly increasing attention nowadays. Now some patients agree to open their information for research use, which gives rise to a new question of how to effectively use the public information to better understand the private dataset without breaching privacy. In this paper, we specialize this question as selecting an optimal subset of the public dataset for M-estimators in the framework of differential privacy (DP) in [1]. From a perspective of non-interactive learning, we first construct the weighted private density estimation from the hybrid datasets under DP. Along the same line as [2], we analyze the accuracy of the DP M-estimators based on the hybrid datasets. Our main contributions are (i) we find that the bias-variance tradeoff in the performance of our M-estimators can be characterized in the sample size of the released dataset; (2) based on this finding, we develop an algorithm to select the optimal subset of the public dataset to release under DP. Our simulation studies and application to the real datasets confirm our findings and set a guideline in the real application.
Collapse
Affiliation(s)
- Meng Wang
- Department of Biomedical Informatics, University of California at San Diego, CA, 92093 U.S., and now is with the Department of Genetics, Stanford University, CA, 94305, U.S
| | - Zhanglong Ji
- Department of Biomedical Informatics, University of California at San Diego, CA, 92093 U.S
| | - Hyeon-Eui Kim
- Department of Biomedical Informatics, University of California at San Diego, CA, 92093 U.S
| | - Shuang Wang
- Department of Biomedical Informatics, University of California at San Diego, CA, 92093 U.S
| | - Li Xiong
- Department of Computer Science, Emory University, GA, 30322 U.S
| | - Xiaoqian Jiang
- Department of Biomedical Informatics, University of California at San Diego, CA, 92093 U.S
| |
Collapse
|
11
|
Honkela A, Das M, Nieminen A, Dikmen O, Kaski S. Efficient differentially private learning improves drug sensitivity prediction. Biol Direct 2018; 13:1. [PMID: 29409513 PMCID: PMC5801888 DOI: 10.1186/s13062-017-0203-4] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/06/2017] [Accepted: 12/21/2017] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Users of a personalised recommendation system face a dilemma: recommendations can be improved by learning from data, but only if other users are willing to share their private information. Good personalised predictions are vitally important in precision medicine, but genomic information on which the predictions are based is also particularly sensitive, as it directly identifies the patients and hence cannot easily be anonymised. Differential privacy has emerged as a potentially promising solution: privacy is considered sufficient if presence of individual patients cannot be distinguished. However, differentially private learning with current methods does not improve predictions with feasible data sizes and dimensionalities. RESULTS We show that useful predictors can be learned under powerful differential privacy guarantees, and even from moderately-sized data sets, by demonstrating significant improvements in the accuracy of private drug sensitivity prediction with a new robust private regression method. Our method matches the predictive accuracy of the state-of-the-art non-private lasso regression using only 4x more samples under relatively strong differential privacy guarantees. Good performance with limited data is achieved by limiting the sharing of private information by decreasing the dimensionality and by projecting outliers to fit tighter bounds, therefore needing to add less noise for equal privacy. CONCLUSIONS The proposed differentially private regression method combines theoretical appeal and asymptotic efficiency with good prediction accuracy even with moderate-sized data. As already the simple-to-implement method shows promise on the challenging genomic data, we anticipate rapid progress towards practical applications in many fields. REVIEWERS This article was reviewed by Zoltan Gaspari and David Kreil.
Collapse
Affiliation(s)
- Antti Honkela
- Helsinki Institute for Information Technology HIIT, Department of Computer Science, University of Helsinki, Helsinki, Finland
- Department of Mathematics and Statistics, University of Helsinki, Helsinki, Finland
- Department of Public Health, University of Helsinki, Helsinki, Finland
| | - Mrinal Das
- Helsinki Institute for Information Technology HIIT, Department of Computer Science, Aalto University, Helsinki, Finland
| | - Arttu Nieminen
- Helsinki Institute for Information Technology HIIT, Department of Computer Science, University of Helsinki, Helsinki, Finland
| | - Onur Dikmen
- Helsinki Institute for Information Technology HIIT, Department of Computer Science, University of Helsinki, Helsinki, Finland
| | - Samuel Kaski
- Helsinki Institute for Information Technology HIIT, Department of Computer Science, Aalto University, Helsinki, Finland
| |
Collapse
|
12
|
Plis SM, Sarwate AD, Wood D, Dieringer C, Landis D, Reed C, Panta SR, Turner JA, Shoemaker JM, Carter KW, Thompson P, Hutchison K, Calhoun VD. COINSTAC: A Privacy Enabled Model and Prototype for Leveraging and Processing Decentralized Brain Imaging Data. Front Neurosci 2016; 10:365. [PMID: 27594820 PMCID: PMC4990563 DOI: 10.3389/fnins.2016.00365] [Citation(s) in RCA: 54] [Impact Index Per Article: 6.8] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/20/2016] [Accepted: 07/22/2016] [Indexed: 01/17/2023] Open
Abstract
The field of neuroimaging has embraced the need for sharing and collaboration. Data sharing mandates from public funding agencies and major journal publishers have spurred the development of data repositories and neuroinformatics consortia. However, efficient and effective data sharing still faces several hurdles. For example, open data sharing is on the rise but is not suitable for sensitive data that are not easily shared, such as genetics. Current approaches can be cumbersome (such as negotiating multiple data sharing agreements). There are also significant data transfer, organization and computational challenges. Centralized repositories only partially address the issues. We propose a dynamic, decentralized platform for large scale analyses called the Collaborative Informatics and Neuroimaging Suite Toolkit for Anonymous Computation (COINSTAC). The COINSTAC solution can include data missing from central repositories, allows pooling of both open and "closed" repositories by developing privacy-preserving versions of widely-used algorithms, and incorporates the tools within an easy-to-use platform enabling distributed computation. We present an initial prototype system which we demonstrate on two multi-site data sets, without aggregating the data. In addition, by iterating across sites, the COINSTAC model enables meta-analytic solutions to converge to "pooled-data" solutions (i.e., as if the entire data were in hand). More advanced approaches such as feature generation, matrix factorization models, and preprocessing can be incorporated into such a model. In sum, COINSTAC enables access to the many currently unavailable data sets, a user friendly privacy enabled interface for decentralized analysis, and a powerful solution that complements existing data sharing solutions.
Collapse
Affiliation(s)
- Sergey M. Plis
- The Mind Research Network, Lovelace Biomedical and Environmental Research InstituteAlbuquerque, NM, USA
| | - Anand D. Sarwate
- Department of Electrical and Computer Engineering, Rutgers, The State University of New JerseyPiscataway, NJ, USA
| | - Dylan Wood
- The Mind Research Network, Lovelace Biomedical and Environmental Research InstituteAlbuquerque, NM, USA
| | - Christopher Dieringer
- The Mind Research Network, Lovelace Biomedical and Environmental Research InstituteAlbuquerque, NM, USA
| | - Drew Landis
- The Mind Research Network, Lovelace Biomedical and Environmental Research InstituteAlbuquerque, NM, USA
| | - Cory Reed
- The Mind Research Network, Lovelace Biomedical and Environmental Research InstituteAlbuquerque, NM, USA
| | - Sandeep R. Panta
- The Mind Research Network, Lovelace Biomedical and Environmental Research InstituteAlbuquerque, NM, USA
| | - Jessica A. Turner
- The Mind Research Network, Lovelace Biomedical and Environmental Research InstituteAlbuquerque, NM, USA
- Department of Psychology and Neuroscience Institute, Georgia State UniversityAtlanta, GA, USA
| | - Jody M. Shoemaker
- The Mind Research Network, Lovelace Biomedical and Environmental Research InstituteAlbuquerque, NM, USA
| | - Kim W. Carter
- Telethon Kids Institute, The University of Western AustraliaSubiaco, WA, Australia
| | - Paul Thompson
- Departments of Neurology, Psychiatry, Engineering, Radiology, and Pediatrics, Imaging Genetics Center, Enhancing Neuroimaging and Genetics through Meta-Analysis Center for Worldwide Medicine, Imaging, and Genomics, University of Southern CaliforniaMarina del Rey, CA, USA
| | - Kent Hutchison
- Department of Psychology and Neuroscience, University of Colorado BoulderBoulder, CO, USA
| | - Vince D. Calhoun
- The Mind Research Network, Lovelace Biomedical and Environmental Research InstituteAlbuquerque, NM, USA
- Department of Electrical and Computer Engineering, University of New MexicoAlbuquerque, NM, USA
| |
Collapse
|
13
|
Shi H, Jiang C, Dai W, Jiang X, Tang Y, Ohno-Machado L, Wang S. Secure Multi-pArty Computation Grid LOgistic REgression (SMAC-GLORE). BMC Med Inform Decis Mak 2016; 16 Suppl 3:89. [PMID: 27454168 PMCID: PMC4959358 DOI: 10.1186/s12911-016-0316-1] [Citation(s) in RCA: 31] [Impact Index Per Article: 3.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022] Open
Abstract
Background In biomedical research, data sharing and information exchange are very important for improving quality of care, accelerating discovery, and promoting the meaningful secondary use of clinical data. A big concern in biomedical data sharing is the protection of patient privacy because inappropriate information leakage can put patient privacy at risk. Methods In this study, we deployed a grid logistic regression framework based on Secure Multi-party Computation (SMAC-GLORE). Unlike our previous work in GLORE, SMAC-GLORE protects not only patient-level data, but also all the intermediary information exchanged during the model-learning phase. Results The experimental results demonstrate the feasibility of secure distributed logistic regression across multiple institutions without sharing patient-level data. Conclusions In this study, we developed a circuit-based SMAC-GLORE framework. The proposed framework provides a practical solution for secure distributed logistic regression model learning.
Collapse
Affiliation(s)
- Haoyi Shi
- Department of Biomedical Informatics, University of California, San Diego, CA, 92093, USA.,Department of Electrical Engineering and Computer Science, Syracuse University, Syracuse, NY, 13210, USA
| | - Chao Jiang
- Department of Biomedical Informatics, University of California, San Diego, CA, 92093, USA.,School of Electrical and Computer Engineering, University of Oklahoma, Tulsa, OK, 74135, USA
| | - Wenrui Dai
- Department of Biomedical Informatics, University of California, San Diego, CA, 92093, USA
| | - Xiaoqian Jiang
- Department of Biomedical Informatics, University of California, San Diego, CA, 92093, USA
| | - Yuzhe Tang
- Department of Electrical Engineering and Computer Science, Syracuse University, Syracuse, NY, 13210, USA
| | - Lucila Ohno-Machado
- Department of Biomedical Informatics, University of California, San Diego, CA, 92093, USA
| | - Shuang Wang
- Department of Biomedical Informatics, University of California, San Diego, CA, 92093, USA.
| |
Collapse
|
14
|
|