1
|
Bataa M, Song S, Park K, Kim M, Cheon JH, Kim S. Finding Highly Similar Regions of Genomic Sequences Through Homomorphic Encryption. J Comput Biol 2024; 31:197-212. [PMID: 38531050 DOI: 10.1089/cmb.2023.0050] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 03/28/2024] Open
Abstract
Finding highly similar regions of genomic sequences is a basic computation of genomic analysis. Genomic analyses on a large amount of data are efficiently processed in cloud environments, but outsourcing them to a cloud raises concerns over the privacy and security issues. Homomorphic encryption (HE) is a powerful cryptographic primitive that preserves privacy of genomic data in various analyses processed in an untrusted cloud environment. We introduce an efficient algorithm for finding highly similar regions of two homomorphically encrypted sequences, and describe how to implement it using the bit-wise and word-wise HE schemes. In the experiment, our algorithm outperforms an existing algorithm by up to two orders of magnitude in terms of elapsed time. Overall, it finds highly similar regions of the sequences in real data sets in a feasible time.
Collapse
Affiliation(s)
- Magsarjav Bataa
- Department of Computer Science and Engineering, Seoul National University, Seoul, South Korea
- Department of Information and Computer Sciences, National University of Mongolia, Ulaanbaatar, Mongolia
| | - Siwoo Song
- Department of Computer Science and Engineering, Seoul National University, Seoul, South Korea
| | - Kunsoo Park
- Department of Computer Science and Engineering, Seoul National University, Seoul, South Korea
| | - Miran Kim
- Department of Mathematics, Hanyang University, Seoul, South Korea
| | - Jung Hee Cheon
- Department of Mathematical Sciences, Seoul National University, Seoul, South Korea
| | - Sun Kim
- Department of Computer Science and Engineering, Seoul National University, Seoul, South Korea
| |
Collapse
|
2
|
Ueda A, Tussie C, Kim S, Kuwajima Y, Matsumoto S, Kim G, Satoh K, Nagai S. Classification of Maxillofacial Morphology by Artificial Intelligence Using Cephalometric Analysis Measurements. Diagnostics (Basel) 2023; 13:2134. [PMID: 37443528 DOI: 10.3390/diagnostics13132134] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/04/2023] [Revised: 06/08/2023] [Accepted: 06/15/2023] [Indexed: 07/15/2023] Open
Abstract
The characteristics of maxillofacial morphology play a major role in orthodontic diagnosis and treatment planning. While Sassouni's classification scheme outlines different categories of maxillofacial morphology, there is no standardized approach to assigning these classifications to patients. This study aimed to create an artificial intelligence (AI) model that uses cephalometric analysis measurements to accurately classify maxillofacial morphology, allowing for the standardization of maxillofacial morphology classification. This study used the initial cephalograms of 220 patients aged 18 years or older. Three orthodontists classified the maxillofacial morphologies of 220 patients using eight measurements as the accurate classification. Using these eight cephalometric measurement points and the subject's gender as input features, a random forest classifier from the Python sci-kit learning package was trained and tested with a k-fold split of five to determine orthodontic classification; distinct models were created for horizontal-only, vertical-only, and combined maxillofacial morphology classification. The accuracy of the combined facial classification was 0.823 ± 0.060; for anteroposterior-only classification, the accuracy was 0.986 ± 0.011; and for the vertical-only classification, the accuracy was 0.850 ± 0.037. ANB angle had the greatest feature importance at 0.3519. The AI model created in this study accurately classified maxillofacial morphology, but it can be further improved with more learning data input.
Collapse
Affiliation(s)
- Akane Ueda
- Division of Orthodontics, Department of Developmental Oral Health Science, School of Dentistry, Iwate Medical University, 1-3-27 Chuo-dori, Morioka 020-8505, Iwate, Japan
- Department of Restorative Dentistry and Biomaterial Sciences, Harvard School of Dental Medicine, 188 Longwood Avenue, Boston, MA 02115, USA
| | - Cami Tussie
- DMD Candidate Class of 2025, Harvard School of Dental Medicine, 188 Longwood Avenue, Boston, MA 02115, USA
| | - Sophie Kim
- DMD Candidate Class of 2025, Harvard School of Dental Medicine, 188 Longwood Avenue, Boston, MA 02115, USA
| | - Yukinori Kuwajima
- Division of Orthodontics, Department of Developmental Oral Health Science, School of Dentistry, Iwate Medical University, 1-3-27 Chuo-dori, Morioka 020-8505, Iwate, Japan
| | - Shikino Matsumoto
- Division of Orthodontics, Department of Developmental Oral Health Science, School of Dentistry, Iwate Medical University, 1-3-27 Chuo-dori, Morioka 020-8505, Iwate, Japan
| | - Grace Kim
- Department of Developmental Biology, Harvard School of Dental Medicine,188 Longwood Avenue, Boston, MA 02115, USA
| | - Kazuro Satoh
- Division of Orthodontics, Department of Developmental Oral Health Science, School of Dentistry, Iwate Medical University, 1-3-27 Chuo-dori, Morioka 020-8505, Iwate, Japan
| | - Shigemi Nagai
- Department of Restorative Dentistry and Biomaterial Sciences, Harvard School of Dental Medicine, 188 Longwood Avenue, Boston, MA 02115, USA
| |
Collapse
|
3
|
Kuo TT, Jiang X, Tang H, Wang X, Harmanci A, Kim M, Post K, Bu D, Bath T, Kim J, Liu W, Chen H, Ohno-Machado L. The evolving privacy and security concerns for genomic data analysis and sharing as observed from the iDASH competition. J Am Med Inform Assoc 2022; 29:2182-2190. [PMID: 36164820 PMCID: PMC9667175 DOI: 10.1093/jamia/ocac165] [Citation(s) in RCA: 4] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/18/2022] [Revised: 08/25/2022] [Accepted: 09/13/2022] [Indexed: 01/11/2023] Open
Abstract
Concerns regarding inappropriate leakage of sensitive personal information as well as unauthorized data use are increasing with the growth of genomic data repositories. Therefore, privacy and security of genomic data have become increasingly important and need to be studied. With many proposed protection techniques, their applicability in support of biomedical research should be well understood. For this purpose, we have organized a community effort in the past 8 years through the integrating data for analysis, anonymization and sharing consortium to address this practical challenge. In this article, we summarize our experience from these competitions, report lessons learned from the events in 2020/2021 as examples, and discuss potential future research directions in this emerging field.
Collapse
Affiliation(s)
- Tsung-Ting Kuo
- Corresponding Author: Tsung-Ting Kuo, PhD, UCSD Health Department of Biomedical Informatics, University of California San Diego, La Jolla, CA 92093, USA;
| | | | | | | | - Arif Harmanci
- School of Biomedical Informatics, The University of Texas Health Science Center at Houston, Houston, Texas, USA
| | - Miran Kim
- Department of Mathematics, Hanyang University, Seoul, Republic of Korea,Department of Computer Science, Hanyang University, Seoul, Republic of Korea
| | - Kai Post
- UCSD Health Department of Biomedical Informatics, University of California San Diego, La Jolla, California, USA
| | - Diyue Bu
- Luddy School of Informatics, Computing, and Engineering, Indiana University Bloomington, Bloomington, Indiana, USA
| | - Tyler Bath
- UCSD Health Department of Biomedical Informatics, University of California San Diego, La Jolla, California, USA
| | - Jihoon Kim
- UCSD Health Department of Biomedical Informatics, University of California San Diego, La Jolla, California, USA
| | - Weijie Liu
- Luddy School of Informatics, Computing, and Engineering, Indiana University Bloomington, Bloomington, Indiana, USA
| | - Hongbo Chen
- Luddy School of Informatics, Computing, and Engineering, Indiana University Bloomington, Bloomington, Indiana, USA
| | - Lucila Ohno-Machado
- UCSD Health Department of Biomedical Informatics, University of California San Diego, La Jolla, California, USA,Division of Health Services Research & Development, Veteran Affairs San Diego Healthcare System, San Diego, California, USA
| |
Collapse
|
4
|
Islam MM, Mohammed N, Wang Y, Hu P. Differential Private Deep Learning Models for Analyzing Breast Cancer Omics Data. Front Oncol 2022; 12:879607. [PMID: 35814415 PMCID: PMC9259987 DOI: 10.3389/fonc.2022.879607] [Citation(s) in RCA: 4] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/21/2022] [Accepted: 05/20/2022] [Indexed: 12/24/2022] Open
Abstract
Proper analysis of high-dimensional human genomic data is necessary to increase human knowledge about fundamental biological questions such as disease associations and drug sensitivity. However, such data contain sensitive private information about individuals and can be used to identify an individual (i.e., privacy violation) uniquely. Therefore, raw genomic datasets cannot be publicly published or shared with researchers. The recent success of deep learning (DL) in diverse problems proved its suitability for analyzing the high volume of high-dimensional genomic data. Still, DL-based models leak information about the training samples. To overcome this challenge, we can incorporate differential privacy mechanisms into the DL analysis framework as differential privacy can protect individuals’ privacy. We proposed a differential privacy based DL framework to solve two biological problems: breast cancer status (BCS) and cancer type (CT) classification, and drug sensitivity prediction. To predict BCS and CT using genomic data, we built a differential private (DP) deep autoencoder (dpAE) using private gene expression datasets that performs low-dimensional data representation learning. We used dpAE features to build multiple DP binary classifiers to predict BCS and CT in any individual. To predict drug sensitivity, we used the Genomics of Drug Sensitivity in Cancer (GDSC) dataset. We extracted GDSC’s dpAE features to build our DP drug sensitivity prediction model for 265 drugs. Evaluation of our proposed DP framework shows that it achieves improved prediction performance in predicting BCS, CT, and drug sensitivity than the previously published DP work.
Collapse
Affiliation(s)
| | - Noman Mohammed
- Department of Biochemistry and Medical Genetics, University of Manitoba, Winnipeg, MB, Canada
| | - Yang Wang
- Department of Biochemistry and Medical Genetics, University of Manitoba, Winnipeg, MB, Canada
| | - Pingzhao Hu
- Department of Computer Science, University of Manitoba, Winnipeg, MB, Canada
- Department of Biochemistry and Medical Genetics, University of Manitoba, Winnipeg, MB, Canada
- Department of Electrical and Computer Engineering, University of Manitoba, Winnipeg, MB, Canada
- Research Institute for Oncology and Hematology, CancerCare Manitoba, Winnipeg, MB, Canada
- *Correspondence: Pingzhao Hu,
| |
Collapse
|
5
|
Towards the Sign Function Best Approximation for Secure Outsourced Computations and Control. MATHEMATICS 2022. [DOI: 10.3390/math10122006] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/16/2022]
Abstract
Homomorphic encryption with the ability to compute over encrypted data without access to the secret key provides benefits for the secure and powerful computation, storage, and communication of resources in the cloud. One of its important applications is fast-growing robot control systems for building lightweight, low-cost, smarter robots with intelligent brains consisting of data centers, knowledge bases, task planners, deep learning, information processing, environment models, communication support, synchronous map construction and positioning, etc. It enables robots to be endowed with secure, powerful capabilities while reducing sizes and costs. Processing encrypted information using homomorphic ciphers uses the sign function polynomial approximation, which is a widely studied research field with many practical results. State-of-the-art works are mainly focused on finding the polynomial of best approximation of the sign function (PBAS) with the improved errors on the union of the intervals [−1,−ϵ]∪[ϵ,1]. However, even though the existence of the single PBAS with the minimum deviation is well known, its construction method on the complete interval [−1,1] is still an open problem. In this paper, we provide the PBAS construction method on the interval [−1,1], using as a norm the area between the sign function and the polynomial and showing that for a polynomial degree n≥1, there is (1) unique PBAS of the odd sign function, (2) no PBAS of the general form sign function if n is odd, and (3) an uncountable set of PBAS, if n is even.
Collapse
|
6
|
Privacy-preserving federated neural network learning for disease-associated cell classification. PATTERNS 2022; 3:100487. [PMID: 35607628 PMCID: PMC9122966 DOI: 10.1016/j.patter.2022.100487] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 12/18/2021] [Revised: 02/14/2022] [Accepted: 03/14/2022] [Indexed: 11/21/2022]
Abstract
Training accurate and robust machine learning models requires a large amount of data that is usually scattered across data silos. Sharing or centralizing the data of different healthcare institutions is, however, unfeasible or prohibitively difficult due to privacy regulations. In this work, we address this problem by using a privacy-preserving federated learning-based approach, PriCell, for complex models such as convolutional neural networks. PriCell relies on multiparty homomorphic encryption and enables the collaborative training of encrypted neural networks with multiple healthcare institutions. We preserve the confidentiality of each institutions’ input data, of any intermediate values, and of the trained model parameters. We efficiently replicate the training of a published state-of-the-art convolutional neural network architecture in a decentralized and privacy-preserving manner. Our solution achieves an accuracy comparable with the one obtained with the centralized non-secure solution. PriCell guarantees patient privacy and ensures data utility for efficient multi-center studies involving complex healthcare data. We enable collaborative and privacy-preserving model training between institutions Training under encryption does not degrade the utility of the data We apply our solution to the single-cell analysis in a federated setting Our method is generalizable to other machine learning tasks in the healthcare domain
High-quality medical machine learning models will benefit greatly from collaboration between health care institutions. Yet, it is usually difficult to transfer data between these institutions due to strict privacy regulations. In this study, we propose a solution, PriCell, that relies on multiparty homomorphic encryption to enable privacy-preserving collaborative machine learning while protecting via encryption the institutions' input data, the model, and any value exchanged between the institutions. We show the maturity of our solution by training a published state-of-the-art convolutional neural network in a decentralized and privacy-preserving manner. We compare the accuracy achieved by PriCell with the centralized and non-secure solutions and show that PriCell guarantees privacy without reducing the utility of the data. The benefits of PriCell constitute an important landmark for real-world applications of collaborative training while preserving privacy.
Collapse
|
7
|
Yan X, Zhao W, Wei J, Yao Y, Sun G, Wang L, Zhang W, Chen S, Zhou W, Zhao H, Li X, Xiao Y, Li Y. A serum lipidomics study for the identification of specific biomarkers for endometrial polyps to distinguish them from endometrial cancer or hyperplasia. Int J Cancer 2022; 150:1549-1559. [PMID: 35076938 DOI: 10.1002/ijc.33943] [Citation(s) in RCA: 9] [Impact Index Per Article: 4.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/04/2021] [Revised: 01/09/2022] [Accepted: 01/13/2022] [Indexed: 11/10/2022]
Affiliation(s)
- Xingxu Yan
- School of Chinese Materia Medica, Tianjin University of Traditional Chinese Medicine Tianjin China
| | - Wen Zhao
- Department of Gynaecology and Obstetrics People's Hospital of Guangrao County, 257300 Dongying Shandong China
| | - Jinxia Wei
- School of Chinese Materia Medica, Tianjin University of Traditional Chinese Medicine Tianjin China
| | - Yaqi Yao
- School of Chinese Materia Medica, Tianjin University of Traditional Chinese Medicine Tianjin China
| | - Guijiang Sun
- Department of Kidney Disease and Blood Purification The Second Hospital of Tianjin Medical University Tianjin China
| | - Lei Wang
- Department of Oncology Tianjin Institute of Urology, The Second Hospital of Tianjin Medical University Tianjin China
| | - Wenqing Zhang
- School of Chinese Materia Medica, Tianjin University of Traditional Chinese Medicine Tianjin China
| | - Siyu Chen
- School of Chinese Materia Medica, Tianjin University of Traditional Chinese Medicine Tianjin China
| | - Wenjie Zhou
- School of Chinese Materia Medica, Tianjin University of Traditional Chinese Medicine Tianjin China
| | - Huan Zhao
- School of Chinese Materia Medica, Tianjin University of Traditional Chinese Medicine Tianjin China
| | - Xiaomeng Li
- School of Chinese Materia Medica, Tianjin University of Traditional Chinese Medicine Tianjin China
| | - Yu Xiao
- Hysteroscopic Center, FuXing Hospital Capital Medical University Beijing China
| | - Yubo Li
- School of Chinese Materia Medica, Tianjin University of Traditional Chinese Medicine Tianjin China
| |
Collapse
|
8
|
Liu Y, Wang Z, Zhao L. A Potential Three-Gene-Based Diagnostic Signature for Hypertension in Pregnancy. Int J Gen Med 2021; 14:6847-6856. [PMID: 34703289 PMCID: PMC8526516 DOI: 10.2147/ijgm.s331573] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/03/2021] [Accepted: 09/28/2021] [Indexed: 11/23/2022] Open
Abstract
Background Hypertensive disorders of pregnancy affect approximately 5–10% of all pregnancies, and this study aims to identify potential diagnostic signatures. Methods We downloaded the mRNA profiles of GSE75010 (placenta samples) and GSE48424 (blood samples) datasets with or without hypertension in pregnancy from the Gene Expression Omnibus database. Differential expression analysis was performed on the placenta samples using limma package of R language. GO terms and KEGG pathways enrichment analyses were performed on the placenta samples by the clusterProfiler package of R language. Infiltrating immune cell proportion of the placenta samples was evaluated using CIBERSORT software. The key genes involved in hypertension in pregnancy were screened from protein–protein interaction (PPI) network constructed based on the differentially expressed genes (DEGs). The logistic regression model was constructed by the glm package of R language, and receiver operating characteristic (ROC) curve was plotted to determine the accuracy of the model. Results For the placenta samples, a total of 104 DEGs were identified, and 39 GO terms and 7 KEGG pathways were significantly enriched based on these 104 genes. Furthermore, the analysis of infiltrating immune cells indicated that the difference in the amount of immune cells might be the potential cause of hypertension in pregnancy. The logistic regression model was constructed based on three optimal genes (LEP, PRL and IGFBP1) screened from PPI network and could efficiently separate patients with hypertension in pregnancy from healthy subjects. Conclusion A predictive model based on three potential genes LEP, PRL and IGFBP1 was obtained, suggesting that these genes might be potential diagnostic signatures for hypertension in pregnancy.
Collapse
Affiliation(s)
- Yan Liu
- Department of Obstetrics, Tianjin First Central Hospital, Nankai University, Tianjin, 300192, People's Republic of China
| | - Zhenglu Wang
- Biobank, Tianjin First Central Hospital, Nankai University, Tianjin, 300192, People's Republic of China
| | - Lin Zhao
- Department of Obstetrics, Tianjin First Central Hospital, Nankai University, Tianjin, 300192, People's Republic of China
| |
Collapse
|
9
|
Multi-Party Privacy-Preserving Logistic Regression with Poor Quality Data Filtering for IoT Contributors. ELECTRONICS 2021. [DOI: 10.3390/electronics10172049] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/17/2022]
Abstract
Nowadays, the internet of things (IoT) is used to generate data in several application domains. A logistic regression, which is a standard machine learning algorithm with a wide application range, is built on such data. Nevertheless, building a powerful and effective logistic regression model requires large amounts of data. Thus, collaboration between multiple IoT participants has often been the go-to approach. However, privacy concerns and poor data quality are two challenges that threaten the success of such a setting. Several studies have proposed different methods to address the privacy concern but to the best of our knowledge, little attention has been paid towards addressing the poor data quality problems in the multi-party logistic regression model. Thus, in this study, we propose a multi-party privacy-preserving logistic regression framework with poor quality data filtering for IoT data contributors to address both problems. Specifically, we propose a new metric gradient similarity in a distributed setting that we employ to filter out parameters from data contributors with poor quality data. To solve the privacy challenge, we employ homomorphic encryption. Theoretical analysis and experimental evaluations using real-world datasets demonstrate that our proposed framework is privacy-preserving and robust against poor quality data.
Collapse
|
10
|
Abstract
Tuberculosis (TB) is an airborne infectious disease caused by organisms in the Mycobacterium tuberculosis (Mtb) complex. In many low and middle-income countries, TB remains a major cause of morbidity and mortality. Once a patient has been diagnosed with TB, it is critical that healthcare workers make the most appropriate treatment decision given the individual conditions of the patient and the likely course of the disease based on medical experience. Depending on the prognosis, delayed or inappropriate treatment can result in unsatisfactory results including the exacerbation of clinical symptoms, poor quality of life, and increased risk of death. This work benchmarks machine learning models to aid TB prognosis using a Brazilian health database of confirmed cases and deaths related to TB in the State of Amazonas. The goal is to predict the probability of death by TB thus aiding the prognosis of TB and associated treatment decision making process. In its original form, the data set comprised 36,228 records and 130 fields but suffered from missing, incomplete, or incorrect data. Following data cleaning and preprocessing, a revised data set was generated comprising 24,015 records and 38 fields, including 22,876 reported cured TB patients and 1139 deaths by TB. To explore how the data imbalance impacts model performance, two controlled experiments were designed using (1) imbalanced and (2) balanced data sets. The best result is achieved by the Gradient Boosting (GB) model using the balanced data set to predict TB-mortality, and the ensemble model composed by the Random Forest (RF), GB and Multi-Layer Perceptron (MLP) models is the best model to predict the cure class.
Collapse
|
11
|
Chen X, Liu G, Wang S, Zhang H, Xue P. Machine learning analysis of gene expression profile reveals a novel diagnostic signature for osteoporosis. J Orthop Surg Res 2021; 16:189. [PMID: 33722258 PMCID: PMC7958453 DOI: 10.1186/s13018-021-02329-1] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 11/23/2020] [Accepted: 03/01/2021] [Indexed: 01/25/2023] Open
Abstract
Background Osteoporosis (OP) is increasingly prevalent with the aging of the world population. It is urgent to identify efficient diagnostic signatures for the clinical application. Method We downloaded the mRNA profile of 90 peripheral blood samples with or without OP from GEO database (Number: GSE152073). Weighted gene co-expression network analysis (WGCNA) was used to reveal the correlation among genes in all samples. GO term and KEGG pathway enrichment analysis was performed via the clusterProfiler R package. STRING database was applied to screen the interaction pairs among proteins. Protein–protein interaction (PPI) network was visualized based on Cytoscape, and the key genes were screened using the cytoHubba plug-in. The diagnostic model based on these key genes was constructed, and 5-fold cross validation method was applied to evaluate its reliability. Results A gene module consisted of 176 genes predicted to be associated with the occurrence of OP was identified. A total of 16 significantly enriched GO terms and 1 significantly enriched KEGG pathway were obtained based on the 176 genes. The top 50 key genes in the PPI network were identified. Then 22 genes were screened based on stepwise regression analysis from the 50 key genes. Of which, 9 genes were further screened out by multivariate regression analysis with the significant threshold of P value < 0.01. The diagnostic model was established based on the optimal 9 key genes, which efficiently separated the normal samples and OP samples. Conclusion A diagnostic model established based on nine key genes could reliably separate OP patients from healthy subjects, which provided novel lightings on the diagnostic research of OP. Supplementary Information The online version contains supplementary material available at 10.1186/s13018-021-02329-1.
Collapse
Affiliation(s)
- Xinlei Chen
- Department of Orthopedics, Zibo Central Hospital, Zibo, 255000, Shandong, China
| | - Guangping Liu
- Department of Orthopedics, Zibo Central Hospital, Zibo, 255000, Shandong, China
| | - Shuxiang Wang
- Department of Orthopedics, Zibo Central Hospital, Zibo, 255000, Shandong, China
| | - Haiyang Zhang
- Department of Orthopedics, Zibo Central Hospital, Zibo, 255000, Shandong, China
| | - Peng Xue
- Department of Orthopedics, Zibo Central Hospital, Zibo, 255000, Shandong, China.
| |
Collapse
|
12
|
Females and Males Show Differences in Early-Stage Transcriptomic Biomarkers of Lung Adenocarcinoma and Lung Squamous Cell Carcinoma. Diagnostics (Basel) 2021; 11:diagnostics11020347. [PMID: 33669819 PMCID: PMC7922551 DOI: 10.3390/diagnostics11020347] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/20/2021] [Revised: 02/15/2021] [Accepted: 02/17/2021] [Indexed: 12/25/2022] Open
Abstract
The incidence and mortality rates of lung cancers are different between females and males. Therefore, sex information should be an important part of how to train and optimize a diagnostic model. However, most of the existing studies do not fully utilize this information. This study carried out a comparative investigation between sex-specific models and sex-independent models. Three feature selection algorithms and five classifiers were utilized to evaluate the contribution of the sex information to the detection of early-stage lung cancers. Both lung adenocarcinoma (LUAD) and lung squamous cell carcinoma (LUSC) showed that the sex-specific models outperformed the sex-independent detection of early-stage lung cancers. The Venn plots suggested that females and males shared only a few transcriptomic biomarkers of early-stage lung cancers. Our experimental data suggested that sex information should be included in optimizing disease diagnosis models.
Collapse
|
13
|
Scalable Privacy-Preserving Distributed Learning. PROCEEDINGS ON PRIVACY ENHANCING TECHNOLOGIES 2021. [DOI: 10.2478/popets-2021-0030] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/20/2022]
Abstract
Abstract
In this paper, we address the problem of privacy-preserving distributed learning and the evaluation of machine-learning models by analyzing it in the widespread MapReduce abstraction that we extend with privacy constraints. We design spindle (Scalable Privacy-preservINg Distributed LEarning), the first distributed and privacy-preserving system that covers the complete ML workflow by enabling the execution of a cooperative gradient-descent and the evaluation of the obtained model and by preserving data and model confidentiality in a passive-adversary model with up to N −1 colluding parties. spindle uses multiparty homomorphic encryption to execute parallel high-depth computations on encrypted data without significant overhead. We instantiate spindle for the training and evaluation of generalized linear models on distributed datasets and show that it is able to accurately (on par with non-secure centrally-trained models) and efficiently (due to a multi-level parallelization of the computations) train models that require a high number of iterations on large input data with thousands of features, distributed among hundreds of data providers. For instance, it trains a logistic-regression model on a dataset of one million samples with 32 features distributed among 160 data providers in less than three minutes.
Collapse
|
14
|
De Cock M, Dowsley R, Nascimento ACA, Railsback D, Shen J, Todoki A. High performance logistic regression for privacy-preserving genome analysis. BMC Med Genomics 2021; 14:23. [PMID: 33472626 PMCID: PMC7818577 DOI: 10.1186/s12920-020-00869-9] [Citation(s) in RCA: 13] [Impact Index Per Article: 4.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/02/2020] [Accepted: 12/30/2020] [Indexed: 11/30/2022] Open
Abstract
BACKGROUND In biomedical applications, valuable data is often split between owners who cannot openly share the data because of privacy regulations and concerns. Training machine learning models on the joint data without violating privacy is a major technology challenge that can be addressed by combining techniques from machine learning and cryptography. When collaboratively training machine learning models with the cryptographic technique named secure multi-party computation, the price paid for keeping the data of the owners private is an increase in computational cost and runtime. A careful choice of machine learning techniques, algorithmic and implementation optimizations are a necessity to enable practical secure machine learning over distributed data sets. Such optimizations can be tailored to the kind of data and Machine Learning problem at hand. METHODS Our setup involves secure two-party computation protocols, along with a trusted initializer that distributes correlated randomness to the two computing parties. We use a gradient descent based algorithm for training a logistic regression like model with a clipped ReLu activation function, and we break down the algorithm into corresponding cryptographic protocols. Our main contributions are a new protocol for computing the activation function that requires neither secure comparison protocols nor Yao's garbled circuits, and a series of cryptographic engineering optimizations to improve the performance. RESULTS For our largest gene expression data set, we train a model that requires over 7 billion secure multiplications; the training completes in about 26.90 s in a local area network. The implementation in this work is a further optimized version of the implementation with which we won first place in Track 4 of the iDASH 2019 secure genome analysis competition. CONCLUSIONS In this paper, we present a secure logistic regression training protocol and its implementation, with a new subprotocol to securely compute the activation function. To the best of our knowledge, we present the fastest existing secure multi-party computation implementation for training logistic regression models on high dimensional genome data distributed across a local area network.
Collapse
Affiliation(s)
- Martine De Cock
- School of Engineering and Technology, University of Washington Tacoma, Tacoma, WA 98402 USA
| | - Rafael Dowsley
- Faculty of Information Technology, Monash University, Clayton, 3800 Australia
| | | | - Davis Railsback
- School of Engineering and Technology, University of Washington Tacoma, Tacoma, WA 98402 USA
| | - Jianwei Shen
- School of Engineering and Technology, University of Washington Tacoma, Tacoma, WA 98402 USA
| | - Ariel Todoki
- School of Engineering and Technology, University of Washington Tacoma, Tacoma, WA 98402 USA
| |
Collapse
|
15
|
Lu Y, Zhou T, Tian Y, Zhu S, Li J. Web-Based Privacy-Preserving Multicenter Medical Data Analysis Tools Via Threshold Homomorphic Encryption: Design and Development Study. J Med Internet Res 2020; 22:e22555. [PMID: 33289676 PMCID: PMC7755539 DOI: 10.2196/22555] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/16/2020] [Revised: 10/02/2020] [Accepted: 11/06/2020] [Indexed: 11/22/2022] Open
Abstract
Background Data sharing in multicenter medical research can improve the generalizability of research, accelerate progress, enhance collaborations among institutions, and lead to new discoveries from data pooled from multiple sources. Despite these benefits, many medical institutions are unwilling to share their data, as sharing may cause sensitive information to be leaked to researchers, other institutions, and unauthorized users. Great progress has been made in the development of secure machine learning frameworks based on homomorphic encryption in recent years; however, nearly all such frameworks use a single secret key and lack a description of how to securely evaluate the trained model, which makes them impractical for multicenter medical applications. Objective The aim of this study is to provide a privacy-preserving machine learning protocol for multiple data providers and researchers (eg, logistic regression). This protocol allows researchers to train models and then evaluate them on medical data from multiple sources while providing privacy protection for both the sensitive data and the learned model. Methods We adapted a novel threshold homomorphic encryption scheme to guarantee privacy requirements. We devised new relinearization key generation techniques for greater scalability and multiplicative depth and new model training strategies for simultaneously training multiple models through x-fold cross-validation. Results Using a client-server architecture, we evaluated the performance of our protocol. The experimental results demonstrated that, with 10-fold cross-validation, our privacy-preserving logistic regression model training and evaluation over 10 attributes in a data set of 49,152 samples took approximately 7 minutes and 20 minutes, respectively. Conclusions We present the first privacy-preserving multiparty logistic regression model training and evaluation protocol based on threshold homomorphic encryption. Our protocol is practical for real-world use and may promote multicenter medical research to some extent.
Collapse
Affiliation(s)
- Yao Lu
- Engineering Research Center of EMR and Intelligent Expert System, Key Laboratory for Biomedical Engineering of Ministry of Education, College of Biomedical Engineering and Instrument Science, Zhejiang University, Hangzhou, China
| | - Tianshu Zhou
- Engineering Research Center of EMR and Intelligent Expert System, Key Laboratory for Biomedical Engineering of Ministry of Education, College of Biomedical Engineering and Instrument Science, Zhejiang University, Hangzhou, China
| | - Yu Tian
- Engineering Research Center of EMR and Intelligent Expert System, Key Laboratory for Biomedical Engineering of Ministry of Education, College of Biomedical Engineering and Instrument Science, Zhejiang University, Hangzhou, China
| | | | - Jingsong Li
- Engineering Research Center of EMR and Intelligent Expert System, Key Laboratory for Biomedical Engineering of Ministry of Education, College of Biomedical Engineering and Instrument Science, Zhejiang University, Hangzhou, China.,Zhejiang Lab, Hangzhou, China
| |
Collapse
|
16
|
Kim D, Son Y, Kim D, Kim A, Hong S, Cheon JH. Privacy-preserving approximate GWAS computation based on homomorphic encryption. BMC Med Genomics 2020; 13:77. [PMID: 32693801 PMCID: PMC7372890 DOI: 10.1186/s12920-020-0722-1] [Citation(s) in RCA: 12] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/22/2022] Open
Abstract
Background One of three tasks in a secure genome analysis competition called iDASH 2018 was to develop a solution for privacy-preserving GWAS computation based on homomorphic encryption. The scenario is that a data holder encrypts a number of individual records, each of which consists of several phenotype and genotype data, and provide the encrypted data to an untrusted server. Then, the server performs a GWAS algorithm based on homomorphic encryption without the decryption key and outputs the result in encrypted state so that there is no information leakage on the sensitive data to the server. Methods We develop a privacy-preserving semi-parallel GWAS algorithm by applying an approximate homomorphic encryption scheme HEAAN. Fisher scoring and semi-parallel GWAS algorithms are modified to be efficiently computed over homomorphically encrypted data with several optimization methodologies; substitute matrix inversion by an adjoint matrix, avoid computing a superfluous matrix of super-large size, and transform the algorithm into an approximate version. Results Our modified semi-parallel GWAS algorithm based on homomorphic encryption which achieves 128-bit security takes 30–40 minutes for 245 samples containing 10,000–15,000 SNPs. Compared to the true p-value from the original semi-parallel GWAS algorithm, the F1 score of our p-value result is over 0.99. Conclusions Privacy-preserving semi-parallel GWAS computation can be efficiently done based on homomorphic encryption with sufficiently high accuracy compared to the semi-parallel GWAS computation in unencrypted state.
Collapse
Affiliation(s)
- Duhyeong Kim
- Department of Mathematical Sciences, Seoul National University, 1, Gwanak-ro, Gwanak-gu, Seoul, Republic of Korea
| | - Yongha Son
- Department of Mathematical Sciences, Seoul National University, 1, Gwanak-ro, Gwanak-gu, Seoul, Republic of Korea
| | - Dongwoo Kim
- Department of Mathematical Sciences, Seoul National University, 1, Gwanak-ro, Gwanak-gu, Seoul, Republic of Korea
| | - Andrey Kim
- Department of Mathematical Sciences, Seoul National University, 1, Gwanak-ro, Gwanak-gu, Seoul, Republic of Korea
| | - Seungwan Hong
- Department of Mathematical Sciences, Seoul National University, 1, Gwanak-ro, Gwanak-gu, Seoul, Republic of Korea
| | - Jung Hee Cheon
- Department of Mathematical Sciences, Seoul National University, 1, Gwanak-ro, Gwanak-gu, Seoul, Republic of Korea.
| |
Collapse
|
17
|
Carpov S, Gama N, Georgieva M, Troncoso-Pastoriza JR. Privacy-preserving semi-parallel logistic regression training with fully homomorphic encryption. BMC Med Genomics 2020; 13:88. [PMID: 32693814 PMCID: PMC7372765 DOI: 10.1186/s12920-020-0723-0] [Citation(s) in RCA: 10] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/16/2022] Open
Abstract
Background Privacy-preserving computations on genomic data, and more generally on medical data, is a critical path technology for innovative, life-saving research to positively and equally impact the global population. It enables medical research algorithms to be securely deployed in the cloud because operations on encrypted genomic databases are conducted without revealing any individual genomes. Methods for secure computation have shown significant performance improvements over the last several years. However, it is still challenging to apply them on large biomedical datasets. Methods The HE Track of iDash 2018 competition focused on solving an important problem in practical machine learning scenarios, where a data analyst that has trained a regression model (both linear and logistic) with a certain set of features, attempts to find all features in an encrypted database that will improve the quality of the model. Our solution is based on the hybrid framework Chimera that allows for switching between different families of fully homomorphic schemes, namely TFHE and HEAAN. Results Our solution is one of the finalist of Track 2 of iDash 2018 competition. Among the submitted solutions, ours is the only bootstrapped approach that can be applied for different sets of parameters without re-encrypting the genomic database, making it practical for real-world applications. Conclusions This is the first step towards the more general feature selection problem across large encrypted databases.
Collapse
Affiliation(s)
- Sergiu Carpov
- CEA, LIST, Point Courier 172, Gif-sur-Yvette cedex, 91191, France.,Inpher, Innovation Park A, Lausanne, CH-1015, Switzerland
| | - Nicolas Gama
- Inpher, Innovation Park A, Lausanne, CH-1015, Switzerland
| | - Mariya Georgieva
- Inpher, Innovation Park A, Lausanne, CH-1015, Switzerland. .,EPFL, Route Cantonal, Lausanne, CH-1015, Switzerland.
| | | |
Collapse
|
18
|
Wang X, Tang H, Wang S, Jiang X, Wang W, Bu D, Wang L, Jiang Y, Wang C. iDASH secure genome analysis competition 2017. BMC Med Genomics 2018; 11:85. [PMID: 30309344 PMCID: PMC6180354 DOI: 10.1186/s12920-018-0396-0] [Citation(s) in RCA: 12] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/16/2023] Open
Affiliation(s)
- XiaoFeng Wang
- School of Informatics, Computing and Engineering, Indiana University, Bloomington, IN, 47408, USA.
| | - Haixu Tang
- School of Informatics, Computing and Engineering, Indiana University, Bloomington, IN, 47408, USA
| | - Shuang Wang
- UCSD Health Department of Biomedical Informatics, University of California San Diego, La Jolla, CA, 92093, USA
| | - Xiaoqian Jiang
- School of Biomedical Informatics, The University of Texas Health Science Center, Houston, TX, 77030, USA
| | - Wenhao Wang
- School of Informatics, Computing and Engineering, Indiana University, Bloomington, IN, 47408, USA
| | - Diyue Bu
- School of Informatics, Computing and Engineering, Indiana University, Bloomington, IN, 47408, USA
| | - Lei Wang
- School of Informatics, Computing and Engineering, Indiana University, Bloomington, IN, 47408, USA
| | - Yicheng Jiang
- UCSD Health Department of Biomedical Informatics, University of California San Diego, La Jolla, CA, 92093, USA
| | - Chenghong Wang
- UCSD Health Department of Biomedical Informatics, University of California San Diego, La Jolla, CA, 92093, USA
| |
Collapse
|