1
|
Ai N, Yang Z, Yuan H, Ouyang D, Miao R, Ji Y, Liang Y. A distributed sparse logistic regression with $$L_{1/2}$$ regularization for microarray biomarker discovery in cancer classification. Soft comput 2022. [DOI: 10.1007/s00500-022-07551-5] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/01/2022]
|
2
|
Waqas Khan P, Byun YC. Multi-Fault Detection and Classification of Wind Turbines Using Stacking Classifier. SENSORS (BASEL, SWITZERLAND) 2022; 22:6955. [PMID: 36146299 PMCID: PMC9505315 DOI: 10.3390/s22186955] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 08/21/2022] [Revised: 09/09/2022] [Accepted: 09/13/2022] [Indexed: 06/16/2023]
Abstract
Wind turbines are widely used worldwide to generate clean, renewable energy. The biggest issue with a wind turbine is reducing failures and downtime, which lowers costs associated with operations and maintenance. Wind turbines' consistency and timely maintenance can enhance their performance and dependability. Still, the traditional routine configuration makes detecting faults of wind turbines difficult. Supervisory control and data acquisition (SCADA) produces reliable and affordable quality data for the health condition of wind turbine operations. For wind power to be sufficiently reliable, it is crucial to retrieve useful information from SCADA successfully. This article proposes a new AdaBoost, K-nearest neighbors, and logistic regression-based stacking ensemble (AKL-SE) classifier to classify the faults of the wind turbine condition monitoring system. A stacking ensemble classifier integrates different classification models to enhance the model's accuracy. We have used three classifiers, AdaBoost, K-nearest neighbors, and logistic regression, as base models to make output. The output of these three classifiers is used as input in the logistic regression classifier's meta-model. To improve the data validity, SCADA data are first preprocessed by cleaning and removing any abnormal data. Next, the Pearson correlation coefficient was used to choose the input variables. The Stacking Ensemble classifier was trained using these parameters. The analysis demonstrates that the suggested method successfully identifies faults in wind turbines when applied to local 3 MW wind turbines. The proposed approach shows the potential for effective wind energy use, which could encourage the use of clean energy.
Collapse
|
3
|
Qayyum A, Ijaz A, Usama M, Iqbal W, Qadir J, Elkhatib Y, Al-Fuqaha A. Securing Machine Learning in the Cloud: A Systematic Review of Cloud Machine Learning Security. Front Big Data 2021; 3:587139. [PMID: 33693420 PMCID: PMC7931962 DOI: 10.3389/fdata.2020.587139] [Citation(s) in RCA: 7] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/24/2020] [Accepted: 10/08/2020] [Indexed: 11/13/2022] Open
Abstract
With the advances in machine learning (ML) and deep learning (DL) techniques, and the potency of cloud computing in offering services efficiently and cost-effectively, Machine Learning as a Service (MLaaS) cloud platforms have become popular. In addition, there is increasing adoption of third-party cloud services for outsourcing training of DL models, which requires substantial costly computational resources (e.g., high-performance graphics processing units (GPUs)). Such widespread usage of cloud-hosted ML/DL services opens a wide range of attack surfaces for adversaries to exploit the ML/DL system to achieve malicious goals. In this article, we conduct a systematic evaluation of literature of cloud-hosted ML/DL models along both the important dimensions—attacks and defenses—related to their security. Our systematic review identified a total of 31 related articles out of which 19 focused on attack, six focused on defense, and six focused on both attack and defense. Our evaluation reveals that there is an increasing interest from the research community on the perspective of attacking and defending different attacks on Machine Learning as a Service platforms. In addition, we identify the limitations and pitfalls of the analyzed articles and highlight open research issues that require further investigation.
Collapse
Affiliation(s)
- Adnan Qayyum
- Information Technology University (ITU), Lahore, Pakistan
| | - Aneeqa Ijaz
- AI4Networks Research Center, University of Oklahoma, Norman, OK, United States
| | - Muhammad Usama
- Information Technology University (ITU), Lahore, Pakistan
| | - Waleed Iqbal
- Social Data Science (SDS) Lab, Queen Mary University of London, London, United Kingdom
| | - Junaid Qadir
- Information Technology University (ITU), Lahore, Pakistan
| | - Yehia Elkhatib
- School of Computing and Communications, Lancaster University, Lancaster, United Kingdom
| | | |
Collapse
|
4
|
Scalable Privacy-Preserving Distributed Learning. PROCEEDINGS ON PRIVACY ENHANCING TECHNOLOGIES 2021. [DOI: 10.2478/popets-2021-0030] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/20/2022]
Abstract
Abstract
In this paper, we address the problem of privacy-preserving distributed learning and the evaluation of machine-learning models by analyzing it in the widespread MapReduce abstraction that we extend with privacy constraints. We design spindle (Scalable Privacy-preservINg Distributed LEarning), the first distributed and privacy-preserving system that covers the complete ML workflow by enabling the execution of a cooperative gradient-descent and the evaluation of the obtained model and by preserving data and model confidentiality in a passive-adversary model with up to N −1 colluding parties. spindle uses multiparty homomorphic encryption to execute parallel high-depth computations on encrypted data without significant overhead. We instantiate spindle for the training and evaluation of generalized linear models on distributed datasets and show that it is able to accurately (on par with non-secure centrally-trained models) and efficiently (due to a multi-level parallelization of the computations) train models that require a high number of iterations on large input data with thousands of features, distributed among hundreds of data providers. For instance, it trains a logistic-regression model on a dataset of one million samples with 32 features distributed among 160 data providers in less than three minutes.
Collapse
|
5
|
Lu Y, Zhou T, Tian Y, Zhu S, Li J. Web-Based Privacy-Preserving Multicenter Medical Data Analysis Tools Via Threshold Homomorphic Encryption: Design and Development Study. J Med Internet Res 2020; 22:e22555. [PMID: 33289676 PMCID: PMC7755539 DOI: 10.2196/22555] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/16/2020] [Revised: 10/02/2020] [Accepted: 11/06/2020] [Indexed: 11/22/2022] Open
Abstract
Background Data sharing in multicenter medical research can improve the generalizability of research, accelerate progress, enhance collaborations among institutions, and lead to new discoveries from data pooled from multiple sources. Despite these benefits, many medical institutions are unwilling to share their data, as sharing may cause sensitive information to be leaked to researchers, other institutions, and unauthorized users. Great progress has been made in the development of secure machine learning frameworks based on homomorphic encryption in recent years; however, nearly all such frameworks use a single secret key and lack a description of how to securely evaluate the trained model, which makes them impractical for multicenter medical applications. Objective The aim of this study is to provide a privacy-preserving machine learning protocol for multiple data providers and researchers (eg, logistic regression). This protocol allows researchers to train models and then evaluate them on medical data from multiple sources while providing privacy protection for both the sensitive data and the learned model. Methods We adapted a novel threshold homomorphic encryption scheme to guarantee privacy requirements. We devised new relinearization key generation techniques for greater scalability and multiplicative depth and new model training strategies for simultaneously training multiple models through x-fold cross-validation. Results Using a client-server architecture, we evaluated the performance of our protocol. The experimental results demonstrated that, with 10-fold cross-validation, our privacy-preserving logistic regression model training and evaluation over 10 attributes in a data set of 49,152 samples took approximately 7 minutes and 20 minutes, respectively. Conclusions We present the first privacy-preserving multiparty logistic regression model training and evaluation protocol based on threshold homomorphic encryption. Our protocol is practical for real-world use and may promote multicenter medical research to some extent.
Collapse
Affiliation(s)
- Yao Lu
- Engineering Research Center of EMR and Intelligent Expert System, Key Laboratory for Biomedical Engineering of Ministry of Education, College of Biomedical Engineering and Instrument Science, Zhejiang University, Hangzhou, China
| | - Tianshu Zhou
- Engineering Research Center of EMR and Intelligent Expert System, Key Laboratory for Biomedical Engineering of Ministry of Education, College of Biomedical Engineering and Instrument Science, Zhejiang University, Hangzhou, China
| | - Yu Tian
- Engineering Research Center of EMR and Intelligent Expert System, Key Laboratory for Biomedical Engineering of Ministry of Education, College of Biomedical Engineering and Instrument Science, Zhejiang University, Hangzhou, China
| | | | - Jingsong Li
- Engineering Research Center of EMR and Intelligent Expert System, Key Laboratory for Biomedical Engineering of Ministry of Education, College of Biomedical Engineering and Instrument Science, Zhejiang University, Hangzhou, China.,Zhejiang Lab, Hangzhou, China
| |
Collapse
|
6
|
Anxie T, Bing L. Application of deep learning and artificial intelligence in the psychological mechanism of language activity. JOURNAL OF INTELLIGENT & FUZZY SYSTEMS 2020. [DOI: 10.3233/jifs-179806] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/15/2022]
Affiliation(s)
- Tuo Anxie
- Cognitive Linguistic Psychology, College of Medical Humanities, Guizhou Medical University, Guiyang, China
| | - Li Bing
- Psycholinguistics, College of Foreign Languages, Guizhou University, Guiyang, China
| |
Collapse
|
7
|
Gong M, Wang S, Wang L, Liu C, Wang J, Guo Q, Zheng H, Xie K, Wang C, Hui Z. Evaluation of Privacy Risks of Patients' Data in China: Case Study. JMIR Med Inform 2020; 8:e13046. [PMID: 32022691 PMCID: PMC7055805 DOI: 10.2196/13046] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/15/2018] [Revised: 06/07/2019] [Accepted: 09/26/2019] [Indexed: 11/13/2022] Open
Abstract
BACKGROUND Patient privacy is a ubiquitous problem around the world. Many existing studies have demonstrated the potential privacy risks associated with sharing of biomedical data. Owing to the increasing need for data sharing and analysis, health care data privacy is drawing more attention. However, to better protect biomedical data privacy, it is essential to assess the privacy risk in the first place. OBJECTIVE In China, there is no clear regulation for health systems to deidentify data. It is also not known whether a mechanism such as the Health Insurance Portability and Accountability Act (HIPAA) safe harbor policy will achieve sufficient protection. This study aimed to conduct a pilot study using patient data from Chinese hospitals to understand and quantify the privacy risks of Chinese patients. METHODS We used g-distinct analysis to evaluate the reidentification risks with regard to the HIPAA safe harbor approach when applied to Chinese patients' data. More specifically, we estimated the risks based on the HIPAA safe harbor and limited dataset policies by assuming an attacker has background knowledge of the patient from the public domain. RESULTS The experiments were conducted on 0.83 million patients (with data field of date of birth, gender, and surrogate ZIP codes generated based on home address) across 33 provincial-level administrative divisions in China. Under the Limited Dataset policy, 19.58% (163,262/833,235) of the population could be uniquely identifiable under the g-distinct metric (ie, 1-distinct). In contrast, the Safe Harbor policy is able to significantly reduce privacy risk, where only 0.072% (601/833,235) of individuals are uniquely identifiable, and the majority of the population is 3000 indistinguishable (ie the population is expected to share common attributes with 3000 or less people). CONCLUSIONS Through the experiments based on real-world patient data, this work illustrates that the results of g-distinct analysis about Chinese patient privacy risk are similar to those from a previous US study, in which data from different organizations/regions might be vulnerable to different reidentification risks under different policies. This work provides reference to Chinese health care entities for estimating patients' privacy risk during data sharing, which laid the foundation of privacy risk study about Chinese patients' data in the future.
Collapse
Affiliation(s)
- Mengchun Gong
- Digital China Health Technologies Corporation Limited, Beijing, China
| | - Shuang Wang
- Shanghai Putuo People's Hospital, Tongji University, Shanghai, China
| | - Lezi Wang
- Digital China Health Technologies Corporation Limited, Beijing, China
| | - Chao Liu
- Digital China Health Technologies Corporation Limited, Beijing, China
| | - Jianyang Wang
- Department of Radiation Oncology, National Cancer Center/National Clinical Research Center for Cancer/Cancer Hospital, Chinese Academy of Medical Sciences and Peking Union Medical College, Beijing, China
| | - Qiang Guo
- Big Data Center, National Cancer Center/National Clinical Research Center for Cancer/Cancer Hospital, Chinese Academy of Medical Sciences and Peking Union Medical College, Beijing, China
| | - Hao Zheng
- Department of Bioinformatics, Hangzhou Nuowei Information Technology, Hangzhou, China
| | - Kang Xie
- The Third Research Institute of Ministry of Public Security, Key Lab of Information Network Security, Ministry of Public Security, Shanghai, China
| | - Chenghong Wang
- Department of Radiation Oncology, National Cancer Center/National Clinical Research Center for Cancer/Cancer Hospital, Chinese Academy of Medical Sciences and Peking Union Medical College, Beijing, China
| | - Zhouguang Hui
- Department of Radiation Oncology, National Cancer Center/National Clinical Research Center for Cancer/Cancer Hospital, Chinese Academy of Medical Sciences and Peking Union Medical College, Beijing, China.,Department of VIP Medical Services, National Cancer Center/National Clinical Research Center for Cancer/Cancer Hospital, Chinese Academy of Medical Sciences and Peking Union Medical College, Beijing, China
| |
Collapse
|
8
|
A multicenter random forest model for effective prognosis prediction in collaborative clinical research network. Artif Intell Med 2020; 103:101814. [PMID: 32143809 DOI: 10.1016/j.artmed.2020.101814] [Citation(s) in RCA: 20] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/16/2019] [Revised: 02/04/2020] [Accepted: 02/04/2020] [Indexed: 12/17/2022]
Abstract
BACKGROUND The accuracy of a prognostic prediction model has become an essential aspect of the quality and reliability of the health-related decisions made by clinicians in modern medicine. Unfortunately, individual institutions often lack sufficient samples, which might not provide sufficient statistical power for models. One mitigation is to expand data collection from a single institution to multiple centers to collectively increase the sample size. However, sharing sensitive biomedical data for research involves complicated issues. Machine learning models such as random forests (RF), though they are commonly used and achieve good performances for prognostic prediction, usually suffer worse performance under multicenter privacy-preserving data mining scenarios compared to a centrally trained version. METHODS AND MATERIALS In this study, a multicenter random forest prognosis prediction model is proposed that enables federated clinical data mining from horizontally partitioned datasets. By using a novel data enhancement approach based on a differentially private generative adversarial network customized to clinical prognosis data, the proposed model is able to provide a multicenter RF model with performances on par with-or even better than-centrally trained RF but without the need to aggregate the raw data. Moreover, our model also incorporates an importance ranking step designed for feature selection without sharing patient-level information. RESULT The proposed model was evaluated on colorectal cancer datasets from the US and China. Two groups of datasets with different levels of heterogeneity within the collaborative research network were selected. First, we compare the performance of the distributed random forest model under different privacy parameters with different percentages of enhancement datasets and validate the effectiveness and plausibility of our approach. Then, we compare the discrimination and calibration ability of the proposed multicenter random forest with a centrally trained random forest model and other tree-based classifiers as well as some commonly used machine learning methods. The results show that the proposed model can provide better prediction performance in terms of discrimination and calibration ability than the centrally trained RF model or the other candidate models while following the privacy-preserving rules in both groups. Additionally, good discrimination and calibration ability are shown on the simplified model based on the feature importance ranking in the proposed approach. CONCLUSION The proposed random forest model exhibits ideal prediction capability using multicenter clinical data and overcomes the performance limitation arising from privacy guarantees. It can also provide feature importance ranking across institutions without pooling the data at a central site. This study offers a practical solution for building a prognosis prediction model in the collaborative clinical research network and solves practical issues in real-world applications of medical artificial intelligence.
Collapse
|
9
|
Li T, Sun J, Zhang X, Wang L, Zhu P, Wang N. Competition prediction and fitness behavior based on GA-SVM algorithm and PCA model. JOURNAL OF INTELLIGENT & FUZZY SYSTEMS 2019. [DOI: 10.3233/jifs-179202] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/15/2022]
Affiliation(s)
- Tuojian Li
- Shandong University, Jinan, Shandong, China
| | - Jinhai Sun
- Shandong University, Jinan, Shandong, China
| | | | - Lei Wang
- Shandong University, Jinan, Shandong, China
| | | | - Ning Wang
- Shandong University, Jinan, Shandong, China
| |
Collapse
|
10
|
A Two-Stage Big Data Analytics Framework with Real World Applications Using Spark Machine Learning and Long Short-Term Memory Network. Symmetry (Basel) 2018. [DOI: 10.3390/sym10100485] [Citation(s) in RCA: 21] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/16/2022] Open
Abstract
Every day we experience unprecedented data growth from numerous sources, which contribute to big data in terms of volume, velocity, and variability. These datasets again impose great challenges to analytics framework and computational resources, making the overall analysis difficult for extracting meaningful information in a timely manner. Thus, to harness these kinds of challenges, developing an efficient big data analytics framework is an important research topic. Consequently, to address these challenges by exploiting non-linear relationships from very large and high-dimensional datasets, machine learning (ML) and deep learning (DL) algorithms are being used in analytics frameworks. Apache Spark has been in use as the fastest big data processing arsenal, which helps to solve iterative ML tasks, using distributed ML library called Spark MLlib. Considering real-world research problems, DL architectures such as Long Short-Term Memory (LSTM) is an effective approach to overcoming practical issues such as reduced accuracy, long-term sequence dependency, and vanishing and exploding gradient in conventional deep architectures. In this paper, we propose an efficient analytics framework, which is technically a progressive machine learning technique merged with Spark-based linear models, Multilayer Perceptron (MLP) and LSTM, using a two-stage cascade structure in order to enhance the predictive accuracy. Our proposed architecture enables us to organize big data analytics in a scalable and efficient way. To show the effectiveness of our framework, we applied the cascading structure to two different real-life datasets to solve a multiclass and a binary classification problem, respectively. Experimental results show that our analytical framework outperforms state-of-the-art approaches with a high-level of classification accuracy.
Collapse
|