1
|
Kondylakis H, Kalokyri V, Sfakianakis S, Marias K, Tsiknakis M, Jimenez-Pastor A, Camacho-Ramos E, Blanquer I, Segrelles JD, López-Huguet S, Barelle C, Kogut-Czarkowska M, Tsakou G, Siopis N, Sakellariou Z, Bizopoulos P, Drossou V, Lalas A, Votis K, Mallol P, Marti-Bonmati L, Alberich LC, Seymour K, Boucher S, Ciarrocchi E, Fromont L, Rambla J, Harms A, Gutierrez A, Starmans MPA, Prior F, Gelpi JL, Lekadir K. Data infrastructures for AI in medical imaging: a report on the experiences of five EU projects. Eur Radiol Exp 2023; 7:20. [PMID: 37150779 PMCID: PMC10164664 DOI: 10.1186/s41747-023-00336-x] [Citation(s) in RCA: 5] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/25/2023] [Accepted: 03/02/2023] [Indexed: 05/09/2023] Open
Abstract
Artificial intelligence (AI) is transforming the field of medical imaging and has the potential to bring medicine from the era of 'sick-care' to the era of healthcare and prevention. The development of AI requires access to large, complete, and harmonized real-world datasets, representative of the population, and disease diversity. However, to date, efforts are fragmented, based on single-institution, size-limited, and annotation-limited datasets. Available public datasets (e.g., The Cancer Imaging Archive, TCIA, USA) are limited in scope, making model generalizability really difficult. In this direction, five European Union projects are currently working on the development of big data infrastructures that will enable European, ethically and General Data Protection Regulation-compliant, quality-controlled, cancer-related, medical imaging platforms, in which both large-scale data and AI algorithms will coexist. The vision is to create sustainable AI cloud-based platforms for the development, implementation, verification, and validation of trustable, usable, and reliable AI models for addressing specific unmet needs regarding cancer care provision. In this paper, we present an overview of the development efforts highlighting challenges and approaches selected providing valuable feedback to future attempts in the area.Key points• Artificial intelligence models for health imaging require access to large amounts of harmonized imaging data and metadata.• Main infrastructures adopted either collect centrally anonymized data or enable access to pseudonymized distributed data.• Developing a common data model for storing all relevant information is a challenge.• Trust of data providers in data sharing initiatives is essential.• An online European Union meta-tool-repository is a necessity minimizing effort duplication for the various projects in the area.
Collapse
Affiliation(s)
| | | | | | - Kostas Marias
- FORTH-ICS, FORTH-ICS, N. Plastira 100, Heraklion, Crete, Greece
| | | | | | | | | | | | | | | | | | - Gianna Tsakou
- MAGGIOLI S.P.A., Research and Development Lab, Marousi, Greece
| | - Nikolaos Siopis
- Centre of Research & Technology - Hellas, Information Technologies Institute, Thermi - Thessaloniki, Greece
| | - Zisis Sakellariou
- Centre of Research & Technology - Hellas, Information Technologies Institute, Thermi - Thessaloniki, Greece
| | - Paschalis Bizopoulos
- Centre of Research & Technology - Hellas, Information Technologies Institute, Thermi - Thessaloniki, Greece
| | - Vicky Drossou
- Centre of Research & Technology - Hellas, Information Technologies Institute, Thermi - Thessaloniki, Greece
| | - Antonios Lalas
- Centre of Research & Technology - Hellas, Information Technologies Institute, Thermi - Thessaloniki, Greece
| | - Konstantinos Votis
- Centre of Research & Technology - Hellas, Information Technologies Institute, Thermi - Thessaloniki, Greece
| | - Pedro Mallol
- La Fe Health Research Institute, Valencia, Spain
| | | | | | | | | | | | - Lauren Fromont
- European Genome-Phenome Archive, Centre for Genomic Regulation, Barcelona, Spain
| | - Jordi Rambla
- European Genome-Phenome Archive, Centre for Genomic Regulation, Barcelona, Spain
| | | | | | | | - Fred Prior
- Department of Biomedical Informatics, University of Arkansas for Medical Sciences, Little Rock, AR, USA
| | | | | |
Collapse
|
2
|
Cabezón Ruiz S, Morilla Romero de la Osa R. [Big Data in health: a new paradigm to regulate, a challenge for social justice.]. Rev Esp Salud Publica 2021; 95:e202110150. [PMID: 34617519] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/24/2020] [Accepted: 09/09/2021] [Indexed: 06/13/2023] Open
Abstract
In addition to the opportunities posed by the use of Big Data in health, it also generates important challenges in the field of research, especially from the point of view of its management and ethical considerations. The European Union has been promoting different initiatives that allow the exploitation of this data in the context of the knowledge economy. The UNESCO Ethics Committee has identified three ethical principles to take into account regarding the application of Big Data in Health: independence, privacy and justice. The protection of privacy and patient safety is questioned in a context where cybersecurity is far to be complete. In addition, an imbalance in the exploitation of these data by the public and private sectors could generate inequalities that would represent a significant problem of social justice. This article follows a qualitative methodology based on the documentary analysis of current legislative texts, especially the recently approved General Data Protection Regulation (GDPR), as well as non-legislative documents of projects and parliamentary communications throughout the last two legislatures, with the aim of analyzing them and evaluating how they conform to the principles outlined by UNESCO, especially with respect to the principle of social justice. The most representative national projects that have started to be adopted are also reviewed.
Collapse
Affiliation(s)
| | - Rubén Morilla Romero de la Osa
- Hospital Universitario Virgen del Rocío. Sevilla. España
- Departamento de Enfermería de la Universidad de Sevilla. Sevilla. España
- Instituto de Biomedicina de Sevilla. Sevilla. España
- Consejo Superior de Investigaciones Científicas (CSIC) - Universidad de Sevilla. Sevilla. España
| |
Collapse
|
3
|
Lee H, Chung YD. Differentially private release of medical microdata: an efficient and practical approach for preserving informative attribute values. BMC Med Inform Decis Mak 2020; 20:155. [PMID: 32641043 PMCID: PMC7346516 DOI: 10.1186/s12911-020-01171-5] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/24/2019] [Accepted: 06/26/2020] [Indexed: 11/21/2022] Open
Abstract
Background Various methods based on k-anonymity have been proposed for publishing medical data while preserving privacy. However, the k-anonymity property assumes that adversaries possess fixed background knowledge. Although differential privacy overcomes this limitation, it is specialized for aggregated results. Thus, it is difficult to obtain high-quality microdata. To address this issue, we propose a differentially private medical microdata release method featuring high utility. Methods We propose a method of anonymizing medical data under differential privacy. To improve data utility, especially by preserving informative attribute values, the proposed method adopts three data perturbation approaches: (1) generalization, (2) suppression, and (3) insertion. The proposed method produces an anonymized dataset that is nearly optimal with regard to utility, while preserving privacy. Results The proposed method achieves lower information loss than existing methods. Based on a real-world case study, we prove that the results of data analyses using the original dataset and those obtained using a dataset anonymized via the proposed method are considerably similar. Conclusions We propose a novel differentially private anonymization method that preserves informative values for the release of medical data. Through experiments, we show that the utility of medical data that has been anonymized via the proposed method is significantly better than that of existing methods.
Collapse
Affiliation(s)
- Hyukki Lee
- Department of Computer Science and Engineering, Korea University, 145 Anam-ro, Seongbuk-gu, Seoul, 02841, Republic of Korea
| | - Yon Dohn Chung
- Department of Computer Science and Engineering, Korea University, 145 Anam-ro, Seongbuk-gu, Seoul, 02841, Republic of Korea.
| |
Collapse
|
4
|
Hauswaldt J, Demmer I, Heinemann S, Himmel W, Hummers E, Pung J, Schlegelmilch F, Drepper J. [The risk of re-identification when analyzing electronic health records: a critical appraisal and possible solutions]. Z Evid Fortbild Qual Gesundhwes 2019; 149:22-31. [PMID: 32165110 DOI: 10.1016/j.zefq.2020.01.002] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [What about the content of this article? (0)] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 05/25/2019] [Revised: 11/14/2019] [Accepted: 01/15/2020] [Indexed: 11/22/2022]
Abstract
BACKGROUND AND OBJECTIVES The use of primary care data gathered from electronic health records in local practices could be an important building block for the future of health services research. However, the risks and reservations associated with using this data for research purposes should not be underestimated. We show the data protection and privacy problems that may arise through secondary analysis of routine primary care data and describe the technical solutions that are available to address these concerns - as a trust-building measure. METHODS We screened 40 variables that are deemed important for documentation in the electronic health records of primary care physicians and rated the risk of patient re-identification when using these records from routine medical data for research purposes. The criteria used to rate the risk of re-identification were "expert perception" (inferences of a professional observer of phenotypical characteristics which are documented in the 40 variables), "researchable additional knowledge" (knowledge of characteristics of a person through publicly available information and social media networks), and "statistic frequency" according to diagnosis and medication statistics. RESULTS Diagnoses and reasons for contacting a general practitioner can contain particularly identifiable characteristics such as "obesity" (ICD-10 E66) and "nicotine dependence" (F17). About half of all ICD codes documented in primary care fall below a critical threshold value in their absolute frequency; this is all the more problematic if diagnoses allow for re-identification due to phenotypical characteristics. Medication information holds little potential risk of re-identification of a person. However, the application of medications could be a source of re-identification, e. g., self-injections of insulin or use of inhalators. Information about times and dates are especially sensitive for the re-identification of a person. Sex and age of a patient generally pose no problems, except in the case of very young or very old individuals when these age groups are seldom represented in the practice. DISCUSSION Routine health data are, in principle, sensitive data. Knowledge about the variables in primary care data gathered from electronic health records in local practices and the evaluation of this data allow us to more accurately estimate the risk of re-identification for the persons concerned. In particular, chronic diagnoses and/or diagnoses in long text, calendar dates for patient contacts and therapies bear a high risk of re-identification. Technical measures such as removing data, masking values and coding should make re-identification considerably more difficult. There will always be a remaining risk of re-identification which should be openly discussed to counteract concerns about a lack of data protection or a sweeping critique of digitization in healthcare.
Collapse
|
5
|
Eicher J, Bild R, Spengler H, Kuhn KA, Prasser F. A comprehensive tool for creating and evaluating privacy-preserving biomedical prediction models. BMC Med Inform Decis Mak 2020; 20:29. [PMID: 32046701 PMCID: PMC7014648 DOI: 10.1186/s12911-020-1041-3] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/27/2019] [Accepted: 01/30/2020] [Indexed: 02/07/2023] Open
Abstract
Background Modern data driven medical research promises to provide new insights into the development and course of disease and to enable novel methods of clinical decision support. To realize this, machine learning models can be trained to make predictions from clinical, paraclinical and biomolecular data. In this process, privacy protection and regulatory requirements need careful consideration, as the resulting models may leak sensitive personal information. To counter this threat, a wide range of methods for integrating machine learning with formal methods of privacy protection have been proposed. However, there is a significant lack of practical tools to create and evaluate such privacy-preserving models. In this software article, we report on our ongoing efforts to bridge this gap. Results We have extended the well-known ARX anonymization tool for biomedical data with machine learning techniques to support the creation of privacy-preserving prediction models. Our methods are particularly well suited for applications in biomedicine, as they preserve the truthfulness of data (e.g. no noise is added) and they are intuitive and relatively easy to explain to non-experts. Moreover, our implementation is highly versatile, as it supports binomial and multinomial target variables, different types of prediction models and a wide range of privacy protection techniques. All methods have been integrated into a sound framework that supports the creation, evaluation and refinement of models through intuitive graphical user interfaces. To demonstrate the broad applicability of our solution, we present three case studies in which we created and evaluated different types of privacy-preserving prediction models for breast cancer diagnosis, diagnosis of acute inflammation of the urinary system and prediction of the contraceptive method used by women. In this process, we also used a wide range of different privacy models (k-anonymity, differential privacy and a game-theoretic approach) as well as different data transformation techniques. Conclusions With the tool presented in this article, accurate prediction models can be created that preserve the privacy of individuals represented in the training set in a variety of threat scenarios. Our implementation is available as open source software.
Collapse
Affiliation(s)
- Johanna Eicher
- School of Medicine, Technical University of Munich, Ismaninger Str. 22, Munich, 81675, Germany.
| | - Raffael Bild
- School of Medicine, Technical University of Munich, Ismaninger Str. 22, Munich, 81675, Germany
| | - Helmut Spengler
- School of Medicine, Technical University of Munich, Ismaninger Str. 22, Munich, 81675, Germany
| | - Klaus A Kuhn
- School of Medicine, Technical University of Munich, Ismaninger Str. 22, Munich, 81675, Germany
| | - Fabian Prasser
- Berlin Institute of Health (BIH), Anna-Louisa-Karsch-Straße 2, Berlin, 10178, Germany.,Charité - Universitätsmedizin Berlin, Charitéplatz 1, Berlin, 10117, Germany
| |
Collapse
|
6
|
Abstract
Irreproducibility of research causes a major concern in academia. This concern affects all study designs regardless of scientific fields. Without testing the reproducibility and replicability it is almost impossible to repeat the research and to gain the same or similar results. In addition, irreproducibility limits the translation of research findings into practice where the same results are expected. To find the solutions, the Interacademy Partnership for Health gathered academics from established networks of science, medicine and engineering around a table to introduce seven strategies that can enhance the reproducibility: pre-registration, open methods, open data, collaboration, automation, reporting guidelines, and post-publication reviews. The current editorial discusses the generalisability and practicality of these strategies to systematic reviews and claims that systematic reviews have even a greater potential than other research designs to lead the movement toward the reproducibility of research. Moreover, I discuss the potential of reproducibility, on the other hand, to upgrade the systematic review from review to research. Furthermore, there are references to the successful and ongoing practices from collaborative efforts around the world to encourage the systematic reviewers, the journal editors and publishers, the organizations linked to evidence synthesis, and the funders and policy makers to facilitate this movement and to gain the public trust in research.
Collapse
Affiliation(s)
- Farhad Shokraneh
- Division of Psychiatry and Applied Psychology, Institute of Mental Health, School of Medicine, University of Nottingham, Nottingham NG7 2TU, United Kingdom
| |
Collapse
|
7
|
Abstract
BACKGROUND Publishing raw electronic health records (EHRs) may be considered as a breach of the privacy of individuals because they usually contain sensitive information. A common practice for the privacy-preserving data publishing is to anonymize the data before publishing, and thus satisfy privacy models such as k-anonymity. Among various anonymization techniques, generalization is the most commonly used in medical/health data processing. Generalization inevitably causes information loss, and thus, various methods have been proposed to reduce information loss. However, existing generalization-based data anonymization methods cannot avoid excessive information loss and preserve data utility. METHODS We propose a utility-preserving anonymization for privacy preserving data publishing (PPDP). To preserve data utility, the proposed method comprises three parts: (1) utility-preserving model, (2) counterfeit record insertion, (3) catalog of the counterfeit records. We also propose an anonymization algorithm using the proposed method. Our anonymization algorithm applies full-domain generalization algorithm. We evaluate our method in comparison with existence method on two aspects, information loss measured through various quality metrics and error rate of analysis result. RESULTS With all different types of quality metrics, our proposed method show the lower information loss than the existing method. In the real-world EHRs analysis, analysis results show small portion of error between the anonymized data through the proposed method and original data. CONCLUSIONS We propose a new utility-preserving anonymization method and an anonymization algorithm using the proposed method. Through experiments on various datasets, we show that the utility of EHRs anonymized by the proposed method is significantly better than those anonymized by previous approaches.
Collapse
Affiliation(s)
- Hyukki Lee
- Department of Computer Science and Engineering, Korea University, 145 Anam-ro, Seongbuk-gu, Seoul, 02841 Republic of Korea
| | - Soohyung Kim
- Department of IT Convegence, Korea University, Seoul, 145 Anam-ro, Seongbuk-gu, 02841 Republic of Korea
| | - Jong Wook Kim
- Department of Media Software, Seoul, 20-Gil, Hongji-dong, Seongbuk-gu, 03016 Republic of Korea
| | - Yon Dohn Chung
- Department of Computer Science and Engineering, Korea University, 145 Anam-ro, Seongbuk-gu, Seoul, 02841 Republic of Korea
| |
Collapse
|
8
|
Eicher J, Kuhn KA, Prasser F. An Experimental Comparison of Quality Models for Health Data De-Identification. Stud Health Technol Inform 2017; 245:704-708. [PMID: 29295189] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 06/07/2023]
Abstract
When individual-level health data are shared in biomedical research, the privacy of patients must be protected. This is typically achieved by data de-identification methods, which transform data in such a way that formal privacy requirements are met. In the process, it is important to minimize the loss of information to maintain data quality. Although several models have been proposed for measuring this aspect, it remains unclear which model is best suited for which application. We have therefore performed an extensive experimental comparison. We first implemented several common quality models into the ARX de-identification tool for biomedical data. We then used each model to de-identify a patient discharge dataset covering almost 4 million cases and outputs were analyzed to measure the impact of different quality models on real-world applications. Our results show that different models are best suited for specific applications, but that one model (Non-Uniform Entropy) is particularly well suited for general-purpose use.
Collapse
Affiliation(s)
- Johanna Eicher
- Institute of Medical Statistics and Epidemiology, University Hospital rechts der Isar, Technical University of Munich, Germany
| | - Klaus A Kuhn
- Institute of Medical Statistics and Epidemiology, University Hospital rechts der Isar, Technical University of Munich, Germany
| | - Fabian Prasser
- Institute of Medical Statistics and Epidemiology, University Hospital rechts der Isar, Technical University of Munich, Germany
| |
Collapse
|
9
|
Abstract
Preserving privacy and utility during data publishing and data mining is essential for individuals, data providers and researchers. However, studies in this area typically assume that one individual has only one record in a dataset, which is unrealistic in many applications. Having multiple records for an individual leads to new privacy leakages. We call such a dataset a 1:M dataset. In this paper, we propose a novel privacy model called (k, l)-diversity that addresses disclosure risks in 1:M data publishing. Based on this model, we develop an efficient algorithm named 1:M-Generalization to preserve privacy and data utility, and compare it with alternative approaches. Extensive experiments on real-world data show that our approach outperforms the state-of-the-art technique, in terms of data utility and computational cost.
Collapse
Affiliation(s)
| | | | - Ming Yang
- Southeast University, Nanjing, China
| | - Weiwei Ni
- Southeast University, Nanjing, China
| | - Xiao-Bai Li
- University of Massachusetts Lowell, Massachusetts, USA
| |
Collapse
|
10
|
Abstract
Background To facilitate long-term safety surveillance of marketing drugs, many spontaneously reporting systems (SRSs) of ADR events have been established world-wide. Since the data collected by SRSs contain sensitive personal health information that should be protected to prevent the identification of individuals, it procures the issue of privacy preserving data publishing (PPDP), that is, how to sanitize (anonymize) raw data before publishing. Although much work has been done on PPDP, very few studies have focused on protecting privacy of SRS data and none of the anonymization methods is favorable for SRS datasets, due to which contain some characteristics such as rare events, multiple individual records, and multi-valued sensitive attributes. Methods We propose a new privacy model called MS(k, θ*)-bounding for protecting published spontaneous ADE reporting data from privacy attacks. Our model has the flexibility of varying privacy thresholds, i.e., θ*, for different sensitive values and takes the characteristics of SRS data into consideration. We also propose an anonymization algorithm for sanitizing the raw data to meet the requirements specified through the proposed model. Our algorithm adopts a greedy-based clustering strategy to group the records into clusters, conforming to an innovative anonymization metric aiming to minimize the privacy risk as well as maintain the data utility for ADR detection. Empirical study was conducted using FAERS dataset from 2004Q1 to 2011Q4. We compared our model with four prevailing methods, including k-anonymity, (X, Y)-anonymity, Multi-sensitive l-diversity, and (α, k)-anonymity, evaluated via two measures, Danger Ratio (DR) and Information Loss (IL), and considered three different scenarios of threshold setting for θ*, including uniform setting, level-wise setting and frequency-based setting. We also conducted experiments to inspect the impact of anonymized data on the strengths of discovered ADR signals. Results With all three different threshold settings for sensitive value, our method can successively prevent the disclosure of sensitive values (nearly all observed DRs are zeros) without sacrificing too much of data utility. With non-uniform threshold setting, level-wise or frequency-based, our MS(k, θ*)-bounding exhibits the best data utility and the least privacy risk among all the models. The experiments conducted on selected ADR signals from MedWatch show that only very small difference on signal strength (PRR or ROR) were observed. The results show that our method can effectively prevent the disclosure of patient sensitive information without sacrificing data utility for ADR signal detection. Conclusions We propose a new privacy model for protecting SRS data that possess some characteristics overlooked by contemporary models and an anonymization algorithm to sanitize SRS data in accordance with the proposed model. Empirical evaluation on the real SRS dataset, i.e., FAERS, shows that our method can effectively solve the privacy problem in SRS data without influencing the ADR signal strength.
Collapse
|