1
|
Kovačević A, Bašaragin B, Milošević N, Nenadić G. De-identification of clinical free text using natural language processing: A systematic review of current approaches. Artif Intell Med 2024; 151:102845. [PMID: 38555848 DOI: 10.1016/j.artmed.2024.102845] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/20/2023] [Revised: 03/13/2024] [Accepted: 03/18/2024] [Indexed: 04/02/2024]
Abstract
BACKGROUND Electronic health records (EHRs) are a valuable resource for data-driven medical research. However, the presence of protected health information (PHI) makes EHRs unsuitable to be shared for research purposes. De-identification, i.e. the process of removing PHI is a critical step in making EHR data accessible. Natural language processing has repeatedly demonstrated its feasibility in automating the de-identification process. OBJECTIVES Our study aims to provide systematic evidence on how the de-identification of clinical free text written in English has evolved in the last thirteen years, and to report on the performances and limitations of the current state-of-the-art systems for the English language. In addition, we aim to identify challenges and potential research opportunities in this field. METHODS A systematic search in PubMed, Web of Science, and the DBLP was conducted for studies published between January 2010 and February 2023. Titles and abstracts were examined to identify the relevant studies. Selected studies were then analysed in-depth, and information was collected on de-identification methodologies, data sources, and measured performance. RESULTS A total of 2125 publications were identified for the title and abstract screening. 69 studies were found to be relevant. Machine learning (37 studies) and hybrid (26 studies) approaches are predominant, while six studies relied only on rules. The majority of the approaches were trained and evaluated on public corpora. The 2014 i2b2/UTHealth corpus is the most frequently used (36 studies), followed by the 2006 i2b2 (18 studies) and 2016 CEGS N-GRID (10 studies) corpora. CONCLUSION Earlier de-identification approaches aimed at English were mainly rule and machine learning hybrids with extensive feature engineering and post-processing, while more recent performance improvements are due to feature-inferring recurrent neural networks. Current leading performance is achieved using attention-based neural models. Recent studies report state-of-the-art F1-scores (over 98 %) when evaluated in the manner usually adopted by the clinical natural language processing community. However, their performance needs to be more thoroughly assessed with different measures to judge their reliability to safely de-identify data in a real-world setting. Without additional manually labeled training data, state-of-the-art systems fail to generalise well across a wide range of clinical sub-domains.
Collapse
Affiliation(s)
- Aleksandar Kovačević
- The University of Novi Sad, Faculty of Technical Sciences, Trg Dositeja Obradovića 6, 21002 Novi Sad, Serbia
| | - Bojana Bašaragin
- The Institute for Artificial Intelligence Research and Development of Serbia, Fruškogorska 1, 21000 Novi Sad, Serbia.
| | - Nikola Milošević
- The Institute for Artificial Intelligence Research and Development of Serbia, Fruškogorska 1, 21000 Novi Sad, Serbia; Bayer A.G., Research and Development, Mullerstrasse 173, Berlin 13342, Germany
| | - Goran Nenadić
- The University of Manchester, Department of Computer Science, Manchester, United Kingdom
| |
Collapse
|
2
|
Chen F, Bokhari SMA, Cato K, Gürsoy G, Rossetti S. Examining the Generalizability of Pretrained De-identification Transformer Models on Narrative Nursing Notes. Appl Clin Inform 2024; 15:357-367. [PMID: 38447965 PMCID: PMC11078567 DOI: 10.1055/a-2282-4340] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/01/2023] [Accepted: 02/15/2024] [Indexed: 03/08/2024] Open
Abstract
BACKGROUND Narrative nursing notes are a valuable resource in informatics research with unique predictive signals about patient care. The open sharing of these data, however, is appropriately constrained by rigorous regulations set by the Health Insurance Portability and Accountability Act (HIPAA) for the protection of privacy. Several models have been developed and evaluated on the open-source i2b2 dataset. A focus on the generalizability of these models with respect to nursing notes remains understudied. OBJECTIVES The study aims to understand the generalizability of pretrained transformer models and investigate the variability of personal protected health information (PHI) distribution patterns between discharge summaries and nursing notes with a goal to inform the future design for model evaluation schema. METHODS Two pretrained transformer models (RoBERTa, ClinicalBERT) fine-tuned on i2b2 2014 discharge summaries were evaluated on our data inpatient nursing notes and compared with the baseline performance. Statistical testing was deployed to assess differences in PHI distribution across discharge summaries and nursing notes. RESULTS RoBERTa achieved the optimal performance when tested on an external source of data, with an F1 score of 0.887 across PHI categories and 0.932 in the PHI binary task. Overall, discharge summaries contained a higher number of PHI instances and categories of PHI compared with inpatient nursing notes. CONCLUSION The study investigated the applicability of two pretrained transformers on inpatient nursing notes and examined the distinctions between nursing notes and discharge summaries concerning the utilization of personal PHI. Discharge summaries presented a greater quantity of PHI instances and types when compared with narrative nursing notes, but narrative nursing notes exhibited more diversity in the types of PHI present, with some pertaining to patient's personal life. The insights obtained from the research help improve the design and selection of algorithms, as well as contribute to the development of suitable performance thresholds for PHI.
Collapse
Affiliation(s)
- Fangyi Chen
- Department of Biomedical Informatics, Columbia University, New York, New York, United States
| | | | - Kenrick Cato
- School of Nursing, University of Pennsylvania, Philadelphia, Pennsylvania, United States
- School of Nursing, Columbia University, New York, New York, United States
| | - Gamze Gürsoy
- Department of Biomedical Informatics, Columbia University, New York, New York, United States
| | - Sarah Rossetti
- Department of Biomedical Informatics, Columbia University, New York, New York, United States
- School of Nursing, Columbia University, New York, New York, United States
| |
Collapse
|
3
|
Spithoff S, Grundy Q. Commercializing Personal Health Information: A Critical Qualitative Content Analysis of Documents Describing Proprietary Primary Care Databases in Canada. Int J Health Policy Manag 2023; 12:6938. [PMID: 37579404 PMCID: PMC10461871 DOI: 10.34172/ijhpm.2023.6938] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/12/2021] [Accepted: 04/03/2023] [Indexed: 08/16/2023] Open
Abstract
BACKGROUND Commercial data brokers have amassed large collections of primary care patient data in proprietary databases. Our study objective was to critically analyze how entities involved in the collection and use of these records construct the value of these proprietary databases. We also discuss the implications of the collection and use of these databases. METHODS We conducted a critical qualitative content analysis using publicly available documents describing the creation and use of proprietary databases containing Canadian primary care patient data. We identified relevant commercial data brokers, as well as entities involved in collecting data or in using data from these databases. We sampled documents associated with these entities that described any aspect of the collection, processing, and use of the proprietary databases. We extracted data from each document using a structured data tool. We conducted an interpretive thematic content analysis by inductively coding documents and the extracted data. RESULTS We analyzed 25 documents produced between 2013 and 2021. These documents were largely directed at the pharmaceutical industry, as well as shareholders, academics, and governments. The documents constructed the value of the proprietary databases by describing extensive, intimate, detailed patient-level data holdings. They provided examples of how the databases could be used by pharmaceutical companies for regulatory approval, marketing and understanding physician behaviour. The documents constructed the value of these data more broadly by claiming to improve health for patients, while also addressing risks to privacy. Some documents referred to the trade-offs between patient privacy and data utility, which suggests these considerations may be in tension. CONCLUSION Documents in our analysis positioned the proprietary databases as socially legitimate and valuable, particularly to pharmaceutical companies. The databases, however, may pose risks to patient privacy and contribute to problematic drug promotion. Solutions include expanding public data repositories with appropriate governance and external regulatory oversight.
Collapse
Affiliation(s)
- Sheryl Spithoff
- Department of Family and Community Medicine, University of Toronto, Toronto, ON, Canada
- Department of Family and Community Medicine, Women’s College Hospital, Toronto, ON, Canada
- Women’s College Research Institute, Women’s College Hospital, Toronto, ON, Canada
| | - Quinn Grundy
- Lawrence S. Bloomberg Faculty of Nursing, University of Toronto, Toronto, ON, Canada
| |
Collapse
|
4
|
Kotevski DP, Smee RI, Field M, Nemes YN, Broadley K, Vajdic CM. Evaluation of an automated Presidio anonymisation model for unstructured radiation oncology electronic medical records in an Australian setting. Int J Med Inform 2022; 168:104880. [DOI: 10.1016/j.ijmedinf.2022.104880] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/23/2022] [Revised: 09/13/2022] [Accepted: 09/27/2022] [Indexed: 11/08/2022]
|
5
|
Jordan S, Fontaine C, Hendricks-Sturrup R. Selecting Privacy-Enhancing Technologies for Managing Health Data Use. Front Public Health 2022; 10:814163. [PMID: 35372185 PMCID: PMC8967420 DOI: 10.3389/fpubh.2022.814163] [Citation(s) in RCA: 7] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/12/2021] [Accepted: 02/14/2022] [Indexed: 11/29/2022] Open
Abstract
Privacy protection for health data is more than simply stripping datasets of specific identifiers. Privacy protection increasingly means the application of privacy-enhancing technologies (PETs), also known as privacy engineering. Demands for the application of PETs are not yet met with ease of use or even understanding. This paper provides a scope of the current peer-reviewed evidence regarding the practical use or adoption of various PETs for managing health data privacy. We describe the state of knowledge of PETS for the use and exchange of health data specifically and build a practical perspective on the steps needed to improve the standardization of the application of PETs for diverse uses of health data.
Collapse
Affiliation(s)
- Sara Jordan
- Future of Privacy Forum, Washington, DC, United States
| | - Clara Fontaine
- Centre for Quantum Technologies at the National University of Singapore, Singapore, Singapore
| | | |
Collapse
|
6
|
Shin SY, Kim HS. Data Pseudonymization in a Range That Does Not Affect Data Quality: Correlation with the Degree of Participation of Clinicians. J Korean Med Sci 2021; 36:e299. [PMID: 34783216 PMCID: PMC8593412 DOI: 10.3346/jkms.2021.36.e299] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 05/04/2021] [Accepted: 10/18/2021] [Indexed: 12/28/2022] Open
Abstract
Personal medical information is an essential resource for research; however, there are laws that regulate its use, and it typically has to be pseudonymized or anonymized. When data are anonymized, the quantity and quality of extractable information decrease significantly. From the perspective of a clinical researcher, a method of achieving pseudonymized data without degrading data quality while also preventing data loss is proposed herein. As the level of pseudonymization varies according to the research purpose, the pseudonymization method applied should be carefully chosen. Therefore, the active participation of clinicians is crucial to transform the data according to the research purpose. This can contribute to data security by simply transforming the data through secondary data processing. Case studies demonstrated that, compared with the initial baseline data, there was a clinically significant difference in the number of datapoints added with the participation of a clinician (from 267,979 to 280,127 points, P < 0.001). Thus, depending on the degree of clinician participation, data anonymization may not affect data quality and quantity, and proper data quality management along with data security are emphasized. Although the pseudonymization level and clinical use of data have a trade-off relationship, it is possible to create pseudonymized data while maintaining the data quality required for a given research purpose. Therefore, rather than relying solely on security guidelines, the active participation of clinicians is important.
Collapse
Affiliation(s)
- Soo-Yong Shin
- Department of Digital Health, Samsung Advanced Institute for Health Sciences & Technology (SAIHST), Sungkyunkwan University, Seoul, Korea
- Center for Research Resource Standardization, Samsung Medical Center, Seoul, Korea
| | - Hun-Sung Kim
- Department of Medical Informatics, College of Medicine, The Catholic University of Korea, Seoul, Korea
- Division of Endocrinology and Metabolism, Department of Internal Medicine, Seoul St. Mary's Hospital, College of Medicine, The Catholic University of Korea, Seoul, Korea.
| |
Collapse
|
7
|
Jonnagaddala J, Chen A, Batongbacal S, Nekkantti C. The OpenDeID corpus for patient de-identification. Sci Rep 2021; 11:19973. [PMID: 34620985 PMCID: PMC8497517 DOI: 10.1038/s41598-021-99554-9] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/24/2021] [Accepted: 09/28/2021] [Indexed: 11/18/2022] Open
Abstract
For research purposes, protected health information is often redacted from unstructured electronic health records to preserve patient privacy and confidentiality. The OpenDeID corpus is designed to assist development of automatic methods to redact sensitive information from unstructured electronic health records. We retrieved 4548 unstructured surgical pathology reports from four urban Australian hospitals. The corpus was developed by two annotators under three different experimental settings. The quality of the annotations was evaluated for each setting. Specifically, we employed serial annotations, parallel annotations, and pre-annotations. Our results suggest that the pre-annotations approach is not reliable in terms of quality when compared to the serial annotations but can drastically reduce annotation time. The OpenDeID corpus comprises 2,100 pathology reports from 1,833 cancer patients with an average of 737.49 tokens and 7.35 protected health information entities annotated per report. The overall inter annotator agreement and deviation scores are 0.9464 and 0.9726, respectively. Realistic surrogates are also generated to make the corpus suitable for distribution to other researchers.
Collapse
Affiliation(s)
| | - Aipeng Chen
- School of Computer Science and Engineering, UNSW Sydney, Sydney, Australia
| | - Sean Batongbacal
- School of Computer Science and Engineering, UNSW Sydney, Sydney, Australia
| | | |
Collapse
|
8
|
Liao S, Kiros J, Chen J, Zhang Z, Chen T. Improving domain adaptation in de-identification of electronic health records through self-training. J Am Med Inform Assoc 2021; 28:2093-2100. [PMID: 34363664 PMCID: PMC8449604 DOI: 10.1093/jamia/ocab128] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/18/2021] [Revised: 07/01/2021] [Accepted: 07/04/2021] [Indexed: 11/13/2022] Open
Abstract
OBJECTIVE De-identification is a fundamental task in electronic health records to remove protected health information entities. Deep learning models have proven to be promising tools to automate de-identification processes. However, when the target domain (where the model is applied) is different from the source domain (where the model is trained), the model often suffers a significant performance drop, commonly referred to as domain adaptation issue. In de-identification, domain adaptation issues can make the model vulnerable for deployment. In this work, we aim to close the domain gap by leveraging unlabeled data from the target domain. MATERIALS AND METHODS We introduce a self-training framework to address the domain adaptation issue by leveraging unlabeled data from the target domain. We validate the effectiveness on 4 standard de-identification datasets. In each experiment, we use a pair of datasets: labeled data from the source domain and unlabeled data from the target domain. We compare the proposed self-training framework with supervised learning that directly deploys the model trained on the source domain. RESULTS In summary, our proposed framework improves the F1-score by 5.38 (on average) when compared with direct deployment. For example, using i2b2-2014 as the training dataset and i2b2-2006 as the test, the proposed framework increases the F1-score from 76.61 to 85.41 (+8.8). The method also increases the F1-score by 10.86 for mimic-radiology and mimic-discharge. CONCLUSION Our work demonstrates an effective self-training framework to boost the domain adaptation performance for the de-identification task for electronic health records.
Collapse
Affiliation(s)
- Shun Liao
- Department of Computer Science, University of Toronto, Toronto, Ontario, Canada
- Donnelly Centre for Cellular and Biomoleular Research, University of Toronto, Ontario, Canada
| | | | | | - Zhaolei Zhang
- Department of Computer Science, University of Toronto, Toronto, Ontario, Canada
- Donnelly Centre for Cellular and Biomoleular Research, University of Toronto, Ontario, Canada
| | | |
Collapse
|
9
|
Zhou H, Ruan D. Technical Note: An embedding-based medical note de-identification approach with sparse annotation. Med Phys 2021; 48:1341-1348. [PMID: 33340113 DOI: 10.1002/mp.14664] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/04/2020] [Revised: 11/04/2020] [Accepted: 11/25/2020] [Indexed: 11/12/2022] Open
Abstract
PURPOSE Medical note de-identification is critical for the protection of private information and the security of data sharing in collaborative research. The task demands the complete removal of all patient names and other sensitive information such as addresses and phone numbers from medical records. Accomplishing this goal is challenging, with many variations in the medical note formats and string representations. Existing de-identification approaches include pattern matching where extensive dictionary lists are constructed a prior; and entity tagging, which trains on a large word-wise annotated corpus. This motivates us to study an alternative to the existing approaches with a reduced annotation burden. METHODS In this work, we propose a novel approach that implicitly accounts for the language territory of sensitive information. Specifically, our approach incorporates a contextualized word embedding module and a multilayer perceptron to simultaneously infer the similarity of sensitive and non-sensitive vocabularies to a constructed landmark set, providing an overall sparsely supervised classification. To demonstrate the rationale, we present the principle and work pipeline with the task of name removal, but the proposed method applies to other strings as well. RESULTS On a large cohort of hybrid clinical reports, including various forms of consulting, on-treatment-visit, and follow-up notes, we achieved >0.99 accuracies in our constructed training, validation, and testing sets. The sensitivity and specificity were 1.0 and 0.9973, respectively, for two randomly selected reports, comparing favorably to the benchmark Stanford NER tagger, which achieved 0.8529 and 0.9969. The F1 score was 0.889 ± 0.046 and 0.822 ± 0.103 across six randomly selected reports for the proposed method and the Stanford NER, respectively, and the result was significant under a one-sided t-test with alpha = 0.1. CONCLUSION Our qualitative and quantitative analysis shows that our method achieved better results than the pretrained 3-class Stanford NER toolbox.
Collapse
Affiliation(s)
- Hanyue Zhou
- Department of Bioengineering, University of California, Los Angeles, Los Angeles, CA, 90095, USA
| | - Dan Ruan
- Department of Bioengineering, University of California, Los Angeles, Los Angeles, CA, 90095, USA.,Department of Radiation Oncology, University of California, Los Angeles, Los Angeles, CA, 90095, USA
| |
Collapse
|
10
|
Johnson AEW, Bulgarelli L, Pollard TJ. Deidentification of free-text medical records using pre-trained bidirectional transformers. PROCEEDINGS OF THE ACM CONFERENCE ON HEALTH, INFERENCE, AND LEARNING 2020; 2020:214-221. [PMID: 34350426 PMCID: PMC8330601 DOI: 10.1145/3368555.3384455] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 11/30/2022]
Abstract
The ability of caregivers and investigators to share patient data is fundamental to many areas of clinical practice and biomedical research. Prior to sharing, it is often necessary to remove identifiers such as names, contact details, and dates in order to protect patient privacy. Deidentification, the process of removing identifiers, is challenging, however. High-quality annotated data for developing models is scarce; many target identifiers are highly heterogenous (for example, there are uncountable variations of patient names); and in practice anything less than perfect sensitivity may be considered a failure. As a result, patient data is often withheld when sharing would be beneficial, and identifiable patient data is often divulged when a deidentified version would suffice. In recent years, advances in machine learning methods have led to rapid performance improvements in natural language processing tasks, in particular with the advent of large-scale pretrained language models. In this paper we develop and evaluate an approach for deidentification of clinical notes based on a bidirectional transformer model. We propose human interpretable evaluation measures and demonstrate state of the art performance against modern baseline models. Finally, we highlight current challenges in deidentification, including the absence of clear annotation guidelines, lack of portability of models, and paucity of training data. Code to develop our model is open source, allowing for broad reuse.
Collapse
Affiliation(s)
| | | | - Tom J Pollard
- Massachusetts Institute of Technology, Cambridge, MA, USA
| |
Collapse
|