Reference Citation Analysis: Find an Article, Find a Category, Find a Journal, Find a Scholar

For: Hartman T, Howell MD, Dean J, Hoory S, Slyper R, Laish I, Gilon O, Vainstein D, Corrado G, Chou K, Po MJ, Williams J, Ellis S, Bee G, Hassidim A, Amira R, Beryozkin G, Szpektor I, Matias Y. Customization scenarios for de-identification of clinical notes. BMC Med Inform Decis Mak 2020;20:14. [PMID: 32000770 PMCID: PMC6993314 DOI: 10.1186/s12911-020-1026-2] [Citation(s) in RCA: 10] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [What about the content of this article? (0)] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/11/2019] [Accepted: 01/14/2020] [Indexed: 11/10/2022] Open

For:	Hartman T, Howell MD, Dean J, Hoory S, Slyper R, Laish I, Gilon O, Vainstein D, Corrado G, Chou K, Po MJ, Williams J, Ellis S, Bee G, Hassidim A, Amira R, Beryozkin G, Szpektor I, Matias Y. Customization scenarios for de-identification of clinical notes. BMC Med Inform Decis Mak 2020;20:14. [PMID: 32000770 PMCID: PMC6993314 DOI: 10.1186/s12911-020-1026-2] [Citation(s) in RCA: 10] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [What about the content of this article? (0)] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/11/2019] [Accepted: 01/14/2020] [Indexed: 11/10/2022] Open

Number

Cited by Other Article(s)

Kovačević A, Bašaragin B, Milošević N, Nenadić G. De-identification of clinical free text using natural language processing: A systematic review of current approaches. Artif Intell Med 2024;151:102845. [PMID: 38555848 DOI: 10.1016/j.artmed.2024.102845] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/20/2023] [Revised: 03/13/2024] [Accepted: 03/18/2024] [Indexed: 04/02/2024]

Abstract

BACKGROUND

Electronic health records (EHRs) are a valuable resource for data-driven medical research. However, the presence of protected health information (PHI) makes EHRs unsuitable to be shared for research purposes. De-identification, i.e. the process of removing PHI is a critical step in making EHR data accessible. Natural language processing has repeatedly demonstrated its feasibility in automating the de-identification process.

OBJECTIVES

Our study aims to provide systematic evidence on how the de-identification of clinical free text written in English has evolved in the last thirteen years, and to report on the performances and limitations of the current state-of-the-art systems for the English language. In addition, we aim to identify challenges and potential research opportunities in this field.

METHODS

A systematic search in PubMed, Web of Science, and the DBLP was conducted for studies published between January 2010 and February 2023. Titles and abstracts were examined to identify the relevant studies. Selected studies were then analysed in-depth, and information was collected on de-identification methodologies, data sources, and measured performance.

RESULTS

A total of 2125 publications were identified for the title and abstract screening. 69 studies were found to be relevant. Machine learning (37 studies) and hybrid (26 studies) approaches are predominant, while six studies relied only on rules. The majority of the approaches were trained and evaluated on public corpora. The 2014 i2b2/UTHealth corpus is the most frequently used (36 studies), followed by the 2006 i2b2 (18 studies) and 2016 CEGS N-GRID (10 studies) corpora.

CONCLUSION

Earlier de-identification approaches aimed at English were mainly rule and machine learning hybrids with extensive feature engineering and post-processing, while more recent performance improvements are due to feature-inferring recurrent neural networks. Current leading performance is achieved using attention-based neural models. Recent studies report state-of-the-art F1-scores (over 98 %) when evaluated in the manner usually adopted by the clinical natural language processing community. However, their performance needs to be more thoroughly assessed with different measures to judge their reliability to safely de-identify data in a real-world setting. Without additional manually labeled training data, state-of-the-art systems fail to generalise well across a wide range of clinical sub-domains.

Collapse

Chen F, Bokhari SMA, Cato K, Gürsoy G, Rossetti S. Examining the Generalizability of Pretrained De-identification Transformer Models on Narrative Nursing Notes. Appl Clin Inform 2024;15:357-367. [PMID: 38447965 PMCID: PMC11078567 DOI: 10.1055/a-2282-4340] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/01/2023] [Accepted: 02/15/2024] [Indexed: 03/08/2024] Open

Abstract

BACKGROUND

Narrative nursing notes are a valuable resource in informatics research with unique predictive signals about patient care. The open sharing of these data, however, is appropriately constrained by rigorous regulations set by the Health Insurance Portability and Accountability Act (HIPAA) for the protection of privacy. Several models have been developed and evaluated on the open-source i2b2 dataset. A focus on the generalizability of these models with respect to nursing notes remains understudied.

OBJECTIVES

The study aims to understand the generalizability of pretrained transformer models and investigate the variability of personal protected health information (PHI) distribution patterns between discharge summaries and nursing notes with a goal to inform the future design for model evaluation schema.

METHODS

Two pretrained transformer models (RoBERTa, ClinicalBERT) fine-tuned on i2b2 2014 discharge summaries were evaluated on our data inpatient nursing notes and compared with the baseline performance. Statistical testing was deployed to assess differences in PHI distribution across discharge summaries and nursing notes.

RESULTS

RoBERTa achieved the optimal performance when tested on an external source of data, with an F1 score of 0.887 across PHI categories and 0.932 in the PHI binary task. Overall, discharge summaries contained a higher number of PHI instances and categories of PHI compared with inpatient nursing notes.

CONCLUSION

The study investigated the applicability of two pretrained transformers on inpatient nursing notes and examined the distinctions between nursing notes and discharge summaries concerning the utilization of personal PHI. Discharge summaries presented a greater quantity of PHI instances and types when compared with narrative nursing notes, but narrative nursing notes exhibited more diversity in the types of PHI present, with some pertaining to patient's personal life. The insights obtained from the research help improve the design and selection of algorithms, as well as contribute to the development of suitable performance thresholds for PHI.

Collapse

Spithoff S, Grundy Q. Commercializing Personal Health Information: A Critical Qualitative Content Analysis of Documents Describing Proprietary Primary Care Databases in Canada. Int J Health Policy Manag 2023;12:6938. [PMID: 37579404 PMCID: PMC10461871 DOI: 10.34172/ijhpm.2023.6938] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/12/2021] [Accepted: 04/03/2023] [Indexed: 08/16/2023] Open

Abstract

BACKGROUND

Commercial data brokers have amassed large collections of primary care patient data in proprietary databases. Our study objective was to critically analyze how entities involved in the collection and use of these records construct the value of these proprietary databases. We also discuss the implications of the collection and use of these databases.

METHODS

We conducted a critical qualitative content analysis using publicly available documents describing the creation and use of proprietary databases containing Canadian primary care patient data. We identified relevant commercial data brokers, as well as entities involved in collecting data or in using data from these databases. We sampled documents associated with these entities that described any aspect of the collection, processing, and use of the proprietary databases. We extracted data from each document using a structured data tool. We conducted an interpretive thematic content analysis by inductively coding documents and the extracted data.

RESULTS

We analyzed 25 documents produced between 2013 and 2021. These documents were largely directed at the pharmaceutical industry, as well as shareholders, academics, and governments. The documents constructed the value of the proprietary databases by describing extensive, intimate, detailed patient-level data holdings. They provided examples of how the databases could be used by pharmaceutical companies for regulatory approval, marketing and understanding physician behaviour. The documents constructed the value of these data more broadly by claiming to improve health for patients, while also addressing risks to privacy. Some documents referred to the trade-offs between patient privacy and data utility, which suggests these considerations may be in tension.

CONCLUSION

Documents in our analysis positioned the proprietary databases as socially legitimate and valuable, particularly to pharmaceutical companies. The databases, however, may pose risks to patient privacy and contribute to problematic drug promotion. Solutions include expanding public data repositories with appropriate governance and external regulatory oversight.

Collapse

Kotevski DP, Smee RI, Field M, Nemes YN, Broadley K, Vajdic CM. Evaluation of an automated Presidio anonymisation model for unstructured radiation oncology electronic medical records in an Australian setting. Int J Med Inform 2022;168:104880. [DOI: 10.1016/j.ijmedinf.2022.104880] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/23/2022] [Revised: 09/13/2022] [Accepted: 09/27/2022] [Indexed: 11/08/2022]

Jordan S, Fontaine C, Hendricks-Sturrup R. Selecting Privacy-Enhancing Technologies for Managing Health Data Use. Front Public Health 2022;10:814163. [PMID: 35372185 PMCID: PMC8967420 DOI: 10.3389/fpubh.2022.814163] [Citation(s) in RCA: 7] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/12/2021] [Accepted: 02/14/2022] [Indexed: 11/29/2022] Open

Shin SY, Kim HS. Data Pseudonymization in a Range That Does Not Affect Data Quality: Correlation with the Degree of Participation of Clinicians. J Korean Med Sci 2021;36:e299. [PMID: 34783216 PMCID: PMC8593412 DOI: 10.3346/jkms.2021.36.e299] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 05/04/2021] [Accepted: 10/18/2021] [Indexed: 12/28/2022] Open

Jonnagaddala J, Chen A, Batongbacal S, Nekkantti C. The OpenDeID corpus for patient de-identification. Sci Rep 2021;11:19973. [PMID: 34620985 PMCID: PMC8497517 DOI: 10.1038/s41598-021-99554-9] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/24/2021] [Accepted: 09/28/2021] [Indexed: 11/18/2022] Open

Liao S, Kiros J, Chen J, Zhang Z, Chen T. Improving domain adaptation in de-identification of electronic health records through self-training. J Am Med Inform Assoc 2021;28:2093-2100. [PMID: 34363664 PMCID: PMC8449604 DOI: 10.1093/jamia/ocab128] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/18/2021] [Revised: 07/01/2021] [Accepted: 07/04/2021] [Indexed: 11/13/2022] Open

Zhou H, Ruan D. Technical Note: An embedding-based medical note de-identification approach with sparse annotation. Med Phys 2021;48:1341-1348. [PMID: 33340113 DOI: 10.1002/mp.14664] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/04/2020] [Revised: 11/04/2020] [Accepted: 11/25/2020] [Indexed: 11/12/2022] Open

Abstract

PURPOSE

Medical note de-identification is critical for the protection of private information and the security of data sharing in collaborative research. The task demands the complete removal of all patient names and other sensitive information such as addresses and phone numbers from medical records. Accomplishing this goal is challenging, with many variations in the medical note formats and string representations. Existing de-identification approaches include pattern matching where extensive dictionary lists are constructed a prior; and entity tagging, which trains on a large word-wise annotated corpus. This motivates us to study an alternative to the existing approaches with a reduced annotation burden.

METHODS

In this work, we propose a novel approach that implicitly accounts for the language territory of sensitive information. Specifically, our approach incorporates a contextualized word embedding module and a multilayer perceptron to simultaneously infer the similarity of sensitive and non-sensitive vocabularies to a constructed landmark set, providing an overall sparsely supervised classification. To demonstrate the rationale, we present the principle and work pipeline with the task of name removal, but the proposed method applies to other strings as well.

RESULTS

On a large cohort of hybrid clinical reports, including various forms of consulting, on-treatment-visit, and follow-up notes, we achieved >0.99 accuracies in our constructed training, validation, and testing sets. The sensitivity and specificity were 1.0 and 0.9973, respectively, for two randomly selected reports, comparing favorably to the benchmark Stanford NER tagger, which achieved 0.8529 and 0.9969. The F1 score was 0.889 ± 0.046 and 0.822 ± 0.103 across six randomly selected reports for the proposed method and the Stanford NER, respectively, and the result was significant under a one-sided t-test with alpha = 0.1.

CONCLUSION

Our qualitative and quantitative analysis shows that our method achieved better results than the pretrained 3-class Stanford NER toolbox.

Collapse

Johnson AEW, Bulgarelli L, Pollard TJ. Deidentification of free-text medical records using pre-trained bidirectional transformers. PROCEEDINGS OF THE ACM CONFERENCE ON HEALTH, INFERENCE, AND LEARNING 2020;2020:214-221. [PMID: 34350426 PMCID: PMC8330601 DOI: 10.1145/3368555.3384455] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 11/30/2022]