1
|
Lee HJ, Zhang Y, Roberts K, Xu H. Leveraging existing corpora for de-identification of psychiatric notes using domain adaptation. AMIA ... ANNUAL SYMPOSIUM PROCEEDINGS. AMIA SYMPOSIUM 2018; 2017:1070-1079. [PMID: 29854175 PMCID: PMC5977650] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Subscribe] [Scholar Register] [Indexed: 06/08/2023]
Abstract
De-identification of clinical notes is a special case of named entity recognition. Supervised machine-learning (ML) algorithms have achieved promising results for this task. However, ML-based de-identification systems often require annotating a large number of clinical notes of interest, which is costly. Domain adaptation (DA) is a technology that enables learning from annotated datasets from different sources, thereby reducing annotation cost required for ML training in the target domain. In this study, we investigate the use of DA methods for deidentification of psychiatric notes. Three state-of-the-art DA methods: instance pruning, instance weighting, and feature augmentation are applied to three source corpora of annotated hospital discharge summaries, outpatient notes, and a mixture of different note types written for diabetic patients. Our results show that DA can increase deidentification performance over the baselines, indicating that it can effectively reduce annotation cost for the target psychiatric notes. Feature augmentation is shown to increase performance the most among the three DA methods. Performance variation among the different types of clinical notes is also observed, showing that a mixture of different types of notes brings the biggest increase in performance.
Collapse
Affiliation(s)
- Hee-Jin Lee
- University of Texas Health Science Center at Houston, Houston, TX
| | - Yaoyun Zhang
- University of Texas Health Science Center at Houston, Houston, TX
| | - Kirk Roberts
- University of Texas Health Science Center at Houston, Houston, TX
| | - Hua Xu
- University of Texas Health Science Center at Houston, Houston, TX
| |
Collapse
|
2
|
Lee HJ, Wu Y, Zhang Y, Xu J, Xu H, Roberts K. A hybrid approach to automatic de-identification of psychiatric notes. J Biomed Inform 2017; 75S:S19-S27. [PMID: 28602904 PMCID: PMC5705430 DOI: 10.1016/j.jbi.2017.06.006] [Citation(s) in RCA: 22] [Impact Index Per Article: 3.1] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/02/2017] [Revised: 06/02/2017] [Accepted: 06/05/2017] [Indexed: 11/17/2022]
Abstract
De-identification, or identifying and removing protected health information (PHI) from clinical data, is a critical step in making clinical data available for clinical applications and research. This paper presents a natural language processing system for automatic de-identification of psychiatric notes, which was designed to participate in the 2016 CEGS N-GRID shared task Track 1. The system has a hybrid structure that combines machine leaning techniques and rule-based approaches. The rule-based components exploit the structure of the psychiatric notes as well as characteristic surface patterns of PHI mentions. The machine learning components utilize supervised learning with rich features. In addition, the system performance was boosted with integration of additional data to the training set through domain adaptation. The hybrid system showed overall micro-averaged F-score 90.74 on the test set, second-best among all the participants of the CEGS N-GRID task.
Collapse
Affiliation(s)
- Hee-Jin Lee
- School of Biomedical Informatics, The University of Texas Health Science Center at Houston, Houston, TX, United States
| | - Yonghui Wu
- School of Biomedical Informatics, The University of Texas Health Science Center at Houston, Houston, TX, United States
| | - Yaoyun Zhang
- School of Biomedical Informatics, The University of Texas Health Science Center at Houston, Houston, TX, United States
| | - Jun Xu
- School of Biomedical Informatics, The University of Texas Health Science Center at Houston, Houston, TX, United States
| | - Hua Xu
- School of Biomedical Informatics, The University of Texas Health Science Center at Houston, Houston, TX, United States.
| | - Kirk Roberts
- School of Biomedical Informatics, The University of Texas Health Science Center at Houston, Houston, TX, United States.
| |
Collapse
|
3
|
Li B, Vorobeychik Y, Li M, Malin B. Scalable Iterative Classification for Sanitizing Large-Scale Datasets. IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING 2017; 29:698-711. [PMID: 28943741 PMCID: PMC5607782 DOI: 10.1109/tkde.2016.2628180] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Indexed: 06/07/2023]
Abstract
Cheap ubiquitous computing enables the collection of massive amounts of personal data in a wide variety of domains. Many organizations aim to share such data while obscuring features that could disclose personally identifiable information. Much of this data exhibits weak structure (e.g., text), such that machine learning approaches have been developed to detect and remove identifiers from it. While learning is never perfect, and relying on such approaches to sanitize data can leak sensitive information, a small risk is often acceptable. Our goal is to balance the value of published data and the risk of an adversary discovering leaked identifiers. We model data sanitization as a game between 1) a publisher who chooses a set of classifiers to apply to data and publishes only instances predicted as non-sensitive and 2) an attacker who combines machine learning and manual inspection to uncover leaked identifying information. We introduce a fast iterative greedy algorithm for the publisher that ensures a low utility for a resource-limited adversary. Moreover, using five text data sets we illustrate that our algorithm leaves virtually no automatically identifiable sensitive instances for a state-of-the-art learning algorithm, while sharing over 93% of the original data, and completes after at most 5 iterations.
Collapse
|
4
|
Yang H, Garibaldi JM. Automatic detection of protected health information from clinic narratives. J Biomed Inform 2015; 58 Suppl:S30-S38. [PMID: 26231070 DOI: 10.1016/j.jbi.2015.06.015] [Citation(s) in RCA: 39] [Impact Index Per Article: 4.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/02/2015] [Revised: 06/22/2015] [Accepted: 06/23/2015] [Indexed: 11/29/2022]
Abstract
This paper presents a natural language processing (NLP) system that was designed to participate in the 2014 i2b2 de-identification challenge. The challenge task aims to identify and classify seven main Protected Health Information (PHI) categories and 25 associated sub-categories. A hybrid model was proposed which combines machine learning techniques with keyword-based and rule-based approaches to deal with the complexity inherent in PHI categories. Our proposed approaches exploit a rich set of linguistic features, both syntactic and word surface-oriented, which are further enriched by task-specific features and regular expression template patterns to characterize the semantics of various PHI categories. Our system achieved promising accuracy on the challenge test data with an overall micro-averaged F-measure of 93.6%, which was the winner of this de-identification challenge.
Collapse
Affiliation(s)
- Hui Yang
- School of Computer Science, University of Nottingham, Nottingham, UK; Advanced Data Analysis Centre, University of Nottingham, Nottingham, UK.
| | - Jonathan M Garibaldi
- School of Computer Science, University of Nottingham, Nottingham, UK; Advanced Data Analysis Centre, University of Nottingham, Nottingham, UK
| |
Collapse
|
5
|
Kayaalp M, Browne AC, Dodd ZA, Sagan P, McDonald CJ. De-identification of Address, Date, and Alphanumeric Identifiers in Narrative Clinical Reports. AMIA ... ANNUAL SYMPOSIUM PROCEEDINGS. AMIA SYMPOSIUM 2014; 2014:767-776. [PMID: 25954383 PMCID: PMC4419982] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Subscribe] [Scholar Register] [Indexed: 06/04/2023]
Abstract
INTRODUCTION The Privacy Rule of Health Insurance Portability and Accountability Act requires that clinical documents be stripped of personally identifying information before they can be released to researchers and others. We have been developing a software application, NLM Scrubber, to de-identify narrative clinical reports. METHODS We compared NLM Scrubber with MIT's and MITRE's de-identification systems on 3,093 clinical reports about 1,636 patients. The performance of each system was analyzed on address, date, and alphanumeric identifier recognition separately. Their overall performance on de-identification and on conservation of the remaining clinical text was analyzed as well. RESULTS NLM Scrubber's sensitivity on de-identifying these identifiers was 99%. It's specificity on conserving the text with no personal identifiers was 99% as well. CONCLUSION The current version of the system recognizes and redacts patient names, alphanumeric identifiers, addresses and dates. We plan to make the system available prior to the AMIA Annual Symposium in 2014.
Collapse
Affiliation(s)
- Mehmet Kayaalp
- Lister Hill National Center for Biomedical Communications, U.S. National Library of Medicine, National Institutes of Health, Bethesda, MD
| | - Allen C Browne
- Lister Hill National Center for Biomedical Communications, U.S. National Library of Medicine, National Institutes of Health, Bethesda, MD
| | - Zeyno A Dodd
- Lister Hill National Center for Biomedical Communications, U.S. National Library of Medicine, National Institutes of Health, Bethesda, MD
| | - Pamela Sagan
- Lister Hill National Center for Biomedical Communications, U.S. National Library of Medicine, National Institutes of Health, Bethesda, MD
| | - Clement J McDonald
- Lister Hill National Center for Biomedical Communications, U.S. National Library of Medicine, National Institutes of Health, Bethesda, MD
| |
Collapse
|
6
|
Deleger L, Lingren T, Ni Y, Kaiser M, Stoutenborough L, Marsolo K, Kouril M, Molnar K, Solti I. Preparing an annotated gold standard corpus to share with extramural investigators for de-identification research. J Biomed Inform 2014; 50:173-183. [PMID: 24556292 DOI: 10.1016/j.jbi.2014.01.014] [Citation(s) in RCA: 17] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/01/2013] [Revised: 01/28/2014] [Accepted: 01/30/2014] [Indexed: 10/25/2022]
Abstract
OBJECTIVE The current study aims to fill the gap in available healthcare de-identification resources by creating a new sharable dataset with realistic Protected Health Information (PHI) without reducing the value of the data for de-identification research. By releasing the annotated gold standard corpus with Data Use Agreement we would like to encourage other Computational Linguists to experiment with our data and develop new machine learning models for de-identification. This paper describes: (1) the modifications required by the Institutional Review Board before sharing the de-identification gold standard corpus; (2) our efforts to keep the PHI as realistic as possible; (3) and the tests to show the effectiveness of these efforts in preserving the value of the modified data set for machine learning model development. MATERIALS AND METHODS In a previous study we built an original de-identification gold standard corpus annotated with true Protected Health Information (PHI) from 3503 randomly selected clinical notes for the 22 most frequent clinical note types of our institution. In the current study we modified the original gold standard corpus to make it suitable for external sharing by replacing HIPAA-specified PHI with newly generated realistic PHI. Finally, we evaluated the research value of this new dataset by comparing the performance of an existing published in-house de-identification system, when trained on the new de-identification gold standard corpus, with the performance of the same system, when trained on the original corpus. We assessed the potential benefits of using the new de-identification gold standard corpus to identify PHI in the i2b2 and PhysioNet datasets that were released by other groups for de-identification research. We also measured the effectiveness of the i2b2 and PhysioNet de-identification gold standard corpora in identifying PHI in our original clinical notes. RESULTS Performance of the de-identification system using the new gold standard corpus as a training set was very close to training on the original corpus (92.56 vs. 93.48 overall F-measures). Best i2b2/PhysioNet/CCHMC cross-training performances were obtained when training on the new shared CCHMC gold standard corpus, although performances were still lower than corpus-specific trainings. DISCUSSION AND CONCLUSION We successfully modified a de-identification dataset for external sharing while preserving the de-identification research value of the modified gold standard corpus with limited drop in machine learning de-identification performance.
Collapse
Affiliation(s)
- Louise Deleger
- Division of Biomedical Informatics, Cincinnati Children's Hospital Medical Center, Cincinnati, OH, USA
| | - Todd Lingren
- Division of Biomedical Informatics, Cincinnati Children's Hospital Medical Center, Cincinnati, OH, USA
| | - Yizhao Ni
- Division of Biomedical Informatics, Cincinnati Children's Hospital Medical Center, Cincinnati, OH, USA
| | - Megan Kaiser
- Division of Biomedical Informatics, Cincinnati Children's Hospital Medical Center, Cincinnati, OH, USA
| | - Laura Stoutenborough
- Division of Biomedical Informatics, Cincinnati Children's Hospital Medical Center, Cincinnati, OH, USA
| | - Keith Marsolo
- Division of Biomedical Informatics, Cincinnati Children's Hospital Medical Center, Cincinnati, OH, USA
| | - Michal Kouril
- Division of Biomedical Informatics, Cincinnati Children's Hospital Medical Center, Cincinnati, OH, USA
| | - Katalin Molnar
- Division of Biomedical Informatics, Cincinnati Children's Hospital Medical Center, Cincinnati, OH, USA
| | - Imre Solti
- Division of Biomedical Informatics, Cincinnati Children's Hospital Medical Center, Cincinnati, OH, USA
| |
Collapse
|
7
|
Hill S, Merchant R, Ungar L. LESSONS LEARNED ABOUT PUBLIC HEALTH FROM ONLINE CROWD SURVEILLANCE. BIG DATA 2013; 1:160-167. [PMID: 25045598 PMCID: PMC4102381 DOI: 10.1089/big.2013.0020] [Citation(s) in RCA: 12] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/03/2023]
Abstract
The Internet has forever changed the way people access information and make decisions about their healthcare needs. Patients now share information about their health at unprecedented rates on social networking sites such as Twitter and Facebook and on medical discussion boards. In addition to explicitly shared information about health conditions through posts, patients reveal data on their inner fears and desires about health when searching for health-related keywords on search engines. Data are also generated by the use of mobile phone applications that track users' health behaviors (e.g., eating and exercise habits) as well as give medical advice. The data generated through these applications are mined and repackaged by surveillance systems developed by academics, companies, and governments alike to provide insight to patients and healthcare providers for medical decisions. Until recently, most Internet research in public health has been surveillance focused or monitoring health behaviors. Only recently have researchers used and interacted with the crowd to ask questions and collect health-related data. In the future, we expect to move from this surveillance focus to the "ideal" of Internet-based patient-level interventions where healthcare providers help patients change their health behaviors. In this article, we highlight the results of our prior research on crowd surveillance and make suggestions for the future.
Collapse
Affiliation(s)
- Shawndra Hill
- Operations and Information Management Department, The Wharton School, University of Pennsylvania, Philadelphia, Pennsylvania
| | - Raina Merchant
- Department of Emergency Medicine, University of Pennsylvania, Philadelphia, Pennsylvania
| | - Lyle Ungar
- Department of Computer and Information Science, University of Pennsylvania, Philadelphia, Pennsylvania
| |
Collapse
|
8
|
Fernandes AC, Cloete D, Broadbent MTM, Hayes RD, Chang CK, Jackson RG, Roberts A, Tsang J, Soncul M, Liebscher J, Stewart R, Callard F. Development and evaluation of a de-identification procedure for a case register sourced from mental health electronic records. BMC Med Inform Decis Mak 2013; 13:71. [PMID: 23842533 PMCID: PMC3751474 DOI: 10.1186/1472-6947-13-71] [Citation(s) in RCA: 122] [Impact Index Per Article: 11.1] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/20/2012] [Accepted: 04/18/2013] [Indexed: 11/26/2022] Open
Abstract
Background Electronic health records (EHRs) provide enormous potential for health research but also present data governance challenges. Ensuring de-identification is a pre-requisite for use of EHR data without prior consent. The South London and Maudsley NHS Trust (SLaM), one of the largest secondary mental healthcare providers in Europe, has developed, from its EHRs, a de-identified psychiatric case register, the Clinical Record Interactive Search (CRIS), for secondary research. Methods We describe development, implementation and evaluation of a bespoke de-identification algorithm used to create the register. It is designed to create dictionaries using patient identifiers (PIs) entered into dedicated source fields and then identify, match and mask them (with ZZZZZ) when they appear in medical texts. We deemed this approach would be effective, given high coverage of PI in the dedicated fields and the effectiveness of the masking combined with elements of a security model. We conducted two separate performance tests i) to test performance of the algorithm in masking individual true PIs entered in dedicated fields and then found in text (using 500 patient notes) and ii) to compare the performance of the CRIS pattern matching algorithm with a machine learning algorithm, called the MITRE Identification Scrubber Toolkit – MIST (using 70 patient notes – 50 notes to train, 20 notes to test on). We also report any incidences of potential breaches, defined by occurrences of 3 or more true or apparent PIs in the same patient’s notes (and in an additional set of longitudinal notes for 50 patients); and we consider the possibility of inferring information despite de-identification. Results True PIs were masked with 98.8% precision and 97.6% recall. As anticipated, potential PIs did appear, owing to misspellings entered within the EHRs. We found one potential breach. In a separate performance test, with a different set of notes, CRIS yielded 100% precision and 88.5% recall, while MIST yielded a 95.1% and 78.1%, respectively. We discuss how we overcome the realistic possibility – albeit of low probability – of potential breaches through implementation of the security model. Conclusion CRIS is a de-identified psychiatric database sourced from EHRs, which protects patient anonymity and maximises data available for research. CRIS demonstrates the advantage of combining an effective de-identification algorithm with a carefully designed security model. The paper advances much needed discussion of EHR de-identification – particularly in relation to criteria to assess de-identification, and considering the contexts of de-identified research databases when assessing the risk of breaches of confidential patient information.
Collapse
|
9
|
Deleger L, Li Q, Lingren T, Kaiser M, Molnar K, Stoutenborough L, Kouril M, Marsolo K, Solti I. Building gold standard corpora for medical natural language processing tasks. AMIA ... ANNUAL SYMPOSIUM PROCEEDINGS. AMIA SYMPOSIUM 2012; 2012:144-153. [PMID: 23304283 PMCID: PMC3540456] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Subscribe] [Scholar Register] [Indexed: 06/01/2023]
Abstract
We present the construction of three annotated corpora to serve as gold standards for medical natural language processing (NLP) tasks. Clinical notes from the medical record, clinical trial announcements, and FDA drug labels are annotated. We report high inter-annotator agreements (overall F-measures between 0.8467 and 0.9176) for the annotation of Personal Health Information (PHI) elements for a de-identification task and of medications, diseases/disorders, and signs/symptoms for information extraction (IE) task. The annotated corpora of clinical trials and FDA labels will be publicly released and to facilitate translational NLP tasks that require cross-corpora interoperability (e.g. clinical trial eligibility screening) their annotation schemas are aligned with a large scale, NIH-funded clinical text annotation project.
Collapse
Affiliation(s)
- Louise Deleger
- Division of Biomedical Informatics, Cincinnati Children’s Hospital Medical Center, Cincinnati, OH
| | - Qi Li
- Division of Biomedical Informatics, Cincinnati Children’s Hospital Medical Center, Cincinnati, OH
| | - Todd Lingren
- Division of Biomedical Informatics, Cincinnati Children’s Hospital Medical Center, Cincinnati, OH
| | - Megan Kaiser
- Division of Biomedical Informatics, Cincinnati Children’s Hospital Medical Center, Cincinnati, OH
| | - Katalin Molnar
- Division of Biomedical Informatics, Cincinnati Children’s Hospital Medical Center, Cincinnati, OH
| | - Laura Stoutenborough
- Division of Biomedical Informatics, Cincinnati Children’s Hospital Medical Center, Cincinnati, OH
| | - Michal Kouril
- Division of Biomedical Informatics, Cincinnati Children’s Hospital Medical Center, Cincinnati, OH
| | - Keith Marsolo
- Division of Biomedical Informatics, Cincinnati Children’s Hospital Medical Center, Cincinnati, OH
| | - Imre Solti
- Division of Biomedical Informatics, Cincinnati Children’s Hospital Medical Center, Cincinnati, OH
| |
Collapse
|
10
|
Ferrández O, South BR, Shen S, Friedlin FJ, Samore MH, Meystre SM. BoB, a best-of-breed automated text de-identification system for VHA clinical documents. J Am Med Inform Assoc 2012; 20:77-83. [PMID: 22947391 DOI: 10.1136/amiajnl-2012-001020] [Citation(s) in RCA: 38] [Impact Index Per Article: 3.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/04/2022] Open
Abstract
OBJECTIVE De-identification allows faster and more collaborative clinical research while protecting patient confidentiality. Clinical narrative de-identification is a tedious process that can be alleviated by automated natural language processing methods. The goal of this research is the development of an automated text de-identification system for Veterans Health Administration (VHA) clinical documents. MATERIALS AND METHODS We devised a novel stepwise hybrid approach designed to improve the current strategies used for text de-identification. The proposed system is based on a previous study on the best de-identification methods for VHA documents. This best-of-breed automated clinical text de-identification system (aka BoB) tackles the problem as two separate tasks: (1) maximize patient confidentiality by redacting as much protected health information (PHI) as possible; and (2) leave de-identified documents in a usable state preserving as much clinical information as possible. RESULTS We evaluated BoB with a manually annotated corpus of a variety of VHA clinical notes, as well as with the 2006 i2b2 de-identification challenge corpus. We present evaluations at the instance- and token-level, with detailed results for BoB's main components. Moreover, an existing text de-identification system was also included in our evaluation. DISCUSSION BoB's design efficiently takes advantage of the methods implemented in its pipeline, resulting in high sensitivity values (especially for sensitive PHI categories) and a limited number of false positives. CONCLUSIONS Our system successfully addressed VHA clinical document de-identification, and its hybrid stepwise design demonstrates robustness and efficiency, prioritizing patient confidentiality while leaving most clinical information intact.
Collapse
Affiliation(s)
- Oscar Ferrández
- Department of Biomedical Informatics, University of Utah, Salt Lake City, UT 84112, USA.
| | | | | | | | | | | |
Collapse
|
11
|
Deleger L, Molnar K, Savova G, Xia F, Lingren T, Li Q, Marsolo K, Jegga A, Kaiser M, Stoutenborough L, Solti I. Large-scale evaluation of automated clinical note de-identification and its impact on information extraction. J Am Med Inform Assoc 2012; 20:84-94. [PMID: 22859645 PMCID: PMC3555323 DOI: 10.1136/amiajnl-2012-001012] [Citation(s) in RCA: 47] [Impact Index Per Article: 3.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022] Open
Abstract
Objective (1) To evaluate a state-of-the-art natural language processing (NLP)-based approach to automatically de-identify a large set of diverse clinical notes. (2) To measure the impact of de-identification on the performance of information extraction algorithms on the de-identified documents. Material and methods A cross-sectional study that included 3503 stratified, randomly selected clinical notes (over 22 note types) from five million documents produced at one of the largest US pediatric hospitals. Sensitivity, precision, F value of two automated de-identification systems for removing all 18 HIPAA-defined protected health information elements were computed. Performance was assessed against a manually generated ‘gold standard’. Statistical significance was tested. The automated de-identification performance was also compared with that of two humans on a 10% subsample of the gold standard. The effect of de-identification on the performance of subsequent medication extraction was measured. Results The gold standard included 30 815 protected health information elements and more than one million tokens. The most accurate NLP method had 91.92% sensitivity (R) and 95.08% precision (P) overall. The performance of the system was indistinguishable from that of human annotators (annotators' performance was 92.15%(R)/93.95%(P) and 94.55%(R)/88.45%(P) overall while the best system obtained 92.91%(R)/95.73%(P) on same text). The impact of automated de-identification was minimal on the utility of the narrative notes for subsequent information extraction as measured by the sensitivity and precision of medication name extraction. Discussion and conclusion NLP-based de-identification shows excellent performance that rivals the performance of human annotators. Furthermore, unlike manual de-identification, the automated approach scales up to millions of documents quickly and inexpensively.
Collapse
Affiliation(s)
- Louise Deleger
- Division of Biomedical Informatics, Cincinnati Children's Hospital Medical Center, Cincinnati, OH 45229-3039, USA
| | | | | | | | | | | | | | | | | | | | | |
Collapse
|
12
|
Benton A, Holmes JH, Hill S, Chung A, Ungar L. medpie: an information extraction package for medical message board posts. Bioinformatics 2012; 28:743-4. [PMID: 22262673 DOI: 10.1093/bioinformatics/bts030] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open
Abstract
SUMMARY We have developed medpie, a software package for preparing medical message board corpora and extracting patient mentions and statistics for drugs, herbs and adverse effects experienced from them. The package is divided into web-crawling, HTML-cleaning, de-identification and information extraction modules. It also includes a sample controlled vocabulary of drugs, herbs and adverse effect terms. AVAILABILITY http://www.cis.upenn.edu/~ungar/medpie.zip. DEPENDENCIES Python 2.6 or 2.7.
Collapse
Affiliation(s)
- A Benton
- University of Pennsylvania School of Medicine, Center for Clinical Epidemiology and Biostatistics, University of Pennsylvania, PA 19104, USA
| | | | | | | | | |
Collapse
|
13
|
Islamaj Doğan R, Yeganova L. Topics in machine learning for biomedical literature analysis and text retrieval. BMC Bioinformatics 2011; 12 Suppl 3:I1. [PMID: 21658287 PMCID: PMC3111586 DOI: 10.1186/1471-2105-12-s3-i1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/23/2022] Open
|