1
|
Tang A, Deléger L, Bossy R, Zweigenbaum P, Nédellec C. Do syntactic trees enhance Bidirectional Encoder Representations from Transformers (BERT) models for chemical–drug relation extraction? Database (Oxford) 2022; 2022:6675625. [PMID: 36006843 PMCID: PMC9408061 DOI: 10.1093/database/baac070] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/30/2022] [Revised: 07/14/2022] [Accepted: 08/12/2022] [Indexed: 11/14/2022]
Abstract
Collecting relations between chemicals and drugs is crucial in biomedical research. The pre-trained transformer model, e.g. Bidirectional Encoder Representations from Transformers (BERT), is shown to have limitations on biomedical texts; more specifically, the lack of annotated data makes relation extraction (RE) from biomedical texts very challenging. In this paper, we hypothesize that enriching a pre-trained transformer model with syntactic information may help improve its performance on chemical–drug RE tasks. For this purpose, we propose three syntax-enhanced models based on the domain-specific BioBERT model: Chunking-Enhanced-BioBERT and Constituency-Tree-BioBERT in which constituency information is integrated and a Multi-Task-Learning framework Multi-Task-Syntactic (MTS)-BioBERT in which syntactic information is injected implicitly by adding syntax-related tasks as training objectives. Besides, we test an existing model Late-Fusion which is enhanced by syntactic dependency information and build ensemble systems combining syntax-enhanced models and non-syntax-enhanced models. Experiments are conducted on the BioCreative VII DrugProt corpus, a manually annotated corpus for the development and evaluation of RE systems. Our results reveal that syntax-enhanced models in general degrade the performance of BioBERT in the scenario of biomedical RE but improve the performance when the subject–object distance of candidate semantic relation is long. We also explore the impact of quality of dependency parses. [Our code is available at: https://github.com/Maple177/syntax-enhanced-RE/tree/drugprot (for only MTS-BioBERT); https://github.com/Maple177/drugprot-relation-extraction (for the rest of experiments)] Database URLhttps://github.com/Maple177/drugprot-relation-extraction
Collapse
Affiliation(s)
- Anfu Tang
- INRAE, MaIAGE, Université Paris-Saclay , Domaine de Vilvert, Jouy-en-Josas 78352, France
- CNRS, Laboratoire interdisciplinaire des sciences du numérique, Université Paris-Saclay , Campus universitaire bât 507, Rue du Belvedère, Orsay 91405, France
| | - Louise Deléger
- INRAE, MaIAGE, Université Paris-Saclay , Domaine de Vilvert, Jouy-en-Josas 78352, France
| | - Robert Bossy
- INRAE, MaIAGE, Université Paris-Saclay , Domaine de Vilvert, Jouy-en-Josas 78352, France
| | - Pierre Zweigenbaum
- CNRS, Laboratoire interdisciplinaire des sciences du numérique, Université Paris-Saclay , Campus universitaire bât 507, Rue du Belvedère, Orsay 91405, France
| | - Claire Nédellec
- INRAE, MaIAGE, Université Paris-Saclay , Domaine de Vilvert, Jouy-en-Josas 78352, France
| |
Collapse
|
2
|
Campillos-Llanos L, Thomas C, Bilinski É, Neuraz A, Rosset S, Zweigenbaum P. Lessons Learned from the Usability Evaluation of a Simulated Patient Dialogue System. J Med Syst 2021; 45:69. [PMID: 33999302 DOI: 10.1007/s10916-021-01737-4] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/23/2020] [Accepted: 04/05/2021] [Indexed: 10/21/2022]
Abstract
Simulated consultations through virtual patients allow medical students to practice history-taking skills. Ideally, applications should provide interactions in natural language and be multi-case, multi-specialty. Nevertheless, few systems handle or are tested on a large variety of cases. We present a virtual patient dialogue system in which a medical trainer types new cases and these are processed without human intervention. To develop it, we designed a patient record model, a knowledge model for the history-taking task, and a termino-ontological model for term variation and out-of-vocabulary words. We evaluated whether this system provided quality dialogue across medical specialities (n = 18), and with unseen cases (n = 29) compared to the cases used for development (n = 6). Medical evaluators (students, residents, practitioners, and researchers) conducted simulated history-taking with the system and assessed its performance through Likert-scale questionnaires. We analysed interaction logs and evaluated system correctness. The mean user evaluation score for the 29 unseen cases was 4.06 out of 5 (very good). The evaluation of correctness determined that, on average, 74.3% (sd = 9.5) of replies were correct, 14.9% (sd = 6.3) incorrect, and in 10.7% the system behaved cautiously by deferring a reply. In the user evaluation, all aspects scored higher in the 29 unseen cases than in the 6 seen cases. Although such a multi-case system has its limits, the evaluation showed that creating it is feasible; that it performs adequately; and that it is judged usable. We discuss some lessons learned and pivotal design choices affecting its performance and the end-users, who are primarily medical students.
Collapse
Affiliation(s)
- Leonardo Campillos-Llanos
- Université Paris-Saclay, CNRS, LISN, Orsay, France. .,ILLA - Consejo Superior de Investigaciones Científicas (CSIC), Madrid, Spain.
| | | | | | | | | | | |
Collapse
|
3
|
Zheng Y, Meng X, Zweigenbaum P, Chen L, Xia J. Hybrid phenotype mining method for investigating off-target protein and underlying side effects of anti-tumor immunotherapy. BMC Med Inform Decis Mak 2020; 20:133. [PMID: 32646421 PMCID: PMC7346346 DOI: 10.1186/s12911-020-1105-4] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/17/2022] Open
Abstract
BACKGROUND It is of utmost importance to investigate novel therapies for cancer, as it is a major cause of death. In recent years, immunotherapies, especially those against immune checkpoints, have been developed and brought significant improvement in cancer management. However, on the other hand, immune checkpoints blockade (ICB) by monoclonal antiboties may cause common and severe adverse reactions (ADRs), the cause of which remains largely undetermined. We hypothesize that ICB-agents may induce adverse reactions through off-target protein interactions, similar to the ADR-causing off-target effects of small molecules. In this study, we propose a hybrid phenotype mining approach which integrates molecular level information and provides new mechanistic insights for ICB-associated ADRs. METHODS We trained a conditional random fields model on the TAC 2017 benchmark training data, then used it to extract all drug-centric phenotypes for the five anti-PD-1/PD-L1 drugs from the drug labels of the DailyMed database. Proteins with structure similar to the drugs were obtained by using BlastP, and the gene targets of drugs were obtained from the STRING database. The target-centric phenotypes were extracted from the human phenotype ontology database. Finally, a screening module was designed to investigate off-target proteins, by making use of gene ontology analysis and pathway analysis. RESULTS Eventually, through the cross-analysis of the drug and target gene phenotypes, the off-target effect caused by the mutation of gene BTK was found, and the candidate side-effect off-target site was analyzed. CONCLUSIONS This research provided a hybrid method of biomedical natural language processing and bioinformatics to investigate the off-target-based mechanism of ICB treatment. The method can also be applied for the investigation of ADRs related to other large molecule drugs.
Collapse
Affiliation(s)
- Yuyu Zheng
- Hubei Key Lab of Agricultural Bioinformatics, College of Informatics, Huazhong Agricultural University, Wuhan, 430070 China
| | - Xiangyu Meng
- Department of Urology, Zhongnan Hospital of Wuhan University, Wuhan, China
- Institut Curie, CNRS, UMR144, Molecular Oncology Team, PSL Research University, Paris, France
| | | | - Lingling Chen
- Hubei Key Lab of Agricultural Bioinformatics, College of Informatics, Huazhong Agricultural University, Wuhan, 430070 China
| | - Jingbo Xia
- Hubei Key Lab of Agricultural Bioinformatics, College of Informatics, Huazhong Agricultural University, Wuhan, 430070 China
| |
Collapse
|
4
|
Baghdadi Y, Bourrée A, Robert A, Rey G, Gallay A, Zweigenbaum P, Grouin C, Fouillet A. A New Approach to Compare the Performance of Two Classification Methods of Causes of Death for Timely Surveillance in France. Stud Health Technol Inform 2019; 264:925-929. [PMID: 31438059 DOI: 10.3233/shti190359] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/15/2022]
Abstract
Timely mortality surveillance in France is based on the monitoring of electronic death certificates to provide information to health authorities. This study aims to analyze the performance of a rule-based and a supervised machine learning method to classify medical causes of death into 60 mortality syndromic groups (MSGs). Performance was first measured on a test set. Then we compared the trends of the monthly numbers of deaths classified into MSGs from 2012 to 2016 using both methods. Among the 60 MSGs, 31 achieved recall and precision over 0.95 for either one or the other method on the test set. On the whole dataset, the correlation coefficient of the monthly numbers of deaths obtained by the two methods were close to 1 for 21 of the 31 MSGs. This approach is useful for analyzing a large number of categories or when annotated resources are limited.
Collapse
Affiliation(s)
| | - Alix Bourrée
- Santé publique France, Saint-Maurice, France.,LIMSI, CNRS, Université Paris-Saclay, F-91405 Orsay, France
| | - Aude Robert
- Epidemiology Center on Medical Causes of Death (Inserm-CépiDc), Kremlin-Bicêtre, France
| | - Grégoire Rey
- Epidemiology Center on Medical Causes of Death (Inserm-CépiDc), Kremlin-Bicêtre, France
| | - Anne Gallay
- Santé publique France, Saint-Maurice, France
| | | | - Cyril Grouin
- LIMSI, CNRS, Université Paris-Saclay, F-91405 Orsay, France
| | | |
Collapse
|
5
|
Campillos-Llanos L, Grouin C, Lillo-Le Louët A, Zweigenbaum P. Initial Experiments for Pharmacovigilance Analysis in Social Media Using Summaries of Product Characteristics. Stud Health Technol Inform 2019; 264:60-64. [PMID: 31437885 DOI: 10.3233/shti190183] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 06/10/2023]
Abstract
We report initial experiments for analyzing social media through an NLP annotation tool on web posts about medications of current interests (baclofen, levothyroxine and vaccines) and summaries of product characteristics (SPCs). We conducted supervised experiments on a subset of messages annotated by experts according to positive or negative misuse; results ranged from 0.62 to 0.91 of F-score. We also annotated both SPCs and another set of posts to compare MedDRA annotations in each source. A pharmacovigilance expert checked the output and confirmed that entities not found in SCPs might express drug misuse or unknown ADRs.
Collapse
Affiliation(s)
| | - Cyril Grouin
- LIMSI, CNRS, Université Paris-Saclay, F-91405 Orsay, France
| | - Agnès Lillo-Le Louët
- CRPV Paris-HEGP, Hôpital Européen George Pompidou, 20 rue Leblanc, 75015 Paris, France
| | | |
Collapse
|
6
|
Baghdadi Y, Bourrée A, Robert A, Rey G, Gallay A, Zweigenbaum P, Grouin C, Fouillet A. Automatic classification of free-text medical causes from death certificates for reactive mortality surveillance in France. Int J Med Inform 2019; 131:103915. [PMID: 31522022 DOI: 10.1016/j.ijmedinf.2019.06.022] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/04/2019] [Revised: 05/14/2019] [Accepted: 06/24/2019] [Indexed: 11/19/2022]
Abstract
BACKGROUND Mortality surveillance is of fundamental importance to public health surveillance. The real-time recording of death certificates, thanks to Electronic Death Registration System (EDRS), provides valuable data for reactive mortality surveillance based on medical causes of death in free-text format. Reactive mortality surveillance is based on the monitoring of mortality syndromic groups (MSGs). An MSG is a cluster of medical causes of death (pathologies, syndromes or symptoms) that meets the objectives of early detection and impact assessment of public health events. The aim of this study is to implement and measure the performance of a rule-based method and two supervised models for automatic free-text cause of death classification from death certificates in order to implement them for routine surveillance. METHOD A rule-based method was implemented using four processing steps: standardization rules, splitting causes of death using delimiters, spelling corrections and dictionary projection. A supervised machine learning method using a linear Support Vector Machine (SVM) classifier was also implemented. Two models were produced using different features (SVM1 based solely on surface features and SVM2 combining surface features and MSGs classified by the rule-based method as feature vectors). The evaluation was conducted using an annotated subset of electronic death certificates received between 2012 and 2016. Classification performance was evaluated on seven MSGs (Influenza, Low respiratory diseases, Asphyxia/abnormal respiration, Acute respiratory disease, Sepsis, Chronic digestive diseases, and Chronic endocrine diseases). RESULTS The rule-based method and the SVM2 model displayed a high performance with F-measures over 0.94 for all MSGs. Precision and recall were slightly higher for the rule-based method and the SVM2 model. An error-analysis shows that errors were not specific to an MSG. CONCLUSION The high performance of the rule-based method and SVM2 model will allow us to set-up a reactive mortality surveillance system based on free-text death certificates. This surveillance will be an added-value for public health decision making.
Collapse
Affiliation(s)
- Yasmine Baghdadi
- Santé publique France, Division for Data Science, Saint-Maurice, France.
| | - Alix Bourrée
- Santé publique France, Division for Data Science, Saint-Maurice, France
| | - Aude Robert
- CépiDc-Inserm, Epidemiology Center on Medical Causes of Death, Kremlin-Bicêtre, France
| | - Grégoire Rey
- CépiDc-Inserm, Epidemiology Center on Medical Causes of Death, Kremlin-Bicêtre, France
| | - Anne Gallay
- Santé publique France, Division of Non communicable Diseases and Injuries, Saint-Maurice, France
| | | | - Cyril Grouin
- LIMSI, CNRS, Université Paris-Saclay, Orsay, France
| | - Anne Fouillet
- Santé publique France, Division for Data Science, Saint-Maurice, France
| |
Collapse
|
7
|
Névéol A, Zweigenbaum P. Expanding the Diversity of Texts and Applications: Findings from the Section on Clinical Natural Language Processing of the International Medical Informatics Association Yearbook. Yearb Med Inform 2018; 27:193-198. [PMID: 30157523 PMCID: PMC6115241 DOI: 10.1055/s-0038-1667080] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/18/2022] Open
Abstract
Objectives:
To summarize recent research and present a selection of the best papers published in 2017 in the field of clinical Natural Language Processing (NLP).
Methods:
A survey of the literature was performed by the two editors of the NLP section of the International Medical Informatics Association (IMIA) Yearbook. Bibliographic databases PubMed and Association of Computational Linguistics (ACL) Anthology were searched for papers with a focus on NLP efforts applied to clinical texts or aimed at a clinical outcome. A total of 709 papers were automatically ranked and then manually reviewed based on title and abstract. A shortlist of 15 candidate best papers was selected by the section editors and peer-reviewed by independent external reviewers to come to the three best clinical NLP papers for 2017.
Results:
Clinical NLP best papers provide a contribution that ranges from methodological studies to the application of research results to practical clinical settings. They draw from text genres as diverse as clinical narratives across hospitals and languages or social media.
Conclusions:
Clinical NLP continued to thrive in 2017, with an increasing number of contributions towards applications compared to fundamental methods. Methodological work explores deep learning and system adaptation across language variants. Research results continue to translate into freely available tools and corpora, mainly for the English language.
Collapse
|
8
|
Paris N, Mendis M, Daniel C, Murphy S, Tannier X, Zweigenbaum P. i2b2 implemented over SMART-on-FHIR. AMIA Jt Summits Transl Sci Proc 2018; 2017:369-378. [PMID: 29888095 PMCID: PMC5961782] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 11/24/2022]
Abstract
Integrating Biology and the Bedside (i2b2) is the de-facto open-source medical tool for cohort discovery. Fast Healthcare Interoperability Resources (FHIR) is a new standard for exchanging health care information electronically. Substitutable Modular third-party Applications (SMART) defines the SMART-on-FHIR specification on how applications shall interface with Electronic Health Records (EHR) through FHIR. Related work made it possible to produce FHIR from an i2b2 instance or made i2b2 able to store FHIR datasets. In this paper, we extend i2b2 to search remotely into one or multiple SMART-on-FHIR Application Programming Interfaces (APIs). This enables the federation of queries, security, terminology mapping, and also bridges the gap between i2b2 and modern big-data technologies.
Collapse
Affiliation(s)
- Nicolas Paris
- WIND-DSI, AP-HP, Paris, France,LIMSI, CNRS, Université Paris-Saclay, Orsay, France,INSERM, UMR S 1142, LIMICS, Paris, France
| | | | - Christel Daniel
- WIND-DSI, AP-HP, Paris, France,INSERM, UMR S 1142, LIMICS, Paris, France
| | | | - Xavier Tannier
- INSERM, UMR S 1142, LIMICS, Paris, France,Sorbonne Universités, UPMC Univ Paris 06, France
| | | |
Collapse
|
9
|
Cohen KB, Xia J, Zweigenbaum P, Callahan TJ, Hargraves O, Goss F, Ide N, Névéol A, Grouin C, Hunter LE. Three Dimensions of Reproducibility in Natural Language Processing. LREC Int Conf Lang Resour Eval 2018; 2018:156-165. [PMID: 29911205 PMCID: PMC5998676] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Grants] [Subscribe] [Scholar Register] [Indexed: 06/08/2023]
Abstract
Despite considerable recent attention to problems with reproducibility of scientific research, there is a striking lack of agreement about the definition of the term. That is a problem, because the lack of a consensus definition makes it difficult to compare studies of reproducibility, and thus to have even a broad overview of the state of the issue in natural language processing. This paper proposes an ontology of reproducibility in that field. Its goal is to enhance both future research and communication about the topic, and retrospective meta-analyses. We show that three dimensions of reproducibility, corresponding to three kinds of claims in natural language processing papers, can account for a variety of types of research reports. These dimensions are reproducibility of a conclusion, of a finding, and of a value. Three biomedical natural language processing papers by the authors of this paper are analyzed with respect to these dimensions.
Collapse
Affiliation(s)
- K Bretonnel Cohen
- Computational Bioscience Program, University of Colorado School of Medicine
- LIMSI, CNRS, Université Paris-Saclay
| | | | | | - Tiffany J Callahan
- Computational Bioscience Program, University of Colorado School of Medicine
| | | | - Foster Goss
- Department of Emergency Medicine, University of Colorado
| | | | | | | | - Lawrence E Hunter
- Computational Bioscience Program, University of Colorado School of Medicine
| |
Collapse
|
10
|
Névéol A, Dalianis H, Velupillai S, Savova G, Zweigenbaum P. Clinical Natural Language Processing in languages other than English: opportunities and challenges. J Biomed Semantics 2018; 9:12. [PMID: 29602312 PMCID: PMC5877394 DOI: 10.1186/s13326-018-0179-8] [Citation(s) in RCA: 83] [Impact Index Per Article: 13.8] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/22/2017] [Accepted: 02/14/2018] [Indexed: 01/22/2023] Open
Abstract
Background Natural language processing applied to clinical text or aimed at a clinical outcome has been thriving in recent years. This paper offers the first broad overview of clinical Natural Language Processing (NLP) for languages other than English. Recent studies are summarized to offer insights and outline opportunities in this area. Main Body We envision three groups of intended readers: (1) NLP researchers leveraging experience gained in other languages, (2) NLP researchers faced with establishing clinical text processing in a language other than English, and (3) clinical informatics researchers and practitioners looking for resources in their languages in order to apply NLP techniques and tools to clinical practice and/or investigation. We review work in clinical NLP in languages other than English. We classify these studies into three groups: (i) studies describing the development of new NLP systems or components de novo, (ii) studies describing the adaptation of NLP architectures developed for English to another language, and (iii) studies focusing on a particular clinical application. Conclusion We show the advantages and drawbacks of each method, and highlight the appropriate application context. Finally, we identify major challenges and opportunities that will affect the impact of NLP on clinical practice and public health studies in a context that encompasses English as well as other languages.
Collapse
Affiliation(s)
- Aurélie Névéol
- LIMSI, CNRS, Université Paris Saclay, Rue John von Neumann, Paris, F-91405 Orsay, France
| | | | - Sumithra Velupillai
- School of Computer Science and Communication, KTH, Stockholm, Sweden.,Institute of Psychiatry, Psychology and Neuroscience, King's College, London, UK
| | - Guergana Savova
- Children's Hospital Boston and Harvard Medical School, Boston, Massachusetts, USA
| | - Pierre Zweigenbaum
- LIMSI, CNRS, Université Paris Saclay, Rue John von Neumann, Paris, F-91405 Orsay, France
| |
Collapse
|
11
|
Bouaud J, Bachimont B, Charlet J, Séroussi B, Boisvieux JF, Zweigenbaum P. From Text to Knowledge: a Unifying Document-Centered View of Analyzed Medical Language. Methods Inf Med 2018. [DOI: 10.1055/s-0038-1634559] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [What about the content of this article? (0)] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/17/2022]
Abstract
AbstractAlthough medical language processing (MLP) has achieved some success, the actual use and dissemination of data extracted from free text by MLP systems is still very limited. We claim that the adoption of an ‘enricheddocument’ paradigm (or ‘document-centered’ view) can help to address this issue. We present this paradigm and explain how it can be implemented, then discuss its expected benefits both for end-users and MLP researchers.
Collapse
|
12
|
Bachimont B, Bouaud J, Charlet J, Boisvieux JF, Zweigenbaum P. Issues in the Structuring and Acquisition of an Ontology for Medical Language Understanding. Methods Inf Med 2018. [DOI: 10.1055/s-0038-1634577] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [What about the content of this article? (0)] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/17/2022]
Abstract
Abstract:Medical natural language understanding basically aims at representing the contents of medical texts in a formal, conceptual representation. The understanding process itself increasingly relies on a body of domain knowledge, generally expressed in the same conceptual formalism. The design of such a conceptual representation is a key knowledge-acquisition issue. When representing knowledge, the most important point is to ensure that the formal exploitation of the knowledge representation conforms to its meaning in the domain. We examined some methodological and theoretical principles to enforce this conformity. These principles result from our experience in MENELAS, a medical language understanding project.
Collapse
|
13
|
Névéol A, Zweigenbaum P. Making Sense of Big Textual Data for Health Care: Findings from the Section on Clinical Natural Language Processing. Yearb Med Inform 2017; 26:228-234. [PMID: 29063569 PMCID: PMC6239234 DOI: 10.15265/iy-2017-027] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/18/2017] [Indexed: 02/01/2023] Open
Abstract
Objectives: To summarize recent research and present a selection of the best papers published in 2016 in the field of clinical Natural Language Processing (NLP). Method: A survey of the literature was performed by the two section editors of the IMIA Yearbook NLP section. Bibliographic databases were searched for papers with a focus on NLP efforts applied to clinical texts or aimed at a clinical outcome. Papers were automatically ranked and then manually reviewed based on titles and abstracts. A shortlist of candidate best papers was first selected by the section editors before being peer-reviewed by independent external reviewers. Results: The five clinical NLP best papers provide a contribution that ranges from emerging original foundational methods to transitioning solid established research results to a practical clinical setting. They offer a framework for abbreviation disambiguation and coreference resolution, a classification method to identify clinically useful sentences, an analysis of counseling conversations to improve support to patients with mental disorder and grounding of gradable adjectives. Conclusions: Clinical NLP continued to thrive in 2016, with an increasing number of contributions towards applications compared to fundamental methods. Fundamental work addresses increasingly complex problems such as lexical semantics, coreference resolution, and discourse analysis. Research results translate into freely available tools, mainly for English.
Collapse
Affiliation(s)
- A. Névéol
- LIMSI, CNRS, Université Paris Saclay, Orsay, France
| | | | | |
Collapse
|
14
|
Abstract
OBJECTIVE To summarize recent research and present a selection of the best papers published in 2014 in the field of clinical Natural Language Processing (NLP). METHOD A systematic review of the literature was performed by the two section editors of the IMIA Yearbook NLP section by searching bibliographic databases with a focus on NLP efforts applied to clinical texts or aimed at a clinical outcome. A shortlist of candidate best papers was first selected by the section editors before being peer-reviewed by independent external reviewers. RESULTS The clinical NLP best paper selection shows that the field is tackling text analysis methods of increasing depth. The full review process highlighted five papers addressing foundational methods in clinical NLP using clinically relevant texts from online forums or encyclopedias, clinical texts from Electronic Health Records, and included studies specifically aiming at a practical clinical outcome. The increased access to clinical data that was made possible with the recent progress of de-identification paved the way for the scientific community to address complex NLP problems such as word sense disambiguation, negation, temporal analysis and specific information nugget extraction. These advances in turn allowed for efficient application of NLP to clinical problems such as cancer patient triage. Another line of research investigates online clinically relevant texts and brings interesting insight on communication strategies to convey health-related information. CONCLUSIONS The field of clinical NLP is thriving through the contributions of both NLP researchers and healthcare professionals interested in applying NLP techniques for concrete healthcare purposes. Clinical NLP is becoming mature for practical applications with a significant clinical impact.
Collapse
Affiliation(s)
- A Névéol
- Aurélie Névéol, LIMSI CNRS UPR 3251, Rue John von Neumann, Campus Universitaire d'Orsay, 91405 Orsay cedex, France, E-mail: {neveol,pz}@limsi.fr
| | | |
Collapse
|
15
|
Cohen KB, Goss FR, Zweigenbaum P, Hunter LE. Translational Morphosyntax: Distribution of Negation in Clinical Records and Biomedical Journal Articles. Stud Health Technol Inform 2017; 245:346-350. [PMID: 29295113] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 06/07/2023]
Abstract
Prior knowledge of the distributional characteristics of linguistic phenomena can be useful for a variety of language processing tasks. This paper describes the distribution of negation in two types of biomedical texts: scientific journal articles and progress notes. Two types of negation are examined: explicit negation at the syntactic level and affixal negation at the sub-word level. The data show that the distribution of negation is significantly different in the two document types, with explicit negation more frequent in the clinical documents than in the scientific publications and affixal negation more frequent in the journal articles at the type level and token levels. All code is available on GitHub <fnr rid="fn001" /><fn id="fn001">https://github.com/KevinBretonnelCohen/NegationDistribution </fn>.
Collapse
Affiliation(s)
- K Bretonnel Cohen
- Computational Bioscience Program, University of Colorado School of Medicine, Aurora, CO, USA
| | - Foster R Goss
- University of Colorado School of Medicine, Department of Emergency Medicine, Aurora, CO, USA
| | | | - Lawrence E Hunter
- Computational Bioscience Program, University of Colorado School of Medicine, Aurora, CO, USA
| |
Collapse
|
16
|
Abstract
OBJECTIVE To summarize recent research and present a selection of the best papers published in 2015 in the field of clinical Natural Language Processing (NLP). METHOD A systematic review of the literature was performed by the two section editors of the IMIA Yearbook NLP section by searching bibliographic databases with a focus on NLP efforts applied to clinical texts or aimed at a clinical outcome. Section editors first selected a shortlist of candidate best papers that were then peer-reviewed by independent external reviewers. RESULTS The clinical NLP best paper selection shows that clinical NLP is making use of a variety of texts of clinical interest to contribute to the analysis of clinical information and the building of a body of clinical knowledge. The full review process highlighted five papers analyzing patient-authored texts or seeking to connect and aggregate multiple sources of information. They provide a contribution to the development of methods, resources, applications, and sometimes a combination of these aspects. CONCLUSIONS The field of clinical NLP continues to thrive through the contributions of both NLP researchers and healthcare professionals interested in applying NLP techniques to impact clinical practice. Foundational progress in the field makes it possible to leverage a larger variety of texts of clinical interest for healthcare purposes.
Collapse
Affiliation(s)
- A Névéol
- Aurélie Névéol, LIMSI CNRS UPR 3251, Université Paris Saclay, Rue John von Neumann, 91400 Orsay, France, E-mail:
| | - P Zweigenbaum
- Pierre Zweigenbaum, LIMSI CNRS UPR 3251, Université Paris Saclay, Rue John von Neumann, 91400 Orsay, France, E-mail:
| |
Collapse
|
17
|
Névéol A, Cohen KB, Grouin C, Hamon T, Lavergne T, Kelly L, Goeuriot L, Rey G, Robert A, Tannier X, Zweigenbaum P. Clinical Information Extraction at the CLEF eHealth Evaluation lab 2016. CEUR Workshop Proc 2016; 1609:28-42. [PMID: 29308065 PMCID: PMC5756095] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Grants] [Subscribe] [Scholar Register] [Indexed: 06/07/2023]
Abstract
This paper reports on Task 2 of the 2016 CLEF eHealth evaluation lab which extended the previous information extraction tasks of ShARe/CLEF eHealth evaluation labs. The task continued with named entity recognition and normalization in French narratives, as offered in CLEF eHealth 2015. Named entity recognition involved ten types of entities including disorders that were defined according to Semantic Groups in the Unified Medical Language System® (UMLS®), which was also used for normalizing the entities. In addition, we introduced a large-scale classification task in French death certificates, which consisted of extracting causes of death as coded in the International Classification of Diseases, tenth revision (ICD10). Participant systems were evaluated against a blind reference standard of 832 titles of scientific articles indexed in MEDLINE, 4 drug monographs published by the European Medicines Agency (EMEA) and 27,850 death certificates using Precision, Recall and F-measure. In total, seven teams participated, including five in the entity recognition and normalization task, and five in the death certificate coding task. Three teams submitted their systems to our newly offered reproducibility track. For entity recognition, the highest performance was achieved on the EMEA corpus, with an overall F-measure of 0.702 for plain entities recognition and 0.529 for normalized entity recognition. For entity normalization, the highest performance was achieved on the MEDLINE corpus, with an overall F-measure of 0.552. For death certificate coding, the highest performance was 0.848 F-measure.
Collapse
Affiliation(s)
| | - K Bretonnel Cohen
- LIMSI, CNRS, Université Paris-Saclay, Orsay, France
- University of Colorado, USA
| | - Cyril Grouin
- LIMSI, CNRS, Université Paris-Saclay, Orsay, France
| | - Thierry Hamon
- LIMSI, CNRS, Université Paris-Saclay, Orsay, France
- Université Paris Nord, Villetaneuse, France
| | - Thomas Lavergne
- LIMSI, CNRS, Université Paris-Saclay, Orsay, France
- Univ. Paris-Sud, Orsay, France
| | - Liadh Kelly
- ADAPT Centre, Trinity College, Dublin, Ireland
| | | | | | | | - Xavier Tannier
- LIMSI, CNRS, Université Paris-Saclay, Orsay, France
- Univ. Paris-Sud, Orsay, France
| | | |
Collapse
|
18
|
Abbe A, Grouin C, Zweigenbaum P, Falissard B. Text mining applications in psychiatry: a systematic literature review. Int J Methods Psychiatr Res 2016; 25:86-100. [PMID: 26184780 PMCID: PMC6877250 DOI: 10.1002/mpr.1481] [Citation(s) in RCA: 56] [Impact Index Per Article: 7.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 03/05/2014] [Revised: 01/21/2015] [Accepted: 04/09/2015] [Indexed: 11/08/2022] Open
Abstract
The expansion of biomedical literature is creating the need for efficient tools to keep pace with increasing volumes of information. Text mining (TM) approaches are becoming essential to facilitate the automated extraction of useful biomedical information from unstructured text. We reviewed the applications of TM in psychiatry, and explored its advantages and limitations. A systematic review of the literature was carried out using the CINAHL, Medline, EMBASE, PsycINFO and Cochrane databases. In this review, 1103 papers were screened, and 38 were included as applications of TM in psychiatric research. Using TM and content analysis, we identified four major areas of application: (1) Psychopathology (i.e. observational studies focusing on mental illnesses) (2) the Patient perspective (i.e. patients' thoughts and opinions), (3) Medical records (i.e. safety issues, quality of care and description of treatments), and (4) Medical literature (i.e. identification of new scientific information in the literature). The information sources were qualitative studies, Internet postings, medical records and biomedical literature. Our work demonstrates that TM can contribute to complex research tasks in psychiatry. We discuss the benefits, limits, and further applications of this tool in the future. Copyright © 2015 John Wiley & Sons, Ltd.
Collapse
Affiliation(s)
- Adeline Abbe
- Inserm, U669, Paris, France.,University Paris-Sud and University Paris Descartes, UMR-S0669, Paris, France
| | | | | | - Bruno Falissard
- Inserm, U669, Paris, France.,University Paris-Sud and University Paris Descartes, UMR-S0669, Paris, France
| |
Collapse
|
19
|
Rosier A, Mabo P, Temal L, Van Hille P, Dameron O, Deleger L, Grouin C, Zweigenbaum P, Jacques J, Chazard E, Laporte L, Henry C, Burgun A. Remote Monitoring of Cardiac Implantable Devices: Ontology Driven Classification of the Alerts. Stud Health Technol Inform 2016; 221:59-63. [PMID: 27071877] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 06/05/2023]
Abstract
The number of patients that benefit from remote monitoring of cardiac implantable electronic devices, such as pacemakers and defibrillators, is growing rapidly. Consequently, the huge number of alerts that are generated and transmitted to the physicians represents a challenge to handle. We have developed a system based on a formal ontology that integrates the alert information and the patient data extracted from the electronic health record in order to better classify the importance of alerts. A pilot study was conducted on atrial fibrillation alerts. We show some examples of alert processing. The results suggest that this approach has the potential to significantly reduce the alert burden in telecardiology. The methods may be extended to other types of connected devices.
Collapse
Affiliation(s)
| | | | - Lynda Temal
- LTSI (Inserm UMR1099), Université de Rennes 1, Rennes, France
| | | | - Olivier Dameron
- University of Rennes 1, UMR 6074 IRISA, 35042, Rennes, France
| | | | | | | | | | | | | | | | - Anita Burgun
- INSERM UMR_S 1138 Eq 22, Paris Descartes University, France
| |
Collapse
|
20
|
Rosier A, Mabo P, Temal L, Van Hille P, Dameron O, Deléger L, Grouin C, Zweigenbaum P, Jacques J, Chazard E, Laporte L, Henry C, Burgun A. Personalized and automated remote monitoring of atrial fibrillation. Europace 2015; 18:347-52. [DOI: 10.1093/europace/euv234] [Citation(s) in RCA: 22] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [What about the content of this article? (0)] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/20/2015] [Accepted: 06/10/2015] [Indexed: 11/14/2022] Open
|
21
|
Ben Abacha A, Chowdhury MFM, Karanasiou A, Mrabet Y, Lavelli A, Zweigenbaum P. Text mining for pharmacovigilance: Using machine learning for drug name recognition and drug-drug interaction extraction and classification. J Biomed Inform 2015; 58:122-132. [PMID: 26432353 DOI: 10.1016/j.jbi.2015.09.015] [Citation(s) in RCA: 31] [Impact Index Per Article: 3.4] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/30/2014] [Revised: 08/31/2015] [Accepted: 09/22/2015] [Indexed: 10/23/2022]
Abstract
Pharmacovigilance (PV) is defined by the World Health Organization as the science and activities related to the detection, assessment, understanding and prevention of adverse effects or any other drug-related problem. An essential aspect in PV is to acquire knowledge about Drug-Drug Interactions (DDIs). The shared tasks on DDI-Extraction organized in 2011 and 2013 have pointed out the importance of this issue and provided benchmarks for: Drug Name Recognition, DDI extraction and DDI classification. In this paper, we present our text mining systems for these tasks and evaluate their results on the DDI-Extraction benchmarks. Our systems rely on machine learning techniques using both feature-based and kernel-based methods. The obtained results for drug name recognition are encouraging. For DDI-Extraction, our hybrid system combining a feature-based method and a kernel-based method was ranked second in the DDI-Extraction-2011 challenge, and our two-step system for DDI detection and classification was ranked first in the DDI-Extraction-2013 task at SemEval. We discuss our methods and results and give pointers to future work.
Collapse
Affiliation(s)
| | | | | | - Yassine Mrabet
- Luxembourg Institute of Science and Technology, Luxembourg.
| | | | | |
Collapse
|
22
|
Lavergne T, Grouin C, Zweigenbaum P. The contribution of co-reference resolution to supervised relation detection between bacteria and biotopes entities. BMC Bioinformatics 2015. [PMID: 26201352 PMCID: PMC4511182 DOI: 10.1186/1471-2105-16-s10-s6] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [What about the content of this article? (0)] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/12/2023] Open
Abstract
Background The acquisition of knowledge about relations between bacteria and their locations (habitats and geographical locations) in short texts about bacteria, as defined in the BioNLP-ST 2013 Bacteria Biotope task, depends on the detection of co-reference links between mentions of entities of each of these three types. To our knowledge, no participant in this task has investigated this aspect of the situation. The present work specifically addresses issues raised by this situation: (i) how to detect these co-reference links and associated co-reference chains; (ii) how to use them to prepare positive and negative examples to train a supervised system for the detection of relations between entity mentions; (iii) what context around which entity mentions contributes to relation detection when co-reference chains are provided. Results We present experiments and results obtained both with gold entity mentions (task 2 of BioNLP-ST 2013) and with automatically detected entity mentions (end-to-end system, in task 3 of BioNLP-ST 2013). Our supervised mention detection system uses a linear chain Conditional Random Fields classifier, and our relation detection system relies on a Logistic Regression (aka Maximum Entropy) classifier. They use a set of morphological, morphosyntactic and semantic features. To minimize false inferences, co-reference resolution applies a set of heuristic rules designed to optimize precision. They take into account the types of the detected entity mentions, and take advantage of the didactic nature of the texts of the corpus, where a large proportion of bacteria naming is fairly explicit (although natural referring expressions such as "the bacteria" are common). The resulting system achieved a 0.495 F-measure on the official test set when taking as input the gold entity mentions, and a 0.351 F-measure when taking as input entity mentions predicted by our CRF system, both of which are above the best BioNLP-ST 2013 participant system. Conclusions We show that co-reference resolution substantially improves over a baseline system which does not use co-reference information: about 3.5 F-measure points on the test corpus for the end-to-end system (5.5 points on the development corpus) and 7 F-measure points on both development and test corpora when gold mentions are used. While this outperforms the best published system on the BioNLP-ST 2013 Bacteria Biotope dataset, we consider that it provides mostly a stronger baseline from which more work can be started. We also emphasize the importance and difficulty of designing a comprehensive gold standard co-reference annotation, which we explain is a key point to further progress on the task.
Collapse
|
23
|
Grouin C, Moriceau V, Zweigenbaum P. Combining glass box and black box evaluations in the identification of heart disease risk factors and their temporal relations from clinical records. J Biomed Inform 2015; 58 Suppl:S133-S142. [PMID: 26142870 DOI: 10.1016/j.jbi.2015.06.014] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/14/2015] [Revised: 06/16/2015] [Accepted: 06/22/2015] [Indexed: 11/27/2022]
Abstract
BACKGROUND The determination of risk factors and their temporal relations in natural language patient records is a complex task which has been addressed in the i2b2/UTHealth 2014 shared task. In this context, in most systems it was broadly decomposed into two sub-tasks implemented by two components: entity detection, and temporal relation determination. Task-level ("black box") evaluation is relevant for the final clinical application, whereas component-level evaluation ("glass box") is important for system development and progress monitoring. Unfortunately, because of the interaction between entity representation and temporal relation representation, glass box and black box evaluation cannot be managed straightforwardly at the same time in the setting of the i2b2/UTHealth 2014 task, making it difficult to assess reliably the relative performance and contribution of the individual components to the overall task. OBJECTIVE To identify obstacles and propose methods to cope with this difficulty, and illustrate them through experiments on the i2b2/UTHealth 2014 dataset. METHODS We outline several solutions to this problem and examine their requirements in terms of adequacy for component-level and task-level evaluation and of changes to the task framework. We select the solution which requires the least modifications to the i2b2 evaluation framework and illustrate it with our system. This system identifies risk factor mentions with a CRF system complemented by hand-designed patterns, identifies and normalizes temporal expressions through a tailored version of the Heideltime tool, and determines temporal relations of each risk factor with a One Rule classifier. RESULTS Giving a fixed value to the temporal attribute in risk factor identification proved to be the simplest way to evaluate the risk factor detection component independently. This evaluation method enabled us to identify the risk factor detection component as most contributing to the false negatives and false positives of the global system. This led us to redirect further effort to this component, focusing on medication detection, with gains of 7 to 20 recall points and of 3 to 6 F-measure points depending on the corpus and evaluation. CONCLUSION We proposed a method to achieve a clearer glass box evaluation of risk factor detection and temporal relation detection in clinical texts, which can provide an example to help system development in similar tasks. This glass box evaluation was instrumental in refocusing our efforts and obtaining substantial improvements in risk factor detection.
Collapse
|
24
|
Zweigenbaum P, Lavergne T, Grabar N, Hamon T, Rosset S, Grouin C. Combining an expert-based medical entity recognizer to a machine-learning system: methods and a case study. Biomed Inform Insights 2013; 6:51-62. [PMID: 24052691 PMCID: PMC3776026 DOI: 10.4137/bii.s11770] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Subscribe] [Scholar Register] [Indexed: 11/26/2022]
Abstract
Medical entity recognition is currently generally performed by data-driven methods based on supervised machine learning. Expert-based systems, where linguistic and domain expertise are directly provided to the system are often combined with data-driven systems. We present here a case study where an existing expert-based medical entity recognition system, Ogmios, is combined with a data-driven system, Caramba, based on a linear-chain Conditional Random Field (CRF) classifier. Our case study specifically highlights the risk of overfitting incurred by an expert-based system. We observe that it prevents the combination of the 2 systems from obtaining improvements in precision, recall, or F-measure, and analyze the underlying mechanisms through a post-hoc feature-level analysis. Wrapping the expert-based system alone as attributes input to a CRF classifier does boost its F-measure from 0.603 to 0.710, bringing it on par with the data-driven system. The generalization of this method remains to be further investigated.
Collapse
|
25
|
Abstract
OBJECTIVE To identify the temporal relations between clinical events and temporal expressions in clinical reports, as defined in the i2b2/VA 2012 challenge. DESIGN To detect clinical events, we used rules and Conditional Random Fields. We built Random Forest models to identify event modality and polarity. To identify temporal expressions we built on the HeidelTime system. To detect temporal relations, we systematically studied their breakdown into distinct situations; we designed an oracle method to determine the most prominent situations and the most suitable associated classifiers, and combined their results. RESULTS We achieved F-measures of 0.8307 for event identification, based on rules, and 0.8385 for temporal expression identification. In the temporal relation task, we identified nine main situations in three groups, experimentally confirming shared intuitions: within-sentence relations, section-related time, and across-sentence relations. Logistic regression and Naïve Bayes performed best on the first and third groups, and decision trees on the second. We reached a 0.6231 global F-measure, improving by 7.5 points our official submission. CONCLUSIONS Carefully hand-crafted rules obtained good results for the detection of events and temporal expressions, while a combination of classifiers improved temporal link prediction. The characterization of the oracle recall of situations allowed us to point at directions where further work would be most useful for temporal relation detection: within-sentence relations and linking History of Present Illness events to the admission date. We suggest that the systematic situation breakdown proposed in this paper could also help improve other systems addressing this task.
Collapse
|
26
|
Grouin C, Zweigenbaum P. Automatic de-identification of French clinical records: comparison of rule-based and machine-learning approaches. Stud Health Technol Inform 2013; 192:476-480. [PMID: 23920600] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 06/02/2023]
Abstract
In this paper, we present a comparison of two approaches to automatically de-identify medical records written in French: a rule-based system and a machine-learning based system using a conditional random fields (CRF) formalism. Both systems have been designed to process nine identifiers in a corpus of medical records in cardiology. We performed two evaluations: first, on 62 documents in cardiology, and on 10 documents in foetopathology - produced by optical character recognition (OCR) - to evaluate the robustness of our systems. We achieved a 0.843 (rule-based) and 0.883 (machine-learning) exact match overall F-measure in cardiology. While the rule-based system allowed us to achieve good results on nominative (first and last names) and numerical data (dates, phone numbers, and zip codes), the machine-learning approach performed best on more complex categories (postal addresses, hospital names, medical devices, and towns). On the foetopathology corpus, although our systems have not been designed for this corpus and despite OCR character recognition errors, we obtained promising results: a 0.681 (rule-based) and 0.638 (machine-learning) exact-match overall F-measure. This demonstrates that existing tools can be applied to process new documents of lower quality.
Collapse
|
27
|
Grouin C, Deléger L, Rosier A, Temal L, Dameron O, Van Hille P, Burgun A, Zweigenbaum P. Automatic computation of CHA2DS2-VASc score: information extraction from clinical texts for thromboembolism risk assessment. AMIA Annu Symp Proc 2011; 2011:501-10. [PMID: 22195104 PMCID: PMC3243195] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Subscribe] [Scholar Register] [Indexed: 05/31/2023]
Abstract
The CHA2DS2-VASc score is a 10-point scale which allows cardiologists to easily identify potential stroke risk for patients with non-valvular fibrillation. In this article, we present a system based on natural language processing (lexicon and linguistic modules), including negation and speculation handling, which extracts medical concepts from French clinical records and uses them as criteria to compute the CHA2DS2-VASc score. We evaluate this system by comparing its computed criteria with those obtained by human reading of the same clinical texts, and by assessing the impact of the observed differences on the resulting CHA2DS2-VASc scores. Given 21 patient records, 168 instances of criteria were computed, with an accuracy of 97.6%, and the accuracy of the 21 CHA2DS2-VASc scores was 85.7%. All differences in scores trigger the same alert, which means that system performance on this test set yields similar results to human reading of the texts.
Collapse
|
28
|
Abstract
Background Information extraction is a complex task which is necessary to develop high-precision information retrieval tools. In this paper, we present the platform MeTAE (Medical Texts Annotation and Exploration). MeTAE allows (i) to extract and annotate medical entities and relationships from medical texts and (ii) to explore semantically the produced RDF annotations. Results Our annotation approach relies on linguistic patterns and domain knowledge and consists in two steps: (i) recognition of medical entities and (ii) identification of the correct semantic relation between each pair of entities. The first step is achieved by an enhanced use of MetaMap which improves the precision obtained by MetaMap by 19.59% in our evaluation. The second step relies on linguistic patterns which are built semi-automatically from a corpus selected according to semantic criteria. We evaluate our system’s ability to identify medical entities of 16 types. We also evaluate the extraction of treatment relations between a treatment (e.g. medication) and a problem (e.g. disease): we obtain 75.72% precision and 60.46% recall. Conclusions According to our experiments, using an external sentence segmenter and noun phrase chunker may improve the precision of MetaMap-based medical entity recognition. Our pattern-based relation extraction method obtains good precision and recall w.r.t related works. A more precise comparison with related approaches remains difficult however given the differences in corpora and in the exact nature of the extracted relations. The selection of MEDLINE articles through queries related to known drug-disease pairs enabled us to obtain a more focused corpus of relevant examples of treatment relations than a more general MEDLINE query.
Collapse
|
29
|
Minard AL, Ligozat AL, Ben Abacha A, Bernhard D, Cartoni B, Deléger L, Grau B, Rosset S, Zweigenbaum P, Grouin C. Hybrid methods for improving information access in clinical documents: concept, assertion, and relation identification. J Am Med Inform Assoc 2011; 18:588-93. [PMID: 21597105 DOI: 10.1136/amiajnl-2011-000154] [Citation(s) in RCA: 30] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/03/2022] Open
Abstract
OBJECTIVE This paper describes the approaches the authors developed while participating in the i2b2/VA 2010 challenge to automatically extract medical concepts and annotate assertions on concepts and relations between concepts. DESIGN The authors'approaches rely on both rule-based and machine-learning methods. Natural language processing is used to extract features from the input texts; these features are then used in the authors' machine-learning approaches. The authors used Conditional Random Fields for concept extraction, and Support Vector Machines for assertion and relation annotation. Depending on the task, the authors tested various combinations of rule-based and machine-learning methods. RESULTS The authors'assertion annotation system obtained an F-measure of 0.931, ranking fifth out of 21 participants at the i2b2/VA 2010 challenge. The authors' relation annotation system ranked third out of 16 participants with a 0.709 F-measure. The 0.773 F-measure the authors obtained on concept extraction did not make it to the top 10. CONCLUSION On the one hand, the authors confirm that the use of only machine-learning methods is highly dependent on the annotated training data, and thus obtained better results for well-represented classes. On the other hand, the use of only a rule-based method was not sufficient to deal with new types of data. Finally, the use of hybrid approaches combining machine-learning and rule-based approaches yielded higher scores.
Collapse
|
30
|
|
31
|
Deléger L, Grouin C, Zweigenbaum P. Extracting medical information from narrative patient records: the case of medication-related information. J Am Med Inform Assoc 2010; 17:555-8. [PMID: 20819863 DOI: 10.1136/jamia.2010.003962] [Citation(s) in RCA: 34] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/04/2022] Open
Abstract
OBJECTIVE While essential for patient care, information related to medication is often written as free text in clinical records and, therefore, difficult to use in computerized systems. This paper describes an approach to automatically extract medication information from clinical records, which was developed to participate in the i2b2 2009 challenge, as well as different strategies to improve the extraction. DESIGN Our approach relies on a semantic lexicon and extraction rules as a two-phase strategy: first, drug names are recognized and, then, the context of these names is explored to extract drug-related information (mode, dosage, etc) according to rules capturing the document structure and the syntax of each kind of information. Different configurations are tested to improve this baseline system along several dimensions, particularly drug name recognition-this step being a determining factor to extract drug-related information. Changes were tested at the level of the lexicons and of the extraction rules. RESULTS The initial system participating in i2b2 achieved good results (global F-measure of 77%). Further testing of different configurations substantially improved the system (global F-measure of 81%), performing well for all types of information (eg, 84% for drug names and 88% for modes), except for durations and reasons, which remain problematic. CONCLUSION This study demonstrates that a simple rule-based system can achieve good performance on the medication extraction task. We also showed that controlled modifications (lexicon filtering and rule refinement) were the improvements that best raised the performance.
Collapse
|
32
|
Delbecque T, Zweigenbaum P. Using Co-Authoring and Cross-Referencing Information for MEDLINE Indexing. AMIA Annu Symp Proc 2010; 2010:147-151. [PMID: 21346958 PMCID: PMC3041281] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Subscribe] [Scholar Register] [Indexed: 05/30/2023]
Abstract
Due to the large amount of new papers regularly entering the MEDLINE database, there is an ongoing effort to design tools that help indexing this new material. Here we investigate the hypothesis that past indexing information coming from referencing and authoring links can be used for this purpose. Using a JAMA-based subset of MEDLINE, we designed ranking scores which rely on this information; given a new article, the aim of these scores is to build an ordered list of MeSH terms that should be used to index this article. Evaluation measures on an independent, 1000-document data set are given. Comparison with equivalent works shows benefits in recall, F-measure and mean average precision. Moreover, cited articles and authors' past articles contribute to seven of the top ten ranking features, supporting our hypothesis. Further improvements and extensions to this work are exposed in the conclusion.
Collapse
|
33
|
Deléger L, Merabti T, Lecrocq T, Joubert M, Zweigenbaum P, Darmoni S. A twofold strategy for translating a medical terminology into French. AMIA Annu Symp Proc 2010; 2010:152-156. [PMID: 21346959 PMCID: PMC3041288] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Subscribe] [Scholar Register] [Indexed: 05/30/2023]
Abstract
OBJECTIVE The goal of this study is to assist the translation of a medical terminology (MedlinePlus) into French. METHODS We combined two types of approaches to acquire French translations of English MedlinePlus terms. The first is knowledge-based and relies on the conceptual information of the UMLS metathesaurus. The second method is a corpus-based NLP technique using a bilingual parallel corpus. RESULTS The knowledge-based method brought translations for 611 terms, among which 67.6% were considered valid. The corpus-based approach provided translations for 143 terms of which 71.3% were considered valid. We thus acquired a total of 435 translated terms (51.3%). CONCLUSION Combining two approaches allowed us to semi-automatically translate more than half of the terminology, while focusing on only one would have provided a more partial translation. From an applicative viewpoint, this French version is now integrated in the catalogue of online health resources CISMeF.
Collapse
|
34
|
Deléger L, Grouin C, Zweigenbaum P. Extracting medication information from French clinical texts. Stud Health Technol Inform 2010; 160:949-953. [PMID: 20841824] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Grants] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 05/29/2023]
Abstract
Much more Natural Language Processing (NLP) work has been performed on the English language than on any other. This general observation is also true of medical NLP, although clinical language processing needs are as strong in other languages as they are in English. In specific subdomains, such as drug prescription, the expression of information can be closely related across different languages, which should help transfer systems from English to other languages. We report here the implementation of a medication extraction system which extracts drugs and related information from French clinical texts, on the basis of an approach initially designed for English within the framework of the i2b2 2009 challenge. The system relies on specialized lexicons and a set of extraction rules. A first evaluation on 50 annotated texts obtains 86.7% F-measure, a level higher than the original English system and close to related work. This shows that the same rule-based approach can be applied to English and French languages, with a similar level of performance. We further discuss directions for improving both systems.
Collapse
|
35
|
Deléger L, Merkel M, Zweigenbaum P. Translating medical terminologies through word alignment in parallel text corpora. J Biomed Inform 2009; 42:692-701. [PMID: 19275946 DOI: 10.1016/j.jbi.2009.03.002] [Citation(s) in RCA: 21] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/04/2008] [Revised: 02/18/2009] [Accepted: 03/03/2009] [Indexed: 11/25/2022]
Abstract
Developing international multilingual terminologies is a time-consuming process. We present a methodology which aims to ease this process by automatically acquiring new translations of medical terms based on word alignment in parallel text corpora, and test it on English and French. After collecting a parallel, English-French corpus, we detected French translations of English terms from three terminologies-MeSH, SNOMED CT and the MedlinePlus Health Topics. We obtained respectively for each terminology 74.8%, 77.8% and 76.3% of linguistically correct new translations. A sample of the MeSH translations was submitted to expert review and 61.5% were deemed desirable additions to the French MeSH. In conclusion, we successfully obtained good quality new translations, which underlines the suitability of using alignment in text corpora to help translating terminologies. Our method may be applied to different European languages and provides a methodological framework that may be used with different processing tools.
Collapse
Affiliation(s)
- Louise Deléger
- INSERM, UMR_S 872, Eq. 20, Centre des cordeliers, Paris F-75006, France.
| | | | | |
Collapse
|
36
|
Grouin C, Rosier A, Dameron O, Zweigenbaum P. Testing tactics to localize de-identification. Stud Health Technol Inform 2009; 150:735-739. [PMID: 19745408] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 05/28/2023]
Abstract
Recent renewed interest in de-identification (also known as "anonymisation") has led to the development of a series of systems in the United States with very good performance on challenge test sets. De-identification needs however to be tuned to the local documents and their specificities. We address here two issues raised in this context. First, tuning is generally performed by language engineers who should not have to work on identified text. We therefore perform a first gross de-identification step in the hospital. Second, to set up a de-identification system for new documents in a language different from English, here French patient reports, we tested two methods: the first attempts to adapt an existing US de-identifier for English, the second re-develops a new system which applies the same methods. The first method involved localizing patterns designed for English, which proved cumbersome and did not quickly obtain good performance. With a similar effort, the latter method obtained much better results. Evaluated on a set of 23 randomly selected texts from a corpus of 21,749 clinical texts, it obtained 83% recall and 92% precision.
Collapse
Affiliation(s)
- Cyril Grouin
- Laboratoire d'Informatique pour la Mécanique et les Sciences de l'Ingénieur, Centre National de la Recherche Scientifique (LIMSI-CNRS), Orsay, France
| | | | | | | |
Collapse
|
37
|
Deléger L, Zweigenbaum P. Paraphrase acquisition from comparable medical corpora of specialized and lay texts. AMIA Annu Symp Proc 2008; 2008:146-150. [PMID: 18999095 PMCID: PMC2656025] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Subscribe] [Scholar Register] [Received: 03/14/2008] [Revised: 07/16/2008] [Indexed: 05/27/2023]
Abstract
Nowadays a large amount of health information is available to the public, but medical language is often difficult for lay people to understand. Developing means to make medical information more comprehensible is therefore a real need. In this regard, a useful resource would be a corpus of specialized and lay paraphrases. To this end we built comparable corpora of specialized and lay texts on which we applied paraphrasing patterns based on anchors of deverbal noun and verb pairs. The results show that the paraphrases were of good quality (71.4% to 94.2% precision) and that this type of paraphrases was relevant in the context of studying the differences between specialized and lay language. This study also demonstrates that simple paraphrase acquisition methods can also work on texts with a rather small degree of similarity, once similar text segments are detected.
Collapse
|
38
|
Cormont S, Buemi A, Horeau T, Zweigenbaum P, Lepage E. Construction of a dictionary of laboratory tests mapped to LOINC at AP-HP. AMIA Annu Symp Proc 2008:1200. [PMID: 18999107] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Subscribe] [Scholar Register] [Received: 03/14/2008] [Accepted: 06/17/2008] [Indexed: 05/27/2023]
Abstract
We report on the ongoing process implemented at Assistance Publique-Hôpitaux de Paris (AP-HP), the largest hospital system in Europe, to build a common reference for laboratory tests in French with LOINC mappings. At the time of writing, it contained 24,000 tests, covering all fields of biology, in use in 19 AP-HP hospitals, 30% of which had a mapping to LOINC with a peak of over 60% in biochemistry.
Collapse
Affiliation(s)
- Sylvie Cormont
- Assistance Publique-Hôpitaux de Paris, Si-DoPa, F-75683 Paris, France
| | | | | | | | | |
Collapse
|
39
|
Deléger L, Namer F, Zweigenbaum P. Morphosemantic parsing of medical compound words: transferring a French analyzer to English. Int J Med Inform 2008; 78 Suppl 1:S48-55. [PMID: 18801700 DOI: 10.1016/j.ijmedinf.2008.07.016] [Citation(s) in RCA: 11] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/20/2008] [Revised: 06/24/2008] [Accepted: 07/30/2008] [Indexed: 11/25/2022]
Abstract
PURPOSE Medical language, as many technical languages, is rich with morphologically complex words, many of which take their roots in Greek and Latin--in which case they are called neoclassical compounds. Morphosemantic analysis can help generate definitions of such words. The similarity of structure of those compounds in several European languages has also been observed, which seems to indicate that a same linguistic analysis could be applied to neo-classical compounds from different languages with minor modifications. METHODS This paper reports work on the adaptation of a morphosemantic analyzer dedicated to French (DériF) to analyze English medical neo-classical compounds. It presents the principles of this transposition and its current performance. RESULTS The analyzer was tested on a set of 1299 compounds extracted from the WHO-ART terminology. 859 could be decomposed and defined, 675 of which successfully. CONCLUSION An advantage of this process is that complex linguistic analyses designed for French could be successfully transposed to the analysis of English medical neoclassical compounds, which confirmed our hypothesis of transferability. The fact that the method was successfully applied to a Germanic language such as English suggests that performances would be at least as high if experimenting with Romance languages such as Spanish. Finally, the resulting system can produce more complete analyses of English medical compounds than existing systems, including a hierarchical decomposition and semantic gloss of each word.
Collapse
Affiliation(s)
- Louise Deléger
- INSERM U872, Eq. 20, 15 rue de l'Ecole de Médecine, Paris F-75006, France.
| | | | | |
Collapse
|
40
|
Deleger L, Zweigenbaum P. Aligning lay and specialized passages in comparable medical corpora. Stud Health Technol Inform 2008; 136:89-94. [PMID: 18487713] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 05/26/2023]
Abstract
While the public has increasingly access to medical information, specialized medical language is often difficult for non-experts to understand and there is a need to bridge the gap between specialized language and lay language. As a first step towards this end, we describe here a method to build a comparable corpus of expert and non-expert medical French documents and to identify similar text segments of lay and specialized language. Among the top 400 pairs of text segments retrieved with this method, 59% were actually similar and 37% were deemed exploitable for further processing. This is encouraging evidence for the target task of finding equivalent expressions between these two varieties of language.
Collapse
|
41
|
Abstract
It is now almost 15 years since the publication of the first paper on text mining in the genomics domain, and decades since the first paper on text mining in the medical domain. Enormous progress has been made in the areas of information retrieval, evaluation methodologies and resource construction. Some problems, such as abbreviation-handling, can essentially be considered solved problems, and others, such as identification of gene mentions in text, seem likely to be solved soon. However, a number of problems at the frontiers of biomedical text mining continue to present interesting challenges and opportunities for great improvements and interesting research. In this article we review the current state of the art in biomedical text mining or 'BioNLP' in general, focusing primarily on papers published within the past year.
Collapse
|
42
|
Deléger L, Namer F, Zweigenbaum P. Defining medical words: transposing morphosemantic analysis from French to English. Stud Health Technol Inform 2007; 129:535-9. [PMID: 17911774] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 05/17/2023]
Abstract
Medical language, as many technical languages, is rich with morphologically complex words, many of which take their roots in Greek and Latin-in which case they are called neoclassical compounds. Morphosemantic analysis can help generate definitions of such words. This paper reports work on the adaptation of a morphosemantic analyzer dedicated to French (DériF) to analyze English medical neoclassical compounds. It presents the principles of this transposition and its current performance. The analyzer was tested on a set of 1,299 compounds extracted from the WHO-ART terminology. 859 could be decomposed and defined, 675 of which successfully. An advantage of this process is that complex linguistic analyses designed for French could be successfully transferred to the analysis of English medical neoclassical compounds. Moreover, the resulting system can produce more complete analyses of English medical compounds than existing ones, including a hierarchical decomposition and semantic gloss of each word.
Collapse
Affiliation(s)
- Louise Deléger
- INSERM, UMR _S 872, Eq. 20, Les Cordeliers, Paris, F-75006 France.
| | | | | |
Collapse
|
43
|
Nyström M, Merkel M, Ahrenberg L, Zweigenbaum P, Petersson H, Åhlfeldt H. Creating a medical English-Swedish dictionary using interactive word alignment. BMC Med Inform Decis Mak 2006; 6:35. [PMID: 17034649 PMCID: PMC1624822 DOI: 10.1186/1472-6947-6-35] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/24/2006] [Accepted: 10/12/2006] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND This paper reports on a parallel collection of rubrics from the medical terminology systems ICD-10, ICF, MeSH, NCSP and KSH97-P and its use for semi-automatic creation of an English-Swedish dictionary of medical terminology. The methods presented are relevant for many other West European language pairs than English-Swedish. METHODS The medical terminology systems were collected in electronic format in both English and Swedish and the rubrics were extracted in parallel language pairs. Initially, interactive word alignment was used to create training data from a sample. Then the training data were utilised in automatic word alignment in order to generate candidate term pairs. The last step was manual verification of the term pair candidates. RESULTS A dictionary of 31,000 verified entries has been created in less than three man weeks, thus with considerably less time and effort needed compared to a manual approach, and without compromising quality. As a side effect of our work we found 40 different translation problems in the terminology systems and these results indicate the power of the method for finding inconsistencies in terminology translations. We also report on some factors that may contribute to making the process of dictionary creation with similar tools even more expedient. Finally, the contribution is discussed in relation to other ongoing efforts in constructing medical lexicons for non-English languages. CONCLUSION In three man weeks we were able to produce a medical English-Swedish dictionary consisting of 31,000 entries and also found hidden translation errors in the utilized medical terminology systems.
Collapse
Affiliation(s)
- Mikael Nyström
- Department of Biomedical Engineering, Linköpings universitet, SE-58185 Linköping, Sweden
| | - Magnus Merkel
- Department of Computer and Information Science, Linköpings universitet, SE-58183 Linköping, Sweden
| | - Lars Ahrenberg
- Department of Computer and Information Science, Linköpings universitet, SE-58183 Linköping, Sweden
| | - Pierre Zweigenbaum
- Assistance Publique-Hôpitaux de Paris, F-75683 Paris Cedex 14, France
- Inserm, U729, F-75270 Paris Cedex 06, France
- Inalco, CRIM, F-75343 PARIS Cedex 07, France
| | - Håkan Petersson
- Department of Biomedical Engineering, Linköpings universitet, SE-58185 Linköping, Sweden
| | - Hans Åhlfeldt
- Department of Biomedical Engineering, Linköpings universitet, SE-58185 Linköping, Sweden
| |
Collapse
|
44
|
Marko K, Baud R, Zweigenbaum P, Borin L, Merkel M, Schulz S. Towards a multilingual medical lexicon. AMIA Annu Symp Proc 2006; 2006:534-8. [PMID: 17238398 PMCID: PMC1839525] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Subscribe] [Scholar Register] [Indexed: 05/13/2023]
Abstract
We present results of the collaboration of a multinational team of researchers from (computational) linguistics, medicine, and medical informatics with the goal of building a multilingual medical lexicon with high coverage and complete morpho-syntactic information. Monolingual lexical resources were collected and subsequently mapped between languages using a morpho-semantic term normalization engine, which captures intra- as well as interlingual synonymy relationships on the level of subwords.
Collapse
Affiliation(s)
- Kornél Marko
- Freiburg University Hospital, Department of Medical Informatics, Freiburg, Germany
| | | | | | | | | | | |
Collapse
|
45
|
Deleger L, Merkel M, Zweigenbaum P. Enriching medical terminologies: an approach based on aligned corpora. Stud Health Technol Inform 2006; 124:747-52. [PMID: 17108604] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 05/12/2023]
Abstract
Medical terminologies such as those in the UMLS are never exhaustive and there is a constant need to enrich them, especially in terms of multilinguality. We present a methodology to acquire new French translations of English medical terms based on word alignment in a parallel corpus - i.e. pairing of corresponding words. We automatically collected a 27.7-million-word parallel, English-French corpus. Based on a first 1.3-million-word extract of this corpus, we detected 10,171 candidate French translations of English medical terms from MeSH and SNOMED, among which 3,807 are new translations of English MeSH terms.
Collapse
|
46
|
Deléger L, Merkel M, Zweigenbaum P. Contribution to terminology internationalization by word alignment in parallel corpora. AMIA Annu Symp Proc 2006; 2006:185-9. [PMID: 17238328 PMCID: PMC1839560] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Subscribe] [Scholar Register] [Indexed: 05/13/2023]
Abstract
BACKGROUND AND OBJECTIVES Creating a complete translation of a large vocabulary is a time-consuming task, which requires skilled and knowledgeable medical translators. Our goal is to examine to which extent such a task can be alleviated by a specific natural language processing technique, word alignment in parallel corpora. We experiment with translation from English to French. METHODS Build a large corpus of parallel, English-French documents, and automatically align it at the document, sentence and word levels using state-of-the-art alignment methods and tools. Then project English terms from existing controlled vocabularies to the aligned word pairs, and examine the number and quality of the putative French translations obtained thereby. We considered three American vocabularies present in the UMLS with three different translation statuses: the MeSH, SNOMED CT, and the MedlinePlus Health Topics. RESULTS We obtained several thousand new translations of our input terms, this number being closely linked to the number of terms in the input vocabularies. CONCLUSION Our study shows that alignment methods can extract a number of new term translations from large bodies of text with a moderate human reviewing effort, and thus contribute to help a human translator obtain better translation coverage of an input vocabulary. Short-term perspectives include their application to a corpus 20 times larger than that used here, together with more focused methods for term extraction.
Collapse
|
47
|
Zweigenbaum P, Baud R, Burgun A, Namer F, Jarrousse E, Grabar N, Ruch P, Le Duff F, Forget JF, Douyère M, Darmoni S. UMLF: a unified medical lexicon for French. Int J Med Inform 2005; 74:119-24. [PMID: 15694616 DOI: 10.1016/j.ijmedinf.2004.03.010] [Citation(s) in RCA: 17] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/01/2003] [Accepted: 03/05/2004] [Indexed: 11/25/2022]
Abstract
Medical Informatics has a constant need for basic medical language processing tasks, e.g. for coding into controlled vocabularies, free text indexing and information retrieval. Most of these tasks involve term matching and rely on lexical resources: lists of words with attached information, including inflected forms and derived words, etc. Such resources are publicly available for the English language with the UMLS Specialist Lexicon, but not in other languages. For the French language, several teams have worked on the subject and built local lexical resources. The goal of the present work is to pool and unify these resources and to add extensively to them by exploiting medical terminologies and corpora, resulting in a unified medical lexicon for French (UMLF). This paper exposes the issues raised by such an objective, describes the methods on which the project relies and illustrates them with experimental results.
Collapse
Affiliation(s)
- Pierre Zweigenbaum
- STIM/DSI/Assistance Publique-Hôpitaux de Paris, 91, boulevard de l'Hôpital, 75634 Paris Cedex 13, France.
| | | | | | | | | | | | | | | | | | | | | |
Collapse
|
48
|
Delbecque T, Jacquemart P, Zweigenbaum P. Indexing UMLS Semantic Types for Medical Question-Answering. Stud Health Technol Inform 2005; 116:805-10. [PMID: 16160357] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 05/04/2023]
Abstract
Open-domain Question-Answering (QA) systems heavily rely on named entities, a set of general-purpose semantic types which generally cover names of persons, organizations and locations, dates and amounts, etc. If we are to build medical QA systems, a set of medically relevant named entities must be used. In this paper, we explore the use of the UMLS (Unified Medical Language System) Semantic Network semantic types for this purpose. We present an experiment where the French part of the UMLS Metathesaurus, together with the associated semantic types, is used as a resource for a medically-specific named entity tagger. We also explore the detection of Semantic Network relations for answering specific types of medical questions. We present results and evaluations on a corpus of French-language medical documents that was used in the EQueR Question-Answering evaluation forum. We show, using statistical studies, that strategies for using these new tags in a QA context are to take in account the individual origin of documents.
Collapse
|
49
|
Baud RH, Nyström M, Borin L, Evans R, Schulz S, Zweigenbaum P. Interchanging lexical information for a multilingual dictionary. AMIA Annu Symp Proc 2005; 2005:31-5. [PMID: 16778996 PMCID: PMC1560452] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Subscribe] [Scholar Register] [Indexed: 05/10/2023]
Abstract
OBJECTIVE To facilitate the interchange of lexical information for multiple languages in the medical domain. To pave the way for the emergence of a generally available truly multilingual electronic dictionary in the medical domain. METHODS An interchange format has to be neutral relative to the target languages. It has to be consistent with current needs of lexicon authors, present and future. An active interaction between six potential authors aimed to determine a common denominator striking the right balance between richness of content and ease of use for lexicon providers. RESULTS A simple list of relevant attributes has been established and published. The format has the potential for collecting relevant parts of a future multilingual dictionary. An XML version is available. CONCLUSION This effort makes feasible the exchange of lexical information between research groups. Interchange files are made available in a public repository. This procedure opens the door to a true multilingual dictionary, in the awareness that the exchange of lexical information is (only) a necessary first step, before structuring the corresponding entries in different languages.
Collapse
Affiliation(s)
- R H Baud
- Service of Medical Informatics, University Hospitals of Geneva, Switzerland
| | | | | | | | | | | |
Collapse
|
50
|
Namer F, Zweigenbaum P. Acquiring meaning for French medical terminology: contribution of morphosemantics. Stud Health Technol Inform 2004; 107:535-9. [PMID: 15360870] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 04/30/2023]
Abstract
Morphologically complex words, and particularly neoclassical compounds, form more than 60% of the neologisms in the biomedical field. Guessing their definitions and grouping them into semantic classes by means of lexical relations are thus two crucial improvements for handling these words, e.g., for information retrieval, indexing and text understanding applications. This paper describes a morphosemantic linguistic-based parser called DériF, currently developed in the framework of two projects, UMLF and VUMeF, and its application to French biomedical derived and compound words. It shows how the resulting morphologically tagged lexicon is enriched by semantic relations leading both to the synthesis of pseudo-definitions and to the constitution of classes of synonyms, hypo- and hypernyms.
Collapse
Affiliation(s)
- Fiammetta Namer
- ATILF, Université Nancy 2, CLSH-BP 3397, 54015 Nancy Cédex, France.
| | | |
Collapse
|