26
|
Grouin C, Zweigenbaum P. Automatic de-identification of French clinical records: comparison of rule-based and machine-learning approaches. Stud Health Technol Inform 2013; 192:476-480. [PMID: 23920600] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 06/02/2023]
Abstract
In this paper, we present a comparison of two approaches to automatically de-identify medical records written in French: a rule-based system and a machine-learning based system using a conditional random fields (CRF) formalism. Both systems have been designed to process nine identifiers in a corpus of medical records in cardiology. We performed two evaluations: first, on 62 documents in cardiology, and on 10 documents in foetopathology - produced by optical character recognition (OCR) - to evaluate the robustness of our systems. We achieved a 0.843 (rule-based) and 0.883 (machine-learning) exact match overall F-measure in cardiology. While the rule-based system allowed us to achieve good results on nominative (first and last names) and numerical data (dates, phone numbers, and zip codes), the machine-learning approach performed best on more complex categories (postal addresses, hospital names, medical devices, and towns). On the foetopathology corpus, although our systems have not been designed for this corpus and despite OCR character recognition errors, we obtained promising results: a 0.681 (rule-based) and 0.638 (machine-learning) exact-match overall F-measure. This demonstrates that existing tools can be applied to process new documents of lower quality.
Collapse
|
27
|
Grouin C, Deléger L, Rosier A, Temal L, Dameron O, Van Hille P, Burgun A, Zweigenbaum P. Automatic computation of CHA2DS2-VASc score: information extraction from clinical texts for thromboembolism risk assessment. AMIA ... ANNUAL SYMPOSIUM PROCEEDINGS. AMIA SYMPOSIUM 2011; 2011:501-10. [PMID: 22195104 PMCID: PMC3243195] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Subscribe] [Scholar Register] [Indexed: 05/31/2023]
Abstract
The CHA2DS2-VASc score is a 10-point scale which allows cardiologists to easily identify potential stroke risk for patients with non-valvular fibrillation. In this article, we present a system based on natural language processing (lexicon and linguistic modules), including negation and speculation handling, which extracts medical concepts from French clinical records and uses them as criteria to compute the CHA2DS2-VASc score. We evaluate this system by comparing its computed criteria with those obtained by human reading of the same clinical texts, and by assessing the impact of the observed differences on the resulting CHA2DS2-VASc scores. Given 21 patient records, 168 instances of criteria were computed, with an accuracy of 97.6%, and the accuracy of the 21 CHA2DS2-VASc scores was 85.7%. All differences in scores trigger the same alert, which means that system performance on this test set yields similar results to human reading of the texts.
Collapse
|
28
|
Ben Abacha A, Zweigenbaum P. Automatic extraction of semantic relations between medical entities: a rule based approach. J Biomed Semantics 2011; 2 Suppl 5:S4. [PMID: 22166723 PMCID: PMC3239304 DOI: 10.1186/2041-1480-2-s5-s4] [Citation(s) in RCA: 63] [Impact Index Per Article: 4.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/24/2022] Open
Abstract
Background Information extraction is a complex task which is necessary to develop high-precision information retrieval tools. In this paper, we present the platform MeTAE (Medical Texts Annotation and Exploration). MeTAE allows (i) to extract and annotate medical entities and relationships from medical texts and (ii) to explore semantically the produced RDF annotations. Results Our annotation approach relies on linguistic patterns and domain knowledge and consists in two steps: (i) recognition of medical entities and (ii) identification of the correct semantic relation between each pair of entities. The first step is achieved by an enhanced use of MetaMap which improves the precision obtained by MetaMap by 19.59% in our evaluation. The second step relies on linguistic patterns which are built semi-automatically from a corpus selected according to semantic criteria. We evaluate our system’s ability to identify medical entities of 16 types. We also evaluate the extraction of treatment relations between a treatment (e.g. medication) and a problem (e.g. disease): we obtain 75.72% precision and 60.46% recall. Conclusions According to our experiments, using an external sentence segmenter and noun phrase chunker may improve the precision of MetaMap-based medical entity recognition. Our pattern-based relation extraction method obtains good precision and recall w.r.t related works. A more precise comparison with related approaches remains difficult however given the differences in corpora and in the exact nature of the extracted relations. The selection of MEDLINE articles through queries related to known drug-disease pairs enabled us to obtain a more focused corpus of relevant examples of treatment relations than a more general MEDLINE query.
Collapse
|
29
|
Minard AL, Ligozat AL, Ben Abacha A, Bernhard D, Cartoni B, Deléger L, Grau B, Rosset S, Zweigenbaum P, Grouin C. Hybrid methods for improving information access in clinical documents: concept, assertion, and relation identification. J Am Med Inform Assoc 2011; 18:588-93. [PMID: 21597105 DOI: 10.1136/amiajnl-2011-000154] [Citation(s) in RCA: 30] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/03/2022] Open
Abstract
OBJECTIVE This paper describes the approaches the authors developed while participating in the i2b2/VA 2010 challenge to automatically extract medical concepts and annotate assertions on concepts and relations between concepts. DESIGN The authors'approaches rely on both rule-based and machine-learning methods. Natural language processing is used to extract features from the input texts; these features are then used in the authors' machine-learning approaches. The authors used Conditional Random Fields for concept extraction, and Support Vector Machines for assertion and relation annotation. Depending on the task, the authors tested various combinations of rule-based and machine-learning methods. RESULTS The authors'assertion annotation system obtained an F-measure of 0.931, ranking fifth out of 21 participants at the i2b2/VA 2010 challenge. The authors' relation annotation system ranked third out of 16 participants with a 0.709 F-measure. The 0.773 F-measure the authors obtained on concept extraction did not make it to the top 10. CONCLUSION On the one hand, the authors confirm that the use of only machine-learning methods is highly dependent on the annotated training data, and thus obtained better results for well-represented classes. On the other hand, the use of only a rule-based method was not sufficient to deal with new types of data. Finally, the use of hybrid approaches combining machine-learning and rule-based approaches yielded higher scores.
Collapse
|
30
|
Ben Abacha A, Zweigenbaum P. A Hybrid Approach for the Extraction of Semantic Relations from MEDLINE Abstracts. ACTA ACUST UNITED AC 2011. [DOI: 10.1007/978-3-642-19437-5_11] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 03/09/2023]
|
31
|
Deléger L, Grouin C, Zweigenbaum P. Extracting medical information from narrative patient records: the case of medication-related information. J Am Med Inform Assoc 2010; 17:555-8. [PMID: 20819863 DOI: 10.1136/jamia.2010.003962] [Citation(s) in RCA: 34] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/04/2022] Open
Abstract
OBJECTIVE While essential for patient care, information related to medication is often written as free text in clinical records and, therefore, difficult to use in computerized systems. This paper describes an approach to automatically extract medication information from clinical records, which was developed to participate in the i2b2 2009 challenge, as well as different strategies to improve the extraction. DESIGN Our approach relies on a semantic lexicon and extraction rules as a two-phase strategy: first, drug names are recognized and, then, the context of these names is explored to extract drug-related information (mode, dosage, etc) according to rules capturing the document structure and the syntax of each kind of information. Different configurations are tested to improve this baseline system along several dimensions, particularly drug name recognition-this step being a determining factor to extract drug-related information. Changes were tested at the level of the lexicons and of the extraction rules. RESULTS The initial system participating in i2b2 achieved good results (global F-measure of 77%). Further testing of different configurations substantially improved the system (global F-measure of 81%), performing well for all types of information (eg, 84% for drug names and 88% for modes), except for durations and reasons, which remain problematic. CONCLUSION This study demonstrates that a simple rule-based system can achieve good performance on the medication extraction task. We also showed that controlled modifications (lexicon filtering and rule refinement) were the improvements that best raised the performance.
Collapse
|
32
|
Delbecque T, Zweigenbaum P. Using Co-Authoring and Cross-Referencing Information for MEDLINE Indexing. AMIA ... ANNUAL SYMPOSIUM PROCEEDINGS. AMIA SYMPOSIUM 2010; 2010:147-151. [PMID: 21346958 PMCID: PMC3041281] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Subscribe] [Scholar Register] [Indexed: 05/30/2023]
Abstract
Due to the large amount of new papers regularly entering the MEDLINE database, there is an ongoing effort to design tools that help indexing this new material. Here we investigate the hypothesis that past indexing information coming from referencing and authoring links can be used for this purpose. Using a JAMA-based subset of MEDLINE, we designed ranking scores which rely on this information; given a new article, the aim of these scores is to build an ordered list of MeSH terms that should be used to index this article. Evaluation measures on an independent, 1000-document data set are given. Comparison with equivalent works shows benefits in recall, F-measure and mean average precision. Moreover, cited articles and authors' past articles contribute to seven of the top ten ranking features, supporting our hypothesis. Further improvements and extensions to this work are exposed in the conclusion.
Collapse
|
33
|
Deléger L, Merabti T, Lecrocq T, Joubert M, Zweigenbaum P, Darmoni S. A twofold strategy for translating a medical terminology into French. AMIA ... ANNUAL SYMPOSIUM PROCEEDINGS. AMIA SYMPOSIUM 2010; 2010:152-156. [PMID: 21346959 PMCID: PMC3041288] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Subscribe] [Scholar Register] [Indexed: 05/30/2023]
Abstract
OBJECTIVE The goal of this study is to assist the translation of a medical terminology (MedlinePlus) into French. METHODS We combined two types of approaches to acquire French translations of English MedlinePlus terms. The first is knowledge-based and relies on the conceptual information of the UMLS metathesaurus. The second method is a corpus-based NLP technique using a bilingual parallel corpus. RESULTS The knowledge-based method brought translations for 611 terms, among which 67.6% were considered valid. The corpus-based approach provided translations for 143 terms of which 71.3% were considered valid. We thus acquired a total of 435 translated terms (51.3%). CONCLUSION Combining two approaches allowed us to semi-automatically translate more than half of the terminology, while focusing on only one would have provided a more partial translation. From an applicative viewpoint, this French version is now integrated in the catalogue of online health resources CISMeF.
Collapse
|
34
|
Deléger L, Grouin C, Zweigenbaum P. Extracting medication information from French clinical texts. Stud Health Technol Inform 2010; 160:949-953. [PMID: 20841824] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 05/29/2023]
Abstract
Much more Natural Language Processing (NLP) work has been performed on the English language than on any other. This general observation is also true of medical NLP, although clinical language processing needs are as strong in other languages as they are in English. In specific subdomains, such as drug prescription, the expression of information can be closely related across different languages, which should help transfer systems from English to other languages. We report here the implementation of a medication extraction system which extracts drugs and related information from French clinical texts, on the basis of an approach initially designed for English within the framework of the i2b2 2009 challenge. The system relies on specialized lexicons and a set of extraction rules. A first evaluation on 50 annotated texts obtains 86.7% F-measure, a level higher than the original English system and close to related work. This shows that the same rule-based approach can be applied to English and French languages, with a similar level of performance. We further discuss directions for improving both systems.
Collapse
|
35
|
Deléger L, Merkel M, Zweigenbaum P. Translating medical terminologies through word alignment in parallel text corpora. J Biomed Inform 2009; 42:692-701. [PMID: 19275946 DOI: 10.1016/j.jbi.2009.03.002] [Citation(s) in RCA: 21] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/04/2008] [Revised: 02/18/2009] [Accepted: 03/03/2009] [Indexed: 11/25/2022]
Abstract
Developing international multilingual terminologies is a time-consuming process. We present a methodology which aims to ease this process by automatically acquiring new translations of medical terms based on word alignment in parallel text corpora, and test it on English and French. After collecting a parallel, English-French corpus, we detected French translations of English terms from three terminologies-MeSH, SNOMED CT and the MedlinePlus Health Topics. We obtained respectively for each terminology 74.8%, 77.8% and 76.3% of linguistically correct new translations. A sample of the MeSH translations was submitted to expert review and 61.5% were deemed desirable additions to the French MeSH. In conclusion, we successfully obtained good quality new translations, which underlines the suitability of using alignment in text corpora to help translating terminologies. Our method may be applied to different European languages and provides a methodological framework that may be used with different processing tools.
Collapse
|
36
|
Grouin C, Rosier A, Dameron O, Zweigenbaum P. Testing tactics to localize de-identification. Stud Health Technol Inform 2009; 150:735-739. [PMID: 19745408] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 05/28/2023]
Abstract
Recent renewed interest in de-identification (also known as "anonymisation") has led to the development of a series of systems in the United States with very good performance on challenge test sets. De-identification needs however to be tuned to the local documents and their specificities. We address here two issues raised in this context. First, tuning is generally performed by language engineers who should not have to work on identified text. We therefore perform a first gross de-identification step in the hospital. Second, to set up a de-identification system for new documents in a language different from English, here French patient reports, we tested two methods: the first attempts to adapt an existing US de-identifier for English, the second re-develops a new system which applies the same methods. The first method involved localizing patterns designed for English, which proved cumbersome and did not quickly obtain good performance. With a similar effort, the latter method obtained much better results. Evaluated on a set of 23 randomly selected texts from a corpus of 21,749 clinical texts, it obtained 83% recall and 92% precision.
Collapse
|
37
|
Deléger L, Zweigenbaum P. Paraphrase acquisition from comparable medical corpora of specialized and lay texts. AMIA ... ANNUAL SYMPOSIUM PROCEEDINGS. AMIA SYMPOSIUM 2008; 2008:146-150. [PMID: 18999095 PMCID: PMC2656025] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Subscribe] [Scholar Register] [Received: 03/14/2008] [Revised: 07/16/2008] [Indexed: 05/27/2023]
Abstract
Nowadays a large amount of health information is available to the public, but medical language is often difficult for lay people to understand. Developing means to make medical information more comprehensible is therefore a real need. In this regard, a useful resource would be a corpus of specialized and lay paraphrases. To this end we built comparable corpora of specialized and lay texts on which we applied paraphrasing patterns based on anchors of deverbal noun and verb pairs. The results show that the paraphrases were of good quality (71.4% to 94.2% precision) and that this type of paraphrases was relevant in the context of studying the differences between specialized and lay language. This study also demonstrates that simple paraphrase acquisition methods can also work on texts with a rather small degree of similarity, once similar text segments are detected.
Collapse
|
38
|
Cormont S, Buemi A, Horeau T, Zweigenbaum P, Lepage E. Construction of a dictionary of laboratory tests mapped to LOINC at AP-HP. AMIA ... ANNUAL SYMPOSIUM PROCEEDINGS. AMIA SYMPOSIUM 2008:1200. [PMID: 18999107] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Subscribe] [Scholar Register] [Received: 03/14/2008] [Accepted: 06/17/2008] [Indexed: 05/27/2023]
Abstract
We report on the ongoing process implemented at Assistance Publique-Hôpitaux de Paris (AP-HP), the largest hospital system in Europe, to build a common reference for laboratory tests in French with LOINC mappings. At the time of writing, it contained 24,000 tests, covering all fields of biology, in use in 19 AP-HP hospitals, 30% of which had a mapping to LOINC with a peak of over 60% in biochemistry.
Collapse
|
39
|
Deléger L, Namer F, Zweigenbaum P. Morphosemantic parsing of medical compound words: transferring a French analyzer to English. Int J Med Inform 2008; 78 Suppl 1:S48-55. [PMID: 18801700 DOI: 10.1016/j.ijmedinf.2008.07.016] [Citation(s) in RCA: 11] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/20/2008] [Revised: 06/24/2008] [Accepted: 07/30/2008] [Indexed: 11/25/2022]
Abstract
PURPOSE Medical language, as many technical languages, is rich with morphologically complex words, many of which take their roots in Greek and Latin--in which case they are called neoclassical compounds. Morphosemantic analysis can help generate definitions of such words. The similarity of structure of those compounds in several European languages has also been observed, which seems to indicate that a same linguistic analysis could be applied to neo-classical compounds from different languages with minor modifications. METHODS This paper reports work on the adaptation of a morphosemantic analyzer dedicated to French (DériF) to analyze English medical neo-classical compounds. It presents the principles of this transposition and its current performance. RESULTS The analyzer was tested on a set of 1299 compounds extracted from the WHO-ART terminology. 859 could be decomposed and defined, 675 of which successfully. CONCLUSION An advantage of this process is that complex linguistic analyses designed for French could be successfully transposed to the analysis of English medical neoclassical compounds, which confirmed our hypothesis of transferability. The fact that the method was successfully applied to a Germanic language such as English suggests that performances would be at least as high if experimenting with Romance languages such as Spanish. Finally, the resulting system can produce more complete analyses of English medical compounds than existing systems, including a hierarchical decomposition and semantic gloss of each word.
Collapse
|
40
|
Deleger L, Zweigenbaum P. Aligning lay and specialized passages in comparable medical corpora. Stud Health Technol Inform 2008; 136:89-94. [PMID: 18487713] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 05/26/2023]
Abstract
While the public has increasingly access to medical information, specialized medical language is often difficult for non-experts to understand and there is a need to bridge the gap between specialized language and lay language. As a first step towards this end, we describe here a method to build a comparable corpus of expert and non-expert medical French documents and to identify similar text segments of lay and specialized language. Among the top 400 pairs of text segments retrieved with this method, 59% were actually similar and 37% were deemed exploitable for further processing. This is encouraging evidence for the target task of finding equivalent expressions between these two varieties of language.
Collapse
|
41
|
Zweigenbaum P, Demner-Fushman D, Yu H, Cohen KB. Frontiers of biomedical text mining: current progress. Brief Bioinform 2007; 8:358-75. [PMID: 17977867 PMCID: PMC2516302 DOI: 10.1093/bib/bbm045] [Citation(s) in RCA: 141] [Impact Index Per Article: 8.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/21/2022] Open
Abstract
It is now almost 15 years since the publication of the first paper on text mining in the genomics domain, and decades since the first paper on text mining in the medical domain. Enormous progress has been made in the areas of information retrieval, evaluation methodologies and resource construction. Some problems, such as abbreviation-handling, can essentially be considered solved problems, and others, such as identification of gene mentions in text, seem likely to be solved soon. However, a number of problems at the frontiers of biomedical text mining continue to present interesting challenges and opportunities for great improvements and interesting research. In this article we review the current state of the art in biomedical text mining or 'BioNLP' in general, focusing primarily on papers published within the past year.
Collapse
|
42
|
Deléger L, Namer F, Zweigenbaum P. Defining medical words: transposing morphosemantic analysis from French to English. Stud Health Technol Inform 2007; 129:535-9. [PMID: 17911774] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 05/17/2023]
Abstract
Medical language, as many technical languages, is rich with morphologically complex words, many of which take their roots in Greek and Latin-in which case they are called neoclassical compounds. Morphosemantic analysis can help generate definitions of such words. This paper reports work on the adaptation of a morphosemantic analyzer dedicated to French (DériF) to analyze English medical neoclassical compounds. It presents the principles of this transposition and its current performance. The analyzer was tested on a set of 1,299 compounds extracted from the WHO-ART terminology. 859 could be decomposed and defined, 675 of which successfully. An advantage of this process is that complex linguistic analyses designed for French could be successfully transferred to the analysis of English medical neoclassical compounds. Moreover, the resulting system can produce more complete analyses of English medical compounds than existing ones, including a hierarchical decomposition and semantic gloss of each word.
Collapse
|
43
|
Nyström M, Merkel M, Ahrenberg L, Zweigenbaum P, Petersson H, Åhlfeldt H. Creating a medical English-Swedish dictionary using interactive word alignment. BMC Med Inform Decis Mak 2006; 6:35. [PMID: 17034649 PMCID: PMC1624822 DOI: 10.1186/1472-6947-6-35] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/24/2006] [Accepted: 10/12/2006] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND This paper reports on a parallel collection of rubrics from the medical terminology systems ICD-10, ICF, MeSH, NCSP and KSH97-P and its use for semi-automatic creation of an English-Swedish dictionary of medical terminology. The methods presented are relevant for many other West European language pairs than English-Swedish. METHODS The medical terminology systems were collected in electronic format in both English and Swedish and the rubrics were extracted in parallel language pairs. Initially, interactive word alignment was used to create training data from a sample. Then the training data were utilised in automatic word alignment in order to generate candidate term pairs. The last step was manual verification of the term pair candidates. RESULTS A dictionary of 31,000 verified entries has been created in less than three man weeks, thus with considerably less time and effort needed compared to a manual approach, and without compromising quality. As a side effect of our work we found 40 different translation problems in the terminology systems and these results indicate the power of the method for finding inconsistencies in terminology translations. We also report on some factors that may contribute to making the process of dictionary creation with similar tools even more expedient. Finally, the contribution is discussed in relation to other ongoing efforts in constructing medical lexicons for non-English languages. CONCLUSION In three man weeks we were able to produce a medical English-Swedish dictionary consisting of 31,000 entries and also found hidden translation errors in the utilized medical terminology systems.
Collapse
|
44
|
Marko K, Baud R, Zweigenbaum P, Borin L, Merkel M, Schulz S. Towards a multilingual medical lexicon. AMIA ... ANNUAL SYMPOSIUM PROCEEDINGS. AMIA SYMPOSIUM 2006; 2006:534-8. [PMID: 17238398 PMCID: PMC1839525] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Subscribe] [Scholar Register] [Indexed: 05/13/2023]
Abstract
We present results of the collaboration of a multinational team of researchers from (computational) linguistics, medicine, and medical informatics with the goal of building a multilingual medical lexicon with high coverage and complete morpho-syntactic information. Monolingual lexical resources were collected and subsequently mapped between languages using a morpho-semantic term normalization engine, which captures intra- as well as interlingual synonymy relationships on the level of subwords.
Collapse
|
45
|
Deleger L, Merkel M, Zweigenbaum P. Enriching medical terminologies: an approach based on aligned corpora. Stud Health Technol Inform 2006; 124:747-52. [PMID: 17108604] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 05/12/2023]
Abstract
Medical terminologies such as those in the UMLS are never exhaustive and there is a constant need to enrich them, especially in terms of multilinguality. We present a methodology to acquire new French translations of English medical terms based on word alignment in a parallel corpus - i.e. pairing of corresponding words. We automatically collected a 27.7-million-word parallel, English-French corpus. Based on a first 1.3-million-word extract of this corpus, we detected 10,171 candidate French translations of English medical terms from MeSH and SNOMED, among which 3,807 are new translations of English MeSH terms.
Collapse
|
46
|
Deléger L, Merkel M, Zweigenbaum P. Contribution to terminology internationalization by word alignment in parallel corpora. AMIA ... ANNUAL SYMPOSIUM PROCEEDINGS. AMIA SYMPOSIUM 2006; 2006:185-9. [PMID: 17238328 PMCID: PMC1839560] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Subscribe] [Scholar Register] [Indexed: 05/13/2023]
Abstract
BACKGROUND AND OBJECTIVES Creating a complete translation of a large vocabulary is a time-consuming task, which requires skilled and knowledgeable medical translators. Our goal is to examine to which extent such a task can be alleviated by a specific natural language processing technique, word alignment in parallel corpora. We experiment with translation from English to French. METHODS Build a large corpus of parallel, English-French documents, and automatically align it at the document, sentence and word levels using state-of-the-art alignment methods and tools. Then project English terms from existing controlled vocabularies to the aligned word pairs, and examine the number and quality of the putative French translations obtained thereby. We considered three American vocabularies present in the UMLS with three different translation statuses: the MeSH, SNOMED CT, and the MedlinePlus Health Topics. RESULTS We obtained several thousand new translations of our input terms, this number being closely linked to the number of terms in the input vocabularies. CONCLUSION Our study shows that alignment methods can extract a number of new term translations from large bodies of text with a moderate human reviewing effort, and thus contribute to help a human translator obtain better translation coverage of an input vocabulary. Short-term perspectives include their application to a corpus 20 times larger than that used here, together with more focused methods for term extraction.
Collapse
|
47
|
Zweigenbaum P, Baud R, Burgun A, Namer F, Jarrousse E, Grabar N, Ruch P, Le Duff F, Forget JF, Douyère M, Darmoni S. UMLF: a unified medical lexicon for French. Int J Med Inform 2005; 74:119-24. [PMID: 15694616 DOI: 10.1016/j.ijmedinf.2004.03.010] [Citation(s) in RCA: 17] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/01/2003] [Accepted: 03/05/2004] [Indexed: 11/25/2022]
Abstract
Medical Informatics has a constant need for basic medical language processing tasks, e.g. for coding into controlled vocabularies, free text indexing and information retrieval. Most of these tasks involve term matching and rely on lexical resources: lists of words with attached information, including inflected forms and derived words, etc. Such resources are publicly available for the English language with the UMLS Specialist Lexicon, but not in other languages. For the French language, several teams have worked on the subject and built local lexical resources. The goal of the present work is to pool and unify these resources and to add extensively to them by exploiting medical terminologies and corpora, resulting in a unified medical lexicon for French (UMLF). This paper exposes the issues raised by such an objective, describes the methods on which the project relies and illustrates them with experimental results.
Collapse
|
48
|
Delbecque T, Jacquemart P, Zweigenbaum P. Indexing UMLS Semantic Types for Medical Question-Answering. Stud Health Technol Inform 2005; 116:805-10. [PMID: 16160357] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 05/04/2023]
Abstract
Open-domain Question-Answering (QA) systems heavily rely on named entities, a set of general-purpose semantic types which generally cover names of persons, organizations and locations, dates and amounts, etc. If we are to build medical QA systems, a set of medically relevant named entities must be used. In this paper, we explore the use of the UMLS (Unified Medical Language System) Semantic Network semantic types for this purpose. We present an experiment where the French part of the UMLS Metathesaurus, together with the associated semantic types, is used as a resource for a medically-specific named entity tagger. We also explore the detection of Semantic Network relations for answering specific types of medical questions. We present results and evaluations on a corpus of French-language medical documents that was used in the EQueR Question-Answering evaluation forum. We show, using statistical studies, that strategies for using these new tags in a QA context are to take in account the individual origin of documents.
Collapse
|
49
|
Baud RH, Nyström M, Borin L, Evans R, Schulz S, Zweigenbaum P. Interchanging lexical information for a multilingual dictionary. AMIA ... ANNUAL SYMPOSIUM PROCEEDINGS. AMIA SYMPOSIUM 2005; 2005:31-5. [PMID: 16778996 PMCID: PMC1560452] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Subscribe] [Scholar Register] [Indexed: 05/10/2023]
Abstract
OBJECTIVE To facilitate the interchange of lexical information for multiple languages in the medical domain. To pave the way for the emergence of a generally available truly multilingual electronic dictionary in the medical domain. METHODS An interchange format has to be neutral relative to the target languages. It has to be consistent with current needs of lexicon authors, present and future. An active interaction between six potential authors aimed to determine a common denominator striking the right balance between richness of content and ease of use for lexicon providers. RESULTS A simple list of relevant attributes has been established and published. The format has the potential for collecting relevant parts of a future multilingual dictionary. An XML version is available. CONCLUSION This effort makes feasible the exchange of lexical information between research groups. Interchange files are made available in a public repository. This procedure opens the door to a true multilingual dictionary, in the awareness that the exchange of lexical information is (only) a necessary first step, before structuring the corresponding entries in different languages.
Collapse
|
50
|
Namer F, Zweigenbaum P. Acquiring meaning for French medical terminology: contribution of morphosemantics. Stud Health Technol Inform 2004; 107:535-9. [PMID: 15360870] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 04/30/2023]
Abstract
Morphologically complex words, and particularly neoclassical compounds, form more than 60% of the neologisms in the biomedical field. Guessing their definitions and grouping them into semantic classes by means of lexical relations are thus two crucial improvements for handling these words, e.g., for information retrieval, indexing and text understanding applications. This paper describes a morphosemantic linguistic-based parser called DériF, currently developed in the framework of two projects, UMLF and VUMeF, and its application to French biomedical derived and compound words. It shows how the resulting morphologically tagged lexicon is enriched by semantic relations leading both to the synthesis of pseudo-definitions and to the constitution of classes of synonyms, hypo- and hypernyms.
Collapse
|