1
|
Ning W, Yu M, Zhang R. A hierarchical method to automatically encode Chinese diagnoses through semantic similarity estimation. BMC Med Inform Decis Mak 2016; 16:30. [PMID: 26940992 PMCID: PMC4778321 DOI: 10.1186/s12911-016-0269-4] [Citation(s) in RCA: 32] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/25/2015] [Accepted: 02/26/2016] [Indexed: 12/31/2022] Open
Abstract
Background The accumulation of medical documents in China has rapidly increased in the past years. We focus on developing a method that automatically performs ICD-10 code assignment to Chinese diagnoses from the electronic medical records to support the medical coding process in Chinese hospitals. Methods We propose two encoding methods: one that directly determines the desired code (flat method), and one that hierarchically determines the most suitable code until the desired code is obtained (hierarchical method). Both methods are based on instances from the standard diagnostic library, a gold standard dataset in China. For the first time, semantic similarity estimation between Chinese words are applied in the biomedical domain with the successful implementation of knowledge-based and distributional approaches. Characteristics of the Chinese language are considered in implementing distributional semantics. We test our methods against 16,330 coding instances from our partner hospital. Results The hierarchical method outperforms the flat method in terms of accuracy and time complexity. Representing distributional semantics using Chinese characters can achieve comparable performance to the use of Chinese words. The diagnoses in the test set can be encoded automatically with micro-averaged precision of 92.57 %, recall of 89.63 %, and F-score of 91.08 %. A sharp decrease in encoding performance is observed without semantic similarity estimation. Conclusion The hierarchical nature of ICD-10 codes can enhance the performance of the automated code assignment. Semantic similarity estimation is demonstrated indispensable in dealing with Chinese medical text. The proposed method can greatly reduce the workload and improve the efficiency of the code assignment process in Chinese hospitals. Electronic supplementary material The online version of this article (doi:10.1186/s12911-016-0269-4) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Wenxin Ning
- Health Care Services Research Center, Department of Industrial Engineering, Tsinghua University, Beijing, 100084, PR China.
| | - Ming Yu
- Health Care Services Research Center, Department of Industrial Engineering, Tsinghua University, Beijing, 100084, PR China.
| | - Runtong Zhang
- Department of Information Management, School of Economics and Management, Beijing Jiaotong University, Beijing, 100084, PR China.
| |
Collapse
|
2
|
Zeng J, Wu Y, Bailey A, Johnson A, Holla V, Bernstam EV, Xu H, Meric-Bernstam F. Adapting a natural language processing tool to facilitate clinical trial curation for personalized cancer therapy. AMIA JOINT SUMMITS ON TRANSLATIONAL SCIENCE PROCEEDINGS. AMIA JOINT SUMMITS ON TRANSLATIONAL SCIENCE 2014; 2014:126-31. [PMID: 25717412 PMCID: PMC4333699] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Download PDF] [Subscribe] [Scholar Register] [Indexed: 10/28/2022]
Abstract
The design of personalized cancer therapy based upon patients' molecular profile requires an enormous amount of effort to review, analyze and integrate molecular, pharmacological, clinical and patient-specific information. The vast size, rapid expansion and non-standardized formats of the relevant information sources make it difficult for oncologists to gather pertinent information that can support routine personalized treatment. In this paper, we introduce informatics tools that assist the retrieval and curation of cancer-related clinical trials involving targeted therapies. Particularly, we adapted and extended an existing natural language processing tool, and explored its applicability in facilitating our annotation efforts. The system was evaluated using a gold standard of 539 curated clinical trials, demonstrating promising performance and good generalizability (81% accuracy in predicting genotype-selected trials and an average recall of 0.85 in predicting specific selection criteria).
Collapse
Affiliation(s)
- Jia Zeng
- Institute for Personalized Cancer Therapy, The University of Texas MD Anderson Cancer Center, Houston, TX
| | - Yonghui Wu
- School of Biomedical Informatics, The University of Texas Health Science Center at Houston, Houston, TX
| | - Ann Bailey
- Institute for Personalized Cancer Therapy, The University of Texas MD Anderson Cancer Center, Houston, TX
| | - Amber Johnson
- Institute for Personalized Cancer Therapy, The University of Texas MD Anderson Cancer Center, Houston, TX
| | - Vijaykumar Holla
- Institute for Personalized Cancer Therapy, The University of Texas MD Anderson Cancer Center, Houston, TX
| | - Elmer V. Bernstam
- School of Biomedical Informatics, The University of Texas Health Science Center at Houston, Houston, TX
| | - Hua Xu
- School of Biomedical Informatics, The University of Texas Health Science Center at Houston, Houston, TX
| | - Funda Meric-Bernstam
- Institute for Personalized Cancer Therapy, The University of Texas MD Anderson Cancer Center, Houston, TX
| |
Collapse
|
3
|
Wu Y, Levy MA, Micheel CM, Yeh P, Tang B, Cantrell MJ, Cooreman SM, Xu H. Identifying the status of genetic lesions in cancer clinical trial documents using machine learning. BMC Genomics 2012; 13 Suppl 8:S21. [PMID: 23282337 PMCID: PMC3535695 DOI: 10.1186/1471-2164-13-s8-s21] [Citation(s) in RCA: 9] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Many cancer clinical trials now specify the particular status of a genetic lesion in a patient's tumor in the inclusion or exclusion criteria for trial enrollment. To facilitate search and identification of gene-associated clinical trials by potential participants and clinicians, it is important to develop automated methods to identify genetic information from narrative trial documents. METHODS We developed a two-stage classification method to identify genes and genetic lesion statuses in clinical trial documents extracted from the National Cancer Institute's (NCI's) Physician Data Query (PDQ) cancer clinical trial database. The method consists of two steps: 1) to distinguish gene entities from non-gene entities such as English words; and 2) to determine whether and which genetic lesion status is associated with an identified gene entity. We developed and evaluated the performance of the method using a manually annotated data set containing 1,143 instances of the eight most frequently mentioned genes in cancer clinical trials. In addition, we applied the classifier to a real-world task of cancer trial annotation and evaluated its performance using a larger sample size (4,013 instances from 249 distinct human gene symbols detected from 250 trials). RESULTS Our evaluation using a manually annotated data set showed that the two-stage classifier outperformed the single-stage classifier and achieved the best average accuracy of 83.7% for the eight most frequently mentioned genes when optimized feature sets were used. It also showed better generalizability when we applied the two-stage classifier trained on one set of genes to another independent gene. When a gene-neutral, two-stage classifier was applied to the real-world task of cancer trial annotation, it achieved a highest accuracy of 89.8%, demonstrating the feasibility of developing a gene-neutral classifier for this task. CONCLUSIONS We presented a machine learning-based approach to detect gene entities and the genetic lesion statuses from clinical trial documents and demonstrated its use in cancer trial annotation. Such methods would be valuable for building information retrieval tools targeting gene-associated clinical trials.
Collapse
Affiliation(s)
- Yonghui Wu
- Department of Biomedical Informatics, Vanderbilt University, School of Medicine, 2209 Garland Ave, Nashville, TN 37232, USA
| | - Mia A Levy
- Department of Biomedical Informatics, Vanderbilt University, School of Medicine, 2209 Garland Ave, Nashville, TN 37232, USA
- Department of Medicine, Division of Hematology and Oncology, Vanderbilt University, School of Medicine, USA
- Vanderbilt-Ingram Cancer Center, Vanderbilt University Medical Center, USA
| | | | - Paul Yeh
- Vanderbilt-Ingram Cancer Center, Vanderbilt University Medical Center, USA
| | - Buzhou Tang
- Department of Biomedical Informatics, Vanderbilt University, School of Medicine, 2209 Garland Ave, Nashville, TN 37232, USA
| | - Michael J Cantrell
- Vanderbilt-Ingram Cancer Center, Vanderbilt University Medical Center, USA
| | - Stacy M Cooreman
- Vanderbilt-Ingram Cancer Center, Vanderbilt University Medical Center, USA
| | - Hua Xu
- Department of Biomedical Informatics, Vanderbilt University, School of Medicine, 2209 Garland Ave, Nashville, TN 37232, USA
| |
Collapse
|
4
|
Jonnalagadda S, Topham P. NEMO: Extraction and normalization of organization names from PubMed affiliations. JOURNAL OF BIOMEDICAL DISCOVERY AND COLLABORATION 2010; 5:50-75. [PMID: 20922666 PMCID: PMC2990275] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 06/23/2010] [Revised: 09/23/2010] [Accepted: 09/25/2010] [Indexed: 11/16/2022]
Abstract
BACKGROUND Today, there are more than 18 million articles related to biomedical research indexed in MEDLINE, and information derived from them could be used effectively to save the great amount of time and resources spent by government agencies in understanding the scientific landscape, including key opinion leaders and centers of excellence. Associating biomedical articles with organization names could significantly benefit the pharmaceutical marketing industry, health care funding agencies and public health officials and be useful for other scientists in normalizing author names, automatically creating citations, indexing articles and identifying potential resources or collaborators. Large amount of extracted information helps in disambiguating organization names using machine-learning algorithms. RESULTS We propose NEMO, a system for extracting organization names in the affiliation and normalizing them to a canonical organization name. Our parsing process involves multi-layered rule matching with multiple dictionaries. The system achieves more than 98% f-score in extracting organization names. Our process of normalization that involves clustering based on local sequence alignment metrics and local learning based on finding connected components. A high precision was also observed in normalization. CONCLUSION NEMO is the missing link in associating each biomedical paper and its authors to an organization name in its canonical form and the Geopolitical location of the organization. This research could potentially help in analyzing large social networks of organizations for landscaping a particular topic, improving performance of author disambiguation, adding weak links in the co-author network of authors, augmenting NLM's MARS system for correcting errors in OCR output of affiliation field, and automatically indexing the PubMed citations with the normalized organization name and country. Our system is available as a graphical user interface available for download along with this paper.
Collapse
Affiliation(s)
| | - Philip Topham
- Lnx Research LLC, 750 The City Drive Suite 490, Orange, CA 92868United States
| |
Collapse
|
5
|
Harmston N, Filsell W, Stumpf MPH. What the papers say: text mining for genomics and systems biology. Hum Genomics 2010; 5:17-29. [PMID: 21106487 PMCID: PMC3500154 DOI: 10.1186/1479-7364-5-1-17] [Citation(s) in RCA: 39] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/06/2010] [Accepted: 08/06/2010] [Indexed: 12/11/2022] Open
Abstract
Keeping up with the rapidly growing literature has become virtually impossible for most scientists. This can have dire consequences. First, we may waste research time and resources on reinventing the wheel simply because we can no longer maintain a reliable grasp on the published literature. Second, and perhaps more detrimental, judicious (or serendipitous) combination of knowledge from different scientific disciplines, which would require following disparate and distinct research literatures, is rapidly becoming impossible for even the most ardent readers of research publications. Text mining - the automated extraction of information from (electronically) published sources - could potentially fulfil an important role - but only if we know how to harness its strengths and overcome its weaknesses. As we do not expect that the rate at which scientific results are published will decrease, text mining tools are now becoming essential in order to cope with, and derive maximum benefit from, this information explosion. In genomics, this is particularly pressing as more and more rare disease-causing variants are found and need to be understood. Not being conversant with this technology may put scientists and biomedical regulators at a severe disadvantage. In this review, we introduce the basic concepts underlying modern text mining and its applications in genomics and systems biology. We hope that this review will serve three purposes: (i) to provide a timely and useful overview of the current status of this field, including a survey of present challenges; (ii) to enable researchers to decide how and when to apply text mining tools in their own research; and (iii) to highlight how the research communities in genomics and systems biology can help to make text mining from biomedical abstracts and texts more straightforward.
Collapse
Affiliation(s)
- Nathan Harmston
- Division of Molecular Biosciences, Centre for Bioinformatics, Imperial College London, 303, Wolfson Building, South Kensington Campus, London, SW7 2AZ, UK
| | - Wendy Filsell
- Unilever R&D, Colworth Science Park, Sharnbrook, Bedford MK44 1 LQ, UK
| | - Michael PH Stumpf
- Division of Molecular Biosciences, Centre for Bioinformatics, Imperial College London, 303, Wolfson Building, South Kensington Campus, London, SW7 2AZ, UK
| |
Collapse
|
6
|
Alexopoulou D, Andreopoulos B, Dietze H, Doms A, Gandon F, Hakenberg J, Khelif K, Schroeder M, Wächter T. Biomedical word sense disambiguation with ontologies and metadata: automation meets accuracy. BMC Bioinformatics 2009; 10:28. [PMID: 19159460 PMCID: PMC2663782 DOI: 10.1186/1471-2105-10-28] [Citation(s) in RCA: 20] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/16/2008] [Accepted: 01/21/2009] [Indexed: 11/24/2022] Open
Abstract
Background Ontology term labels can be ambiguous and have multiple senses. While this is no problem for human annotators, it is a challenge to automated methods, which identify ontology terms in text. Classical approaches to word sense disambiguation use co-occurring words or terms. However, most treat ontologies as simple terminologies, without making use of the ontology structure or the semantic similarity between terms. Another useful source of information for disambiguation are metadata. Here, we systematically compare three approaches to word sense disambiguation, which use ontologies and metadata, respectively. Results The 'Closest Sense' method assumes that the ontology defines multiple senses of the term. It computes the shortest path of co-occurring terms in the document to one of these senses. The 'Term Cooc' method defines a log-odds ratio for co-occurring terms including co-occurrences inferred from the ontology structure. The 'MetaData' approach trains a classifier on metadata. It does not require any ontology, but requires training data, which the other methods do not. To evaluate these approaches we defined a manually curated training corpus of 2600 documents for seven ambiguous terms from the Gene Ontology and MeSH. All approaches over all conditions achieve 80% success rate on average. The 'MetaData' approach performed best with 96%, when trained on high-quality data. Its performance deteriorates as quality of the training data decreases. The 'Term Cooc' approach performs better on Gene Ontology (92% success) than on MeSH (73% success) as MeSH is not a strict is-a/part-of, but rather a loose is-related-to hierarchy. The 'Closest Sense' approach achieves on average 80% success rate. Conclusion Metadata is valuable for disambiguation, but requires high quality training data. Closest Sense requires no training, but a large, consistently modelled ontology, which are two opposing conditions. Term Cooc achieves greater 90% success given a consistently modelled ontology. Overall, the results show that well structured ontologies can play a very important role to improve disambiguation. Availability The three benchmark datasets created for the purpose of disambiguation are available in Additional file 1.
Collapse
Affiliation(s)
- Dimitra Alexopoulou
- Biotechnology Center (BIOTEC), Technische Universität Dresden, 01062, Dresden, Germany.
| | | | | | | | | | | | | | | | | |
Collapse
|
7
|
Kastrin A, Hristovski D. A fast document classification algorithm for gene symbol disambiguation in the BITOLA literature-based discovery support system. AMIA ... ANNUAL SYMPOSIUM PROCEEDINGS. AMIA SYMPOSIUM 2008; 2008:358-362. [PMID: 18998999 PMCID: PMC2655979] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Subscribe] [Scholar Register] [Received: 03/14/2008] [Revised: 07/10/2008] [Indexed: 05/27/2023]
Abstract
Gene symbol disambiguation is an important problem for biomedical text mining systems. When detecting gene symbols in MEDLINE citations one of the biggest challenges is the fact that many gene symbols also denote other, more general biomedical concepts (e.g. CT, MR). Our approach to this problem is first to classify the citations into genetic and non-genetic domains and then to detect gene symbols only in the genetic domain. We used ontological information provided by Medical Subject Headings (MeSH) for this classification task. The proposed algorithm is fast and is able to process the full MEDLINE distribution in a few hours. It achieves predictive accuracy of 0.91. The algorithm is currently implemented in the BITOLA literature-based discovery support system (http://www.mf.uni-lj.si/bitola/).
Collapse
Affiliation(s)
- Andrej Kastrin
- Institute of Medical Genetics, University Medical Centre, Ljubljana, Slovenia
| | | |
Collapse
|
8
|
Hakenberg J, Plake C, Leaman R, Schroeder M, Gonzalez G. Inter-species normalization of gene mentions with GNAT. Bioinformatics 2008; 24:i126-132. [PMID: 18689813 DOI: 10.1093/bioinformatics/btn299] [Citation(s) in RCA: 80] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
Abstract
MOTIVATION Text mining in the biomedical domain aims at helping researchers to access information contained in scientific publications in a faster, easier and more complete way. One step towards this aim is the recognition of named entities and their subsequent normalization to database identifiers. Normalization helps to link objects of potential interest, such as genes, to detailed information not contained in a publication; it is also key for integrating different knowledge sources. From an information retrieval perspective, normalization facilitates indexing and querying. Gene mention normalization (GN) is particularly challenging given the high ambiguity of gene names: they refer to orthologous or entirely different genes, are named after phenotypes and other biomedical terms, or they resemble common English words. RESULTS We present the first publicly available system, GNAT, reported to handle inter-species GN. Our method uses extensive background knowledge on genes to resolve ambiguous names to EntrezGene identifiers. It performs comparably to single-species approaches proposed by us and others. On a benchmark set derived from BioCreative 1 and 2 data that contains genes from 13 species, GNAT achieves an F-measure of 81.4% (90.8% precision at 73.8% recall). For the single-species task, we report an F-measure of 85.4% on human genes. AVAILABILITY A web-frontend is available at http://cbioc.eas.asu.edu/gnat/. GNAT will also be available within the BioCreativeMetaService project, see http://bcms.bioinfo.cnio.es. SUPPLEMENTARY INFORMATION The test data set, lexica, and links toexternal data are available at http://cbioc.eas.asu.edu/gnat/
Collapse
Affiliation(s)
- Jörg Hakenberg
- Department of Computer Science and Engineering, Arizona State University, Tempe, AZ 85287, USA.
| | | | | | | | | |
Collapse
|