1
|
Szekér S, Fogarassy G, Vathy-Fogarassy Á. A general text mining method to extract echocardiography measurement results from echocardiography documents. Artif Intell Med 2023; 143:102584. [PMID: 37673570 DOI: 10.1016/j.artmed.2023.102584] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/03/2022] [Revised: 03/08/2023] [Accepted: 05/16/2023] [Indexed: 09/08/2023]
Abstract
BACKGROUND In everyday medical practice, the results of cardiac ultrasound examinations are generally recorded in unstructured text, from which extracting relevant information is an important and challenging task. This paper presents a generally applicable language and corpus-independent text mining method for extracting and structuring numerical measurement results and their descriptions from echocardiography reports. METHOD The developed method is based on generally applicable text mining preprocessing activities, it automatically identifies and standardizes the descriptions of the cardiac ultrasound measures, and it stores the extracted and standardized measurement descriptions with their measurement results in a structured form for later usage. The method does not contain any regular expression-based search and does not rely on information about the structure of the document. RESULTS The method has been tested on a document set containing more than 20,000 echocardiographic reports by examining the efficiency of extracting 12 echocardiography parameters considered important by experts. The method extracted and structured the echocardiography parameters under the study with good sensitivity (lowest value: 0.775, highest value: 1.0, average: 0.904) and excellent specificity (for all cases 1.0). The F1 score ranged between 0.873 and 1.0, and its average value was 0.948. CONCLUSION The presented case study has shown that the proposed method can extract measurement results from echocardiography documents with high confidence without performing a direct search or having detailed information about the data recording habits. Furthermore, it effectively handles spelling errors, abbreviations and the highly varied terminology used in descriptions. As it does not rely on any information related to the structure or the language of the documents or data recording habits, it can be applied for processing any free-text written medical texts.
Collapse
Affiliation(s)
- Szabolcs Szekér
- Department of Computer Science and Systems Technology, University of Pannonia, Veszprém, Hungary
| | - György Fogarassy
- 1st Department of Cardiology, State Hospital for Cardiology, Balatonfüred, Hungary
| | - Ágnes Vathy-Fogarassy
- Department of Computer Science and Systems Technology, University of Pannonia, Veszprém, Hungary.
| |
Collapse
|
2
|
Gill SK, Karwath A, Uh HW, Cardoso VR, Gu Z, Barsky A, Slater L, Acharjee A, Duan J, Dall'Olio L, el Bouhaddani S, Chernbumroong S, Stanbury M, Haynes S, Asselbergs FW, Grobbee DE, Eijkemans MJC, Gkoutos GV, Kotecha D. Artificial intelligence to enhance clinical value across the spectrum of cardiovascular healthcare. Eur Heart J 2023; 44:713-725. [PMID: 36629285 PMCID: PMC9976986 DOI: 10.1093/eurheartj/ehac758] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 05/16/2022] [Revised: 11/22/2022] [Accepted: 12/05/2022] [Indexed: 01/12/2023] Open
Abstract
Artificial intelligence (AI) is increasingly being utilized in healthcare. This article provides clinicians and researchers with a step-wise foundation for high-value AI that can be applied to a variety of different data modalities. The aim is to improve the transparency and application of AI methods, with the potential to benefit patients in routine cardiovascular care. Following a clear research hypothesis, an AI-based workflow begins with data selection and pre-processing prior to analysis, with the type of data (structured, semi-structured, or unstructured) determining what type of pre-processing steps and machine-learning algorithms are required. Algorithmic and data validation should be performed to ensure the robustness of the chosen methodology, followed by an objective evaluation of performance. Seven case studies are provided to highlight the wide variety of data modalities and clinical questions that can benefit from modern AI techniques, with a focus on applying them to cardiovascular disease management. Despite the growing use of AI, further education for healthcare workers, researchers, and the public are needed to aid understanding of how AI works and to close the existing gap in knowledge. In addition, issues regarding data access, sharing, and security must be addressed to ensure full engagement by patients and the public. The application of AI within healthcare provides an opportunity for clinicians to deliver a more personalized approach to medical care by accounting for confounders, interactions, and the rising prevalence of multi-morbidity.
Collapse
Affiliation(s)
- Simrat K Gill
- Institute of Cardiovascular Sciences, University of Birmingham, Vincent Drive, B15 2TT Birmingham, UK
- Health Data Research UK Midlands, University Hospitals Birmingham NHS Foundation Trust, Birmingham, UK
| | - Andreas Karwath
- Health Data Research UK Midlands, University Hospitals Birmingham NHS Foundation Trust, Birmingham, UK
- Institute of Cancer and Genomic Sciences, University of Birmingham, Vincent Drive, B15 2TT Birmingham, UK
| | - Hae-Won Uh
- Julius Center for Health Sciences and Primary Care, University Medical Centre Utrecht, Utrecht, The Netherlands
| | - Victor Roth Cardoso
- Institute of Cardiovascular Sciences, University of Birmingham, Vincent Drive, B15 2TT Birmingham, UK
- Health Data Research UK Midlands, University Hospitals Birmingham NHS Foundation Trust, Birmingham, UK
- Institute of Cancer and Genomic Sciences, University of Birmingham, Vincent Drive, B15 2TT Birmingham, UK
| | - Zhujie Gu
- Julius Center for Health Sciences and Primary Care, University Medical Centre Utrecht, Utrecht, The Netherlands
| | - Andrey Barsky
- Health Data Research UK Midlands, University Hospitals Birmingham NHS Foundation Trust, Birmingham, UK
- Institute of Cancer and Genomic Sciences, University of Birmingham, Vincent Drive, B15 2TT Birmingham, UK
| | - Luke Slater
- Health Data Research UK Midlands, University Hospitals Birmingham NHS Foundation Trust, Birmingham, UK
- Institute of Cancer and Genomic Sciences, University of Birmingham, Vincent Drive, B15 2TT Birmingham, UK
| | - Animesh Acharjee
- Health Data Research UK Midlands, University Hospitals Birmingham NHS Foundation Trust, Birmingham, UK
- Institute of Cancer and Genomic Sciences, University of Birmingham, Vincent Drive, B15 2TT Birmingham, UK
| | - Jinming Duan
- School of Computer Science, University of Birmingham, Birmingham, UK
- Alan Turing Institute, London, UK
| | - Lorenzo Dall'Olio
- Department of Physics and Astronomy, University of Bologna, Bologna, Italy
| | - Said el Bouhaddani
- Julius Center for Health Sciences and Primary Care, University Medical Centre Utrecht, Utrecht, The Netherlands
| | - Saisakul Chernbumroong
- Health Data Research UK Midlands, University Hospitals Birmingham NHS Foundation Trust, Birmingham, UK
- Institute of Cancer and Genomic Sciences, University of Birmingham, Vincent Drive, B15 2TT Birmingham, UK
| | | | | | - Folkert W Asselbergs
- Amsterdam University Medical Center, Department of Cardiology, University of Amsterdam, Amsterdam, The Netherlands
- Health Data Research UK and Institute of Health Informatics, University College London, London, UK
| | - Diederick E Grobbee
- Julius Center for Health Sciences and Primary Care, University Medical Centre Utrecht, Utrecht, The Netherlands
| | - Marinus J C Eijkemans
- Julius Center for Health Sciences and Primary Care, University Medical Centre Utrecht, Utrecht, The Netherlands
| | - Georgios V Gkoutos
- Health Data Research UK Midlands, University Hospitals Birmingham NHS Foundation Trust, Birmingham, UK
- Institute of Cancer and Genomic Sciences, University of Birmingham, Vincent Drive, B15 2TT Birmingham, UK
| | - Dipak Kotecha
- Institute of Cardiovascular Sciences, University of Birmingham, Vincent Drive, B15 2TT Birmingham, UK
- Health Data Research UK Midlands, University Hospitals Birmingham NHS Foundation Trust, Birmingham, UK
- Department of Cardiology, Division Heart and Lungs, University Medical Center Utrecht, Utrecht University, Utrecht, The Netherlands
| |
Collapse
|
3
|
Yew ANJ, Schraagen M, Otte WM, van Diessen E. Transforming epilepsy research: A systematic review on natural language processing applications. Epilepsia 2023; 64:292-305. [PMID: 36462150 PMCID: PMC10108221 DOI: 10.1111/epi.17474] [Citation(s) in RCA: 14] [Impact Index Per Article: 14.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/11/2022] [Revised: 11/23/2022] [Accepted: 12/01/2022] [Indexed: 12/05/2022]
Abstract
Despite improved ancillary investigations in epilepsy care, patients' narratives remain indispensable for diagnosing and treatment monitoring. This wealth of information is typically stored in electronic health records and accumulated in medical journals in an unstructured manner, thereby restricting complete utilization in clinical decision-making. To this end, clinical researchers increasing apply natural language processing (NLP)-a branch of artificial intelligence-as it removes ambiguity, derives context, and imbues standardized meaning from free-narrative clinical texts. This systematic review presents an overview of the current NLP applications in epilepsy and discusses the opportunities and drawbacks of NLP alongside its future implications. We searched the PubMed and Embase databases with a "natural language processing" and "epilepsy" query (March 4, 2022) and included original research articles describing the application of NLP techniques for textual analysis in epilepsy. Twenty-six studies were included. Fifty-eight percent of these studies used NLP to classify clinical records into predefined categories, improving patient identification and treatment decisions. Other applications of NLP had structured clinical information retrieval from electronic health records, scientific papers, and online posts of patients. Challenges and opportunities of NLP applications for enhancing epilepsy care and research are discussed. The field could further benefit from NLP by replicating successes in other health care domains, such as NLP-aided quality evaluation for clinical decision-making, outcome prediction, and clinical record summarization.
Collapse
Affiliation(s)
- Arister N J Yew
- University College Utrecht, Utrecht University, Utrecht, The Netherlands
| | - Marijn Schraagen
- Department of Information and Computing Sciences, Faculty of Science, Utrecht University, Utrecht, The Netherlands
| | - Willem M Otte
- Department of Child Neurology, Brain Center, University Medical Center Utrecht and Utrecht University, Utrecht, The Netherlands
| | - Eric van Diessen
- Department of Child Neurology, Brain Center, University Medical Center Utrecht and Utrecht University, Utrecht, The Netherlands
| |
Collapse
|
4
|
van Es B, Reteig LC, Tan SC, Schraagen M, Hemker MM, Arends SRS, Rios MAR, Haitjema S. Negation detection in Dutch clinical texts: an evaluation of rule-based and machine learning methods. BMC Bioinformatics 2023; 24:10. [PMID: 36624385 PMCID: PMC9830789 DOI: 10.1186/s12859-022-05130-x] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/20/2022] [Accepted: 12/30/2022] [Indexed: 01/11/2023] Open
Abstract
When developing models for clinical information retrieval and decision support systems, the discrete outcomes required for training are often missing. These labels need to be extracted from free text in electronic health records. For this extraction process one of the most important contextual properties in clinical text is negation, which indicates the absence of findings. We aimed to improve large scale extraction of labels by comparing three methods for negation detection in Dutch clinical notes. We used the Erasmus Medical Center Dutch Clinical Corpus to compare a rule-based method based on ContextD, a biLSTM model using MedCAT and (finetuned) RoBERTa-based models. We found that both the biLSTM and RoBERTa models consistently outperform the rule-based model in terms of F1 score, precision and recall. In addition, we systematically categorized the classification errors for each model, which can be used to further improve model performance in particular applications. Combining the three models naively was not beneficial in terms of performance. We conclude that the biLSTM and RoBERTa-based models in particular are highly accurate accurate in detecting clinical negations, but that ultimately all three approaches can be viable depending on the use case at hand.
Collapse
Affiliation(s)
- Bram van Es
- grid.7692.a0000000090126352Central Diagnostic Laboratory, University Medical Center Utrecht, Utrecht University, Utrecht, The Netherlands ,MedxAI, Amsterdam, The Netherlands
| | - Leon C. Reteig
- grid.7692.a0000000090126352Center for Translational Immunology, University Medical Center Utrecht, Utrecht, The Netherlands
| | - Sander C. Tan
- grid.7692.a0000000090126352Department for Research & Data Technology, University Medical Center Utrecht, Utrecht, The Netherlands
| | - Marijn Schraagen
- grid.5477.10000000120346234Institute for Information and Computing Sciences, Utrecht University, Utrecht, The Netherlands
| | - Myrthe M. Hemker
- grid.5477.10000000120346234Utrecht Institute of Linguistics OTS & Department of Languages, Literature and Communication, Utrecht University, Utrecht, The Netherlands
| | - Sebastiaan R. S. Arends
- grid.7177.60000000084992262Department of Medical Informatics, University of Amsterdam, Amsterdam, The Netherlands
| | - Miguel A. R. Rios
- grid.10420.370000 0001 2286 1424Centre for Translation Studies, University of Vienna, Vienna, Austria
| | - Saskia Haitjema
- grid.7692.a0000000090126352Central Diagnostic Laboratory, University Medical Center Utrecht, Utrecht University, Utrecht, The Netherlands
| |
Collapse
|
5
|
Pezanowski S, Mitra P, MacEachren AM. Exploring Descriptions of Movement Through Geovisual Analytics. KN - JOURNAL OF CARTOGRAPHY AND GEOGRAPHIC INFORMATION 2022; 72:5-27. [PMID: 35229072 PMCID: PMC8866112 DOI: 10.1007/s42489-022-00098-3] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 12/15/2021] [Accepted: 01/31/2022] [Indexed: 11/26/2022]
Abstract
Sensemaking using automatically extracted information from text is a challenging problem. In this paper, we address a specific type of information extraction, namely extracting information related to descriptions of movement. Aggregating and understanding information related to descriptions of movement and lack of movement specified in text can lead to an improved understanding and sensemaking of movement phenomena of various types, e.g., migration of people and animals, impediments to travel due to COVID-19, etc. We present GeoMovement, a system that is based on combining machine learning and rule-based extraction of movement-related information with state-of-the-art visualization techniques. Along with the depiction of movement, our tool can extract and present a lack of movement. Very little prior work exists on automatically extracting descriptions of movement, especially negation and movement. Apart from addressing these, GeoMovement also provides a novel integrated framework for combining these extraction modules with visualization. We include two systematic case studies of GeoMovement that show how humans can derive meaningful geographic movement information. GeoMovement can complement precise movement data, e.g., obtained using sensors, or be used by itself when precise data is unavailable.
Collapse
Affiliation(s)
- Scott Pezanowski
- Information Sciences and Technology, The Pennsylvania State University, Westgate Building, University Park, PA 16802 USA
| | - Prasenjit Mitra
- Information Sciences and Technology, The Pennsylvania State University, Westgate Building, University Park, PA 16802 USA
| | - Alan M. MacEachren
- Information Sciences and Technology, The Pennsylvania State University, Westgate Building, University Park, PA 16802 USA
- Department of Geography, The Pennsylvania State University, Walker Building, University Park, PA 16802 USA
| |
Collapse
|
6
|
Slater LT, Karwath A, Hoehndorf R, Gkoutos GV. Effects of Negation and Uncertainty Stratification on Text-Derived Patient Profile Similarity. Front Digit Health 2021; 3:781227. [PMID: 34939069 PMCID: PMC8685209 DOI: 10.3389/fdgth.2021.781227] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/22/2021] [Accepted: 11/12/2021] [Indexed: 11/13/2022] Open
Abstract
Semantic similarity is a useful approach for comparing patient phenotypes, and holds the potential of an effective method for exploiting text-derived phenotypes for differential diagnosis, text and document classification, and outcome prediction. While approaches for context disambiguation are commonly used in text mining applications, forming a standard component of information extraction pipelines, their effects on semantic similarity calculations have not been widely explored. In this work, we evaluate how inclusion and disclusion of negated and uncertain mentions of concepts from text-derived phenotypes affects similarity of patients, and the use of those profiles to predict diagnosis. We report on the effectiveness of these approaches and report a very small, yet significant, improvement in performance when classifying primary diagnosis over MIMIC-III patient visits.
Collapse
Affiliation(s)
- Luke T Slater
- Centre for Computational Biology, College of Medical and Dental Sciences, Institute of Cancer and Genomic Sciences, University of Birmingham, Birmingham, United Kingdom.,Institute of Translational Medicine, University Hospitals Birmingham, NHS Foundation Trust, Birmingham, United Kingdom.,University Hospitals Birmingham National Health Service Foundation Trust, Birmingham, United Kingdom.,MRC Health Data Research UK (HDR UK) Midlands, Birmingham, United Kingdom
| | - Andreas Karwath
- Centre for Computational Biology, College of Medical and Dental Sciences, Institute of Cancer and Genomic Sciences, University of Birmingham, Birmingham, United Kingdom.,Institute of Translational Medicine, University Hospitals Birmingham, NHS Foundation Trust, Birmingham, United Kingdom.,University Hospitals Birmingham National Health Service Foundation Trust, Birmingham, United Kingdom.,MRC Health Data Research UK (HDR UK) Midlands, Birmingham, United Kingdom
| | - Robert Hoehndorf
- Computer, Electrical and Mathematical Sciences & Engineering Division, Computational Bioscience Research Center, King Abdullah University of Science and Technology, Thuwal, Saudi Arabia
| | - Georgios V Gkoutos
- Centre for Computational Biology, College of Medical and Dental Sciences, Institute of Cancer and Genomic Sciences, University of Birmingham, Birmingham, United Kingdom.,Institute of Translational Medicine, University Hospitals Birmingham, NHS Foundation Trust, Birmingham, United Kingdom.,University Hospitals Birmingham National Health Service Foundation Trust, Birmingham, United Kingdom.,MRC Health Data Research UK (HDR UK) Midlands, Birmingham, United Kingdom.,National Institute for Health Research Experimental Cancer Medicine Centre, Birmingham, United Kingdom.,National Institute for Health Research Surgical Reconstruction and Microbiology Research Centre, Birmingham, United Kingdom.,National Institute for Health Research Biomedical Research Centre, Birmingham, United Kingdom
| |
Collapse
|
7
|
Slater K, Williams JA, Karwath A, Fanning H, Ball S, Schofield PN, Hoehndorf R, Gkoutos GV. Multi-faceted semantic clustering with text-derived phenotypes. Comput Biol Med 2021; 138:104904. [PMID: 34600327 PMCID: PMC8573608 DOI: 10.1016/j.compbiomed.2021.104904] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/23/2021] [Revised: 09/22/2021] [Accepted: 09/23/2021] [Indexed: 02/03/2023]
Abstract
Identification of ontology concepts in clinical narrative text enables the creation of phenotype profiles that can be associated with clinical entities, such as patients or drugs. Constructing patient phenotype profiles using formal ontologies enables their analysis via semantic similarity, in turn enabling the use of background knowledge in clustering or classification analyses. However, traditional semantic similarity approaches collapse complex relationships between patient phenotypes into a unitary similarity scores for each pair of patients. Moreover, single scores may be based only on matching terms with the greatest information content (IC), ignoring other dimensions of patient similarity. This process necessarily leads to a loss of information in the resulting representation of patient similarity, and is especially apparent when using very large text-derived and highly multi-morbid phenotype profiles. Moreover, it renders finding a biological explanation for similarity very difficult; the black box problem. In this article, we explore the generation of multiple semantic similarity scores for patients based on different facets of their phenotypic manifestation, which we define through different sub-graphs in the Human Phenotype Ontology. We further present a new methodology for deriving sets of qualitative class descriptions for groups of entities described by ontology terms. Leveraging this strategy to obtain meaningful explanations for our semantic clusters alongside other evaluation techniques, we show that semantic clustering with ontology-derived facets enables the representation, and thus identification of, clinically relevant phenotype relationships not easily recoverable using overall clustering alone. In this way, we demonstrate the potential of faceted semantic clustering for gaining a deeper and more nuanced understanding of text-derived patient phenotypes.
Collapse
Affiliation(s)
- Karin Slater
- College of Medical and Dental Sciences, Institute of Cancer and Genomic Sciences, University of Birmingham, UK; Institute of Translational Medicine, University Hospitals Birmingham, NHS Foundation Trust, UK; MRC Health Data Research UK (HDR UK) Midlands, UK; University Hospitals Birmingham NHS Foundation Trust, Edgbaston, Birmingham, UK.
| | - John A Williams
- College of Medical and Dental Sciences, Institute of Cancer and Genomic Sciences, University of Birmingham, UK; Institute of Translational Medicine, University Hospitals Birmingham, NHS Foundation Trust, UK; University Hospitals Birmingham NHS Foundation Trust, Edgbaston, Birmingham, UK
| | - Andreas Karwath
- College of Medical and Dental Sciences, Institute of Cancer and Genomic Sciences, University of Birmingham, UK; Institute of Translational Medicine, University Hospitals Birmingham, NHS Foundation Trust, UK; MRC Health Data Research UK (HDR UK) Midlands, UK; University Hospitals Birmingham NHS Foundation Trust, Edgbaston, Birmingham, UK
| | - Hilary Fanning
- Institute of Translational Medicine, University Hospitals Birmingham, NHS Foundation Trust, UK; University Hospitals Birmingham NHS Foundation Trust, Edgbaston, Birmingham, UK
| | - Simon Ball
- Institute of Translational Medicine, University Hospitals Birmingham, NHS Foundation Trust, UK; University Hospitals Birmingham NHS Foundation Trust, Edgbaston, Birmingham, UK
| | - Paul N Schofield
- Dept of Physiology, Development, and Neuroscience, University of Cambridge, UK
| | - Robert Hoehndorf
- Computer, Electrical and Mathematical Sciences & Engineering Division, Computational Bioscience Research Center, King Abdullah University of Science and Technology, Saudi Arabia
| | - Georgios V Gkoutos
- College of Medical and Dental Sciences, Institute of Cancer and Genomic Sciences, University of Birmingham, UK; Institute of Translational Medicine, University Hospitals Birmingham, NHS Foundation Trust, UK; NIHR Experimental Cancer Medicine Centre, UK; NIHR Surgical Reconstruction and Microbiology Research Centre, UK; NIHR Biomedical Research Centre, UK; MRC Health Data Research UK (HDR UK) Midlands, UK; University Hospitals Birmingham NHS Foundation Trust, Edgbaston, Birmingham, UK
| |
Collapse
|
8
|
Slater LT, Bradlow W, Ball S, Hoehndorf R, Gkoutos GV. Improved characterisation of clinical text through ontology-based vocabulary expansion. J Biomed Semantics 2021; 12:7. [PMID: 33845909 PMCID: PMC8042947 DOI: 10.1186/s13326-021-00241-5] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/17/2020] [Accepted: 03/18/2021] [Indexed: 12/18/2022] Open
Abstract
BACKGROUND Biomedical ontologies contain a wealth of metadata that constitutes a fundamental infrastructural resource for text mining. For several reasons, redundancies exist in the ontology ecosystem, which lead to the same entities being described by several concepts in the same or similar contexts across several ontologies. While these concepts describe the same entities, they contain different sets of complementary metadata. Linking these definitions to make use of their combined metadata could lead to improved performance in ontology-based information retrieval, extraction, and analysis tasks. RESULTS We develop and present an algorithm that expands the set of labels associated with an ontology class using a combination of strict lexical matching and cross-ontology reasoner-enabled equivalency queries. Across all disease terms in the Disease Ontology, the approach found 51,362 additional labels, more than tripling the number defined by the ontology itself. Manual validation by a clinical expert on a random sampling of expanded synonyms over the Human Phenotype Ontology yielded a precision of 0.912. Furthermore, we found that annotating patient visits in MIMIC-III with an extended set of Disease Ontology labels led to semantic similarity score derived from those labels being a significantly better predictor of matching first diagnosis, with a mean average precision of 0.88 for the unexpanded set of annotations, and 0.913 for the expanded set. CONCLUSIONS Inter-ontology synonym expansion can lead to a vast increase in the scale of vocabulary available for text mining applications. While the accuracy of the extended vocabulary is not perfect, it nevertheless led to a significantly improved ontology-based characterisation of patients from text in one setting. Furthermore, where run-on error is not acceptable, the technique can be used to provide candidate synonyms which can be checked by a domain expert.
Collapse
Affiliation(s)
- Luke T. Slater
- Institute of Cancer and Genomic Sciences, College of Medical and Dental Sciences, University of Birmingham, Birmingham, B15 2TT UK
- University Hospitals Birmingham NHS Foundation Trust, University of Birmingham, Birmingham, B15 2TT UK
| | - William Bradlow
- Institute of Cancer and Genomic Sciences, College of Medical and Dental Sciences, University of Birmingham, Birmingham, B15 2TT UK
- University Hospitals Birmingham NHS Foundation Trust, University of Birmingham, Birmingham, B15 2TT UK
| | - Simon Ball
- Institute of Cancer and Genomic Sciences, College of Medical and Dental Sciences, University of Birmingham, Birmingham, B15 2TT UK
- University Hospitals Birmingham NHS Foundation Trust, University of Birmingham, Birmingham, B15 2TT UK
| | - Robert Hoehndorf
- Computational Bioscience Research Centre, KAUST, Thuwal, Saudi Arabia
| | - Georgios V Gkoutos
- Institute of Cancer and Genomic Sciences, College of Medical and Dental Sciences, University of Birmingham, Birmingham, B15 2TT UK
- University Hospitals Birmingham NHS Foundation Trust, University of Birmingham, Birmingham, B15 2TT UK
- NIHR Experimental Cancer Medicine Centre, University of Birmingham, Birmingham, B15 2TT UK
- NIHR Surgical Reconstruction and Microbiology Research Centre, University of Birmingham, Birmingham, B15 2TT UK
- NIHR Biomedical Research Centre, University of Birmingham, Birmingham, B15 2TT UK
- MRC Health Data Research (HDR), Birmingham, UK
| |
Collapse
|