1
|
Wu H, Wang M, Wu J, Francis F, Chang YH, Shavick A, Dong H, Poon MTC, Fitzpatrick N, Levine AP, Slater LT, Handy A, Karwath A, Gkoutos GV, Chelala C, Shah AD, Stewart R, Collier N, Alex B, Whiteley W, Sudlow C, Roberts A, Dobson RJB. A survey on clinical natural language processing in the United Kingdom from 2007 to 2022. NPJ Digit Med 2022; 5:186. [PMID: 36544046 PMCID: PMC9770568 DOI: 10.1038/s41746-022-00730-6] [Citation(s) in RCA: 18] [Impact Index Per Article: 9.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/20/2022] [Accepted: 11/29/2022] [Indexed: 12/24/2022] Open
Abstract
Much of the knowledge and information needed for enabling high-quality clinical research is stored in free-text format. Natural language processing (NLP) has been used to extract information from these sources at scale for several decades. This paper aims to present a comprehensive review of clinical NLP for the past 15 years in the UK to identify the community, depict its evolution, analyse methodologies and applications, and identify the main barriers. We collect a dataset of clinical NLP projects (n = 94; £ = 41.97 m) funded by UK funders or the European Union's funding programmes. Additionally, we extract details on 9 funders, 137 organisations, 139 persons and 431 research papers. Networks are created from timestamped data interlinking all entities, and network analysis is subsequently applied to generate insights. 431 publications are identified as part of a literature review, of which 107 are eligible for final analysis. Results show, not surprisingly, clinical NLP in the UK has increased substantially in the last 15 years: the total budget in the period of 2019-2022 was 80 times that of 2007-2010. However, the effort is required to deepen areas such as disease (sub-)phenotyping and broaden application domains. There is also a need to improve links between academia and industry and enable deployments in real-world settings for the realisation of clinical NLP's great potential in care delivery. The major barriers include research and development access to hospital data, lack of capable computational resources in the right places, the scarcity of labelled data and barriers to sharing of pretrained models.
Collapse
Affiliation(s)
- Honghan Wu
- Institute of Health Informatics, University College London, London, UK.
| | - Minhong Wang
- Institute of Health Informatics, University College London, London, UK
| | - Jinge Wu
- Institute of Health Informatics, University College London, London, UK
- Usher Institute, University of Edinburgh, Edinburgh, UK
| | - Farah Francis
- Usher Institute, University of Edinburgh, Edinburgh, UK
| | - Yun-Hsuan Chang
- Institute of Health Informatics, University College London, London, UK
| | - Alex Shavick
- Research Department of Pathology, UCL Cancer Institute, University College London, London, UK
| | - Hang Dong
- Usher Institute, University of Edinburgh, Edinburgh, UK
- Department of Computer Science, University of Oxford, Oxford, UK
| | | | | | - Adam P Levine
- Research Department of Pathology, UCL Cancer Institute, University College London, London, UK
| | - Luke T Slater
- Institute of Cancer and Genomics, University of Birmingham, Birmingham, UK
| | - Alex Handy
- Institute of Health Informatics, University College London, London, UK
- University College London Hospitals NHS Trust, London, UK
| | - Andreas Karwath
- Institute of Cancer and Genomics, University of Birmingham, Birmingham, UK
| | - Georgios V Gkoutos
- Institute of Cancer and Genomics, University of Birmingham, Birmingham, UK
| | - Claude Chelala
- Centre for Tumour Biology, Barts Cancer Institute, Queen Mary University of London, London, UK
| | - Anoop Dinesh Shah
- Institute of Health Informatics, University College London, London, UK
| | - Robert Stewart
- Department of Psychological Medicine, Institute of Psychiatry, Psychology and Neuroscience (IoPPN), King's College London, London, UK
- South London and Maudsley NHS Foundation Trust, London, UK
| | - Nigel Collier
- Theoretical and Applied Linguistics, Faculty of Modern & Medieval Languages & Linguistics, University of Cambridge, Cambridge, UK
| | - Beatrice Alex
- Edinburgh Futures Institute, University of Edinburgh, Edinburgh, UK
| | | | - Cathie Sudlow
- Usher Institute, University of Edinburgh, Edinburgh, UK
| | - Angus Roberts
- Department of Biostatistics & Health Informatics, King's College London, London, UK
| | - Richard J B Dobson
- Institute of Health Informatics, University College London, London, UK
- Department of Biostatistics & Health Informatics, King's College London, London, UK
| |
Collapse
|
4
|
Automating Identification of Multiple Chronic Conditions in Clinical Practice Guidelines. AMIA JOINT SUMMITS ON TRANSLATIONAL SCIENCE PROCEEDINGS. AMIA JOINT SUMMITS ON TRANSLATIONAL SCIENCE 2015; 2015:456-60. [PMID: 26306285 PMCID: PMC4525235] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 12/02/2022]
Abstract
Many clinical practice guidelines (CPGs) are intended to provide evidence-based guidance to clinicians on a single disease, and are frequently considered inadequate when caring for patients with multiple chronic conditions (MCC), or two or more chronic conditions. It is unclear to what degree disease-specific CPGs provide guidance about MCC. In this study, we develop a method for extracting knowledge from single-disease chronic condition CPGs to determine how frequently they mention commonly co-occurring chronic diseases. We focus on 15 highly prevalent chronic conditions. We use publicly available resources, including a repository of guideline summaries from the National Guideline Clearinghouse to build a text corpus, a data dictionary of ICD-9 codes from the Medicare Chronic Conditions Data Warehouse (CCW) to construct an initial list of disease terms, and disease synonyms from the National Center for Biomedical Ontology to enhance the list of disease terms. First, for each disease guideline, we determined the frequency of comorbid condition mentions (a disease-comorbidity pair) by exactly matching disease synonyms in the text corpus. Then, we developed an annotated reference standard using a sample subset of guidelines. We used this reference standard to evaluate our approach. Then, we compared the co-prevalence of common pairs of chronic conditions from Medicare CCW data to the frequency of disease-comorbidity pairs in CPGs. Our results show that some disease-comorbidity pairs occur more frequently in CPGs than others. Sixty-one (29.0%) of 210 possible disease-comorbidity pairs occurred zero times; for example, no guideline on chronic kidney disease mentioned depression, while heart failure guidelines mentioned ischemic heart disease the most frequently. Our method adequately identifies comorbid chronic conditions in CPG recommendations with precision 0.82, recall 0.75, and F-measure 0.78. Our work identifies knowledge currently embedded in the free text of clinical practice guideline recommendations and provides an initial view of the extent to which CPGs mention common comorbid conditions. Knowledge extracted from CPG text in this way may be useful to inform gaps in guideline recommendations regarding MCC and therefore identify potential opportunities for guideline improvement.
Collapse
|
7
|
Korhonen A, Séaghdha DO, Silins I, Sun L, Högberg J, Stenius U. Text mining for literature review and knowledge discovery in cancer risk assessment and research. PLoS One 2012; 7:e33427. [PMID: 22511921 PMCID: PMC3325219 DOI: 10.1371/journal.pone.0033427] [Citation(s) in RCA: 51] [Impact Index Per Article: 4.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/25/2011] [Accepted: 02/08/2012] [Indexed: 12/14/2022] Open
Abstract
Research in biomedical text mining is starting to produce technology which can make information in biomedical literature more accessible for bio-scientists. One of the current challenges is to integrate and refine this technology to support real-life scientific tasks in biomedicine, and to evaluate its usefulness in the context of such tasks. We describe CRAB - a fully integrated text mining tool designed to support chemical health risk assessment. This task is complex and time-consuming, requiring a thorough review of existing scientific data on a particular chemical. Covering human, animal, cellular and other mechanistic data from various fields of biomedicine, this is highly varied and therefore difficult to harvest from literature databases via manual means. Our tool automates the process by extracting relevant scientific data in published literature and classifying it according to multiple qualitative dimensions. Developed in close collaboration with risk assessors, the tool allows navigating the classified dataset in various ways and sharing the data with other users. We present a direct and user-based evaluation which shows that the technology integrated in the tool is highly accurate, and report a number of case studies which demonstrate how the tool can be used to support scientific discovery in cancer risk assessment and research. Our work demonstrates the usefulness of a text mining pipeline in facilitating complex research tasks in biomedicine. We discuss further development and application of our technology to other types of chemical risk assessment in the future.
Collapse
Affiliation(s)
- Anna Korhonen
- Computer Laboratory, University of Cambridge, Cambridge, United Kingdom.
| | | | | | | | | | | |
Collapse
|
8
|
Liakata M, Saha S, Dobnik S, Batchelor C, Rebholz-Schuhmann D. Automatic recognition of conceptualization zones in scientific articles and two life science applications. ACTA ACUST UNITED AC 2012; 28:991-1000. [PMID: 22321698 PMCID: PMC3315721 DOI: 10.1093/bioinformatics/bts071] [Citation(s) in RCA: 69] [Impact Index Per Article: 5.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/23/2022]
Abstract
Motivation: Scholarly biomedical publications report on the findings of a research investigation. Scientists use a well-established discourse structure to relate their work to the state of the art, express their own motivation and hypotheses and report on their methods, results and conclusions. In previous work, we have proposed ways to explicitly annotate the structure of scientific investigations in scholarly publications. Here we present the means to facilitate automatic access to the scientific discourse of articles by automating the recognition of 11 categories at the sentence level, which we call Core Scientific Concepts (CoreSCs). These include: Hypothesis, Motivation, Goal, Object, Background, Method, Experiment, Model, Observation, Result and Conclusion. CoreSCs provide the structure and context to all statements and relations within an article and their automatic recognition can greatly facilitate biomedical information extraction by characterizing the different types of facts, hypotheses and evidence available in a scientific publication. Results: We have trained and compared machine learning classifiers (support vector machines and conditional random fields) on a corpus of 265 full articles in biochemistry and chemistry to automatically recognize CoreSCs. We have evaluated our automatic classifications against a manually annotated gold standard, and have achieved promising accuracies with ‘Experiment’, ‘Background’ and ‘Model’ being the categories with the highest F1-scores (76%, 62% and 53%, respectively). We have analysed the task of CoreSC annotation both from a sentence classification as well as sequence labelling perspective and we present a detailed feature evaluation. The most discriminative features are local sentence features such as unigrams, bigrams and grammatical dependencies while features encoding the document structure, such as section headings, also play an important role for some of the categories. We discuss the usefulness of automatically generated CoreSCs in two biomedical applications as well as work in progress. Availability: A web-based tool for the automatic annotation of articles with CoreSCs and corresponding documentation is available online at http://www.sapientaproject.com/softwarehttp://www.sapientaproject.com also contains detailed information pertaining to CoreSC annotation and links to annotation guidelines as well as a corpus of manually annotated articles, which served as our training data. Contact:liakata@ebi.ac.uk Supplementary information:Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Maria Liakata
- Department of Computer Science, Aberystwyth University, Aberystwyth, Ceredigion, SY23 3DB, UK.
| | | | | | | | | |
Collapse
|
9
|
Liakata M, Kim JH, Saha S, Hastings J, Rebholz-Schuhmann D. Three hybrid classifiers for the detection of emotions in suicide notes. BIOMEDICAL INFORMATICS INSIGHTS 2012; 5:175-84. [PMID: 22879774 PMCID: PMC3409474 DOI: 10.4137/bii.s8967] [Citation(s) in RCA: 12] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 11/25/2022]
Abstract
We describe our approach for creating a system able to detect emotions in suicide notes. Motivated by the sparse and imbalanced data as well as the complex annotation scheme, we have considered three hybrid approaches for distinguishing between the different categories. Each of the three approaches combines machine learning with manually derived rules, where the latter target very sparse emotion categories. The first approach considers the task as single label multi-class classification, where an SVM and a CRF classifier are trained to recognise fifteen different categories and their results are combined. Our second approach trains individual binary classifiers (SVM and CRF) for each of the fifteen sentence categories and returns the union of the classifiers as the final result. Finally, our third approach is a combination of binary and multi-class classifiers (SVM and CRF) trained on different subsets of the training data. We considered a number of different feature configurations. All three systems were tested on 300 unseen messages. Our second system had the best performance of the three, yielding an F1 score of 45.6% and a Precision of 60.1% whereas our best Recall (43.6%) was obtained using the third system.
Collapse
|
10
|
Guo Y, Korhonen A, Silins I, Stenius U. Weakly supervised learning of information structure of scientific abstracts--is it accurate enough to benefit real-world tasks in biomedicine? Bioinformatics 2011; 27:3179-85. [PMID: 21949269 DOI: 10.1093/bioinformatics/btr536] [Citation(s) in RCA: 10] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
Abstract
MOTIVATION Many practical tasks in biomedicine require accessing specific types of information in scientific literature; e.g. information about the methods, results or conclusions of the study in question. Several approaches have been developed to identify such information in scientific journal articles. The best of these have yielded promising results and proved useful for biomedical text mining tasks. However, relying on fully supervised machine learning (ml) and a large body of annotated data, existing approaches are expensive to develop and port to different tasks. A potential solution to this problem is to employ weakly supervised learning instead. In this article, we investigate a weakly supervised approach to identifying information structure according to a scheme called Argumentative Zoning (az). We apply four weakly supervised classifiers to biomedical abstracts and evaluate their performance both directly and in a real-life scenario in the context of cancer risk assessment. RESULTS Our best weakly supervised classifier (based on the combination of active learning and self-training) performs well on the task, outperforming our best supervised classifier: it yields a high accuracy of 81% when just 10% of the labeled data is used for training. When cancer risk assessors are presented with the resulting annotated abstracts, they find relevant information in them significantly faster than when presented with unannotated abstracts. These results suggest that weakly supervised learning could be used to improve the practical usefulness of information structure for real-life tasks in biomedicine.
Collapse
Affiliation(s)
- Yufan Guo
- Computer Laboratory, University of Cambridge, Cambridge CB3 0FD, UK
| | | | | | | |
Collapse
|