1
|
Wu H, Wang M, Wu J, Francis F, Chang YH, Shavick A, Dong H, Poon MTC, Fitzpatrick N, Levine AP, Slater LT, Handy A, Karwath A, Gkoutos GV, Chelala C, Shah AD, Stewart R, Collier N, Alex B, Whiteley W, Sudlow C, Roberts A, Dobson RJB. A survey on clinical natural language processing in the United Kingdom from 2007 to 2022. NPJ Digit Med 2022; 5:186. [PMID: 36544046 PMCID: PMC9770568 DOI: 10.1038/s41746-022-00730-6] [Citation(s) in RCA: 18] [Impact Index Per Article: 9.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/20/2022] [Accepted: 11/29/2022] [Indexed: 12/24/2022] Open
Abstract
Much of the knowledge and information needed for enabling high-quality clinical research is stored in free-text format. Natural language processing (NLP) has been used to extract information from these sources at scale for several decades. This paper aims to present a comprehensive review of clinical NLP for the past 15 years in the UK to identify the community, depict its evolution, analyse methodologies and applications, and identify the main barriers. We collect a dataset of clinical NLP projects (n = 94; £ = 41.97 m) funded by UK funders or the European Union's funding programmes. Additionally, we extract details on 9 funders, 137 organisations, 139 persons and 431 research papers. Networks are created from timestamped data interlinking all entities, and network analysis is subsequently applied to generate insights. 431 publications are identified as part of a literature review, of which 107 are eligible for final analysis. Results show, not surprisingly, clinical NLP in the UK has increased substantially in the last 15 years: the total budget in the period of 2019-2022 was 80 times that of 2007-2010. However, the effort is required to deepen areas such as disease (sub-)phenotyping and broaden application domains. There is also a need to improve links between academia and industry and enable deployments in real-world settings for the realisation of clinical NLP's great potential in care delivery. The major barriers include research and development access to hospital data, lack of capable computational resources in the right places, the scarcity of labelled data and barriers to sharing of pretrained models.
Collapse
Affiliation(s)
- Honghan Wu
- Institute of Health Informatics, University College London, London, UK.
| | - Minhong Wang
- Institute of Health Informatics, University College London, London, UK
| | - Jinge Wu
- Institute of Health Informatics, University College London, London, UK
- Usher Institute, University of Edinburgh, Edinburgh, UK
| | - Farah Francis
- Usher Institute, University of Edinburgh, Edinburgh, UK
| | - Yun-Hsuan Chang
- Institute of Health Informatics, University College London, London, UK
| | - Alex Shavick
- Research Department of Pathology, UCL Cancer Institute, University College London, London, UK
| | - Hang Dong
- Usher Institute, University of Edinburgh, Edinburgh, UK
- Department of Computer Science, University of Oxford, Oxford, UK
| | | | | | - Adam P Levine
- Research Department of Pathology, UCL Cancer Institute, University College London, London, UK
| | - Luke T Slater
- Institute of Cancer and Genomics, University of Birmingham, Birmingham, UK
| | - Alex Handy
- Institute of Health Informatics, University College London, London, UK
- University College London Hospitals NHS Trust, London, UK
| | - Andreas Karwath
- Institute of Cancer and Genomics, University of Birmingham, Birmingham, UK
| | - Georgios V Gkoutos
- Institute of Cancer and Genomics, University of Birmingham, Birmingham, UK
| | - Claude Chelala
- Centre for Tumour Biology, Barts Cancer Institute, Queen Mary University of London, London, UK
| | - Anoop Dinesh Shah
- Institute of Health Informatics, University College London, London, UK
| | - Robert Stewart
- Department of Psychological Medicine, Institute of Psychiatry, Psychology and Neuroscience (IoPPN), King's College London, London, UK
- South London and Maudsley NHS Foundation Trust, London, UK
| | - Nigel Collier
- Theoretical and Applied Linguistics, Faculty of Modern & Medieval Languages & Linguistics, University of Cambridge, Cambridge, UK
| | - Beatrice Alex
- Edinburgh Futures Institute, University of Edinburgh, Edinburgh, UK
| | | | - Cathie Sudlow
- Usher Institute, University of Edinburgh, Edinburgh, UK
| | - Angus Roberts
- Department of Biostatistics & Health Informatics, King's College London, London, UK
| | - Richard J B Dobson
- Institute of Health Informatics, University College London, London, UK
- Department of Biostatistics & Health Informatics, King's College London, London, UK
| |
Collapse
|
2
|
Zafari H, Langlois S, Zulkernine F, Kosowan L, Singer A. AI in predicting COPD in the Canadian population. Biosystems 2021; 211:104585. [PMID: 34864143 DOI: 10.1016/j.biosystems.2021.104585] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/29/2021] [Revised: 11/17/2021] [Accepted: 11/23/2021] [Indexed: 12/12/2022]
Abstract
Chronic obstructive pulmonary disease (COPD) is a progressive lung disease that produces non-reversible airflow limitations. Approximately 10% of Canadians aged 35 years or older are living with COPD. Primary care is often the first contact an individual will have with the healthcare system providing acute care, chronic disease management, and services aimed at health maintenance. This study used Electronic Medical Record (EMR) data from primary care clinics in seven provinces across Canada to develop predictive models to identify COPD in the Canadian population. The comprehensive nature of this primary care EMR data containing structured numeric, categorical, hybrid, and unstructured text data, enables the predictive models to capture symptoms of COPD and discriminate it from diseases with similar symptoms. We applied two supervised machine learning models, a Multilayer Neural Networks (MLNN) model and an Extreme Gradient Boosting (XGB) to identify COPD patients. The XGB model achieved an accuracy of 86% in the test dataset compared to 83% achieved by the MLNN. Utilizing feature importance, we identified a set of key symptoms from the EMR for diagnosing COPD, which included medications, health conditions, risk factors, and patient age. Application of this XGB model to primary care structured EMR data can identify patients with COPD from others having similar chronic conditions for disease surveillance, and improve evidence-based care delivery.
Collapse
Affiliation(s)
- Hasan Zafari
- School of Computing, Queen's University, Kingston, Ontario, Canada.
| | - Sarah Langlois
- School of Computing, Queen's University, Kingston, Ontario, Canada.
| | | | - Leanne Kosowan
- Department of Family Medicine, Max Rady College of Medicine, Rady Faculty of Health Sciences, University of Manitoba, Winnipeg, Manitoba, Canada.
| | - Alexander Singer
- Department of Family Medicine, Max Rady College of Medicine, Rady Faculty of Health Sciences, University of Manitoba, Winnipeg, Manitoba, Canada.
| |
Collapse
|
3
|
Alnazzawi N. Building a semantically annotated corpus for chronic disease complications using two document types. PLoS One 2021; 16:e0247319. [PMID: 33735207 PMCID: PMC7971867 DOI: 10.1371/journal.pone.0247319] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/11/2020] [Accepted: 02/04/2021] [Indexed: 11/19/2022] Open
Abstract
Narrative information in electronic health records (EHRs) contains a wealth of information related to patient health conditions. In addition, people use Twitter to express their experiences regarding personal health issues, such as medical complaints, symptoms, treatments, lifestyle, and other factors. Both genres of text include different types of health-related information concerning disease complications and risk factors. Knowing detailed information about controlling disease risk factors has a great impact on modifying these risks and subsequently preventing disease complications. Text-mining tools provide efficient solutions to extract and integrate vital information related to disease complications hidden in the large volume of the narrative text. However, the development of text-mining tools depends on the availability of an annotated corpus. In response, we have developed the PrevComp corpus, which is annotated with information relevant to the identification of disease complications, underlying risk factors, and prevention measures, in the context of the interaction between hypertension and diabetes. The corpus is unique and novel in terms of the very specific topic in the biomedical domain and as an integration of information from both EHRs and tweets collected from Twitter. The annotation scheme was designed with guidance by a domain expert, and two further domain experts performed the annotation, resulting in a high-quality annotation, with agreement rate F-scores as high as 0.60 and 0.75 for EHRs and tweets, respectively.
Collapse
Affiliation(s)
- Noha Alnazzawi
- Department of Computer Science and Engineering, Royal Commission for Jubail and Yanbu, Yanbu University College, Yanbu Industrial City, Saudi Arabia
- * E-mail:
| |
Collapse
|
4
|
Ju M, Short AD, Thompson P, Bakerly ND, Gkoutos GV, Tsaprouni L, Ananiadou S. Annotating and detecting phenotypic information for chronic obstructive pulmonary disease. JAMIA Open 2020; 2:261-271. [PMID: 31984360 PMCID: PMC6951876 DOI: 10.1093/jamiaopen/ooz009] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/19/2018] [Revised: 02/21/2019] [Accepted: 03/19/2019] [Indexed: 12/29/2022] Open
Abstract
Objectives Chronic obstructive pulmonary disease (COPD) phenotypes cover a range of lung abnormalities. To allow text mining methods to identify pertinent and potentially complex information about these phenotypes from textual data, we have developed a novel annotated corpus, which we use to train a neural network-based named entity recognizer to detect fine-grained COPD phenotypic information. Materials and methods Since COPD phenotype descriptions often mention other concepts within them (proteins, treatments, etc.), our corpus annotations include both outermost phenotype descriptions and concepts nested within them. Our neural layered bidirectional long short-term memory conditional random field (BiLSTM-CRF) network firstly recognizes nested mentions, which are fed into subsequent BiLSTM-CRF layers, to help to recognize enclosing phenotype mentions. Results Our corpus of 30 full papers (available at: http://www.nactem.ac.uk/COPD) is annotated by experts with 27 030 phenotype-related concept mentions, most of which are automatically linked to UMLS Metathesaurus concepts. When trained using the corpus, our BiLSTM-CRF network outperforms other popular approaches in recognizing detailed phenotypic information. Discussion Information extracted by our method can facilitate efficient location and exploration of detailed information about phenotypes, for example, those specifically concerning reactions to treatments. Conclusion The importance of our corpus for developing methods to extract fine-grained information about COPD phenotypes is demonstrated through its successful use to train a layered BiLSTM-CRF network to extract phenotypic information at various levels of granularity. The minimal human intervention needed for training should permit ready adaption to extracting phenotypic information about other diseases.
Collapse
Affiliation(s)
- Meizhi Ju
- National Centre for Text Mining, School of Computer Science, The University of Manchester, Manchester, UK
| | - Andrea D Short
- Faculty of Biology, Medicine and Health, The University of Manchester, Manchester, UK
| | - Paul Thompson
- National Centre for Text Mining, School of Computer Science, The University of Manchester, Manchester, UK
| | - Nawar Diar Bakerly
- Salford Royal NHS Foundation Trust; and School of Health Sciences, The University of Manchester, Manchester, UK
| | - Georgios V Gkoutos
- College of Medical and Dental Sciences, Institute of Cancer and Genomic Sciences, Centre for Computational Biology, University of Birmingham, Birmingham, UK.,Institute of Translational Medicine, University Hospitals Birmingham NHS Foundation Trust, Birmingham, UK.,MRC Health Data Research UK (HDR UK).,NIHR Experimental Cancer Medicine Centre, Birmingham, UK.,NIHR Surgical Reconstruction and Microbiology Research Centre, Birmingham, UK.,NIHR Biomedical Research Centre, Birmingham, UK
| | - Loukia Tsaprouni
- School of Health Sciences, Centre for Life and Sport Sciences, Birmingham City University, Birmingham, UK
| | - Sophia Ananiadou
- National Centre for Text Mining, School of Computer Science, The University of Manchester, Manchester, UK
| |
Collapse
|
5
|
Kirk IK, Simon C, Banasik K, Holm PC, Haue AD, Jensen PB, Juhl Jensen L, Rodríguez CL, Pedersen MK, Eriksson R, Andersen HU, Almdal T, Bork-Jensen J, Grarup N, Borch-Johnsen K, Pedersen O, Pociot F, Hansen T, Bergholdt R, Rossing P, Brunak S. Linking glycemic dysregulation in diabetes to symptoms, comorbidities, and genetics through EHR data mining. eLife 2019; 8:44941. [PMID: 31818369 PMCID: PMC6904221 DOI: 10.7554/elife.44941] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/17/2019] [Accepted: 11/16/2019] [Indexed: 12/13/2022] Open
Abstract
Diabetes is a diverse and complex disease, with considerable variation in phenotypic manifestation and severity. This variation hampers the study of etiological differences and reduces the statistical power of analyses of associations to genetics, treatment outcomes, and complications. We address these issues through deep, fine-grained phenotypic stratification of a diabetes cohort. Text mining the electronic health records of 14,017 patients, we matched two controlled vocabularies (ICD-10 and a custom vocabulary developed at the clinical center Steno Diabetes Center Copenhagen) to clinical narratives spanning a 19 year period. The two matched vocabularies comprise over 20,000 medical terms describing symptoms, other diagnoses, and lifestyle factors. The cohort is genetically homogeneous (Caucasian diabetes patients from Denmark) so the resulting stratification is not driven by ethnic differences, but rather by inherently dissimilar progression patterns and lifestyle related risk factors. Using unsupervised Markov clustering, we defined 71 clusters of at least 50 individuals within the diabetes spectrum. The clusters display both distinct and shared longitudinal glycemic dysregulation patterns, temporal co-occurrences of comorbidities, and associations to single nucleotide polymorphisms in or near genes relevant for diabetes comorbidities.
Collapse
Affiliation(s)
- Isa Kristina Kirk
- Novo Nordisk Foundation Center for Protein Research, University of Copenhagen, Copenhagen, Denmark
| | - Christian Simon
- Novo Nordisk Foundation Center for Protein Research, University of Copenhagen, Copenhagen, Denmark
| | - Karina Banasik
- Novo Nordisk Foundation Center for Protein Research, University of Copenhagen, Copenhagen, Denmark
| | - Peter Christoffer Holm
- Novo Nordisk Foundation Center for Protein Research, University of Copenhagen, Copenhagen, Denmark
| | - Amalie Dahl Haue
- Novo Nordisk Foundation Center for Protein Research, University of Copenhagen, Copenhagen, Denmark
| | - Peter Bjødstrup Jensen
- Novo Nordisk Foundation Center for Protein Research, University of Copenhagen, Copenhagen, Denmark.,Odense Patient Data Explorative Network (OPEN), Odense University Hospital, Odense, Denmark
| | - Lars Juhl Jensen
- Novo Nordisk Foundation Center for Protein Research, University of Copenhagen, Copenhagen, Denmark
| | - Cristina Leal Rodríguez
- Novo Nordisk Foundation Center for Protein Research, University of Copenhagen, Copenhagen, Denmark
| | - Mette Krogh Pedersen
- Novo Nordisk Foundation Center for Protein Research, University of Copenhagen, Copenhagen, Denmark
| | - Robert Eriksson
- Novo Nordisk Foundation Center for Protein Research, University of Copenhagen, Copenhagen, Denmark
| | | | - Thomas Almdal
- Steno Diabetes Center Copenhagen, Gentofte, Denmark.,Department of Endocrinology, Rigshospitalet, Copenhagen, Denmark
| | - Jette Bork-Jensen
- Novo Nordisk Foundation Center for Basic Metabolic Research, University of Copenhagen, Copenhagen, Denmark
| | - Niels Grarup
- Novo Nordisk Foundation Center for Basic Metabolic Research, University of Copenhagen, Copenhagen, Denmark
| | | | - Oluf Pedersen
- Steno Diabetes Center Copenhagen, Gentofte, Denmark.,Novo Nordisk Foundation Center for Basic Metabolic Research, University of Copenhagen, Copenhagen, Denmark
| | - Flemming Pociot
- Steno Diabetes Center Copenhagen, Gentofte, Denmark.,Department of Clinical Medicine, Herlev-Gentofte Hospital, Herlev, Denmark
| | - Torben Hansen
- Steno Diabetes Center Copenhagen, Gentofte, Denmark.,Novo Nordisk Foundation Center for Basic Metabolic Research, University of Copenhagen, Copenhagen, Denmark
| | | | - Peter Rossing
- Steno Diabetes Center Copenhagen, Gentofte, Denmark.,Department of Clinical Medicine, University of Copenhagen, Copenhagen, Denmark
| | - Søren Brunak
- Novo Nordisk Foundation Center for Protein Research, University of Copenhagen, Copenhagen, Denmark.,Center for Biological Sequence Analysis, Department of Bio and Health Informatics, Technical University of Denmark, Lyngby, Denmark
| |
Collapse
|
6
|
Haendel MA, McMurry JA, Relevo R, Mungall CJ, Robinson PN, Chute CG. A Census of Disease Ontologies. Annu Rev Biomed Data Sci 2018. [DOI: 10.1146/annurev-biodatasci-080917-013459] [Citation(s) in RCA: 19] [Impact Index Per Article: 3.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/09/2022]
Abstract
For centuries, humans have sought to classify diseases based on phenotypic presentation and available treatments. Today, a wide landscape of strategies, resources, and tools exist to classify patients and diseases. Ontologies can provide a robust foundation of logic for precise stratification and classification along diverse axes such as etiology, development, treatment, and genetics. Disease and phenotype ontologies are used in four primary ways: ( a) search, retrieval, and annotation of knowledge; ( b) data integration and analysis; ( c) clinical decision support; and ( d) knowledge discovery. Computational inference can connect existing knowledge and generate new insights and hypotheses about drug targets, prognosis prediction, or diagnosis. In this review, we examine the rise of disease and phenotype ontologies and the diverse ways they are represented and applied in biomedicine.
Collapse
Affiliation(s)
- Melissa A. Haendel
- Department of Medical Informatics and Clinical Epidemiology, Oregon Health and Science University, Portland, Oregon 97239, USA
- Linus Pauling Institute, Oregon State University, Corvallis, Oregon 97331, USA
| | - Julie A. McMurry
- Department of Medical Informatics and Clinical Epidemiology, Oregon Health and Science University, Portland, Oregon 97239, USA
| | - Rose Relevo
- Department of Medical Informatics and Clinical Epidemiology, Oregon Health and Science University, Portland, Oregon 97239, USA
| | - Christopher J. Mungall
- Environmental Genomics and Systems Biology, Lawrence Berkeley National Laboratory, Berkeley, California 94720, USA
| | | | - Christopher G. Chute
- School of Medicine, School of Public Health, and School of Nursing, Johns Hopkins University, Baltimore, Maryland 21205, USA
| |
Collapse
|
7
|
Ernecoff NC, Wessell KL, Gabriel S, Carey TS, Hanson LC. A Novel Screening Method to Identify Late-Stage Dementia Patients for Palliative Care Research and Practice. J Pain Symptom Manage 2018; 55:1152-1158.e1. [PMID: 29288881 PMCID: PMC6036617 DOI: 10.1016/j.jpainsymman.2017.12.480] [Citation(s) in RCA: 15] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 11/06/2017] [Revised: 12/14/2017] [Accepted: 12/18/2017] [Indexed: 12/19/2022]
Abstract
CONTEXT Investigators need novel methods for timely identification of patients with serious illness to test or implement new palliative care models. OBJECTIVES The study's aim was to develop an electronic health record (EHR) phenotype to identify patients with late-stage dementia for a clinical trial of palliative care consultation. METHODS We developed a computerized method to identify patients with dementia on hospital admission. Within a data warehouse derived from the hospital's EHR, we used search terms of age, admission date, and ICD-9 and ICD-10 diagnosis codes to create an EHR dementia phenotype, followed by brief medical record review to confirm late-stage dementia. We calculated positive predictive value, false discovery rate, and false negative rate of this novel screening method. RESULTS The EHR phenotype screening method had a positive predictive value of 76.3% for dementia patients and 24.5% for late-stage dementia patients; a false discovery rate of 23.7% for dementia patients and 75.5% for late-stage dementia patients compared to physician assessment. The sensitivity of this screening method was 59.7% to identify hospitalized patients with dementia. Daily screening-including confirmatory chart reviews-averaged 20 minutes and was more feasible, efficient, and more complete than manual screening. CONCLUSION A novel method using an EHR phenotype plus brief medical record review is effective to identify hospitalized patients with late-stage dementia. In health care systems with similar clinical data warehouses, this method may be applied to serious illness populations to improve enrollment in clinical trials of palliative care or to facilitate access to palliative care services.
Collapse
Affiliation(s)
- Natalie C Ernecoff
- Cecil G. Sheps Center for Health Services Research, University of North Carolina, Chapel Hill, North Carolina, USA; Department of Health Policy and Management, Gillings School of Global Public Health, University of North Carolina, Chapel Hill, North Carolina, USA.
| | - Kathryn L Wessell
- Cecil G. Sheps Center for Health Services Research, University of North Carolina, Chapel Hill, North Carolina, USA
| | - Stacey Gabriel
- Cecil G. Sheps Center for Health Services Research, University of North Carolina, Chapel Hill, North Carolina, USA
| | - Timothy S Carey
- Cecil G. Sheps Center for Health Services Research, University of North Carolina, Chapel Hill, North Carolina, USA; Departments of Medicine and Social Medicine, School of Medicine, University of North Carolina, Chapel Hill, North Carolina, USA
| | - Laura C Hanson
- Cecil G. Sheps Center for Health Services Research, University of North Carolina, Chapel Hill, North Carolina, USA; Division of Geriatric Medicine & Palliative Care Program, University of North Carolina, Chapel Hill, North Carolina, USA
| |
Collapse
|
8
|
Britan A, Cusin I, Hinard V, Mottin L, Pasche E, Gobeill J, Rech de Laval V, Gleizes A, Teixeira D, Michel PA, Ruch P, Gaudet P. Accelerating annotation of articles via automated approaches: evaluation of the neXtA5 curation-support tool by neXtProt. Database (Oxford) 2018; 2018:5255187. [PMID: 30576492 PMCID: PMC6301339 DOI: 10.1093/database/bay129] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/21/2018] [Revised: 10/04/2018] [Accepted: 11/09/2018] [Indexed: 11/14/2022]
Abstract
The development of efficient text-mining tools promises to boost the curation workflow by significantly reducing the time needed to process the literature into biological databases. We have developed a curation support tool, neXtA5, that provides a search engine coupled with an annotation system directly integrated into a biocuration workflow. neXtA5 assists curation with modules optimized for the thevarious curation tasks: document triage, entity recognition and information extraction.Here, we describe the evaluation of neXtA5 by expert curators. We first assessed the annotations of two independent curators to provide a baseline for comparison. To evaluate the performance of neXtA5, we submitted requests and compared the neXtA5 results with the manual curation. The analysis focuses on the usability of neXtA5 to support the curation of two types of data: biological processes (BPs) and diseases (Ds). We evaluated the relevance of the papers proposed as well as the recall and precision of the suggested annotations.The evaluation of document triage by neXtA5 precision showed that both curators agree with neXtA5 for 67 (BP) and 63% (D) of abstracts, while curators agree on accepting or rejecting an abstract ~80% of the time. Hence, the precision of the triage system is satisfactory.For concept extraction, curators approved 35 (BP) and 25% (D) of the neXtA5 annotations. Conversely, neXtA5 successfully annotated up to 36 (BP) and 68% (D) of the terms identified by curators. The user feedback obtained in these tests highlighted the need for improvement in the ranking function of neXtA5 annotations. Therefore, we transformed the information extraction component into an annotation ranking system. This improvement results in a top precision (precision at first rank) of 59 (D) and 63% (BP). These results suggest that when considering only the first extracted entity, the current system achieves a precision comparable with expert biocurators.
Collapse
Affiliation(s)
- Aurore Britan
- Computer and Laboratory Investigation of Proteins of Human Origin Group, SIB Swiss Institute of Bioinformatics, Geneva 4, Switzerland
| | - Isabelle Cusin
- Computer and Laboratory Investigation of Proteins of Human Origin Group, SIB Swiss Institute of Bioinformatics, Geneva 4, Switzerland
| | - Valérie Hinard
- Computer and Laboratory Investigation of Proteins of Human Origin Group, SIB Swiss Institute of Bioinformatics, Geneva 4, Switzerland
| | - Luc Mottin
- Haute école spécialisée de Suisse occidentale, Haute Ecole de Gestion de Genève, Carouge, Switzerland
- SIB Text Mining, SIB Swiss Institute of Bioinformatics, Geneva 4, Switzerland
| | - Emilie Pasche
- Haute école spécialisée de Suisse occidentale, Haute Ecole de Gestion de Genève, Carouge, Switzerland
- SIB Text Mining, SIB Swiss Institute of Bioinformatics, Geneva 4, Switzerland
| | - Julien Gobeill
- Haute école spécialisée de Suisse occidentale, Haute Ecole de Gestion de Genève, Carouge, Switzerland
- SIB Text Mining, SIB Swiss Institute of Bioinformatics, Geneva 4, Switzerland
| | - Valentine Rech de Laval
- Computer and Laboratory Investigation of Proteins of Human Origin Group, SIB Swiss Institute of Bioinformatics, Geneva 4, Switzerland
| | - Anne Gleizes
- Computer and Laboratory Investigation of Proteins of Human Origin Group, SIB Swiss Institute of Bioinformatics, Geneva 4, Switzerland
| | - Daniel Teixeira
- Computer and Laboratory Investigation of Proteins of Human Origin Group, SIB Swiss Institute of Bioinformatics, Geneva 4, Switzerland
| | - Pierre-André Michel
- Computer and Laboratory Investigation of Proteins of Human Origin Group, SIB Swiss Institute of Bioinformatics, Geneva 4, Switzerland
| | - Patrick Ruch
- Haute école spécialisée de Suisse occidentale, Haute Ecole de Gestion de Genève, Carouge, Switzerland
- SIB Text Mining, SIB Swiss Institute of Bioinformatics, Geneva 4, Switzerland
| | - Pascale Gaudet
- Computer and Laboratory Investigation of Proteins of Human Origin Group, SIB Swiss Institute of Bioinformatics, Geneva 4, Switzerland
| |
Collapse
|
9
|
Przybyła P, Shardlow M, Aubin S, Bossy R, Eckart de Castilho R, Piperidis S, McNaught J, Ananiadou S. Text mining resources for the life sciences. Database (Oxford) 2016; 2016:baw145. [PMID: 27888231 PMCID: PMC5199186 DOI: 10.1093/database/baw145] [Citation(s) in RCA: 30] [Impact Index Per Article: 3.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/08/2016] [Revised: 10/13/2016] [Accepted: 10/17/2016] [Indexed: 11/18/2022]
Abstract
Text mining is a powerful technology for quickly distilling key information from vast quantities of biomedical literature. However, to harness this power the researcher must be well versed in the availability, suitability, adaptability, interoperability and comparative accuracy of current text mining resources. In this survey, we give an overview of the text mining resources that exist in the life sciences to help researchers, especially those employed in biocuration, to engage with text mining in their own work. We categorize the various resources under three sections: Content Discovery looks at where and how to find biomedical publications for text mining; Knowledge Encoding describes the formats used to represent the different levels of information associated with content that enable text mining, including those formats used to carry such information between processes; Tools and Services gives an overview of workflow management systems that can be used to rapidly configure and compare domain- and task-specific processes, via access to a wide range of pre-built tools. We also provide links to relevant repositories in each section to enable the reader to find resources relevant to their own area of interest. Throughout this work we give a special focus to resources that are interoperable-those that have the crucial ability to share information, enabling smooth integration and reusability.
Collapse
Affiliation(s)
- Piotr Przybyła
- National Centre for Text Mining, School of Computer Science, University of Manchester, Manchester, UK
| | - Matthew Shardlow
- National Centre for Text Mining, School of Computer Science, University of Manchester, Manchester, UK
| | - Sophie Aubin
- Institut National de la Recherche Agronomique, Jouy-en-Josas, France
| | - Robert Bossy
- Institut National de la Recherche Agronomique, Jouy-en-Josas, France
| | | | - Stelios Piperidis
- Institute for Language and Speech Processing, Athena Research Center, Athens, Greece
| | - John McNaught
- National Centre for Text Mining, School of Computer Science, University of Manchester, Manchester, UK
| | - Sophia Ananiadou
- National Centre for Text Mining, School of Computer Science, University of Manchester, Manchester, UK
| |
Collapse
|
10
|
Alnazzawi N, Thompson P, Ananiadou S. Mapping Phenotypic Information in Heterogeneous Textual Sources to a Domain-Specific Terminological Resource. PLoS One 2016; 11:e0162287. [PMID: 27643689 PMCID: PMC5028053 DOI: 10.1371/journal.pone.0162287] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/09/2016] [Accepted: 08/19/2016] [Indexed: 02/02/2023] Open
Abstract
Biomedical literature articles and narrative content from Electronic Health Records (EHRs) both constitute rich sources of disease-phenotype information. Phenotype concepts may be mentioned in text in multiple ways, using phrases with a variety of structures. This variability stems partly from the different backgrounds of the authors, but also from the different writing styles typically used in each text type. Since EHR narrative reports and literature articles contain different but complementary types of valuable information, combining details from each text type can help to uncover new disease-phenotype associations. However, the alternative ways in which the same concept may be mentioned in each source constitutes a barrier to the automatic integration of information. Accordingly, identification of the unique concepts represented by phrases in text can help to bridge the gap between text types. We describe our development of a novel method, PhenoNorm, which integrates a number of different similarity measures to allow automatic linking of phenotype concept mentions to known concepts in the UMLS Metathesaurus, a biomedical terminological resource. PhenoNorm was developed using the PhenoCHF corpus—a collection of literature articles and narratives in EHRs, annotated for phenotypic information relating to congestive heart failure (CHF). We evaluate the performance of PhenoNorm in linking CHF-related phenotype mentions to Metathesaurus concepts, using a newly enriched version of PhenoCHF, in which each phenotype mention has an expert-verified link to a concept in the UMLS Metathesaurus. We show that PhenoNorm outperforms a number of alternative methods applied to the same task. Furthermore, we demonstrate PhenoNorm’s wider utility, by evaluating its ability to link mentions of various other types of medically-related information, occurring in texts covering wider subject areas, to concepts in different terminological resources. We show that PhenoNorm can maintain performance levels, and that its accuracy compares favourably to other methods applied to these tasks.
Collapse
Affiliation(s)
- Noha Alnazzawi
- National Centre for Text Mining, Manchester Institute of Biotechnology, Manchester University, Manchester, United Kingdom
- * E-mail:
| | - Paul Thompson
- National Centre for Text Mining, Manchester Institute of Biotechnology, Manchester University, Manchester, United Kingdom
| | - Sophia Ananiadou
- National Centre for Text Mining, Manchester Institute of Biotechnology, Manchester University, Manchester, United Kingdom
| |
Collapse
|
11
|
Wang Q, S Abdul S, Almeida L, Ananiadou S, Balderas-Martínez YI, Batista-Navarro R, Campos D, Chilton L, Chou HJ, Contreras G, Cooper L, Dai HJ, Ferrell B, Fluck J, Gama-Castro S, George N, Gkoutos G, Irin AK, Jensen LJ, Jimenez S, Jue TR, Keseler I, Madan S, Matos S, McQuilton P, Milacic M, Mort M, Natarajan J, Pafilis E, Pereira E, Rao S, Rinaldi F, Rothfels K, Salgado D, Silva RM, Singh O, Stefancsik R, Su CH, Subramani S, Tadepally HD, Tsaprouni L, Vasilevsky N, Wang X, Chatr-Aryamontri A, Laulederkind SJF, Matis-Mitchell S, McEntyre J, Orchard S, Pundir S, Rodriguez-Esteban R, Van Auken K, Lu Z, Schaeffer M, Wu CH, Hirschman L, Arighi CN. Overview of the interactive task in BioCreative V. DATABASE-THE JOURNAL OF BIOLOGICAL DATABASES AND CURATION 2016; 2016:baw119. [PMID: 27589961 PMCID: PMC5009325 DOI: 10.1093/database/baw119] [Citation(s) in RCA: 34] [Impact Index Per Article: 4.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 04/04/2016] [Accepted: 07/28/2016] [Indexed: 11/14/2022]
Abstract
Fully automated text mining (TM) systems promote efficient literature searching, retrieval, and review but are not sufficient to produce ready-to-consume curated documents. These systems are not meant to replace biocurators, but instead to assist them in one or more literature curation steps. To do so, the user interface is an important aspect that needs to be considered for tool adoption. The BioCreative Interactive task (IAT) is a track designed for exploring user-system interactions, promoting development of useful TM tools, and providing a communication channel between the biocuration and the TM communities. In BioCreative V, the IAT track followed a format similar to previous interactive tracks, where the utility and usability of TM tools, as well as the generation of use cases, have been the focal points. The proposed curation tasks are user-centric and formally evaluated by biocurators. In BioCreative V IAT, seven TM systems and 43 biocurators participated. Two levels of user participation were offered to broaden curator involvement and obtain more feedback on usability aspects. The full level participation involved training on the system, curation of a set of documents with and without TM assistance, tracking of time-on-task, and completion of a user survey. The partial level participation was designed to focus on usability aspects of the interface and not the performance per se. In this case, biocurators navigated the system by performing pre-designed tasks and then were asked whether they were able to achieve the task and the level of difficulty in completing the task. In this manuscript, we describe the development of the interactive task, from planning to execution and discuss major findings for the systems tested. Database URL:http://www.biocreative.org
Collapse
Affiliation(s)
- Qinghua Wang
- Center for Bioinformatics and Computational Biology, University of Delaware, Newark, DE, 19711, USA Department of Computer and Information Sciences, University of Delaware, Newark, DE, 19711, USA
| | - Shabbir S Abdul
- International Centre of Health Information Technology, Taipei Medical University, Taipei, Taiwan
| | - Lara Almeida
- DETI/IEETA, University of Aveiro, Campus Universitário de Santiago, Aveiro 3810-193, Portugal
| | - Sophia Ananiadou
- National Centre for Text Mining, University of Manchester, Manchester, UK
| | | | | | | | - Lucy Chilton
- Northern Institute for Cancer Research, Newcastle University, New Castle, UK
| | - Hui-Jou Chou
- Rutgers University-Camden, Camden, NJ 08102, USA
| | - Gabriela Contreras
- Centro de Ciencias Genómicas, Universidad Nacional Autónoma de México, 04510 Ciudad de México, México
| | - Laurel Cooper
- Department of Botany and Plant Pathology, Oregon State University Corvallis, OR 97331, USA
| | - Hong-Jie Dai
- Department of Computer Science and Information Engineering, National Taitung University, Taitung, Taiwan
| | - Barbra Ferrell
- College of Agriculture and Natural Resources, University of Delaware, Newark, DE 19711, USA
| | - Juliane Fluck
- Fraunhofer Institute for Algorithms and Scientific Computing, Schloss Birlinghoven, 53754 St. Augustin, Germany
| | - Socorro Gama-Castro
- Centro de Ciencias Genómicas, Universidad Nacional Autónoma de México, 04510 Ciudad de México, México
| | | | - Georgios Gkoutos
- College of Medical and Dental Sciences, Institute of Cancer and Genomic Sciences, Centre for Computational Biology, University of Birmingham, Birmingham B15 2TT, UK Institute of Translational Medicine, University Hospitals Birmingham NHS Foundation Trust, Birmingham B15 2TT, UK
| | - Afroza K Irin
- Life Science Informatics, University of Bonn, Bonn, Germany
| | - Lars J Jensen
- Novo Nordisk Foundation Center for Protein Research, Faculty of Health and Medical Sciences, University of Copenhagen, Copenhagen, Denmark
| | - Silvia Jimenez
- Blue Brain Project, École Polytechnique Fédérale de Lausanne (EPFL) Biotech Campus, Geneva, Switzerland
| | - Toni R Jue
- Prince of Wales Clinical School, University of New South Wales NSW, Sydney, New South Wales, Australia
| | | | - Sumit Madan
- Fraunhofer Institute for Algorithms and Scientific Computing, Schloss Birlinghoven, 53754 St. Augustin, Germany
| | - Sérgio Matos
- DETI/IEETA, University of Aveiro, Campus Universitário de Santiago, Aveiro 3810-193, Portugal
| | | | - Marija Milacic
- Department of Informatics and Bio-Computing, Ontario Institute for Cancer Research, Toronto, ON M5G0A3, Canada
| | - Matthew Mort
- HGMD, Institute of Medical Genetics, Cardiff University, Heath Park, Cardiff, UK
| | - Jeyakumar Natarajan
- Department of Bioinformatics, Bharathiar University, Coimbatore, Tamil Nadu, India
| | - Evangelos Pafilis
- Institute of Marine Biology, Biotechnology and Aquaculture, Hellenic Centre for Marine Research, Heraklion, Crete, Greece
| | - Emiliano Pereira
- Microbial Genomics and Bioinformatics Group, Max Planck Institute for Marine Microbiology, Bremen, Germany
| | - Shruti Rao
- Innovation Center for Biomedical Informatics (ICBI), Georgetown University, Washington, DC 20007, USA
| | - Fabio Rinaldi
- Institute of Computational Linguistics, University of Zurich, Zurich, Switzerland
| | - Karen Rothfels
- Department of Informatics and Bio-Computing, Ontario Institute for Cancer Research, Toronto, ON M5G0A3, Canada
| | - David Salgado
- GMGF, Aix-Marseille Universite, 13385 Marseille, France Inserm, UMR_S 910, 13385 Marseille, France
| | - Raquel M Silva
- Department of Medical Sciences, iBiMED & IEETA, University of Aveiro, 3810-193 Aveiro, Portugal
| | - Onkar Singh
- Taipei Medical University Graduate Institute of Biomedical informatics, Taipei, Taiwan
| | | | - Chu-Hsien Su
- Institute of Information Science, Academia Sinica, Taipei, Taiwan
| | - Suresh Subramani
- Department of Bioinformatics, Bharathiar University, Coimbatore, Tamil Nadu, India
| | | | - Loukia Tsaprouni
- Institute of Sport and Physical Activity Research (ISPAR), University of Bedfordshire, Bedford, UK
| | - Nicole Vasilevsky
- Ontology Development Group, Oregon Health & Science University, Portland, OR 97239, USA
| | - Xiaodong Wang
- WormBase Consortium, Division of Biology and Biological Engineering, California Institute of Technology, Pasadena, CA 91125, USA
| | | | | | | | | | - Sandra Orchard
- European Bioinformatics Institute (EMBL-EBI), Hinxton, UK
| | - Sangya Pundir
- European Bioinformatics Institute (EMBL-EBI), Hinxton, UK
| | | | - Kimberly Van Auken
- WormBase Consortium, Division of Biology and Biological Engineering, California Institute of Technology, Pasadena, CA 91125, USA
| | - Zhiyong Lu
- National Center for Biotechnology Information (NCBI), National Institutes of Health, Bethesda, MD 20894, USA
| | - Mary Schaeffer
- MaizeGDB USDA ARS and University of Missouri, Columbia, MO 65211, USA
| | - Cathy H Wu
- Center for Bioinformatics and Computational Biology, University of Delaware, Newark, DE, 19711, USA Department of Computer and Information Sciences, University of Delaware, Newark, DE, 19711, USA
| | | | - Cecilia N Arighi
- Center for Bioinformatics and Computational Biology, University of Delaware, Newark, DE, 19711, USA Department of Computer and Information Sciences, University of Delaware, Newark, DE, 19711, USA
| |
Collapse
|
12
|
Batista-Navarro R, Carter J, Ananiadou S. Argo: enabling the development of bespoke workflows and services for disease annotation. DATABASE-THE JOURNAL OF BIOLOGICAL DATABASES AND CURATION 2016; 2016:baw066. [PMID: 27189607 PMCID: PMC4869796 DOI: 10.1093/database/baw066] [Citation(s) in RCA: 12] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 12/13/2015] [Accepted: 04/11/2016] [Indexed: 11/14/2022]
Abstract
Argo (http://argo.nactem.ac.uk) is a generic text mining workbench that can cater to a variety of use cases, including the semi-automatic annotation of literature. It enables its technical users to build their own customised text mining solutions by providing a wide array of interoperable and configurable elementary components that can be seamlessly integrated into processing workflows. With Argo's graphical annotation interface, domain experts can then make use of the workflows' automatically generated output to curate information of interest. With the continuously rising need to understand the aetiology of diseases as well as the demand for their informed diagnosis and personalised treatment, the curation of disease-relevant information from medical and clinical documents has become an indispensable scientific activity. In the Fifth BioCreative Challenge Evaluation Workshop (BioCreative V), there was substantial interest in the mining of literature for disease-relevant information. Apart from a panel discussion focussed on disease annotations, the chemical-disease relations (CDR) track was also organised to foster the sharing and advancement of disease annotation tools and resources. This article presents the application of Argo’s capabilities to the literature-based annotation of diseases. As part of our participation in BioCreative V’s User Interactive Track (IAT), we demonstrated and evaluated Argo’s suitability to the semi-automatic curation of chronic obstructive pulmonary disease (COPD) phenotypes. Furthermore, the workbench facilitated the development of some of the CDR track’s top-performing web services for normalising disease mentions against the Medical Subject Headings (MeSH) database. In this work, we highlight Argo’s support for developing various types of bespoke workflows ranging from ones which enabled us to easily incorporate information from various databases, to those which train and apply machine learning-based concept recognition models, through to user-interactive ones which allow human curators to manually provide their corrections to automatically generated annotations. Our participation in the BioCreative V challenges shows Argo’s potential as an enabling technology for curating disease and phenotypic information from literature. Database URL: http://argo.nactem.ac.uk
Collapse
Affiliation(s)
- Riza Batista-Navarro
- National Centre for Text Mining, School of Computer Science, University of Manchester, Manchester, UK
| | - Jacob Carter
- National Centre for Text Mining, School of Computer Science, University of Manchester, Manchester, UK
| | - Sophia Ananiadou
- National Centre for Text Mining, School of Computer Science, University of Manchester, Manchester, UK
| |
Collapse
|
13
|
Soldatova LN, Collier N, Oellrich A, Groza T, Verspoor K, Rocca-Serra P, Dumontier M, Shah NH. Special issue on bio-ontologies and phenotypes. J Biomed Semantics 2015; 6:40. [PMID: 26682035 PMCID: PMC4682270 DOI: 10.1186/s13326-015-0040-2] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/19/2015] [Accepted: 11/15/2015] [Indexed: 11/10/2022] Open
Abstract
The bio-ontologies and phenotypes special issue includes eight papers selected from the 11 papers presented at the Bio-Ontologies SIG (Special Interest Group) and the Phenotype Day at ISMB (Intelligent Systems for Molecular Biology) conference in Boston in 2014. The selected papers span a wide range of topics including the automated re-use and update of ontologies, quality assessment of ontological resources, and the systematic description of phenotype variation, driven by manual, semi- and fully automatic means.
Collapse
Affiliation(s)
| | | | | | - Tudor Groza
- The Garvan Institute of Medical Research, Sydney, Australia
| | | | | | | | | |
Collapse
|