1
|
Wassell M, Vitiello A, Butler-Henderson K, Verspoor K, Pollard H. Generalizability of a Musculoskeletal Therapist Electronic Health Record for Modelling Outcomes to Work-Related Musculoskeletal Disorders. J Occup Rehabil 2024:10.1007/s10926-024-10196-w. [PMID: 38739344 DOI: 10.1007/s10926-024-10196-w] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Accepted: 04/07/2024] [Indexed: 05/14/2024]
Abstract
PURPOSE Electronic Health Records (EHRs) can contain vast amounts of clinical information that could be reused in modelling outcomes of work-related musculoskeletal disorders (WMSDs). Determining the generalizability of an EHR dataset is an important step in determining the appropriateness of its reuse. The study aims to describe the EHR dataset used by occupational musculoskeletal therapists and determine whether the EHR dataset is generalizable to the Australian workers' population and injury characteristics seen in workers' compensation claims. METHODS Variables were considered if they were associated with outcomes of WMSDs and variables data were available. Completeness and external validity assessment analysed frequency distributions, percentage of records and confidence intervals. RESULTS There were 48,434 patient care plans across 10 industries from 2014 to 2021. The EHR collects information related to clinical interventions, health and psychosocial factors, job demands, work accommodations as well as workplace culture, which have all been shown to be valuable variables in determining outcomes to WMSDs. Distributions of age, duration of employment, gender and region of birth were mostly similar to the Australian workforce. Upper limb WMSDs were higher in the EHR compared to workers' compensation claims and diagnoses were similar. CONCLUSION The study shows the EHR has strong potential to be used for further research into WMSDs as it has a similar population to the Australian workforce, manufacturing industry and workers' compensation claims. It contains many variables that may be relevant in modelling outcomes to WMSDs that are not typically available in existing datasets.
Collapse
Affiliation(s)
- M Wassell
- School of Computing Technologies, RMIT University, Melbourne, Australia.
| | - A Vitiello
- School of Health, Medical and Applied Sciences, Central Queensland University, Queensland, Australia
| | - K Butler-Henderson
- STEM|Health and Biomedical Sciences, RMIT University, Melbourne, Australia
| | - K Verspoor
- School of Computing Technologies, RMIT University, Melbourne, Australia
| | - H Pollard
- Faculty of Health Sciences, Durban University of Technology, Durban, South Africa
| |
Collapse
|
2
|
Liu Y, Ritchie SC, Teo SM, Ruuskanen MO, Kambur O, Zhu Q, Sanders J, Vázquez-Baeza Y, Verspoor K, Jousilahti P, Lahti L, Niiranen T, Salomaa V, Havulinna AS, Knight R, Méric G, Inouye M. Integration of polygenic and gut metagenomic risk prediction for common diseases. Nat Aging 2024; 4:584-594. [PMID: 38528230 PMCID: PMC11031402 DOI: 10.1038/s43587-024-00590-7] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 08/11/2023] [Accepted: 02/13/2024] [Indexed: 03/27/2024]
Abstract
Multiomics has shown promise in noninvasive risk profiling and early detection of various common diseases. In the present study, in a prospective population-based cohort with ~18 years of e-health record follow-up, we investigated the incremental and combined value of genomic and gut metagenomic risk assessment compared with conventional risk factors for predicting incident coronary artery disease (CAD), type 2 diabetes (T2D), Alzheimer disease and prostate cancer. We found that polygenic risk scores (PRSs) improved prediction over conventional risk factors for all diseases. Gut microbiome scores improved predictive capacity over baseline age for CAD, T2D and prostate cancer. Integrated risk models of PRSs, gut microbiome scores and conventional risk factors achieved the highest predictive performance for all diseases studied compared with models based on conventional risk factors alone. The present study demonstrates that integrated PRSs and gut metagenomic risk models improve the predictive value over conventional risk factors for common chronic diseases.
Collapse
Affiliation(s)
- Yang Liu
- Cambridge Baker Systems Genomics Initiative, Department of Public Health and Primary Care, University of Cambridge, Cambridge, UK.
- Cambridge Baker Systems Genomics Initiative, Baker Heart and Diabetes Institute, Melbourne, Victoria, Australia.
- Department of Clinical Pathology, Melbourne Medical School, University of Melbourne, Melbourne, Victoria, Australia.
- Victor Phillip Dahdaleh Heart and Lung Research Institute, University of Cambridge, Cambridge, UK.
- British Heart Foundation Cardiovascular Epidemiology Unit, Department of Public Health and Primary Care, University of Cambridge, Cambridge, UK.
| | - Scott C Ritchie
- Cambridge Baker Systems Genomics Initiative, Department of Public Health and Primary Care, University of Cambridge, Cambridge, UK
- Cambridge Baker Systems Genomics Initiative, Baker Heart and Diabetes Institute, Melbourne, Victoria, Australia
- Victor Phillip Dahdaleh Heart and Lung Research Institute, University of Cambridge, Cambridge, UK
- British Heart Foundation Cardiovascular Epidemiology Unit, Department of Public Health and Primary Care, University of Cambridge, Cambridge, UK
- British Heart Foundation Cambridge Centre of Research Excellence, School of Clinical Medicine, University of Cambridge, Cambridge, UK
- Health Data Research UK Cambridge, Wellcome Genome Campus and University of Cambridge, Cambridge, UK
| | - Shu Mei Teo
- Cambridge Baker Systems Genomics Initiative, Department of Public Health and Primary Care, University of Cambridge, Cambridge, UK
- Cambridge Baker Systems Genomics Initiative, Baker Heart and Diabetes Institute, Melbourne, Victoria, Australia
- Centre for Youth Mental Health, University of Melbourne, Melbourne, Victoria, Australia
| | - Matti O Ruuskanen
- Department of Public Health and Welfare, Finnish Institute for Health and Welfare, Helsinki, Finland
- Department of Computing, University of Turku, Turku, Finland
| | - Oleg Kambur
- Department of Public Health and Welfare, Finnish Institute for Health and Welfare, Helsinki, Finland
| | - Qiyun Zhu
- School of Life Sciences, Arizona State University, Tempe, AZ, USA
- Biodesign Center for Fundamental and Applied Microbiomics, Arizona State University, Tempe, AZ, USA
| | - Jon Sanders
- Department of Ecology and Evolutionary Biology, Cornell University, Ithaca, NY, USA
| | - Yoshiki Vázquez-Baeza
- Center for Microbiome Innovation, Jacobs School of Engineering, University of California San Diego, La Jolla, CA, USA
| | - Karin Verspoor
- School of Computing Technologies, RMIT University, Melbourne, Victoria, Australia
- School of Computing and Information Systems, University of Melbourne, Melbourne, Victoria, Australia
| | - Pekka Jousilahti
- Department of Public Health and Welfare, Finnish Institute for Health and Welfare, Helsinki, Finland
| | - Leo Lahti
- Department of Computing, University of Turku, Turku, Finland
| | - Teemu Niiranen
- Department of Public Health and Welfare, Finnish Institute for Health and Welfare, Helsinki, Finland
- Division of Medicine, Turku University Hospital and University of Turku, Turku, Finland
| | - Veikko Salomaa
- Department of Public Health and Welfare, Finnish Institute for Health and Welfare, Helsinki, Finland
| | - Aki S Havulinna
- Department of Public Health and Welfare, Finnish Institute for Health and Welfare, Helsinki, Finland
- Institute for Molecular Medicine Finland, FIMM-HiLIFE, University of Helsinki, Helsinki, Finland
| | - Rob Knight
- Center for Microbiome Innovation, Jacobs School of Engineering, University of California San Diego, La Jolla, CA, USA
- Department of Computer Science and Engineering, University of California San Diego, La Jolla, CA, USA
- Department of Pediatrics, School of Medicine, University of California San Diego, La Jolla, CA, USA
| | - Guillaume Méric
- Cambridge Baker Systems Genomics Initiative, Baker Heart and Diabetes Institute, Melbourne, Victoria, Australia
- Central Clinical School, Monash University, Melbourne, Victoria, Australia
- Department of Cardiometabolic Health, University of Melbourne, Melbourne, Victoria, Australia
- Department of Cardiovascular Research, Translation and Implementation, La Trobe University, Melbourne, Victoria, Australia
- Department of Medical Sciences, Molecular Epidemiology, Uppsala University, Uppsala, Sweden
| | - Michael Inouye
- Cambridge Baker Systems Genomics Initiative, Department of Public Health and Primary Care, University of Cambridge, Cambridge, UK.
- Cambridge Baker Systems Genomics Initiative, Baker Heart and Diabetes Institute, Melbourne, Victoria, Australia.
- Department of Clinical Pathology, Melbourne Medical School, University of Melbourne, Melbourne, Victoria, Australia.
- Victor Phillip Dahdaleh Heart and Lung Research Institute, University of Cambridge, Cambridge, UK.
- British Heart Foundation Cardiovascular Epidemiology Unit, Department of Public Health and Primary Care, University of Cambridge, Cambridge, UK.
- British Heart Foundation Cambridge Centre of Research Excellence, School of Clinical Medicine, University of Cambridge, Cambridge, UK.
- Health Data Research UK Cambridge, Wellcome Genome Campus and University of Cambridge, Cambridge, UK.
- The Alan Turing Institute, London, UK.
| |
Collapse
|
3
|
Liu J, Capurro D, Nguyen A, Verspoor K. Uncovering Variations in Clinical Notes for NLP Modeling. Stud Health Technol Inform 2024; 310:1460-1461. [PMID: 38269696 DOI: 10.3233/shti231244] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/26/2024]
Abstract
Clinical text contains rich patient information and has attracted much research interest in applying Natural Language Processing (NLP) tools to model it. In this study, we quantified and analyzed the textual characteristics of five common clinical note types using multiple measurements, including lexical-level features, semantic content, and grammaticality. We found there exist significant linguistic variations in different clinical note types, while some types tend to be more similar than others.
Collapse
Affiliation(s)
- Jinghui Liu
- The University of Melbourne, Australia
- CSIRO, Australia
| | | | | | - Karin Verspoor
- RMIT University, Australia
- The University of Melbourne, Australia
| |
Collapse
|
4
|
Khanina A, Rozova V, Elkins S, Verspoor K, Thursky K. Designing a Digital Health Solution: A Platform for Automated Surveillance of Fungal Infection. Stud Health Technol Inform 2024; 310:1454-1455. [PMID: 38269693 DOI: 10.3233/shti231241] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/26/2024]
Abstract
Surveillance of invasive fungal infection (IFI) requires laborious review of multiple sources of clinical information, while applying complex criteria to effectively identify relevant infections. These processes can be automated using artificial intelligence (AI) methodologies, including applying natural language processing (NLP) to clinical reports. However, developing a practically useful automated IFI surveillance tool requires consideration of the implementation context. We employed the Design Thinking Framework (DTF) to focus on the needs of end users of the tool to ensure sustained user engagement and enable its prospective validation. DTF allowed iterative generation of ideas and refinement of the final digital health solution. We believe this approach is key to increasing the likelihood that the solution will be implemented in clinical practice.
Collapse
Affiliation(s)
- Anna Khanina
- National Centre for Infections in Cancer, Peter MacCallum Cancer Centre, Melbourne, Australia
- Department of Infectious Diseases, Peter MacCallum Cancer Centre, Melbourne, Australia
- Sir Peter MacCallum Cancer Department of Oncology, University of Melbourne, Melbourne, Australia
| | - Vlada Rozova
- National Centre for Infections in Cancer, Peter MacCallum Cancer Centre, Melbourne, Australia
- School of Computing Technologies, RMIT University, Melbourne, Australia
- School of Computing and Information Systems, University of Melbourne, Melbourne, Australia
| | - Sri Elkins
- Guidance Group, Royal Melbourne Hospital at the Peter Doherty Institute for Infection and Immunity, Melbourne, Australia
| | - Karin Verspoor
- School of Computing Technologies, RMIT University, Melbourne, Australia
- School of Computing and Information Systems, University of Melbourne, Melbourne, Australia
| | - Karin Thursky
- National Centre for Infections in Cancer, Peter MacCallum Cancer Centre, Melbourne, Australia
- Department of Infectious Diseases, Peter MacCallum Cancer Centre, Melbourne, Australia
- Sir Peter MacCallum Cancer Department of Oncology, University of Melbourne, Melbourne, Australia
- Guidance Group, Royal Melbourne Hospital at the Peter Doherty Institute for Infection and Immunity, Melbourne, Australia
| |
Collapse
|
5
|
Wassell M, Murray JL, Kumar C, Verspoor K, Butler-Henderson K. Understanding Clinician EHR Data Quality for Reuse in Predictive Modelling. Stud Health Technol Inform 2024; 310:169-173. [PMID: 38269787 DOI: 10.3233/shti230949] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/26/2024]
Abstract
It is imperative to build clinician trust to reuse ever-growing amounts of rich clinical data. Utilising a proprietary, structured electronic health record, we address data quality by assessing the plausibility of chiropractors, physical therapists and osteopaths' data entry to help determine if the data is fit for use in predicting outcomes of work-related musculoskeletal disorders using machine learning. For most variables assessed, individual clinician data entry positively correlated to the clinician group's data entry, indicating data is fit for reuse. However, from the clinician's perspective, there were inconsistencies, which could lead to data mistrust. When assessing data quality in EHR studies, it is crucial to engage clinicians with their deep understanding of EHR use, as improvement suggestions could be made. Clinicians should be considered local knowledge experts.
Collapse
|
6
|
Liu J, Capurro D, Nguyen A, Verspoor K. Attention-based multimodal fusion with contrast for robust clinical prediction in the face of missing modalities. J Biomed Inform 2023; 145:104466. [PMID: 37549722 DOI: 10.1016/j.jbi.2023.104466] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/25/2023] [Revised: 06/09/2023] [Accepted: 08/01/2023] [Indexed: 08/09/2023]
Abstract
OBJECTIVE With the increasing amount and growing variety of healthcare data, multimodal machine learning supporting integrated modeling of structured and unstructured data is an increasingly important tool for clinical machine learning tasks. However, it is non-trivial to manage the differences in dimensionality, volume, and temporal characteristics of data modalities in the context of a shared target task. Furthermore, patients can have substantial variations in the availability of data, while existing multimodal modeling methods typically assume data completeness and lack a mechanism to handle missing modalities. METHODS We propose a Transformer-based fusion model with modality-specific tokens that summarize the corresponding modalities to achieve effective cross-modal interaction accommodating missing modalities in the clinical context. The model is further refined by inter-modal, inter-sample contrastive learning to improve the representations for better predictive performance. We denote the model as Attention-based cRoss-MOdal fUsion with contRast (ARMOUR). We evaluate ARMOUR using two input modalities (structured measurements and unstructured text), six clinical prediction tasks, and two evaluation regimes, either including or excluding samples with missing modalities. RESULTS Our model shows improved performances over unimodal or multimodal baselines in both evaluation regimes, including or excluding patients with missing modalities in the input. The contrastive learning improves the representation power and is shown to be essential for better results. The simple setup of modality-specific tokens enables ARMOUR to handle patients with missing modalities and allows comparison with existing unimodal benchmark results. CONCLUSION We propose a multimodal model for robust clinical prediction to achieve improved performance while accommodating patients with missing modalities. This work could inspire future research to study the effective incorporation of multiple, more complex modalities of clinical data into a single model.
Collapse
Affiliation(s)
- Jinghui Liu
- Australian e-Health Research Centre, CSIRO, Queensland, Australia; School of Computing and Information Systems, The University of Melbourne, Victoria, Australia
| | - Daniel Capurro
- School of Computing and Information Systems, The University of Melbourne, Victoria, Australia; Centre for Digital Transformation of Health, The University of Melbourne, Victoria, Australia
| | - Anthony Nguyen
- Australian e-Health Research Centre, CSIRO, Queensland, Australia
| | - Karin Verspoor
- School of Computing and Information Systems, The University of Melbourne, Victoria, Australia; School of Computing Technologies, RMIT University, Victoria, Australia.
| |
Collapse
|
7
|
Pu Y, Beck D, Verspoor K. Graph embedding-based link prediction for literature-based discovery in Alzheimer's Disease. J Biomed Inform 2023; 145:104464. [PMID: 37541406 DOI: 10.1016/j.jbi.2023.104464] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/22/2023] [Revised: 07/29/2023] [Accepted: 07/30/2023] [Indexed: 08/06/2023]
Abstract
OBJECTIVE We explore the framing of literature-based discovery (LBD) as link prediction and graph embedding learning, with Alzheimer's Disease (AD) as our focus disease context. The key link prediction setting of prediction window length is specifically examined in the context of a time-sliced evaluation methodology. METHODS We propose a four-stage approach to explore literature-based discovery for Alzheimer's Disease, creating and analyzing a knowledge graph tailored to the AD context, and predicting and evaluating new knowledge based on time-sliced link prediction. The first stage is to collect an AD-specific corpus. The second stage involves constructing an AD knowledge graph with identified AD-specific concepts and relations from the corpus. In the third stage, 20 pairs of training and testing datasets are constructed with the time-slicing methodology. Finally, we infer new knowledge with graph embedding-based link prediction methods. We compare different link prediction methods in this context. The impact of limiting prediction evaluation of LBD models in the context of short-term and longer-term knowledge evolution for Alzheimer's Disease is assessed. RESULTS We constructed an AD corpus of over 16 k papers published in 1977-2021, and automatically annotated it with concepts and relations covering 11 AD-specific semantic entity types. The knowledge graph of Alzheimer's Disease derived from this resource consisted of ∼11 k nodes and ∼394 k edges, among which 34% were genotype-phenotype relationships, 57% were genotype-genotype relationships, and 9% were phenotype-phenotype relationships. A Structural Deep Network Embedding (SDNE) model consistently showed the best performance in terms of returning the most confident set of link predictions as time progresses over 20 years. A huge improvement in model performance was observed when changing the link prediction evaluation setting to consider a more distant future, reflecting the time required for knowledge accumulation. CONCLUSION Neural network graph-embedding link prediction methods show promise for the literature-based discovery context, although the prediction setting is extremely challenging, with graph densities of less than 1%. Varying prediction window length on the time-sliced evaluation methodology leads to hugely different results and interpretations of LBD studies. Our approach can be generalized to enable knowledge discovery for other diseases. AVAILABILITY Code, AD ontology, and data are available at https://github.com/READ-BioMed/readbiomed-lbd.
Collapse
Affiliation(s)
- Yiyuan Pu
- School of Computing and Information Systems, The University of Melbourne, Melbourne, Victoria, Australia.
| | - Daniel Beck
- School of Computing and Information Systems, The University of Melbourne, Melbourne, Victoria, Australia.
| | - Karin Verspoor
- School of Computing and Information Systems, The University of Melbourne, Melbourne, Victoria, Australia; School of Computing Technologies, RMIT University, Melbourne, Victoria, Australia.
| |
Collapse
|
8
|
Affiliation(s)
- Enrico W Coiera
- Centre for Health InformaticsMacquarie UniversitySydneyNSW
- RMIT UniversityMelbourneVIC
| | | | | |
Collapse
|
9
|
Šuster S, Baldwin T, Verspoor K. Analysis of predictive performance and reliability of classifiers for quality assessment of medical evidence revealed important variation by medical area. J Clin Epidemiol 2023; 159:58-69. [PMID: 37120028 DOI: 10.1016/j.jclinepi.2023.04.006] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/05/2022] [Revised: 03/30/2023] [Accepted: 04/18/2023] [Indexed: 05/01/2023]
Abstract
OBJECTIVES A major obstacle in deployment of models for automated quality assessment is their reliability. To analyze their calibration and selective classification performance. STUDY DESIGN AND SETTING We examine two systems for assessing the quality of medical evidence, EvidenceGRADEr and RobotReviewer, both developed from Cochrane Database of Systematic Reviews (CDSR) to measure strength of bodies of evidence and risk of bias (RoB) of individual studies, respectively. We report their calibration error and Brier scores, present their reliability diagrams, and analyze the risk-coverage trade-off in selective classification. RESULTS The models are reasonably well calibrated on most quality criteria (expected calibration error [ECE] 0.04-0.09 for EvidenceGRADEr, 0.03-0.10 for RobotReviewer). However, we discover that both calibration and predictive performance vary significantly by medical area. This has ramifications for the application of such models in practice, as average performance is a poor indicator of group-level performance (e.g., health and safety at work, allergy and intolerance, and public health see much worse performance than cancer, pain, and anesthesia, and Neurology). We explore the reasons behind this disparity. CONCLUSION Practitioners adopting automated quality assessment should expect large fluctuations in system reliability and predictive performance depending on the medical area. Prospective indicators of such behavior should be further researched.
Collapse
Affiliation(s)
- Simon Šuster
- School of Computing and Information Systems, The University of Melbourne, Melbourne, Australia.
| | - Timothy Baldwin
- School of Computing and Information Systems, The University of Melbourne, Melbourne, Australia; Mohamed bin Zayed University of Artificial Intelligence, Abu Dhabi, United Arab Emirates
| | - Karin Verspoor
- School of Computing Technologies, RMIT University, Melbourne, Australia; School of Computing and Information Systems, The University of Melbourne, Melbourne, Australia
| |
Collapse
|
10
|
El-Hayek C, Barzegar S, Faux N, Doyle K, Pillai P, Mutch SJ, Vaisey A, Ward R, Sanci L, Dunn AG, Hellard ME, Hocking JS, Verspoor K, Boyle DI. An evaluation of existing text de-identification tools for use with patient progress notes from Australian general practice. Int J Med Inform 2023; 173:105021. [PMID: 36870249 DOI: 10.1016/j.ijmedinf.2023.105021] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/01/2022] [Revised: 02/07/2023] [Accepted: 02/10/2023] [Indexed: 02/13/2023]
Abstract
INTRODUCTION Digitized patient progress notes from general practice represent a significant resource for clinical and public health research but cannot feasibly and ethically be used for these purposes without automated de-identification. Internationally, several open-source natural language processing tools have been developed, however, given wide variations in clinical documentation practices, these cannot be utilized without appropriate review. We evaluated the performance of four de-identification tools and assessed their suitability for customization to Australian general practice progress notes. METHODS Four tools were selected: three rule-based (HMS Scrubber, MIT De-id, Philter) and one machine learning (MIST). 300 patient progress notes from three general practice clinics were manually annotated with personally identifying information. We conducted a pairwise comparison between the manual annotations and patient identifiers automatically detected by each tool, measuring recall (sensitivity), precision (positive predictive value), f1-score (harmonic mean of precision and recall), and f2-score (weighs recall 2x higher than precision). Error analysis was also conducted to better understand each tool's structure and performance. RESULTS Manual annotation detected 701 identifiers in seven categories. The rule-based tools detected identifiers in six categories and MIST in three. Philter achieved the highest aggregate recall (67%) and the highest recall for NAME (87%). HMS Scrubber achieved the highest recall for DATE (94%) and all tools performed poorly on LOCATION. MIST achieved the highest precision for NAME and DATE while also achieving similar recall to the rule-based tools for DATE and highest recall for LOCATION. Philter had the lowest aggregate precision (37%), however preliminary adjustments of its rules and dictionaries showed a substantial reduction in false positives. CONCLUSION Existing off-the-shelf solutions for automated de-identification of clinical text are not immediately suitable for our context without modification. Philter is the most promising candidate due to its high recall and flexibility however will require extensive revising of its pattern matching rules and dictionaries.
Collapse
Affiliation(s)
- Carol El-Hayek
- Burnet Institute, Melbourne, Australia; Melbourne School of Population and Global Health, University of Melbourne, Australia; School of Public Health and Preventive Medicine, Monash University, Australia.
| | - Siamak Barzegar
- School of Computing and Information Systems, University of Melbourne, Australia
| | - Noel Faux
- Melbourne Data Analytics Platform, University of Melbourne, Australia; Florey Institute of Neuroscience and Mental Health, University of Melbourne, Australia
| | - Kim Doyle
- Melbourne Data Analytics Platform, University of Melbourne, Australia
| | - Priyanka Pillai
- Melbourne Data Analytics Platform, University of Melbourne, Australia; The Peter Doherty Institute for Infection and Immunity, Melbourne, Australia
| | - Simon J Mutch
- Melbourne Data Analytics Platform, University of Melbourne, Australia
| | - Alaina Vaisey
- Melbourne School of Population and Global Health, University of Melbourne, Australia
| | - Roger Ward
- Department of General Practice and Primary Care, University of Melbourne, Australia
| | - Lena Sanci
- Department of General Practice and Primary Care, University of Melbourne, Australia
| | - Adam G Dunn
- School of Medical Sciences, University of Sydney, Australia
| | - Margaret E Hellard
- Burnet Institute, Melbourne, Australia; Melbourne School of Population and Global Health, University of Melbourne, Australia; School of Public Health and Preventive Medicine, Monash University, Australia; The Peter Doherty Institute for Infection and Immunity, Melbourne, Australia
| | - Jane S Hocking
- Melbourne School of Population and Global Health, University of Melbourne, Australia
| | - Karin Verspoor
- School of Computing and Information Systems, University of Melbourne, Australia; School of Computing Technologies, RMIT University, Melbourne, Australia
| | - Douglas Ir Boyle
- Department of General Practice and Primary Care, University of Melbourne, Australia
| |
Collapse
|
11
|
Šuster S, Baldwin T, Lau JH, Jimeno Yepes A, Martinez Iraola D, Otmakhova Y, Verspoor K. Automating Quality Assessment of Medical Evidence in Systematic Reviews: Model Development and Validation Study. J Med Internet Res 2023; 25:e35568. [PMID: 36722350 PMCID: PMC10131699 DOI: 10.2196/35568] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/12/2021] [Revised: 01/18/2023] [Accepted: 01/31/2023] [Indexed: 02/01/2023] Open
Abstract
BACKGROUND Assessment of the quality of medical evidence available on the web is a critical step in the preparation of systematic reviews. Existing tools that automate parts of this task validate the quality of individual studies but not of entire bodies of evidence and focus on a restricted set of quality criteria. OBJECTIVE We proposed a quality assessment task that provides an overall quality rating for each body of evidence (BoE), as well as finer-grained justification for different quality criteria according to the Grading of Recommendation, Assessment, Development, and Evaluation formalization framework. For this purpose, we constructed a new data set and developed a machine learning baseline system (EvidenceGRADEr). METHODS We algorithmically extracted quality-related data from all summaries of findings found in the Cochrane Database of Systematic Reviews. Each BoE was defined by a set of population, intervention, comparison, and outcome criteria and assigned a quality grade (high, moderate, low, or very low) together with quality criteria (justification) that influenced that decision. Different statistical data, metadata about the review, and parts of the review text were extracted as support for grading each BoE. After pruning the resulting data set with various quality checks, we used it to train several neural-model variants. The predictions were compared against the labels originally assigned by the authors of the systematic reviews. RESULTS Our quality assessment data set, Cochrane Database of Systematic Reviews Quality of Evidence, contains 13,440 instances, or BoEs labeled for quality, originating from 2252 systematic reviews published on the internet from 2002 to 2020. On the basis of a 10-fold cross-validation, the best neural binary classifiers for quality criteria detected risk of bias at 0.78 F1 (P=.68; R=0.92) and imprecision at 0.75 F1 (P=.66; R=0.86), while the performance on inconsistency, indirectness, and publication bias criteria was lower (F1 in the range of 0.3-0.4). The prediction of the overall quality grade into 1 of the 4 levels resulted in 0.5 F1. When casting the task as a binary problem by merging the Grading of Recommendation, Assessment, Development, and Evaluation classes (high+moderate vs low+very low-quality evidence), we attained 0.74 F1. We also found that the results varied depending on the supporting information that is provided as an input to the models. CONCLUSIONS Different factors affect the quality of evidence in the context of systematic reviews of medical evidence. Some of these (risk of bias and imprecision) can be automated with reasonable accuracy. Other quality dimensions such as indirectness, inconsistency, and publication bias prove more challenging for machine learning, largely because they are much rarer. This technology could substantially reduce reviewer workload in the future and expedite quality assessment as part of evidence synthesis.
Collapse
Affiliation(s)
- Simon Šuster
- School of Computing and Information Systems, University of Melbourne, Melbourne, Australia
| | - Timothy Baldwin
- School of Computing and Information Systems, University of Melbourne, Melbourne, Australia.,Mohamed bin Zayed University of Artificial Intelligence, Abu Dhabi, United Arab Emirates
| | - Jey Han Lau
- School of Computing and Information Systems, University of Melbourne, Melbourne, Australia
| | - Antonio Jimeno Yepes
- School of Computing and Information Systems, University of Melbourne, Melbourne, Australia.,School of Computing Technologies, RMIT University, Melbourne, Australia
| | | | - Yulia Otmakhova
- School of Computing and Information Systems, University of Melbourne, Melbourne, Australia
| | - Karin Verspoor
- School of Computing Technologies, RMIT University, Melbourne, Australia
| |
Collapse
|
12
|
Rozova V, Khanina A, Teng JC, Teh JSK, Worth LJ, Slavin MA, Thursky KA, Verspoor K. Detecting evidence of invasive fungal infections in cytology and histopathology reports enriched with concept-level annotations. J Biomed Inform 2023; 139:104293. [PMID: 36682389 DOI: 10.1016/j.jbi.2023.104293] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/04/2022] [Revised: 01/09/2023] [Accepted: 01/16/2023] [Indexed: 01/22/2023]
Abstract
Invasive fungal infections (IFIs) are particularly dangerous to high-risk patients with haematological malignancies and are responsible for excessive mortality and delays in cancer therapy. Surveillance of IFI in clinical settings offers an opportunity to identify potential risk factors and evaluate new therapeutic strategies. However, manual surveillance is both time- and resource-intensive. As part of a broader project aimed to develop a system for automated IFI surveillance by leveraging electronic medical records, we present our approach to detecting evidence of IFI in the key diagnostic domain of histopathology. Using natural language processing (NLP), we analysed cytology and histopathology reports to identify IFI-positive reports. We compared a conventional bag-of-words classification model to a method that relies on concept-level annotations. Although the investment to prepare data supporting concept annotations is substantial, extracting targeted information specific to IFI as a pre-processing step increased the performance of the classifier from the PR AUC of 0.84 to 0.92 and enabled model interpretability. We have made publicly available the annotated dataset of 283 reports, the Cytology and Histopathology IFI Reports corpus (CHIFIR), to allow the clinical NLP research community to further build on our results.
Collapse
Affiliation(s)
- Vlada Rozova
- School of Computing Technologies, RMIT University, Melbourne, Australia; School of Computing and Information Systems, University of Melbourne, Melbourne, Australia; National Centre for Infections in Cancer, Peter MacCallum, Cancer Centre, Melbourne, Australia.
| | - Anna Khanina
- National Centre for Infections in Cancer, Peter MacCallum, Cancer Centre, Melbourne, Australia; Department of Infectious Diseases, Peter MacCallum Cancer Centre, Melbourne, Australia; Sir Peter MacCallum Department of Oncology, University of Melbourne, Melbourne, Australia
| | - Jasmine C Teng
- National Centre for Infections in Cancer, Peter MacCallum, Cancer Centre, Melbourne, Australia; Department of Infectious Diseases, Peter MacCallum Cancer Centre, Melbourne, Australia
| | - Joanne S K Teh
- National Centre for Infections in Cancer, Peter MacCallum, Cancer Centre, Melbourne, Australia; Department of Infectious Diseases, Peter MacCallum Cancer Centre, Melbourne, Australia
| | - Leon J Worth
- National Centre for Infections in Cancer, Peter MacCallum, Cancer Centre, Melbourne, Australia; Department of Infectious Diseases, Peter MacCallum Cancer Centre, Melbourne, Australia; Sir Peter MacCallum Department of Oncology, University of Melbourne, Melbourne, Australia
| | - Monica A Slavin
- National Centre for Infections in Cancer, Peter MacCallum, Cancer Centre, Melbourne, Australia; Department of Infectious Diseases, Peter MacCallum Cancer Centre, Melbourne, Australia; Sir Peter MacCallum Department of Oncology, University of Melbourne, Melbourne, Australia
| | - Karin A Thursky
- National Centre for Infections in Cancer, Peter MacCallum, Cancer Centre, Melbourne, Australia; Department of Infectious Diseases, Peter MacCallum Cancer Centre, Melbourne, Australia; Sir Peter MacCallum Department of Oncology, University of Melbourne, Melbourne, Australia; National Centre for Antimicrobial Stewardship, Peter Doherty Institute for Infection and Immunity, University of Melbourne, Melbourne, Australia
| | - Karin Verspoor
- School of Computing Technologies, RMIT University, Melbourne, Australia; School of Computing and Information Systems, University of Melbourne, Melbourne, Australia.
| |
Collapse
|
13
|
Jimeno Yepes AJ, Verspoor K. Classifying literature mentions of biological pathogens as experimentally studied using natural language processing. J Biomed Semantics 2023; 14:1. [PMID: 36721225 PMCID: PMC9889128 DOI: 10.1186/s13326-023-00282-y] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/25/2022] [Accepted: 01/17/2023] [Indexed: 02/02/2023] Open
Abstract
BACKGROUND Information pertaining to mechanisms, management and treatment of disease-causing pathogens including viruses and bacteria is readily available from research publications indexed in MEDLINE. However, identifying the literature that specifically characterises these pathogens and their properties based on experimental research, important for understanding of the molecular basis of diseases caused by these agents, requires sifting through a large number of articles to exclude incidental mentions of the pathogens, or references to pathogens in other non-experimental contexts such as public health. OBJECTIVE In this work, we lay the foundations for the development of automatic methods for characterising mentions of pathogens in scientific literature, focusing on the task of identifying research that involves the experimental study of a pathogen in an experimental context. There are no manually annotated pathogen corpora available for this purpose, while such resources are necessary to support the development of machine learning-based models. We therefore aim to fill this gap, producing a large data set automatically from MEDLINE under some simplifying assumptions for the task definition, and using it to explore automatic methods that specifically support the detection of experimentally studied pathogen mentions in research publications. METHODS We developed a pathogen mention characterisation literature data set -READBiomed-Pathogens- automatically using NCBI resources, which we make available. Resources such as the NCBI Taxonomy, MeSH and GenBank can be used effectively to identify relevant literature about experimentally researched pathogens, more specifically using MeSH to link to MEDLINE citations including titles and abstracts with experimentally researched pathogens. We experiment with several machine learning-based natural language processing (NLP) algorithms leveraging this data set as training data, to model the task of detecting papers that specifically describe experimental study of a pathogen. RESULTS We show that our data set READBiomed-Pathogens can be used to explore natural language processing configurations for experimental pathogen mention characterisation. READBiomed-Pathogens includes citations related to organisms including bacteria, viruses, and a small number of toxins and other disease-causing agents. CONCLUSIONS We studied the characterisation of experimentally studied pathogens in scientific literature, developing several natural language processing methods supported by an automatically developed data set. As a core contribution of the work, we presented a methodology to automatically construct a data set for pathogen identification using existing biomedical resources. The data set and the annotation code are made publicly available. Performance of the pathogen mention identification and characterisation algorithms were additionally evaluated on a small manually annotated data set shows that the data set that we have generated allows characterising pathogens of interest. TRIAL REGISTRATION N/A.
Collapse
Affiliation(s)
- Antonio Jose Jimeno Yepes
- School of Computing Technologies, RMIT University, Melbourne, Australia.
- School of Computing and Information Systems, The University of Melbourne, Melbourne, Australia.
| | - Karin Verspoor
- School of Computing Technologies, RMIT University, Melbourne, Australia
- School of Computing and Information Systems, The University of Melbourne, Melbourne, Australia
| |
Collapse
|
14
|
Liu Y, Teo SM, Méric G, Tang HHF, Zhu Q, Sanders JG, Vázquez-Baeza Y, Verspoor K, Vartiainen VA, Jousilahti P, Lahti L, Niiranen T, Havulinna AS, Knight R, Salomaa V, Inouye M. The gut microbiome is a significant risk factor for future chronic lung disease. J Allergy Clin Immunol 2022; 151:943-952. [PMID: 36587850 PMCID: PMC10109092 DOI: 10.1016/j.jaci.2022.12.810] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/23/2022] [Revised: 11/21/2022] [Accepted: 12/05/2022] [Indexed: 12/30/2022]
Abstract
BACKGROUND The gut-lung axis is generally recognized, but there are few large studies of the gut microbiome and incident respiratory disease in adults. OBJECTIVE We sought to investigate the association and predictive capacity of the gut microbiome for incident asthma and chronic obstructive pulmonary disease (COPD). METHODS Shallow metagenomic sequencing was performed for stool samples from a prospective, population-based cohort (FINRISK02; N = 7115 adults) with linked national administrative health register-derived classifications for incident asthma and COPD up to 15 years after baseline. Generalized linear models and Cox regressions were used to assess associations of microbial taxa and diversity with disease occurrence. Predictive models were constructed using machine learning with extreme gradient boosting. Models considered taxa abundances individually and in combination with other risk factors, including sex, age, body mass index, and smoking status. RESULTS A total of 695 and 392 statistically significant associations were found between baseline taxonomic groups and incident asthma and COPD, respectively. Gradient boosting decision trees of baseline gut microbiome abundance predicted incident asthma and COPD in the validation data sets with mean area under the curves of 0.608 and 0.780, respectively. Cox analysis showed that the baseline gut microbiome achieved higher predictive performance than individual conventional risk factors, with C-indices of 0.623 for asthma and 0.817 for COPD. The integration of the gut microbiome and conventional risk factors further improved prediction capacities. CONCLUSIONS The gut microbiome is a significant risk factor for incident asthma and incident COPD and is largely independent of conventional risk factors.
Collapse
Affiliation(s)
- Yang Liu
- Department of Clinical Pathology, Melbourne Medical School, The University of Melbourne, Melbourne, Australia; Cambridge Baker Systems Genomics Initiative, Baker Heart and Diabetes Institute, Melbourne, Australia.
| | - Shu Mei Teo
- Cambridge Baker Systems Genomics Initiative, Baker Heart and Diabetes Institute, Melbourne, Australia; Cambridge Baker Systems Genomics Initiative, Department of Public Health and Primary Care, University of Cambridge, Cambridge, United Kingdom; Centre for Youth Mental Health, University of Melbourne, Melbourne, Australia
| | - Guillaume Méric
- Cambridge Baker Systems Genomics Initiative, Baker Heart and Diabetes Institute, Melbourne, Australia
| | - Howard H F Tang
- Cambridge Baker Systems Genomics Initiative, Baker Heart and Diabetes Institute, Melbourne, Australia; Cambridge Baker Systems Genomics Initiative, Department of Public Health and Primary Care, University of Cambridge, Cambridge, United Kingdom
| | - Qiyun Zhu
- School of Life Sciences, Arizona State University, Tempe, Ariz; Biodesign Center for Fundamental and Applied Microbiomics, Arizona State University, Tempe, Ariz
| | - Jon G Sanders
- Department of Ecology and Evolutionary Biology, Cornell University, Ithaca, NY
| | - Yoshiki Vázquez-Baeza
- Center for Microbiome Innovation, Jacobs School of Engineering, University of California San Diego, La Jolla, Calif
| | - Karin Verspoor
- School of Computing Technologies, RMIT University, Melbourne, Australia; School of Computing and Information Systems, The University of Melbourne, Melbourne, Australia
| | - Ville A Vartiainen
- Department of Public Health and Welfare, Finnish Institute for Health and Welfare, Helsinki, Finland; Individualized Drug Therapy Research Program, Faculty of Medicine, University of Helsinki, Helsinki, Finland; Department of Pulmonary Medicine, Heart and Lung Center, Helsinki University Hospital, Helsinki, Finland
| | - Pekka Jousilahti
- Department of Public Health and Welfare, Finnish Institute for Health and Welfare, Helsinki, Finland
| | - Leo Lahti
- Department of Computing, University of Turku, Turku, Finland
| | - Teemu Niiranen
- Department of Public Health and Welfare, Finnish Institute for Health and Welfare, Helsinki, Finland; Division of Medicine, Turku University Hospital and University of Turku, Turku, Finland
| | - Aki S Havulinna
- Department of Public Health and Welfare, Finnish Institute for Health and Welfare, Helsinki, Finland; Institute for Molecular Medicine Finland, FIMM-HiLIFE, University of Helsinki, Helsinki, Finland
| | - Rob Knight
- Center for Microbiome Innovation, Jacobs School of Engineering, University of California San Diego, La Jolla, Calif; Department of Computer Science and Engineering, University of California San Diego, La Jolla, Calif; Department of Pediatrics, School of Medicine, University of California San Diego, La Jolla, Calif
| | - Veikko Salomaa
- Department of Public Health and Welfare, Finnish Institute for Health and Welfare, Helsinki, Finland
| | - Michael Inouye
- Department of Clinical Pathology, Melbourne Medical School, The University of Melbourne, Melbourne, Australia; Cambridge Baker Systems Genomics Initiative, Baker Heart and Diabetes Institute, Melbourne, Australia; Cambridge Baker Systems Genomics Initiative, Department of Public Health and Primary Care, University of Cambridge, Cambridge, United Kingdom; British Heart Foundation Cardiovascular Epidemiology Unit, Department of Public Health and Primary Care, University of Cambridge, Cambridge, United Kingdom; British Heart Foundation Cambridge Centre of Research Excellence, School of Clinical Medicine, University of Cambridge, Cambridge, United Kingdom; Health Data Research UK Cambridge, Wellcome Genome Campus and University of Cambridge, Cambridge, United Kingdom; The Alan Turing Institute, London, United Kingdom; Heart and Lung Research Institute, University of Cambridge, Cambridge, United Kingdom.
| |
Collapse
|
15
|
Eysenbach G, Šuster S, Baldwin T, Verspoor K. Predicting Publication of Clinical Trials Using Structured and Unstructured Data: Model Development and Validation Study. J Med Internet Res 2022; 24:e38859. [PMID: 36563029 PMCID: PMC9823568 DOI: 10.2196/38859] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/19/2022] [Revised: 10/14/2022] [Accepted: 11/16/2022] [Indexed: 12/24/2022] Open
Abstract
BACKGROUND Publication of registered clinical trials is a critical step in the timely dissemination of trial findings. However, a significant proportion of completed clinical trials are never published, motivating the need to analyze the factors behind success or failure to publish. This could inform study design, help regulatory decision-making, and improve resource allocation. It could also enhance our understanding of bias in the publication of trials and publication trends based on the research direction or strength of the findings. Although the publication of clinical trials has been addressed in several descriptive studies at an aggregate level, there is a lack of research on the predictive analysis of a trial's publishability given an individual (planned) clinical trial description. OBJECTIVE We aimed to conduct a study that combined structured and unstructured features relevant to publication status in a single predictive approach. Established natural language processing techniques as well as recent pretrained language models enabled us to incorporate information from the textual descriptions of clinical trials into a machine learning approach. We were particularly interested in whether and which textual features could improve the classification accuracy for publication outcomes. METHODS In this study, we used metadata from ClinicalTrials.gov (a registry of clinical trials) and MEDLINE (a database of academic journal articles) to build a data set of clinical trials (N=76,950) that contained the description of a registered trial and its publication outcome (27,702/76,950, 36% published and 49,248/76,950, 64% unpublished). This is the largest data set of its kind, which we released as part of this work. The publication outcome in the data set was identified from MEDLINE based on clinical trial identifiers. We carried out a descriptive analysis and predicted the publication outcome using 2 approaches: a neural network with a large domain-specific language model and a random forest classifier using a weighted bag-of-words representation of text. RESULTS First, our analysis of the newly created data set corroborates several findings from the existing literature regarding attributes associated with a higher publication rate. Second, a crucial observation from our predictive modeling was that the addition of textual features (eg, eligibility criteria) offers consistent improvements over using only structured data (F1-score=0.62-0.64 vs F1-score=0.61 without textual features). Both pretrained language models and more basic word-based representations provide high-utility text representations, with no significant empirical difference between the two. CONCLUSIONS Different factors affect the publication of a registered clinical trial. Our approach to predictive modeling combines heterogeneous features, both structured and unstructured. We show that methods from natural language processing can provide effective textual features to enable more accurate prediction of publication success, which has not been explored for this task previously.
Collapse
Affiliation(s)
| | - Simon Šuster
- School of Computing and Information Systems, University of Melbourne, Melbourne, Australia
| | - Timothy Baldwin
- School of Computing and Information Systems, University of Melbourne, Melbourne, Australia.,Mohamed bin Zayed University of Artificial Intelligence, Abu Dhabi, United Arab Emirates
| | - Karin Verspoor
- School of Computing Technologies, RMIT University, Melbourne, Australia
| |
Collapse
|
16
|
Ghosh Roy G, Geard N, Verspoor K, He S. MPVNN: Mutated Pathway Visible Neural Network architecture for interpretable prediction of cancer-specific survival risk. Bioinformatics 2022; 38:5026-5032. [PMID: 36124954 DOI: 10.1093/bioinformatics/btac636] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/02/2022] [Revised: 08/04/2022] [Accepted: 09/16/2022] [Indexed: 12/24/2022] Open
Abstract
MOTIVATION Survival risk prediction using gene expression data is important in making treatment decisions in cancer. Standard neural network (NN) survival analysis models are black boxes with a lack of interpretability. More interpretable visible neural network architectures are designed using biological pathway knowledge. But they do not model how pathway structures can change for particular cancer types. RESULTS We propose a novel Mutated Pathway Visible Neural Network (MPVNN) architecture, designed using prior signaling pathway knowledge and random replacement of known pathway edges using gene mutation data simulating signal flow disruption. As a case study, we use the PI3K-Akt pathway and demonstrate overall improved cancer-specific survival risk prediction of MPVNN over other similar-sized NN and standard survival analysis methods. We show that trained MPVNN architecture interpretation, which points to smaller sets of genes connected by signal flow within the PI3K-Akt pathway that is important in risk prediction for particular cancer types, is reliable. AVAILABILITY AND IMPLEMENTATION The data and code are available at https://github.com/gourabghoshroy/MPVNN. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Gourab Ghosh Roy
- School of Computer Science, University of Birmingham, Birmingham B15 2TT, UK.,School of Computing and Information Systems, University of Melbourne, Melbourne 3052, Australia
| | - Nicholas Geard
- School of Computing and Information Systems, University of Melbourne, Melbourne 3052, Australia
| | - Karin Verspoor
- School of Computing and Information Systems, University of Melbourne, Melbourne 3052, Australia.,School of Computing Technologies, RMIT University, Melbourne 3000, Australia
| | - Shan He
- School of Computer Science, University of Birmingham, Birmingham B15 2TT, UK
| |
Collapse
|
17
|
Goudey B, Geard N, Verspoor K, Zobel J. Propagation, detection and correction of errors using the sequence database network. Brief Bioinform 2022; 23:6764545. [PMID: 36266246 PMCID: PMC9677457 DOI: 10.1093/bib/bbac416] [Citation(s) in RCA: 5] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/10/2022] [Revised: 07/31/2022] [Accepted: 08/28/2022] [Indexed: 12/14/2022] Open
Abstract
Nucleotide and protein sequences stored in public databases are the cornerstone of many bioinformatics analyses. The records containing these sequences are prone to a wide range of errors, including incorrect functional annotation, sequence contamination and taxonomic misclassification. One source of information that can help to detect errors are the strong interdependency between records. Novel sequences in one database draw their annotations from existing records, may generate new records in multiple other locations and will have varying degrees of similarity with existing records across a range of attributes. A network perspective of these relationships between sequence records, within and across databases, offers new opportunities to detect-or even correct-erroneous entries and more broadly to make inferences about record quality. Here, we describe this novel perspective of sequence database records as a rich network, which we call the sequence database network, and illustrate the opportunities this perspective offers for quantification of database quality and detection of spurious entries. We provide an overview of the relevant databases and describe how the interdependencies between sequence records across these databases can be exploited by network analyses. We review the process of sequence annotation and provide a classification of sources of error, highlighting propagation as a major source. We illustrate the value of a network perspective through three case studies that use network analysis to detect errors, and explore the quality and quantity of critical relationships that would inform such network analyses. This systematic description of a network perspective of sequence database records provides a novel direction to combat the proliferation of errors within these critical bioinformatics resources.
Collapse
Affiliation(s)
- Benjamin Goudey
- Corresponding author. Benjamin Goudey, School of Computing and Information Systems, University of Melbourne Parkville, Victoria, 3010,
| | - Nicholas Geard
- School of Computing and Information Systems, University of Melbourne Parkville, Victoria, 3010
| | - Karin Verspoor
- School of Computing Technologies, RMIT University Melbourne, Victoria, 3000
| | - Justin Zobel
- School of Computing and Information Systems, University of Melbourne Parkville, Victoria, 3010
| |
Collapse
|
18
|
Liu J, Capurro D, Nguyen A, Verspoor K. "Note Bloat" impacts deep learning-based NLP models for clinical prediction tasks. J Biomed Inform 2022; 133:104149. [PMID: 35878821 DOI: 10.1016/j.jbi.2022.104149] [Citation(s) in RCA: 6] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/22/2022] [Revised: 05/28/2022] [Accepted: 07/19/2022] [Indexed: 10/17/2022]
Abstract
One unintended consequence of the Electronic Health Records (EHR) implementation is the overuse of content-importing technology, such as copy-and-paste, that creates "bloated" notes containing large amounts of textual redundancy. Despite the rising interest in applying machine learning models to learn from real-patient data, it is unclear how the phenomenon of note bloat might affect the Natural Language Processing (NLP) models derived from these notes. Therefore, in this work we examine the impact of redundancy on deep learning-based NLP models, considering four clinical prediction tasks using a publicly available EHR database. We applied two deduplication methods to the hospital notes, identifying large quantities of redundancy, and found that removing the redundancy usually has little negative impact on downstream performances, and can in certain circumstances assist models to achieve significantly better results. We also showed it is possible to attack model predictions by simply adding note duplicates, causing changes of correct predictions made by trained models into wrong predictions. In conclusion, we demonstrated that EHR text redundancy substantively affects NLP models for clinical prediction tasks, showing that the awareness of clinical contexts and robust modeling methods are important to create effective and reliable NLP systems in healthcare contexts.
Collapse
Affiliation(s)
- Jinghui Liu
- School of Computing and Information Systems, The University of Melbourne, Victoria, Australia; Australian e-Health Research Centre, CSIRO, Brisbane, Australia.
| | - Daniel Capurro
- School of Computing and Information Systems, The University of Melbourne, Victoria, Australia; Centre for Digital Transformation of Health, Melbourne Medical School, The University of Melbourne, Victoria, Australia.
| | - Anthony Nguyen
- Australian e-Health Research Centre, CSIRO, Brisbane, Australia.
| | - Karin Verspoor
- School of Computing and Information Systems, The University of Melbourne, Victoria, Australia; Centre for Digital Transformation of Health, Melbourne Medical School, The University of Melbourne, Victoria, Australia; School of Computing Technologies, RMIT University, Victoria, Australia.
| |
Collapse
|
19
|
Lederman A, Lederman R, Verspoor K. Tasks as needs: reframing the paradigm of clinical natural language processing research for real-world decision support. J Am Med Inform Assoc 2022; 29:1810-1817. [PMID: 35848784 PMCID: PMC9471702 DOI: 10.1093/jamia/ocac121] [Citation(s) in RCA: 9] [Impact Index Per Article: 4.5] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/19/2022] [Revised: 06/06/2022] [Accepted: 07/04/2022] [Indexed: 12/13/2022] Open
Abstract
Electronic medical records are increasingly used to store patient information in hospitals and other clinical settings. There has been a corresponding proliferation of clinical natural language processing (cNLP) systems aimed at using text data in these records to improve clinical decision-making, in comparison to manual clinician search and clinical judgment alone. However, these systems have delivered marginal practical utility and are rarely deployed into healthcare settings, leading to proposals for technical and structural improvements. In this paper, we argue that this reflects a violation of Friedman's "Fundamental Theorem of Biomedical Informatics," and that a deeper epistemological change must occur in the cNLP field, as a parallel step alongside any technical or structural improvements. We propose that researchers shift away from designing cNLP systems independent of clinical needs, in which cNLP tasks are ends in themselves-"tasks as decisions"-and toward systems that are directly guided by the needs of clinicians in realistic decision-making contexts-"tasks as needs." A case study example illustrates the potential benefits of developing cNLP systems that are designed to more directly support clinical needs.
Collapse
Affiliation(s)
- Asher Lederman
- Faculty of Engineering and IT, School of Computing and Information Systems, University of Melbourne, Melbourne, Australia
| | - Reeva Lederman
- Faculty of Engineering and IT, School of Computing and Information Systems, University of Melbourne, Melbourne, Australia
| | - Karin Verspoor
- STEM College, School of Computing Technologies, RMIT University, Melbourne, Australia
| |
Collapse
|
20
|
Chen J, Goudey B, Zobel J, Geard N, Verspoor K. Exploring automatic inconsistency detection for literature-based gene ontology annotation. Bioinformatics 2022; 38:i273-i281. [PMID: 35758780 PMCID: PMC9235499 DOI: 10.1093/bioinformatics/btac230] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 04/08/2022] [Indexed: 11/12/2022] Open
Abstract
Motivation Literature-based gene ontology annotations (GOA) are biological database records that use controlled vocabulary to uniformly represent gene function information that is described in the primary literature. Assurance of the quality of GOA is crucial for supporting biological research. However, a range of different kinds of inconsistencies in between literature as evidence and annotated GO terms can be identified; these have not been systematically studied at record level. The existing manual-curation approach to GOA consistency assurance is inefficient and is unable to keep pace with the rate of updates to gene function knowledge. Automatic tools are therefore needed to assist with GOA consistency assurance. This article presents an exploration of different GOA inconsistencies and an early feasibility study of automatic inconsistency detection. Results We have created a reliable synthetic dataset to simulate four realistic types of GOA inconsistency in biological databases. Three automatic approaches are proposed. They provide reasonable performance on the task of distinguishing the four types of inconsistency and are directly applicable to detect inconsistencies in real-world GOA database records. Major challenges resulting from such inconsistencies in the context of several specific application settings are reported. This is the first study to introduce automatic approaches that are designed to address the challenges in current GOA quality assurance workflows. The data underlying this article are available in Github at https://github.com/jiyuc/AutoGOAConsistency.
Collapse
Affiliation(s)
- Jiyu Chen
- School of Computing and Information Systems, The University of Melbourne, Parkville, VIC 3010, Australia
| | - Benjamin Goudey
- School of Computing and Information Systems, The University of Melbourne, Parkville, VIC 3010, Australia
| | - Justin Zobel
- School of Computing and Information Systems, The University of Melbourne, Parkville, VIC 3010, Australia
| | - Nicholas Geard
- School of Computing and Information Systems, The University of Melbourne, Parkville, VIC 3010, Australia
| | - Karin Verspoor
- School of Computing and Information Systems, The University of Melbourne, Parkville, VIC 3010, Australia.,School of Computer Technologies, RMIT University, Melbourne, VIC 3000, Australia
| |
Collapse
|
21
|
Liu Y, Méric G, Havulinna AS, Teo SM, Åberg F, Ruuskanen M, Sanders J, Zhu Q, Tripathi A, Verspoor K, Cheng S, Jain M, Jousilahti P, Vázquez-Baeza Y, Loomba R, Lahti L, Niiranen T, Salomaa V, Knight R, Inouye M. Early prediction of incident liver disease using conventional risk factors and gut-microbiome-augmented gradient boosting. Cell Metab 2022; 34:719-730.e4. [PMID: 35354069 PMCID: PMC9097589 DOI: 10.1016/j.cmet.2022.03.002] [Citation(s) in RCA: 29] [Impact Index Per Article: 14.5] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 08/31/2021] [Revised: 01/06/2022] [Accepted: 03/08/2022] [Indexed: 02/08/2023]
Abstract
The gut microbiome has shown promise as a predictive biomarker for various diseases. However, the potential of gut microbiota for prospective risk prediction of liver disease has not been assessed. Here, we utilized shallow shotgun metagenomic sequencing of a large population-based cohort (N > 7,000) with ∼15 years of follow-up in combination with machine learning to investigate the predictive capacity of gut microbial predictors individually and in conjunction with conventional risk factors for incident liver disease. Separately, conventional and microbial factors showed comparable predictive capacity. However, microbiome augmentation of conventional risk factors using machine learning significantly improved the performance. Similarly, disease-free survival analysis showed significantly improved stratification using microbiome-augmented models. Investigation of predictive microbial signatures revealed previously unknown taxa for liver disease, as well as those previously associated with hepatic function and disease. This study supports the potential clinical validity of gut metagenomic sequencing to complement conventional risk factors for prediction of liver diseases.
Collapse
Affiliation(s)
- Yang Liu
- Cambridge Baker Systems Genomics Initiative, Baker Heart and Diabetes Institute, Melbourne, VIC, Australia; Department of Clinical Pathology, Melbourne Medical School, The University of Melbourne, Melbourne, VIC, Australia.
| | - Guillaume Méric
- Cambridge Baker Systems Genomics Initiative, Baker Heart and Diabetes Institute, Melbourne, VIC, Australia; Department of Clinical Pathology, Melbourne Medical School, The University of Melbourne, Melbourne, VIC, Australia; Baker Department of Cardiometabolic Health, The University of Melbourne, Melbourne, VIC, Australia; Department of Infectious Diseases, Central Clinical School, Monash University, Melbourne, VIC, Australia
| | - Aki S Havulinna
- Department of Public Health and Welfare, Finnish Institute for Health and Welfare, Helsinki, Finland; Institute of Molecular Medicine Finland, University of Helsinki, Helsinki, Finland
| | - Shu Mei Teo
- Cambridge Baker Systems Genomics Initiative, Baker Heart and Diabetes Institute, Melbourne, VIC, Australia; Cambridge Baker Systems Genomics Initiative, Department of Public Health and Primary Care, University of Cambridge, Cambridge, UK
| | - Fredrik Åberg
- Transplantation and Liver Surgery Clinic, Helsinki University Hospital, University of Helsinki, Helsinki, Finland
| | - Matti Ruuskanen
- Department of Public Health and Welfare, Finnish Institute for Health and Welfare, Helsinki, Finland; Department of Internal Medicine, University of Turku, Turku, Finland
| | - Jon Sanders
- Department of Pediatrics, School of Medicine, University of California, San Diego, La Jolla, CA, USA
| | - Qiyun Zhu
- Department of Pediatrics, School of Medicine, University of California, San Diego, La Jolla, CA, USA
| | - Anupriya Tripathi
- Department of Pediatrics, School of Medicine, University of California, San Diego, La Jolla, CA, USA; Division of Biological Sciences, University of California, San Diego, La Jolla, CA, USA
| | - Karin Verspoor
- School of Computing and Information Systems, University of Melbourne, Melbourne, VIC, Australia; School of Computing Technologies, RMIT University, Melbourne, VIC, Australia
| | - Susan Cheng
- Smidt Heart Institute, Cedars-Sinai Medical Center, Los Angeles, CA 90048, USA
| | - Mohit Jain
- Department of Pediatrics, School of Medicine, University of California, San Diego, La Jolla, CA, USA; Center for Microbiome Innovation, University of California, San Diego, La Jolla, CA, USA
| | - Pekka Jousilahti
- Department of Public Health and Welfare, Finnish Institute for Health and Welfare, Helsinki, Finland
| | - Yoshiki Vázquez-Baeza
- Center for Microbiome Innovation, University of California, San Diego, La Jolla, CA, USA; Department of Computer Science & Engineering, Jacobs School of Engineering, University of California, San Diego, La Jolla, CA, USA
| | - Rohit Loomba
- NAFLD Research Center, Department of Medicine, University of California, San Diego, La Jolla, CA, USA
| | - Leo Lahti
- Department of Computing, University of Turku, Turku, Finland
| | - Teemu Niiranen
- Department of Public Health and Welfare, Finnish Institute for Health and Welfare, Helsinki, Finland; Department of Internal Medicine, University of Turku, Turku, Finland; Division of Medicine, Turku University Hospital, Turku, Finland
| | - Veikko Salomaa
- Department of Public Health and Welfare, Finnish Institute for Health and Welfare, Helsinki, Finland
| | - Rob Knight
- Department of Pediatrics, School of Medicine, University of California, San Diego, La Jolla, CA, USA; Center for Microbiome Innovation, University of California, San Diego, La Jolla, CA, USA; Department of Computer Science & Engineering, Jacobs School of Engineering, University of California, San Diego, La Jolla, CA, USA
| | - Michael Inouye
- Cambridge Baker Systems Genomics Initiative, Baker Heart and Diabetes Institute, Melbourne, VIC, Australia; Department of Clinical Pathology, Melbourne Medical School, The University of Melbourne, Melbourne, VIC, Australia; Baker Department of Cardiometabolic Health, The University of Melbourne, Melbourne, VIC, Australia; Cambridge Baker Systems Genomics Initiative, Department of Public Health and Primary Care, University of Cambridge, Cambridge, UK; Health Data Research UK Cambridge, Wellcome Genome Campus, University of Cambridge, Cambridge, UK; British Heart Foundation Cardiovascular Epidemiology Unit, Department of Public Health and Primary Care, University of Cambridge, Cambridge, UK; British Heart Foundation Centre of Research Excellence, University of Cambridge, Cambridge, UK; The Alan Turing Institute, London, UK.
| |
Collapse
|
22
|
Hur B, Hardefeldt LY, Verspoor K, Baldwin T, Gilkerson JR. Overcoming challenges in extracting prescribing habits from veterinary clinics using big data and deep learning. Aust Vet J 2022; 100:220-222. [DOI: 10.1111/avj.13145] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/23/2021] [Accepted: 01/02/2022] [Indexed: 11/27/2022]
Affiliation(s)
- B Hur
- Asia‐Pacific Centre for Animal Health, Melbourne Veterinary School University of Melbourne Melbourne Victoria Australia
- School of Computing and Information Systems University of Melbourne Melbourne Victoria Australia
| | - LY Hardefeldt
- Asia‐Pacific Centre for Animal Health, Melbourne Veterinary School University of Melbourne Melbourne Victoria Australia
| | - K Verspoor
- School of Computing and Information Systems University of Melbourne Melbourne Victoria Australia
- School of Computing Technologies RMIT University Melbourne Victoria Australia
| | - T Baldwin
- School of Computing and Information Systems University of Melbourne Melbourne Victoria Australia
| | - JR Gilkerson
- Asia‐Pacific Centre for Animal Health, Melbourne Veterinary School University of Melbourne Melbourne Victoria Australia
| |
Collapse
|
23
|
Cao K, Verspoor K, Sahebjada S, Baird PN. Accuracy of Machine Learning Assisted Detection of Keratoconus: A Systematic Review and Meta-Analysis. J Clin Med 2022; 11:jcm11030478. [PMID: 35159930 PMCID: PMC8836961 DOI: 10.3390/jcm11030478] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/29/2021] [Revised: 01/10/2022] [Accepted: 01/13/2022] [Indexed: 12/26/2022] Open
Abstract
(1) Background: The objective of this review was to synthesize available data on the use of machine learning to evaluate its accuracy (as determined by pooled sensitivity and specificity) in detecting keratoconus (KC), and measure reporting completeness of machine learning models in KC based on TRIPOD (the transparent reporting of multivariable prediction models for individual prognosis or diagnosis) statement. (2) Methods: Two independent reviewers searched the electronic databases for all potential articles on machine learning and KC published prior to 2021. The TRIPOD 29-item checklist was used to evaluate the adherence to reporting guidelines of the studies, and the adherence rate to each item was computed. We conducted a meta-analysis to determine the pooled sensitivity and specificity of machine learning models for detecting KC. (3) Results: Thirty-five studies were included in this review. Thirty studies evaluated machine learning models for detecting KC eyes from controls and 14 studies evaluated machine learning models for detecting early KC eyes from controls. The pooled sensitivity for detecting KC was 0.970 (95% CI 0.949–0.982), with a pooled specificity of 0.985 (95% CI 0.971–0.993), whereas the pooled sensitivity of detecting early KC was 0.882 (95% CI 0.822–0.923), with a pooled specificity of 0.947 (95% CI 0.914–0.967). Between 3% and 48% of TRIPOD items were adhered to in studies, and the average (median) adherence rate for a single TRIPOD item was 23% across all studies. (4) Conclusions: Application of machine learning model has the potential to make the diagnosis and monitoring of KC more efficient, resulting in reduced vision loss to the patients. This review provides current information on the machine learning models that have been developed for detecting KC and early KC. Presently, the machine learning models performed poorly in identifying early KC from control eyes and many of these research studies did not follow established reporting standards, thus resulting in the failure of these clinical translation of these machine learning models. We present possible approaches for future studies for improvement in studies related to both KC and early KC models to more efficiently and widely utilize machine learning models for diagnostic process.
Collapse
Affiliation(s)
- Ke Cao
- Centre for Eye Research Australia, Melbourne, VIC 3002, Australia; (K.C.); (S.S.)
- Department of Surgery, Ophthalmology, The University of Melbourne, Melbourne, VIC 3002, Australia
| | - Karin Verspoor
- School of Computing Technologies, RMIT University, Melbourne, VIC 3000, Australia;
- School of Computing and Information Systems, The University of Melbourne, Melbourne, VIC 3010, Australia
| | - Srujana Sahebjada
- Centre for Eye Research Australia, Melbourne, VIC 3002, Australia; (K.C.); (S.S.)
- Department of Surgery, Ophthalmology, The University of Melbourne, Melbourne, VIC 3002, Australia
| | - Paul N. Baird
- Department of Surgery, Ophthalmology, The University of Melbourne, Melbourne, VIC 3002, Australia
- Correspondence: ; Tel.: +61-3-9929-8613
| |
Collapse
|
24
|
Elangovan A, Li Y, Pires DEV, Davis MJ, Verspoor K. Large-scale protein-protein post-translational modification extraction with distant supervision and confidence calibrated BioBERT. BMC Bioinformatics 2022; 23:4. [PMID: 34983371 PMCID: PMC8729035 DOI: 10.1186/s12859-021-04504-x] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/12/2021] [Accepted: 11/30/2021] [Indexed: 11/10/2022] Open
Abstract
MOTIVATION Protein-protein interactions (PPIs) are critical to normal cellular function and are related to many disease pathways. A range of protein functions are mediated and regulated by protein interactions through post-translational modifications (PTM). However, only 4% of PPIs are annotated with PTMs in biological knowledge databases such as IntAct, mainly performed through manual curation, which is neither time- nor cost-effective. Here we aim to facilitate annotation by extracting PPIs along with their pairwise PTM from the literature by using distantly supervised training data using deep learning to aid human curation. METHOD We use the IntAct PPI database to create a distant supervised dataset annotated with interacting protein pairs, their corresponding PTM type, and associated abstracts from the PubMed database. We train an ensemble of BioBERT models-dubbed PPI-BioBERT-x10-to improve confidence calibration. We extend the use of ensemble average confidence approach with confidence variation to counteract the effects of class imbalance to extract high confidence predictions. RESULTS AND CONCLUSION The PPI-BioBERT-x10 model evaluated on the test set resulted in a modest F1-micro 41.3 (P =5 8.1, R = 32.1). However, by combining high confidence and low variation to identify high quality predictions, tuning the predictions for precision, we retained 19% of the test predictions with 100% precision. We evaluated PPI-BioBERT-x10 on 18 million PubMed abstracts and extracted 1.6 million (546507 unique PTM-PPI triplets) PTM-PPI predictions, and filter [Formula: see text] (4584 unique) high confidence predictions. Of the 5700, human evaluation on a small randomly sampled subset shows that the precision drops to 33.7% despite confidence calibration and highlights the challenges of generalisability beyond the test set even with confidence calibration. We circumvent the problem by only including predictions associated with multiple papers, improving the precision to 58.8%. In this work, we highlight the benefits and challenges of deep learning-based text mining in practice, and the need for increased emphasis on confidence calibration to facilitate human curation efforts.
Collapse
Affiliation(s)
- Aparna Elangovan
- School of Computing and Information Systems, The University of Melbourne, Melbourne, Australia
| | - Yuan Li
- School of Computing and Information Systems, The University of Melbourne, Melbourne, Australia
| | - Douglas E. V. Pires
- School of Computing and Information Systems, The University of Melbourne, Melbourne, Australia
| | - Melissa J. Davis
- The Walter and Eliza Hall Institute of Medical Research, Melbourne, Australia
- Department of Clinical Pathology, Faculty of Medicine, Dentistry and Health Sciences, The University of Melbourne, Melbourne, Australia
| | - Karin Verspoor
- School of Computing and Information Systems, The University of Melbourne, Melbourne, Australia
- School of Computing Technologies, RMIT University, Melbourne, Australia
| |
Collapse
|
25
|
Rozova V, Witt K, Robinson J, Li Y, Verspoor K. Detection of self-harm and suicidal ideation in emergency department triage notes. J Am Med Inform Assoc 2021; 29:472-480. [PMID: 34897466 PMCID: PMC8800520 DOI: 10.1093/jamia/ocab261] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/19/2021] [Revised: 09/30/2021] [Accepted: 11/11/2021] [Indexed: 12/15/2022] Open
Abstract
OBJECTIVE Accurate identification of self-harm presentations to Emergency Departments (ED) can lead to more timely mental health support, aid in understanding the burden of suicidal intent in a population, and support impact evaluation of public health initiatives related to suicide prevention. Given lack of manual self-harm reporting in ED, we aim to develop an automated system for the detection of self-harm presentations directly from ED triage notes. MATERIALS AND METHODS We frame this as supervised classification using natural language processing (NLP), utilizing a large data set of 477 627 free-text triage notes from ED presentations in 2012-2018 to The Royal Melbourne Hospital, Australia. The data were highly imbalanced, with only 1.4% of triage notes relating to self-harm. We explored various preprocessing techniques, including spelling correction, negation detection, bigram replacement, and clinical concept recognition, and several machine learning methods. RESULTS Our results show that machine learning methods dramatically outperform keyword-based methods. We achieved the best results with a calibrated Gradient Boosting model, showing 90% Precision and 90% Recall (PR-AUC 0.87) on blind test data. Prospective validation of the model achieves similar results (88% Precision; 89% Recall). DISCUSSION ED notes are noisy texts, and simple token-based models work best. Negation detection and concept recognition did not change the results while bigram replacement significantly impaired model performance. CONCLUSION This first NLP-based classifier for self-harm in ED notes has practical value for identifying patients who would benefit from mental health follow-up in ED, and for supporting surveillance of self-harm and suicide prevention efforts in the population.
Collapse
Affiliation(s)
- Vlada Rozova
- Corresponding Author: Vlada Rozova, PhD, School of Computing Technologies, RMIT University, 124 La Trobe Street, Melbourne, VIC 3000, Australia;
| | - Katrina Witt
- Orygen, Melbourne, Victoria, Australia,Centre for Youth Mental Health, The University of Melbourne, Melbourne, Victoria, Australia
| | - Jo Robinson
- Orygen, Melbourne, Victoria, Australia,Centre for Youth Mental Health, The University of Melbourne, Melbourne, Victoria, Australia
| | - Yan Li
- School of Computing and Information Systems, The University of Melbourne, Melbourne, Victoria, Australia
| | - Karin Verspoor
- School of Computing Technologies, RMIT University, Melbourne, Victoria, Australia,School of Computing and Information Systems, The University of Melbourne, Melbourne, Victoria, Australia
| |
Collapse
|
26
|
Zhai Z, Druckenbrodt C, Thorne C, Akhondi SA, Nguyen DQ, Cohn T, Verspoor K. ChemTables: a dataset for semantic classification on tables in chemical patents. J Cheminform 2021; 13:97. [PMID: 34895295 PMCID: PMC8665561 DOI: 10.1186/s13321-021-00568-2] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/07/2021] [Accepted: 11/06/2021] [Indexed: 11/10/2022] Open
Abstract
Chemical patents are a commonly used channel for disclosing novel compounds and reactions, and hence represent important resources for chemical and pharmaceutical research. Key chemical data in patents is often presented in tables. Both the number and the size of tables can be very large in patent documents. In addition, various types of information can be presented in tables in patents, including spectroscopic and physical data, or pharmacological use and effects of chemicals. Since images of Markush structures and merged cells are commonly used in these tables, their structure also shows substantial variation. This heterogeneity in content and structure of tables in chemical patents makes relevant information difficult to find. We therefore propose a new text mining task of automatically categorising tables in chemical patents based on their contents. Categorisation of tables based on the nature of their content can help to identify tables containing key information, improving the accessibility of information in patents that is highly relevant for new inventions. For developing and evaluating methods for the table classification task, we developed a new dataset, called CHEMTABLES, which consists of 788 chemical patent tables with labels of their content type. We introduce this data set in detail. We further establish strong baselines for the table classification task in chemical patents by applying state-of-the-art neural network models developed for natural language processing, including TabNet, ResNet and Table-BERT on CHEMTABLES. The best performing model, Table-BERT, achieves a performance of 88.66 micro-averaged [Formula: see text] score on the table classification task. The CHEMTABLES dataset is publicly available at https://doi.org/10.17632/g7tjh7tbrj.3 , subject to the CC BY NC 3.0 license. Code/models evaluated in this work are in a Github repository https://github.com/zenanz/ChemTables .
Collapse
Affiliation(s)
- Zenan Zhai
- School of Computing and Information Systems, The University of Melbourne, Melbourne, Australia
| | | | - Camilo Thorne
- Elsevier-Data Science, Life Science, Amsterdam, The Netherlands
| | | | - Dat Quoc Nguyen
- School of Computing and Information Systems, The University of Melbourne, Melbourne, Australia
- VinAI Research, Hanoi, Vietnam
| | - Trevor Cohn
- School of Computing and Information Systems, The University of Melbourne, Melbourne, Australia
| | - Karin Verspoor
- School of Computing and Information Systems, The University of Melbourne, Melbourne, Australia
- Present Address: School of Computing Technologies, RMIT University, Melbourne, Australia
| |
Collapse
|
27
|
Chen J, Geard N, Zobel J, Verspoor K. Automatic consistency assurance for literature-based gene ontology annotation. BMC Bioinformatics 2021; 22:565. [PMID: 34823464 PMCID: PMC8620237 DOI: 10.1186/s12859-021-04479-9] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/12/2021] [Accepted: 11/15/2021] [Indexed: 12/21/2022] Open
Abstract
BACKGROUND Literature-based gene ontology (GO) annotation is a process where expert curators use uniform expressions to describe gene functions reported in research papers, creating computable representations of information about biological systems. Manual assurance of consistency between GO annotations and the associated evidence texts identified by expert curators is reliable but time-consuming, and is infeasible in the context of rapidly growing biological literature. A key challenge is maintaining consistency of existing GO annotations as new studies are published and the GO vocabulary is updated. RESULTS In this work, we introduce a formalisation of biological database annotation inconsistencies, identifying four distinct types of inconsistency. We propose a novel and efficient method using state-of-the-art text mining models to automatically distinguish between consistent GO annotation and the different types of inconsistent GO annotation. We evaluate this method using a synthetic dataset generated by directed manipulation of instances in an existing corpus, BC4GO. We provide detailed error analysis for demonstrating that the method achieves high precision on more confident predictions. CONCLUSIONS Two models built using our method for distinct annotation consistency identification tasks achieved high precision and were robust to updates in the GO vocabulary. Our approach demonstrates clear value for human-in-the-loop curation scenarios.
Collapse
Affiliation(s)
- Jiyu Chen
- School of Computing and Information Systems, University of Melbourne, Melbourne, 3010, Australia
| | - Nicholas Geard
- School of Computing and Information Systems, University of Melbourne, Melbourne, 3010, Australia
| | - Justin Zobel
- School of Computing and Information Systems, University of Melbourne, Melbourne, 3010, Australia
| | - Karin Verspoor
- School of Computing and Information Systems, University of Melbourne, Melbourne, 3010, Australia. .,School of Computing Technologies, RMIT University, Melbourne, VIC, 3000, Australia.
| |
Collapse
|
28
|
Cao K, Verspoor K, Chan E, Daniell M, Sahebjada S, Baird PN. Machine learning with a reduced dimensionality representation of comprehensive Pentacam tomography parameters to identify subclinical keratoconus. Comput Biol Med 2021; 138:104884. [PMID: 34607273 DOI: 10.1016/j.compbiomed.2021.104884] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/13/2021] [Revised: 09/15/2021] [Accepted: 09/19/2021] [Indexed: 12/26/2022]
Abstract
PURPOSE To investigate the performance of a machine learning model based on a reduced dimensionality parameter space derived from complete Pentacam parameters to identify subclinical keratoconus (KC). METHODS All 1692 available parameters were obtained from the Pentacam imaging machine on 145 subclinical KC and 122 control eyes. We applied a principal component analysis (PCA) to the complete Pentacam dataset to reduce its parameter dimensionality. Subsequently, we investigated machine learning performance of the random forest algorithm with increasing numbers of components to identify their optimal number for detecting subclinical KC from control eyes. RESULTS The dimensionality of the complete set of 1692 Pentacam parameters was reduced to 267 principal components using PCA. Subsequent selection of 15 of these principal components explained over 85% of the variance of the original Pentacam-derived parameters and input to train a random forest machine learning model to achieve the best accuracy of 98% in detecting subclinical KC eyes. The model established also reached a high sensitivity of 97% in identification of subclinical KC and a specificity of 98% in recognizing control eyes. CONCLUSIONS A random forest-based model trained using a modest number of components derived from a reduced dimensionality representation of complete Pentacam system parameters allowed for high accuracy of subclinical KC identification.
Collapse
Affiliation(s)
- Ke Cao
- Centre for Eye Research Australia, Melbourne, Victoria, Australia; Department of Surgery, Ophthalmology, The University of Melbourne, Melbourne, Victoria, Australia
| | - Karin Verspoor
- School of Computing Technologies, RMIT University, Melbourne, Australia; School of Computing and Information Systems, The University of Melbourne, Melbourne, Australia
| | - Elsie Chan
- Centre for Eye Research Australia, Melbourne, Victoria, Australia; Department of Surgery, Ophthalmology, The University of Melbourne, Melbourne, Victoria, Australia; Royal Victorian Eye and Ear Hospital, Melbourne, Victoria, Australia
| | - Mark Daniell
- Centre for Eye Research Australia, Melbourne, Victoria, Australia; Department of Surgery, Ophthalmology, The University of Melbourne, Melbourne, Victoria, Australia; Royal Victorian Eye and Ear Hospital, Melbourne, Victoria, Australia
| | - Srujana Sahebjada
- Centre for Eye Research Australia, Melbourne, Victoria, Australia; Department of Surgery, Ophthalmology, The University of Melbourne, Melbourne, Victoria, Australia
| | - Paul N Baird
- Department of Surgery, Ophthalmology, The University of Melbourne, Melbourne, Victoria, Australia.
| |
Collapse
|
29
|
Ghosh Roy G, Geard N, Verspoor K, He S. PoLoBag: Polynomial Lasso Bagging for signed gene regulatory network inference from expression data. Bioinformatics 2021; 36:5187-5193. [PMID: 32697830 DOI: 10.1093/bioinformatics/btaa651] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/03/2020] [Revised: 06/06/2020] [Accepted: 07/16/2020] [Indexed: 02/01/2023] Open
Abstract
MOTIVATION Inferring gene regulatory networks (GRNs) from expression data is a significant systems biology problem. A useful inference algorithm should not only unveil the global structure of the regulatory mechanisms but also the details of regulatory interactions such as edge direction (from regulator to target) and sign (activation/inhibition). Many popular GRN inference algorithms cannot infer edge signs, and those that can infer signed GRNs cannot simultaneously infer edge directions or network cycles. RESULTS To address these limitations of existing algorithms, we propose Polynomial Lasso Bagging (PoLoBag) for signed GRN inference with both edge directions and network cycles. PoLoBag is an ensemble regression algorithm in a bagging framework where Lasso weights estimated on bootstrap samples are averaged. These bootstrap samples incorporate polynomial features to capture higher-order interactions. Results demonstrate that PoLoBag is consistently more accurate for signed inference than state-of-the-art algorithms on simulated and real-world expression datasets. AVAILABILITY AND IMPLEMENTATION Algorithm and data are freely available at https://github.com/gourabghoshroy/PoLoBag. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Gourab Ghosh Roy
- School of Computer Science, University of Birmingham, Birmingham B15 2TT, UK.,School of Computing and Information Systems, University of Melbourne, Melbourne, VIC 3052, Australia
| | - Nicholas Geard
- School of Computing and Information Systems, University of Melbourne, Melbourne, VIC 3052, Australia
| | - Karin Verspoor
- School of Computing and Information Systems, University of Melbourne, Melbourne, VIC 3052, Australia
| | - Shan He
- School of Computer Science, University of Birmingham, Birmingham B15 2TT, UK
| |
Collapse
|
30
|
Abstract
OBJECTIVES We examine the knowledge ecosystem of COVID-19, focusing on clinical knowledge and the role of health informatics as enabling technology. We argue for commitment to the model of a global learning health system to facilitate rapid knowledge translation supporting health care decision making in the face of emerging diseases. METHODS AND RESULTS We frame the evolution of knowledge in the COVID-19 crisis in terms of learning theory, and present a view of what has occurred during the pandemic to rapidly derive and share knowledge as an (underdeveloped) instance of a global learning health system. We identify the key role of information technologies for electronic data capture and data sharing, computational modelling, evidence synthesis, and knowledge dissemination. We further highlight gaps in the system and barriers to full realisation of an efficient and effective global learning health system. CONCLUSIONS The need for a global knowledge ecosystem supporting rapid learning from clinical practice has become more apparent than ever during the COVID-19 pandemic. Continued effort to realise the vision of a global learning health system, including establishing effective approaches to data governance and ethics to support the system, is imperative to enable continuous improvement in our clinical care.
Collapse
Affiliation(s)
- Karin Verspoor
- School of Computing Technologies, RMIT University, Melbourne VIC 3000 Australia
- Centre for Digital Transformation of Health, The University of Melbourne, Melbourne VIC 3010 Australia
- School of Computing and Information Systems, The University of Melbourne, Melbourne VIC 3010 Australia
| |
Collapse
|
31
|
Liu J, Capurro D, Nguyen A, Verspoor K. Early prediction of diagnostic-related groups and estimation of hospital cost by processing clinical notes. NPJ Digit Med 2021; 4:103. [PMID: 34211109 PMCID: PMC8249417 DOI: 10.1038/s41746-021-00474-9] [Citation(s) in RCA: 17] [Impact Index Per Article: 5.7] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/11/2021] [Accepted: 06/08/2021] [Indexed: 11/09/2022] Open
Abstract
As healthcare providers receive fixed amounts of reimbursement for given services under DRG (Diagnosis-Related Groups) payment, DRG codes are valuable for cost monitoring and resource allocation. However, coding is typically performed retrospectively post-discharge. We seek to predict DRGs and DRG-based case mix index (CMI) at early inpatient admission using routine clinical text to estimate hospital cost in an acute setting. We examined a deep learning-based natural language processing (NLP) model to automatically predict per-episode DRGs and corresponding cost-reflecting weights on two cohorts (paid under Medicare Severity (MS) DRG or All Patient Refined (APR) DRG), without human coding efforts. It achieved macro-averaged area under the receiver operating characteristic curve (AUC) scores of 0·871 (SD 0·011) on MS-DRG and 0·884 (0·003) on APR-DRG in fivefold cross-validation experiments on the first day of ICU admission. When extended to simulated patient populations to estimate average cost-reflecting weights, the model increased its accuracy over time and obtained absolute CMI error of 2·40 (1·07%) and 12·79% (2·31%), respectively on the first day. As the model could adapt to variations in admission time, cohort size, and requires no extra manual coding efforts, it shows potential to help estimating costs for active patients to support better operational decision-making in hospitals.
Collapse
Affiliation(s)
- Jinghui Liu
- School of Computing and Information Systems, The University of Melbourne, Melbourne, VIC, Australia
- Australian e-Health Research Centre, CSIRO, Brisbane, QLD, Australia
| | - Daniel Capurro
- School of Computing and Information Systems, The University of Melbourne, Melbourne, VIC, Australia
- Centre for Digital Transformation of Health, Melbourne Medical School, The University of Melbourne, Melbourne, VIC, Australia
| | - Anthony Nguyen
- Australian e-Health Research Centre, CSIRO, Brisbane, QLD, Australia
| | - Karin Verspoor
- School of Computing and Information Systems, The University of Melbourne, Melbourne, VIC, Australia.
- Centre for Digital Transformation of Health, Melbourne Medical School, The University of Melbourne, Melbourne, VIC, Australia.
- School of Computing Technologies, RMIT University, Melbourne, VIC, Australia.
| |
Collapse
|
32
|
Abstract
The gene regulatory network (GRN) architecture plays a key role in explaining the biological differences between species. We aim to understand species differences in terms of some universally present dynamical properties of their gene regulatory systems. A network architectural feature associated with controlling system-level dynamical properties is the bow-tie, identified by a strongly connected subnetwork, the core layer, between two sets of nodes, the in and the out layers. Though a bow-tie architecture has been observed in many networks, its existence has not been extensively investigated in GRNs of species of widely varying biological complexity. We analyse publicly available GRNs of several well-studied species from prokaryotes to unicellular eukaryotes to multicellular organisms. In their GRNs, we find the existence of a bow-tie architecture with a distinct largest strongly connected core layer. We show that the bow-tie architecture is a characteristic feature of GRNs. We observe an increasing trend in the relative core size with species complexity. Using studied relationships of the core size with dynamical properties like robustness and fragility, flexibility, criticality, controllability and evolvability, we hypothesize how these regulatory system properties have emerged differently with biological complexity, based on the observed differences of the GRN bow-tie architectures.
Collapse
Affiliation(s)
- Gourab Ghosh Roy
- School of Computer Science, University of Birmingham, Birmingham B15 2TT, UK.,School of Computing and Information Systems, University of Melbourne, Melbourne, Victoria, Australia
| | - Shan He
- School of Computer Science, University of Birmingham, Birmingham B15 2TT, UK
| | - Nicholas Geard
- School of Computing and Information Systems, University of Melbourne, Melbourne, Victoria, Australia
| | - Karin Verspoor
- School of Computing and Information Systems, University of Melbourne, Melbourne, Victoria, Australia
| |
Collapse
|
33
|
Reddy S, Bhaskar R, Padmanabhan S, Verspoor K, Mamillapalli C, Lahoti R, Makinen VP, Pradhan S, Kushwah P, Sinha S. Use and validation of text mining and cluster algorithms to derive insights from Corona Virus Disease-2019 (COVID-19) medical literature. Comput Methods Programs Biomed Update 2021; 1:100010. [PMID: 34337589 DOI: 10.1016/j.cmpbup.2021.100014] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 02/11/2021] [Revised: 04/01/2021] [Accepted: 04/02/2021] [Indexed: 05/26/2023]
Abstract
The emergence of the severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) late last year has not only led to the world-wide coronavirus disease 2019 (COVID-19) pandemic but also a deluge of biomedical literature. Following the release of the COVID-19 open research dataset (CORD-19) comprising over 200,000 scholarly articles, we a multi-disciplinary team of data scientists, clinicians, medical researchers and software engineers developed an innovative natural language processing (NLP) platform that combines an advanced search engine with a biomedical named entity recognition extraction package. In particular, the platform was developed to extract information relating to clinical risk factors for COVID-19 by presenting the results in a cluster format to support knowledge discovery. Here we describe the principles behind the development, the model and the results we obtained.
Collapse
|
34
|
He J, Nguyen DQ, Akhondi SA, Druckenbrodt C, Thorne C, Hoessel R, Afzal Z, Zhai Z, Fang B, Yoshikawa H, Albahem A, Cavedon L, Cohn T, Baldwin T, Verspoor K. ChEMU 2020: Natural Language Processing Methods Are Effective for Information Extraction From Chemical Patents. Front Res Metr Anal 2021; 6:654438. [PMID: 33870071 PMCID: PMC8028406 DOI: 10.3389/frma.2021.654438] [Citation(s) in RCA: 10] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/16/2021] [Accepted: 02/24/2021] [Indexed: 11/21/2022] Open
Abstract
Chemical patents represent a valuable source of information about new chemical compounds, which is critical to the drug discovery process. Automated information extraction over chemical patents is, however, a challenging task due to the large volume of existing patents and the complex linguistic properties of chemical patents. The Cheminformatics Elsevier Melbourne University (ChEMU) evaluation lab 2020, part of the Conference and Labs of the Evaluation Forum 2020 (CLEF2020), was introduced to support the development of advanced text mining techniques for chemical patents. The ChEMU 2020 lab proposed two fundamental information extraction tasks focusing on chemical reaction processes described in chemical patents: (1) chemical named entity recognition, requiring identification of essential chemical entities and their roles in chemical reactions, as well as reaction conditions; and (2) event extraction, which aims at identification of event steps relating the entities involved in chemical reactions. The ChEMU 2020 lab received 37 team registrations and 46 runs. Overall, the performance of submissions for these tasks exceeded our expectations, with the top systems outperforming strong baselines. We further show the methods to be robust to variations in sampling of the test data. We provide a detailed overview of the ChEMU 2020 corpus and its annotation, showing that inter-annotator agreement is very strong. We also present the methods adopted by participants, provide a detailed analysis of their performance, and carefully consider the potential impact of data leakage on interpretation of the results. The ChEMU 2020 Lab has shown the viability of automated methods to support information extraction of key information in chemical patents.
Collapse
Affiliation(s)
- Jiayuan He
- The University of Melbourne, Parkville, VIC, Australia.,RMIT University, Melbourne, VIC, Australia
| | - Dat Quoc Nguyen
- The University of Melbourne, Parkville, VIC, Australia.,VinAI Research, Hanoi, Vietnam
| | | | | | - Camilo Thorne
- Elsevier Information Systems GmbH, Frankfurt, Germany
| | - Ralph Hoessel
- Elsevier Information Systems GmbH, Frankfurt, Germany
| | | | - Zenan Zhai
- The University of Melbourne, Parkville, VIC, Australia
| | - Biaoyan Fang
- The University of Melbourne, Parkville, VIC, Australia
| | - Hiyori Yoshikawa
- The University of Melbourne, Parkville, VIC, Australia.,Fujitsu Laboratories Ltd., Tokyo, Japan
| | - Ameer Albahem
- The University of Melbourne, Parkville, VIC, Australia.,RMIT University, Melbourne, VIC, Australia
| | | | - Trevor Cohn
- The University of Melbourne, Parkville, VIC, Australia
| | | | - Karin Verspoor
- The University of Melbourne, Parkville, VIC, Australia.,RMIT University, Melbourne, VIC, Australia
| |
Collapse
|
35
|
Reddy S, Bhaskar R, Padmanabhan S, Verspoor K, Mamillapalli C, Lahoti R, Makinen VP, Pradhan S, Kushwah P, Sinha S. Use and validation of text mining and cluster algorithms to derive insights from Corona Virus Disease-2019 (COVID-19) medical literature. Comput Methods Programs Biomed Update 2021; 1:100010. [PMID: 34337589 PMCID: PMC8050406 DOI: 10.1016/j.cmpbup.2021.100010] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 02/11/2021] [Revised: 04/01/2021] [Accepted: 04/02/2021] [Indexed: 05/04/2023]
Abstract
The emergence of the severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) late last year has not only led to the world-wide coronavirus disease 2019 (COVID-19) pandemic but also a deluge of biomedical literature. Following the release of the COVID-19 open research dataset (CORD-19) comprising over 200,000 scholarly articles, we a multi-disciplinary team of data scientists, clinicians, medical researchers and software engineers developed an innovative natural language processing (NLP) platform that combines an advanced search engine with a biomedical named entity recognition extraction package. In particular, the platform was developed to extract information relating to clinical risk factors for COVID-19 by presenting the results in a cluster format to support knowledge discovery. Here we describe the principles behind the development, the model and the results we obtained.
Collapse
|
36
|
Robinson J, Witt K, Lamblin M, Spittal MJ, Carter G, Verspoor K, Page A, Rajaram G, Rozova V, Hill NTM, Pirkis J, Bleeker C, Pleban A, Knott JC. Development of a Self-Harm Monitoring System for Victoria. Int J Environ Res Public Health 2020; 17:ijerph17249385. [PMID: 33333970 PMCID: PMC7765445 DOI: 10.3390/ijerph17249385] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 10/05/2020] [Revised: 11/28/2020] [Accepted: 12/10/2020] [Indexed: 12/18/2022]
Abstract
The prevention of suicide and suicide-related behaviour are key policy priorities in Australia and internationally. The World Health Organization has recommended that member states develop self-harm surveillance systems as part of their suicide prevention efforts. This is also a priority under Australia’s Fifth National Mental Health and Suicide Prevention Plan. The aim of this paper is to describe the development of a state-based self-harm monitoring system in Victoria, Australia. In this system, data on all self-harm presentations are collected from eight hospital emergency departments in Victoria. A natural language processing classifier that uses machine learning to identify episodes of self-harm is currently being developed. This uses the free-text triage case notes, together with certain structured data fields, contained within the metadata of the incoming records. Post-processing is undertaken to identify primary mechanism of injury, substances consumed (including alcohol, illicit drugs and pharmaceutical preparations) and presence of psychiatric disorders. This system will ultimately leverage routinely collected data in combination with advanced artificial intelligence methods to support robust community-wide monitoring of self-harm. Once fully operational, this system will provide accurate and timely information on all presentations to participating emergency departments for self-harm, thereby providing a useful indicator for Australia’s suicide prevention efforts.
Collapse
Affiliation(s)
- Jo Robinson
- Orygen, Parkville, VIC 3052, Australia; (K.W.); (M.L.); (G.R.); (N.T.M.H.); (C.B.)
- Centre for Youth Mental Health, The University of Melbourne, Parkville, VIC 3052, Australia
- Correspondence: ; Tel.: +61-393-420-2866
| | - Katrina Witt
- Orygen, Parkville, VIC 3052, Australia; (K.W.); (M.L.); (G.R.); (N.T.M.H.); (C.B.)
- Centre for Youth Mental Health, The University of Melbourne, Parkville, VIC 3052, Australia
| | - Michelle Lamblin
- Orygen, Parkville, VIC 3052, Australia; (K.W.); (M.L.); (G.R.); (N.T.M.H.); (C.B.)
- Centre for Youth Mental Health, The University of Melbourne, Parkville, VIC 3052, Australia
| | - Matthew J. Spittal
- Centre for Mental Health, Melbourne School of Population and Global Health, The University of Melbourne, Parkville, VIC 3010 Australia; (M.J.S.); (J.P.)
| | - Greg Carter
- Centre for Brain and Mental Health Research, Faculty of Health and Medicine, University of Newcastle, Callaghan, NSW 2308, Australia;
- Calvary Mater Newcastle, Callaghan, NSW 2308, Australia
| | - Karin Verspoor
- School of Computing and Information Systems, The University of Melbourne, Parkville, VIC 3052, Australia; (K.V.); (V.R.)
- Centre for Digital Transformation of Health, The University of Melbourne, Melbourne, VIC 3000, Australia
| | - Andrew Page
- Translational Health Research Institute, Western Sydney University, Campbelltown, NSW 2560, Australia;
| | - Gowri Rajaram
- Orygen, Parkville, VIC 3052, Australia; (K.W.); (M.L.); (G.R.); (N.T.M.H.); (C.B.)
- Centre for Youth Mental Health, The University of Melbourne, Parkville, VIC 3052, Australia
| | - Vlada Rozova
- School of Computing and Information Systems, The University of Melbourne, Parkville, VIC 3052, Australia; (K.V.); (V.R.)
| | - Nicole T. M. Hill
- Orygen, Parkville, VIC 3052, Australia; (K.W.); (M.L.); (G.R.); (N.T.M.H.); (C.B.)
- Centre for Youth Mental Health, The University of Melbourne, Parkville, VIC 3052, Australia
- Telethon Kids Institute, Nedlands, WA 6009, Australia
| | - Jane Pirkis
- Centre for Mental Health, Melbourne School of Population and Global Health, The University of Melbourne, Parkville, VIC 3010 Australia; (M.J.S.); (J.P.)
| | - Caitlin Bleeker
- Orygen, Parkville, VIC 3052, Australia; (K.W.); (M.L.); (G.R.); (N.T.M.H.); (C.B.)
- Centre for Youth Mental Health, The University of Melbourne, Parkville, VIC 3052, Australia
| | - Alex Pleban
- Mid-West Area Mental Health Service, Emergency Department, Sunshine Hospital, Sunshine, VIC 3021, Australia;
| | - Jonathan C. Knott
- Centre for Integrated Critical Care, Melbourne Medical School, The University of Melbourne, Parkville, VIC 3010, Australia;
| |
Collapse
|
37
|
Al Bkhetan Z, Chana G, Ramamohanarao K, Verspoor K, Goudey B. Evaluation of consensus strategies for haplotype phasing. Brief Bioinform 2020; 22:5998997. [PMID: 33236761 DOI: 10.1093/bib/bbaa280] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/13/2020] [Revised: 09/22/2020] [Accepted: 09/22/2020] [Indexed: 01/05/2023] Open
Abstract
Haplotype phasing is a critical step for many genetic applications but incorrect estimates of phase can negatively impact downstream analyses. One proposed strategy to improve phasing accuracy is to combine multiple independent phasing estimates to overcome the limitations of any individual estimate. However, such a strategy is yet to be thoroughly explored. This study provides a comprehensive evaluation of consensus strategies for haplotype phasing. We explore the performance of different consensus paradigms, and the effect of specific constituent tools, across several datasets with different characteristics and their impact on the downstream task of genotype imputation. Based on the outputs of existing phasing tools, we explore two different strategies to construct haplotype consensus estimators: voting across outputs from multiple phasing tools and multiple outputs of a single non-deterministic tool. We find that the consensus approach from multiple tools reduces SE by an average of 10% compared to any constituent tool when applied to European populations and has the highest accuracy regardless of population ethnicity, sample size, variant density or variant frequency. Furthermore, the consensus estimator improves the accuracy of the downstream task of genotype imputation carried out by the widely used Minimac3, pbwt and BEAGLE5 tools. Our results provide guidance on how to produce the most accurate phasing estimates and the trade-offs that a consensus approach may have. Our implementation of consensus haplotype phasing, consHap, is available freely at https://github.com/ziadbkh/consHap. Supplementary information: Supplementary data are available at Briefings in Bioinformatics online.
Collapse
Affiliation(s)
- Ziad Al Bkhetan
- School of Computing and Information Systems at the University of Melbourne
| | | | | | - Karin Verspoor
- School of Computing and Information Systems at the University of Melbourne
| | - Benjamin Goudey
- IBM Research Australia and an Honorary Research Fellow at the School of Computing and Information Systems, University of Melbourne
| |
Collapse
|
38
|
Hardefeldt L, Hur B, Verspoor K, Baldwin T, Bailey KE, Scarborough R, Richards S, Billman-Jacobe H, Browning GF, Gilkerson J. Use of cefovecin in dogs and cats attending first-opinion veterinary practices in Australia. Vet Rec 2020; 187:e95. [PMID: 32826347 DOI: 10.1136/vr.105997] [Citation(s) in RCA: 10] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/21/2020] [Revised: 06/13/2020] [Accepted: 07/13/2020] [Indexed: 12/12/2022]
Abstract
BACKGROUND Cefovecin is a long-acting third-generation cephalosporin commonly used in veterinary medicine. Third-generation cephalosporins are critically important antimicrobials that should only be used after culture and susceptibility testing. The authors describe the common indications for cefovecin use in dogs and cats, and the frequency of culture and susceptibility testing. MATERIALS AND METHODS A cross-sectional study was performed using clinical records extracted from VetCompass Australia. A previously described method was used to identify records containing cefovecin. The reason for cefovecin use was annotated in situ in each consultation text. RESULTS Over a six-month period (February and September 2018), 5180 (0.4 per cent) consultations involved cefovecin administration, of which 151 were excluded. Cats were administered cefovecin more frequently than dogs (1.9 per cent of cat consultations and 0.1 per cent of dog consultations). The most common reasons for cefovecin administration to cats were cat fight injuries and abscesses (28 per cent) and dermatitis (13 per cent). For dogs, the most common reasons for cefovecin administration were surgical prophylaxis (24 per cent) and dermatitis (19 per cent). Culture and susceptibility testing were reported in 16 cases (0.3 per cent). CONCLUSION Cefovecin is used in many scenarios in dogs and cats where antimicrobials may be either not indicated or where an antimicrobial of lower importance to human health is recommended.
Collapse
Affiliation(s)
- Laura Hardefeldt
- National Centre for Antimicrobial Stewardship, Carlton, Victoria, Australia .,Asia Pacific Centre for Animal Health, University of Melbourne, Parkville, Victoria, Australia
| | - Brian Hur
- National Centre for Antimicrobial Stewardship, Carlton, Victoria, Australia.,Asia Pacific Centre for Animal Health, University of Melbourne, Parkville, Victoria, Australia.,School of Computing and Information Systems, University of Melbourne, Parkville, Victoria, Australia
| | - Karin Verspoor
- School of Computing and Information Systems, University of Melbourne, Parkville, Victoria, Australia.,Health and Biomedical Informatics Centre, University of Melbourne, Parkville, Victoria, Australia
| | - Timothy Baldwin
- School of Computing and Information Systems, University of Melbourne, Parkville, Victoria, Australia
| | - Kirsten E Bailey
- National Centre for Antimicrobial Stewardship, Carlton, Victoria, Australia.,Asia Pacific Centre for Animal Health, University of Melbourne, Parkville, Victoria, Australia
| | - Ri Scarborough
- National Centre for Antimicrobial Stewardship, Carlton, Victoria, Australia.,Asia Pacific Centre for Animal Health, University of Melbourne, Parkville, Victoria, Australia
| | - Suzanna Richards
- National Centre for Antimicrobial Stewardship, Carlton, Victoria, Australia.,Veterinary Biosciences, University of Melbourne, Parkville, Victoria, Australia
| | - Helen Billman-Jacobe
- National Centre for Antimicrobial Stewardship, Carlton, Victoria, Australia.,Asia Pacific Centre for Animal Health, University of Melbourne, Parkville, Victoria, Australia
| | - Glenn Francis Browning
- National Centre for Antimicrobial Stewardship, Carlton, Victoria, Australia.,Asia Pacific Centre for Animal Health, University of Melbourne, Parkville, Victoria, Australia
| | - James Gilkerson
- Asia Pacific Centre for Animal Health, University of Melbourne, Parkville, Victoria, Australia
| |
Collapse
|
39
|
Pedersen M, Verspoor K, Jenkinson M, Law M, Abbott DF, Jackson GD. Artificial intelligence for clinical decision support in neurology. Brain Commun 2020; 2:fcaa096. [PMID: 33134913 PMCID: PMC7585692 DOI: 10.1093/braincomms/fcaa096] [Citation(s) in RCA: 27] [Impact Index Per Article: 6.8] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/19/2020] [Revised: 05/19/2020] [Accepted: 06/12/2020] [Indexed: 01/13/2023] Open
Abstract
Artificial intelligence is one of the most exciting methodological shifts in our era. It holds the potential to transform healthcare as we know it, to a system where humans and machines work together to provide better treatment for our patients. It is now clear that cutting edge artificial intelligence models in conjunction with high-quality clinical data will lead to improved prognostic and diagnostic models in neurological disease, facilitating expert-level clinical decision tools across healthcare settings. Despite the clinical promise of artificial intelligence, machine and deep-learning algorithms are not a one-size-fits-all solution for all types of clinical data and questions. In this article, we provide an overview of the core concepts of artificial intelligence, particularly contemporary deep-learning methods, to give clinician and neuroscience researchers an appreciation of how artificial intelligence can be harnessed to support clinical decisions. We clarify and emphasize the data quality and the human expertise needed to build robust clinical artificial intelligence models in neurology. As artificial intelligence is a rapidly evolving field, we take the opportunity to iterate important ethical principles to guide the field of medicine is it moves into an artificial intelligence enhanced future.
Collapse
Affiliation(s)
- Mangor Pedersen
- The Florey Institute of Neuroscience and Mental Health, The University of Melbourne, Heidelberg, VIC 3084, Australia.,Department of Psychology, Auckland University of Technology (AUT), Auckland, 0627, New Zealand
| | - Karin Verspoor
- School of Computing and Information Systems, The University of Melbourne, Parkville, VIC 3010, Australia
| | - Mark Jenkinson
- Wellcome Centre for Integrative Neuroimaging, FMRIB, Nuffield Department of Clinical Neurosciences, University of Oxford, Oxford, OX3 9DU, UK.,South Australian Health and Medical Research Institute (SAHMRI), Adelaide, SA 5000, Australia.,Australian Institute for Machine Learning (AIML), The University of Adelaide, Adelaide, SA 5000, Australia
| | - Meng Law
- Department of Radiology, Alfred Hospital, Melbourne, VIC 3181, Australia.,Department of Electrical and Computer Systems Engineering, Monash University, Melbourne, VIC 3181, Australia.,Department of Neuroscience, Monash School of Medicine, Nursing and Health Sciences, Melbourne, VIC 3181, Australia
| | - David F Abbott
- The Florey Institute of Neuroscience and Mental Health, The University of Melbourne, Heidelberg, VIC 3084, Australia.,Department of Medicine Austin Health, The University of Melbourne, Heidelberg, VIC 3084, Australia
| | - Graeme D Jackson
- The Florey Institute of Neuroscience and Mental Health, The University of Melbourne, Heidelberg, VIC 3084, Australia.,Department of Medicine Austin Health, The University of Melbourne, Heidelberg, VIC 3084, Australia.,Department of Neurology, Austin Health, Heidelberg, VIC 3084, Australia
| |
Collapse
|
40
|
Cao K, Verspoor K, Sahebjada S, Baird PN. Evaluating the Performance of Various Machine Learning Algorithms to Detect Subclinical Keratoconus. Transl Vis Sci Technol 2020; 9:24. [PMID: 32818085 PMCID: PMC7396174 DOI: 10.1167/tvst.9.2.24] [Citation(s) in RCA: 30] [Impact Index Per Article: 7.5] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/30/2019] [Accepted: 02/05/2020] [Indexed: 12/26/2022] Open
Abstract
Purpose Keratoconus (KC) represents one of the leading causes of corneal transplantation worldwide. Detecting subclinical KC would lead to better management to avoid the need for corneal grafts, but the condition is clinically challenging to diagnose. We wished to compare eight commonly used machine learning algorithms using a range of parameter combinations by applying them to our KC dataset and build models to better differentiate subclinical KC from non-KC eyes. Methods Oculus Pentacam was used to obtain corneal parameters on 49 subclinical KC and 39 control eyes, along with clinical and demographic parameters. Eight machine learning methods were applied to build models to differentiate subclinical KC from control eyes. Dominant algorithms were trained with all combinations of the considered parameters to select important parameter combinations. The performance of each model was evaluated and compared. Results Using a total of eleven parameters, random forest, support vector machine and k-nearest neighbors had better performance in detecting subclinical KC. The highest area under the curve of 0.97 for detecting subclinical KC was achieved using five parameters by the random forest method. The highest sensitivity (0.94) and specificity (0.90) were obtained by the support vector machine and the k-nearest neighbor model, respectively. Conclusions This study showed machine learning algorithms can be applied to identify subclinical KC using a minimal parameter set that are routinely collected during clinical eye examination. Translational Relevance Machine learning algorithms can be built using routinely collected clinical parameters that will assist in the objective detection of subclinical KC.
Collapse
Affiliation(s)
- Ke Cao
- Centre for Eye Research Australia, Royal Victorian Eye and Ear Hospital, Melbourne, Victoria, Australia.,Department of Surgery, Ophthalmology, The University of Melbourne, Melbourne, Victoria, Australia
| | - Karin Verspoor
- Department of Computing and Information Systems, The University of Melbourne, Melbourne, Victoria, Australia
| | - Srujana Sahebjada
- Centre for Eye Research Australia, Royal Victorian Eye and Ear Hospital, Melbourne, Victoria, Australia.,Department of Surgery, Ophthalmology, The University of Melbourne, Melbourne, Victoria, Australia
| | - Paul N Baird
- Department of Surgery, Ophthalmology, The University of Melbourne, Melbourne, Victoria, Australia
| |
Collapse
|
41
|
Jose JM, Yilmaz E, Magalhães J, Castells P, Ferro N, Silva MJ, Martins F, Akhondi SA, Cohn T, Baldwin T, Verspoor K. ChEMU: Named Entity Recognition and Event Extraction of Chemical Reactions from Patents. Lecture Notes in Computer Science 2020; 12036. [PMCID: PMC7148043 DOI: 10.1007/978-3-030-45442-5_74] [Citation(s) in RCA: 11] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 12/23/2022]
Abstract
We introduce a new evaluation lab named ChEMU (Cheminformatics Elsevier Melbourne University), part of the 11th Conference and Labs of the Evaluation Forum (CLEF-2020). ChEMU involves two key information extraction tasks over chemical reactions from patents. Task 1—Named entity recognition—involves identifying chemical compounds as well as their types in context, i.e., to assign the label of a chemical compound according to the role which the compound plays within a chemical reaction. Task 2—Event extraction over chemical reactions—involves event trigger detection and argument recognition. We briefly present the motivations and goals of the ChEMU tasks, as well as resources and evaluation methodology.
Collapse
|
42
|
Al Bkhetan Z, Zobel J, Kowalczyk A, Verspoor K, Goudey B. Exploring effective approaches for haplotype block phasing. BMC Bioinformatics 2019; 20:540. [PMID: 31666002 PMCID: PMC6822470 DOI: 10.1186/s12859-019-3095-8] [Citation(s) in RCA: 14] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/25/2019] [Accepted: 09/10/2019] [Indexed: 01/19/2023] Open
Abstract
BACKGROUND Knowledge of phase, the specific allele sequence on each copy of homologous chromosomes, is increasingly recognized as critical for detecting certain classes of disease-associated mutations. One approach for detecting such mutations is through phased haplotype association analysis. While the accuracy of methods for phasing genotype data has been widely explored, there has been little attention given to phasing accuracy at haplotype block scale. Understanding the combined impact of the accuracy of phasing tool and the method used to determine haplotype blocks on the error rate within the determined blocks is essential to conduct accurate haplotype analyses. RESULTS We present a systematic study exploring the relationship between seven widely used phasing methods and two common methods for determining haplotype blocks. The evaluation focuses on the number of haplotype blocks that are incorrectly phased. Insights from these results are used to develop a haplotype estimator based on a consensus of three tools. The consensus estimator achieved the most accurate phasing in all applied tests. Individually, EAGLE2, BEAGLE and SHAPEIT2 alternate in being the best performing tool in different scenarios. Determining haplotype blocks based on linkage disequilibrium leads to more correctly phased blocks compared to a sliding window approach. We find that there is little difference between phasing sections of a genome (e.g. a gene) compared to phasing entire chromosomes. Finally, we show that the location of phasing error varies when the tools are applied to the same data several times, a finding that could be important for downstream analyses. CONCLUSIONS The choice of phasing and block determination algorithms and their interaction impacts the accuracy of phased haplotype blocks. This work provides guidance and evidence for the different design choices needed for analyses using haplotype blocks. The study highlights a number of issues that may have limited the replicability of previous haplotype analysis.
Collapse
Affiliation(s)
- Ziad Al Bkhetan
- School of Computing & Information Systems, University of Melbourne, Parkville, 3010, Australia
| | - Justin Zobel
- School of Computing & Information Systems, University of Melbourne, Parkville, 3010, Australia
| | - Adam Kowalczyk
- School of Computing & Information Systems, University of Melbourne, Parkville, 3010, Australia.,Centre for Neural Engineering, University of Melbourne, Carlton, 3053, Australia.,Faculty of Mathematics and Information Science, Warsaw University of Technology, Warsaw, 00-662, Poland.,Centre for Epidemiology and Biostatistics, The University of Melbourne, Parkville, 3010, Australia
| | - Karin Verspoor
- School of Computing & Information Systems, University of Melbourne, Parkville, 3010, Australia.
| | - Benjamin Goudey
- Centre for Epidemiology and Biostatistics, The University of Melbourne, Parkville, 3010, Australia.,IBM Australia - Research, Southgate, 3006, Australia
| |
Collapse
|
43
|
Hassanzadeh H, Nguyen A, Verspoor K. Quantifying semantic similarity of clinical evidence in the biomedical literature to facilitate related evidence synthesis. J Biomed Inform 2019; 100:103321. [PMID: 31676460 DOI: 10.1016/j.jbi.2019.103321] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/04/2019] [Revised: 09/28/2019] [Accepted: 10/25/2019] [Indexed: 10/25/2022]
Abstract
OBJECTIVE Published clinical trials and high quality peer reviewed medical publications are considered as the main sources of evidence used for synthesizing systematic reviews or practicing Evidence Based Medicine (EBM). Finding all relevant published evidence for a particular medical case is a time and labour intensive task, given the breadth of the biomedical literature. Automatic quantification of conceptual relationships between key clinical evidence within and across publications, despite variations in the expression of clinically-relevant concepts, can help to facilitate synthesis of evidence. In this study, we aim to provide an approach towards expediting evidence synthesis by quantifying semantic similarity of key evidence as expressed in the form of individual sentences. Such semantic textual similarity can be applied as a key approach for supporting selection of related studies. MATERIAL AND METHODS We propose a generalisable approach for quantifying semantic similarity of clinical evidence in the biomedical literature, specifically considering the similarity of sentences corresponding to a given type of evidence, such as clinical interventions, population information, clinical findings, etc. We develop three sets of generic, ontology-based, and vector-space models of similarity measures that make use of a variety of lexical, conceptual, and contextual information to quantify the similarity of full sentences containing clinical evidence. To understand the impact of different similarity measures on the overall evidence semantic similarity quantification, we provide a comparative analysis of these measures when used as input to an unsupervised linear interpolation and a supervised regression ensemble. In order to provide a reliable test-bed for this experiment, we generate a dataset of 1000 pairs of sentences from biomedical publications that are annotated by ten human experts. We also extend the experiments on an external dataset for further generalisability testing. RESULTS The combination of all diverse similarity measures showed stronger correlations with the gold standard similarity scores in the dataset than any individual kind of measure. Our approach reached near 0.80 average Pearson correlation across different clinical evidence types using the devised similarity measures. Although they were more effective when combined together, individual generic and vector-space measures also resulted in strong similarity quantification when used in both unsupervised and supervised models. On the external dataset, our similarity measures were highly competitive with the state-of-the-art approaches developed and trained specifically on that dataset for predicting semantic similarity. CONCLUSION Experimental results showed that the proposed semantic similarity quantification approach can effectively identify related clinical evidence that is reported in the literature. The comparison with a state-of-the-art method demonstrated the effectiveness of the approach, and experiments with an external dataset support its generalisability.
Collapse
Affiliation(s)
- Hamed Hassanzadeh
- The Australian e-Health Research Centre, CSIRO, Brisbane, QLD, Australia.
| | - Anthony Nguyen
- The Australian e-Health Research Centre, CSIRO, Brisbane, QLD, Australia.
| | - Karin Verspoor
- School of Computing and Information Systems, University of Melbourne, Melbourne, VIC, Australia.
| |
Collapse
|
44
|
Lopez-Campos G, Kiossoglou P, Borda A, Hawthorne C, Gray K, Verspoor K. Characterizing the Scope of Exposome Research Through Topic Modeling and Ontology Analysis. Stud Health Technol Inform 2019; 264:1530-1531. [PMID: 31438216 DOI: 10.3233/shti190519] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/15/2022]
Abstract
Exposomics is a field of research which is receiving growing attention. In this work, we characterize the exposome research landscape and update our previous study of formal knowledge representation approaches to this field. We applied a deductive analysis using the National Center for Biomedical Ontology Recommender for comparability of the results generated from a literature dataset and newly available ontologies with our previously published work. We highlight the changes in ontology recommendations.
Collapse
Affiliation(s)
- Guillermo Lopez-Campos
- Wellcome-Wolfson Institute for Experimental Medicine, Queen's University of Belfast, Belfast, Northern Ireland, United Kingdom.,Health and Biomedical Informatics Centre, The University of Melbourne, Parkville, Victoria, Australia
| | - Philip Kiossoglou
- Health and Biomedical Informatics Centre, The University of Melbourne, Parkville, Victoria, Australia
| | - Ann Borda
- Health and Biomedical Informatics Centre, The University of Melbourne, Parkville, Victoria, Australia
| | - Christopher Hawthorne
- Wellcome-Wolfson Institute for Experimental Medicine, Queen's University of Belfast, Belfast, Northern Ireland, United Kingdom
| | - Kathleen Gray
- Health and Biomedical Informatics Centre, The University of Melbourne, Parkville, Victoria, Australia.,School of Computing and Information Systems, The University of Melbourne, Parkville, Victoria, Australia
| | - Karin Verspoor
- Health and Biomedical Informatics Centre, The University of Melbourne, Parkville, Victoria, Australia.,School of Computing and Information Systems, The University of Melbourne, Parkville, Victoria, Australia
| |
Collapse
|
45
|
Hur B, Hardefeldt LY, Verspoor K, Baldwin T, Gilkerson JR. Using natural language processing and VetCompass to understand antimicrobial usage patterns in Australia. Aust Vet J 2019; 97:298-300. [PMID: 31209869 DOI: 10.1111/avj.12836] [Citation(s) in RCA: 18] [Impact Index Per Article: 3.6] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/07/2019] [Accepted: 02/16/2019] [Indexed: 11/30/2022]
Abstract
BACKGROUND Currently there is an incomplete understanding of antimicrobial usage patterns in veterinary clinics in Australia, but such knowledge is critical for the successful implementation and monitoring of antimicrobial stewardship programs. METHODS VetCompass Australia collects medical records from 181 clinics in Australia (as of May 2018). These records contain detailed information from individual consultations regarding the medications dispensed. One unique aspect of VetCompass Australia is its focus on applying natural language processing (NLP) and machine learning techniques to analyse the records, similar to efforts conducted in other medical studies. RESULTS The free text fields of 4,394,493 veterinary consultation records of dogs and cats between 2013 and 2018 were collated by VetCompass Australia and NLP techniques applied to enable the querying of the antimicrobial usage within these consultations. CONCLUSION The NLP algorithms developed matched antimicrobial in clinical records with 96.7% accuracy and an F1 Score of 0.85, as evaluated relative to expert annotations. This dataset can be readily queried to demonstrate the antimicrobial usage patterns of companion animal practices throughout Australia.
Collapse
Affiliation(s)
- B Hur
- Asia-Pacific Centre for Animal Health, Melbourne Veterinary School, University of Melbourne, Parkville, Victoria, Australia.,School of Computing and Information Systems, University of Melbourne Parkville, VIC, Australia
| | - L Y Hardefeldt
- Asia-Pacific Centre for Animal Health, Melbourne Veterinary School, University of Melbourne, Parkville, Victoria, Australia
| | - K Verspoor
- School of Computing and Information Systems, University of Melbourne Parkville, VIC, Australia.,Health and Biomedical Informatics Centre, University of Melbourne, Parkville, VIC, Australia
| | - T Baldwin
- School of Computing and Information Systems, University of Melbourne Parkville, VIC, Australia
| | - J R Gilkerson
- Asia-Pacific Centre for Animal Health, Melbourne Veterinary School, University of Melbourne, Parkville, Victoria, Australia
| |
Collapse
|
46
|
Bouadjenek MR, Zobel J, Verspoor K. Automated assessment of biological database assertions using the scientific literature. BMC Bioinformatics 2019; 20:216. [PMID: 31035936 PMCID: PMC6489365 DOI: 10.1186/s12859-019-2801-x] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/14/2018] [Accepted: 04/09/2019] [Indexed: 12/27/2022] Open
Abstract
Background The large biological databases such as GenBank contain vast numbers of records, the content of which is substantively based on external resources, including published literature. Manual curation is used to establish whether the literature and the records are indeed consistent. We explore in this paper an automated method for assessing the consistency of biological assertions, to assist biocurators, which we call BARC, Biocuration tool for Assessment of Relation Consistency. In this method a biological assertion is represented as a relation between two objects (for example, a gene and a disease); we then use our novel set-based relevance algorithm SaBRA to retrieve pertinent literature, and apply a classifier to estimate the likelihood that this relation (assertion) is correct. Results Our experiments on assessing gene–disease relations and protein–protein interactions using the PubMed Central collection show that BARC can be effective at assisting curators to perform data cleansing. Specifically, the results obtained showed that BARC substantially outperforms the best baselines, with an improvement of F-measure of 3.5% and 13%, respectively, on gene-disease relations and protein-protein interactions. We have additionally carried out a feature analysis that showed that all feature types are informative, as are all fields of the documents. Conclusions BARC provides a clear benefit for the biocuration community, as there are no prior automated tools for identifying inconsistent assertions in large-scale biological databases.
Collapse
Affiliation(s)
- Mohamed Reda Bouadjenek
- Department of Mechanical & Industrial Engineering, University of Toronto, Toronto, M5S 3G8, Canada.
| | - Justin Zobel
- School of Computing and Information Systems, University of Melbourne, Melbourne, 3010, Australia
| | - Karin Verspoor
- School of Computing and Information Systems, University of Melbourne, Melbourne, 3010, Australia
| |
Collapse
|
47
|
Abstract
BACKGROUND Given the importance of relation or event extraction from biomedical research publications to support knowledge capture and synthesis, and the strong dependency of approaches to this information extraction task on syntactic information, it is valuable to understand which approaches to syntactic processing of biomedical text have the highest performance. RESULTS We perform an empirical study comparing state-of-the-art traditional feature-based and neural network-based models for two core natural language processing tasks of part-of-speech (POS) tagging and dependency parsing on two benchmark biomedical corpora, GENIA and CRAFT. To the best of our knowledge, there is no recent work making such comparisons in the biomedical context; specifically no detailed analysis of neural models on this data is available. Experimental results show that in general, the neural models outperform the feature-based models on two benchmark biomedical corpora GENIA and CRAFT. We also perform a task-oriented evaluation to investigate the influences of these models in a downstream application on biomedical event extraction, and show that better intrinsic parsing performance does not always imply better extrinsic event extraction performance. CONCLUSION We have presented a detailed empirical study comparing traditional feature-based and neural network-based models for POS tagging and dependency parsing in the biomedical context, and also investigated the influence of parser selection for a biomedical event extraction downstream task. AVAILABILITY OF DATA AND MATERIALS We make the retrained models available at https://github.com/datquocnguyen/BioPosDep .
Collapse
Affiliation(s)
- Dat Quoc Nguyen
- School of Computing and Information Systems, The University of Melbourne, Melbourne, Australia.
| | - Karin Verspoor
- School of Computing and Information Systems, The University of Melbourne, Melbourne, Australia
| |
Collapse
|
48
|
Islamaj Dogan R, Kim S, Chatr-Aryamontri A, Wei CH, Comeau DC, Antunes R, Matos S, Chen Q, Elangovan A, Panyam NC, Verspoor K, Liu H, Wang Y, Liu Z, Altinel B, Hüsünbeyi ZM, Özgür A, Fergadis A, Wang CK, Dai HJ, Tran T, Kavuluru R, Luo L, Steppi A, Zhang J, Qu J, Lu Z. Overview of the BioCreative VI Precision Medicine Track: mining protein interactions and mutations for precision medicine. Database (Oxford) 2019; 2019:5303240. [PMID: 30689846 PMCID: PMC6348314 DOI: 10.1093/database/bay147] [Citation(s) in RCA: 19] [Impact Index Per Article: 3.8] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/02/2018] [Accepted: 12/19/2018] [Indexed: 12/16/2022]
Abstract
The Precision Medicine Initiative is a multicenter effort aiming at formulating personalized treatments leveraging on individual patient data (clinical, genome sequence and functional genomic data) together with the information in large knowledge bases (KBs) that integrate genome annotation, disease association studies, electronic health records and other data types. The biomedical literature provides a rich foundation for populating these KBs, reporting genetic and molecular interactions that provide the scaffold for the cellular regulatory systems and detailing the influence of genetic variants in these interactions. The goal of BioCreative VI Precision Medicine Track was to extract this particular type of information and was organized in two tasks: (i) document triage task, focused on identifying scientific literature containing experimentally verified protein-protein interactions (PPIs) affected by genetic mutations and (ii) relation extraction task, focused on extracting the affected interactions (protein pairs). To assist system developers and task participants, a large-scale corpus of PubMed documents was manually annotated for this task. Ten teams worldwide contributed 22 distinct text-mining models for the document triage task, and six teams worldwide contributed 14 different text-mining systems for the relation extraction task. When comparing the text-mining system predictions with human annotations, for the triage task, the best F-score was 69.06%, the best precision was 62.89%, the best recall was 98.0% and the best average precision was 72.5%. For the relation extraction task, when taking homologous genes into account, the best F-score was 37.73%, the best precision was 46.5% and the best recall was 54.1%. Submitted systems explored a wide range of methods, from traditional rule-based, statistical and machine learning systems to state-of-the-art deep learning methods. Given the level of participation and the individual team results we find the precision medicine track to be successful in engaging the text-mining research community. In the meantime, the track produced a manually annotated corpus of 5509 PubMed documents developed by BioGRID curators and relevant for precision medicine. The data set is freely available to the community, and the specific interactions have been integrated into the BioGRID data set. In addition, this challenge provided the first results of automatically identifying PubMed articles that describe PPI affected by mutations, as well as extracting the affected relations from those articles. Still, much progress is needed for computer-assisted precision medicine text mining to become mainstream. Future work should focus on addressing the remaining technical challenges and incorporating the practical benefits of text-mining tools into real-world precision medicine information-related curation.
Collapse
Affiliation(s)
- Rezarta Islamaj Dogan
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD, USA
| | - Sun Kim
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD, USA
| | | | - Chih-Hsuan Wei
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD, USA
| | - Donald C Comeau
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD, USA
| | - Rui Antunes
- Department of Electronics, Telecommunications and Informatics (DETI)/Institute of Electronics and Informatics Engineering of Aveiro (IEETA), University of Aveiro, Aveiro, Portugal
| | - Sérgio Matos
- Department of Electronics, Telecommunications and Informatics (DETI)/Institute of Electronics and Informatics Engineering of Aveiro (IEETA), University of Aveiro, Aveiro, Portugal
| | - Qingyu Chen
- School of Computing and Information Systems, The University of Melbourne, Melbourne, VIC, Australia
| | - Aparna Elangovan
- School of Computing and Information Systems, The University of Melbourne, Melbourne, VIC, Australia
| | - Nagesh C Panyam
- School of Computing and Information Systems, The University of Melbourne, Melbourne, VIC, Australia
| | - Karin Verspoor
- School of Computing and Information Systems, The University of Melbourne, Melbourne, VIC, Australia
| | - Hongfang Liu
- Department of Health Science Research, Mayo Clinic, Rochester, MN, USA
| | - Yanshan Wang
- Department of Health Science Research, Mayo Clinic, Rochester, MN, USA
| | - Zhuang Liu
- School of Computer Science and Technology, Dalian University of Technology, Dalian, China
| | - Berna Altinel
- Department of Computer Engineering, Marmara University, Istanbul, Turkey
| | | | | | - Aris Fergadis
- School of Electrical and Computer Engineering, National Technical University of Athens, Zografou, Athens, Greece
| | - Chen-Kai Wang
- Graduate Institute of Biomedical Informatics, Taipei Medical University, Taipei, Taiwan
| | - Hong-Jie Dai
- Department of Electrical Engineering, National Kaousiung University of Science and Technology, Kaohsiung, Taiwan
| | - Tung Tran
- Department of Computer Science, University of Kentucky, Lexington, KY, USA
| | - Ramakanth Kavuluru
- Division of Biomedical Informatics, Department of Internal Medicine, University of Kentucky, Lexington, KY, USA
| | - Ling Luo
- College of Computer Science and Technology, Dalian University of Technology, Dalian, China
| | - Albert Steppi
- Department of Statistics, Florida State University, Florida, USA
| | - Jinfeng Zhang
- Department of Statistics, Florida State University, Florida, USA
| | - Jinchan Qu
- Department of Statistics, Florida State University, Florida, USA
| | - Zhiyong Lu
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD, USA
| |
Collapse
|
49
|
Abstract
Duplicate sequence records-that is, records having similar or identical sequences-are a challenge in search of biological sequence databases. They significantly increase database search time and can lead to uninformative search results containing similar sequences. Sequence clustering methods have been used to address this issue to group similar sequences into clusters. These clusters form a nonredundant database consisting of representatives (one record per cluster) and members (the remaining records in a cluster). In this approach, for nonredundant database search, users search against representatives first and optionally expand search results by exploring member records from matching clusters. Existing studies used Precision and Recall to assess the search effectiveness of nonredundant databases. However, the use of Precision and Recall does not model user behavior in practice and thus may not reflect practical search effectiveness. In this study, we first propose innovative evaluation metrics to measure search effectiveness. The findings are that (1) the Precision of expanded sets is consistently lower than that of representatives, with a decrease up to 7% at top ranks; and (2) Recall is uninformative because, for most queries, expanded sets return more records than does search of the original unclustered databases. Motivated by these findings, we propose a solution that returns a user-specified proportion of top similar records, modeled by a ranking function that aggregates sequence and annotation similarities. In experiments undertaken on UniProtKB/Swiss-Prot, the largest expert-curated protein database, we show that our method dramatically reduces the number of returned sequences, increases Precision by 3%, and does not impact effective search time.
Collapse
Affiliation(s)
- Qingyu Chen
- 1 School of Computing and Information Systems, The University of Melbourne, Parkville, Australia
| | - Xiuzhen Zhang
- 2 School of Science, RMIT University, Melbourne, Australia
| | - Yu Wan
- 3 Department of Biochemistry and Molecular Biology, Bio21 Molecular Science and Biotechnology Institute, The University of Melbourne, Parkville, Australia
| | - Justin Zobel
- 1 School of Computing and Information Systems, The University of Melbourne, Parkville, Australia
| | - Karin Verspoor
- 1 School of Computing and Information Systems, The University of Melbourne, Parkville, Australia
| |
Collapse
|
50
|
Khumrin P, Ryan A, Juddy T, Verspoor K. DrKnow: A Diagnostic Learning Tool with Feedback from Automated Clinical Decision Support. AMIA Annu Symp Proc 2018; 2018:1348-1357. [PMID: 30815179 PMCID: PMC6371235] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Subscribe] [Scholar Register] [Indexed: 06/09/2023]
Abstract
Providing medical trainees with effective feedback is critical to the successful development of their diagnostic reasoning skills. We present the design of DrKnow, a web-based learning application that utilises a clinical decision support system (CDSS) and virtual cases to support the development of problem-solving and decision-making skills in medical students. Based on the clinical information they request and prioritise, DrKnow provides personalised feedback to help students develop differential and provisional diagnoses at key decision points as they work through the virtual cases. Once students make a final diagnosis, DrKnow presents students with information about their overall diagnostic performance as well as recommendations for diagnosing similar cases. This paper argues that designing DrKnow around a task-sensitive CDSS provides a suitable approach enabling positive student learning outcomes, while simultaneously overcoming the resource challenges of expert clinician-supported bedside teaching.
Collapse
Affiliation(s)
- Piyapong Khumrin
- School of Computing and Information Systems, Melbourne School of Engineering, University of Melbourne, Australia
- Department of Physiology, Faculty of Medicine, Chiang Mai University, Thailand
| | - Anna Ryan
- Department of Medical Education, Faculty of Medicine, Dentistry and Health Sciences, University of Melbourne, Australia
| | - Terry Juddy
- Department of Medical Education, Faculty of Medicine, Dentistry and Health Sciences, University of Melbourne, Australia
| | - Karin Verspoor
- School of Computing and Information Systems, Melbourne School of Engineering, University of Melbourne, Australia
| |
Collapse
|